Saturday, July 9, 2011

Practical Algorithms and Data Structures : Introduction

Welcome to the first of hopefully many posts on Algorithms and Data Structures.

One of the areas of computer science that I have always felt drawn to is the study of algorithms. That is not to say that I have developed any kind of specialized expertise in the field, but I do find it tremendously interesting and continuously strive to expand on my understanding and ability to apply algorithms effectively.

For me at least, the best way to evolve my knowledge of a topic is to attempt to explain some aspect of the particular topic to someone else. On countless occasions I have found that by answering questions and having my explanations challenged, has enabled me to reach new levels of that “Aha” moment when suddenly a topic that I thought I understood just became that much clearer.

Now you might be wondering why I would even bother writing something like this, especially given all the material out there covering this specific area. Well other than my selfish motivation to learn more in the process, I have also realized that so many developers today lack the basics in terms of even the simplest algorithms. They might know the terminology and sometimes not even that, but as soon as you start probing on the specifics of the algorithms, how to decide which algorithm to use under which circumstance etc. things start to go sideways.

Why is it the case, that so many practicing developers today find themselves lacking in this area? Well I guess, the truth is that with all the excellent tooling and libraries accompanying  most languages and development platforms today, very few people need to actually delve into the depths of the algorithms and data structures they use. Everything is right there, got a list and want to find something in it, just call Find, Search, IndexOf or whatever function is documented to find an item in the list and be done with it. There is definitely nothing wrong with this, these libraries are professionally developed, robust and already used by thousands of developers so they are well QA’d. There is however tremendous value in having a good understanding of the algorithms and data structures you use.

If you understand your data, and you know what it is you need to do with that data, the next step is selecting the most appropriate data structure to store your data. Algorithms and data structures are tightly coupled together, often the data structures you use to store your data will determine which algorithms can be used efficiently on the data. While most data structures can be searched, what will vary is how efficiently that search can be performed, depending on the underlying data structure or even the ordering of the data in the structure. Having an understanding of the various data structures and the corresponding algorithms can help you choose the most efficient way to work with your data and ensure your software does not buckle under the pressure of huge data volumes just because you selected the wrong data structure and/or algorithm to manipulate and manage your data. Selecting the right algorithm for the job often it the key to a successful outcome, but to do that you need to understand the pros and cons of what is available to you.

With this series of blog posts I hope to share some of my learning's and at the same time gain a deeper level of understanding as we explore the algorithms and data structures together. I invite you to participate in this series, if you know a better way to implement something or have a better approach to explaining a specific algorithm, please share with us and help enrich our journey.

What are Data Structures?

Data structures are the containers that you use to store and manage the data for your application.

As we design and develop software, one of the decisions we need to make is what data types to use to store the application specific data. If we where developing a simple contact management system, in which we can enter contact information and later retrieve that information, we would need to define how the contacts would be represented internally within the system i.e. the data elements that represent a contact, such as first name, last name, address, email address, mobile number, date of birth etc. as well as the data types used to store each of these elements, how multiple contacts will be maintained and managed within the system, all of which help define the data structures that will be required to build a functional system.

However simply looking at the data requirements is not always enough, we also need to look at the algorithms we intend to apply to these data structures, how will we manipulate the data, perform searches, sort the data etc.. As we will discover through this series, the intended algorithms will have a bearing on the data structures we might select to represent the data in the system.

What is an Algorithm?

An algorithm is a recipe or set of instructions that can be followed to solve a specific type of problem.

Assume we have chosen to store our contacts from earlier in a list in which we can access each contact by walking through the list item by item, like paging through a book. We now need to find an algorithm we can use to search through the list to locate a contact by last name. How would you go about that? Given what we know about the data structure, the obvious solution would be to iterate through the list of contacts comparing the last name element to the search key. If you find a match you can stop the iteration and return the instance of the contact that was found, otherwise if you reach the end of the list and no match was found you return some indication that the contact does not exist. These steps describe what is known as a Sequential Search.

There are situations where using the simple Sequential Search algorithm might not be the best option and could severely hurt the performance of your system. For example, if the list in question contained a significant number of items and you need to perform frequent searches to determine the existence of an item in the list, this could quickly become a bottleneck. Every time we search for an item that does not exist, we will be iterating through the entire list just to determine that the item does not exist, in the best case the item we are searching for is found quickly within the first few items of the list, while on other occasions the item might only be found towards the end of the list. We will look at this in a little more depth in the section on Big-O notation.

Let’s look at one possible alternative, we could use a Binary Search. This algorithm can be significantly more efficient than a Sequential Search, especially in the worst case scenarios where the item being searched is either not in the list or it exists far from the beginning of the list. However, to be able to use the Binary Search, the collection of items will need to conform to the basic requirements of the Binary Search.

  1. The list of items must be sorted
  2. The list structure must support what is often called random access. i.e. we should be able to access item 83 in a list of 100 items without needing to iterate over the first 82 items.

Given the above constraints are met, for the moment we will ignore the cost of ensuring the data is sorted, while not insignificant, for the purposes of the discussion we will choose to ignore it for now, we can use the Binary Search algorithm to introduce some significant optimization.

Here is a quick introductory example of the basics of the Binary Search algorithm.

First you select the item in the middle of the list, in a list of 100 items that will be item 50. Now compare item 50 (the midpoint item) to the search key, if it is a match we can terminate the search and return item, if the search key is greater than the midpoint item then we know that, if the item exists, it must be in the second half of the list. We know this because the list is sorted, therefore if the the search key is greater than the midpoint item, if must be greater than all the items preceding the midpoint item. And visa versa, if the search key is less than the midpoint item, then a potentially matching item would be in the first half of the list. Can you see how with a single comparison we have eliminated half of the items to be searched?

Having determined which half of the list the item might be in, you can repeat the same logic on that subset of the data. Having narrowed the list of items down to a subset of 50, you can again select the midpoint item and compare it to the search key, which will either be a match or indicate that the search key potentially exists in the top or bottom half of the subset. After 2 comparisons we have eliminated roughly 75% of the items to be searched. You can continue until you either find the item or run out of items to search which would indicate that the item does not exist in the list.

We will cover both the Sequential and Binary Search in more detail later, for now I just want to use this to demonstrate how selecting the right algorithm for the job can make a difference and how the nature of the data might influence the algorithms you can use. And it also leads us into the next topic and that is Big-O notation.

Big-O notation

I am sure you have at some point seen or read about Big-O notation even if you did not know what it meant. If you spent anytime reading about algorithms you might have seen something like the following O(n), O(log n) or O(n3). And if you wondered what it all means, I will try to give a very brief non-mathematical description of how you can make some basic sense of this notation.

When selecting an algorithm there are a number of factors that you might have to take into consideration. For desktops or server based applications your primary criteria might be performance, where you want to select the algorithm that is going to give you the best performance regardless of the amount of memory the algorithm requires to be executed. On the other hand, on mobile devices you might be more concerned about the memory requirements of a particular algorithm. In either case, we need some way to represent this these characteristics of an algorithm, this representation needs to be simple enough that just by looking at it I can tell if one algorithm will perform better than another or if it will be more memory efficient without needing to read or understand the complex mathematical analysis of each algorithm. For our purposes we will focus on the performance aspect and discuss the memory aspect later when we work with actual algorithms.

Big-O notation gives us a concise notation that captures the performance characteristics of an algorithm over a collection of items. Basically we can see at a glace if the algorithm performance will degrade rapidly, linearly or gradually as the number of items in the collection increases. Of course as we saw earlier algorithms have best case scenarios as well as worst case scenarios, Big-O notation represents the average case, so looking at the Big-O for a Binary Search I can say that on average the Binary Search will out perform a Sequential Search. That does not mean it will always out perform the Sequential Search, remember, if the matching item is first in the list the Sequential Search will find it immediately, while the Binary Search will need to perform a few iterations before finding that the first item is the matching item, but on average we would expect that the Binary Search will perform better for real world searches.

Using Big-O notation, the Sequential Search would be described as an O(n) algorithm, where n represents the number of items the algorithm will be working with. If we searched a list of 10 items then n=10 and if we searched a list of 1000 items then n=1000. From this we can conclude that on average the algorithm performs linearly, if we double the number of items the average search time will double, so the relationship between the number of items and the execution time is linear.

How does that compare to our Binary Search algorithm, well without getting into the details now, I will tell you that a Binary Search is an O(log n) algorithm. This means that as the number of items increase the execution time increases logarithmically. Mathematically log(n) < n were n is a positive integer (see the table below), therefore we can say that Binary Search is faster than a Sequential search.

Lets look a quick analysis, if the item that I am searching for is the first item in a list of 1,000,000 items then the Sequential Search will clearly out perform the Binary Search which is going to jump to the middle of the list, see that the item is in the first half of the list and half that portion and so on for a total of 20 comparisons before locating the target item at the beginning of the list. However if the item being searched was the last item in the list then the Sequential Search would require 1,000,000 comparisons while the Binary Search worst case would not be more than 20 comparisons. So the worst case of the Binary Search of a collection of sorted items is 20 comparisons while the sequential search will exceed this worst case for when searching of any of the 999 980 that are after the first 20 items in the list.

What we have seen here is that the Sequential Search has a best case execution of O(1), that is constant time regardless of the number of items in the list, of course this is the absolute best case when you are lucky enough to have the item you are looking for be the first item in the list. While the worst case if the item is the last item or the item does not exist at all will be O(n) which is also the average case.

The Binary Search also has a best case scenario of O(1), that is when the item you are searching for happens to be the item in the middle of the list, in which case the item would be found on the first comparison, but the average case is O(log n).

n O(1) O(n) O(log n) O(n log n) O(n2)
1 1 1 0 0 1
10 1 10 3.32 33.22 100
100 1 100 6.64 664.39 10000
1000 1 1000 9.97 9965.78 1000000
10000 1 10000 13.29 132877.12 100000000
100000 1 100000 16.61 1660964.05 10000000000
1000000 1 1000000 19.93 19931568.57 1000000000000

Looking at the above table you see a comparison of some Big-O representations for various values of n. This should give you a feel for how one algorithm would perform relative to another based on the Big-O of the algorithm. The best general purpose sorting algorithms today are O(n log n) algorithms.

Given the speed of computers today, even the worst performing algorithms will appear to perform efficiently for small values of n. That is why it is very important to understand the volume of data that your system might need to work with and make sure you test with volumes that are representative of what you expect to see in the production environment.

When selecting your algorithms, make sure that you fully grasp the context in which the algorithm will be used and how that scope might change overtime as your system hopefully becomes more and more popular.

If you would like to see a visual representation of the table above take a look at my Big-O Visualizer. Note this application requires Silverlight 4.


Next: Arrays – Part 1