This HTML version of Think Data Structures is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly. You might prefer to read the PDF version. Or you can buy this book on Amazon.com. Chapter 17 SortingComputer science departments have an unhealthy obsession with sort algorithms. Based on the amount of time CS students spend on the topic, you would think that choosing sort algorithms is the cornerstone of modern software engineering. Of course, the reality is that software developers can go years, or entire careers, without thinking about how sorting works. For almost all applications, they use whatever general-purpose algorithm is provided by the language or libraries they use. And usually that’s just fine. So if you skip this chapter and learn nothing about sort algorithms, you can still be an excellent developer. But there are a few reasons you might want to do it anyway:
So, in this chapter we’ll analyze insertion sort, you will implement merge sort, I’ll tell you about radix sort, and you will write a simple version of a bounded heap sort. 17.1 Insertion sortWe’ll start with insertion sort, mostly because it is simple to describe and implement. It is not very efficient, but it has some redeeming qualities, as we’ll see. Rather than explain the algorithm here, I suggest you read the insertion sort Wikipedia page at http://thinkdast.com/insertsort, which includes pseudocode and animated examples. Come back when you get the general idea. Here’s an implementation of insertion sort in Java: public class ListSorter<T> { public void insertionSort(List<T> list, Comparator<T> comparator) { for (int i=1; i < list.size(); i++) { T elt_i = list.get(i); int j = i; while (j > 0) { T elt_j = list.get(j-1); if (comparator.compare(elt_i, elt_j) >= 0) { break; } list.set(j, elt_j); j--; } list.set(j, elt_i); } } } I define a class,
The following example shows how to call this method with a List<Integer> list = new ArrayList<Integer>( Arrays.asList(3, 5, 1, 4, 2)); Comparator<Integer> comparator = new Comparator<Integer>() { @Override public int compare(Integer elt1, Integer elt2) { return elt1.compareTo(elt2); } }; ListSorter<Integer> sorter = new ListSorter<Integer>(); sorter.insertionSort(list, comparator); System.out.println(list);
The outer loop iterates from 1 to If you are not sure about that, here’s the argument:
So the total number of times the inner loop runs is the sum of the series 1, 2, … , n−1, which is n (n−1) / 2. And the leading term of that expression (the one with the highest exponent) is n2. In the worst case, insertion sort is quadratic. However:
So if we know that the array is nearly sorted, or is not very big, insertion sort might be a good choice. But for large arrays, we can do better. In fact, much better. 17.2 Exercise 14Merge sort is one of several algorithms whose run time is better than quadratic. Again, rather than explaining the algorithm here, I suggest you read about it on Wikipedia at http://thinkdast.com/mergesort. Once you get the idea, come back and you can test your understanding by writing an implementation. In the repository for this book, you’ll find the source files for this exercise:
Run In public void mergeSortInPlace(List<T> list, Comparator<T> comparator) { List<T> sorted = mergeSortHelper(list, comparator); list.clear(); list.addAll(sorted); } private List<T> mergeSort(List<T> list, Comparator<T> comparator) { // TODO: fill this in! return null; } These two methods do the same thing but provide different interfaces.
Your job is to fill in
This will give you a chance to debug the merge code without dealing with the complexity of a recursive method. Next, add a base case (see
http://thinkdast.com/basecase). If you are
given a list with only one element, you could return it immediately,
since it is already sorted, sort of. Or if the length of the list is
below some threshold, you could sort it using Finally, modify your solution so it makes two recursive calls to sort
the halves of the array. When you get it working, 17.3 Analysis of merge sortTo classify the run time of merge sort, it helps to think in terms of levels of recursion and how much work is done on each level. Suppose we start with a list that contains n elements. Here are the steps of the algorithm:
Figure 17.1 shows these steps. The first step copies each of the elements once, so it is linear. The third step also copies each element once, so it is also linear. Now we need to figure out the complexity of step 2. To do that, it helps to looks at a different picture of the computation, which shows the levels of recursion, as in Figure 17.2. At the top level, we have 1 list with n elements. For simplicity, let’s assume n is a power of 2. At the next level there are 2 lists with n/2 elements. Then 4 lists with n/4 elements, and so on until we get to n lists with 1 element. On every level we have a total of n elements. On the way down, we have to split the arrays in half, which takes time proportional to n on every level. On the way back up, we have to merge a total of n elements, which is also linear. If the number of levels is h, the total amount of work for the algorithm is O(nh). So how many levels are there? There are two ways to think about that:
Another way to ask the second question is “What power of 2 is n?” 2h = n Taking the log2 of both sides yields h = log2 n So the total time is O(n logn). I didn’t bother to write the base of the logarithm because logarithms with different bases differ by a constant factor, so all logarithms are in the same order of growth. Algorithms in O(n logn) are sometimes called “linearithmic”, but most people just say “n log n”. It turns out that O(n logn) is the theoretical lower bound for sort algorithms that work by comparing elements to each other. That means there is no “comparison sort” whose order of growth is better than n logn. See http://thinkdast.com/compsort. But as we’ll see in the next section, there are non-comparison sorts that take linear time! 17.4 Radix sortDuring the 2008 United States Presidential Campaign, candidate Barack Obama was asked to perform an impromptu algorithm analysis when he visited Google. Chief executive Eric Schmidt jokingly asked him for “the most efficient way to sort a million 32-bit integers.” Obama had apparently been tipped off, because he quickly replied, “I think the bubble sort would be the wrong way to go.” You can watch the video at http://thinkdast.com/obama. Obama was right: bubble sort is conceptually simple but its run time is quadratic; and even among quadratic sort algorithms, its performance is not very good. See http://thinkdast.com/bubble. The answer Schmidt was probably looking for is “radix sort”, which is a non-comparison sort algorithm that works if the size of the elements is bounded, like a 32-bit integer or a 20-character string. To see how this works, imagine you have a stack of index cards where each card contains a three-letter word. Here’s how you could sort the cards:
At this point each bucket contains one element, and the buckets are sorted in ascending order. Figure 17.3 shows an example with three-letter words. The top row shows the unsorted words. The second row shows what the buckets look like after the first pass. The words in each bucket begin with the same letter. After the second pass, the words in each bucket begin with the same two letters. After the third pass, there can be only one word in each bucket, and the buckets are in order. During each pass, we iterate through the elements and add them to buckets. As long as the buckets allow addition in constant time, each pass is linear. The number of passes, which I’ll call w, depends on the “width” of the words, but it doesn’t depend on the number of words, n. So the order of growth is O(wn), which is linear in n. There are many variations on radix sort, and many ways to implement each one. You can read more about them at http://thinkdast.com/radix. As an optional exercise, consider writing a version of radix sort. 17.5 Heap sortIn addition to radix sort, which applies when the things you want to sort are bounded in size, there is one other special-purpose sorting algorithm you might encounter: bounded heap sort. Bounded heap sort is useful if you are working with a very large dataset and you want to report the “Top 10” or “Top k” for some value of k much smaller than n. For example, suppose you are monitoring a Web service that handles a billion transactions per day. At the end of each day, you want to report the k biggest transactions (or slowest, or any other superlative). One option is to store all transactions, sort them at the end of the day, and select the top k. That would take time proportional to n logn, and it would be very slow because we probably can’t fit a billion transactions in the memory of a single program. We would have to use an “out of core” sort algorithm. You can read about external sorting at http://thinkdast.com/extsort. Using a bounded heap, we can do much better! Here’s how we will proceed:
To understand heap sort, you have to understand a heap, which is a data structure similar to a binary search tree (BST). Here are the differences:
The smallest element in a heap is always at the root, so we can find it in constant time. Adding and removing elements from a heap takes time proportional to the height of the tree h. And because the heap is always balanced, h is proportional to logn. You can read more about heaps at http://thinkdast.com/heap. The Java
Given a
Because Adding n elements to the queue takes n logn time. So does removing n elements. So the run time for heap sort is O(n logn). In the repository for this book, in 17.6 Bounded heapA bounded heap is a heap that is limited to contain at most k elements. If you have n elements, you can keep track of the k largest elements like this: Initially, the heap is empty. For each element,
Using a heap with the smallest element at the top, we can keep track of the largest k elements. Let’s analyze the performance of this algorithm. For each element, we perform one of:
In the worst case, if the elements appear in ascending order, we always run Branch 3. In that case, the total time to process n elements is O(n logk), which is linear in n. In 17.7 Space complexityUntil now we have talked a lot about run time analysis, but for many algorithms we are also concerned about space. For example, one of the drawbacks of merge sort is that it makes copies of the data. In our implementation, the total amount of space it allocates is O(n logn). With a more clever implementation, you can get the space requirement down to O(n). In contrast, insertion sort doesn’t copy the data because it sorts the elements in place. It uses temporary variables to compare two elements at a time, and it uses a few other local variables. But its space use doesn’t depend on n. Our implementation of heap sort creates a new One of the benefits of the bounded heap algorithm you just implemented is that it only needs space proportional to k (the number of elements we want to keep), and k is often much smaller than n. Software developers tend to pay more attention to run time than space, and for many applications, that’s appropriate. But for large datasets, space can be just as important or more so. For example:
So those are some reasons you should know at least a little bit about the space needs of algorithms. |
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.
|