Quick sort is the fastest known comparison-based sorting algorithm when applied to large, unordered, sequences. It also has the advantage of being an in-place (or nearly in-place) sort. Unfortunately, quick sort has some serious failings: it's worst-case performance is O(n2), and it is not stable. Merge sort shares neither of these disadvantages: it is stable and it requires O(n log n) steps in the worst case. Unfortunately, merge sort requires O(n) additional space and it runs significantly more slowly than quick sort on most inputs.
What we really want is a sorting algorithm that is as fast as quick sort, stable, in-place, with O(n log n) worst-case performance. Sadly, no such algorithm has yet been discovered. In practice, most code libraries follow one of two paths: either they provide a modified version of quick sort that is able to avoid worst-case behavior on typical inputs, or they provide a modified version of merge sort that is able to close some of the performance gap through careful optimizations. Python's built-in sorting algorithm takes the latter path. In this project, we will attempt the former.
The objective of this assignment is to write two modified versions of quick sort that exhibit better performance than the standard version described in our textbook. You will then write timing code to experimentally compare your new sorting algorithms to the original quick sort and to the other sorting algorithms we've studied.
In the textbook's version of quick sort, the first item in the sequence is always chosen as the pivot item. This is the worst possible choice for lists that are already sorted. One improvement is to examine three elements from the list and use the median of those three as the pivot. Typically, the first, middle and last items are examined. Recall that the best case for quick sort occurs when the pivot item ends up near the middle of the partitioned list and the worst case occurs when it ends up near one of the extremes. Using the median-of-three approach greatly reduces the chance that we will end up with a pivot item that is near one of the extremes. For inputs that are already sorted, the median item will be in the exact center and we will end up with best-case performance instead of worst-case.
Note that the median-of-three approach doesn't guarantee that quick sort will perform well. It is still possible to construct input sequences that will result in O(n2) comparisons.
For the large majority of inputs quick sort is very fast. There are relatively few pathological inputs that result in O(n2) performance. The idea behind introspective sort is to perform a standard quick sort until there is evidence that the current input is pathological. If it looks like the current input will cause O(n2) performance, the algorithm switches strategies to a sorting algorithm with guaranteed O(n log n) performance. Typically, heap sort is used as the alternative, since it is in-place and takes O(n log n) time in the worst case. Since we haven't yet studied heap sort, our version of introspective sort will use merge sort as the alternate sorting algorithm. The resulting algorithm is as fast as quick sort for most inputs, but has a worst case performance of O(n log n).
It is possible to detect pathological inputs by tracking the recursion depth of the quick sort algorithm. When quick sort is working well, the partition operation typically moves the pivot item somewhere near the center of the current sub-sequence. When this is the case, the maximum recursion depth will be O(log n). Therefore, introspective sort switches strategies when the recursion depth exceeds 2 * ⌊log2 n⌋.
For this assignment you will write code to time sorting algorithms and also to count the number of comparisons and assignments that occur.
For timing, you can use the following function, provided in time_sorts.py:
def currentTime(): """ Return the current time in seconds, as a float. Annoyingly, time.time() and time.clock() behave differently from one platform to the next. This function should give the most accurate timings across platforms.""" if platform.system() == "Windows": return time.clock() else: return time.time()
For example, you might time a single call to insertion sort as follows:
start = currentTime() insertionSort(items) totalTime = currentTime() - start
When writing timing code, make sure that you only time the actual sorting operations. If you include other operations, such as generating random inputs, your results will be misleading.
Measuring clock time provides valuable data for evaluating sorting
algorithms: ultimately our main concern is speed. However, there are
other types of data that may provide useful insight into why one
sorting algorithm exhibits better performance than another. In
particular, it can be useful to count the number of comparison and
assignment operations that occur during different sorting algorithms.
We can accomplish this easily in Python by subclassing built-in Python
data types. The following two classes are subclasses of the Python
int
class and list
class respectively:
class CountInt(int): """ A subclass of Python's built-in int class that counts the number of comparisons between objects of the class. A CountInt is exactly the same as an int, except the class-level variable numComparisons is incremented during every comparison operation. """ numComparisons = 0 def __cmp__(self, rhs): CountInt.numComparisons += 1 return super(CountInt, self).__cmp__(rhs) class CountList(list): """ A subclass of Python's built-in list class that counts the number of assignment operations performed on instances of the class . A CountList is exactly the same as a list, except the class-level variable numAssignments is incremented during every assignment operation. """ numAssignments = 0 def __setitem__(self, indx, item): CountList.numAssignments += 1 super(CountList, self).__setitem__(indx, item)In the code above, numComparisons and numAssignments are class-level variables. This means that there is just one copy of each variable associated with the class as a whole. By overriding the comparison and assignment methods we ensure that each of those operation will be counted. In all other respects these two classes will behave exactly as the default versions would. The following example illustrates how these classes may be used:
>>> items = CountList([CountInt(i) for i in range(20)]) >>> CountInt.numComparisons = 0 >>> CountList.numAssignments = 0 >>> selectionSort(items) >>> print CountInt.numComparisons 190 >>> print CountList.numAssignments 38
You should use the python modules sorts.py and time_sorts.py as a starting point. You must supplement the sorting algorithms provided in sorts.py with two additional algorithms:
quickSortMedian3
- Median-of-three quick sort, where
the pivot item is chosen to be the median of the first middle and
last elements in the sequence.
introSort
- Introspective sort, where the sorting
strategy is switched to merge sort if the recursion depth exceeds 2
* ⌊log2 n⌋.
You must also modify the main function in time_sorts.py so that it prints out timing information for each of the sorting algorithms in sorts.py, including the new sorting algorithms that you develop. You should time the following sorts on randomly generated lists with 100,000 entries:
quickSort
quickSortMedian3
introSort
mergeSort
Random numbers should be generated using the provided functions. All results should be averaged across at least 10 runs with different randomly generated inputs. You must also time all sorts on both ordered and randomly generated lists of size 5,000.
The following output illustrates the correct format for the printed results:
-------------------------------------------------------- Sorting 10 Random Lists of Size 100000 -------------------------------------------------------- SORT TIME(s) #COMPARISONS #ASSIGNMENTS Quick 2.34261 2025491.67 750950.67 Quick3 0.00000 0.00 0.00 Intro 0.00000 0.00 0.00 Merge 4.60801 1536407.67 1668928.00 -------------------------------------------------------- Sorting 10 Random Lists of Size 5000 -------------------------------------------------------- SORT TIME(s) #COMPARISONS #ASSIGNMENTS Quick 0.08415 73340.33 27368.67 Quick3 0.00000 0.00 0.00 Intro 0.00000 0.00 0.00 Merge 0.16660 55217.67 61808.00 Selection 10.00425 12497500.00 9998.00 Insertion 10.70457 6262166.00 6262175.00 -------------------------------------------------------- Sorting 10 Ordered Lists of Size 5000 -------------------------------------------------------- SORT TIME(s) #COMPARISONS #ASSIGNMENTS Quick 9.44986 12502498.00 0.00 Quick3 0.00000 0.00 0.00 Intro 0.00000 0.00 0.00 Merge 0.14597 32004.00 61808.00 Selection 9.94826 12497500.00 9998.00 Insertion 0.00834 4999.00 4999.00
Along with your modified Python modules you must also submit a .txt files that contains the output of your program on a sample execution.
sort
method. For example:
>>> a = [random.randint(0, 500) for _ in range(500)] # Generate a random list. >>> b = a[:] # Copy the list. >>> b.sort() # Sort the copy. >>> sorts.insertionSort(a) # Sort using my sort. >>> a == b # Compare results. True
merge
function to use
a CountList
for temporary storage.
Producing the table of results requires some fairly complex string
formatting. You need to display the required number of decimal places
and you need to ensure that the columns line up correctly. The
recommended way to handle this sort of thing in Python is to use
the format
method of the string class. In its simplest
form, format
makes it possible to insert values into
arbitrary strings. For example, the following statement inserts the
values 1/3 and 2/3 into the indicated string:
>>> "A {0} B {1}".format(1./3, 2./3) 'A 0.333333333333 B 0.666666666667'
The first argument is converted to a string and inserted in place of
{0}
and the second is inserted in place of {1}
.
The format
method also makes it possible to control the way that
inserted values are displayed. For example, we can limit the number of
decimal places as follows:
>>> "A {0:.1f} B {1:.1f}".format(1./3, 2./3) 'A 0.3 B 0.7'Here
.1f
indicates that the values should be displayed as
floats, and that only one decimal place should be retained. More
generally, the information after the colon is referred to as a format
specifier. The syntax for format specifiers is governed by the
Format
Specification Mini-Language.
Format specifiers may also include a width value. The following ensures that both inserted fields will be at least 20 characters wide:
>>> "A {0:20.1f} B {1:20.1f}".format(1./3, 2./3) 'A 0.3 B 0.7'
This just scratches the surface of string formatting possibilities. I encourage you to read over the Python documentation for details and examples.
This assignment is based on a project originally developed by Dr. Chris Fox.