Data Structures And Algorithms: PA #4

James Madison University, Fall 2012

Programming Assignment #4: Sorting

Introduction

Quick sort is the fastest known comparison-based sorting algorithm when applied to large, unordered, sequences. It also has the advantage of being an in-place (or nearly in-place) sort. Unfortunately, quick sort has some serious failings: it's worst-case performance is O(n²), and it is not stable. Merge sort shares neither of these disadvantages: it is stable and it requires O(n log n) steps in the worst case. Unfortunately, merge sort requires O(n) additional space and it runs significantly more slowly than quick sort on most inputs.

What we really want is a sorting algorithm that is as fast as quick sort, stable, in-place, with O(n log n) worst-case performance. Sadly, no such algorithm has yet been discovered. In practice, most code libraries follow one of two paths: either they provide a modified version of quick sort that is able to avoid worst-case behavior on typical inputs, or they provide a modified version of merge sort that is able to close some of the performance gap through careful optimizations. Python's built-in sorting algorithm takes the latter path. In this project, we will attempt the former.

The objective of this assignment is to write two modified versions of quick sort that exhibit better performance than the standard version described in our textbook. You will then write timing code to experimentally compare your new sorting algorithms to the original quick sort and to the other sorting algorithms we've studied.

Improvement One: Median of Three

In the textbook's version of quick sort, the first item in the sequence is always chosen as the pivot item. This is the worst possible choice for lists that are already sorted. One improvement is to examine three elements from the list and use the median of those three as the pivot. Typically, the first, middle and last items are examined. Recall that the best case for quick sort occurs when the pivot item ends up near the middle of the partitioned list and the worst case occurs when it ends up near one of the extremes. Using the median-of-three approach greatly reduces the chance that we will end up with a pivot item that is near one of the extremes. For inputs that are already sorted, the median item will be in the exact center and we will end up with best-case performance instead of worst-case.

Note that the median-of-three approach doesn't guarantee that quick sort will perform well. It is still possible to construct input sequences that will result in O(n²) comparisons.

Improvement Two: Introspective Sort

For the large majority of inputs quick sort is very fast. There are relatively few pathological inputs that result in O(n²) performance. The idea behind introspective sort is to perform a standard quick sort until there is evidence that the current input is pathological. If it looks like the current input will cause O(n²) performance, the algorithm switches strategies to a sorting algorithm with guaranteed O(n log n) performance. Typically, heap sort is used as the alternative, since it is in-place and takes O(n log n) time in the worst case. Since we haven't yet studied heap sort, our version of introspective sort will use merge sort as the alternate sorting algorithm. The resulting algorithm is as fast as quick sort for most inputs, but has a worst case performance of O(n log n).

It is possible to detect pathological inputs by tracking the recursion depth of the quick sort algorithm. When quick sort is working well, the partition operation typically moves the pivot item somewhere near the center of the current sub-sequence. When this is the case, the maximum recursion depth will be O(log n). Therefore, introspective sort switches strategies when the recursion depth exceeds 2 * ⌊log₂ n⌋.

Experimentally Analyzing Sorting Algorithms

For this assignment you will write code to time sorting algorithms and also to count the number of comparisons and assignments that occur.

For timing, you can use the following function, provided in time_sorts.py:

def currentTime():
    """ Return the current time in seconds, as a float. 

    Annoyingly, time.time() and time.clock() behave differently from
    one platform to the next.  This function should give the most
    accurate timings across platforms."""

    if platform.system() == "Windows":
        return time.clock()
    else:
        return time.time()

For example, you might time a single call to insertion sort as follows:

start = currentTime()
insertionSort(items)
totalTime = currentTime() - start

When writing timing code, make sure that you only time the actual sorting operations. If you include other operations, such as generating random inputs, your results will be misleading.

Measuring clock time provides valuable data for evaluating sorting algorithms: ultimately our main concern is speed. However, there are other types of data that may provide useful insight into why one sorting algorithm exhibits better performance than another. In particular, it can be useful to count the number of comparison and assignment operations that occur during different sorting algorithms. We can accomplish this easily in Python by subclassing built-in Python data types. The following two classes are subclasses of the Python int class and list class respectively:

class CountInt(int):
    """ A subclass of Python's built-in int class that counts the
    number of comparisons between objects of the class.

    A CountInt is exactly the same as an int, except the class-level
    variable numComparisons is incremented during every comparison
    operation.
    """

    numComparisons = 0
    
    def __cmp__(self, rhs):
        CountInt.numComparisons += 1
        return super(CountInt, self).__cmp__(rhs)


class CountList(list):
    """ A subclass of Python's built-in list class that counts the
    number of assignment operations performed on instances of the
    class .

    A CountList is exactly the same as a list, except the class-level
    variable numAssignments is incremented during every assignment
    operation.
    """

    numAssignments = 0

    def __setitem__(self, indx, item):
        CountList.numAssignments += 1
        super(CountList, self).__setitem__(indx, item)

In the code above, numComparisons and numAssignments are class-level variables. This means that there is just one copy of each variable associated with the class as a whole. By overriding the comparison and assignment methods we ensure that each of those operation will be counted. In all other respects these two classes will behave exactly as the default versions would. The following example illustrates how these classes may be used:

>>> items = CountList([CountInt(i) for i in range(20)])
>>> CountInt.numComparisons = 0
>>> CountList.numAssignments = 0
>>> selectionSort(items)
>>> print CountInt.numComparisons
190
>>> print CountList.numAssignments
38

Requirements

You should use the python modules sorts.py and time_sorts.py as a starting point. You must supplement the sorting algorithms provided in sorts.py with two additional algorithms:

quickSortMedian3 - Median-of-three quick sort, where the pivot item is chosen to be the median of the first middle and last elements in the sequence.
introSort - Introspective sort, where the sorting strategy is switched to merge sort if the recursion depth exceeds 2 * ⌊log₂ n⌋.

You must also modify the main function in time_sorts.py so that it prints out timing information for each of the sorting algorithms in sorts.py, including the new sorting algorithms that you develop. You should time the following sorts on randomly generated lists with 100,000 entries:

quickSort
quickSortMedian3
introSort
mergeSort

Random numbers should be generated using the provided functions. All results should be averaged across at least 10 runs with different randomly generated inputs. You must also time all sorts on both ordered and randomly generated lists of size 5,000.

The following output illustrates the correct format for the printed results:

--------------------------------------------------------
Sorting 10 Random Lists of Size 100000
--------------------------------------------------------
    SORT        TIME(s)  #COMPARISONS       #ASSIGNMENTS
    Quick       2.34261    2025491.67          750950.67
    Quick3      0.00000          0.00               0.00
    Intro       0.00000          0.00               0.00
    Merge       4.60801    1536407.67         1668928.00

--------------------------------------------------------
Sorting 10 Random Lists of Size 5000
--------------------------------------------------------
    SORT        TIME(s)  #COMPARISONS       #ASSIGNMENTS
    Quick       0.08415      73340.33           27368.67
    Quick3      0.00000          0.00               0.00
    Intro       0.00000          0.00               0.00
    Merge       0.16660      55217.67           61808.00
    Selection  10.00425   12497500.00            9998.00
    Insertion  10.70457    6262166.00         6262175.00

--------------------------------------------------------
Sorting 10 Ordered Lists of Size 5000
--------------------------------------------------------
    SORT        TIME(s)  #COMPARISONS       #ASSIGNMENTS
    Quick       9.44986   12502498.00               0.00
    Quick3      0.00000          0.00               0.00
    Intro       0.00000          0.00               0.00
    Merge       0.14597      32004.00           61808.00
    Selection   9.94826   12497500.00            9998.00
    Insertion   0.00834       4999.00            4999.00

Along with your modified Python modules you must also submit a .txt files that contains the output of your program on a sample execution.

Hints

Keep in mind that your new sorting algorithms need to be correct as well as efficient. One way to test your code is to compare its output to Python's built in sort method. For example:

>>> a = [random.randint(0, 500) for _ in range(500)] # Generate a random list.
>>> b = a[:]                                         # Copy the list. 
>>> b.sort()                                         # Sort the copy.
>>> sorts.insertionSort(a)                           # Sort using my sort.
>>> a == b                                           # Compare results.  
True

Getting accurate assignment counts for merge sort will require you to modify the existing merge function to use a CountList for temporary storage.
Producing the table of results requires some fairly complex string formatting. You need to display the required number of decimal places and you need to ensure that the columns line up correctly. The recommended way to handle this sort of thing in Python is to use the format method of the string class. In its simplest form, format makes it possible to insert values into arbitrary strings. For example, the following statement inserts the values 1/3 and 2/3 into the indicated string:
```
>>> "A {0} B {1}".format(1./3, 2./3)
'A 0.333333333333 B 0.666666666667'
```
The first argument is converted to a string and inserted in place of {0} and the second is inserted in place of {1}.

The format method also makes it possible to control the way that inserted values are displayed. For example, we can limit the number of decimal places as follows:
```
>>> "A {0:.1f} B {1:.1f}".format(1./3, 2./3)
'A 0.3 B 0.7'
```
Here .1f indicates that the values should be displayed as floats, and that only one decimal place should be retained. More generally, the information after the colon is referred to as a format specifier. The syntax for format specifiers is governed by the Format Specification Mini-Language.
Format specifiers may also include a width value. The following ensures that both inserted fields will be at least 20 characters wide:
```
>>> "A {0:20.1f} B {1:20.1f}".format(1./3, 2./3)
'A                  0.3 B                  0.7'
```
This just scratches the surface of string formatting possibilities. I encourage you to read over the Python documentation for details and examples.

Handing In

You should submit two .py files and a .txt file through Blackboard before the deadline. As always, your code should conform to the CS240 Python Coding Conventions.

Acknowledgments

This assignment is based on a project originally developed by Dr. Chris Fox.