Sorting Project

Wed, Oct 19, 2016

Introduction - What is the World’s Best Sorting Algorithm?

There is no single best sorting algorithm. Quicksort is the fastest known comparison-based sorting algorithm when applied to large, unordered, sequences. It also has the advantage of being an in-place (or nearly in-place) sort. Unfortunately, quicksort has some weaknesses: it’s worst-case performance is $O(n^2)$, and it is not stable. Merge sort shares neither of these disadvantages: it is stable and it requires $O(n \log n)$ steps in the worst case. Unfortunately, merge sort requires $O(n)$ additional space and it runs more slowly than quick sort on most inputs.

What we really want is a sorting algorithm that is as fast as quicksort, stable, in-place, with $O(n \log n)$ worst-case performance. Sadly, no such algorithm has yet been discovered. In practice, most code libraries follow one of two paths: either they provide a modified version of quicksort that is able to avoid worst-case behavior on typical inputs, or they provide a modified version of merge sort that is able to close some of the performance gap through careful optimizations^*.

The objective of this assignment is to write modified versions of merge and quicksort that exhibit better performance than the standard versions described in our textbook. You will implement two improvements (one for merge sort and one for quicksort) that should improve some aspect of the algorithm’s performance without requiring a great deal of additional code. You will also experimentally evaluate the impact of your improvements.

The following zip file contains starter code, including the sort implementations provided by our textbook.

sorting.zip

This zip file has the following contents:

Classes for generating testing data:
- Generator.java - Abstract base class.
- RandomGenerator.java
- OrderedGenerator.java
- EvilGenerator.java (UNFINISHED. See Part 4 below.)
Sorting classes. Normally we would write sorting algorithms as static utility methods. For this project we are creating sorting classes so that we can take advantage of polymorphism in the development of the profiling tool described in Part 2.
- Sorter.java - Abstract base class for all sorting classes.
- InsertionSorter.java
- SelectionSorter.java
- MergeSorter.java
- QuickSorter.java
- TimSorter.java - The default Java sorting algorithm.
- MergeSorter1.java (UNFINISHED. See Part 3 below.)
- IntroSorter.java (UNFINISHED. See Part 5 below.)
CliDemo.java - Demo application illustrating the use of the JOpt Simple library. (See Part 2 below.)
DemoDriver.java - Demo driver illustrating how to use the sort and generator classes.

Part 1 - Unit Testing

Develop unit tests for each of the provided sorting classes. These tests don’t need to evaluate the running time of the sorts, they only need to ensure that each implementation produces the correct output. Your tests must test the stability of the sorts that should be stable.

Note that there is at least one defect in the provided sorts. For full credit your testing must uncover all defects and you must correct them before submission.

Part 2 - Sort Profiling Tool

So far in this course we have focused on asymptotic analysis. This usually gives us the information we want. When deciding between two algorithms we generally prefer the one with better asymptotic performance.

Our goals for this project are somewhat different. The focus now is on fine-tuning existing algorithms to improve their performance. This is a situation where we really care about the constant factors. (We would be very happy if we could write a version of merge sort that is twice as fast, even if the asymptotic performance were still $O(n \log n)$.)

The goal for this part of the project is to develop a command-line tool that can be used to systematically evaluate the run-time performance of different sorting algorithms.

Your tool should conform to the following help message:

Option (* = required)         Description                                    
---------------------         -----------                                    
* -s <Integer: NUMBER>        Starting (smallest) input size                 
* -i <Integer: NUMBER>        Input size increment                           
* -m <Integer: NUMBER>        Maximum input size to test                     
* -t <Integer: NUMBER>        Number of trials for each input size           
-w [String: SORT1,SORT2,...]  Comma separated list of sorts. Options include
                                InsertionSort, SelectionSort, MergeSort,     
                                MergeSort1, QuickSort, IntroSort and TimSort.
                                Default is to execute all sorts.
-g [String: GENERATOR]        Sequence generator. Options include random,    
                                ordered or evil. The default is random

For example, we could compare the performance of quicksort and merge sort on small inputs by executing the tool as follows:

$ java SortProfiler -s 1000 -i 1000 -m 10000 -t 10 -w QuickSort,IntroSort

The resulting output would look something like the following:

N,	Quicksort,	IntroSort
1000,	0.00008631,	0.00000076
2000,	0.00018698,	0.00000068
3000,	0.00028221,	0.00000066
4000,	0.00041534,	0.00000069
5000,	0.00056695,	0.00000062
6000,	0.00063308,	0.00000063
7000,	0.00073361,	0.00000064
8000,	0.00084264,	0.00000063
9000,	0.00096724,	0.00000063
10000,	0.00108252,	0.00000064

Note that IntroSort is very fast in these results because it hasn’t been implemented yet. Your exact timing results will vary depending on the speed of your computer.

Specific functional requirements for the tool include the following:

Columns must be separated by a comma character followed by a tab character.
Each row (including the last) must be terminated with a newline character.
Averaged times must be reported in seconds with 8 decimal places of precision.
The order of the columns must match the order that the sorts appear on the command line.
There should be error checking for the command line arguments. In the case of negative integers or other incorrect arguments a help message should be printed and the application should exit.
The command-line arguments should be interpreted correctly regardless of their ordering.

Processing Command-Line Arguments

It’s lot of work to write code that properly handles command line arguments. Rather than tackling this problem from scratch, you are encouraged to make use of the JOpt Simple library. You’ll need to place this jar file on your build path to compile your project: jopt-simple-5.0.3.jar. The file CliDemo.java provides an example of using this library.

Testing Pseudocode

Your implementation should follow the pseudocode below:

Print column headers

For each input size n:
	For t trials:
		Generate a sequence of size n
		For each sort:
			Time the execution of that algorithm on a copy of the sequence

	For each sort
		Calculate and store the average running time across all t trials

	Output the data for the current row

Note that this pseudocode is organized so that each algorithm is tested against the same sequences. This results in a more “fair” comparison between the different algorithms, particularly when the number of trials is small.

Part 3 - Merge Sort Improvement

The merge algorithm described in our textbook consists of the following two stages:

1. Copy all values from the two merge-sorted halves int a temporary array.
2. Merge the values from the temporary array back into the original array.

If $n$ is the combined size of the two sub-arrays being merged, this algorithm requires a temporary array of size $n$, and requires $n$ assignment operations in stage 1. The following alternative approach cuts both of those values from $n$ to around $n/2$:

1. Copy the values from the first merge-sorted half to a temporary array.
2. Merge the values from the temporary array and the second
   merge-sorted half into the original array.

Here are some ASCII graphics illustrating the process:

Original sorted sub-arrays:

      ________________________________________
  ...| 1 | 3 | 5 | 7 | 9 | 2 | 4 | 6 | 8 | 10 |...
      ----------------------------------------
 start-^           mid-^               end-^

Temporary Array

      ___________________
     |   |   |   |   |   |...
      -------------------
     0-^

      ________________________________________
  ...| 1 | 3 | 5 | 7 | 9 | 2 | 4 | 6 | 8 | 10 |...
      ----------------------------------------
 start-^           mid-^               end-^

Copy the sorted first half into the temporary array:

         ___________________
        | 1 | 3 | 5 | 7 | 9 |...
         -------------------

         ________________________________________
     ...|   |   |   |   |   | 2 | 4 | 6 | 8 | 10 |...
         ----------------------------------------

Initialize i1 to be the index of the first position (0) in the temporary array, i2 to be the index of the first position of the sorted right half, and curr to be the next available position for merging:

         ___________________
        | 1 | 3 | 5 | 7 | 9 |...
         -------------------
       i1-^
         ________________________________________
     ...|   |   |   |   |   | 2 | 4 | 6 | 8 | 10 |...
         ----------------------------------------
     curr-^                i2-^

Since temp[i1] < items[i2], temp[i1] is copied to items[curr]. i1 and curr are incremented.

       ___________________
      | 1 | 3 | 5 | 7 | 9 |...
       -------------------
         i1-^
       ________________________________________
   ...| 1 |   |   |   |   | 2 | 4 | 6 | 8 | 10 |...
       ----------------------------------------
       curr-^            i2-^

Since items[i2] < new[i1], items[i2] is copied to items[curr]. i2 and next are incremented.

       ___________________
      | 1 | 3 | 5 | 7 | 9 |...
       -------------------
         i1-^
       ________________________________________
   ...| 1 | 2 |   |   |   | 2 | 4 | 6 | 8 | 10 |...
       ----------------------------------------
           curr-^            i2-^

Continue Merging until complete:

       ___________________
      | 1 | 3 | 5 | 7 | 9 |...
       -------------------
       ________________________________________
   ...| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |...
       ----------------------------------------

Use your sort profiling tool to test your finished implementation. You should not expect your improved merge sort to be dramatically faster than the provided implementation. Since array assignments are very fast relative to calls to compareTo, the speed impact will be minimal. The main advantage here is the savings in space.

Part 4 - Black Hat Sequence Generation

Quicksort is usually very fast. However for any quicksort implementation that selects a pivot point deterministically it is possible to construct an input sequence that will cause the running time to be $O(n^2)$.

For this part of the project you will channel your inner evil algorithmicist to write a method that maliciously generates hard-to-sort sequences. You will need to:

Analyze the provided quicksort algorithm to determine how it selects the pivot point.
Use this information to determine the characteristics of the worst possible input sequence.
Complete the generate method in GenerateEvil so that it produces these malicious inputs.

You must updated your profiling tool to use this generator when the appropriate command line argument is provided. It should be possible to use the tool to test the success of your evil method.

Part 5 - Introspective Sort

As mentioned above, quicksort is usually very fast. There are relatively few pathological inputs that result in $O(n^2)$ performance. The idea behind introspective sort is to perform a standard quick sort until there is evidence that the current input is pathological. If it looks like the current input will cause $O(n^2)$ performance, the algorithm switches strategies to a sorting algorithm with guaranteed $O(n \log n)$ performance. Typically, heap sort is used as the alternative, since it is in-place and takes $O(n \log n)$ time in the worst case. Since we haven’t yet studied heap sort, your version of introspective sort should use the improved merge sort from Part 3 as the alternate sorting algorithm. The resulting algorithm is almost as fast as quick sort for most inputs, but has a worst case performance of $O(n log n)$.

It is possible to detect pathological inputs by tracking the recursion depth of the quicksort algorithm. When quicksort is working well, the partition operation typically moves the pivot item somewhere near the center of the current sub-sequence. When this is the case, the maximum recursion depth will be $O(\log n)$. Therefore, introspective sort switches strategies when the recursion depth exceeds $2 \lfloor \log_2 n\rfloor$.

Note that the provided version of merge sort sorts a full array. Introspective sort needs to be able to call a version of merge sort that can sort a portion of a provided array. You’ll need to add a public method with the following signature to MergeSorter1:

public void sort(T[] items, int left, int right)

You should confirm that introspective sort performs well even in the case of pathological inputs.

Submission A - Sorts and Tests

Submit your unit tests along with all of the sorting classes (including those that you didn’t modify) through Web-CAT.

Submission B - Sorts and Tools

Submit all of your completed files through Web-CAT. You don’t need to upload the jar file for the JOpt Simple library.

Grading

Web-CAT Funcionality and Testing for Submission A	30%
Web-CAT Functionality and Testing for Submission B	30%
Web-CAT Style Checks	20%
Web-CAT Instructor Style Points	20%

^*The second approach has recently become popular. Both Python (starting in version 2.3) and Java (starting in version 7) use Timsort as their default sorting algorithm. Timsort is a modified version of merge sort that includes several enhancements: it uses a more space-efficient merge operation, it takes advantage of partially sorted arrays by finding and merging existing sorted regions, and it uses a non-recursive binary insertion sort to handle short unsorted regions.