This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Projects

2-week Assignments

1 - Project 1 - Multiset

The Java collections framework provides interfaces corresponding to some of the most commonly used abstract data types: List, Map, Set, etc. One abstract data type that is not included is the Multiset, which you will implement in this project.

The Multiset ADT is a generalization of the Set ADT that allows duplicate elements. Multisets are useful in cases where we don’t care about the order of the elements but we do need to maintain a count of how many times each element appears. For example, in a machine learning document analysis project, we may need to count the number of occurrences of each word in the document. Such distributions of word frequencies can be useful for problems like authorship detection and topic classification.

For this project, you will be making several modifications to the following Multiset implementation. These modifications will improve the usability and efficiency of this collection. Make sure you understand the provided code before attempting to implement the rest of the project.

Multiset.java

The project stages below should be completed in order. Each part builds on the work completed in the earlier stages.

Part 1 - Generics

Your first task is to modify the provided Multiset class so that it makes appropriate use of generics. The functionality should be unchanged, except that your improved class should accept a type parameter specifying an element type.

Once this phase of the project it complete, code like the following should compile and execute:

    Multiset<Character> chars = new Multiset<>();
    
    chars.add('a');
    chars.add('b', 3);
    chars.add('a');
    
    System.out.println(chars); // prints "[a x 2, b x 3]"

You should test your updated class by creating a test driver or some simple JUnit tests.

There is nothing to submit for Part 1.

Part 2 - Multiset Application

For this part of the project you will build a simple utility class that makes use of your generic multiset. Complete the following dice rolling class so that the methods conform to the provided Javadoc comments:

Roller.java

The submission tests will make use of a seed value to create predictable results, so be careful not to make any extraneous calls to the random number generator.

The following example illustrates the required format for the output of the plotRolls method. In this example the call was roller.plotRolls(1000, 3, 6, 10).

Rolling 3d6 1000 Times
(0.40%)   :
(1.60%)   :#
(3.10%)   :###
(4.70%)   :####
(7.50%)   :#######
(8.00%)   :########
(12.10%)  :############
(12.60%) :############
(13.90%) :#############
(11.50%) :###########
(9.10%)  :#########
(7.20%)  :#######
(4.30%)  :####
(2.30%)  :##
(1.40%)  :#
(0.30%)  :

Notice that the percentages are reported to two decimal places and that the labels for each row are left-justified within a field of length 12.

Once you have completed Part 2, submit Roller.java and your updated Multiset.java to Gradescope.

Part 3 - Interfaces

ADT vs Data Structure

At this point, your Multiset collection is type-safe, but it violates the principle of separating implementation from interface. The Java Collections Framework provides some nice examples of putting this principle into action. The UML below shows a small subset of the classes and interfaces included in the collections framework. In this diagram the List interface expresses the methods that are associated with the List ADT. LinkedList and ArrayList are concrete classes that implement the List interface, each using a different underlying data structure. The various abstract classes provide default implementations for some methods to simplify the implementation of the concrete classes.

In this diagram, interfaces are colored gray, abstract classes are yellow, and concrete classes are white.

Multiset Interface

Your goal for this stage of the project is to split the existing Multiset class into an interface named Multiset and a concrete implementation named ArrayListMultiset. Your finished code should conform to the UML diagram below. The entities you need to create are enclosed by the dashed red line. The remaining classes are already part of the Java Collections Framework. You should not change the way any of the existing classes are implemented.

Notice that your completed Multiset interface must extend the java.util.Collection interface. The advantage of this is that any method anywhere that expects a Collection object will now work correctly when passed a Multiset. The challenge is that the Collection interface requires several methods that are not included in the provided Multiset class.

The optional methods removeAll and retainAll must return an UnsupportedOperationException.

You may use the following implementation of the hashCode method:

 /**
   * The role of the hashCode method will be addressed later in the semester.
   * 
   * @return 1, always 1.
   */
  @Override
  public int hashCode() {
    return 1;
  }

All other Collection methods must be implemented, including those that are described as optional. Note that the java.util.AbstractCollection class provides default implementations for some of these methods. You should take advantage of that fact.

You may use the following JUnit tests to confirm that your code matches the specification.

MultisetTest.java This abstract class defines testing methods for each of the methods included in the Multiset interface. Any correct implementation should pass these tests.
ArrayListMultisetTest.java Subclass of MultisetTest that can be used to test your ArrayListMultiset implementation.

Part 4 - Building a Better Data Structure

The modifications so far have improved the usability of our original Multiset class, but they have not improved the efficiency. The current implementation is terribly inefficient in terms of both space and time.

Consider the following code snippet:

    Multiset<String> set = new ArrayListMultiset<>();
    for (int i = 0; i < 10000; i) {
       set.add("Harrisonburg");
    }

After this code executes the ArrayList inside set will contain 10,000 copies of the string “Harrisonburg”.

It would be much more efficient to store a single copy of “Harrisonburg” along with a counter. Subsequent calls to add would then increment the counter without actually increasing the number of objects stored in the ArrayList.

For this stage of the project you will create a new implementation of the Multiset interface named CounterMultiset. Instead of storing duplicate objects, this class should store Pair objects representing counters. The resulting class hierarchy should match the following UML:

Once again, you must create the files inside the dashed red line.

You may test your finished implementation using the following JUnit tests:

CounterMultisetTest.java - Subclass of MultisetTest.

Implementation Tips

Don’t be fooled by the fact that most of the methods in Multiset.java are simple one-liners. This will not be the case for CounterMultiset.java. For example, you will need to create a custom Iterator class for CounterMultiset. You won’t be able to use the existing iterator of the ArrayList class.
Use the @Override annotation where appropriate! This has several benefits. It will prevent you from making some sneaky coding errors (overloading/duplicating methods instead of overriding them). It will also allow you to avoid writing Javadoc comments for overridden methods: it is generally not necessary to provide Javadoc comments for overridden methods, as long as the description in the superclass or interface provides an adequate description of the methods behavior.
Don’t forget everything you learned in CS 149 and CS 159!
- Code duplication is bad: use helper methods where appropriate.
- Code duplication is bad: use abstract superclasses to handle functionality that is the same across multiple sibling classes. For example, Do ArrayListMultiset and CounterMultiset both need identical hashCode methods?
- Unnecessary code is bad: Do you need to implement the addAll method? Or can you inherit from AbstractCollection to take advantage of an existing implementation?

Ignoring the advice above will negatively impact your grade, even if your code passes all of the submission tests.

Part 5 Demonstration Workload

It was claimed above that the counter implementation is much more efficient than the ArrayList implementation for the Multiset ADT. Strictly speaking, this claim depends on the actual sequence of operations performed. Take a few minutes to think about scenarios where the CounterMultiset will have an advantage and scenarios where it will not, then complete the following driver class so that it conforms to the Javadoc comments.

WorkloadDriver.java

Submission and Grading

The project will be graded as follows:

Project Part	Weight
Gradescope Functionality Tests for Parts 1 & 2	10%
Gradescope Functionality Tests for Parts 3 & 4	60%
Gradescope Style Checks for Parts 3 & 4	10%
Gradescope Timing Tests for Part 5	10%
Gradescope Instructor Style Points	10%

Submissions that fail any functionality tests will receive at most 75% of the available points for that component of the grade.

Looking Ahead

The counter-based Multiset that you developed is, in general, much more efficient than the terrible implementation we provided. This raises the question of whether it is possible to develop an even more efficient implementation. It is. We’ll see several data structures later in the semester that could be used to create much better implementations of this ADT.

Acknowledgements

The Multiset interface used for this project is based on a similar interface included with the the Guava Google Java libraries.

2 - Project 2 - Hybrid Deque

The Double Ended Queue, or Deque, pronounced “deck”, is an abstract data type that allows elements to be appended and removed from either end of a sequence. The Java collections framework includes a [Deque] (https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Deque.html) interface along with several implementing classes. These include the ArrayDeque, implemented using a circular dynamic array, and LinkedList, implemented using a doubly-linked list.

Files

The following files are provided:

AbstractDeque.java - The Deque interface has three super-interfaces: Collection, Iterable and Queue. Any class that implements the Deque interface must provide concrete versions of the methods declared in Deque as well as all of it’s super-interfaces. In all, this is around 40 methods. The AbstractDeque class facilitates the creation of new Deque implementations by providing default implementations for many of these methods. YOU SHOULD NOT NEED TO MODIFY THIS CLASS.
HybridDeque.java - This is the starter code for your Deque implementation. Your implementation must conform to the implementation notes included in the comments for this file.

Introduction

Note that the Deque is not a widely used abstract data type. In most cases it makes more sense to use either a Stack or a Queue, depending on the application. The most commonly cited example of an application that does require a Deque is the work stealing algorithm.

The goal of this assignment is to develop a new Deque implementation that satisfies the following design requirements:

Worst case $\Theta(1)$ append and remove operations at either end.
Efficient use of space for large collections.

Neither ArrayDeque nor LinkedList satisfy both of these requirements. While the ArrayDeque provides amortized $\Theta(1)$ append and remove operations, the need to resize the underlying array means that appending will occasionally require $\Theta(n)$ steps. The LinkedList does provide $\Theta(1)$ append and remove in the worst case, but it is quite inefficient in terms of space. The need to store backward and forward references for each item stored means that as much as 2/3 of the required memory is “wasted” overhead.

Part 1 - Hybrid Deque

For this project we will develop a HybridDeque class that combines some of the advantages of contiguous and linked data structures. Our implementation is inspired by the Python deque collection. The following description is taken from the internal documentation for that class:

Data for deque objects is stored in a doubly-linked list of fixed length blocks. This assures that appends or pops never move any other data elements besides the one being appended or popped.

Textbook implementations of doubly-linked lists store one datum per link, but that gives them a 200% memory overhead (a prev and next link for each datum) and it costs one malloc() call per data element. By using fixed-length blocks, the link to data ratio is significantly improved and there are proportionally fewer calls to malloc() and free(). The data blocks of consecutive pointers also improve cache locality.*

Essentially, our Deque implementation will be a doubly-linked list where each list node contains multiple data elements. Part 1 of the project involves implementing all HybridDeque methods that do not involve removing elements from the middle of the collection.

Part 2 - Mid-Deque Removal

While the Deque interface is primarily designed to allow access to elements at the ends of the sequence, there are a few methods that allow an item to be removed from the middle. These are removeLastOccurance, removeFirstOccurance and the remove methods of the two iterators. These methods are relatively challenging to implement because removing an element from middle of the sequence requires shifting all subsequent elements left to fill the vacated space.

Some issues to keep in mind while implementing removal:

Don’t code the removal logic four times! You should create a helper method to handle removal and use it as needed in your other methods.
The most efficient implementation would first determine whether the item to be removed is nearer to the beginning or the end of the sequence. Elements near the beginning would be removed by shifting preceding elements to the right while elements near the end would be removed by shifting subsequent elements to the left. YOUR CODE DOESN’T NEED TO HANDLE THESE TWO CASES DIFFERENTLY. For the sake of simplicity, we suggest that you always shift elements to the left to handle removal.

Implementation Advice

Cursor Class

One of the challenges in implementing the HybridDeque class is that it requires two pieces of information to specify a location in the Deque: a block reference and an index into that block. The implementation may be simplified by creating a private inner class that encapsulates both pieces of information and handles the logic of stepping forward and backward. Feel free to make use of the provided inner Cursor class if you would like to take this approach.

Smaller Block Sizes

The default block size in the provided code is 64. This follows the original Python implementation and represents a trade-off: A large block size will waste a significant amount of space for small collections that don’t fill an entire block. A small block size will increase the overhead required for forward and backward references when storing large collections.

While 64 represents a good block size in terms of space efficiency, it probably isn’t ideal for tracing and debugging your code during the implementation stage. Bugs are likely to pop up at block boundaries, and it is inconvenient if you need to step through dozens of operations before reaching a boundary. We suggest changing the block size to 4 or 8 while you work on your implementation. You should switch back to 64 when you submit.

Drawing Pictures

This assignment will be impossible if you don’t develop a good mental model of how the data structure is organized. The readiness quiz will require that you take some time to draw the memory layout for some sample HybridDeque objects before you start coding. You will be asked to draw a HybridDeque at each of the indicated points in the sample code below.

HybridDeque.setBlockSize(8);
HybridDeque<String> deque = new HybridDeque<>();

// Draw the deque now... (This one is done for you.)

deque.offerLast("A");

// Draw the deque now...

deque.offerLast("B");
deque.offerLast("C");
deque.offerLast("D");
deque.offerLast("E");

// Draw the deque now...

deque.pollFirst();
deque.pollFirst();
deque.pollFirst();
deque.pollFirst();

// Draw the deque now...

Here is an example of what the correct memory diagram would look like for the initial step above:

Existing Implementation

As we mentioned above, the design of the HybridDeque class is based on the C implementation of Python’s deque collection. You are welcome to examine to the existing C implementation as you work on your code. It probably isn’t a good idea to attempt a line-for-line translation from C to Java, but the C code might provide some helpful ideas.

Submission and Grading

The project will be graded as follows:

Project Part	Weight
Gradescope Readiness Quiz	10%
Gradescope Functionality Part 1	55%
Gradescope Functionality Part 2	10%
Student JUnit Tests	10%
Gradescope Style Checks	5%
Instructor Style Points	10%

Submissions that fail any functionality tests will receive at most 75% of the available points for that component of the grade.

No late submissions will be accepted for the readiness quiz.

Note that we will not be providing copies of our Gradescope submission tests for this assignment. In order to get full credit for the testing portion of the grade, you must submit JUnit version 5 tests that provide full method coverage line coverage of the HybridDeque class. Recall that method coverage is a low bar for testing. It only requires that each method be executed by at least one of your tests. Our intention is to give you the responsibility for testing and debugging your own code without requiring you to submit a comprehensive test suite.

You will be limited to 10 free Gradescope submissions for this assignment. Your project grade will be reduced by 1% for each submission beyond 10.

Any submission that doesn’t pass all of the instructor unit tests will receive at most 75% of the points for that component of the grade.

The Gradescope submission tests will include checks for efficiency as well as correctness. In other words, if you code a $\Theta(1)$ operation in such a way that it requires $\Theta(n)$ time to execute, your implementation will be considered incorrect even if it produces the expected result.

Acknowledgements

The HybridDeque project was developed by Nathan Sprague.

3 - Project 3 - Best Sorting

There is no single best sorting algorithm. Quicksort is the fastest known comparison-based sorting algorithm when applied to large, unordered, sequences. It also has the advantage of being an in-place (or nearly in-place) sort. Unfortunately, quicksort has some weaknesses: its worst-case performance is

\Theta(n^2)

, and it is not stable. Merge sort shares neither of these disadvantages: it is stable and it requires

n log n

steps in the worst case. Unfortunately, merge sort requires

\Theta(n)

additional space and it runs more slowly than quick sort on most inputs. The objective of this assignment is to write modified versions of merge and quicksort that exhibit better performance than the standard versions described in our textbook. You will implement three improvements (two for merge sort and one for quicksort) that should improve some aspect of the algorithm’s performance without requiring a great deal of additional code. You will also experimentally evaluate the impact of your improvements.

Files

The following zip file contains starter code, including the sort implementations provided by our textbook.

sorting.zip

This zip file has the following contents:

Sorting algorithms:
- BasicSorts.java - Includes selection sort and insertion sort. (UNFINISHED. See Part 2 below.)
- MergeSort.java - Includes merge sort.
- QuickSort.java - Includes quicksort.
- MergeSortImproved.java (UNFINISHED. See Parts 1 & 2 below.)
- IntrospectiveSort.java (UNFINISHED. See Part 3 below.)
An application for timing sorting algorithms:
- SortProfiler.java
- Sorter.java - Functional Interface for sorting methods.
- Generator.java - Functional Interface for generator methods.
- Generators.java - Static methods for generating test data.

Note that SortProfiler relies on the JOpt Simple library for handling command-line arguments. You will need to have jopt-simple-5.0.4.jar in your classpath to use the application.

Part 1 - Improved Merges

The merge algorithm described in our textbook consists of the following two stages:

Copy all values from the two pre-sorted halves into a temporary array.
Merge the values from the temporary array back into the original array.

If $n$ is the combined size of the two sub-arrays being merged, this algorithm requires a temporary array of size $n$ , and requires $n$ assignment operations in stage 1. The following alternative approach cuts both of those values from $n$ to around $n/2$

Copy the values from the first pre-sorted half to a temporary array.
Merge the values from the temporary array and the second merge-sorted half into the original array.

Here are some ASCII graphics illustrating the process:

Original sorted sub-arrays:

      ________________________________________
  ...| 1 | 3 | 5 | 7 | 9 | 2 | 4 | 6 | 8 | 10 |...
      ----------------------------------------
 start-^           mid-^               end-^

Temporary Array

      ___________________
     |   |   |   |   |   |...
      -------------------
     0-^

      ________________________________________
  ...| 1 | 3 | 5 | 7 | 9 | 2 | 4 | 6 | 8 | 10 |...
      ----------------------------------------
 start-^           mid-^               end-^

Copy the sorted first half into the temporary array:

         ___________________
        | 1 | 3 | 5 | 7 | 9 |...
         -------------------

         ________________________________________
     ...|   |   |   |   |   | 2 | 4 | 6 | 8 | 10 |...
         ----------------------------------------

Initialize `i1` to be the index of the first position (0) in the temporary array, `i2` to be the index of the first position of the sorted right half, and `curr` to be the next available position for merging:

         ___________________
        | 1 | 3 | 5 | 7 | 9 |...
         -------------------
       i1-^
         ________________________________________
     ...|   |   |   |   |   | 2 | 4 | 6 | 8 | 10 |...
         ----------------------------------------
     curr-^                i2-^

Since `temp[i1] < items[i2]`, `temp[i1]` is copied to `items[curr]`. `i1` and `curr` are incremented.

       ___________________
      | 1 | 3 | 5 | 7 | 9 |...
       -------------------
         i1-^
       ________________________________________
   ...| 1 |   |   |   |   | 2 | 4 | 6 | 8 | 10 |...
       ----------------------------------------
       curr-^            i2-^

Since `items[i2] < temp[i1]`, `items[i2]` is copied to `items[curr]`. `i2` and `curr` are incremented.

       ___________________
      | 1 | 3 | 5 | 7 | 9 |...
       -------------------
         i1-^
       ________________________________________
   ...| 1 | 2 |   |   |   | 2 | 4 | 6 | 8 | 10 |...
       ----------------------------------------
           curr-^            i2-^

Continue Merging until complete:

       ___________________
      | 1 | 3 | 5 | 7 | 9 |...
       -------------------
       ________________________________________
   ...| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |...
       ----------------------------------------

Use the sort profiling tool to evaluate your finished implementation. You should not expect your improved merge sort to be dramatically faster than the provided implementation. Since array assignments are very fast relative to calls to compareTo, the speed impact will be minimal. The main advantage here is the savings in space.

Part 2 - Switching Strategies

The next improvement is based on the observation that merge sort is actually slower than simple $\Theta(n^2)$ sorts for small input sizes. This may seem surprising given that merge sort is an $\Theta(nlogn)$ sorting algorithm. However, it is important to keep in mind that asymptotic analysis is only concerned with rates of growth. A $\Theta(nlogn)$ algorithm will always be faster than a $\Theta(n^2)$ algorithm eventually, but that doesn’t mean the $\Theta(n^2)$ algorithm can’t be faster for small inputs. The following figure was created by timing merge sort and insertion sort on small randomly ordered arrays from size 2 to size 150:

As you can see, insertion sort is faster until around $n=100$ . At that point, merge sort becomes faster and it remains faster for all larger inputs.

A a reminder, the following pseudocode describes the overall logic of the merge sort Algorithm:

  merge_sort(sub-array)
        If sub-array is has more than one entry: 
            Recursively merge_sort the left half
            Recursively merge_sort the right half
            Merge the two sorted halves.

This logic recursively splits the original array into smaller and smaller sub-arrays until the recursion bottoms out at sub-arrays of size one. This means that every time a large array is sorted, there are many recursive calls to merge sort that have small input sizes. In light of the figure above, that approach doesn’t make much sense: merge sort is not a competitive sorting algorithm on small inputs. It would make more sense to recursively break the input into smaller and smaller pieces until some threshold is reached, and then switch strategies to a sorting algorithm that is more efficient on those small inputs.

The following pseudocode describes this alternate approach:

    merge_sort(sub-array)
        If sub-array has fewer than MERGE_SORT_THRESHOLD entries:
            Sort the sub-array with insertion sort. 
        Otherwise: 
            Recursively merge_sort the left half
            Recursively merge_sort the right half
            Merge the two sorted halves.

Choosing an appropriate value for MERGE_SORT_THRESHOLD requires some experimentation. One of the requirements of this assignment is that you select an appropriate threshold value and provide data to justify your choice. Simply selecting the crossover point based on the figure above is not a reasonable strategy. The point where the two lines cross represents the input size where the two sorts are equally fast. Switching strategies at that input size will result in no speedup at all.

The only way to determine an appropriate choice for MERGE_SORT_THRESHOLD is to systematically experiment with different values until you find the one that leads to the best overall sorting performance. You should create a driver method that iterates over different values of MERGE_SORT_THRESHOLD and uses SortProfiler to evaluate each. The following minimal driver class illustrates how SortProfiler objects can be used directly without invoking the command line application:

import java.util.List;

public class ProfileDriver {
  public static void main(String[] args) {

    // Create a SortProfiler object.
    // See the JavaDoc for an explanation of the parameters.
    SortProfiler sp = new SortProfiler(List.of(MergeSort::mergeSort),
                                       List.of("mergesort"),
                                       1, 10, 100, 100,
                                       Generators::generateRandom);

    // Run the profiler and send the output to stdout.
    sp.run(System.out);
  }

}

Finally, note that we have provided an implementation of insertion sort that works on a full array, but you will need a version that works on a sub-array for this part of the assignment.

Part 3 - Introspective Sort

As mentioned above, quicksort is usually very fast. There are relatively few pathological inputs that result in $\Theta(n^2)$ performance. The idea behind (introspective sort)[https://en.wikipedia.org/wiki/Introsort] is to perform a standard quick sort until there is evidence that the current input is pathological. If it looks like the current input will cause $\Theta(n^2)$ performance, the algorithm switches strategies to a sorting algorithm with guaranteed $\Theta(n logn)$ performance. Typically, heap sort is used as the alternative, since it is in-place and takes $\Theta(nlogn)$ time in the worst case. Since we haven’t yet studied heap sort, your version of introspective sort should use the improved merge sort from Part 2 as the alternate sorting algorithm. The resulting algorithm is almost as fast as quick sort for most inputs, but has a worst case performance of $\Theta(nlogn)$ .

It is possible to detect pathological inputs by tracking the recursion depth of the quicksort algorithm. When quicksort is working well, the partition operation typically moves the pivot item somewhere near the center of the current sub-sequence. When this is the case, the maximum recursion depth will be $logn$ . Therefore, introspective sort switches strategies when the recursion depth exceeds $2\lfloor log_2n \rfloor$ .

There are a number of possible variations on the basic Introsort algorithm. Your implementation should conform to the following pseudocode:

introspective_sort(array)
    introspective_sort(array, 2 ⌊log₂(array.length)⌋)

introspective_sort(sub-array, depth-limit)
    If depth-limit is 0
        merge_sort(sub-array)
    Otherwise
        If sub-array has 0 or 1 entries
            return
        Otherwise
            pivot ← partition(sub-array)
            introspective_sort(sub-array[0 to pivot - 1], 
                               depth-limit - 1)
            introspective_sort(sub-array[pivot + 1 to end],
                               depth-limit - 1)

Note that the provided version of merge sort sorts a full array. Introspective sort needs to be able to call a version of merge sort that can sort a portion of a provided array. The MergeSortImproved class provides a declaration for such a method.

Part 4 - Experimental Analysis

So far in this course we have focused on asymptotic analysis. This usually gives us the information we want. When deciding between two algorithms we generally prefer the one with better asymptotic performance.

Our goals for this project are somewhat different. The focus now is on fine-tuning existing algorithms to improve their performance. This is a situation where we really care about the constant factors. (We would be very happy if we could write a version of merge sort that is twice as fast, even if the asymptotic worst-case performance were still $\Theta(nlogn)$ .)

In order to help you evaluate your sorting improvements we are providing a command-line tool that can be used to systematically evaluate the run-time performance of different sorting algorithms.

Running the tool with no arguments should produce the following help message:

Option (* = required)         Description                                   
---------------------         -----------                                   
* -s <Integer: NUMBER>        Starting (smallest) input size                
* -i <Integer: NUMBER>        Input size increment                          
* -m <Integer: NUMBER>        Maximum input size to test                    
* -t <Integer: NUMBER>        Number of trials for each input size          
-w [String: SORT1,SORT2,...]  Comma separated list of sorts. Options include
                                insertion, selection, merge, merge_half,    
                                merge_adaptive, quick, introspective and    
                                timsort. Default is to execute all sorts.   
-g [String: GENERATOR]        Sequence generator. Options include random,   
                                ordered or evil. The default is random

For example, we could compare the performance of quicksort and introspective sort on small inputs by executing the tool as follows:

$ java SortProfiler -s 1000 -i 1000 -m 10000 -t 10 -w quick,introspective

The resulting output would look something like the following:

N,  quick,  introspective
1000,   0.00009277, 0.00000085
2000,   0.00021152, 0.00000086
3000,   0.00030850, 0.00000088
4000,   0.00043710, 0.00000090
5000,   0.00055941, 0.00000088
6000,   0.00068144, 0.00000087
7000,   0.00081457, 0.00000091
8000,   0.00095075, 0.00000087
9000,   0.00105827, 0.00000089
10000,  0.00117701, 0.00000087

(Introspective sort is very fast here because it hasn’t been implemented yet.)

For this part of the project you will prepare a short document that quantifies the performance improvements you were able to achieve. This document should take the form of the following figures:

Figure 1 - Improved Merges. This figure should compare the performance of mergeSortHalfSpace to the original merge sort implementation as well as quicksort. Results should be shown for both random and ordered inputs. To avoid clutter, it may be helpful to plot the results for random inputs and ordered inputs separately.
Figure 2 - Threshold Selection. This figure should show the data you used to select an appropriate threshold for switching strategies in mergeSortAdaptive.
Figure 3 - Switching Strategies. This figure should compare your tuned mergeSortAdaptive to the previous two merge sort implementations as well as quicksort. Results should be shown for both random and ordered inputs.
Figure 4 - Introspective Sort (Random Inputs). This figure should compare introspective sort to quicksort for randomly generated inputs.
Figure 5 - Introspective Sort (Pathological Inputs). This figure should compare introspective sort to quicksort for worst-case quicksort inputs.

Each figure must have clearly labeled axes, a caption, and appropriate legends. There must be a paragraph describing the results illustrated in each figure. The description must provide enough information for the reader to understand the point of the experiment and how the data supports the conclusion.

Submission

Submit the following five files through Gradescope:

BasicSorts.java
MergeSortImproved.java
IntrospectiveSort.java
SortAnalysis.pdf - The document containing your analysis
WhateverYouChooseToCallIt.java - The driver class that you used to select MERGE_SORT_THRESHOLD.

Test your code locally at each stage of development, and reason about its correctness. When you are reasonably confident in your solution, submit your code to Gradescope to run the autograder’s unit tests. There will be a point deduction for excessive Gradescope submissions.

Grading

Gradescope Tests for Improved Merge	20 points
Gradescope Tests for Switching Strategies	20 points
Gradescope Tests for Introspective Sort	20 points
Experimental Results Document	25 points
Instructor Style Points	10 points
Gradescope Style Checks	5 points
Excessive Submission Deduction	-1 for each submission beyond 10.

Any submission that doesn’t pass all of the instructor unit tests will receive at most 75% of the points for that component of the grade (45/60).

The grading of your evaluation document will be based on the quality of the figures as well as the clarity of your writing.

^*The second approach has recently become popular. Both Python (starting in version 2.3) and Java (starting in version 7) use Timsort as their default sorting algorithm. Timsort is a modified version of merge sort that includes several enhancements: it uses a more space-efficient merge operation, it takes advantage of partially sorted arrays by finding and merging existing sorted regions, and it uses a non-recursive binary insertion sort to handle short unsorted regions.

4 - Project 4 - Huffman Coding

The goal for this project is to develop a file compression utility to compete with applications like 7-Zip, gzip, WinZip etc.

Madzip

You must develop a utility class named MadZip with two public static methods: zip and unzip. These methods are described below.

zip

The zip method must accept two java.io.File arguments. The first is the location of the file that should be compressed, and the second is the location where the compressed version of the file will be saved.
This method must perform the full Huffman coding process:
- It must determine the frequencies of all bytes in the source file.
- It must build a Huffman tree with the help of a Min-Heap. (You are welcome to use your own MinHeap class, or Java’s built-in PriorityQueue.)
- It must use the Huffman tree to build a mapping from bytes to byte-encodings.
- It must use that mapping to generate a compressed version of the original file. This file will include both the frequency data and a sequence of bits representing the encoded version of the original file. (See below for details of the file format)
This method must throw an IOException if the source file cannot be read or the destination cannot be written to.
The return type must be void
This method must overwrite the destination file if it already exists.
This method must not modify the source file.

unzip

The unzip must accept two java.io.File arguments. The first is the location of a previously zipped file, and the second is the location where the un-compressed version of the file should be saved.
This method must perform the full Huffman decoding process:
- It must reconstruct the Huffman tree from the frequencies stored in the compressed version of the file.
- It must use that Huffman tree to decode the encoded bit sequence in the compressed file, saving all of the recovered bytes to the destination file.
This method must throw an IOException if the the source file cannot be read or the destination file cannot be written to. It must throw a ClassNotFoundException if that exception occurs during deserialization. (See below for more information on deserialization.)
The return type must be void.
This method must overwrite the destination file if it already exists.
This method must not modify the source file.

Notice that this specification leaves most of the design up to you. It will be your responsibility to decompose the problem into an appropriate set of classes and methods. Take some time to develop your design before you start coding! You may find it helpful to create stubbed-out versions of your classes as a first step.

MapZip App

We are providing a Java GUI wrapper for your zip and unzip methods:

MadZipApp.java

As part of your testing, you should use the GUI to zip several files, unzip those files with a different name, and confirm that the unzipped versions are identical to the original files.

There are many tools that will report whether two files are identical. For example, the diff command-line tool should be available by default under Linux and OS X.

File Format

You should already have experience reading and writing text files in Java. You may not have experience handling file I/O with binary data. For this application it will be necessary to both read and write binary data. We will need to read binary files because our compression tool should be able to compress any file, whether or not it contains plain-text. We will need to create binary files because the Huffman coding process results in a binary encoding. Saving a series of 0’s and 1’s as ASCII characters would not reduce the overall file size since we would end up using eight bits to store each individual bit of encoded data.

We are providing two Java files to help with efficiently manipulating and storing binary data.

BitSequence.java This class represents an arbitrarily long sequence of bits. You will use this class to represent the Huffman-encoded version of the input file. (You should not try to use it to store the data that is read in from the uncompressed file.)
HuffmanSave.java This is a simple container class that stores a BitSequence along with a HashMap of frequency data. Recall that decoding the bit sequence created during the Huffman coding process requires us to have access to the Huffman tree that was used to create the code. A HuffmanSave object stores all of the information necessary to reconstruct a compressed file. The frequency data can be used to rebuild the Huffman tree and the BitSequence stores the encoded data.

For this application the compressed files must be serialized versions of H HuffmanSave objects. Object serialization is the process of converting a complete Java object into a binary form so that it can be saved to a file or communicated over a network. This (tutorial)[https://www.tutorialspoint.com/java/java_serialization.htm] provides some background on serialization and shows examples of saving and loading Java objects.

We will follow the convention that compressed files should have a .mz file extension, but this is not a requirement that should be enforced by your zip method.

The files BitSequence.java and HuffmanSave.java must NOT be modified.

Reading Binary Data

You probably have the most experience reading files using the Scanner class. Recall that every file is really just a long sequence of bytes. The Scanner class is useful because it allows us to interpret those bytes as ints, doubles, Strings etc. For this project, we don’t care what the bytes actually represent. This means there is no reason to use the Scanner class. Instead you should use the read method of FileInputStream to process the bytes in the file directly. Wrapping your FileInputStream in a BufferedInputStream will significantly improve file I/O performance for large files.

Be careful! The documentation for the read method states that it “reads a byte of data from the input stream”, but the return type is actually an int. The read method returns a value of -1 (which is not a valid byte) if there are no more bytes available in the stream. You can cast the return value to a byte, but you need to check for the -1 before you do so.

Space Efficiency

Keep in mind that we might be interested in compressing very large files. This means that we should avoid reading the entire contents of a file into memory. In other words, don’t use the read(byte[] b) method of the FileInputStream, and don’t save all of the bytes into an ArrayList as they are read. Instead, process each byte as it is read from the file. This means that you will end up reading the file twice during the compression process: once to determine the byte frequencies, and again to create the encoded bit sequence.

Under the current design, we will need to store the entire contents of the compressed version of the file in memory, since we need to completely build the HuffmanSave object before it can be saved.

Starter Code

Our textbook provides a partial implementation of a Huffman tree. You are welcome to use that code as a starting point, but you should be aware that it is incomplete and will require some modifications. For example, the leaves in the textbook implementation store char values. Your code will need to store bytes. There are also some questionable design choices in the textbook code. For example, HuffBaseNode is an interface, but would make more sense as an abstract class.

Handling Ties

One issue that needs to be addressed when developing a Huffman tree is how to handle ties: which trees should be selected when more than two options are tied for the lowest frequency? In one sense, it doesn’t matter. In the case of a tie any valid choice will result in the same encoding length.

On the other hand different choices in tie-breaking will result in different trees and thus different encodings. This is an issue for decoding because the resulting bits can only be decoded by a tree that exactly matches the tree that was used for the encoding process.

Address this by providing a deterministic mechanism for breaking ties. In the case of tied frequencies, your compareTo method should base its comparison on the lowest byte value from each tree.

Failing to address this issue can lead to symptoms that are difficult to diagnose. The encoding and decoding processes may seem to work well in most cases because the tree happens to be built the same way for decoding as it was for decoding.

Debugging and Testing

For the purposes of debugging and testing your implementation you will want to create some sample files that are small enough to evaluate by hand. For example:

aaaabbc

Under Linux and OS X, the xxd terminal program can be used to examine the byte values stored in a file. For example, the following commands create a file named tmp.txt then display the contents as both hexadecimal and binary:

$ echo aaaabbc > tmp.txt

$ xxd tmp.txt                   # This shows the contents as hex
00000000: 6161 6161 6262 630a                      aaaabbc.

$ xxd -b tmp.txt                # This shows the contents as binary
00000000: 01100001 01100001 01100001 01100001 01100010 01100010  aaaabb
00000006: 01100011 00001010                                      c.

Note that most standard text editors add a newline character (‘\n’) at the end of the file. If you’re using Windows, you might also get a carriage return (‘\r’). These characters will show up in your frequency counts as the bytes 0x0a and 0x0d, respectively. (Note that 0x means we are talking about hexadecimal numbers. In decimal, these bytes would be 10 and 13.)

Test Files

mary.txt A small text file. The correct binary encoding length for this file is 89 bits.
bytes.dat A larger binary file. The correct binary encoding length for this file is 337918 bits.

Note that these sizes represent the number of bits in the encoding generated using a correctly constructed Huffman tree. They are not the actual file sizes that will result from calling zip. The .mz files contain some overhead in addition to the bit sequence data: they include the frequency map and some additional overhead inherent to Java serialization.

The space overhead in the saved files mean that you should not be surprised if some files actually end up bigger after compression. It is possible that the space required for bookkeeping will outweigh any space savings gained by a more efficient encoding. This will tend to be true for small files and for files that were already compressed using some other tool.

Honor Code Reminder

There are undoubtedly many Java Huffman tree implementations floating around on the internet. Obviously, submitting any of that code as your own would violate the JMU honor code. As always, if you obtain any information from outside sources you must provide a link to your source in the acknowledgment statement in your submission. (Copying external code with a citation wouldn’t violate the honor code, but it also wouldn’t result in any credit for the project.)

Submission

Submit the following five files through Gradescope:

MadZip.java
Any supporting files you develop as part of your implementation

Grading

Gradescope Functionality Tests	80 points
Style Checks	5 points
Instructor Style Points	15 points