Huffman coding

Wed, Sep 21, 2016

Huffman Coding

Introduction

The goal for this project is to develop a file compression utility to compete with applications like 7-Zip, gzip, WinZip etc. You must develop two Java executables: JMZip and JMUnzip.

JMZip will take two command line arguments. The first is the the file to compress, and the second is the name of a file to create. A sample execution might look like the following:

$ java JMZip MyPaper.doc MyPaper.jmz

This command would perform Huffman encoding on the data in MyPaper.doc and save a compressed version of the file to MyPaper.jmz.

The JMUnzip application must undo the compression process to recreate a copy of the original file. The first argument will be the compressed file, the second argument will be the desired file name of the uncompressed file. We could decompress MyPaper.jmz as follows:

$ java JMUnzip MyPaper.jmz MyPaperRestored.doc

Detailed Requirements

Both applications must print appropriate error messages in the case of missing or incorrect arguments. In particular, the user must be informed if he or she attempts to open a file that cannot be opened or save to a location where they do not have permission.
Both applications should overwrite the file specified in the second argument if that file already exists.
Neither application may modify the file provided as the first argument.

File Format

You should already have experience reading and writing text files in Java. You may not have experience handling file I/O with binary data. For this application it will be necessary to both read and write binary data. We will need to read binary files because our compression tool should be able to compress any file whether or not it contains plain-text. We will need to create binary files because the Huffman coding process results in a binary encoding. Saving a series of 0’s and 1’s as ASCII characters would not reduce the overall file size since we would end up using eight bits to store each individual bit of encoded data.

We are providing two Java files to help with efficiently manipulating and storing binary data.

BitSequence.java - This class represents an arbitrarily long sequence of bits. You will use this class to represent the Huffman-encoded version of the input file.
HuffmanSave.java - This is a simple container class that stores a BitSequence along with a HashMap of frequency data. Recall that decoding the bit sequence created during the Huffman coding process requires us to have access to the Huffman tree that was used to create the code. A HuffmanSave object stores all of the information necessary to reconstruct a compressed file. The frequency data can be used to rebuild the Huffman tree and the BitSequence stores the encoded data.

For this application the .jmz files must be serialized versions of HuffmanSave objects. Object serialization is the process of converting a complete Java object into a binary form so that it can be saved to a file or communicated over a network. This tutorial provides some background on serialization and shows examples of saving and loading Java objects.*

The files BitSequence.java and HuffmanSave.java must not be modified.

Tips, Advice and Additional Information

Our textbook provides a partial implementation of a Huffman tree. You are welcome to use that code as a starting point, but you should be aware that it is incomplete and will require some modifications. For example, the leaves in the textbook implementation store char values. Your code will need to store bytes.

There are undoubtedly many other Java Huffman tree implementations floating around on the internet. Obviously, submitting any of that code as your own would violate the JMU honor code. As always, if you obtain any information from outside sources you must provide a link to your source in the acknowledgment statement in your submission. (Copying external code with a citation wouldn’t violate the honor code, but it also wouldn’t result in any credit for the project.)
One issue that needs to be addressed when developing a Huffman tree is how to handle ties: which trees should be selected when more than two options are tied for the lowest frequency? In one sense, it doesn’t matter. In the case of a tie any valid choice will result in the same encoding length as long as the frequencies actually match the frequencies in the data being encoded.

On the other hand different choices in tie-breaking will result in different trees and thus different encodings. This is an issue for decoding because the resulting bits can only be decoded by a tree that exactly matches the tree that was used for the encoding process.

This must be addressed by providing a deterministic mechanism for breaking ties. For example, ties go to the tree that contains the lowest byte value. You can implement this by modifying the existing compareTo method so that it performs some additional logic in the case of ties based on frequency. Failing to address this issue can lead to symptoms that are difficult to diagnose. The encoding/decoding process may seem to work well in most cases because the tree happens to be built the same way for decoding as it was for decoding.
You probably have the most experience reading files using the Scanner class. Recall that every file is really just a long sequence of bytes. The Scanner class is useful because it allows us to interpret those bytes as ints, doubles, Strings etc. For this project, we don’t care what the bytes actually represent. This means there is no reason to use the Scanner class. Instead you can just use the read method of FileInputStream to process the bytes in the file directly.

Grading

Web-CAT Funcionality	60%
Web-CAT Style Checks	20%
Web-CAT Instructor Style Points	20%

* Note that object serialization is not a good strategy for long-term data storage. If the implementation of the serialized class changes, it becomes impossible to deserialize using a Java program that uses a later version of the class. We are using serialization for this project because it saves us the trouble of specifying and implementing a new binary data format. If you were to continue developing this tool the next step would be to design a file format that is independent of the details of the implementation.