CS 240: Algorithms and Data Structures
James Madison University, Fall 2021

Huffman Coding

Introduction

The goal for this project is to develop a file compression utility to compete with applications like 7-Zip, gzip, WinZip etc. You must develop two Java executables: JMZip and JMUnzip.

JMZip will take two command line arguments. The first is the the file to compress, and the second is the name of a file to create. A sample execution might look like the following:

$ java JMZip MyPaper.doc MyPaper.jmz

This command would perform Huffman encoding on the data inq MyPaper.doc and save a compressed version of the file to MyPaper.jmz.

The JMUnzip application must undo the compression process to recreate a copy of the original file. The first argument will be the compressed file, the second argument will be the desired file name of the uncompressed file. We could decompress MyPaper.jmz as follows:

$ java JMUnzip MyPaper.jmz MyPaperRestored.doc

If you want to confirm that the contents of the restored file are the same as the original, you can use a command-line utility, such as shasum. If there is any difference in the file, the value shown on the two lines (e.g., the string starting with “92a40…”) will be different.

$ shasum MyPaper.doc MyPaperRestored.doc
92a40f993c73b5a56987ce5b6470c8879dc33903 MyPaper.doc
92a40f993c73b5a56987ce5b6470c8879dc33903 MyPaperRestored.doc

Detailed Requirements

File Format

You should already have experience reading and writing text files in Java. You may not have experience handling file I/O with binary data. For this application it will be necessary to both read and write binary data. We will need to read binary files because our compression tool should be able to compress any file whether or not it contains plain-text. We will need to create binary files because the Huffman coding process results in a binary encoding. Saving a series of 0’s and 1’s as ASCII characters would not reduce the overall file size since we would end up using eight bits to store each individual bit of encoded data.

We are providing two Java files to help with efficiently manipulating and storing binary data.

For this application the .jmz files must be serialized versions of HuffmanSave objects. Object serialization is the process of converting a complete Java object into a binary form so that it can be saved to a file or communicated over a network. This tutorial provides some background on serialization and shows examples of saving and loading Java objects.*

The files BitSequence.java and HuffmanSave.java must not be modified.

Tips, Advice and Additional Information

Testing

Here are the submission tests that will be used by Autolab:

Notice that these are not unit tests. Unit testing involves testing individual classes in isolation to make sure that each method works correctly. The tests in JMZipTest.java would be considered integration tests or system tests. They check that the components work together correctly and that the final application performs according to the specification.

These tests probably won’t be much use in helping you to debug your code. You will not be required to submit unit tests for this project, but we strongly suggest that you test your individual methods as you complete them.

Grading

Autolab Functionality 60%
Autolab Style Checks 20%
Autolab Instructor Style Points 20%

* Note that object serialization is not a good strategy for long-term data storage. If the implementation of the serialized class changes, it becomes impossible to deserialize using a Java program that uses a later version of the class. We are using serialization for this project because it saves us the trouble of specifying and implementing a new binary data format. If you were to continue developing this tool the next step would be to design a file format that is independent of the details of the implementation.