CS 240: Algorithms and Data Structures
James Madison University, Fall 2024

Huffman Coding

Introduction

The goal for this project is to develop a file compression utility to compete with applications like 7-Zip, gzip, WinZip etc.

Specification

You must develop a utility class named MadZip with two public static methods: zip and unzip.

zip

unzip

Note: The specification above describes the expected behavior of these methods. It should not be read as a list of tasks that each require code to accomplish. For example, the requirement that “this method must overwrite the destination file if it already exists” describes the default behavior of the write method for a FileOutputStream. There is no need to write additional code to check for the existence of a file in order to satisfy this requirement. Make sure you understand the behavior of the built-in Java classes before you unnecessarily replicate existing functionality.

MadZipApp

We are providing a Java GUI wrapper for your zip and unzip methods:

As part of your testing, you should use the GUI to zip several files, unzip those files with a different name, and confirm that the unzipped versions are identical to the original files.

There are many tools that will report whether two files are identical. For example, the diff command-line tool should be available by default under Linux and OS X.

Starter Code

Our textbook provides a partial implementation of a Huffman tree. We are providing a lightly refactored version of the textbook code that you may want to use as a starting point:

This code is incomplete and may require some modifications. You are free to use this code, ignore it, or to change it arbitrarily.

File Format

You should already have experience reading and writing text files in Java. You may not have experience handling file I/O with binary data. For this application it will be necessary to both read and write binary data. We will need to read binary files because our compression tool should be able to compress any file, whether or not it contains plain-text. We will need to create binary files because the Huffman coding process results in a binary encoding. Saving a series of 0’s and 1’s as ASCII characters would not reduce the overall file size since we would end up using eight bits to store each individual bit of encoded data.

We are providing two Java files to help with efficiently manipulating and storing binary data.

For this application the compressed files must be serialized versions of HuffmanSave objects. Object serialization is the process of converting a complete Java object into a binary form so that it can be saved to a file or communicated over a network. This tutorial provides some background on serialization and shows examples of saving and loading Java objects.

We will follow the convention that compressed files should have a .mz file extension, but this is not a requirement that should be enforced by your zip method.

The files BitSequence.java and HuffmanSave.java must not be modified.

Reading Binary Data

You probably have the most experience reading files using the Scanner class. Recall that every file is really just a long sequence of bytes. The Scanner class is useful because it allows us to interpret those bytes as ints, doubles, Strings etc. For this project, we don’t care what the bytes actually represent. This means there is no reason to use the Scanner class. Instead you should use the read method of FileInputStream to process the bytes in the file directly. Wrapping your FileInputStream in a BufferedInputStream will significantly improve file I/O performance for large files.

Be careful! The documentation for the read method states that it “reads a byte of data from the input stream”, but the return type is actually an int. The read method returns a value of -1 (which is not a valid byte) if there are no more bytes available in the stream. You can cast the return value to a byte, but you need to check for the -1 before you do so.

Space Efficiency

Keep in mind that we might be interested in compressing very large files. This means that we should avoid reading the entire contents of a file into memory. In other words, don’t use the read(byte[] b) method of the FileInputStream, and don’t save all of the bytes into an ArrayList as they are read. Instead, process each byte as it is read from the file. This means that you will end up reading the file twice during the compression process: once to determine the byte frequencies, and again to create the encoded bit sequence.

Under the current design, we will need to store the entire contents of the compressed version of the file in memory, since we need to completely build the HuffmanSave object before it can be saved.

Handling Ties

One issue that needs to be addressed when developing a Huffman tree is how to handle ties: which trees should be selected when more than two options are tied for the lowest frequency? In one sense, it doesn’t matter. In the case of a tie any valid choice will result in the same encoding length.

On the other hand different choices in tie-breaking will result in different trees and thus different encodings. This is an issue for decoding because the resulting bits can only be decoded by a tree that exactly matches the tree that was used for the encoding process.

Address this by providing a deterministic mechanism for breaking ties. In the case of tied frequencies, your compareTo method should base its comparison on the lowest byte value from each tree.

Failing to address this issue can lead to symptoms that are difficult to diagnose. The encoding and decoding processes may seem to work well in most cases because the tree happens to be built the same way for decoding as it was for decoding.

Debugging and Testing

For the purposes of debugging and testing your implementation you will want to create some sample files that are small enough to evaluate by hand. For example:

aaaabbc

Under Linux and OS X, the xxd terminal program can be used to examine the byte values stored in a file. For example, the following commands create a file named tmp.txt then display the contents as both hexadecimal and binary:

$ echo aaaabbc > tmp.txt

$ xxd tmp.txt                   # This shows the contents as hex
00000000: 6161 6161 6262 630a                      aaaabbc.

$ xxd -b tmp.txt                # This shows the contents as binary
00000000: 01100001 01100001 01100001 01100001 01100010 01100010  aaaabb
00000006: 01100011 00001010                                      c.

Note that most standard text editors add a newline character (‘\n’) at the end of the file. If you’re using Windows, you might also get a carriage return (‘\r’). These characters will show up in your frequency counts as the bytes 0x0a and 0x0d, respectively. (Note that 0x means we are talking about hexadecimal numbers. In decimal, these bytes would be 10 and 13.)

Test Files

Here are some test files that you can use to check your tree construction and encoding process:

Note that these sizes represent the number of bits in the encoding generated using a correctly constructed Huffman tree. They are not the actual file sizes that will result from calling zip. The .mz files contain some overhead in addition to the bit sequence data: they include the frequency map and some additional overhead inherent to Java serialization.

The space overhead in the saved files mean that you should not be surprised if some files actually end up bigger after compression. It is possible that the space required for bookkeeping will outweigh any space savings gained by a more efficient encoding. This will tend to be true for small files and for files that were already compressed using some other tool.

Honor Code Reminder

There are undoubtedly many Java Huffman tree implementations floating around on the internet. Obviously, submitting any of that code as your own would violate the JMU honor code. As always, if you obtain any information from outside sources you must provide a link to your source in the acknowledgment statement in your submission. (Copying external code with a citation wouldn’t violate the honor code, but it also wouldn’t result in any credit for the project.)

Grading

Submit MadZip.java in addition to any helper classes that you created through Gradescope. You do not need to upload HuffmanSave.java or BitSequence.java.

Gradescope Functionality Tests 80%
Style Checks 5%
Instructor Style Points 15%