The goal for this project is to develop a file compression utility to
compete with applications like 7-Zip, gzip, WinZip etc. You must
develop two Java executables: JMZip
and JMUnzip
.
JMZip
will take two command line arguments. The first is the the
file to compress, and the second is the name of a file to create. A
sample execution might look like the following:
$ java JMZip MyPaper.doc MyPaper.jmz
This command would perform Huffman encoding on the data in
MyPaper.doc
and save a compressed version of the file to
MyPaper.jmz
.
The JMUnzip
application must undo the compression process to
recreate a copy of the original file. The first argument will be the
compressed file, the second argument will be the desired file name of
the uncompressed file. We could decompress MyPaper.jmz
as follows:
$ java JMUnzip MyPaper.jmz MyPaperRestored.doc
If you want to confirm that the contents of the restored file are the
same as the original, you can use a command-line utility, such as
shasum
. If there is any difference in the file, the value shown on the
two lines (e.g., the string starting with "92a40...") will be
different.
$ shasum MyPaper.doc MyPaperRestored.doc
92a40f993c73b5a56987ce5b6470c8879dc33903 MyPaper.doc
92a40f993c73b5a56987ce5b6470c8879dc33903 MyPaperRestored.doc
You should already have experience reading and writing text files in Java. You may not have experience handling file I/O with binary data. For this application it will be necessary to both read and write binary data. We will need to read binary files because our compression tool should be able to compress any file whether or not it contains plain-text. We will need to create binary files because the Huffman coding process results in a binary encoding. Saving a series of 0's and 1's as ASCII characters would not reduce the overall file size since we would end up using eight bits to store each individual bit of encoded data.
We are providing two Java files to help with efficiently manipulating and storing binary data.
BitSequence.java
- This class
represents an arbitrarily long sequence of bits. You will use
this class to represent the Huffman-encoded version of the input
file.HuffmanSave.java
- This is a simple
container class that stores a BitSequence
along with a
HashMap
of frequency data. Recall that decoding the bit
sequence created during the Huffman coding process requires us to
have access to the Huffman tree that was used to create the code.
A HuffmanSave
object stores all of the information necessary to
reconstruct a compressed file. The frequency data can be used to
rebuild the Huffman tree and the BitSequence
stores the encoded
data.For this application the .jmz
files must be serialized versions of
HuffmanSave
objects. Object serialization is the process of
converting a complete Java object into a binary form so that it can be
saved to a file or communicated over a network. This
tutorial
provides some background on serialization and shows examples of saving
and loading Java objects.*
The files BitSequence.java
and HuffmanSave.java
must not be
modified.
Our textbook provides a partial implementation of a Huffman tree.
You are welcome to use that code as a starting point, but you
should be aware that it is incomplete and will require some
modifications. For example, the leaves in the textbook
implementation store char
values. Your code will need to store
byte
s.
There are undoubtedly many other Java Huffman tree implementations floating around on the internet. Obviously, submitting any of that code as your own would violate the JMU honor code. As always, if you obtain any information from outside sources you must provide a link to your source in the acknowledgment statement in your submission. (Copying external code with a citation wouldn't violate the honor code, but it also wouldn't result in any credit for the project.)
One issue that needs to be addressed when developing a Huffman tree is how to handle ties: which trees should be selected when more than two options are tied for the lowest frequency? In one sense, it doesn't matter. In the case of a tie any valid choice will result in the same encoding length as long as the frequencies actually match the frequencies in the data being encoded.
On the other hand different choices in tie-breaking will result in different trees and thus different encodings. This is an issue for decoding because the resulting bits can only be decoded by a tree that exactly matches the tree that was used for the encoding process.
This must be addressed by providing a deterministic mechanism for
breaking ties. For example, ties go to the tree that contains the
lowest byte value. You can implement this by modifying the
existing compareTo
method so that it performs some additional
logic in the case of ties based on frequency. Failing to address
this issue can lead to symptoms that are difficult to diagnose.
The encoding/decoding process may seem to work well in most cases
because the tree happens to be built the same way for decoding as
it was for decoding.
You probably have the most experience reading files using the
Scanner
class. Recall that every file is really just a long
sequence of bytes. The Scanner
class is useful because it
allows us to interpret those bytes as int
s, double
s, String
s
etc. For this project, we don't care what the bytes actually
represent. This means there is no reason to use the Scanner
class. Instead you can just use the read
method of
FileInputStream
to process the bytes in the file directly.
Wrapping your FileInputStream
in a BufferedInputStream
will
significantly improve file I/O performance for large files.
For the purposes of debugging and testing your implementation you will want to create some sample files that are small enough to evaluate by hand. For example:
aaaabbc
Under Linux, the xxd
terminal program can be used to examine the
byte values stored in a file. For example, the following commands
create a file named tmp.txt
then display the contents as both
hexadecimal and binary:
$ echo aaaabbc > tmp.txt
$ xxd tmp.txt # This shows the contents as hex
00000000: 6161 6161 6262 630a aaaabbc.
$ xxd -b tmp.txt # This shows the contents as binary
00000000: 01100001 01100001 01100001 01100001 01100010 01100010 aaaabb
00000006: 01100011 00001010 c.
Note that most standard text editors add a newline character
('\n') at the end of the file. If you're using Windows, you might
also get a carriage return ('\r'). These characters will show up
in your frequency counts as the bytes 0x0a
and 0x0d
,
respectively. (Note that 0x
means we are talking about
hexadecimal numbers. In decimal, these bytes would be 10 and 13.)
Here are the submission tests that will be used by Autolab:
Notice that these are not unit tests. Unit testing involves testing
individual classes in isolation to make sure that each method works
correctly. The tests in JMZipTest.java
would be considered
integration tests or system tests. They check that the components work
together correctly and that the final application performs according
to the specification.
These tests probably won't be much use in helping you to debug your code. You will not be required to submit unit tests for this project, but we strongly suggest that you test your individual methods as you complete them.
Autolab Functionality | 60% |
Autolab Style Checks | 20% |
Autolab Instructor Style Points | 20% |
* Note that object serialization is not a good strategy for long-term data storage. If the implementation of the serialized class changes, it becomes impossible to deserialize using a Java program that uses a later version of the class. We are using serialization for this project because it saves us the trouble of specifying and implementing a new binary data format. If you were to continue developing this tool the next step would be to design a file format that is independent of the details of the implementation.