Multiset
Introduction
The Java collections framework provides interfaces corresponding to some of the most commonly used abstract data types: List, Map, Set, etc. One abstract data type that is not included is the Multiset.
The Multiset ADT is a generalization of the Set ADT that allows duplicate elements. Multisets are useful in cases where we don't care about the order of the elements but we do need to maintain a count of how many times each element appears. For example, in a document analysis domain we may need to count the number of occurrences of each word in the document. Such distributions of word frequencies can be useful for problems like authorship detection and topic classification.
For this assignment, you will be working with two distinct multiset implementations: one provided for you and one that you will develop from scratch. Both classes will implement the following interface:
Here is a completed implementation of the Multiset interface:
The ArrayMultiset class simply stores repeated elements in distinct
array positions. Take a few minutes to read over the Multiset
interface and the ArrayMultiset class before moving on to the next
section.
A Better Data Structure
ArrayMultiset is terribly inefficient in terms of both space and
time.
Consider the following code snippet:
Multiset<String> set = new ArrayMultiset<>(10000);
for (int i = 0; i < 10000; i) {
set.add("Harrisonburg");
}
System.out.println(set.getCount("Harrisonburg");
After this code executes the array inside set will contain 10,000
references to the string "Harrisonburg". The call to getCount will
involve iterating over all 10,000 entries.
It would be much more efficient to store a single copy of
"Harrisonburg" along with a counter. Subsequent calls to add would
then increment the counter without actually increasing the number of
objects stored in the array.
Your goal in this assignment is to create a new implementation of the
Multiset interface named CounterMultiset. Instead of storing
duplicate objects, this class will store Counter objects
containing both the item and the associated count.
Here is a stubbed-out file to get you started:
Part 1 - Understanding the Provided Code
Take some time to review the provided files. Complete the Part 1 questions in Gradescope before proceeding to the rest of the assignment.
Part 2 - Implementing Counter Multiset
Complete the stubbed out methods in CounterMultiset.java. You may test your finished implementation using the following JUnit tests:
Part 3 Demonstration Workload
It was claimed above that the counter implementation is much more
efficient than the naive implementation for the Multiset ADT.
Strictly speaking, this claim depends on the actual sequence of
operations performed. Take a few minutes to think about scenarios
where the CounterMultiset will have an advantage and scenarios where
it will not, then complete the following driver class so that it
conforms to the Javadoc comments.
If you are stuck, consider the following hints:
Hints
- The
getCountmethod forArrayMultisetcan is particularly inefficient for cases where there are many duplicate items. - One weakness of
CounterMultisetis that it requires a sequential search for all add operations. WhileArrayMultisetsimply inserts the new element at the next available position,CounterMultisetneeds to first find the appropriate Counter object. This could be expensive for cases when the item being added happens to be near the end of the array.
Submission and Grading
The project will be graded as follows:
| Part 1 Questions | 20% |
| Gradescope Functionality Tests for Part 2 | 50% |
| Gradescope Style Checks for Parts 2 | 10% |
| Gradescope Timing Tests for Part 3 | 20% |
Looking Ahead
The counter-based Multiset that you developed is, in general, much more efficient than the terrible implementation we provided. This raises the question of whether it is possible to develop an even more efficient implementation. It is. We'll see several data structures later in the semester that could be used to create much better implementations of this ADT.
Acknowledgments
The Multiset interface used for this project is based on a similar
interface included with the the
Guava Google Java libraries.