Skip to content

Multiset

Introduction

The Java collections framework provides interfaces corresponding to some of the most commonly used abstract data types: List, Map, Set, etc. One abstract data type that is not included is the Multiset.

The Multiset ADT is a generalization of the Set ADT that allows duplicate elements. Multisets are useful in cases where we don't care about the order of the elements but we do need to maintain a count of how many times each element appears. For example, in a document analysis domain we may need to count the number of occurrences of each word in the document. Such distributions of word frequencies can be useful for problems like authorship detection and topic classification.

For this assignment, you will be working with two distinct multiset implementations: one provided for you and one that you will develop from scratch. Both classes will implement the following interface:

Multiset.java

Here is a completed implementation of the Multiset interface:

ArrayMultiset.java

The ArrayMultiset class simply stores repeated elements in distinct array positions. Take a few minutes to read over the Multiset interface and the ArrayMultiset class before moving on to the next section.

A Better Data Structure

ArrayMultiset is terribly inefficient in terms of both space and time.

Consider the following code snippet:

    Multiset<String> set = new ArrayMultiset<>(10000);
    for (int i = 0; i < 10000; i) {
       set.add("Harrisonburg");
    }
    System.out.println(set.getCount("Harrisonburg");

After this code executes the array inside set will contain 10,000 references to the string "Harrisonburg". The call to getCount will involve iterating over all 10,000 entries.

It would be much more efficient to store a single copy of "Harrisonburg" along with a counter. Subsequent calls to add would then increment the counter without actually increasing the number of objects stored in the array.

Your goal in this assignment is to create a new implementation of the Multiset interface named CounterMultiset. Instead of storing duplicate objects, this class will store Counter objects containing both the item and the associated count.

Here is a stubbed-out file to get you started:

CounterMultiset.java

Part 1 - Understanding the Provided Code

Take some time to review the provided files. Complete the Part 1 questions in Gradescope before proceeding to the rest of the assignment.

Part 2 - Implementing Counter Multiset

Complete the stubbed out methods in CounterMultiset.java. You may test your finished implementation using the following JUnit tests:

CounterMultisetTest.java

Part 3 Demonstration Workload

It was claimed above that the counter implementation is much more efficient than the naive implementation for the Multiset ADT. Strictly speaking, this claim depends on the actual sequence of operations performed. Take a few minutes to think about scenarios where the CounterMultiset will have an advantage and scenarios where it will not, then complete the following driver class so that it conforms to the Javadoc comments.

WorkloadDriver.java

If you are stuck, consider the following hints:

Hints
  • The getCount method for ArrayMultiset can is particularly inefficient for cases where there are many duplicate items.
  • One weakness of CounterMultiset is that it requires a sequential search for all add operations. While ArrayMultiset simply inserts the new element at the next available position, CounterMultiset needs to first find the appropriate Counter object. This could be expensive for cases when the item being added happens to be near the end of the array.

Submission and Grading

The project will be graded as follows:

Part 1 Questions20%
Gradescope Functionality Tests for Part 250%
Gradescope Style Checks for Parts 210%
Gradescope Timing Tests for Part 320%

Looking Ahead

The counter-based Multiset that you developed is, in general, much more efficient than the terrible implementation we provided. This raises the question of whether it is possible to develop an even more efficient implementation. It is. We'll see several data structures later in the semester that could be used to create much better implementations of this ADT.

Acknowledgments

The Multiset interface used for this project is based on a similar interface included with the the Guava Google Java libraries.