Lab 10: Word Clouds¶
In this (two-part) lab, you will practice using Collections and create a class that generates word clouds, which are visualizations of word frequencies in a given piece of text.
Learning Objectives¶
After completing this lab, you should be able to:
- Use collections to store information about text.
- Read and parse the contents of a plain text file.
- Implement interfaces in the
java.lang
package. - Determine how to utilize a third-party library.
Background¶
A word cloud is a visualization of a bunch of text. See https://monkeylearn.com/word-clouds/ for examples.
The core part of a word cloud is a "frequency table," aka a mapping of each word to the number of times it appears in the given text. For example, consider the frequency table from our CS major description:
Word | Frequency |
---|---|
solutions | 6 |
problems | 5 |
computer | 4 |
science | 4 |
computing | 4 |
… | … |
In the corresponding word cloud, more frequent words appear large, and less frequent words appear small:
One need that arises when generating word clouds is combining related words. For example, the words devoured, devourers, devouring, and devours are all based on the root term devour. We'd want to group these words together in a word cloud, instead of treating them as individual words.
Fortunately, Dr. Martin Porter developed a "stemming" algorithm about 40 years ago that does this part for you. The tricky part is, that's someone else's code you'll need to read through and figure out how to use! A large part of computer science is working with other people's code, which could be well-documented, or not documented at all.
Today's Lab¶
For this lab, you will implement a WordCloud
class that maintains a collection of terms (root words) and the number of words that correspond to each term. The Stemmer
class from https://tartarus.org/martin/PorterStemmer/ will help you convert original words into these terms.
Starter Code¶
Download lab10.zip and extract its contents to cs159/src/labs/lab10/
. It contains the following files:
WordCloud.java
– Contains stubs for the methods you will need to implement.Stemmer.java
– The code for the stemming algorithm. You do not need to modify this class, just use it.WordCloudDriver.java
– Run this code to create a word cloud in an HTML file that you can open in your browser!data/easy.txt
– A sample file with a few words that you can use to testdata/hard.txt
– A sample file with the text from this lab that you can use to test
UML Diagram¶
classDiagram
class WordCloud {
...
+WordCloud()
+add(word : String)
+addAll(path : String) int
+getFrequency(term : String) int
+termCount() int
+totalWordCount() int
+remove(word : String) int
+removeAllLessThan(threshold : int) int
+generateHTML() String
+compareTo(wc : WordCloud) int
+iterator() Iterator<String>
+normalize(word : String) String$
}
Notice that the UML diagram above is incomplete! We will have some methods that must be implemented, but you will need to determine what attributes and other helper methods will be necessary for WordCloud
.
Think carefully about what needs to be stored. Avoid storing data that isn't needed, and don't store the same data more than once.
Stemmer Class¶
The Stemmer
class is responsible for converting words to their base terms (i.e., "devoured" to "devour"). You will need to figure out how to use the Stemmer
class. The easiest way is to look at the source code. Try to figure out which methods you need to call (don't copy and paste any code from Stemmer
)!
JavaDocs?
You'll find that Stemmer
lacks proper JavaDocs, has very few code comments, and follows a different style. It may be tricky to figure out what each method does, or what the arguments do. This is why we have you write JavaDocs and follow a style guide. It makes the code more readable/usable by others! You'll rarely ever write code just for yourself in the real world.
Instructions¶
Implement the WordCloud
class. This class stores the number of words that correspond to each term. It has methods to add words, remove words, get the frequency of each term, etc. It also implements a couple of interfaces that makes it easier for other classes to use it.
We recommend you follow the following requirements, in order:
Attributes and Constructor:¶
-
Read through the stubbed
WordCloud
class and think about what aWordCloud
needs to store/keep track of:- Each term, with the number of words that correspond to the term.
- You need to be able to get/set the frequency for each term.
- You will also need to display the terms in alphabetical order.
- 👉 What collection would be best for this?
-
Add a private attribute to the
WordCloud
class based on your answer.- There are multiple ways to solve this lab, so you may have 1 or more attributes depending on your approach.
- Some approaches are better than others, and will save you work later on!
-
Implement the constructor to initialize the class's attributes.
- Initially, a
WordCloud
should have no terms/words.
- Initially, a
Base Functionality:¶
-
public void add(String word)
- Updates the frequency of a term, based on the given
word
.- (Hint: If this is the first time you've seen this term, what should be its frequency? If you've already seen it, how should you update its frequency?)
- The given
word
should be normalized to a term before storing it in theWordCloud
(use thenormalize()
method, which returns a term from a given word).- You do not need to implement
normalize()
just yet – we can save that for later. - Right now,
normalize()
just returns the same word passed in.
- You do not need to implement
- Updates the frequency of a term, based on the given
-
public int addAll(String path) throws FileNotFoundException
- Adds all the terms in the given file to the word cloud.
- This should read the file word by word, and
add
each word.
- This should read the file word by word, and
- This method also returns the total number of words read/added.
- This method must throw a
FileNotFoundException
if the givenpath
is not found.
- Adds all the terms in the given file to the word cloud.
-
public int getFrequency(String term)
- Returns the frequency/word count for a given term in the word cloud.
- If the term is not in the word cloud, return
0
.
-
public int termCount()
- Returns the number of terms in the word cloud.
-
public int totalWordCount()
- Returns the total number of words in the word cloud (aka the total frequency that each term appears).
-
Run
WordCloudDriver
to test this functionality!- You should see that
611
terms and1561
words were added to the word cloud.
- You should see that
Removing Terms:¶
-
public int remove(String word)
- Completely removes a term from the word cloud, based on the given
word
. - The given
word
should be normalized to a term as well, usingnormalize()
. - Returns the number of words removed for that term, or 0 if not found.
- Completely removes a term from the word cloud, based on the given
-
public int removeAllLessThan(int threshold)
- Removes all terms with a frequency less than the
threshold
. - Returns the total number of words removed.
- You will need to loop over your collection(s) using an
Iterator
and call theIterator
'sremove()
method.- This is because modifying a collection while looping over it can result in a
ConcurrentModificationException
! - Hint: Now would be a good time to review the Java tutorial for your attribute type paying special attention to using an iterator
- This is because modifying a collection while looping over it can result in a
- Removes all terms with a frequency less than the
-
Modify
WordCloudDriver
to test this functionality!- Uncomment the code after
"Test removing"
and then runWordCloudDriver
. - It should remove
92
instances of"the"
, and417
terms with frequency less than 2.
- Uncomment the code after
Generating HTML Output:¶
-
public String generateHTML()
- Generates HTML code (which is what webpages use) for displaying the word cloud. Each term is a "span" with a "font-size" that corresponds to the count. Terms must be in alphabetical order.
- For example, a
WordCloud
that contains the words[bb, aa, cc, cc, bb]
would return the following string:Notice there is a newline character at the end of each line.<span style="font-size: 1pt">aa</span> <span style="font-size: 2pt">bb</span> <span style="font-size: 2pt">cc</span>
-
Modify
WordCloudDriver
to test this functionality!- Uncomment the code after
"Generate HTML file"
and then runWordCloudDriver
. - It should create an HTML file in the
data
folder. You can open the HTML file in a browser to see the generated word cloud (you may need to zoom in).
- Uncomment the code after
Normalizing Words with the Stemmer:¶
-
Read through the code in
Stemmer.java
to understand how to use it. -
public static String normalize(String word)
- Normalizes a given word by removing punctuation, making it lowercase, and running the Porter stemming algorithm.
- Punctuation is defined as any character that is not a letter or digit.
- For example, both
normalize("coding")
andnormalize("Cod-Ing...")
should return the term"code"
.
- Normalizes a given word by removing punctuation, making it lowercase, and running the Porter stemming algorithm.
-
Make sure that you are using
normalize()
to normalize the word in bothadd()
andremove()
. All the terms in the word cloud should be normalized. -
Run
WordCloudDriver
to test this functionality!- You should have the same number of words, but fewer terms.
- There should now be
363
total terms and1561
total words.
- There should now be
- Open the HTML file in a browser to see the generated word cloud.
- You should find that all the words are now lowercase, and there is no punctuation.
- You should have the same number of words, but fewer terms.
Implementing Comparable and Iterable Interfaces:¶
-
Change the declaration of
WordCloud
to implement two interfaces: it must be comparable with otherWordCloud
objects (Comparable<WordCloud>
), and it must be iterable over the words that it contains (Iterable<String>
).- Don't forget to include the appropriate types inside the angle brackets! Otherwise, you will implement the "generic" version of each interface, which uses
Object
as the default type.
- Don't forget to include the appropriate types inside the angle brackets! Otherwise, you will implement the "generic" version of each interface, which uses
-
public int compareTo(WordCloud wc)
- Returns a negative integer, zero, or a positive integer depending on whether this object has fewer, the same, or more terms than
wc
. - This is required for the
Comparable
interface.
- Returns a negative integer, zero, or a positive integer depending on whether this object has fewer, the same, or more terms than
-
public Iterator<String> iterator()
- Returns an iterator that can go through each term in the
WordCloud
, in alphabetical order. - This is required for the
Iterable
interface. - (Hint: Don't "create" a new iterator, just return one that does what the method wants. Can you call a method on one of your attributes to get an appropriate iterator?)
- Returns an iterator that can go through each term in the
Submission¶
Submit your WordCloud.java
file to Gradescope. Your submission must pass a checkstyle audit to receive any points.