HW 6: Small Language Model¶

Learning Objectives¶
After completing this homework, you should be able to:
- Use Java collections to store and query data.
- Generate random numbers in Java.
- Read from text files and parse lines character by character.
- Write recursive code to process nested structures.
- Implement a simple predictive text generator.
Introduction¶
Modern AI platforms are built on systems called large language models (LLMs). At its core, all an LLM does is predict what word comes next in a sequence of words. For example, given the text "The cat sat on the", an LLM might predict that the next word is "mat".
Let's create a very simple version of this that only looks at the previous word to predict the next word. We'll create a statistical word predictor that reads a text file, counts how often each word is followed by other words, and then uses that information to generate new text.
We'll do this by creating "WordNode" objects that represent each word and keep track of which words can follow it and how often. Think of each "WordNode" as a city, and the words that can follow it as roads leading to other cities. The more often a road is used, the wider it is:

Thus, we could generate new text by starting at a word and then randomly following the roads to the next word, but weighted by how often each road is used. In the above example, "the end ." is the most likely sequence, but "the dog ran ." and "the cat ate ." are also possible. We use randomness to choose between them, but more common sequences are more likely to be chosen so that the generated text is more natural.
This is a simple technique that should generate text that at least follows the patterns of the original text (even if it doesn't always make sense). We'll call this our "Small Language Model"!
UML Diagram¶
classDiagram
class Tokenizer {
<<utility>>
+ getTokensFromLine(line : String) ArrayList<String>$
+ getTokensFromFile(filename : String) ArrayList<String>$
}
class WordNode {
- token : String
- allNextNodes : ArrayList<WordNode>
- nextNodeCounts : HashMap<WordNode, Integer>
+ WordNode(token : String)
+ getToken() String
+ getAllNextNodes() List<WordNode>
+ isEndingPunctuation() boolean
+ getNextNodes() Set<WordNode>
+ getNextCount(nextNode : WordNode) int
+ addNextNode(nextNode : WordNode) void
+ getRandomNextNode() WordNode
+ equals(obj : Object) boolean
+ hashCode() int
+ toString() String
}
class SmallLanguageModel {
- startingWords : HashSet<WordNode>
- wordNodes : HashMap<String, WordNode>
- getOrCreateWordNode(token : String) WordNode
+ SmallLanguageModel()
+ getStartingWords() Set<WordNode>
+ getWordNodes() Map<String, WordNode>
+ trainOnFile(filename : String) void
+ generateSentence(node : WordNode) String
}
SmallLanguageModel o-- WordNode : contains
WordNode --> WordNode : counts next words
SmallLanguageModel ..> Tokenizer : uses
Part A: Tokenizer and WordNode¶
In this part, you'll implement a Tokenizer class that reads text files and breaks them into tokens (words and punctuation). You'll also implement a WordNode class that represents a node in our model, which keeps track of a word and the words that can follow it.
What is a Token?¶
Let's define a "token" as either a word (a continuous sequence of letters and/or digits) or punctuation (a single non-alphanumeric character: ., ,, !, ?, ", ', etc.). Spaces and other whitespace characters are not tokens and should be ignored.
Thus, the sentence:
I have 1 cat and 12 dogs...
would be tokenized into the following tokens:
["I", "have", "1", "cat", "and", "12", "dogs", ".", ".", "."]
Note how multiple spaces are ignored, and each punctuation character is its own token.
Do note that numbers and letters together are considered part of the same token. For example, "CS159" would be a single token.
Tokenizer Class¶
classDiagram
class Tokenizer {
<<utility>>
+ getTokensFromLine(line : String) ArrayList<String>$
+ getTokensFromFile(filename : String) ArrayList<String>$
}
The Tokenizer class is a utility class that should have the following methods:
-
getTokensFromLine(String line): Takes a line of text and returns anArrayList<String>of tokens.- You will need to process each character in the line one at a time.
- You can use the
.charAt(int index)method ofStringto get individual characters.
- You can use the
-
We recommend building up a token character by character, and then add the completed token to the list.
Do this by hand first!
Hint: Go through character by character on paper. Think about: (1) what to do when you see an alphanumeric character, (2) what to do when you see a punctuation character, and (3) what to do when you see whitespace. When do you add a token to the list?
-
Remember to handle the end of the line properly: if the line ends while you're building a token, you need to add that token to the list before finishing.
- You can also solve this problem by using a combination of different String methods; up to you!
- You will need to process each character in the line one at a time.
-
getTokensFromFile(String filename): Takes a filename, reads the file, and returns anArrayList<String>of tokens from the entire file.- This method should use
getTokensFromLineto process each line of the file. -
If the file cannot be found or read, this method should print an error message (
"Error reading file!") and return an empty list.Set your Scanner to UTF-8
Note: When you create your Scanner, we recommend specifying the
"UTF-8"charset to ensure proper reading of text files:Otherwise, if you try to use a file with special characters, the Scanner may not output any tokens at all!Scanner fileScanner = new Scanner(file, "UTF-8");
- This method should use
Useful methods from the Character class:
Character.isLetterOrDigit(char ch): Returns true if the character is a letter (a-z, A-Z) or digit (0-9).Character.isWhitespace(char ch): Returns true if the character is a whitespace character (space, tab, newline, etc.).Character.toString(char ch): Converts a character to a String.
WordNode Class¶
classDiagram
class WordNode {
- token : String
- allNextNodes : ArrayList<WordNode>
- nextNodeCounts : HashMap<WordNode, Integer>
+ WordNode(token : String)
+ getToken() String
+ getAllNextNodes() List<WordNode>
+ isEndingPunctuation() boolean
+ getNextNodes() Set<WordNode>
+ getNextCount(nextNode : WordNode) int
+ addNextNode(nextNode : WordNode) void
+ getRandomNextNode() WordNode
+ equals(obj : Object) boolean
+ hashCode() int
+ toString() String
}
The WordNode class represents a "word" in our language model. Each WordNode keeps track of its token (token), a record of all nodes that have ever followed this node (allNextNodes), and the number of times each node has followed this node (nextNodeCounts). Essentially, it represents a "city", with roads leading to other cities.
Example
If I trained my model on the text:
The dog barks. The dog runs. The dog barks.
The WordNode for "dog" would have:
token: "dog"allNextNodes: ["barks", "runs", "barks"] (Note how this contains duplicates!)nextNodeCounts: { "barks" → 2, "runs" → 1 }
Implement the class based on the UML diagram and the following specifications. We recommend implementing methods in the order listed below:
-
WordNode(): This constructor should initialize all three instance variables, including creating a new emptyHashMapfornextNodeCounts. -
getToken(): Returns the token (word) represented by thisWordNode. -
getAllNextNodes(): Returns the list of all nodes that have ever followed this node. -
isEndingPunctuation(): Returnstrueif the token is one of the following punctuation marks:.,!, or?. Otherwise, returnsfalse. -
getNextNodes(): Returns aSet<WordNode>of all theWordNodes that can follow this node.- Hint: there is a method in
HashMapthat does this for you!
- Hint: there is a method in
-
getNextCount(WordNode nextNode): Returns the count of how often the given next node follows this node.- If
nextNodeisnull, throw anIllegalArgumentException. - If
nextNodeis not innextNodeCounts, return0.
- If
-
addNextNode(WordNode nextNode): Updates theWordNodeto record thatnextNodefollows this node one more time.- If
nextNodeisnull, throw anIllegalArgumentException. - This method should add the
nextNodetoallNextNodes. - This method should also update the count of
nextNodeinnextNodeCounts(incrementing it by1).- If
nextNodeis not already innextNodeCounts, add it with a count of1.
- If
- If
-
getRandomNextNode(): Returns a randomly selected nextWordNode. More frequent next nodes should have a higher chance of being selected.- If there are no next nodes (i.e.,
allNextNodesornextNodeCountshas a size of0), returnnull. - How to pick a random node? The
allNextNodeslist already stores a record of every node that comes after this one, with each node appearing as many times as its count! So you can just pick a random index from that list to get a weighted random choice. - You can use either
Math.random()or theRandomclass to generate random numbers.
- If there are no next nodes (i.e.,
-
equals(Object obj): Override theequalsmethod to compare twoWordNodes.- Make sure you properly handle
nulland check the type ofobj. - Two
WordNodes are considered equal if theirtokens are equal.
- Make sure you properly handle
-
hashCode(): This method must have one line:return token.hashCode(); -
toString(): This method should return the token.
Unit Tests¶
You should write unit tests for both the Tokenizer and WordNode classes. Write tests in files named TokenizerTest.java and WordNodeTest.java respectively.
We have provided you with two test files for the Tokenizer class:
test_file_1.txt: a simple sentence with punctuation. (contains 6 tokens)test_file_2.txt: a more complex file with multiple lines and various punctuation. (contains 43 tokens)
You will need to place these files in src/hws/hw6/ and make sure to specify the correct path when reading them in your tests.
Use forward slashes!
Make sure you use "forward slashes" (/) in your file paths, even on Windows! This ensures that the autograder can read your files correctly.
You are free to create additional test files, but make sure to upload all test files to Gradescope, including these two (if you use them)!
Your tests must cover at least 90% of both branches and lines in both classes in order to receive credit for coverage. You should be able to achieve 95-97% coverage with thorough tests.
How to Test Randomness¶
How might you test random behavior in getRandomNextNode()? It will be different every time!
Here's what we're doing for the autograder:
- Instead of calling it once or twice, we call it multiple times in a loop (e.g., 1000 times).
- We keep a count of how many times each next node is returned (using a
HashMap!) - We then check that each word/node is returned approximately the expected number of times, based on its frequency.
- Not exact, since it's random, but close enough (e.g., within 5% of the expected count).
Consider using a similar approach in your own tests!
Part B: SmallLanguageModel¶
This is where the magic happens! You'll implement the SmallLanguageModel class that uses the Tokenizer and WordNode classes to build a simple language model from text files and generate new text. You should "normalize" all tokens to lowercase when building the model (i.e., treat "The" and "the" as the same word).
It should be designed so that you can have it read in multiple text files, updating the model each time. There are two main parts:
- Creating the model by reading text files and making
WordNodes, and counting how often each node follows another node. - Generating new sentences by starting at a random starting word and following the weighted paths through the
WordNodes.
You do not need to write unit tests for Part B; instead, you can use a provided GUI (see below) to test your implementation.
Training the Model¶
The SmallLanguageModel class has two attributes:
-
startingWordsis aHashSet<WordNode>of all the words that can start a sentence. This includes both the first word in a file, AND any words that immediately follow an ending punctuation mark (i.e.,.,!, or?). -
wordNodesis aHashMap<String, WordNode>that maps each token (word) to its correspondingWordNodeobject.
To "train" the model on a text file, you will need to create WordNodes for each unique token in the file, and update their next node counts based on the sequence of tokens in the file.
For instance, the file:
The dog barks. The dog runs.
would result in five WordNodes: "The", "dog", "barks", ".", and "runs". Each WordNode would keep track of which nodes follow it and how often.
The startingWords set would contain only one WordNode: "The".
Your goal is to create a linked structure of WordNodes like so:

Generating Text¶
You would start at a given WordNode (e.g., "the") and then pick a random next node based on the weights. You would then repeat this process until you reach a word that is ending punctuation (e.g., ., !, or ?).
The whole time, you would build up a string of the tokens you visit along the way.
For example, starting at "the", you might get the following sequence:
the dog ate .
Another time, you might get:
the end .
This is easily done recursively! Think about the following:
- Base case: When should you stop the recursive process? When do you end the sentence?
- Recursive case: Two parts:
- What is the smallest piece of work you can do to build the sentence? What does one node contribute to the final sentence?
- Then, what is the smaller subproblem you can solve recursively? How do you get the rest of the sentence?
SmallLanguageModel Class¶
classDiagram
class SmallLanguageModel {
- startingWords : HashSet<WordNode>
- wordNodes : HashMap<String, WordNode>
- getOrCreateWordNode(token : String) WordNode
+ SmallLanguageModel()
+ getStartingWords() Set<WordNode>
+ getWordNodes() Map<String, WordNode>
+ trainOnFile(filename : String) void
+ generateSentence(node : WordNode) String
}
Implement the SmallLanguageModel class based on the UML diagram and the following specifications. We recommend implementing methods in the order listed below:
-
SmallLanguageModel(): This constructor should initialize both attributes by creating a newHashSetandHashMapfor the respective attributes. -
getStartingWords(): A simple getter that returns thestartingWordsset. -
getWordNodes(): A simple getter that returns thewordNodesmap. -
getOrCreateWordNode(String token): A private helper method that will help you out for the next method!- This method should return the
WordNodefor the given token. - However, if the token does not exist, it should create a new
WordNode, add it to thewordNodesmap, and then return it.
- This method should return the
-
trainOnFile(String filename): This method should train the language model on words from the given file.- Use the methods from
Tokenizerto tokenize the file! - Make sure to normalize all tokens to lowercase when creating or looking up
WordNodes. -
You will need to go through each token and update
wordNodesandstartingWordsaccordingly.- When you encounter a token, use
getOrCreateWordNodeto get/create its correspondingWordNode. - If it's not the first token, update the previous token's
WordNodeto add the current token as a next word.
Track the previous token
Hint: Make a variable outside of your loop to track the previous
WordNodeas you iterate through the tokens. - When you encounter a token, use
-
A word node should be treated as a starting word and added to
startingWordsif:- If is NOT a punctuation token
- And either:
- It is the first token in the file.
- It immediately follows an ending punctuation token (i.e.,
.,!, or?).
- Use the methods from
-
generateSentence(WordNode node): This method should generate a sentence starting from the givennode.- This method should be recursive! No credit will be given for iterative solutions. See the "Generating Text" section above for tips.
- It should return a
Stringrepresenting the generated sentence, starting fromnodeand visiting one node at a time, continuing until an ending punctuation word is reached. - There should be a single space between each token (including punctuation) in the generated sentence. For example:
"the dog barks ." - If
nodeisnull, return an empty string (""). - If
nodeis an ending punctuation token, return a string with just that token.
Testing the SmallLanguageModel¶
You do NOT need to write unit tests for the SmallLanguageModel class. Instead, we will provide you with a simple Graphical User Interface (GUI) to help you test and visualize your language model.
Download this file and add it to your src/hws/hw6/ directory.
As long as you've implemented the other classes, you should be able to run this file and see a window like this (yours will be blank at first):

You can use the File menu to load text files to train your model. Afterwards, you can use the top part to visualize the next-word frequencies, and use the bottom part to generate new sentences from a starting word!
As long as your Scanner is set to UTF-8, you should be able to load any text file (e.g., from Project Gutenberg). The above example was trained on a cleaned-up version of Alice in Wonderland, originally written by Lewis Carroll.
You can even try training on multiple files to see how the model changes!
We built the GUI with AI!
Speaking of language models, we used Claude Sonnet AI to write parts of the GUI code. It made development faster, but there was still a lot of manual work, debugging, reprompting, and testing needed.
We still needed to know how to write Java Swing GUIs to make it work!
Think: AI Language Models¶
You've created your own small language model! Obviously, it's nowhere near as good as ChatGPT, but it forms the basis for understanding how these models work.
Here are some things to consider (you don't need to submit answers to these questions, but think about them!):
- Right now, you only look at the previous word to predict the next word. How could you modify your model to look at the previous two, three, or N words instead?
- Does feeding more data into your model improve the quality of the generated text? Why or why not?
- Where do you think you could get more data to train your model?
- If I only trained it on Shakespeare, what kind of text would it generate?
- What if I only trained it on social media posts? Would the resulting model be biased in any way?
- What if we trained on all human writing ever produced?
- What do you think are the ethical or legal implications of training language models on text written by others without their permission?
- Does this constitute real "intelligence"? Why or why not?
Hopefully this assignment has helped peel back the curtain on how language models work!
Submission¶
You will submit this assignment in two parts:
Part A¶
Submit:
Tokenizer.javaTokenizerTest.javaWordNode.javaWordNodeTest.java
And any test (.txt) files used by your tests.
You have 10 free submissions for Part A, after which each additional submission will cost 2 pts.
Part B¶
Submit:
Tokenizer.javaWordNode.javaSmallLanguageModel.java
(Tokenizer and WordNode will not be tested again, but we need them since SmallLanguageModel depends on them.)
You have 10 free submissions for Part B, after which each additional submission will cost 2 pts.
Grading Criteria¶
Your code must compile with the official tests and pass a Checkstyle audit for you to receive any points. Gradescope will provide you with hints but might not completely identify the defects in your submission. Read the error message carefully for hints! What is the method testing? What hint message is given? We strongly recommend asking your instructor for help if you are missing a test but cannot easily identify the issue.
After the due date, the professor may manually review your code. At that time, points may be deducted for inelegant code, inappropriate variable names, bad comments, etc.
Part A (50 points)¶
| Criterion | Points | Details |
|---|---|---|
| Compile | 0 pts | Success Required |
| CompileOfficialTests | 0 pts | Success Required |
| Style | 0 pts | Success Required |
| Passing Your Tests (SelfTests) | 10 pts | (Partial credit possible) |
| WordNode/Tokenizer Coverage (Coverage) | 10 pts | (All or nothing; you must have at least 90% coverage of branches and lines) |
| Correctness (OfficialTests) | 30 pts | (Partial credit possible) |
Part B (50 points)¶
| Criterion | Points | Details |
|---|---|---|
| Compile | 0 pts | Success Required |
| CompileOfficialTests | 0 pts | Success Required |
| Style | 0 pts | Success Required |
| Passing Official Tests (OfficialTests) | 50 pts | (Partial credit possible) |