HW 6: Small Language Model¶

Small Language Model

Learning Objectives¶

After completing this homework, you should be able to:

Use Java collections to store and query data.
Generate random numbers in Java.
Read from text files and parse lines character by character.
Write recursive code to process nested structures.
Implement a simple predictive text generator.

Introduction¶

Modern AI platforms are built on systems called large language models (LLMs). At its core, all an LLM does is predict what word comes next in a sequence of words. For example, given the text "The cat sat on the", an LLM might predict that the next word is "mat".

Let's create a very simple version of this that only looks at the previous word to predict the next word. We'll create a statistical word predictor that reads a text file, counts how often each word is followed by other words, and then uses that information to generate new text.

We'll do this by creating "WordNode" objects that represent each word and keep track of which words can follow it and how often. Think of each "WordNode" as a city, and the words that can follow it as roads leading to other cities. The more often a road is used, the wider it is:

Word map, showing how words connect to each other

Thus, we could generate new text by starting at a word and then randomly following the roads to the next word, but weighted by how often each road is used. In the above example, "the end ." is the most likely sequence, but "the dog ran ." and "the cat ate ." are also possible. We use randomness to choose between them, but more common sequences are more likely to be chosen so that the generated text is more natural.

This is a simple technique that should generate text that at least follows the patterns of the original text (even if it doesn't always make sense). We'll call this our "Small Language Model"!

UML Diagram¶

classDiagram
    class Tokenizer {
        <<utility>>
        + getTokensFromLine(line : String) ArrayList&lt;String&gt;$
        + getTokensFromFile(filename : String) ArrayList&lt;String&gt;$
    }

    class WordNode {
        - token : String
        - allNextNodes : ArrayList&lt;WordNode&gt;
        - nextNodeCounts : HashMap&lt;WordNode, Integer&gt;
        + WordNode(token : String)
        + getToken() String
        + getAllNextNodes() List&lt;WordNode&gt;
        + isEndingPunctuation() boolean
        + getNextNodes() Set&lt;WordNode&gt;
        + getNextCount(nextNode : WordNode) int
        + addNextNode(nextNode : WordNode) void
        + getRandomNextNode() WordNode
        + equals(obj : Object) boolean
        + hashCode() int
        + toString() String
    }

    class SmallLanguageModel {
        - startingWords : HashSet&lt;WordNode&gt;
        - wordNodes : HashMap&lt;String, WordNode&gt;
        - getOrCreateWordNode(token : String) WordNode
        + SmallLanguageModel()
        + getStartingWords() Set&lt;WordNode&gt;
        + getWordNodes() Map&lt;String, WordNode&gt;
        + trainOnFile(filename : String) void
        + generateSentence(node : WordNode) String
    }

    SmallLanguageModel o-- WordNode : contains
    WordNode --> WordNode : counts next words
    SmallLanguageModel ..> Tokenizer : uses

Part A: Tokenizer and WordNode¶

In this part, you'll implement a Tokenizer class that reads text files and breaks them into tokens (words and punctuation). You'll also implement a WordNode class that represents a node in our model, which keeps track of a word and the words that can follow it.

What is a Token?¶

Let's define a "token" as either a word (a continuous sequence of letters and/or digits) or punctuation (a single non-alphanumeric character: ., ,, !, ?, ", ', etc.). Spaces and other whitespace characters are not tokens and should be ignored.

Thus, the sentence:

I have 1 cat   and   12 dogs...

would be tokenized into the following tokens:

["I", "have", "1", "cat", "and", "12", "dogs", ".", ".", "."]

Note how multiple spaces are ignored, and each punctuation character is its own token.

Do note that numbers and letters together are considered part of the same token. For example, "CS159" would be a single token.

Tokenizer Class¶

classDiagram
    class Tokenizer {
        <<utility>>
        + getTokensFromLine(line : String) ArrayList&lt;String&gt;$
        + getTokensFromFile(filename : String) ArrayList&lt;String&gt;$
    }

The Tokenizer class is a utility class that should have the following methods:

getTokensFromLine(String line): Takes a line of text and returns an ArrayList<String> of tokens.
- You will need to process each character in the line one at a time.
  - You can use the .charAt(int index) method of String to get individual characters.
- We recommend building up a token character by character, and then add the completed token to the list.
  
  Do this by hand first!
  
  Hint: Go through character by character on paper. Think about: (1) what to do when you see an alphanumeric character, (2) what to do when you see a punctuation character, and (3) what to do when you see whitespace. When do you add a token to the list?
- Remember to handle the end of the line properly: if the line ends while you're building a token, you need to add that token to the list before finishing.
- You can also solve this problem by using a combination of different String methods; up to you!
getTokensFromFile(String filename): Takes a filename, reads the file, and returns an ArrayList<String> of tokens from the entire file.
- This method should use getTokensFromLine to process each line of the file.
- If the file cannot be found or read, this method should print an error message ("Error reading file!") and return an empty list.
  Set your Scanner to UTF-8
  
  Note: When you create your Scanner, we recommend specifying the "UTF-8" charset to ensure proper reading of text files:
```
Scanner fileScanner = new Scanner(file, "UTF-8");
```
  Otherwise, if you try to use a file with special characters, the Scanner may not output any tokens at all!

Useful methods from the Character class:

Character.isLetterOrDigit(char ch): Returns true if the character is a letter (a-z, A-Z) or digit (0-9).
Character.isWhitespace(char ch): Returns true if the character is a whitespace character (space, tab, newline, etc.).
Character.toString(char ch): Converts a character to a String.

WordNode Class¶

classDiagram
    class WordNode {
        - token : String
        - allNextNodes : ArrayList&lt;WordNode&gt;
        - nextNodeCounts : HashMap&lt;WordNode, Integer&gt;
        + WordNode(token : String)
        + getToken() String
        + getAllNextNodes() List&lt;WordNode&gt;
        + isEndingPunctuation() boolean
        + getNextNodes() Set&lt;WordNode&gt;
        + getNextCount(nextNode : WordNode) int
        + addNextNode(nextNode : WordNode) void
        + getRandomNextNode() WordNode
        + equals(obj : Object) boolean
        + hashCode() int
        + toString() String
    }

The WordNode class represents a "word" in our language model. Each WordNode keeps track of its token (token), a record of all nodes that have ever followed this node (allNextNodes), and the number of times each node has followed this node (nextNodeCounts). Essentially, it represents a "city", with roads leading to other cities.

Example

If I trained my model on the text:

The dog barks. The dog runs. The dog barks.

The WordNode for "dog" would have:

token: "dog"
allNextNodes: ["barks", "runs", "barks"] (Note how this contains duplicates!)
nextNodeCounts: { "barks" → 2, "runs" → 1 }

Implement the class based on the UML diagram and the following specifications. We recommend implementing methods in the order listed below:

WordNode(): This constructor should initialize all three instance variables, including creating a new empty HashMap for nextNodeCounts.
getToken(): Returns the token (word) represented by this WordNode.
getAllNextNodes(): Returns the list of all nodes that have ever followed this node.
isEndingPunctuation(): Returns true if the token is one of the following punctuation marks: ., !, or ?. Otherwise, returns false.
getNextNodes(): Returns a Set<WordNode> of all the WordNodes that can follow this node.
- Hint: there is a method in HashMap that does this for you!
getNextCount(WordNode nextNode): Returns the count of how often the given next node follows this node.
- If nextNode is null, throw an IllegalArgumentException.
- If nextNode is not in nextNodeCounts, return 0.
addNextNode(WordNode nextNode): Updates the WordNode to record that nextNode follows this node one more time.
- If nextNode is null, throw an IllegalArgumentException.
- This method should add the nextNode to allNextNodes.
- This method should also update the count of nextNode in nextNodeCounts (incrementing it by 1).
  - If nextNode is not already in nextNodeCounts, add it with a count of 1.
getRandomNextNode(): Returns a randomly selected next WordNode. More frequent next nodes should have a higher chance of being selected.
- If there are no next nodes (i.e., allNextNodes or nextNodeCounts has a size of 0), return null.
- How to pick a random node? The allNextNodes list already stores a record of every node that comes after this one, with each node appearing as many times as its count! So you can just pick a random index from that list to get a weighted random choice.
- You can use either Math.random() or the Random class to generate random numbers.
equals(Object obj): Override the equals method to compare two WordNodes.
- Make sure you properly handle null and check the type of obj.
- Two WordNodes are considered equal if their tokens are equal.
hashCode(): This method must have one line: return token.hashCode();
toString(): This method should return the token.

Unit Tests¶

You should write unit tests for both the Tokenizer and WordNode classes. Write tests in files named TokenizerTest.java and WordNodeTest.java respectively.

We have provided you with two test files for the Tokenizer class:

test_file_1.txt : a simple sentence with punctuation. (contains 6 tokens)
test_file_2.txt : a more complex file with multiple lines and various punctuation. (contains 43 tokens)

You will need to place these files in src/hws/hw6/ and make sure to specify the correct path when reading them in your tests.

Use forward slashes!

Make sure you use "forward slashes" (/) in your file paths, even on Windows! This ensures that the autograder can read your files correctly.

You are free to create additional test files, but make sure to upload all test files to Gradescope, including these two (if you use them)!

Your tests must cover at least 90% of both branches and lines in both classes in order to receive credit for coverage. You should be able to achieve 95-97% coverage with thorough tests.

How to Test Randomness¶

How might you test random behavior in getRandomNextNode()? It will be different every time!

Here's what we're doing for the autograder:

Instead of calling it once or twice, we call it multiple times in a loop (e.g., 1000 times).
We keep a count of how many times each next node is returned (using a HashMap!)
We then check that each word/node is returned approximately the expected number of times, based on its frequency.
- Not exact, since it's random, but close enough (e.g., within 5% of the expected count).

Consider using a similar approach in your own tests!

Part B: SmallLanguageModel¶

This is where the magic happens! You'll implement the SmallLanguageModel class that uses the Tokenizer and WordNode classes to build a simple language model from text files and generate new text. You should "normalize" all tokens to lowercase when building the model (i.e., treat "The" and "the" as the same word).

It should be designed so that you can have it read in multiple text files, updating the model each time. There are two main parts:

Creating the model by reading text files and making WordNodes, and counting how often each node follows another node.
Generating new sentences by starting at a random starting word and following the weighted paths through the WordNodes.

You do not need to write unit tests for Part B; instead, you can use a provided GUI (see below) to test your implementation.

Training the Model¶

The SmallLanguageModel class has two attributes:

startingWords is a HashSet<WordNode> of all the words that can start a sentence. This includes both the first word in a file, AND any words that immediately follow an ending punctuation mark (i.e., ., !, or ?).
wordNodes is a HashMap<String, WordNode> that maps each token (word) to its corresponding WordNode object.

To "train" the model on a text file, you will need to create WordNodes for each unique token in the file, and update their next node counts based on the sequence of tokens in the file.

For instance, the file:

The dog barks. The dog runs.

would result in five WordNodes: "The", "dog", "barks", ".", and "runs". Each WordNode would keep track of which nodes follow it and how often.

The startingWords set would contain only one WordNode: "The".

Your goal is to create a linked structure of WordNodes like so:

"Graph" of WordNodes, showing how each word has multiple counts of the next words that follow

Generating Text¶

You would start at a given WordNode (e.g., "the") and then pick a random next node based on the weights. You would then repeat this process until you reach a word that is ending punctuation (e.g., ., !, or ?).

The whole time, you would build up a string of the tokens you visit along the way.

For example, starting at "the", you might get the following sequence:

the dog ate .

Another time, you might get:

the end .

This is easily done recursively! Think about the following:

Base case: When should you stop the recursive process? When do you end the sentence?
Recursive case: Two parts:
1. What is the smallest piece of work you can do to build the sentence? What does one node contribute to the final sentence?
2. Then, what is the smaller subproblem you can solve recursively? How do you get the rest of the sentence?

SmallLanguageModel Class¶

classDiagram
    class SmallLanguageModel {
        - startingWords : HashSet&lt;WordNode&gt;
        - wordNodes : HashMap&lt;String, WordNode&gt;
        - getOrCreateWordNode(token : String) WordNode
        + SmallLanguageModel()
        + getStartingWords() Set&lt;WordNode&gt;
        + getWordNodes() Map&lt;String, WordNode&gt;
        + trainOnFile(filename : String) void
        + generateSentence(node : WordNode) String
    }

Implement the SmallLanguageModel class based on the UML diagram and the following specifications. We recommend implementing methods in the order listed below:

SmallLanguageModel(): This constructor should initialize both attributes by creating a new HashSet and HashMap for the respective attributes.
getStartingWords(): A simple getter that returns the startingWords set.
getWordNodes(): A simple getter that returns the wordNodes map.
getOrCreateWordNode(String token): A private helper method that will help you out for the next method!
- This method should return the WordNode for the given token.
- However, if the token does not exist, it should create a new WordNode, add it to the wordNodes map, and then return it.
trainOnFile(String filename): This method should train the language model on words from the given file.
- Use the methods from Tokenizer to tokenize the file!
- Make sure to normalize all tokens to lowercase when creating or looking up WordNodes.
- You will need to go through each token and update wordNodes and startingWords accordingly.
  - When you encounter a token, use getOrCreateWordNode to get/create its corresponding WordNode.
  - If it's not the first token, update the previous token's WordNode to add the current token as a next word.
  Track the previous token
  
  Hint: Make a variable outside of your loop to track the previous WordNode as you iterate through the tokens.
- A word node should be treated as a starting word and added to startingWords if:
  - If is NOT a punctuation token
  - And either:
    - It is the first token in the file.
    - It immediately follows an ending punctuation token (i.e., ., !, or ?).
generateSentence(WordNode node): This method should generate a sentence starting from the given node.
- This method should be recursive! No credit will be given for iterative solutions. See the "Generating Text" section above for tips.
- It should return a String representing the generated sentence, starting from node and visiting one node at a time, continuing until an ending punctuation word is reached.
- There should be a single space between each token (including punctuation) in the generated sentence. For example: "the dog barks ."
- If node is null, return an empty string ("").
- If node is an ending punctuation token, return a string with just that token.

Testing the SmallLanguageModel¶

You do NOT need to write unit tests for the SmallLanguageModel class. Instead, we will provide you with a simple Graphical User Interface (GUI) to help you test and visualize your language model.

Download this file and add it to your src/hws/hw6/ directory.

LanguageModelGUI.java

As long as you've implemented the other classes, you should be able to run this file and see a window like this (yours will be blank at first):

LanguageModelGUI with the model trained on Alice in Wonderland

You can use the File menu to load text files to train your model. Afterwards, you can use the top part to visualize the next-word frequencies, and use the bottom part to generate new sentences from a starting word!

As long as your Scanner is set to UTF-8, you should be able to load any text file (e.g., from Project Gutenberg). The above example was trained on a cleaned-up version of Alice in Wonderland, originally written by Lewis Carroll.

You can even try training on multiple files to see how the model changes!

We built the GUI with AI!

Speaking of language models, we used Claude Sonnet AI to write parts of the GUI code. It made development faster, but there was still a lot of manual work, debugging, reprompting, and testing needed.

We still needed to know how to write Java Swing GUIs to make it work!

Think: AI Language Models¶

You've created your own small language model! Obviously, it's nowhere near as good as ChatGPT, but it forms the basis for understanding how these models work.

Here are some things to consider (you don't need to submit answers to these questions, but think about them!):

Right now, you only look at the previous word to predict the next word. How could you modify your model to look at the previous two, three, or N words instead?
Does feeding more data into your model improve the quality of the generated text? Why or why not?
- Where do you think you could get more data to train your model?
If I only trained it on Shakespeare, what kind of text would it generate?
- What if I only trained it on social media posts? Would the resulting model be biased in any way?
What if we trained on all human writing ever produced?
- What do you think are the ethical or legal implications of training language models on text written by others without their permission?
Does this constitute real "intelligence"? Why or why not?

Hopefully this assignment has helped peel back the curtain on how language models work!

Submission¶

You will submit this assignment in two parts:

Part A¶

Submit:

Tokenizer.java
TokenizerTest.java
WordNode.java
WordNodeTest.java

And any test (.txt) files used by your tests.

You have 10 free submissions for Part A, after which each additional submission will cost 2 pts.

Part B¶

Submit:

Tokenizer.java
WordNode.java
SmallLanguageModel.java

(Tokenizer and WordNode will not be tested again, but we need them since SmallLanguageModel depends on them.)

You have 10 free submissions for Part B, after which each additional submission will cost 2 pts.

Grading Criteria¶

Your code must compile with the official tests and pass a Checkstyle audit for you to receive any points. Gradescope will provide you with hints but might not completely identify the defects in your submission. Read the error message carefully for hints! What is the method testing? What hint message is given? We strongly recommend asking your instructor for help if you are missing a test but cannot easily identify the issue.

After the due date, the professor may manually review your code. At that time, points may be deducted for inelegant code, inappropriate variable names, bad comments, etc.

Part A (50 points)¶

Criterion	Points	Details
Compile	0 pts	Success Required
CompileOfficialTests	0 pts	Success Required
Style	0 pts	Success Required
Passing Your Tests (SelfTests)	10 pts	(Partial credit possible)
WordNode/Tokenizer Coverage (Coverage)	10 pts	(All or nothing; you must have at least 90% coverage of branches and lines)
Correctness (OfficialTests)	30 pts	(Partial credit possible)

Part B (50 points)¶

Criterion	Points	Details
Compile	0 pts	Success Required
CompileOfficialTests	0 pts	Success Required
Style	0 pts	Success Required
Passing Official Tests (OfficialTests)	50 pts	(Partial credit possible)