CS 445 Machine Learning
PA 0: Unigrams and Bigrams
Goals
- help you get up-to-speed with the Python programming language
- provides an introduction to some simple algorithms for statistics-based natural language processing (NLP).
Supplied Components
Component | Description |
---|---|
text_gen.py | Starter code |
huck.txt | Sample text for learning word distribution |
text_gen_tests.py | Code to test your solution |
Part 1: Warm-up
After you log in, follow the "prefs" link to enter your name and my email address (molloykp@jmu.edu) in the "Teacher Share" box. You must complete at least one exercise from each of the non-warmup categories (at least 6 in all).
Feel free to skim the tutorials above if you are already comfortable programming in Python. If you are new to Python, I strongly encourage you to work through each step of the tutorials carefully. It will probably take a couple of hours, but the time spent will pay dividends throughout the semester.
Bigrams and Trigrams
- python3 text_gen.py --infile huck.txt
tell shot up finn man if unloads on wonderful go know swear s myself no to good in and no home times a pick inside janeero s warn misunderstood sometimes sweat wouldn sakes and i away didn i as next furnish two the it put his dick take scared nor i on said we a was i blankets up poor bull him and asked what mary old and you that night en and comfortable all from and re it running we a lonesome but bible he up hitched a t a i telling says yarter hot call can if then
Text generated using bigrams and trigrams should look significantly more English-like.
Note that the code in text_gen.py uses the terms "bigram" and "trigram" in slightly non-standard ways. Typically, bigrams encode the probability of particular word pairs. For example, bigrams could tell us that the probability of the sequence ("pineapple", "extrapolate") is relatively low, while the probability of the sequence ("going", "home") is relatively high. Since we are interested in generating text, it is more useful for us to store the probability that one word will follow another: the probability of observing "home" after "going" is much higher than the probability of observing "cup" after "going". It turns out that the two representations are interchangeable, as described on the Wikipedia bigram page.
Submission and Rubric
You should submit the completed version of
and text_gen.py
to
Autolab by the deadline.
This project will be graded on the following scale:
CodingBat exercises | 20% |
Code in text_gen.py more or less conforms to PEP8 | 10% |
Functionality of bigram functions in text_gen.py | 60% |
Functionality of trigram functions in text_gen.py | 10% |
Keep in mind that testing does not prove the correctness of code. It is possible to write incorrect code that passes the provided unit tests. As always, it is your responsibility to ensure that your code functions correctly.
PA originally developed by Nathan Sprague and modified by Kevin Molloy.
* This is the complete text of the Adventures of Huckleberry Finn by Mark Twain. Obtained from Project Gutenburg.