Machine Learning: PA #0

Goals

The purpose of this programming assignment is to help you get refreshed or up-to-speed with the Python programming language and Jupyter notebooks. This PA will also give you an opportunity to experiment with some simple algorithms for statistics-based natural language processing.

Part 1: Tutorial

Get started by completing one or both of the following online tutorials:

Feel free to skim the tutorials above if you are already comfortable programming in Python. If you are new to Python, I strongly encourage you to work through each step of the tutorials carefully. It will probably take a couple of hours, but the time spent will pay dividends throughout the semester.

Part 2: nbgrader

This class will make extensive use of Jupyter Notebooks for lab activities. The goal for this part of the assignment is to gain experience working with Jupyter notebooks while practicing Python basics. Do not use AI code generation tools for this portion of the assignment.

Create a folder to contain your lab activities for this semester, and download the following file into that folder: IntroPython.ipynb

Then open that notebook using either Jupyter Notebook or Juypter Lab, either of which can be launched from the terminal, or from the Anaconda Navigator

If all goes well, this should bring up a browser window that will allow you to select the notebook and enter your solutions. Submit your completed notebook through Gradescope.

Part 3: Python Classes

For this part of the assignment you must translate the following Java class into Python:

Card.java

All of the basic functionality should be the same, but your Python version should use appropriate “Pythonic” style. Your class should be stored in a file named card.py. Your completed class should pass all of the unit tests in test_card.py.

For the purposes of this exercise, you do not need to provide any comments or documentation in your class. Do not use AI code generation tools for this portion of the assignment.

Properties

In Java it is standard to use private instance variables that are accessed through getter and setter methods. In Python, it is more common to access instance variables directly. (There is no language support for private members. The convention is to prefix members with an underscore if we want them to be considered private.)

For example, you typically won’t see code that looks like this:

alice = Person("Alice")
her_name = alice.get_name()  # get_name is not Pythonic!!

Instead, you’ll see something like this:

alice = Person("Alice")
her_name = alice.name
alice.name = "Alicia" # Name change! Maybe this should not be possible?

Of course, there are cases (like our Card class) where we want to protect instance variables from being modified by outside code. The standard way of handling this in Python is through properties. Python properties make it look like we are accessing instance variables directly, when in reality appropriate methods are being called. You can read more about properties in the following tutorial:

https://www.tutorialsteacher.com/python/property-decorator

Comparisons and “toString”

In Java, it is standard practice to override the equals and toString methods when defining a new class. The same principle applies in Python, but we need to override a different set of methods.

Here is some documentation about enabling comparisons:

https://docs.python.org/3/reference/datamodel.html#object.__lt__

https://docs.python.org/3/library/functools.html#functools.total_ordering

This is the method that needs to be overridden to provide a string representation:

https://docs.python.org/3/reference/datamodel.html#object.__str__

Part 4: Bigrams and Trigrams

Natural language processing (NLP) is an area of machine learning that involves understanding and generating human language. The goal for this PA is to take a tiny step into NLP by using n-grams for the purpose of generating random text in the style of particular authors or documents. Take a minute to read over the Wikipedia pages on n-grams and bigrams:

Your objective for this part of the PA is to complete the unfinished functions in text_gen.py so that they correspond to the provided docstrings. You can use the unit tests in test_text_gen.py to help test your code. You will probably want to add helper functions as needed. You may use AI tools for this portion of the assignment.

Text generation based on unigrams just generates random words with a probability that is proportional to their frequency in the training text. Once you’ve finished random_unigram_text you should be able to generate text sequences like the following from the frequencies obtained from huck.txt*.

tell shot up finn man if unloads on wonderful go know swear s myself no to good in and no home times a pick inside janeero s warn misunderstood sometimes sweat wouldn sakes and i away didn i as next furnish two the it put his dick take scared nor i on said we a was i blankets up poor bull him and asked what mary old and you that night en and comfortable all from and re it running we a lonesome but bible he up hitched a t a i telling says yarter hot call can if then

Text generated using bigrams and trigrams should look significantly more English-like.

Submission and Rubric

You should submit the completed versions of and IntroPython.ipynb, card.py and text_gen.py through Canvas by the deadline.

This project will be graded on the following scale:

IntroPython.ipynb 20%
card.py 20%
text_gen.py 50%
Code in card.py and text_gen.py more or less conforms to PEP8 10%

Keep in mind that testing does not prove the correctness of code. It is possible to write incorrect code that passes the provided unit tests. As always, it is your responsibility to ensure that your code functions correctly.


* This is the complete text of the Adventures of Huckleberry Finn by Mark Twain. Obtained from Project Gutenburg.