PA5: Tries


Introduction

Your final assignment is to implement a trie in Python. This is a data structure that we have not discussed in class, so this assignment is meant to simulate the sorts of problems you might encounter as a developer: implementing a new data structure for a particular task.

The context for this assignment is a spell checker program. Spell checkers need to store a spelling dictionary in a way that makes it as fast as possible to see whether a word is in the dictionary. The data structure we will use to store the dictionary is called a trie, which is derived from the middle letters of the word retrieval. Tries were invented many years ago to speed information retrieval tasks.

A trie is a tree whose edges are labeled with letters. Paths from the root of the trie spell out words. The image below shows a trie.

Trie

The black nodes indicate that the path to this point in the trie spells a word. The words represented by this trie are a, an, and, ant, cab and cat. It is very fast to check whether a word is in a trie: one need only start from the root and trace out the word along the edges; if one ends up at a word-terminating node (a black node), then the word is in the trie, otherwise it is not. A trie is very good for spell checking: simply construct a trie representing all the words in a dictionary, and then check each word in a document to see if it is in the trie. The words not in the trie are (presumably) misspelled.

Tries can also efficiently support auto-completion. Given the prefix-based structure of the trie, it is straightforward to enumerate all of the words that begin with a given letter sequence: these will be represented by nodes that are descendants of the node corresponding to that prefix.

There are many ways to implement tries. We will use a simple implementation that uses a lot of space, but is fast. Each node of the trie has a Boolean field indicating whether a string ending there is a word (corresponding to whether the node is black or white in the diagram). Each node also contains a Python dictionary representing outgoing edges. The dictionary keys are individual letters, and the values are references to other trie nodes.

Assignment

You must create a Trie class that conforms to the following docstrings.

    class Trie
     |  Trie class that supports spell checking and auto-completion.
     |  
     |  Methods:
     |  
     |  __init__(self)
     |      Create a new empty Trie.
     |  
     |  __len__(self)
     |      Return the number of words stored in this Trie
     |  
     |  __contains__(self, word)
     |      Return True if the word is in the Trie, False otherwise.
     |  
     |  __iter__(self)
     |      Return an iterator that will allow iteration over all words in
     |      the trie in lexicographical (alphabetical) order.
     |  
     |  insert(self, word)
     |      Insert the indicated word into the Trie. 
     |      This method has no effect if the word is already in the Trie.
     |      No Return value.
     |  
     |  contains_prefix(self, prefix)
     |      Return true if the indicated string is a word in the Trie or is a 
     |      prefix of any word in the Trie. Return False otherwise.
     |  
     |  prefix_iter(self, prefix)
     |      Return an iterator that will allow iteration over all of the
     |      words in the trie that have the indicated prefix, in
     |      lexicographical (alphabetical) order.
     |

You should also create a node class to represent Trie nodes. All of your code must be submitted in a file named trie.py.

We are providing an example spell-checking program that uses code conforming to the specification above. To run the program, download auto_complete.zip and unzip the two files (a .py file and a dictionary text file) into the same folder as your trie.py, then run auto_complete.py. Here is an example screenshot of the program:

Autocomplete GUI

Acknowledgements

This project is based on a Go programming assignment designed by Dr. Chris Fox. Much of the wording above is borrowed from that project.

Submission

DEADLINE: Friday, December 5, 2014 at 23:59 (11:59PM) ET

You should submit a single file called trie.py through the Web-CAT online submission system before the deadline. As always, your code should conform to the CS240 Python Coding Conventions. Make sure that the file contains a header with your name and the JMU Honor Code statement.

Refer to the course syllabus for the policy on late submissions. These penalties will be applied manually while grading. Web-CAT does not apply them correctly; thus, that behavior has been disabled.