PA5: Tries
Introduction
Your final assignment is to implement a trie in Python. This is a data structure that we have not discussed in class, so this assignment is meant to simulate the sorts of problems you might encounter as a developer: implementing a new data structure for a particular task.
The context for this assignment is a spell checker program. Spell checkers need to store a spelling dictionary in a way that makes it as fast as possible to see whether a word is in the dictionary. The data structure we will use to store the dictionary is called a trie, which is derived from the middle letters of the word retrieval. Tries were invented many years ago to speed information retrieval tasks.
A trie is a tree whose edges are labeled with letters. Paths from the root of the trie spell out words. The image below shows a trie.
The black nodes indicate that the path to this point in the trie spells a word. The words represented by this trie are a, an, and, ant, cab and cat. It is very fast to check whether a word is in a trie: one need only start from the root and trace out the word along the edges; if one ends up at a word-terminating node (a black node), then the word is in the trie, otherwise it is not. A trie is very good for spell checking: simply construct a trie representing all the words in a dictionary, and then check each word in a document to see if it is in the trie. The words not in the trie are (presumably) misspelled.
Tries can also efficiently support auto-completion. Given the prefix-based structure of the trie, it is straightforward to enumerate all of the words that begin with a given letter sequence: these will be represented by nodes that are descendants of the node corresponding to that prefix.
There are many ways to implement tries. We will use a simple implementation that uses a lot of space, but is fast. Each node of the trie has a Boolean field indicating whether a string ending there is a word (corresponding to whether the node is black or white in the diagram). Each node also contains a Python dictionary representing outgoing edges. The dictionary keys are individual letters, and the values are references to other trie nodes.
Assignment
You must create a Trie class that conforms to the following docstrings.
class Trie | Trie class that supports spell checking and auto-completion. | | Methods: | | __init__(self) | Create a new empty Trie. | | __len__(self) | Return the number of words stored in this Trie | | __contains__(self, word) | Return True if the word is in the Trie, False otherwise. | | __iter__(self) | Return an iterator that will allow iteration over all words in | the trie in lexicographical (alphabetical) order. | | insert(self, word) | Insert the indicated word into the Trie. | This method has no effect if the word is already in the Trie. | No Return value. | | contains_prefix(self, prefix) | Return true if the indicated string is a word in the Trie or is a | prefix of any word in the Trie. Return False otherwise. | | prefix_iter(self, prefix) | Return an iterator that will allow iteration over all of the | words in the trie that have the indicated prefix, in | lexicographical (alphabetical) order. |
You should also create a node class to represent Trie nodes. All of your
code must be submitted in a file named trie.py
.
We are providing an example spell-checking program that uses code conforming
to the specification above. To run the program, download auto_complete.zip and unzip the two files (a
.py file and a dictionary text file) into the same folder as your
trie.py
, then run auto_complete.py
. Here is an example
screenshot of the program:
Acknowledgements
This project is based on a Go programming assignment designed by Dr. Chris Fox. Much of the wording above is borrowed from that project.
Submission
DEADLINE: Friday, December 5, 2014 at 23:59 (11:59PM) ET
You should submit a single file called trie.py
through the Web-CAT online submission system before the
deadline. As always, your code should conform to the CS240 Python Coding
Conventions. Make sure that the file contains a header with your
name and the JMU Honor Code statement.
Refer to the course syllabus for the policy on late submissions. These penalties will be applied manually while grading. Web-CAT does not apply them correctly; thus, that behavior has been disabled.