Data Structures And Algorithms: PA #5

James Madison University, Fall 2012

Programming Assignment #5: HashMap

Introduction

The goal for this assignment is to complete a partially implemented Python HashMap class, and to run some experiments to select an appropriate maximum load value. Your class will use the division method for hashing along with prime-number-sized hash tables. Collision resolution will be handled using separate chaining.

Map ADT

Here is the PyDoc documentation for the HashMap class that you will implement. The file hashmap.py contains a partial implementation. Methods in green are already implemented. It will be your job to complete the methods in red.

class HashMap(__builtin__.object)
 |  The HashMap associates stored values with unique keys.  This
 |  class provides a subset of the functionality of Python's built in
 |  dictionary class.  Key objects must implement the __hash__ function. 
 |  
 |  Hashing is based on the division method with prime-number-sized tables.
 |  Collisions are handled using separate chaining.
 |  
 |  MAX_LOAD is a public class attribute that determines the maximum load
 |  factor for the hash table.  The Default value is 1.0
 |  
 |  Methods defined here:
 |  
 |   __contains__(self, key)
 |      H.__contains__(k) -> True if H has a key k, else False
 |  
 |  __delitem__(self, key)
 |      x.__delitem__(y) <==> del x[y]
 |  
 |  __eq__(self, rhs)
 |      x.__eq__(y) <==> x==y
 |  
 |  __getitem__(self, key)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __init__(self)
 |      HashMap() -> new empty HashMap
 |  
 |  __iter__(self)
 |      x.__iter__() <==> iter(x)
 |  
 |  __len__(self)
 |      x.__len__() <==> len(x)
 |  
 |  __ne__(self, rhs)
 |      x.__ne__(y) <==> x!=y
 |  
 |  __repr__(self)
 |      x.__repr__() <==> repr(x)
 |  
 |  __setitem__(self, key, value)
 |      x.__setitem__(i, y) <==> x[i]=y
 |  
 |  clear(self)
 |      H.clear() -> None.  Remove all items from H.
 |  
 |  copy(self)
 |      H.copy() -> a shallow copy of H
 |  
 |  get(self, k, d=None)
 |      H.get(k[,d]) -> H[k] if k in H, else d.  d defaults to None.
 |  
 |  items(self)
 |      H.items() -> list of H's (key, value) pairs, as 2-tuples
 |  
 |  iteritems(self)
 |      H.iteritems() -> an iterator over the (key, value) items of H
 |  
 |  itervalues(self)
 |      H.itervalues() -> an iterator over the values of H
 |  
 |  keys(self)
 |      H.keys() -> list of H's keys
 |  
 |  pop(self, key)
 |      H.pop(k[,d]) -> v, remove specified key and return the corresponding value.
 |      If key is not found, d is returned if given, otherwise KeyError is raised
 |  
 |  popitem(self)
 |      H.popitem() -> (k, v), remove and return some (key, value) pair as a
 |      2-tuple; but raise KeyError if H is empty.
 |  
 |  values(self)
 |      H.values() -> list of H's values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  MAX_LOAD = 1.0

Notice that the documentation above is minimal. For the sake of consistency, these are almost exactly the same as the pydoc comments for Python's built-in dictionary class. With some minor exceptions (noted below) the behavior of the HashMap methods above should be exactly the same as the corresponding Python dictionary methods. If you need clarification for any of these methods I suggest that you refer to the official Python dictionary documentation. If anything is still unclear, I suggest opening up a Python interpreter and experimenting with the Python dictionary methods directly.

There will be some minor differences in the functionality of the HashMap class in comparison to Python dictionaries:

The constructor for the Python dictionary class is able to take arguments specifying the initial contents of the dictionary. The HashMap constructor takes no arguments and always creates an empty HashMap.
For each of the methods above that involve iterating through the collection, the order that items are visited or returned is likely to differ between the HashMap and the Python dictionary. The following statement is from the on-line documentation for the Python dictionary class:
"CPython implementation detail: Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary's history of insertions and deletions."

Since the HashMap is based on a different implementation, it will visit items in a different arbitrary, non-random order.

Implementation Notes

Typically, hash tables use arrays as the underlying storage mechanism. Since Python has no array type, we need an alternative. We could use the turtle array class from PA#2, but the resulting implementation would be slow: there is a lot of overhead involved in the memory management and drawing. Furthermore, the visualization functionality for that Array class wouldn't be very helpful here because the actual keys and values are not stored directly in the Array.

Instead of arrays, this implementation will use Python lists for storage. This means that you need to be careful not to use list methods outside of those that would be provided by an array. For example, if you find yourself using the list iterator, or the contains method, you are probably doing something wrong. The danger here is that many list methods require O(n) time, while nearly all HashMap operations should be O(1).

Testing

You do not need to develop unit tests for this assignment. As always, careful testing is likely to make your life easier in the long run.

Selecting a Load Factor

The following script can be used to benchmark the performance of your HashMap class for increasing values of MAX_LOAD: max_load_experiment.py . Prepare a short (1 page) document that includes a graph of your timing results along with answers to the following questions:

Why is the timing curve shaped the way it is?
Given this data, what do you think is a good choice for MAX_LOAD?

Handing In

You should submit two files through blackboard before the deadline: your completed hashmap.py file, and a .pdf file containing your experimental results. As always, your code should conform to the CS240 Python Coding Conventions.