Hashing Lab

Objectives

The goal of this lab is to gain experience with the implementation of basic hashing structures.

Introduction

Download the lab starter files here:

Don't forget to download and customize your machine-specific makefile.

In this lab, you will implement a very small subset of the PA2/PA3 Set functionality using a hash table as the backing store instead of a dynamic array or skip list.

IMPORTANT: Until now, we have talked about hashing in the context of the Map ADT, but recall that a Set is basically equivalent to a Map in which the values don't matter. Thus, for the purposes of this assignment, "values" in a Set are equivalent to the "keys" in a Map from our previous hashing discussions.

Your hash table should use linear probing to resolve resolutions, and it should auto-expand whenever the load factor exceeds 0.5.

Hash Sets and Hash Tables

In hash_set.h, you will find new data types for hash-based sets.

typedef struct bucket {
    bool is_empty;
    data_t value;
    struct bucket *next;    // for chaining only
} bucket_t;

typedef struct hash_set {
    bucket_t *table;
    size_t capacity;
    size_t size;
} hash_set_t;

For the purposes of this lab, a "hash table" is just an array of bucket_t structs, which may or may not contain actual data values. If the is_empty field is false for a particular bucket, then that bucket stores a data element in the value field.

Unless you attempt the "challenge" portion of this lab, you shouldn't need to use the next pointers in the bucket_t struct.

We have also provided two helper functions for you to use:

  • size_t hash(data_t value)

    Calculate a hash code for the given value.

  • bucket_t* malloc_buckets(size_t capacity)

    Allocate space for a new hash table (array of buckets) on the heap.

Required Functions

For this lab, you will only need to implement two functions of the Set interface. Both functions should run in O(1) time.

  • hash_set_add(hash_set_t *set, data_t value)

    Adds an element to a set. Don't forget to rehash if the maximum load factor is exceeded.

  • hash_set_contains(hash_set_t *set, data_t value)

    Returns true if the element is a member of the set; false otherwise.

You will probably find the required functions much easier to write if you first implement a couple of helper functions (prototypes provided) that work on hash tables (bucket_t arrays) instead of the full set struct. Obviously, these helper functions should also run in O(1) time.

  • bucket_t* probe(bucket_t table[], size_t capacity, data_t value)

    Probe for the given value in a hash table. This function should return the bucket where the value belongs. Either the value will already be there, or the bucket will be empty.

  • void insert(bucket_t table[], size_t capacity, data_t value)

    Insert a new value into a hash table. This function itself will be easier to implement if you write probe first and then use it.

Evaluation

We have provided a basic testsuite that inserts a few numbers and performs membership testing. Your solution should pass this test. Sample output:

Running suite(s): Default
[ _, _, _, _ ]
[ _, _, 1, _ ]
[ 3, _, 1, _ ]
[ 3, _, 5, _, _, _, 1, _ ]
[ 3, _, 5, _, 7, _, 1, _ ]
[ 3, _, _, _, 7, _, 1, _, _, _, 5, _, _, _, 9, _ ]
contains 1? yes
contains 2? no
contains 4? no
contains 5? yes
contains 8? no
contains 9? yes
100%: Checks: 1, Failures: 0, Errors: 0
SUCCESS: All current tests passed!

The underscores in the hash table printouts represent empty buckets.

For performance evaluation, we have included a modified version of the comparison tests from PA3 (main.c). If you put a working implementation of PA2 (array_set.c) and PA3 (skip_set.c) in the same folder, you should be able to build and run main.

NOTE: You may receive compiler warnings about the redeclaration of the data_t typedef. These are safe to ignore as long as all of the declarations match.

If your hash set is working correctly, it should be an order of magnitude faster than even the skip set implementation from PA3. Here is some example output:

ARRAY ADD           10000   0.009480
SKIP  ADD           10000   0.002700
HASH  ADD           10000   0.000634
ARRAY CONTAINS      10000   0.015811
SKIP  CONTAINS      10000   0.001897
HASH  CONTAINS      10000   0.000243

Challenge: Chaining (Optional)

After you have the probing implementation working, save a copy of the file and then change your implementation to use chaining (open addressing) rather than linear probing.

For this part, you may wish to raise the maximum load factor to 0.9 or 1.0.