PA 5

kD-Tree PA

NOTE: Part of the objective for this project is to mimic what it’s like to work on a software solution at a company. Typically, you will not be handed pseudo-code for every operation and subroutine you are supposed to program, but will instead be given a specification for a program–what inputs it should take and what outputs it should produce. You will then have to go figure out how to make it work.

For this project write-up, we are doing a bit of a hybrid approach. We are giving you some information on the kD-tree data structure (which you will be coding), but not everything you will need to know to code it. You are responsible for finding good additional information as needed in order to complete the project. Piazza discussions are especially encouraged.

For this reason, it is also important to read the entire project description and the provided code before you start working. Do not write a single line of code until you are sure you understand what is expected of you and how it fits into the framework of the code we’ve already given you. There is a lot to read, so make sure you start early!

Learning Objectives:

Getting Started

First download and decompress one of the starter files:

This PA has two major parts, and so these archives contain a large number of files. You will need to create Makefiles for both parts. Here is an overview of all the files:

The bst/ folder

This folder contains the bst.h/c files and the files needed for the bst/main tester program.

The kdtree/ folder

COMING SOON: The rangefighter/ folder

We will release this to you in the week after Thanksgiving. It contains the actual game code for rangefighter, which uses your kD-tree implementation as its main data structure.

Part 1: Range Finding in Binary Search Trees

You have seen in class how binary search trees (BSTs) are useful for supporting a find(x) operation that determines whether a value x is stored in the tree. Consider the following tree:

       8
     /   \
    4    12
   / \   / \
  2   6 10 14

Let’s recall what happens when we execute the command find(6). First find(x) looks at the root node to check if the value at the root node is equal to what we are looking for. It is not (6 != 8). Thus we need to recurse on one or the other of the subtrees. Because we have a BST, since $6 < 8$, if 6 is to be found in the tree, it will have to be in the left subtree of 8. We thus search recursively on the left-subtree of 8 for 6.

Now let’s generalize this idea a little. Instead of finding a specific value $x$, what if we wanted to find any value in a given range? For instance, what if we wanted to retrieve all values stored in the tree between 3 and 11? How would we implement this operation?

This operation is called range searching. We do not want to determine whether a particular value exists or not–as in the case of find(x)–but instead we want to find all values in the tree within the given range $[3, 11]$.

To support this operation, we perform a recursive procedure that is very similar to the find(x) operation. Again, we start at the root node 8 and ask the question, “is 8 in our range $[3, 11]$?” The answer is, of course, yes, so we output 8. We then need to determine whether to continue searching recursively in the left subtree, the right-subtree, or both.

This is the main difference between the find operation and the range_search operation: in the find operation, there is always only one subtree we need to search in: either the left tree or the right. Now, however, there are three possible cases: 1) the range we are searching for can only possibly contain points from the left subtree, 2) it can only possibly contain points in the right subtree, or 3) it may possibly contain points from both. The third case arises precisely when the value of the current node is contained within the range. So in our case, because 8 itself falls inside the range $[3, 11]$, both the left and the right subtrees may contain points that fall within the range. This means that we need to continue the search recursively in both subtrees.

So what does that look like? Well, let’s first recurse on the left subtree. Since 4 is in $[3, 11]$ we output the 4 and again we have to recurse on both the subtrees of 4. We first recurse on its left, 2, which is not in the range, so we do not output it. Since 2 has no children execution returns to 4 and we recurse down the right subtree. Here, 6 is in the range $[3, 11]$, so we output 6. (If you are following along at home, we have now produced 8, 4, and 6 as output.) We have now finished with all branches of 4 and execution returns back to the recursive call at 8.

We now continue searching on the right child, 12. In this case 12 is not in the range $[3, 11]$. In fact, all possible values in the range $[3, 11]$ are less than 12. This means that only the left-subtree of 12 can possibly contain any values that fall within $[3, 11]$. So instead of recursing on both the left and right subtrees, this time we will only continue our search in the left subtree of 12. When we recurse there, we find that 10 is in our range, and so we output it. We have now exhausted all subtrees we needed to search and the process terminates, having outputted 8, 4, 6, and 10, which are exactly those nodes of the tree with values in the range $[3, 11]$.

Note that these values were not retrieved in order; however, we could easily modify this process to retrieve the values in order by changing the range_search function to use an inorder traversal instead of a preorder traversal.

Building a BST

Recall from class that the efficiency of operations on trees is often bound up with the height of the tree, and that is certainly the case here. In fact, the running time of the range search operation can be shown to be $O(\log~n + k)$ on a BST with $n$ nodes, where $k$ denotes the number of elements in the tree that falls within the range. This is a little different than running times you have seen before, because the running time $O(\log~n + k)$ is related to both the input size $n$ and the output size $k$. This is called an output-sensitive run time.

To make sure that we get the $\log~n$ term in the running time, our tree needs to be balanced. Suppose we have an array items of $n$ integers we want to put in a BST. How can we build such a balanced tree? Here’s a recursive idea. First select the median of the input array and make this the root node. Then build the left subtree from all the items in the array that are less than the median, and build the right subtree from all the items in the array that are greater than the median. We can do this reasonably efficiently by presorting the array. The steps look like this:

  1. Presort the items array to obtain sorted-items.
  2. Select the median (sorted-items[n/2]) and make it the value of the root node.
  3. Recursively create the left subtree from sorted-items[0:n/2-1].
  4. Recursively create the right subtree from sorted-items[n/2+1:n].

Task 1: Implement a Range Search BST

Your first task is to implement a binary search tree (BST) structure equipped with the following operations found in the bst.c file in the bst/ folder in the starter code:

And of course, the usual memory management functions:

NOTE: You will need a Makefile for the bst directory that builds the main program from the files found in the bst/ folder.

The ./main test program we have included should output the following to the console when run:

find(4)--should be true: true
find(10)--should be false: false
Items in [4, 12]: 7 9 12 4 5

Furthermore, it should produce a turtle graphics image named output.bmp that looks like this:

Output of bst main

Part 2: kD-Trees

A BST equipped with a range search operation as above is useful for many different applications. For instance, if we had a list of employees and wanted to answer queries like, “how many employees do we have between the ages of 20 and 25?” a BST would be a perfect data structure for supporting this type of query. However, what if we wanted instead to find all employees between the ages of 20 and 25 who make between 30 and 50 thousand dollars a year? A BST is not equipped to deal with this operation, because a BST only stores 1-dimensional data. Our query, however, is 2-dimensional, we want to query both an age range and a salary range. In order to handle such queries we need to generalize the binary search tree to higher dimensional data. We call this the kD-tree (where $k$ is the dimension of the data).

For this PA you will not implement a full-blown kD-tree, but instead will implement a 2D-tree for handling 2-dimensional data.

Now our data, instead of being a list of numbers, like 8, 4, 12, 2, 6, 10, 14, is a list of 2D points. In the age/salary example, we could have the x-coordinates represent age and the y-coordinates represent salaries. So we might have, (23, 25000), (20, 55000), (50, 55000), (27, 95000), (24, 32000), (40, 28000), (35, 50000) as a list of employee age-salary “points”. How can we store this using a tree? We could simply choose one of the dimensions, say the $x$ dimension and simply build a normal BST on this dimension. Something like:

                (24, 32000)
                /         \
     (23, 25000)            (40, 28000)
     /         \                /      \
(20, 55000) (27, 95000)  (35, 50000)   (50, 55000)

The problem with this approach is that we can still only support range search queries on the $x$-dimension.

Recall that we build a BST by recursively selecting the median, and then recursively building the left-subtree from all points less than the median, and the right-subtree from all points greater than the median. The key idea behind a kD-tree is to alternate which dimension we choose the median from each time we recurse.

In other words, if we are building a kD-tree age-salary points above, at the top level we would use the median of the age, we would then recurse and for the next level use the median of the salaries. At the following level we would again use the median of the age, and so on. The resulting tree looks like this:

                (24, 32000)  <-- median by age
                /         \
     (20, 55000)            (35, 50000)    <-- medians by salary
     /         \                /      \
(23, 25000) (27, 95000)  (40, 28000)  (50, 55000)

For higher dimensional data, you continue with the same idea, every time you recurse in your build tree, just cycle through the available dimensions.

Range searching a kD-tree

Now that we have seen how to generalize a BST to 2-dimensional data, let’s discuss the actual range searching. In 1 dimensional data, a range is just a range between two numbers, as we have seen. Something like $[3, 11]$ which is all (1D) points lying between 3 and 11. There is no one right way to generalize this idea to 2-dimensional data. In fact, here are two possibilities that are equally natural:

Box Range

Circular range

Motivation: Range Fighter

Your second task is to implement a kD-tree for 2-dimensional points. To make things a little more interesting, you are going to be implementing this kD-tree to support a computer game based on the old Atari Asteroids game. The game is called “range fighter”.

The player pilots a space ship through a debris field using the keys W, A, and D to move the ship forwards, left and right, respectively. The ship is equipped with two weapons. The first is a box range weapon that is activated by holding the space bar down. When the player hits the space bar, the box range weapon activates and a box is drawn from the initial position of the player when the space bar was pressed to the position of the player when the space bar was released. All of the debris within the box is destroyed when the space bar is released. The second weapon is a panic weapon that takes a long time to recharge but immediately destroys all debris within a certain distance of the player. It is activated by hitting the ‘e’ key.

Here is a screenshot of the game:

Game Screenshot

Your assignment is to implement a kD-tree for storing the debris field that supports both circular and box range searching operations. Rather than have you get started on a full-blown game (which might make debugging difficult!) we have provided a main program that uses turtle graphics to draw an image that looks like a screenshot from the game. You should first get your kd-tree working with this program, since it will be easier to debug. Once your kD-tree is working, you will be able to copy and paste your kdtree.c file into the game program code folder which we will give you the week after Thanksgiving.

Task 2: kD-Tree Implementation

All of the functions you need to implement are declared in the kdtree.h file and should be defined in kdtree.c. You may write helper functions in kdtree.c and declare them at the top of the file (but NOT in kdtree.h or your code won’t compile when you submit it). For this assignment we have provided working iterator code, but you will not need to use iterators for any of this assignment. The iterators are there so that the main game application can use them. You need to implement the remaining functions in the kdtree library:

Each piece of debris is represented by a debris_t object, which is defined in helpers.h in the kdtree/ folder. A debris_t object stores a 2D point named .pos for its coordinates and also stores a boolean value .active for whether the debris is active in the debris field or has been destroyed. Your kD-tree will store debris_t objects.

The helpers.h file also contains a few helpful types, such as bbox_t, which is used to define an axis-aligned bounding box for all points with x-coordinate between .minx and maxx and all points with y-coordinate between .miny and .maxy. It also defines point_t, which is used to specify a single 2D point.

Suggested Progression

First create a Makefile in kdtree/, get the code compiling, and run ./main. You should get the following output rangefighter.bmp:

Initial Output

Next, implement the memory management and build-tree functions. Once these are working your output rangefighter.bmp should look like:

Tree is built!

Next, implement the box range searching. Once this is working you should see this:

Box search is built!

Notice how all the debris within the green box is now highlighted green. The pieces of debris are randomly-generated, so the field will look slightly different every time you run the program, but all debris inside the box should be green regardless of their exact location.

Finally, get the circular range search working and the debris within the blue circle should appear blue:

Everything is working, yay!

If you are having trouble debugging your code, you may wish to enable the call to draw_kdtree in main. This should give you a graphical representation of your kD-tree:

kD-trees are cool!

HINT: For debugging purposes, it can also be useful to temporarily lower the number of pieces of debris. You can do this by modifying debris_count in the create_random_debris_tree function. Here is what it looks like with only 16 pieces of debris:

Fewer pieces of debris

Final Submission

DUE DATE: Friday, December 11, 2015 at 23:59 EST (11:59pm)

Submit ONLY your completed bst.c and kdtree.c files on Canvas using the appropriate assignment submission. Make sure the files contain a comment field at the top with your name and a statement of compliance with the JMU honor code.

IMPORTANT: There should be NO memory leaks in any of the code you submit for this assignment (both parts). You should test this using valgrind on both main programs.

Please check your code carefully and make sure that it complies with the project specification. In addition, make sure that your code adheres to general good coding style guidelines. Fix any spelling mistakes, incorrect code formatting, and inconsistent whitespace before submitting. Make sure you include any appropriate documentation.