PA4: BatchGeo

Programming Assignment 4

BatchGeo

Learning Objectives

This assignment is designed to help you learn several things. First, it will help you learn how to use type-safe collections. Second, it will help you learn about the capabilities of different kinds of collections (and when each is appropriate). Third, it will help you learn more about file I/O. Finally, it well help you understand the importance of testing and debugging.

Overview

Nearby is a (fictitious) company that develops software in three closely related areas: personal navigation systems, en-route and mobile commerce, and location-based services. You have been contracted to develop the classes they will need to complete an application named BatchGeo.

BatchGeo is an application that can be used to geocode street addresses (i.e., convert street addresses to longitude/latitude coordinates).

Background

Before you can start implementing the system, you need to learn a little bit about the terminology and file formats used by Nearby.

Acronyms

TIGER
Topologically Integrated Geographic Encoding and Referencing

Definitions

Address
A street name and a house/lot number.
Equator
The imaginary line on the surface of and encircling the Earth that is equidistant to the North and South Poles.
High End of the Segment
The end of the segment with the largest house/lot number
House/Lot Number
A number (for our purposes, an integer) ascribed to a particular location (often a vacant lot, house, or business) on a segment.
Interpolation Parameter for a House Number
Given a low house number, L, a high house number, H, and a house number of interest, A, in [L,H], the interpolation parameter for A is defined as m=(A-L)/(H-L), where the house numbers are treated like real numbers (not integers).
Latitude
The angular distance (in degrees) north (positive) or south (negative) of the Equator.
Line of Latitude (Parallel)
An imaginary line on the surface of and encircling the Earth drawn parallel to the Equator.
Line of Longitude (Meridian)
An imaginary great circle on the surface of the Earth connecting the North and South Poles and perpendicular to the Equator.
Linear Interpolation of Two Numbers
Given two numbers, a and b, and an interpolation parameter, m ∈ [0,1], the linear interpolation of a and b is denoted by v and defined as v=(1-m) ⋅ a + mb.
Linear Interpolation of Two Points
Given two points, a = < ax, ay > and b = < bx, by >, and an interpolation parameter, m ∈ [0,1], the linear interpolation of a and b is denoted by v and defined as v = < vx, vy > where vx=(1-m) ⋅ ax + mbx and vy=(1-m) ⋅ ay + mby .
Location
A point in which the coordinates are a longitude and latitude.
Longitude
The angular distance (in degrees) east (negative) or west (positive) of the Prime (Greenwich) Meridian.
Low End of the Segment
The end of the segment with the smallest house/lot number.
Point
A (2-dimensional) point is an ordered pair. The point a is typically written as the ordered pair < ax, ay >
Prime (Greenwich) Meridian
The line of longitude designated to be at 0 degrees.
Segment
A portion of a street. For example, Main Street might consist of several segments, one from Oak Street to Elm Street, one from Elm Street to Pine Street, and one from Pine Street to Broad Street. A segment has a low end house/lot number and high end house/lot number.
Segment Base
A “database” (i.e., group or collection) containing all of the segments of interest.
Street
A named roadway that consists of one or more segments (that define the street). For example, Main Street might consist of several segments, one from Oak Street to Elm Street, one from Elm Street to Pine Street, and one from Pine Street to Broad Street. A street has a name (e.g., “Main Street”).
Street Base
A “database” (i.e., group or collection) containing all of the streets of interest.

Data Files

The data files used in BatchGeo are “human readable” text files. Each line in a data file is called a record, and each record may contain one or more fields (which are, conceptually, columns or components of the record). Fields in a record are separated by a delimiter character.

BatchGeo currently uses two kinds of data files, .seg files and .str files, each of which is described below. A geographic region will consist of two files that have the same name and these two extensions. So, for example, the data files for the JMU area are named jmuarea.seg and jmuarea.str.

.seg Files

.seg files contain information about segments, as they are defined by the U.S. Census Bureau’s TIGER system (though using a slightly different format).

The Format of .seg Files
  • Records in .seg files use the tab character (i.e., '\t') as the delimiter

  • .seg files contain a variable number of records

  • .seg files are terminated by an end-of-stream (sometimes called end-of-file) character

  • Each record contains nine fields as follows:

  1. The segment ID (String)
  2. The longitude at the low end of the segment (double)
  3. The latitude at the low end of the segment (double)
  4. The longitude at the high end of the segment (double)
  5. The latitude at the high end of the segment (double)
  6. The length in KM (double)
  7. The TIGER road type code (String)
  8. The house/lot number at the low end of the segment (int)
  9. The house/lot number at the high end of the segment (int)
An Example .seg File Containing Seven Segments
Example .seg File
A Description of One Record in a .seg File
Description of a .seg Record

.str Files

.str files contain information about streets.

The Format of .str Files
  • Records in .str files use the tab character (i.e., '\t') as the delimiter (Note: A tab character is used because street names can contain commas)

  • .str files contain a variable number of records

  • .str files are terminated by an end-of-stream (sometimes called an end-of-file) character

  • .str files contain two kinds of records, name records and component records

  • Each name record contains two fields as follows:

    1. The street name (String)
    2. The number of segments in the street (int)
  • Each component record contains one field, the segment ID (String)

An Example .str File Containing Four Streets
Example .str File
A Description of One Street in a .str File
Description of a .str Street

A Geocoding Example

Using the example data files above, we can geocode the address 1970 Creek Ct (which consists of the int house number 1970 and the street name "Creek Ct").

From the .str file we see that Creek Ct consists of a single segment with ID 75739596. From the .seg file we see that the low end number on this segment is 1900 and the high end number on this segment is 1999. So, treating the house numbers as double values (not int values), the interpolation parameter for house/lot number 1970 is given by: m = (1970 - 1900) / (1999 - 1900) = 70/99 ≈ 0.707070

From the .seg file we see that the low end location is <-78.737162, +38.380782> and the high end location is <-78.736349, +38.381549>. Using m to interpolate between these two points yields:

(1.0 - 0.707070) · -78.73716 + 0.707070 · -78.736349

= 0.292930 · -78.73716 + 0.707070 · -78.736349

= -23.064476 + -55.672110

= -78.736586

and

(1.0 - 0.707070) · 38.380782 + 0.707070 · -38.381549

= 0.292930 · 38.380782 + 0.707070 · 38.381549

= 11.242882 + 27.138442

= 38.381324

So, 1970 Creek Ct is geocoded to <-78.736586, 38.381324>, where, using the Nearby convention, the longitude is listed before the latitude.

Note that, in this example, the interpolation parameter was truncated at 0.707070. So, you may get slightly different answers if you use more digits of accuracy. Your code must use the full accuracy of double values.

The Components to be Written

The components to be written are illustrated in black in the following UML class diagram. (The components in jade green are part of the Java API.)

Class Diagram

Note that attributes and methods that are inherited are not made explicit in this diagram. If a method appears in both a base class and a derived class it means that the method in the derived class overrides the method in the base class. Note also that methods that are in an interface are not made explicit in classes that realize/implement that interface.

In addition to the specifications in the UML class diagram, your classes must conform to the following specifications.

The Location Class

The toString() method must return the longitude and latitude formatted using a format String of "%+9.6f,%+9.6f".

The OnSegmentLocation Class

An OnSegmentLocation is a Location that knows the ID of the Segment it is on.

This class implements the Comparable interface so that the sort() method in the Collections class can be used to sort collections of such objects based on their segment IDs (which are String objects that, themselves, implement the Comparable interface).

The Segment Class

The default constructor must create a Segment object with default values for all of the attributes.

The contains() method must return true if the owning Segment contains the given house number, and must return false otherwise.

The interpolate() method must geocode the given house number using linear interpolation. (Hint: Be careful to use the longitude/latitude for the low end of the segment as a and the longitude/latitude of the high end of the segemnt as b.) It must return null if the Segment does not contain the house/lot number.

the fromTSV() method must set the value of all of the attributes from a tab-separated-value representation (as in .seg files). It may need to throw one or more exceptions (e.g., if the String has the wrong number of fields, if a numeric field is non-numeric, etc.). It must return the ID of the segment.

The Street Class

The constructor is passed the name of the Street and must allocate memory for the segments attribute.

The add() method must add the given Segment object to the segments attribute.

The geocode() method must return a List of OnSegmentLocation objects that contain the house number of interest. This List need not be ordered in any particular way. The returned List may be empty but this method must not return null. (Hint: This method should use the interpolate() method in the Segment class.)

The MapReader Class

The readSegments() method must read a collection of Segment objects from a .seg file. Any records in the .seg file that are incorrectly formatted (e.g., contain the wrong number of fields, or contain non-numeric values for fields that must be numeric) must be ignored. The Map that is returned must use the ID of the Segment as they key. (Hint: The readLine() method in the BufferedReader class will return null when the end-of-stream/end-of-file character has been reached. The Scanner class has hasNext() and hasNextLine() methods that can be used to determine if the end-of-stream/end-of-file character has been reached.)

The readStreets() method must read a collection of Street objects from a .str file, using the collection of Segment objects that it is passed. The Map that is returned must use the name of the Street as the key.

The Geocoder Class

The Constructor

The constructor will be passed the name (with no extension) of the .seg and .str files to use (with no path information, meaning the file can be found in the working directory). So, for example, for the JMU area this method will be passed the String "jmuarea" (not "jmuarea.seg" or "jmuarea.str", and not "/cs159/eclipse/pa4/src/testing/jmuarea").

The constructor must use the MapReader class to read the .seg and .str files (from the working directory) and store the information they contain in the Map attribute named streets. Your implementation must not read either data file more than once per execution. If it does, you will receive a grade of 0. In other words, your implementation must read the file once (essentially at the start of the execution), store all of the information needed in the Map named streets, and then use that Map to geocode the addresses entered by the user. (Hint: There is no reason to keep the segments after the constructor returns. They are only needed to read the streets.)

The fromAddress(String name, int number) Method

The fromAddress() method is passed an address (i.e., a street name and a house/lot number) to geocode. The String parameter will contain the street name. The int parameter will contain the house/lot number on the street.

The fromAddress() method must return a List of OnSegmentLocation objects that contains all of the longitude/latitude pairs that correspond to the given address (and nothing else).

  1. If the address can’t be geocoded, the List must contain no elements (but must be non-null).

  2. If the address can be geocoded, the calculated longitude and latitude must be a linear interpolation of the longitudes and latitudes at the low and high end of the relevant segment. If the address can be geocoded using more than one segment (i.e., there is more than one segment with the given street name that contains the given house/lot number), the List must be sorted in ascending order by segment ID. (Hint: This method should use the geocode() method in the Street class.)

Testing

Nearby has provided information for two areas that you can use for testing, each of which consists of two files. They are available for download from the following URLs:

and

You must put these files in the project directory so that your Gecoder will find them in the working directory when the code is executed. (See the section on “Help with Data Files” below.)

Obviously, for all of your tests, you will need a way to calculate the expected results (so that you can compare them with the actual results). While you can use a calculator for this purpose, it’s probably better to use a spreadsheet. Nearby has provided one that is available for download at:

tiny.xls

It contains one row for each segment in the tiny map. If you enter an address in the appropriate row of the “Address” column, it will calculate the interpolation parameter for that address, as well as the longitude and latitude for that address. For example, entering the address 1970 in cell L5 will yield the same results (in cell N5, O5 and P5) as in the example above.

Unit Testing

You should by now realize that you should test individual components before testing the entire system. Hence, you should write JUnit tests for all of the classes your write as you complete them (or as you complete indiviual methods in them). However, you are not required to do so. In other words, you will not be submitting your tests, you should write them because you know that doing so is part of a good process.

Nearby has provided a few unit tests to help you get started. They are available for download at:

These tests may need to be modified/corrected to be useful, and certainly need to be expanded.

Note that it may not be possible to completely cover all of the statements in the classes your write because it may not be possible to generate all of the exceptions that might be thrown. Nonetheless, you should try to cover as much of the code as possible.

Integration Testing

After you are confident that all of the methods in all of your classes pass your unit tests, you should perform integration testing (in which you ensure that they all work correctly together). You should do this by creating JUnit tests for the Geocoder class (which, directly or indirectly, uses all of the other classes). Note that there is no main class included in this assignment. So, if you do not want to use JUnit for integration testing, you will need to design and write a main class on your own. Regardless, make sure you use an appropriate tolerance when comparing double values.

Submission

You must submit (using Gradescope) a .zip file named pa4.zip that contains:

  1. Your implementation of all of the classes

packaged appropriately. The analytics and geog directories/folders must be at the top of the .zip file. (See the first programming assignment if you have forgotten how to do this.) Do not submit your JUnit tests and do not submit any data files.

Your grade will be reduced by 5 points for each submission after the 10th submission. So, you should try to ensure that your code is correct before you submit it the first time.

Grading

Your code will first be graded by Gradescope and then by the Professor. The grade you receive from Gradescope is the maximum grade that you can receive on the assignment

Gradescope Grading

Your code must compile (in Gradescope, this will be indicated in the section on “Does your code compile?”) and all class names and method signatures must comply with the specifications (in Gradescope, this will be indicated in the section on “Do your class names, method signatures, etc. comply with the specifications?”) for you to receive any points on this assignment. Gradescope will then grade your submission as follows:

CriterionPointsDetails
Conformance to the Style Guide0All or Nothing; Success Required
Correctness100Partial Credit Possible

As discussed above, 5 points will be deducted for every submission after the 10th. Gradescope will provide you with hints, but may not completely identify the defects in your submission.

Manual Grading

After the due date, the Professor may manually review your code. At this time, points may be deducted for duplicate code, inelegant code, inappropriate variable names, bad comments, etc.

At this point in the semester, you should be able to create a good process and use it. You should probably think about which classes use which other classes and work from the bottom up. In other words, start with the classes that stand alone (the bottom), then move up to the classes that use them, and keep moving up until you get to the Geocoder class.

Help

You might find the following discussions helpful.

Help with Data Files

You should download the data files into the downloads directory/folder that you created for this course.

After you have downloaded the data files you must open a file explorer or finder, select all of the files, and drag them into Eclipse. Specifically, you must drag them into the project (not the src directory/folder or anything underneath it). When you do so, make sure you select “Copy files” not “Link to files”. (It is possible to put the data files elsewhere, but then, when you submit your solution, your code will not be able to find them on the Gradescope server. In other words, for your code to work both on your computer and on Gradescope, the data files must be dragged into the Eclipse project.)

Then, in your code that needs to use these files, you must use only the file name and extension (i.e., do not include a path). For example, you might construct a BufferedReader named in as follows:

BufferedReader in;
in = new BufferedReader(new FileReader("tiny.seg"));

All of the data files will be available on Gradescope for you to use in your tests (assuming you follow the instructions above).

Help with Testing

The jmuarea files are about 3Mb in total, which means that it will be very difficult for you to debug your code using them. Hence, you should use the smaller tiny files for initial testing and debugging. These files contain four streets, two with one segment each and two with multiple segments.

Even with the small files, you will need to be systematic in the tests you use. Think carefully about the kinds of faults your implementation might contain and the kinds of trigger conditions that might allow you to detect them. Some important addresses to geocode are addresses at the low and high ends of the segment which need not be interpolated (i.e., will have interpolation parameters of 0 and 1 respectively), and addresses that are in the interior of the segment and must be interpolated (i.e., will have an interpolation parameter between 0 and 1).

For tiny.str and tiny.seg you should certainly start with the following test inputs. Of course, you’ll first need to calculate the expected answer for each of these inputs.

Addresses at Ends of Segments (i.e., No Interpolation Required)

  • 748 Milk Spring Rd
  • 798 Milk Spring Rd
  • 674 Milk Spring Rd
  • 746 Milk Spring Rd
  • 1900 Creek Ct
  • 1999 Creek Ct
  • 1 Creek Loop
  • 246 Creek Loop
  • 387 Hill Dr
  • 150 Hill Dr

Addresses in the Interior of Segments (i.e., Must Be Interpolated)

  • 750 Milk Spring Rd
  • 701 Milk Spring Rd
  • 1935 Creek Ct
  • 200 Creek Loop
  • 155 Hill Dr
  • 380 Hill Dr

No Such Street

  • 747 Buttermilk Spring Rd
  • 1 Creekside Ct
  • 700 Spring Rd
  • -1 Hilliard Dr

No Such Number

  • 800 Milk Spring Rd
  • 1750 Creek Ct
  • -1 Hill Dr

Multiple “Hits” (i.e., Segments Containing the Address) Involving Addresses at the End and in the Interior

  • 165 Hill Dr
  • 303 Hill Dr
  • 305 Hill Dr
  • 359 Hill Dr

Multiple “Hits” Involving Addresses in the Interior

  • 177 Hill Dr
  • 351 Hill Dr

Questions to Think About

You don’t have to submit your answers to these questions, but you should try to answer them before you start writing the code because they will help you understand how to do so.

  1. Why is the streets attribute in the Geocoder class a Map<String,Street> rather than some other kind of collection?
  1. Why is the segments attribute in the Street class a List<Segment> rather than some other kind of collection?
  1. Why does the readSegments() method in the MapReader class return a Map<String,Segment> rather than some other kind of collection?
  1. Can a defect in a test (e.g., an incorrect expected value, an inappropriate tolerance) make you think code is correct when it isnt?

  2. What should you do when your code fails an integration tests after passing your unit tests?