For this assignment you will write a command line utility that reads in two text files and displays information about the longest sequence of words that appears in both documents.
Your main class must be named DuplicateSearch, and it must take three command line arguments: two file names and an optional integer value.
DuplicateSearch file1.txt file2.txt [K]No matches shorter than
K
will be returned. (See below
for a more detailed description of K
.) If no value
for K
is provided, then a default of 3 should be used.
For example, consider the two files:
A.txt
|
Happy families are all alike; every unhappy family is unhappy in its own way. Everything was in confusion in the Oblonskys' house. The wife had discovered that the husband was carrying on an intrigue with a French girl, who had been a governess in their family, and she had announced to her husband that she could not go on living in the same house with him. |
B.txt
|
In the words of Leo Tolstoy: "Every unhappy family is unhappy in its own way." |
Given these two files, an interaction with your program will look like the following:
$ java DuplicateSearch A.txt B.txt 2 Match length: 9 Position in A.txt: 5 Position in B.txt: 6 Matching words: every unhappy family is unhappy in its own way
The format of your output must exactly match the format of the example above. The search should not be case-sensitive, and it should disregard all punctuation and newlines. In addition, your program should print appropriate error messages in the case of missing or incorrect arguments. Note that Web-CAT will not test your code for proper error handling, but error-handling will be included in the grading rubric.
If no matches of the appropriate length are found, your program must
print the string "\nNo matches found.\n"
.
You may use the the utility method in DocumentReader.java for reading text from the file. Note that this method uses any non-word character as a delimiter, so it will split some single words into multiple entries. For example "Bob's" will be split into "Bob" and "s". This is fine for our purposes, since both documents will be parsed in the same way.
There is a simple brute-force algorithm for finding matching sequences as described above: just start a search at every possible pair of locations in each document, and return the longest match that is discovered starting from any of those locations. Pseudocode for this approach would look like the following:
This algorithm works fine for files with a few hundred or a few thousand words, but it isn't very useful for large documents. For example, consider two documents that have 1,000,000 words each. There are then 1,000,000 x 1,000,000 = 1012 = 1 trillion pairs of locations to check in the two documents. Computers are fast, but it will still take a long time to finish that computation.
The brute force algorithm above is slow because it needs to scan
all the way through Document2
for every word in
Document1
. The vast majority of those searches will terminate
immediately with a mismatch. For example, the first word in A.txt
,
"Happy", never appears at all in B.txt
, but the algorithm still looks
for it at every possible location.
We can improve our algorithm by only searching for matches at
"promising" locations in Document2
: locations where we know we will
find at least one matching word. The following pseudocode describes
this approach:
This new algorithm is much more efficient than the brute-force
approach, but it still has a problem. English text is dominated by a
few high frequency words like "the", "a" and "and". This results in
many false positives: for every "the" that appears in Document1
, we
will need to check every "the" in Document2
, even though most of them
will not be part of a longer match.
The problem of high frequency words can be addressed with one last
refinement to the algorithm: rather than storing the locations of
every individual word, we will store the locations of short sequences
of K
words. The word "the" might appear hundreds of times in a long
document, but the phrase "the wife had" is unlikely to appear
at all unless text has been copied from one document to the other.
With this modification, it won't be possible to find matches shorter
than K
, but we usually aren't interested in short matches anyway. In
practice, K=3
should usually be a reasonable value.
Note that the algorithm description above does not specify how word locations should be stored or retrieved. It is your responsibility to select appropriate java collection types for solving this problem efficiently.
I am providing a sample command line utility that you may find instructive: Chunker.java. This file will not be part of your final submission. It is provided as a (hopefully) useful illustration of how you might break a document into multi-word chunks.
DuplicateSearch moby_dick.txt moby_dick.txt 3Modify your program to address this problem.
NO LATE SUBMISSIONS WILL BE ACCEPTED FOR THIS ASSIGNMENT
Submit your .java files through Web-CAT by the project
deadline. You should not submit DocumentReader.java
.
Your grade will be based on the following factors, and may also be affected by the number of submissions that you make. See below for more details on how each component will be calculated.
Autograding based on correctness/testing: | 60% |
Autograding based on static analysis of style: | 10% |
Instructor grading based on design, style and readability: | 30% |
You do not need to submit your own unit tests for this project. Your code will be tested against instructor provided unit tests.
Web-CAT will use checkstyle to analyze your code for conformance to the course style guide. I strongly suggest that you to install checkstyle for Eclipse on your own machine so that you can run your own style checks before submission. You can find the checkstyle configuration file that Web-CAT will use on the course supplemental material page.
This portion of your grade will based on stylistic issues that cannot be checked automatically. This includes:
Your grade will also be based on the number of submissions that you make. You may make one free submission per day. Beyond those free submissions, you may make up to six submissions with no penalty. Your grade will then be reduced by 3% after every six submissions:
Number of submissions | Penalty |
---|---|
1-6 | 0 |
7-12 | -3% |
13-18 | -6% |
... | ... |