PA #6: Plagiarism Detection

Honor Code

This assignment must be completed individually. Your submission must conform to the JMU Honor Code. Authorized help is limited to general discussion on Piazza, the lab assistants assigned to CS 139/149/159, and the instructor. Copying work from another student or the Internet is an honor code violation and will be grounds for a reduced or failing grade in the course. Helping someone by looking at their code is also an honor code offense.

Introduction

For this assignment you will write a command line utility that reads in two text files and displays information about the longest sequence of words that appears in both documents.

Program Specifications

Your main class must be named DuplicateSearch, and it must take three command line arguments: two file names and an optional integer value.

DuplicateSearch file1.txt file2.txt [K]

No matches shorter than K will be returned. (See below for a more detailed description of K.) If no value for K is provided, then a default of 3 should be used.

For example, consider the two files:

A.txt
Happy families are all alike; every unhappy family is unhappy in its own way. Everything was in confusion in the Oblonskys' house. The wife had discovered that the husband was carrying on an intrigue with a French girl, who had been a governess in their family, and she had announced to her husband that she could not go on living in the same house with him.

B.txt
In the words of Leo Tolstoy: "Every unhappy family is unhappy in its own way."

Given these two files, an interaction with your program will look like the following:

$ java DuplicateSearch A.txt B.txt 2

Match length: 9
Position in A.txt: 5
Position in B.txt: 6

Matching words: 
every unhappy family is unhappy in its own way

The format of your output must exactly match the format of the example above. The search must not be case-sensitive, and it must disregard all punctuation and newlines. For this project, Web-CAT will not test your code for proper error handling.

If no matches of the appropriate length are found, your program must print the string "\nNo matches found.\n".

You must use the the utility method in DocumentReader.java for reading text from the file. Note that this method uses any non-word character as a delimiter, so it will split some single words into multiple entries. For example "Bob's" will be split into "Bob" and "s". This is fine for our purposes, since both documents will be parsed in the same way.

Algorithm

There is a simple brute-force algorithm for finding matching sequences as described above: just start a search at every possible pair of locations in each document, and return the longest match that is discovered starting from any of those locations. Pseudocode for this approach would look like the following:

For every position p1 in Document1:
- For every position p2 in Document2:
  - Compare the word at p1 in Document1 with the word at p2 in Document2, then compare the words at p1 + 1 and p2 + 1, and so on, until a mismatch is found. If the length of the matching region is longer than any previous match, store it.
Return information about the longest match found.

This algorithm works fine for files with a few hundred or a few thousand words, but it isn't very useful for large documents. For example, consider two documents that have 1,000,000 words each. There are then 1,000,000 x 1,000,000 = 10¹² = 1 trillion pairs of locations to check in the two documents. Computers are fast, but it will still take a long time to finish that computation.

The brute force algorithm above is slow because it needs to scan all the way through Document2 for every word in Document1. The vast majority of those searches will terminate immediately with a mismatch. For example, the first word in A.txt, "Happy", never appears at all in B.txt, but the algorithm still looks for it at every possible location.

We can improve our algorithm by only searching for matches at "promising" locations in Document2: locations where we know we will find at least one matching word. The following pseudocode describes this approach:

Scan through Document2, recording all of the locations where every word occurs.
For every position p1 in Document1:
- Get the word at p1.
- Look up all of the locations where that word occurs in Document2.
- For every location p2 that was found:
  - Compare the word at p1 in Document1 with the word at p2 in Document2, then compare the words at p1 + 1 and p2 + 1, and so on, until a mismatch is found. If the length of the matching region is longer than any previous match, store it.
Return information about the longest match found.

This new algorithm is much more efficient than the brute-force approach, but it still has a problem. English text is dominated by a few high frequency words like "the", "a" and "and". This results in many false positives: for every "the" that appears in Document1, we will need to check every "the" in Document2, even though most of them will not be part of a longer match.

The problem of high frequency words will be addressed with one last refinement to the algorithm: rather than storing the locations of every individual word, we will store the locations of short sequences of K words. The word "the" might appear hundreds of times in a long document, but the phrase "the wife had" is unlikely to appear at all unless text has been copied from one document to the other. With this modification, it won't be possible to find matches shorter than K, but we usually aren't interested in short matches anyway. In practice, K=3 should usually be a reasonable value.

Note that the algorithm description above does not specify how word locations should be stored or retrieved. It is your responsibility to select appropriate java collection types for solving this problem efficiently.

Testing

Your code will only pass the submission tests if your output formatting is exactly correct. Here is a unit test corresponding to the example above:

DuplicateSearchOneTest.java

This test should help you confirm that your output matches the expected format.

It is a little known fact that Mark Twain was a shameless plagiarist. He is known to have lifted one 44-word passage of his novel Adventures of Huckleberry Finn verbatim from Herman Melville's Moby Dick^*. You can test your program by locating this passage.

If your code is working correctly, it should take no more than a few seconds to locate the offending text. Web-CAT will be configured to fail any submission that requires more than six seconds on this example.

Going Further

Here are some optional extensions:

What happens if you execute your program on two large identical files? Try it out:
```
 
     DuplicateSearch moby_dick.txt moby_dick.txt 3
```
If this runs slowly, modify your program to address the problem.
Modify your program to report ALL matches longer than a given threshold. (Be careful not to return pieces of the same match multiple times.)
Searching for patterns or pattern matching in general is a rich field of research in computer science. Many algorithms exist for finding exact matches, but the more interesting algorithms today are those around finding approximate matches. (Think of internet search results.) For more information on approximate match algorithms one could start by investigating the Wikipedia link here. https://en.wikipedia.org/wiki/Approximate_string_matching

Submission and Grading

NO LATE SUBMISSIONS WILL BE ACCEPTED FOR THIS ASSIGNMENT

Web-CAT Correctness	70%
Web-CAT Checkstyle	10%
Instructor grading based on style and code quality	20%

You must submit your completed code through Web-CAT by the project deadline. You should not submit DocumentReader.java. As with previous assignments, code that does not pass all submission tests will receive at most 50% credit. Your final grade will be reduced by one point for each submission beyond five.

Hints and Advice

If you didn't finish the assigned reading on Java collections all the way through, I would encourage you to do so before you get started on this project.
The "short sequences of K words" mentioned above might remind you of the n-gram iterator you completed for an earlier lab. You may find that code useful in developing your solution.

^* Completely untrue.