PA5: Sentiment Analysis

Introduction

Sentiment Analysis is the problem of determining the general attitude expressed by some text. For instance, we would like to have a program that could look at the text "The film was a breath of fresh air" and realize that it was a positive statement while "It made me want to poke out my eyeballs" is negative.

One algorithm that we can use for this problem is to assign a numeric value to each word based on how positive or negative that word is and then score the text as a whole based on the average sentiment value of the individual words. The challenge here is in finding a way to assign positive or negative values to individual words.

For the purposes of this project we will assign values to words by analyzing a large collection of movie reviews collected from the Rotten Tomatoes website. The text of each movie review is accompanied by a human-generated evaluation of whether the review is positive or negative overall.

The first few lines of the file look like this:

1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .	
4 This quiet , introspective and entertaining independent is worth seeking .	
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one .	
3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .	
1 Aggressive self-glorification and a manipulative whitewash .	
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .	
1 Narratively , Trouble Every Day is a plodding mess .

Note that each review starts with a number 0 through 4 with the following meaning:

0 : negative
1 : somewhat negative
2 : neutral
3 : somewhat positive
4 : positive

Individual words will be scored by computing the average rating of all of the reviews that contain that word. For example, if we were only working with the reviews included above, the word "and" would be assigned the score \( (4 + 3 + 1) / 3 = 2.\overline{6} \): it appears in the 2nd, 4th, and 5th reviews which have scores of 4, 3, and 1 respectively.

Instructions

The following files are provided. You should not need to modify any of these files.

Review.java - Objects of this class represent individual reviews. Review objects store the score associated with the review along with the text of the review as an array of words. Here is the documentation for Review.java.
ReviewLoader.java - This class contains code for loading a collection of Reviews from a text file. Here is the documentation for ReviewLoader.java .
DemoDriver.java - This file provides some example code that illustrates the use of Review.java and ReviewLoader.java.
movieReviews.txt - A text file containing 8529 movie reviews along with their scores.

You will need to create each of the following files from scratch:

SentimentUtils.java
SentimentUtilsTest.java
Sentiment.java

SentimentUtils.java

This class must provide the methods described below. It will not have a main method.

public static boolean containsWord(Review review, String word): This method must return true if the word appears in the text of the provided review and false if it does not. Comparisons should not be case-sensitive. Partial matches don't count. For example, if the review contains the words {"The", "movie", "is", "rotten"}, "the" would be considered a match, but "rot" would not.
public static double evaluateWord(Review[] reviews, String word): This method must return the average score of all reviews containing the indicated word. If the word does not appear in any review, the return value should be 2.0. You may assume that reviews will be a correctly initialized array of Review objects: it is not necessary to explicitly check for a null array or for null entries in the array.
public static double evaluateText(Review[] reviews, String[] text): This method must return the average score of all words in the array text. Your method should call evaluateWord to calculate the scores for the individual words. If text has length 0, this method should return 2.0. Again, you may assume that both array arguments are correctly initialized.

SentimentUtilsTest.java

You must provide unit tests for each of the methods in SentimentUtils.java. Note that it is not practical to write your unit tests using all of the data in moviewReviews.txt. In order to create useful tests you need a set of reviews that is small enough to enable you to work out the expected answers by hand. There are two ways you could accomplish this. One possibility is to write code to generate a small array of review objects. Something like the following:

Review[] testReviews = new Review[4]; // Make space for four reviews

// Create the first review...
String[] reviewText1 = {"The", "movie", "is", "rotten"};
testReviews[0] = new Review(1, reviewText1);

// etc.

The other possibility is to create a small text file with the same format as movieReviews.txt. Your testing code can then use ReviewLoader.java to convert that file into an array of Review objects.

Sentiment.java

This file will provide a command-line utility for assigning sentiment scores to text. Here is an example of a possible interaction with the program:

$ java Sentiment
Enter your text (on a single line): 
This movie is rotten 
The sentiment score for this text is: 1.58 
This text is negative.

In this example the green text was produced by the program and the red text represents user input. The displayed numeric score must be rounded to two decimal places. The last line of the output must be based on the numeric score: If the score is below 1.95 the output should be "This text is negative." If the score greater than or equal to 1.95 and less 2.05 the output should be "This text is neutral." If the score is greater than or equal to 2.05 the output should be "This text is positive."

In order to pass the automated Web-CAT tests the output of your program must exactly match the format illustrated above.

You can use the split method of the String class to convert the user input into an array of words. For example:

String text = "This movie is   rotten.";

String[] words = text.split("[\\p{Punct}\\s]+");

The argument to split determines how the method should split the string. In this case, "[\\p{Punct}\\s]+" is a regular expression that will match any sequence of whitespace characters or punctuation. After the code above executes, words will contain the array {"This", "movie", "is", "rotten"}.

You do not need to provide unit tests for Sentiment.java.

Submitting UPDATED 3/28

Part A

Before the deadline for Part A you should read this document carefully. Once you have a clear understanding of the expectations for this assignment, complete the readiness quiz in Canvas. The grading for this quiz will be all or nothing: your score on the quiz will be 0 if you miss any questions. If you do not successfully complete Part A, you cannot receive any credit for Part B.

Part B

Zip SentimentUtils.java and SentimentUtilsTest.java along with any additional text files that you require for testing. Submit the .zip file through Web-CAT. You should not include Sentiment.java or any of the Java files that we have provided.

SentimentUtils.java must conform to the CS 139 Style Guide. You are not required to provide Javadoc comments for the methods in SentimentUtilsTest.java, but that file should conform to the style guide in all other respects.

Part C

Zip SentimentUtils.java and Sentiment.java and submit through Web-CAT. Your zip file should only contain these two java files. You should not included the Java files that we provided or any testing code. Both SentimentUtils.java and Sentiment.java must conform to the CS 139 Style Guide.

Grading

Your submission will be graded using the following criteria:

	Points
Part A Quiz	10
Part B Web-CAT Correctness/Testing	40
Part B Checkstyle Tests	10
Part C Web-CAT Correctness/Testing	10
Part C Checkstyle Tests	10
Style and Code Organization	20

If Web-CAT deducts any points for correctness/testing, you will receive at most 25/50 on that component of the score.

Once again there will be a penalty for excessive submissions. The first 10 submissions are free. Each submission beyond 10 will result in a .5 reduction in the final score.

Honor Code

This assignment must be completed individually. Your submission must conform to the JMU Honor Code. Authorized help is limited to general discussion on Piazza, the lab assistants assigned to CS 139, and the instructor. Copying work from another student or the Internet is an honor code violation and will be grounds for a reduced or failing grade in the course.

Acknowledgments

The idea for this assignment was presented by Eric Manley and Timothy Urness at the 2016 SIGCSE nifty assignment session. This project uses their data files and borrows some text from their write-up. The Rotten Tomatoes data was originally collected for the Stanford sentiment analysis project.