Naive Bayes Classifier Assignment

Naive Bayes Implementation

The following file contains a very simple implementation of a Naive Bayes text classifier:

naive_bayes.py

Here is some sample data derived from the Rotten Tomatoes Reviews Dataset:

These csv file contain two columns. The first is a class label: ‘1’ for “fresh” and ‘0’ for “rotten”. The second column contains short movie reviews. The goal here is to learn to classify movie reviews as either fresh or rotten.

Assignment

Your goal in this assignment to answer several questions by working with the provided data set and the provided classifier code. You are free to modify the provided code, but you should not make any changes to the algorithms that are currently in place. Make sure you comment any code changes to indicate how they are related to answering the questions.

Problem 1 - The Worst Movie

Find the most “rotten” movie review among all of the reviews in the test set, as determined by the Naive Bayes classifier that has been trained on the training set. You’ll need to make some changes to the code since it currently only provides a class label without returning a score.

Problem 2 - Probability of Rottenness

Provide the predicted class probability distribution associated with the following review:

Jurassic Park is a cautionary tale about science gone wrong and
filmmaking gone lazy. For all its groundbreaking effects, the plot is
held together with dino-sized leaps in logic and characters who make
decisions so dumb they deserve to be eaten. The kids are annoying, the
adults are incompetent, and Jeff Goldblum spends half the movie
shirtless and smirking like he’s in a cologne ad. It’s less a
thrilling adventure and more a theme park ride with a script written
on the back of a napkin.

Again, you will need to modify the existing classifier to convert the log-based scores to a normalized probability distribution.

Problem 3 - ROC Curve

Generate an ROC curve using the provided test data, where “fresh” is the positive class. This will require determining a score value for every test instance. You should make use of the sklearn ROC curve function in generating your figure.

Problem 4 - Predictive Words

Find the 10 words that, after training, are most indicative of rottenness, and the 10 words that are most indicative of freshness.

Answering this requires understanding how class scores are calculated by the classifier:

       # Calculate scores for each class
        for class_label in self.classes:
            # Start with log of class prior
            log_prob = math.log(self.class_priors[class_label])

            # Add log probabilities of usable words only
            for word in usable_words:
                word_prob = self.word_probs[class_label][word]
                log_prob += math.log(word_prob)                  # <------

            class_scores[class_label] = log_prob

The indicated line can be seen as a weighted vote associated with one of the words in the review. If the word is more associated with one class than the other, it will have a relatively higher value. The most telling words are the words with the largest difference in their class conditional probabilities between the two classes.

Problem 5 - Writing Deceptive Reviews

Write a positive movie review that will be classified as highly negative by our trained classifier. Provide your review as well as the probability distribution (using the same approach as Problem 2).

Partners

This assignment may be completed individually or in pairs. If you are working with a partner, you must notify me at the beginning of the project. My expectation for pairs is that both members are actively involved, and take full responsibility for all aspects of the project. In other words, I expect that you are either sitting (or virtually) together to work, and not that you are splitting up tasks to be completed separately. If both members of the group are not able to fully explain the code to me, then this does not meet this expectation.

Submission and Grading

Your answers should be provided in the following Jupyter Notebook file:

nb_answers.ipynb

Submit your completed notebook, as well as your updated version of naive_bayes.py through Gradescope. The submitted version of your notebook should include the output of all code cells: I shouldn’t need to run your notebook to see the answers.