CS 445 - Machine Learning

CS 445 Machine Learning

PA2: Naive Bayes Classifier

Learning Objectives

After completing this activity, students should be able to:

define a Naive Bayes classifier for a mixture of both discrete and continuous components
define and implement Laplace smoothing to handle zero probabilities

Partners

This assignment may be completed individually or in pairs. If you are doing this in pairs, you must notify me at the beginning of the project. My expectation for pairs is that both members are actively involved, and take full responsibility for all aspects of the project. In other words, I expect that you are either sitting or virtually together to work, and not that you are splitting up tasks to be completed separately. If both members of the group are not able to fully explain the code to me, then this does not meet this expectation.

Part 1: Implementation

Construct a Naive bayes classifier.

Resources

Link to file	Purpose/description
nb_classifier	Stub for your classifier
nb_classifier_test.py	Unit tests for your classifier
MNIST_data.zip	MNist image data for testing your classifier

Tips for Working with Numpy Matrices with String data

Working with numpy matrices with mixed datatypes is not very convenient. In fact, the Pandas package is very popular with machine learning scientists because it eases some of this pain.

Here is an example of a numpy matrix with categorical and numeric data:

X = np.array([['Cat', 'Yes', 1.23], ['Dog', 'Yes', 2.45]])
    print(X)
    [['Cat' 'Yes' '1.23']
     ['Dog' 'Yes' '2.45']]
    X.dtype()
    Out[14]: dtype('<U4')

U4 means that at most, the column can contain data with 4 characters.

My solution includes a function that computes the probability estimate given μ, σ, and an value for the variable (x). Since the variable may be extracted from an numpy array, it is possible that it has a character type. On the other hand, the method feature_class_prob is used to test your code, and it is inconvient to pass the data in this way (you just want to pass x as a float). The following code can address this situation. It queries the datatype of x (the variable value to estimate the probability) and if it is a numpy str_, it converts it to a float, otherwise, it just copies it to xfloat.

if isinstance(x, np.str_):
        xfloat = x.astype(float)
    else:
        xfloat = x

Variances of zero

When training your classifier, it is possible that all the features have the same value (this happens frequently when dealing with high dimensional data). This causes an issue when trying to compute the variance for continuous parameters where the distribution is estimated using a Gaussian (since the demoniator of the first term is zero in this situation).

If you encounter this situation, you may omit this feature from consideration when predicting the probability.

Smoothing with Continuous Values

When working with continuous features, theoretically, any value should have a probability greater than zero. For THIS project, if your estimate returns a probability of zero for a continous variable, you can assign 10e-9 as the probability.

For discrete/categorical features, Laplace smoothing (when requested) address this (in other words, do not use this technique for discrete features).

Log Likelihood and Challenges with Smoothing

As discussed in class, using the log likelihood of the probabilities is a common in many implementations of Naive Bayes because it prevents numeric underflow.

An issue arises with zero probabilities (an potential issue with discreete non-smoothed examples from the book) since the log(0) is undefined. Taking the log(0) in numpy will result in NaN being returned, and you can treat any computation resulting in NaN as having a zero probability. You may need to check for this with if statements for now (I am working on a better approach).

Testing your Implementation

If you run the nb_classifiers_tests.py file, it will build the naive bayes classifier and validate some of its internal structures (the probabilities) and results from calling the predict function. These tests are not exhaustive, so consider adding a few tests of your own.

Part 2: MNIST Dataset

You will apply your Naive Bayes classifier to the MNist database of handwritten digits. The training and test datasets are provided in the resources section. Enclose all of the following analysis and plots in a Jupyter notebook.

Build Model

Build your model and compute the accuracy. You will need to create an X matrix, and the image data provided is 60k x 28 x 28. You numpy's reshape function to adjust this matrix so that it is 60k x 784.

Use a confusion matrix (10 x 10) to illustrate your results (you can use sklearn to build the confusion matrix). Include 2 to 3 sentences on the quality of your results.

Visualize your Distributions

Being able to visualize your model to verify that it is built correctly is an important component in ML that is often omitted (and sometimes with very bad consequences). In this setting, your model consists of 784 dimensions, so, this could be challenging.

Fortunately, you recall that each of the features are derived from a pixel. For each class label/digit, build a heatmap where each pixel in the heatmap is the expected (mu) value for pixel given the label you are plotting. Thus, you should end up with 10 heatmaps.

If you place your mu values in a 784 length numpy array, you can use reshape to make this a 2d array of 28 by 28 (the original dimensions of the image). You can then use plt.imshow to render the heatmap and plt.savefig to save it as a pdf or png file. The image shown is created by my distributions for the label corresponding to the digit 3.

Submission

This PA has the following required submissions:

Deliverable	Location to Submit	Description
nb_classifier.py	Autolab	Graded using the test file provided plus a few hidden tests
nb_classifier.py	Canvas
MNistBayesAnalysis.ipynb notebook	Canvas	Your notebook can expect that the data files and your nb_classifier.py file are located in the same directory. Your notebook MUST include the plots already rendered (in other words, nothing should need to be run).

Grading

Grades will be calculated according to the following distribution.

Criterion	Percentage	Description
Overall Readability/Style	10%	Your code should follow PEP8 conventions. It should be well documented and well organized.
Part 1: Reference Tests	40%	nb_classifier_tests.py
Part 1: Efficiency Contest	10%	Main concern is clarity and correctness. That said, your implemented must be somewhat effifient. To test this, we will have an Autolab area for just testing the fit performance of your classifier.
Part 1: Hidden Tests	20%	These tests are run on Autolab, so, you can keep submitting till you pass these as well (but the contents of the test remain hidden).
Part 2: Digit Recognition	20%	Notebook illustrating your classifier in action on the MNist digit dataset.