CS 445 Machine Learning
PA2: Naive Bayes Classifier
Learning Objectives
After completing this activity, students should be able to:- define a Naive Bayes classifier for a mixture of both discrete and continuous components
- define and implement Laplace smoothing to handle zero probabilities
Partners
This assignment may be completed individually or in pairs. If you are doing this in pairs, you must notify me at the beginning of the project. My expectation for pairs is that both members are actively involved, and take full responsibility for all aspects of the project. In other words, I expect that you are either sitting or virtually together to work, and not that you are splitting up tasks to be completed separately. If both members of the group are not able to fully explain the code to me, then this does not meet this expectation.Part 1: Implementation
Construct a Naive bayes classifier.Resources
Link to file | Purpose/description |
nb_classifier | Stub for your classifier |
nb_classifier_test.py | Unit tests for your classifier |
MNIST_data.zip | MNist image data for testing your classifier |
Tips for Working with Numpy Matrices with String data
Working with numpy matrices with mixed datatypes is not very convenient. In fact, the Pandas package is very popular with machine learning scientists because it eases some of this pain.
Here is an example of a numpy matrix with categorical and
numeric data:
U4 means that at most, the column can contain data with 4 characters.
X = np.array([['Cat', 'Yes', 1.23], ['Dog', 'Yes', 2.45]])
print(X)
[['Cat' 'Yes' '1.23']
['Dog' 'Yes' '2.45']]
X.dtype()
Out[14]: dtype('<U4')
My solution includes a function that computes the probability estimate given
μ, σ, and an value for the variable (x). Since the variable may be
extracted from an numpy array, it is possible that it has a character type. On
the other hand, the method feature_class_prob is used to test your code, and it
is inconvient to pass the data in this way (you just want to pass x as a float).
The following code can address this situation. It queries the datatype of x
(the variable value to estimate the probability) and if it is a numpy str_,
it converts it to a float, otherwise, it just copies it to xfloat.
if isinstance(x, np.str_):
xfloat = x.astype(float)
else:
xfloat = x
Variances of zero
When training your classifier, it is possible that all the features have the same value (this happens frequently when dealing with high dimensional data). This causes an issue when trying to compute the variance for continuous parameters where the distribution is estimated using a Gaussian (since the demoniator of the first term is zero in this situation).
If you encounter this situation, you may omit this feature from consideration when predicting the probability.
Smoothing with Continuous Values
When working with continuous features, theoretically, any value should have a probability greater than zero. For THIS project, if your estimate returns a probability of zero for a continous variable, you can assign 10e-9 as the probability.
For discrete/categorical features, Laplace smoothing (when requested) address this (in other words, do not use this technique for discrete features).
Log Likelihood and Challenges with Smoothing
As discussed in class, using the log likelihood of the probabilities is a common in many implementations of Naive Bayes because it prevents numeric underflow.
An issue arises with zero probabilities (an potential issue with discreete non-smoothed examples from the book) since the log(0) is undefined. Taking the log(0) in numpy will result in NaN being returned, and you can treat any computation resulting in NaN as having a zero probability. You may need to check for this with if statements for now (I am working on a better approach).
Testing your Implementation
Part 2: MNIST Dataset
You will apply your Naive Bayes classifier to the MNist database of handwritten digits. The training and test datasets are provided in the resources section. Enclose all of the following analysis and plots in a Jupyter notebook.
Build Model
Build your model and compute the accuracy. You will need to create an X matrix, and the image data provided is 60k x 28 x 28. You numpy's reshape function to adjust this matrix so that it is 60k x 784.
Use a confusion matrix (10 x 10) to illustrate your results (you can use sklearn to build the confusion matrix). Include 2 to 3 sentences on the quality of your results.
Visualize your Distributions
Being able to visualize your model to verify that it is built correctly is an important component in ML that is often omitted (and sometimes with very bad consequences). In this setting, your model consists of 784 dimensions, so, this could be challenging.
Fortunately, you recall that each of the features are derived from a pixel. For each class label/digit, build a heatmap where each pixel in the heatmap is the expected (mu) value for pixel given the label you are plotting. Thus, you should end up with 10 heatmaps.
If you place your mu values in a 784 length numpy array, you can use reshape to make this a 2d array of 28 by 28 (the original dimensions of the image). You can then use plt.imshow to render the heatmap and plt.savefig to save it as a pdf or png file. The image shown is created by my distributions for the label corresponding to the digit 3.
Submission
This PA has the following required submissions:Deliverable | Location to Submit | Description |
nb_classifier.py | Autolab | Graded using the test file provided plus a few hidden tests |
nb_classifier.py | Canvas | |
MNistBayesAnalysis.ipynb notebook | Canvas | Your notebook can expect that the data files and your nb_classifier.py file are located in the same directory. Your notebook MUST include the plots already rendered (in other words, nothing should need to be run). |
Grading
Grades will be calculated according to the following distribution.Criterion | Percentage | Description |
Overall Readability/Style | 10% | Your code should follow PEP8 conventions. It should be well documented and well organized. |
Part 1: Reference Tests | 40% | nb_classifier_tests.py |
Part 1: Efficiency Contest | 10% | Main concern is clarity and correctness. That said, your implemented must be somewhat effifient. To test this, we will have an Autolab area for just testing the fit performance of your classifier. |
Part 1: Hidden Tests | 20% | These tests are run on Autolab, so, you can keep submitting till you pass these as well (but the contents of the test remain hidden). |
Part 2: Digit Recognition | 20% | Notebook illustrating your classifier in action on the MNist digit dataset. |