CS 445 Machine Learning
Fall 2020
Image Compression via Clustering
The learning objectives for this assignment are:- utilize *K-Means* to perform image compression
- utilize the *silhouette* measure to evaluate cluster quality
- utilize entropy to perforn an external measure of cluster quality
Resources
Introduction
K-Means is arguably the most utilized clustering algorithin in data science. In this assignment, you will utilize K-Means in a somewhat unconventional setting, which is to perform image compression.Clustering Colors
Images are represented with bitmaps, where each pixel has 3 integer values (red, green, blue). The domain of each integer is 0-255. While much faster algorithms exist for performing image compression, in this assignment you will utilize *K-Means* to do this. The main concept here is that you will be projecting each pixel's color coordinates into a 3-D space. This embedding projects colors that are *similar* to each other as neighbors within this space. The goal of this assignment is reduce the number of colors that are utilized in an image. This will be accomplished by clustering the color points and then replacing the colors in the image with the *centroid* from their assigned cluster.Compress Color space
Utilize k-means (you can use the Scikit-learn methods) on 3 images (2 of mine and 1 of your own (call this image_3). For your own image, make sure the image size is not too large (you can reduce the number of pixels in your image by using Mac's Preview or by using MS Window's Paint (although I have not used this utility). You can download this starter code which will accepts an image file path and a value of K on the command line. The starter code reads the image into an array and reshapes it for you. For image_1 and image_3, perform the following:- Perform clustering with the following values of k: 2, 5, 20. Make sure your k-means object sets the init parameter to random and sets n_init to 1. Replace the colors in the picture with the cluster centroid values and save the image.
- For each image produced above (the 3 plus the original), plot these images in a PDF for submission. Clearly label which k value is used for each image.
- Run K-Means 10 times for k = 15. Plot the RSS value (inertia) for each of these runs. Briefly explain why these values vary from run to run.
- When k=2, are the original colors found in the image? Explain what happen in this case.
- Do you think this picture was created via a photograph or computer generated. Justify your answer with 1 or 2 sentences.
Computing the Silhouette
Information on the silhouette can be found in the class slides, Wikipedia, or the textbook (section 7.5.2, the last subsection entitled The Silhouette Coefficient). Using your own code (and not scikit-learn), compute the silhouette coefficient. Since this process involves a lot of distance computations, you can compute the distance matrix using the pairwise_distances method in the sklearns.metric module. Here is a short example:from sklearn.metrics import pairwise_distances
dist = pairwise_distances(X, metric='euclidean')
Use numpy functions for computing the min and mean values
to speed up your code.
Because the distance matrix can be quite large for high resolution
photos, you only need to compute the silhouette on the
PennySmallerPhoto.jpg photo.
For this section, reduce the colors in
the PennySmallerPhoto.jpg photo
as before using k=2, k=5, k=10, and k=25. For each clustsering,
compute the mean silhouette coefficient for each value of
k and construct either a table and plot of these values (k on the
x-axis, mean silhouette coefficient on the y axis).
Rubrics
Item> | Description | Points |
RSS Plot | Report contains RSS Plot for image_1 and image_3 and comments on the variance between the 10 runs of K-means. | 25 |
Image Results | The original image and the resulting image from the color compression for each value of k are shown for all 3 images (the two I supplied and yours). | 20 |
Image 2 Questions | The two questions are answered in the PDF | 15 |
Silhouette Code | Code computes the mean silhouette correctly. PDF contains images the color reduced output from the PennySmallerPhoto.jpg photo and the mean silhouette coefficients in a table and a plot. | 40 |