CS 445 Machine Learning

Fall 2020

Image Compression via Clustering

The learning objectives for this assignment are:
  • utilize *K-Means* to perform image compression
  • utilize the *silhouette* measure to evaluate cluster quality
  • utilize entropy to perforn an external measure of cluster quality

Resources

Introduction

K-Means is arguably the most utilized clustering algorithin in data science. In this assignment, you will utilize K-Means in a somewhat unconventional setting, which is to perform image compression.

Clustering Colors

Images are represented with bitmaps, where each pixel has 3 integer values (red, green, blue). The domain of each integer is 0-255. While much faster algorithms exist for performing image compression, in this assignment you will utilize *K-Means* to do this. The main concept here is that you will be projecting each pixel's color coordinates into a 3-D space. This embedding projects colors that are *similar* to each other as neighbors within this space. The goal of this assignment is reduce the number of colors that are utilized in an image. This will be accomplished by clustering the color points and then replacing the colors in the image with the *centroid* from their assigned cluster.

Compress Color space

Utilize k-means (you can use the Scikit-learn methods) on 3 images (2 of mine and 1 of your own (call this image_3). For your own image, make sure the image size is not too large (you can reduce the number of pixels in your image by using Mac's Preview or by using MS Window's Paint (although I have not used this utility). You can download this starter code which will accepts an image file path and a value of K on the command line. The starter code reads the image into an array and reshapes it for you. For image_1 and image_3, perform the following:
  • Perform clustering with the following values of k: 2, 5, 20. Make sure your k-means object sets the init parameter to random and sets n_init to 1. Replace the colors in the picture with the cluster centroid values and save the image.
  • For each image produced above (the 3 plus the original), plot these images in a PDF for submission. Clearly label which k value is used for each image.
  • Run K-Means 10 times for k = 15. Plot the RSS value (inertia) for each of these runs. Briefly explain why these values vary from run to run.
For image_2, perform clustering with k=2 and k=3. Include the PDF these 3 images (original,results form k=2 and results from k=3). Answer the following question(s) in your PDF writeup:
  1. When k=2, are the original colors found in the image? Explain what happen in this case.
  2. Do you think this picture was created via a photograph or computer generated. Justify your answer with 1 or 2 sentences.

Computing the Silhouette

Information on the silhouette can be found in the class slides, Wikipedia, or the textbook (section 7.5.2, the last subsection entitled The Silhouette Coefficient). Using your own code (and not scikit-learn), compute the silhouette coefficient. Since this process involves a lot of distance computations, you can compute the distance matrix using the pairwise_distances method in the sklearns.metric module. Here is a short example:
from sklearn.metrics import pairwise_distances
                dist = pairwise_distances(X, metric='euclidean')
Use numpy functions for computing the min and mean values to speed up your code. Because the distance matrix can be quite large for high resolution photos, you only need to compute the silhouette on the PennySmallerPhoto.jpg photo. For this section, reduce the colors in the PennySmallerPhoto.jpg photo as before using k=2, k=5, k=10, and k=25. For each clustsering, compute the mean silhouette coefficient for each value of k and construct either a table and plot of these values (k on the x-axis, mean silhouette coefficient on the y axis).

Rubrics

Item> Description Points
RSS Plot Report contains RSS Plot for image_1 and image_3 and comments on the variance between the 10 runs of K-means. 25
Image Results The original image and the resulting image from the color compression for each value of k are shown for all 3 images (the two I supplied and yours). 20
Image 2 Questions The two questions are answered in the PDF 15
Silhouette Code Code computes the mean silhouette correctly. PDF contains images the color reduced output from the PennySmallerPhoto.jpg photo and the mean silhouette coefficients in a table and a plot. 40