# Larger Scale PCA Example and the Limits of PCA

The previous notebook discussed the mechanics of PCA by looking at a two-dimensional dataset.  PCA is generally more valuable when applied to high-dimensional data.  This notebook will illustrate the use of PCA to reduce the dimensionity of image data.


## PCA on Digit Images

In [None]:
#%matplotlib notebook

import os
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import sklearn.datasets


def load_digits():
    digits, target = sklearn.datasets.load_digits(return_X_y=True)
    
    # Rescale and convert to floats...
    result_images = np.array(digits / 16.0, dtype=np.float32)
    result_labels = np.array(target, dtype=np.float32)

    return result_images, result_labels


### Load the image dataset:

In [None]:
X, labels = load_digits()
print(X.shape)

## Scree Plots

Execute the cell below to visualize the amount of variance captured by each of the principal components for the digit data set.  Around 90% of the variance is captured by the first 20 principal components. 

In [None]:
pca = PCA()
pca.fit(X)

plt.plot(pca.explained_variance_ratio_, '.-')
plt.title('PC Variance')
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_), '.-')
plt.title('Cumulative PC Variance')

## Visualizing the First Two Principal Components

The figure below shows all of the 0's, 1's and 2's in the data set projected onto the first two principal components. 

In [None]:
Z = pca.transform(X)
#plt.plot(X_low[:,0], X_low[:,1], '.')
plt.plot(Z[labels==0, 0], Z[labels==0, 1],'.r', alpha=.6)
plt.plot(Z[labels==1, 0], Z[labels==1, 1],'.g', alpha=.6)
plt.plot(Z[labels==2, 0], Z[labels==2, 1],'.b', alpha=.6)
plt.show()

## PCA For Image Compression
The cells below use PCA to transfrom the 64-dimensional MNIST image data down to 10 dimensions. The data is then projected back into the original 64 dimensional space to show an example of a reconstructed image.  Try increasing or decreasing the dimensionality of the compressed images to see the impact on the resulting image quality. 

In [None]:
pca = PCA(n_components=10)
pca.fit(X)
Z10 = pca.transform(X)
X_back = pca.inverse_transform(Z10)

# Reshape for display as images:
X_back = X_back.reshape(-1, 8, 8)
X_img = X.reshape(-1, 8, 8)

In [None]:
# Display the uncompressed and de-compressed versions of an image.

which_image = 50
plt.figure(figsize = (1,1))
plt.imshow(X_img[which_image, :, :],cmap='gray')
plt.figure(figsize = (1,1))
plt.imshow(X_back[which_image, : ,:],cmap='gray')

## Where Does PCA Fall Short?

Consider the following dataset.  Take a second to predict what will happen when we use PCA to project this to one dimension.

In [None]:
X1 = np.random.multivariate_normal([-5, 3], [[1.5, 3], [3, 6.5]], 200)
X2 = np.random.multivariate_normal([-6.5, 5], [[1.5, 3], [3, 6.5]], 200)
X = np.vstack((X1, X2))
np.random.shuffle(X)
plt.plot(X[:, 0], X[:, 1], '.')
plt.axis('equal')
plt.show()

In [None]:
pca = PCA(n_components=1)
pca.fit(X)
Z1 = pca.transform(X)
X_back = pca.inverse_transform(Z1)
np.random.shuffle(X)
plt.plot(X_back[:, 0], X_back[:, 1], '.')
plt.axis('equal')
plt.show()

Moral of the story: The directions of maximum variance aren't always the most meaningful dimensions.

## Nonlinear Structure

Execute the cell below to visualize some additional data with structure that is not well captured by PCA.

In [None]:
moons = sklearn.datasets.make_moons(n_samples=1000, noise=.05)[0]
circles = sklearn.datasets.make_circles(n_samples=1000, factor=.5,
                                noise=.05)[0]

plt.plot(moons[:, 0], moons[:, 1], '.')
plt.figure()

plt.plot(circles[:, 0], circles[:, 1], '.')

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X, color = sklearn.datasets.make_swiss_roll(n_samples=1500)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)

## PCA is a linear method

Linear methods *can* rotate, shear, scale, project, but they *can't* unroll.

For that we need non-linear dimensionality reduction techniques:

https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.10-Manifold-Learning.ipynb
