Dimensional Reduction with Principal Component Analysis

PCA is a dimensionality reduction technique.
PCA "distills" multi-dimensional data down to fewer dimensions, while preserving variance in the data as best it can.

Each pixel in a black & white image has a few:

  • X position
  • Y position
  • brightness

PCA with the IRIS dataset

The iris dataset is what this file will use to illustrate PCA.

The iris dataset has a small collection of data that has four dimensions of data for ech of three different kinds of Iris flowers:

  • petal length
  • petal width
  • sepal length
  • sepal width

Dependencies

In [11]:
# NOTICE! the iris dataset is included in sklearn datasets
# https://scikit-learn.org/1.5/datasets/toy_dataset.html
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
%matplotlib inline
from pylab import *
In [12]:
iris = load_iris()

numSamples, numFeatures = iris.data.shape
print("numSamples:",numSamples)
print("numFeatures:",numFeatures)
print("target names:",list(iris.target_names))
numSamples: 150
numFeatures: 4
target names: ['setosa', 'versicolor', 'virginica']

Distill Dimensions using PCA

Here, 4 dimensions of the data (noted above) will be "distilled" into 2 dimensions.
PCA will return 2 4-dimensions "vectors" (lists, arrays) that represent the 2 "distilled" dimensions.

In [13]:
irisData = iris.data

# PCA 
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
# number of "components", dimensions, to output
pcaComponentCount = 2

pca = PCA(n_components=pcaComponentCount, whiten=True).fit(irisData)
pcaApplied = pca.transform(irisData)
In [14]:
print("resulting 'components'")
print(pca.components_)
resulting 'components'
[[ 0.36138659 -0.08452251  0.85667061  0.3582892 ]
 [ 0.65658877  0.73016143 -0.17337266 -0.07548102]]

PCA Comprehension: variance

Let's see how much information we've managed to preserve:

In [15]:
print("how much variance is preserved in each of the output components/dimensions:",pca.explained_variance_ratio_)
print("how much variance is preserved in total: ",sum(pca.explained_variance_ratio_))
how much variance is preserved in each of the output components/dimensions: [0.92461872 0.05306648]
how much variance is preserved in total:  0.977685206318795

PCA has chosen the resulting two dimensions well enough that we've captured

  • 92% of the variance in our data in a single dimension
  • 5% more variance in the 2nd dimension
  • PCA has lost less than 3% of the variance in the input data by projecting it down to two dimensions

Visualize the Applied PCA data

In [16]:
colors = cycle('rgb')
target_ids = range(len(iris.target_names))
print("target_ids")
print(target_ids)
pl.figure()

for idx, color, label in zip(target_ids, colors, iris.target_names):
    # print("---looping---")
    # print("i:",i)
    # print("c:",c)
    # print("label:",label)
    
    pl.scatter(pcaApplied[iris.target == idx, 0], pcaApplied[iris.target == idx, 1],
        c=color, label=label)
pl.legend()
pl.show()
target_ids
range(0, 3)
output png
Page Tags:
python
data-science
jupyter
learning
numpy