Dimensional Reduction with Principal Component Analysis
PCA is a dimensionality reduction technique.
PCA "distills" multi-dimensional data down to fewer dimensions, while preserving variance in the data as best it can.
Each pixel in a black & white image has a few:
- X position
- Y position
- brightness
PCA with the IRIS dataset
The iris dataset is what this file will use to illustrate PCA.The iris dataset has a small collection of data that has four dimensions of data for ech of three different kinds of Iris flowers:
- petal length
- petal width
- sepal length
- sepal width
Dependencies
In [11]:
# NOTICE! the iris dataset is included in sklearn datasets
# https://scikit-learn.org/1.5/datasets/toy_dataset.html
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
%matplotlib inline
from pylab import *
In [12]:
iris = load_iris()
numSamples, numFeatures = iris.data.shape
print("numSamples:",numSamples)
print("numFeatures:",numFeatures)
print("target names:",list(iris.target_names))
Distill Dimensions using PCA
Here, 4 dimensions of the data (noted above) will be "distilled" into 2 dimensions.PCA will return 2 4-dimensions "vectors" (lists, arrays) that represent the 2 "distilled" dimensions.
In [13]:
irisData = iris.data
# PCA
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
# number of "components", dimensions, to output
pcaComponentCount = 2
pca = PCA(n_components=pcaComponentCount, whiten=True).fit(irisData)
pcaApplied = pca.transform(irisData)
In [14]:
print("resulting 'components'")
print(pca.components_)
PCA Comprehension: variance
Let's see how much information we've managed to preserve:In [15]:
print("how much variance is preserved in each of the output components/dimensions:",pca.explained_variance_ratio_)
print("how much variance is preserved in total: ",sum(pca.explained_variance_ratio_))
PCA has chosen the resulting two dimensions well enough that we've captured
- 92% of the variance in our data in a single dimension
- 5% more variance in the 2nd dimension
- PCA has lost less than 3% of the variance in the input data by projecting it down to two dimensions
Visualize the Applied PCA data
In [16]:
colors = cycle('rgb')
target_ids = range(len(iris.target_names))
print("target_ids")
print(target_ids)
pl.figure()
for idx, color, label in zip(target_ids, colors, iris.target_names):
# print("---looping---")
# print("i:",i)
# print("c:",c)
# print("label:",label)
pl.scatter(pcaApplied[iris.target == idx, 0], pcaApplied[iris.target == idx, 1],
c=color, label=label)
pl.legend()
pl.show()
Page Tags:
python
data-science
jupyter
learning
numpy