Principal component analysis: Cosmos — All that is, or was, or ever will be

Principal component analysis

cosmos 25th August 2017 at 12:02pm

aka PCA

Given $\{ x^{(1)}, ..., x^{(n)}\}$ where $x^{(i)} \in \mathbb{R}^n$

Reduce it to $k$ -dimensional data set ( $k < n$ , often $k \ll n$ ), so that the dimensions we retain are able to recover the data as well as possible.

Examples, oxford notes

Algorithm

Summary of algorithm

Pre-processing of the data

vid

Zero-out mean
1. Set $\mu =\frac{1}{m}\sum _{i=1}^mx^{\left(i\right)}$
2. Replace $x^{(i)}$ with $x^{(i)} -\mu$
Normalize to unit variance
1. Set $\sigma _j^2=\frac{1}{m}\sum _{i=1}^m\left(x^{\left(i\right)}\right)^2$
2. Replace $x^{(i)}_j$ with $\frac{x^{(i)}_j}{\sigma_j}$

Finding principal components

Example and intuition: we want to find the direction so that when we project the data to the line pointing in that direction, the variance of the data is as high as possible. _{Note: If $||u||=1$ , vector $x^{(i)}$ projected onto $u$ has length $(x^{(i)})^T u$ .} This also minimizes the variance perpendicular to that line.

Choose $u$ s.t.:

$\max\limits_{u:||u||=1} \frac{1}{m}\sum\limits_{i=1}^m ((x^{(i)})^Tu)^2$

$=\max\limits_{u:||u||=1} \frac{1}{m}\sum\limits_{i=1}^m (u^Tx^{(i)})((x^{(i)})^Tu)$

$=\max\limits_{u:||u||=1} u^T \left [\frac{1}{m}\sum\limits_{i=1}^m x^{(i)}(x^{(i)})^T \right] u$

This implies that $u$ is the principal eigenvector of the covariance matrix:

$\mathbf{\Sigma} = \frac{1}{m}\sum\limits_{i=1}^m x^{(i)}(x^{(i)})^T$

See here for nice derivation.

More generally for k-dimensional subspace on which to project the data, you choose the $k$ Eigenvectors with the largest Eigenvalues.

Can then also transform to the new subspace by projecting into the new basis to get a lower-dimensional representation of the data.

Another view of PCA, there are several more views of PCA.

Implementation of PCA

Problem with covariant matrix

Using the Design matrix, $X$ , we can rewrite the covariance matrix as $\Sigma = X^T X$

We can use Singular value decomposition, $X= U D V^T$ and then, the top $k$ columns of $V$ are the top $k$ eigenvectors of $X^T X = \Sigma$ . If the number of samples is much smaller than their dimensionality, then $X$ is a fat matrix ( $m \times d$ ) with much fewer entries than $\Sigma$ ( $d \times d$ ), and is thus more efficient to store on memory, and to compute with.

Applications of PCA

Video

Data visualization
Data compression
Machine learning, Feature selection
Anomaly detection
Matching/distance calculations. See here –> using PCA for comparing data points –> From here we get eigenfaces!

Latent semantic indexing

LSI is essentially application of PCA to text data

Independent component analysis

PPCA (probabilistic PCA)

It is proposed that PCA of autocorrelation matrices of place cell activations produce Grid cells in the Spatial representation in the brain