Cai's Blog: Principle Component Analysis

Principle Components Analysis (PCA) is a widely used data analysis and compression technique which uses orthogonal transformation to convert a set of correlated variables into a set of linear independent values called principle components. Usually, the number of principle components is much less than original dimensions. These latent space best describes original population as the whole. How can principle components describe the population best?

Assume that there is a set of data points $D=\{x_1,x_2,x_3...x_n\},x_i\in R^2$. The data cluster is shown in the following two-dimensional Euclidean plane.

How can one reduce the data dimension from two to one to preserve the original information as much as possible? Imagine that there is a line in the two-dimension Euclidean plane and this line is the direction which the data will be preserved. To do this, all data points need be projected onto this line. After projection, all data points are preserved in the line, i.e. one dimension data. However, there are uncountable lines in a two-dimension plane. Hence, which line or which direction best describes the original data is need to be determined further.

In statistics, the variance of the population is used to describe the variability of the observations. It means how much the data deviate around a mean. In PCA, it attempts to find a low-dimension latent space where the "majority" of variance is preserved. In other words, it attempts to maximize the variance of low-dimension latent space.

Variance of latent space $\{y_1,y_2,y_3,...,y_n\}$:

$\sigma_y^2=\frac{1}{N}\sum_{i=1}^N(y_i-u_y)^2$ $u_y=\frac{1}{N}\sum_{i=1}^Ny_i$

The aim is to find:

$\{y_1^o,y_2^o,...,y_n^o\}=\underset{y_1,y_2,...,y_n}{\arg\max}\quad\sigma_y^2$

This could be done via linear projection ($x_i$ is a column vector, $w^T$ is a row vector:

$y_i=w^Tx_i$

That is say, the projection matrix need to be determined:
Hence,

\[
\begin{align*}
w_o&=\underset{w}{\arg\max}\quad\sigma^2\\
&=\underset{w}{\arg\max}\quad\frac{1}{N}\sum_{i=1}^N(w^Tx_i-w^Tu_x)^2\\
&=\underset{w}{\arg\max}\quad\frac{1}{N}\sum_{i=1}^Nw^T(x_i-u_x)(x_i-u_x)^Tw\\
&=\underset{w}{\arg\max}\quad w^T(\frac{1}{N}\sum_{i=1}^N(x_i-u_x)(x_i-u_x)^T)w\\
&=\underset{w}{\arg\max}\quad w^TS_tw\\
\end{align*}
\]

where:

$S_t=\frac{1}{N}XX^T$

$X=[x_1-u,x_2-u,...,x_n-u]$

$S_t$ is a symmetric matrix because $(XX^T)^T=XX^T$
Furthermore, $S_t$ is a positive semi-definite matrix.
Proven:

Denote $\lambda$ as a eigenvalue of $XX^T$, and $v$ as its eigenvector:

\[
\begin{align*}
XX^Tv&=\lambda v\\
(XX^Tv)^Tv&=(\lambda v)^Tv\\
v^TXX^Tv&=v^T\lambda^Tv\\
(X^Tv)^T(X^Tv)&=\lambda v^Tv\\
||X^Tv||^2&=\lambda ||v||^2\\
\rightarrow\quad\lambda&\geqslant0
\end{align*}
\]
Q.E.D.

For positive semi-definite quadratic forms, there exist maximums. Now, the problem is to compute the maximum value.

\[w_o=\underset{w}{\arg\max}\quad w^TS_tw\]

There is a trivial solution of $w=\infty$.
This could be avoided by adding extra constraints (a fixed magnitude on $w$ ($||w||^2=w^Tw=1$)).

\[w_o=\underset{w}{\arg\max}\quad w^TS_tw\\
s.t.\quad w^Tw=1\]

Formulate the Lagrangian：

\[
\begin{align*}
\mathcal{L}(w,\lambda)&=w^TS_tw -\lambda (w^Tw-1)\\
\frac{\partial \mathcal{L}(w,\lambda)}{\partial w}&=2S_tW - 2\lambda w\\
\end{align*}
\]

Let $\frac{\partial \mathcal{L}(w,\lambda)}{\partial w}=0$
\[\Rightarrow\quad S_tw=\lambda w\]

Take the above equation back to the optimization formula:

\[
\begin{align*}
w_o&=\underset{w}{\arg\max}\quad w^TS_tw\\
w_o&=\underset{w}{\arg\max}\quad \lambda w^Tw\\
w_o&=\underset{w}{\arg\max}\quad \lambda
\end{align*}
\]

The above equation means that to maximize $w^TS_tw$, one need to maximize its eigenvalue \lambda. Therefore, the projection vector $w$ is the eigenvector of the largest eigenvalue of $S_t$.

Done.

Cai's Blog

2016年11月6日星期日

Principle Component Analysis

没有评论:

发表评论

博客归档

标签