Cai's Blog

2016年11月24日星期四

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a statistic method used to compress data based on data labels. Ideally, after separating different classes, the data consisting each class look more like each other while the data of separate classes look more dissimilar. To achieve this, one needs to minimize the variability in each class so that the data in each class look more similar and to maximize the distance between classes to separate classes as much as possible.

Two Classes
Therefore, one needs to find a latent space such that:

$\sigma_y^2(c1)+\sigma_y^2(c2)$ is minimum

$(\mu_y(c1)-\mu_y(c2))^2$ is maximum

The above equations could be combined together to form a optimization formula:

\[minimize\quad\frac{\sigma _y^2(c1)+\sigma_y^2(c2)}{(\mu_y(c1)-\mu_y(c2))^2 }\]

\[maximize\quad\frac{(\mu_y(c1)-\mu_y(c2))^2 }{\sigma _y^2(c1)+\sigma_y^2(c2)}\]

Assume that the latent space is $\{y_1,y_2,...,y_n\}$ and $y_i=w^Tx_i$.

Hence,
\[
\begin{align*}
\sigma_y^2(c1)&=\frac{1}{N_{c1}}\sum_{x_i\in c1}(y_i-u_y(c1))^2\\
&=\frac{1}{N_{c1}}\sum_{x_i\in c1}(w^T(x_i-u(c1)))^2\\
&=\frac{1}{N_{c1}}\sum_{x_i\in c1}w^T(x_i-u(c1))(x_i-u(c1))^Tw\\
&=w^T\frac{1}{N_{c1}}\sum_{x_i\in c1}(x_i-u(c1))(x_i-u(c1))^Tw\\
&=w^TS_1w\\
\end{align*}
\]
where
\[
\begin{align*}
\mu(c1)&=\frac{1}{N_{c1}}\sum_{x_i\in c1}x_i\
\end{align*}
\]
and
\[
\begin{align*}
\sigma_y^2(c2)&=w^TS_2w\\
\end{align*}
\]
obtain:
\[
\begin{align*}
\sigma_y^2(c1)+\sigma_y^2(c2)&=w^T(S_1+S_2)w=w^TS_ww\\
\end{align*}
\]
\[
\begin{align*}
(\mu_y(c1)-\mu_y(c2))^2&=w^T(\mu(c1)-\mu(c2))(\mu(c1)-\mu(c2))^Tw=w^TS_bw\\
\end{align*}
\]
Therefore
\[
\begin{align*}
maximize\quad\frac{(\mu_y(c1)-\mu_y(c2))^2}{\sigma_y^2(c1)+\sigma_y^2(c2)}&=\frac{w^TS_bw}{w^TS_ww}\\
\end{align*}
\]
Adding constrains
\[max\quad w^TS_bw\quad s.t.\quad w^TS_ww=1\]
Formulate the Lagrangian：
\[
\begin{align*}
\mathcal{L}(w,\lambda)&=w^TS_tw -\lambda(w^Tw-1)\\
\end{align*}
\]

Let $\frac{\partial \mathcal{L}(w,\lambda)}{\partial w}=0$
\[
\begin{align*}
&\Rightarrow\quad S_bw=\lambda S_ww\\
&\Rightarrow\quad S_w^{-1}S_bw=\lambda w
\end{align*}\]
Therefore, $w$ is the eigenvector corresponding to the largest eigenvalue of $S_w^{-1}S_b$

Multiple Classes
Multi-classification problems are more common in machine learning practice. Assume that the original data dimensions are reduced to $d$ dimensions, i.e. $W=[w_1,w_2,...,w_d]$.
Hence, within classes:
\[S_w=\sum_{j=1}^CS_j=\sum_{j=1}^C\frac{1}{N_{c1}}\sum_{x_i\in c_j}(x_i-u(c_j))(x_i-u(c_j))^T\]
Between classes:
\[S_b=\sum_{j=1}^C(\mu(c_j)-\mu)(\mu(c_j)-\mu)^T\]
where $\mu$ is the mean of all classes.
Obtain its optimization equation:
\[max\quad tr[W^TS_bW]\quad s.t.\quad W^TS_wW=I\]
Formulate the Lagrangian：
\[
\begin{align*}
\mathcal{L}(W,\Lambda)&=tr[W^TS_tW] -tr[\Lambda(W^TS_wW-I)]\\
\end{align*}
\]

Let $\frac{\partial \mathcal{L}(W,\Lambda)}{\partial W}=0$
\[
\begin{align*}
&\Rightarrow\quad S_bW=S_wW\Lambda\\
&\Rightarrow\quad S_w^{-1}S_bW=W\Lambda
\end{align*}\]
Therefore, $W$ is the eigenvectors corresponding to the $d$ largest eigenvalues of $S_w^{-1}S_b$

2016年11月7日星期一

Multilinear Principle Component Analysis

In previous blog (PCA), one dimension latent space was introduced. However, in practice, a latent space with more dimensions is more popular. To do this, a set of projections $\{w_1,w_2,...,w_n\}$ is needed. Assume that the latent space is $\{y_1,y_2,...,y_n\}, y_i\in R^d$ and original data set is $\{x_1,x_2,...,x_n\},x_i\in R^F$, then:
\[
y_i=\begin{bmatrix}
y_{i1}\\
...\\
y_{id}
\end{bmatrix}
=\begin{bmatrix}
w_1^Tx_i\\
...\\
w_d^Tx_i
\end{bmatrix}
=W^Tx_i
\]
where
\[W=[w_1,w_2,...,w_d]\]
Multilinear Principle Component Analysis could be done via maximising the variance in each dimension:
\[
\begin{align*}
W_o&=\underset{W}{\arg\max}\quad\frac{1}{N}\sum_{k=1}^d\sum_{i=1}^N(y_{ik}-u_{ik})^2\\
&=\underset{W}{\arg\max}\quad\frac{1}{N}\sum_{k=1}^d\sum_{i=1}^Nw_k^T(x_i-u_i)(x_i-u_i)^Tw_k^T\\
&=\underset{W}{\arg\max}\quad\sum_{k=1}^dw_k^TS_tw_k\\
&=\underset{W}{\arg\max}\quad tr[W^TS_tW]\\
\end{align*}
\]

\[s.t. W^TW=I\]

Formulate the Lagrangian：
\[
\begin{align*}
\mathcal{L}(W,\Lambda)&=tr[W^TS_tW] -tr[\Lambda(W^TW-I)]\\
\frac{\partial \mathcal{L}(W,\Lambda)}{\partial W}&=2S_tW - 2W\Lambda\\
\end{align*}
\]

Let $\frac{\partial \mathcal{L}(W,\Lambda)}{\partial W}=0$
\[\Rightarrow\quad S_tW=W\Lambda\]

Take the above formula back to the original optimisation equation:
\[
\begin{align*}
W_o&=\underset{W}{\arg\max}\quad tr[W^TS_tW]\\
&=\underset{W}{\arg\max}\quad tr[W^TW\Lambda]\\
&=\underset{W}{\arg\max}\quad tr[\Lambda]\\
\end{align*}
\]
Therefore, one needs to maximise the eigenvalues of $S_t$, and the projection matrix $W=\{w_1,w_2,...,w_d\}$ corresponds to the eigenvectors of the largest $d$ eigenvalues.

Done.

2016年11月6日星期日

Principle Component Analysis

Principle Components Analysis (PCA) is a widely used data analysis and compression technique which uses orthogonal transformation to convert a set of correlated variables into a set of linear independent values called principle components. Usually, the number of principle components is much less than original dimensions. These latent space best describes original population as the whole. How can principle components describe the population best?

Assume that there is a set of data points $D=\{x_1,x_2,x_3...x_n\},x_i\in R^2$. The data cluster is shown in the following two-dimensional Euclidean plane.

How can one reduce the data dimension from two to one to preserve the original information as much as possible? Imagine that there is a line in the two-dimension Euclidean plane and this line is the direction which the data will be preserved. To do this, all data points need be projected onto this line. After projection, all data points are preserved in the line, i.e. one dimension data. However, there are uncountable lines in a two-dimension plane. Hence, which line or which direction best describes the original data is need to be determined further.

In statistics, the variance of the population is used to describe the variability of the observations. It means how much the data deviate around a mean. In PCA, it attempts to find a low-dimension latent space where the "majority" of variance is preserved. In other words, it attempts to maximize the variance of low-dimension latent space.

Variance of latent space $\{y_1,y_2,y_3,...,y_n\}$:

$\sigma_y^2=\frac{1}{N}\sum_{i=1}^N(y_i-u_y)^2$ $u_y=\frac{1}{N}\sum_{i=1}^Ny_i$

The aim is to find:

$\{y_1^o,y_2^o,...,y_n^o\}=\underset{y_1,y_2,...,y_n}{\arg\max}\quad\sigma_y^2$

This could be done via linear projection ($x_i$ is a column vector, $w^T$ is a row vector:

$y_i=w^Tx_i$

That is say, the projection matrix need to be determined:
Hence,

\[
\begin{align*}
w_o&=\underset{w}{\arg\max}\quad\sigma^2\\
&=\underset{w}{\arg\max}\quad\frac{1}{N}\sum_{i=1}^N(w^Tx_i-w^Tu_x)^2\\
&=\underset{w}{\arg\max}\quad\frac{1}{N}\sum_{i=1}^Nw^T(x_i-u_x)(x_i-u_x)^Tw\\
&=\underset{w}{\arg\max}\quad w^T(\frac{1}{N}\sum_{i=1}^N(x_i-u_x)(x_i-u_x)^T)w\\
&=\underset{w}{\arg\max}\quad w^TS_tw\\
\end{align*}
\]

where:

$S_t=\frac{1}{N}XX^T$

$X=[x_1-u,x_2-u,...,x_n-u]$

$S_t$ is a symmetric matrix because $(XX^T)^T=XX^T$
Furthermore, $S_t$ is a positive semi-definite matrix.
Proven:

Denote $\lambda$ as a eigenvalue of $XX^T$, and $v$ as its eigenvector:

\[
\begin{align*}
XX^Tv&=\lambda v\\
(XX^Tv)^Tv&=(\lambda v)^Tv\\
v^TXX^Tv&=v^T\lambda^Tv\\
(X^Tv)^T(X^Tv)&=\lambda v^Tv\\
||X^Tv||^2&=\lambda ||v||^2\\
\rightarrow\quad\lambda&\geqslant0
\end{align*}
\]
Q.E.D.

For positive semi-definite quadratic forms, there exist maximums. Now, the problem is to compute the maximum value.

\[w_o=\underset{w}{\arg\max}\quad w^TS_tw\]

There is a trivial solution of $w=\infty$.
This could be avoided by adding extra constraints (a fixed magnitude on $w$ ($||w||^2=w^Tw=1$)).

\[w_o=\underset{w}{\arg\max}\quad w^TS_tw\\
s.t.\quad w^Tw=1\]

Formulate the Lagrangian：

\[
\begin{align*}
\mathcal{L}(w,\lambda)&=w^TS_tw -\lambda (w^Tw-1)\\
\frac{\partial \mathcal{L}(w,\lambda)}{\partial w}&=2S_tW - 2\lambda w\\
\end{align*}
\]

Let $\frac{\partial \mathcal{L}(w,\lambda)}{\partial w}=0$
\[\Rightarrow\quad S_tw=\lambda w\]

Take the above equation back to the optimization formula:

\[
\begin{align*}
w_o&=\underset{w}{\arg\max}\quad w^TS_tw\\
w_o&=\underset{w}{\arg\max}\quad \lambda w^Tw\\
w_o&=\underset{w}{\arg\max}\quad \lambda
\end{align*}
\]

The above equation means that to maximize $w^TS_tw$, one need to maximize its eigenvalue \lambda. Therefore, the projection vector $w$ is the eigenvector of the largest eigenvalue of $S_t$.

Done.