2016年11月24日星期四

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a statistic method used to compress data based on data labels. Ideally, after separating different classes, the data consisting each class look more like each other while the data of separate classes look more dissimilar. To achieve this, one needs to minimize the variability in each class so that the data in each class look more similar and to maximize the distance between classes to separate classes as much as possible.

Two Classes
Therefore, one needs to find a latent space such that:

$\sigma_y^2(c1)+\sigma_y^2(c2)$  is minimum
$(\mu_y(c1)-\mu_y(c2))^2$  is maximum

The above equations could be combined together to form a optimization formula:
\[minimize\quad\frac{\sigma _y^2(c1)+\sigma_y^2(c2)}{(\mu_y(c1)-\mu_y(c2))^2 }\]
or
\[maximize\quad\frac{(\mu_y(c1)-\mu_y(c2))^2 }{\sigma _y^2(c1)+\sigma_y^2(c2)}\]

Assume that the latent space is $\{y_1,y_2,...,y_n\}$ and $y_i=w^Tx_i$.
Hence,
\[
\begin{align*}
\sigma_y^2(c1)&=\frac{1}{N_{c1}}\sum_{x_i\in c1}(y_i-u_y(c1))^2\\
&=\frac{1}{N_{c1}}\sum_{x_i\in c1}(w^T(x_i-u(c1)))^2\\
&=\frac{1}{N_{c1}}\sum_{x_i\in c1}w^T(x_i-u(c1))(x_i-u(c1))^Tw\\
&=w^T\frac{1}{N_{c1}}\sum_{x_i\in c1}(x_i-u(c1))(x_i-u(c1))^Tw\\
&=w^TS_1w\\
\end{align*}
\]
where
\[
\begin{align*}
\mu(c1)&=\frac{1}{N_{c1}}\sum_{x_i\in c1}x_i\
\end{align*}
\]
and
\[
\begin{align*}
\sigma_y^2(c2)&=w^TS_2w\\
\end{align*}
\]
obtain:
\[
\begin{align*}
\sigma_y^2(c1)+\sigma_y^2(c2)&=w^T(S_1+S_2)w=w^TS_ww\\
\end{align*}
\]
\[
\begin{align*}
(\mu_y(c1)-\mu_y(c2))^2&=w^T(\mu(c1)-\mu(c2))(\mu(c1)-\mu(c2))^Tw=w^TS_bw\\
\end{align*}
\]
Therefore
\[
\begin{align*}
maximize\quad\frac{(\mu_y(c1)-\mu_y(c2))^2}{\sigma_y^2(c1)+\sigma_y^2(c2)}&=\frac{w^TS_bw}{w^TS_ww}\\
\end{align*}
\]
Adding constrains
\[max\quad w^TS_bw\quad s.t.\quad w^TS_ww=1\]
Formulate the Lagrangian:
\[
\begin{align*}
\mathcal{L}(w,\lambda)&=w^TS_tw -\lambda(w^Tw-1)\\
\end{align*}
\]

Let  $\frac{\partial \mathcal{L}(w,\lambda)}{\partial w}=0$
\[
\begin{align*}
&\Rightarrow\quad S_bw=\lambda S_ww\\
&\Rightarrow\quad S_w^{-1}S_bw=\lambda w
\end{align*}\]
Therefore, $w$ is the eigenvector corresponding to the largest eigenvalue of $S_w^{-1}S_b$


Multiple Classes
Multi-classification problems are more common in machine learning practice. Assume that the original data dimensions are reduced to $d$ dimensions, i.e. $W=[w_1,w_2,...,w_d]$.
Hence, within classes:
\[S_w=\sum_{j=1}^CS_j=\sum_{j=1}^C\frac{1}{N_{c1}}\sum_{x_i\in c_j}(x_i-u(c_j))(x_i-u(c_j))^T\]
Between classes:
\[S_b=\sum_{j=1}^C(\mu(c_j)-\mu)(\mu(c_j)-\mu)^T\]
where $\mu$ is the mean of all classes.
Obtain its optimization equation:
\[max\quad tr[W^TS_bW]\quad s.t.\quad W^TS_wW=I\]
Formulate the Lagrangian:
\[
\begin{align*}
\mathcal{L}(W,\Lambda)&=tr[W^TS_tW] -tr[\Lambda(W^TS_wW-I)]\\
\end{align*}
\]

Let  $\frac{\partial \mathcal{L}(W,\Lambda)}{\partial W}=0$
\[
\begin{align*}
&\Rightarrow\quad S_bW=S_wW\Lambda\\
&\Rightarrow\quad S_w^{-1}S_bW=W\Lambda
\end{align*}\]
Therefore, $W$ is the eigenvectors corresponding to the $d$ largest eigenvalues of $S_w^{-1}S_b$

没有评论:

发表评论