Fisher Linear Discriminant Analysis is one of the dimentionality reduction technique. When it comes to dimentionality resduction, Principal component analysis is well-known. Nonetheless, as Principal component analysis is unsupervised technique, occasionally it would end up terrible projection for classificaiton. For instance, when I work on 2 class classification, if data are positioned in parallel and very closely together such that the biggest variance is in direction which ignore the labels, PCA would destroy the useful information and all lebels of data will be mixed... Thus I believe it's worth getting grasp for Fisher Linear Discriminant Analysis for classification.
0. Derivation of Fisher-LDA objective¶
Let $x$ be data, and we have two classification label $C_1$, $C_2$. Let the number of data in $C_1$ and $C_2$ be $N_1$, $N_2$, respectively. Let me think the projection of $x$ to $y$ by, $$y = w^T x + w_0$$
The mean of $y$ would be expressed by following,
$$\begin{eqnarray} m_k &=& \frac{1}{N_k}\sum_{i \in C_k} y_i\\ \end{eqnarray}$$Let $\mu1$, $\mu2$ be mean of $C_1$, $C_2$, respectively, mean of $y$ takes form,
$$\begin{eqnarray} m_k &=& \frac{1}{N_k}\sum_{i \in C_k} y_i\\ &=& w^T \mu_k \end{eqnarray}$$Since we'd like to project $x$ to $y$ where the mean $m_1$ and $m_2$ are distant as much as possible, following objective ought to be considered.
$$\begin{eqnarray} |m_1 - m_2|^2 &=& \left(w^T(\mu_1 - \mu_2)\right)\left((\mu_1 - \mu_2)^T w\right) \\ &=& w^T S_B w \tag{1} \end{eqnarray}$$Note :
In this case of binary classification, we can say $S_B$ is scatter of class $C_1$ with respect to the scatter of class 2. Generally speaking, $S_B$ is known for "between classes scatter matrix" defined as ,
Subsequently, we'd better take "within-class scatter matrix" into consideration.
$$\begin{eqnarray} S_k &=& \sum_{i \in C_k}y_i - m_k \\ &=& w^T \left\{\sum_{i \in C_k}(x_i - \mu_k)(x_i - \mu_k)^T \right\}w \end{eqnarray}$$Within-class scatter matrix" $S_w$ takes form,
$$S_w = \sum_c \sum_{i \in C_k} (x_i - \mu_k)(x_i - \mu_k)^T$$Consequently, $S_1 + S_2$ would be $$S_1 + S_2 = w^TS_ww \tag{2}$$
For the sake of slick projection, Fisher_LDA cosiders maximizing the following objective,
$$J(w) = \frac{w^TS_Bw}{w^TS_Ww} \tag{3}$$1. Compute $w$ from objective¶
Now our motive is maximize equation (3). We can obtain $w$ which maximize equation (3) by follwing.
$$ \frac{\partial J}{\partial w} = 0$$Apply quotient rule of differential, (You can check about quotient here)
$$\begin{eqnarray} \frac{\partial j(w)}{\partial w} &=& \frac{(w^TS_B w)' (wTSww) - (w^TS_B w)(wTSww)'}{\left(w^T S_w w\right)^2}&=& 0\\ S_Bw(w^TS_W w)&=&(w^TS_Bw)S_Ww\\ \end{eqnarray}$$Since both $w^TS_W w$ and $w^TS_Bw$ are scaler, we can denote
$$\begin{eqnarray} S_B w = \lambda S_w w \end{eqnarray}$$As long as $S_W$ is regular matrix,
$$S_W^{-1}S_Bw = \lambda w$$Consequently, it shows that we can obtain $w$ which maximize equation (3) by tackling regular eigen value problem :)
Reference :
Linear Discriminant Analysis