Sunday, October 14, 2018

Fisher Linear Discriminant Analysis

Fisher Linear Discriminant Analysis is one of the dimentionality reduction technique. When it comes to dimentionality resduction, Principal component analysis is well-known. Nonetheless, as Principal component analysis is unsupervised technique, occasionally it would end up terrible projection for classificaiton. For instance, when I work on 2 class classification, if data are positioned in parallel and very closely together such that the biggest variance is in direction which ignore the labels, PCA would destroy the useful information and all lebels of data will be mixed... Thus I believe it's worth getting grasp for Fisher Linear Discriminant Analysis for classification.

0. Derivation of Fisher-LDA objective

Let $x$ be data, and we have two classification label $C_1$, $C_2$. Let the number of data in $C_1$ and $C_2$ be $N_1$, $N_2$, respectively. Let me think the projection of $x$ to $y$ by,    $$y = w^T x + w_0$$

The mean of $y$ would be expressed by following,

$$\begin{eqnarray} m_k &=& \frac{1}{N_k}\sum_{i \in C_k} y_i\\ \end{eqnarray}$$

Let $\mu1$, $\mu2$ be mean of $C_1$, $C_2$, respectively, mean of $y$ takes form,

$$\begin{eqnarray} m_k &=& \frac{1}{N_k}\sum_{i \in C_k} y_i\\ &=& w^T \mu_k \end{eqnarray}$$

Since we'd like to project $x$ to $y$ where the mean $m_1$ and $m_2$ are distant as much as possible, following objective ought to be considered.

$$\begin{eqnarray} |m_1 - m_2|^2 &=& \left(w^T(\mu_1 - \mu_2)\right)\left((\mu_1 - \mu_2)^T w\right) \\ &=& w^T S_B w \tag{1} \end{eqnarray}$$

Note :
In this case of binary classification, we can say $S_B$ is scatter of class $C_1$ with respect to the scatter of class 2. Generally speaking, $S_B$ is known for "between classes scatter matrix" defined as ,

$$S_B = \sum_k (\mu_k - \bar{x})(\mu_k - \bar{x})^T$$

Subsequently, we'd better take "within-class scatter matrix" into consideration.

$$\begin{eqnarray} S_k &=& \sum_{i \in C_k}y_i - m_k \\ &=& w^T \left\{\sum_{i \in C_k}(x_i - \mu_k)(x_i - \mu_k)^T \right\}w \end{eqnarray}$$

Within-class scatter matrix" $S_w$ takes form,

$$S_w = \sum_c \sum_{i \in C_k} (x_i - \mu_k)(x_i - \mu_k)^T$$

Consequently, $S_1 + S_2$ would be $$S_1 + S_2 = w^TS_ww \tag{2}$$

For the sake of slick projection, Fisher_LDA cosiders maximizing the following objective,

$$J(w) = \frac{w^TS_Bw}{w^TS_Ww} \tag{3}$$

1. Compute $w$ from objective

Now our motive is maximize equation (3). We can obtain $w$ which maximize equation (3) by follwing.

$$ \frac{\partial J}{\partial w} = 0$$

Apply quotient rule of differential, (You can check about quotient here)

$$\begin{eqnarray} \frac{\partial j(w)}{\partial w} &=& \frac{(w^TS_B w)' (wTSww) - (w^TS_B w)(wTSww)'}{\left(w^T S_w w\right)^2}&=& 0\\ S_Bw(w^TS_W w)&=&(w^TS_Bw)S_Ww\\ \end{eqnarray}$$

Since both $w^TS_W w$ and $w^TS_Bw$ are scaler, we can denote

$$\begin{eqnarray} S_B w = \lambda S_w w \end{eqnarray}$$

As long as $S_W$ is regular matrix,

$$S_W^{-1}S_Bw = \lambda w$$

Consequently, it shows that we can obtain $w$ which maximize equation (3) by tackling regular eigen value problem :)

Tuesday, October 2, 2018

Differential of Softmax and Cross entropy layer

In neural network, Softmax function and Cross entropy is utilized as a set. Hence it's convenient to think differential for Softmax - cross entorpy layer. In this article, I will share with you how to derive the differential of "Softmax - cross entorpy layer". As a prior knowledge, reading links is prefered :)

0. Differential of Cross Entropy

Let $y_k$ be output of softmax function , $t_k$ be correct data as one-hot vector , cross entropy takes form, $$Cross\ Entropy = -\sum^n_{k=1}t_k \log y_k$$ Obviously, we can derive differential of cross entropy with comperative ease :) Let $E$ be cross entropy,

$$\frac{\partial E }{\partial y_k} = - \frac{t_k}{y_k}$$

1. Differential of Cross Entropy

As we discussed here, differential of softmax can be depicted as following, where $a_k = (a_1, a_2, \cdots, a_n)$ is input and $y_k=(y_1, y_2, \cdots, y_k)$ is ouput for Softmax function,

$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k} = \begin{cases} y_k(1- y_k) & ( k = l ) \\ -y_k y_l & ( k \neq l ) \end{cases} \end{eqnarray}$$

So differential of Softmax function can be derived as below :),

$$\begin{eqnarray} \frac{\partial E}{\partial a_k} &=& \sum^n_{i=1} \frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial a_k}\\ &=&- \frac{t_k}{y_k} y_k(1- y_k) + \sum_{i \neq k} \frac{t_i}{y_i}y_i y_k\\ &=& t_k y_k - t_k + y_k \sum_{i \neq k} t_i \\ &=& y_k - t_k \ \ \ \ \ \left(\because \sum^n_{i =1} t_i = 1\right) \end{eqnarray}\\ $$

Thus, we obtain incredibly simple result of differential, $$\frac{\partial E}{\partial a_k} = y_k - t_k$$

Monday, October 1, 2018

Gradient of Softmax Function

When it comes to multiclass classification in the context of neural network, Softmax function is utilized as activation function. In this article, I will share with you about derivation of gradient of Softmax function. If you have curiousity towards differential of Affine trasformation, you can check Gradient of Affine transformation.

0. Product and Quotient rule of differential

For the sake of derivation of gradient of Softmax function, Understanding of Product and Quotient rule of differential is somehow imperative. Therefore ahead of derivation of gradient, you might wanna wrapp you head around it here.

0-a. Product rule

If two functions $f(x)$ and $g(x)$ are differentiable then the product of them are also differentiable. Differential takes form,

$$\left\{f(x)g(x)\right\}' = f'(x)g(x) + f(x)g'(x)$$

0-b. Quotient rule

Quotient rule can be derived from Product rule :) It worth deriving from product rule here! Applying product rule to $f^{-1}(x)g(x)$,

$$\begin{eqnarray} f^{-1}(x)g(x) &=& (f^{-1}(x))'{g(x)} + f^{-1}(x){g'(x)} \\ &=& -\left(f(x)\right)^{-2}f'(x)g(x) + f^{-1}g'(x) \tag{*}\\ &=& \frac{f(x)g'(x) - f'(x)g(x)}{\left\{f(x)\right\}^2} \end{eqnarray}$$

(*) it's trivial using differencetial of sythtic function as following. Let $y$ be $y=t^{-1}$, and $t$ be $t=f(x)$,

$$y' = -t^{-2} f'(x) = -(f(x))^{-2} f'(x) $$

1. Gradient of Softmax function

Let the input of Softmax function be $a_k = (a_1, a_2, \cdots, a_n)$, output be $y_k=(y_1, y_2, \cdots, y_k)$, Softmax function can be expressed as below,

$$ y_k = \frac{exp(a_k)}{\sum^n_{i=1}exp(a_i)} $$

For instance, regarding partial differential of $a_1$, $a_1$ is related to all output $y_k=(y_1, y_2, \cdots, y_k)$. Hence, we'd better think in cases separately.

  • In case of $l = k$
$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i) exp(a_k) - exp(a_k)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& \frac{exp(a_k) \left(\sum^n_{i=1}exp(a_i) - exp(a_k)\right)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& \frac{exp(a_k) \sum_{i\neq k}exp(a_i)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& y_k(1-y_k) \end{eqnarray}$$
  • In case of $l \neq k$
$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i)\cdot 0 - exp(a_l)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& - \frac{exp(a_l) exp(a_k)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& -y_l y_k \end{eqnarray}$$

From above result, we can say the differential of Softmax function is,

$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k} = \begin{cases} y_k(1- y_k) & ( k = l ) \\ - y_k y_l & ( k \neq l ) \end{cases} \end{eqnarray}$$