Processing math: 68%

Sunday, October 14, 2018

Fisher Linear Discriminant Analysis

Fisher Linear Discriminant Analysis is one of the dimentionality reduction technique. When it comes to dimentionality resduction, Principal component analysis is well-known. Nonetheless, as Principal component analysis is unsupervised technique, occasionally it would end up terrible projection for classificaiton. For instance, when I work on 2 class classification, if data are positioned in parallel and very closely together such that the biggest variance is in direction which ignore the labels, PCA would destroy the useful information and all lebels of data will be mixed... Thus I believe it's worth getting grasp for Fisher Linear Discriminant Analysis for classification.

0. Derivation of Fisher-LDA objective

Let x be data, and we have two classification label C1, C2. Let the number of data in C1 and C2 be N1, N2, respectively. Let me think the projection of x to y by,    y=wTx+w0

The mean of y would be expressed by following,

mk=1NkiCkyi

Let μ1, μ2 be mean of C1, C2, respectively, mean of y takes form,

mk=1NkiCkyi=wTμk

Since we'd like to project x to y where the mean m1 and m2 are distant as much as possible, following objective ought to be considered.

|m1m2|2=(wT(μ1μ2))((μ1μ2)Tw)=wTSBw

Note :
In this case of binary classification, we can say SB is scatter of class C1 with respect to the scatter of class 2. Generally speaking, SB is known for "between classes scatter matrix" defined as ,

SB=k(μkˉx)(μkˉx)T

Subsequently, we'd better take "within-class scatter matrix" into consideration.

Sk=iCkyimk=wT{iCk(xiμk)(xiμk)T}w

Within-class scatter matrix" Sw takes form,

Sw=ciCk(xiμk)(xiμk)T

Consequently, S1+S2 would be S1+S2=wTSww

For the sake of slick projection, Fisher_LDA cosiders maximizing the following objective,

J(w)=wTSBwwTSWw

1. Compute w from objective

Now our motive is maximize equation (3). We can obtain w which maximize equation (3) by follwing.

Jw=0

Apply quotient rule of differential, (You can check about quotient here)

j(w)w=(wTSBw)(wTSww)(wTSBw)(wTSww)(wTSww)2=0SBw(wTSWw)=(wTSBw)SWw

Since both wTSWw and wTSBw are scaler, we can denote

SBw=λSww

As long as SW is regular matrix,

S1WSBw=λw

Consequently, it shows that we can obtain w which maximize equation (3) by tackling regular eigen value problem :)

Tuesday, October 2, 2018

Differential of Softmax and Cross entropy layer

In neural network, Softmax function and Cross entropy is utilized as a set. Hence it's convenient to think differential for Softmax - cross entorpy layer. In this article, I will share with you how to derive the differential of "Softmax - cross entorpy layer". As a prior knowledge, reading links is prefered :)

0. Differential of Cross Entropy

Let yk be output of softmax function , tk be correct data as one-hot vector , cross entropy takes form, Cross Entropy=nk=1tklogyk Obviously, we can derive differential of cross entropy with comperative ease :) Let E be cross entropy,

Eyk=tkyk

1. Differential of Cross Entropy

As we discussed here, differential of softmax can be depicted as following, where ak=(a1,a2,,an) is input and yk=(y1,y2,,yk) is ouput for Softmax function,

ylak={yk(1yk)(k=l)ykyl(kl)

So differential of Softmax function can be derived as below :),

Eak=ni=1Eyiyiak=tkykyk(1yk)+iktiyiyiyk=tkyktk+ykikti=yktk     (

Thus, we obtain incredibly simple result of differential, \frac{\partial E}{\partial a_k} = y_k - t_k

Monday, October 1, 2018

Gradient of Softmax Function

When it comes to multiclass classification in the context of neural network, Softmax function is utilized as activation function. In this article, I will share with you about derivation of gradient of Softmax function. If you have curiousity towards differential of Affine trasformation, you can check Gradient of Affine transformation.

0. Product and Quotient rule of differential

For the sake of derivation of gradient of Softmax function, Understanding of Product and Quotient rule of differential is somehow imperative. Therefore ahead of derivation of gradient, you might wanna wrapp you head around it here.

0-a. Product rule

If two functions f(x) and g(x) are differentiable then the product of them are also differentiable. Differential takes form,

\left\{f(x)g(x)\right\}' = f'(x)g(x) + f(x)g'(x)

0-b. Quotient rule

Quotient rule can be derived from Product rule :) It worth deriving from product rule here! Applying product rule to f^{-1}(x)g(x),

\begin{eqnarray} f^{-1}(x)g(x) &=& (f^{-1}(x))'{g(x)} + f^{-1}(x){g'(x)} \\ &=& -\left(f(x)\right)^{-2}f'(x)g(x) + f^{-1}g'(x) \tag{*}\\ &=& \frac{f(x)g'(x) - f'(x)g(x)}{\left\{f(x)\right\}^2} \end{eqnarray}

(*) it's trivial using differencetial of sythtic function as following. Let y be y=t^{-1}, and t be t=f(x),

y' = -t^{-2} f'(x) = -(f(x))^{-2} f'(x)

1. Gradient of Softmax function

Let the input of Softmax function be a_k = (a_1, a_2, \cdots, a_n), output be y_k=(y_1, y_2, \cdots, y_k), Softmax function can be expressed as below,

y_k = \frac{exp(a_k)}{\sum^n_{i=1}exp(a_i)}

For instance, regarding partial differential of a_1, a_1 is related to all output y_k=(y_1, y_2, \cdots, y_k). Hence, we'd better think in cases separately.

  • In case of l = k
\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i) exp(a_k) - exp(a_k)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& \frac{exp(a_k) \left(\sum^n_{i=1}exp(a_i) - exp(a_k)\right)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& \frac{exp(a_k) \sum_{i\neq k}exp(a_i)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& y_k(1-y_k) \end{eqnarray}
  • In case of l \neq k
\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i)\cdot 0 - exp(a_l)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& - \frac{exp(a_l) exp(a_k)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& -y_l y_k \end{eqnarray}

From above result, we can say the differential of Softmax function is,

\begin{eqnarray} \frac{\partial y_l}{\partial a_k} = \begin{cases} y_k(1- y_k) & ( k = l ) \\ - y_k y_l & ( k \neq l ) \end{cases} \end{eqnarray}