Fisher Linear Discriminant Analysis is one of the dimentionality reduction technique. When it comes to dimentionality resduction, Principal component analysis is well-known. Nonetheless, as Principal component analysis is unsupervised technique, occasionally it would end up terrible projection for classificaiton. For instance, when I work on 2 class classification, if data are positioned in parallel and very closely together such that the biggest variance is in direction which ignore the labels, PCA would destroy the useful information and all lebels of data will be mixed... Thus I believe it's worth getting grasp for Fisher Linear Discriminant Analysis for classification.
0. Derivation of Fisher-LDA objective¶
Let x be data, and we have two classification label C1, C2. Let the number of data in C1 and C2 be N1, N2, respectively. Let me think the projection of x to y by, y=wTx+w0
The mean of y would be expressed by following,
mk=1Nk∑i∈CkyiLet μ1, μ2 be mean of C1, C2, respectively, mean of y takes form,
mk=1Nk∑i∈Ckyi=wTμkSince we'd like to project x to y where the mean m1 and m2 are distant as much as possible, following objective ought to be considered.
|m1−m2|2=(wT(μ1−μ2))((μ1−μ2)Tw)=wTSBwNote :
In this case of binary classification, we can say SB is scatter of class C1 with respect to the scatter of class 2. Generally speaking, SB is known for "between classes scatter matrix" defined as ,
Subsequently, we'd better take "within-class scatter matrix" into consideration.
Sk=∑i∈Ckyi−mk=wT{∑i∈Ck(xi−μk)(xi−μk)T}wWithin-class scatter matrix" Sw takes form,
Sw=∑c∑i∈Ck(xi−μk)(xi−μk)TConsequently, S1+S2 would be S1+S2=wTSww
For the sake of slick projection, Fisher_LDA cosiders maximizing the following objective,
J(w)=wTSBwwTSWw1. Compute w from objective¶
Now our motive is maximize equation (3). We can obtain w which maximize equation (3) by follwing.
∂J∂w=0Apply quotient rule of differential, (You can check about quotient here)
∂j(w)∂w=(wTSBw)′(wTSww)−(wTSBw)(wTSww)′(wTSww)2=0SBw(wTSWw)=(wTSBw)SWwSince both wTSWw and wTSBw are scaler, we can denote
SBw=λSwwAs long as SW is regular matrix,
S−1WSBw=λwConsequently, it shows that we can obtain w which maximize equation (3) by tackling regular eigen value problem :)
Reference :
Linear Discriminant Analysis