When it comes to multiclass classification in the context of neural network, Softmax function is utilized as activation function. In this article, I will share with you about derivation of gradient of Softmax function. If you have curiousity towards differential of Affine trasformation, you can check Gradient of Affine transformation.
0. Product and Quotient rule of differential¶
For the sake of derivation of gradient of Softmax function, Understanding of Product and Quotient rule of differential is somehow imperative. Therefore ahead of derivation of gradient, you might wanna wrapp you head around it here.
0-a. Product rule¶
If two functions f(x) and g(x) are differentiable then the product of them are also differentiable. Differential takes form,
{f(x)g(x)}′=f′(x)g(x)+f(x)g′(x)
0-b. Quotient rule¶
Quotient rule can be derived from Product rule :) It worth deriving from product rule here! Applying product rule to f−1(x)g(x),
f−1(x)g(x)=(f−1(x))′g(x)+f−1(x)g′(x)=−(f(x))−2f′(x)g(x)+f−1g′(x)=f(x)g′(x)−f′(x)g(x){f(x)}2(*) it's trivial using differencetial of sythtic function as following. Let y be y=t−1, and t be t=f(x),
y′=−t−2f′(x)=−(f(x))−2f′(x)1. Gradient of Softmax function¶
Let the input of Softmax function be ak=(a1,a2,⋯,an), output be yk=(y1,y2,⋯,yk), Softmax function can be expressed as below,
yk=exp(ak)∑ni=1exp(ai)For instance, regarding partial differential of a1, a1 is related to all output yk=(y1,y2,⋯,yk). Hence, we'd better think in cases separately.
- In case of l=k
- In case of l≠k
From above result, we can say the differential of Softmax function is,
∂yl∂ak={yk(1−yk)(k=l)−ykyl(k≠l)
No comments:
Post a Comment