Tech Notes: Gradient of Softmax Function

When it comes to multiclass classification in the context of neural network, Softmax function is utilized as activation function. In this article, I will share with you about derivation of gradient of Softmax function. If you have curiousity towards differential of Affine trasformation, you can check Gradient of Affine transformation.

0. Product and Quotient rule of differential¶

For the sake of derivation of gradient of Softmax function, Understanding of Product and Quotient rule of differential is somehow imperative. Therefore ahead of derivation of gradient, you might wanna wrapp you head around it here.

0-a. Product rule¶

If two functions $f(x)$ and $g(x)$ are differentiable then the product of them are also differentiable. Differential takes form,

$$\left\{f(x)g(x)\right\}' = f'(x)g(x) + f(x)g'(x)$$

0-b. Quotient rule¶

Quotient rule can be derived from Product rule :) It worth deriving from product rule here! Applying product rule to $f^{-1}(x)g(x)$,

$$\begin{eqnarray} f^{-1}(x)g(x) &=& (f^{-1}(x))'{g(x)} + f^{-1}(x){g'(x)} \\ &=& -\left(f(x)\right)^{-2}f'(x)g(x) + f^{-1}g'(x) \tag{*}\\ &=& \frac{f(x)g'(x) - f'(x)g(x)}{\left\{f(x)\right\}^2} \end{eqnarray}$$

(*) it's trivial using differencetial of sythtic function as following. Let $y$ be $y=t^{-1}$, and $t$ be $t=f(x)$,

$$y' = -t^{-2} f'(x) = -(f(x))^{-2} f'(x) $$

1. Gradient of Softmax function¶

Let the input of Softmax function be $a_k = (a_1, a_2, \cdots, a_n)$, output be $y_k=(y_1, y_2, \cdots, y_k)$, Softmax function can be expressed as below,

$$ y_k = \frac{exp(a_k)}{\sum^n_{i=1}exp(a_i)} $$

For instance, regarding partial differential of $a_1$, $a_1$ is related to all output $y_k=(y_1, y_2, \cdots, y_k)$. Hence, we'd better think in cases separately.

In case of $l = k$

$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i) exp(a_k) - exp(a_k)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& \frac{exp(a_k) \left(\sum^n_{i=1}exp(a_i) - exp(a_k)\right)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& \frac{exp(a_k) \sum_{i\neq k}exp(a_i)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& y_k(1-y_k) \end{eqnarray}$$

In case of $l \neq k$

$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i)\cdot 0 - exp(a_l)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& - \frac{exp(a_l) exp(a_k)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& -y_l y_k \end{eqnarray}$$

From above result, we can say the differential of Softmax function is,

$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k} = \begin{cases} y_k(1- y_k) & ( k = l ) \\ - y_k y_l & ( k \neq l ) \end{cases} \end{eqnarray}$$

Tech Notes

Monday, October 1, 2018

Gradient of Softmax Function

0. Product and Quotient rule of differential¶

0-a. Product rule¶

0-b. Quotient rule¶

1. Gradient of Softmax function¶

No comments:

Post a Comment