Monday, October 1, 2018

Gradient of Softmax Function

When it comes to multiclass classification in the context of neural network, Softmax function is utilized as activation function. In this article, I will share with you about derivation of gradient of Softmax function. If you have curiousity towards differential of Affine trasformation, you can check Gradient of Affine transformation.

0. Product and Quotient rule of differential

For the sake of derivation of gradient of Softmax function, Understanding of Product and Quotient rule of differential is somehow imperative. Therefore ahead of derivation of gradient, you might wanna wrapp you head around it here.

0-a. Product rule

If two functions $f(x)$ and $g(x)$ are differentiable then the product of them are also differentiable. Differential takes form,

$$\left\{f(x)g(x)\right\}' = f'(x)g(x) + f(x)g'(x)$$

0-b. Quotient rule

Quotient rule can be derived from Product rule :) It worth deriving from product rule here! Applying product rule to $f^{-1}(x)g(x)$,

$$\begin{eqnarray} f^{-1}(x)g(x) &=& (f^{-1}(x))'{g(x)} + f^{-1}(x){g'(x)} \\ &=& -\left(f(x)\right)^{-2}f'(x)g(x) + f^{-1}g'(x) \tag{*}\\ &=& \frac{f(x)g'(x) - f'(x)g(x)}{\left\{f(x)\right\}^2} \end{eqnarray}$$

(*) it's trivial using differencetial of sythtic function as following. Let $y$ be $y=t^{-1}$, and $t$ be $t=f(x)$,

$$y' = -t^{-2} f'(x) = -(f(x))^{-2} f'(x) $$

1. Gradient of Softmax function

Let the input of Softmax function be $a_k = (a_1, a_2, \cdots, a_n)$, output be $y_k=(y_1, y_2, \cdots, y_k)$, Softmax function can be expressed as below,

$$ y_k = \frac{exp(a_k)}{\sum^n_{i=1}exp(a_i)} $$

For instance, regarding partial differential of $a_1$, $a_1$ is related to all output $y_k=(y_1, y_2, \cdots, y_k)$. Hence, we'd better think in cases separately.

  • In case of $l = k$
$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i) exp(a_k) - exp(a_k)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& \frac{exp(a_k) \left(\sum^n_{i=1}exp(a_i) - exp(a_k)\right)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& \frac{exp(a_k) \sum_{i\neq k}exp(a_i)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& y_k(1-y_k) \end{eqnarray}$$
  • In case of $l \neq k$
$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k}&=&\frac{\sum^n_{i=1}exp(a_i)\cdot 0 - exp(a_l)exp(a_k)}{\left\{\sum^n_{i=1}exp(a_k)\right\}^2}\\ &=& - \frac{exp(a_l) exp(a_k)}{\left\{\sum^n_{i=1}exp(a_i)\right\}^2}\\ &=& -y_l y_k \end{eqnarray}$$

From above result, we can say the differential of Softmax function is,

$$\begin{eqnarray} \frac{\partial y_l}{\partial a_k} = \begin{cases} y_k(1- y_k) & ( k = l ) \\ - y_k y_l & ( k \neq l ) \end{cases} \end{eqnarray}$$

No comments:

Post a Comment