When it comes to neural network, affine transformation is continually utilized. And for stochastic gradient descent, gradient of affine transformation is required. Needless to say, you can reach them by putting words such as "gradient" and "Affine transformation" into google search form. However, in this article, I'm gonna share with you about the way of diriving gradient of Affine transformation.
0. What is Affine transformation ??¶
Affine transformation is combination of linear transformation and translation. In this article, to make it simple, we will deal with 2 dimensional vector $x= (x_1, x_2)$ and 2 by 3 matrix $w=\begin{pmatrix} w_{11},w_{12},w_{13}\\w_{21},w_{22},w_{23}\end{pmatrix}$ and 3 dimentional vector $b=(b_1,b_2,b_3)$. In that case, Linear transformation takes form,
$$\begin{eqnarray}xw + b &=& (x_1, x_2)\begin{pmatrix} w_{11},w_{12},w_{13}\\w_{21},w_{22},w_{23}\end{pmatrix} + (b_1,b_2,b_3) \\ &=& \begin{pmatrix}w_{11}x_1 + w_{21}x_2+b_1,w_{12}x_{1}+w_{22}x_2 + b_2,w_{13}x_1,w_{23}x_2+b_3\end{pmatrix}\end{eqnarray}$$1. Differentiation of synthetic function¶
For the sake of derivation of gradient on Affine transformation, differentiation of synthetic function ought to be understood. Let $z$ be $z = f(x,y)$, $x$ be $x=g(t)$, $y$ be $y=h(t)$, partial differential can be computed as below,
$$\frac{\partial z}{\partial t} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial t}+\frac{\partial z}{\partial y}\frac{\partial y}{\partial t}$$2. Derivation of gradient of x¶
Let $L$ be scalar of following function. Gradient of $x$ can be derived as following.
$$\begin{eqnarray} \frac{\partial L}{\partial x} &=& \left(\frac{\partial L}{\partial x_1},\frac{\partial L}{\partial x_2}\right)\\ &=& \left(\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial x_1} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial x_1} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial x_1},\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial x_2} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial x_2} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial x_2}\right) \\ &=& \left(\frac{\partial L}{\partial y_1}w_{11} + \frac{\partial L}{\partial y_2}w_{12} + \frac{\partial L}{\partial y_3}w_{13},\frac{\partial L}{\partial y_1}w_{21} + \frac{\partial L}{\partial y_2}w_{22} + \frac{\partial L}{\partial y_3}w_{23}\right) \\ &=& \frac{\partial L}{\partial y}\begin{pmatrix} w_{11},w_{21}\\w_{12},w_{22}\\w_{13},w_{23}\end{pmatrix}\\ &=& \frac{\partial L}{\partial y} w^T \end{eqnarray}$$3. Derivation of gradient of w¶
For $w$, gradient can be derived by followings,
$$\begin{eqnarray} \frac{\partial L}{\partial w} &=& \begin{pmatrix} \frac{\partial L}{\partial w_{11}},\frac{\partial L}{\partial w_{12}},\frac{\partial L}{\partial w_{13}}\\ \frac{\partial L}{\partial w_{21}},\frac{\partial L}{\partial w_{22}},\frac{\partial L}{\partial w_{23}}\end{pmatrix}\\ &=& \begin{pmatrix} \frac{\partial L}{\partial y_{1}}\frac{\partial y_{1}}{\partial w_{11}}+ \frac{\partial L}{\partial y_{2}}\frac{\partial y_{2}}{\partial w_{11}} + \frac{\partial L}{y_{3}}\frac{\partial y_{3}}{w_{11}},\cdots,\cdots\\ \frac{\partial L}{\partial y_{1}}\frac{\partial y_{1}}{\partial w_{21}}+ \frac{\partial L}{\partial y_{2}}\frac{\partial y_{2}}{\partial w_{21}} + \frac{\partial L}{y_{3}}\frac{\partial y_{3}}{w_{21}},\cdots,\cdots\end{pmatrix}\\ &=& \begin{pmatrix} \frac{\partial L}{\partial y_{1}}x_1+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}x_1 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}x_1\\ \frac{\partial L}{\partial y_{1}}x_2+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}x_2 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}x_2\end{pmatrix}\\ &=&\begin{pmatrix} \frac{\partial L}{\partial y_{1}}x_1,\frac{\partial L}{\partial y_{2}}x_1,\frac{\partial L}{y_{3}}x_1\\ \frac{\partial L}{\partial y_{1}}x_2 ,\frac{\partial L}{\partial y_{2}}x_2 ,\frac{\partial L}{y_{3}}x_2\end{pmatrix}\\ &=& x^T \frac{\partial L}{\partial y} \end{eqnarray}$$4. Derivation of gradient of b¶
Lastly, gradient of $b$ is derived from following,
$$\begin{eqnarray} \frac{\partial L}{\partial x} &=& \left(\frac{\partial L}{\partial b_1},\frac{\partial L}{\partial b_2},\frac{\partial L}{\partial b_3}\right)\\ &=& \left(\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial b_1} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial b_1} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial b_1},\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial b_2} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial b_2} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial b_2},\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial b_3} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial b_3} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial b_3}\right) \\ &=& \left(\frac{\partial L}{\partial y_1}1 + 0 + 0,0 + \frac{\partial L}{\partial y_2}1 + 0,0 + 0\frac{\partial L}{\partial y_3}1\right) \\ &=& \left(\frac{\partial L}{\partial y_1},\frac{\partial L}{\partial y_2},\frac{\partial L}{\partial y_3}\right)\\ &=& \frac{\partial L}{\partial y} \end{eqnarray}$$