Sunday, September 30, 2018

Gradient of Affine transformation

When it comes to neural network, affine transformation is continually utilized. And for stochastic gradient descent, gradient of affine transformation is required. Needless to say, you can reach them by putting words such as "gradient" and "Affine transformation" into google search form. However, in this article, I'm gonna share with you about　the way of diriving gradient of Affine transformation.

0. What is Affine transformation ??¶

Affine transformation is combination of linear transformation and translation. In this article, to make it simple, we will deal with 2 dimensional vector $x= (x_1, x_2)$ and 2 by 3 matrix $w=\begin{pmatrix} w_{11},w_{12},w_{13}\\w_{21},w_{22},w_{23}\end{pmatrix}$ and 3 dimentional vector $b=(b_1,b_2,b_3)$. In that case, Linear transformation takes form,

$$\begin{eqnarray}xw + b &=& (x_1, x_2)\begin{pmatrix} w_{11},w_{12},w_{13}\\w_{21},w_{22},w_{23}\end{pmatrix} + (b_1,b_2,b_3) \\ &=& \begin{pmatrix}w_{11}x_1 + w_{21}x_2+b_1,w_{12}x_{1}+w_{22}x_2 + b_2,w_{13}x_1,w_{23}x_2+b_3\end{pmatrix}\end{eqnarray}$$

1. Differentiation of synthetic function¶

For the sake of derivation of gradient on Affine transformation, differentiation　of synthetic function ought to be understood. Let $z$ be $z = f(x,y)$, $x$ be $x=g(t)$, $y$ be $y=h(t)$, partial differential can be computed as below,

$$\frac{\partial z}{\partial t} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial t}+\frac{\partial z}{\partial y}\frac{\partial y}{\partial t}$$

2. Derivation of gradient of x¶

Let $L$ be scalar of following function. Gradient of $x$ can be derived as following.

$$\begin{eqnarray} \frac{\partial L}{\partial x} &=& \left(\frac{\partial L}{\partial x_1},\frac{\partial L}{\partial x_2}\right)\\ &=& \left(\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial x_1} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial x_1} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial x_1},\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial x_2} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial x_2} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial x_2}\right) \\ &=& \left(\frac{\partial L}{\partial y_1}w_{11} + \frac{\partial L}{\partial y_2}w_{12} + \frac{\partial L}{\partial y_3}w_{13},\frac{\partial L}{\partial y_1}w_{21} + \frac{\partial L}{\partial y_2}w_{22} + \frac{\partial L}{\partial y_3}w_{23}\right) \\ &=& \frac{\partial L}{\partial y}\begin{pmatrix} w_{11},w_{21}\\w_{12},w_{22}\\w_{13},w_{23}\end{pmatrix}\\ &=& \frac{\partial L}{\partial y} w^T \end{eqnarray}$$

3. Derivation of gradient of w¶

For $w$, gradient can be derived by followings,

$$\begin{eqnarray} \frac{\partial L}{\partial w} &=& \begin{pmatrix} \frac{\partial L}{\partial w_{11}},\frac{\partial L}{\partial w_{12}},\frac{\partial L}{\partial w_{13}}\\ \frac{\partial L}{\partial w_{21}},\frac{\partial L}{\partial w_{22}},\frac{\partial L}{\partial w_{23}}\end{pmatrix}\\ &=& \begin{pmatrix} \frac{\partial L}{\partial y_{1}}\frac{\partial y_{1}}{\partial w_{11}}+ \frac{\partial L}{\partial y_{2}}\frac{\partial y_{2}}{\partial w_{11}} + \frac{\partial L}{y_{3}}\frac{\partial y_{3}}{w_{11}},\cdots,\cdots\\ \frac{\partial L}{\partial y_{1}}\frac{\partial y_{1}}{\partial w_{21}}+ \frac{\partial L}{\partial y_{2}}\frac{\partial y_{2}}{\partial w_{21}} + \frac{\partial L}{y_{3}}\frac{\partial y_{3}}{w_{21}},\cdots,\cdots\end{pmatrix}\\ &=& \begin{pmatrix} \frac{\partial L}{\partial y_{1}}x_1+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}x_1 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}x_1\\ \frac{\partial L}{\partial y_{1}}x_2+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}x_2 + \frac{\partial L}{y_{3}}0,\frac{\partial L}{\partial y_{1}}0+ \frac{\partial L}{\partial y_{2}}0 + \frac{\partial L}{y_{3}}x_2\end{pmatrix}\\ &=&\begin{pmatrix} \frac{\partial L}{\partial y_{1}}x_1,\frac{\partial L}{\partial y_{2}}x_1,\frac{\partial L}{y_{3}}x_1\\ \frac{\partial L}{\partial y_{1}}x_2 ,\frac{\partial L}{\partial y_{2}}x_2 ,\frac{\partial L}{y_{3}}x_2\end{pmatrix}\\ &=& x^T \frac{\partial L}{\partial y} \end{eqnarray}$$

4. Derivation of gradient of b¶

Lastly, gradient of $b$ is derived from following,

$$\begin{eqnarray} \frac{\partial L}{\partial x} &=& \left(\frac{\partial L}{\partial b_1},\frac{\partial L}{\partial b_2},\frac{\partial L}{\partial b_3}\right)\\ &=& \left(\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial b_1} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial b_1} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial b_1},\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial b_2} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial b_2} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial b_2},\frac{\partial L}{\partial y_1}\frac{\partial y_1}{\partial b_3} + \frac{\partial L}{\partial y_2}\frac{\partial y_2}{\partial b_3} + \frac{\partial L}{\partial y_3}\frac{\partial y_3}{\partial b_3}\right) \\ &=& \left(\frac{\partial L}{\partial y_1}1 + 0 + 0,0 + \frac{\partial L}{\partial y_2}1 + 0,0 + 0\frac{\partial L}{\partial y_3}1\right) \\ &=& \left(\frac{\partial L}{\partial y_1},\frac{\partial L}{\partial y_2},\frac{\partial L}{\partial y_3}\right)\\ &=& \frac{\partial L}{\partial y} \end{eqnarray}$$

Friday, September 21, 2018

K-Nearest Neighbors

I will share little bit about k-Neares Neighbors in this article.

0. What is k-Nearest Neighbors ?¶

I believe "k-Nearest Neighbors" is one of the simplest predictive model. It doesn't require mathematical assumptions. In a nutshell, k-Nearest Neighbors can be explained as following steps,

Choose the number of "k" and distance metric.
Find k nearest neighbors of the sample to be classified
Assign the class determined by majority vote of neighbors we obtained step 2.

Next I will share some characteristics of "k-Nearest Neighbors".

There is no cost during learning process.
Non-parametric model. (*)
It's not gonna help you understand the reason of Phenomenon you are looking at.
For data with high dimentionality, it's probably slick to apply dimentionality reduction. Because, with high dimentionality, distance inclined to somehow huge. Hence all datapoints tend to be similar.

(*) Parametric model is refered as the model we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset. Such as Logistic regression and Linear SVM. Whereas Non parametric model doesn't have fixed set of parameters. The example is decision tree and kernel SVM. Incidentially, you can check decision tree here.

1. Distance¶

So as to classify with "k-Nearest Neighbors", calculation of distance is required. In this section, we're gonna look at Minkowski distance here. Let $x$ and $y$ be N-dimentaional vectors, Minkowski distance takes form,

$$d(x,y) = \left( \sum_i^N |x_i - y_i|^p\right)^{\frac{1}{p}}$$

Then, if $p=1$, it's called "Manhattan distance". In the case $p=2$, it's called "Euclidean distance".
Minkowski distance simply implemented as follows.

In [1]:

import numpy as np

In [2]:

def distance(x,y,p=2):
    """
    Compute Minkowski distance !
    """
    return (np.absolute(x-y) **p).sum() ** (1/p)

2. Implementation¶

There is a issue we have to deal with. What if the number of vote is same on different classification?? Apparently, threre are three tactics as followings,

Pick one of the classification at random
Weigh the votes by distance.
Reduce k until we find unique classification.

The following is third tactic :)

In [3]:

def majority_vote(labels):
    """
    Assume that labels are orederd from nearest to farthest
    """

    unique_label, unique_count = np.unique(labels,return_counts=True)
    maxarg = np.where(unique_count == np.max(unique_count))[0]
    
    if maxarg.shape[0] == 1:
        return unique_label[maxarg]
    else:
        return majority_vote(labels[:-1])

Now we're ready to implement "k-nearest neighbors" .

In [4]:

def knn_classify(k, x, y, new_point):
    
    # Sort x  in orde which has closer distance with newpoint. 
    # And get label as candidate.
    candidate = y[np.argsort(np.array([
        distance(data,new_point) for data in x]))][:k]
    winner = majority_vote(candidate)
    
    return winner

Reference :
Data Science from Scratch

Tuesday, September 18, 2018

Tips for linear algebra

Recently, I was asked about some equation on Pattern Recognition and Machine learning by my friend. Hence, I will share some of them here :) In this article, we're gonna work on two following equations on page 80, ( Let $\Sigma$ be variance-covariance matrix, $\vec{u}$ be eigen vector, $\lambda$ be eigen value. )
$$ \Sigma = \sum_{i=1}^{D}\lambda_i \vec{u}_i \vec{u}_i^T \tag{2.48}$$ $$ \Sigma^{-1} = \sum_{i=1}^{D}\frac{1}{\lambda_i} \vec{u}_i \vec{u}_i^T \tag{2.49}$$

1. Prerequisite Knowledge¶

To understand the above two equations, there are two prerequisite knowledge.

Inverse matrix of diagonal matrix
Suppose we have n x n diagonal matrix as following,

$$D = \begin{pmatrix} \lambda_1& & & 0 \\ & \lambda_2 & & \\ & & \ddots & & \\ 0 & & & \lambda_n \end{pmatrix}$$
         This matrix is *invertible*, if all of elemnts on the main diagonal is non-zero.
         Then, *Inverse matrix* of $D$ has *reciprocals* of the elements in the main diagonal
         as below. $$D = \begin{pmatrix} \frac{1}{\lambda_1}& & & 0 \\ & \frac{1}{\lambda_2} & & \\ & & \ddots & & \\ 0 & & & \frac{1}{\lambda_n} \end{pmatrix}$$

Inverse matrix of orthogonal matrix
If the matrix we wanna know about inverse matrix is orthogonal matrix, you can cut down on time to compute inverse matrix. Let's say matrix $D$ is orthogonal matrix. Then inverse matrix of $D$ would be transoposed $D$.

2. Derivation of equation¶

From the definition of eigen vector,

$$\Sigma \vec{u}_i = \lambda_i \vec{u}_i$$

Therefore,

$$\Sigma (\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D) = (\lambda_1\vec{u}_1, \lambda_2 \vec{u}_2, \dots, \lambda_D\vec{u}_D)\tag{1}$$

Since $(\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D)$ is orthogonal matrix , we can apply "Inverse matrix of othogonal matrix" we discussed above.

$$\begin{eqnarray}\Sigma &=& (\lambda_1\vec{u}_1, \lambda_2 \vec{u}_2, \dots, \lambda_D\vec{u}_D)(\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D)^T\\ \Sigma &=& (\lambda_1\vec{u}_1, \lambda_2 \vec{u}_2, \dots, \lambda_D\vec{u}_D)\left( \begin{array}\vec{u}_1^T\\ \vec{u}_2^T\\ \vdots\\ \vec{u}_D^T \end{array}\right)\\ \end{eqnarray}$$

As a result we can get equation $(2.48)$ $$ \Sigma = \sum_{i=1}^{D}\lambda_i \vec{u}_i \vec{u}_i^T \tag{2.48}$$

Equation $(1)$ can be expressed as below, $$\Sigma (\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D) = (\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D) \begin{pmatrix} \lambda_1& & & 0 \\ & \lambda_2 & & \\ & & \ddots & & \\ 0 & & & \lambda_n \end{pmatrix}$$

Multiply inverse matrix of co-variance matrix from left on both side,

$$\Sigma^{-1}(\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D) \begin{pmatrix} \lambda_1& & & 0 \\ & \lambda_2 & & \\ & & \ddots & & \\ 0 & & & \lambda_n \end{pmatrix}=(\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D) $$

Now is the time to apply inverse matrix of diagonal matrix we disscussed above,

$$\Sigma^{-1} =(\vec{u}_1, \vec{u}_2, \dots, \vec{u}_D)\begin{pmatrix} \frac{1}{\lambda_1}& & & 0 \\ & \frac{1}{\lambda_2} & & \\ & & \ddots & & \\ 0 & & & \frac{1}{\lambda_n} \end{pmatrix} \left( \begin{array}\vec{u}_1^T\\\vec{u}_2^T\\ \vdots\\ \vec{u}_D^T \end{array}\right)$$

Finally, we can get $(2.49)$ :), $$ \Sigma^{-1} = \sum_{i=1}^{D}\frac{1}{\lambda_i} \vec{u}_i \vec{u}_i^T \tag{2.49}$$

Reference :
C. M. Bishop. Pattern recognition and machine learning.