Friday, September 21, 2018

K-Nearest Neighbors

I will share little bit about k-Neares Neighbors in this article.

0. What is k-Nearest Neighbors ?

I believe "k-Nearest Neighbors" is one of the simplest predictive model. It doesn't require mathematical assumptions. In a nutshell, k-Nearest Neighbors can be explained as following steps,

  1. Choose the number of "k" and distance metric.
  2. Find k nearest neighbors of the sample to be classified
  3. Assign the class determined by majority vote of neighbors we obtained step 2.

Next I will share some characteristics of "k-Nearest Neighbors".

  • There is no cost during learning process.
  • Non-parametric model. (*)
  • It's not gonna help you understand the reason of Phenomenon you are looking at.
  • For data with high dimentionality, it's probably slick to apply dimentionality reduction. Because, with high dimentionality, distance inclined to somehow huge. Hence all datapoints tend to be similar.

(*) Parametric model is refered as the model we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset. Such as Logistic regression and Linear SVM. Whereas Non parametric model doesn't have fixed set of parameters. The example is decision tree and kernel SVM. Incidentially, you can check decision tree here.

1. Distance

So as to classify with "k-Nearest Neighbors", calculation of distance is required. In this section, we're gonna look at Minkowski distance here. Let $x$ and $y$ be N-dimentaional vectors, Minkowski distance takes form,

$$d(x,y) = \left( \sum_i^N |x_i - y_i|^p\right)^{\frac{1}{p}}$$

Then, if $p=1$, it's called "Manhattan distance". In the case $p=2$, it's called "Euclidean distance".
Minkowski distance simply implemented as follows.

In [1]:
import numpy as np
In [2]:
def distance(x,y,p=2):
    """
    Compute Minkowski distance !
    """
    return (np.absolute(x-y) **p).sum() ** (1/p) 

2. Implementation

There is a issue we have to deal with. What if the number of vote is same on different classification?? Apparently, threre are three tactics as followings,

  • Pick one of the classification at random
  • Weigh the votes by distance.
  • Reduce k until we find unique classification.

The following is third tactic :)

In [3]:
def majority_vote(labels):
    """
    Assume that labels are orederd from nearest to farthest
    """

    unique_label, unique_count = np.unique(labels,return_counts=True)
    maxarg = np.where(unique_count == np.max(unique_count))[0]
    
    if maxarg.shape[0] == 1:
        return unique_label[maxarg]
    else:
        return majority_vote(labels[:-1])

Now we're ready to implement "k-nearest neighbors" .

In [4]:
def knn_classify(k, x, y, new_point):
    
    # Sort x  in orde which has closer distance with newpoint. 
    # And get label as candidate.
    candidate = y[np.argsort(np.array([
        distance(data,new_point) for data in x]))][:k]
    winner = majority_vote(candidate)
    
    return winner

No comments:

Post a Comment