Introduction

k-NN (k-nearest neighbour) is a classification and regression model. The goal of k-NN is to find k nearest neighbours of an instance and label them a given class. The input is feature vector and output is its class. Majority voting is used to predict a new instance’s class. k-NN is a supervised machine learning algorithm so that it needs labeled dataset.

Three basic components of k-NN is:

The k value;
Distance measure;
The rule of classification;

Most people can figure out that we can implement k-NN by linear search for the main job in training phase is to find all nearest neighbour. Unfortunately, it’s unacceptable for large dataset because of high complexity both in time and space. A more universal method is applying kd tree which is a binary tree with special form. Therefore, k-NN is an effective classification and regression model.

Main Contents

Majority vote;
Distance function;
kd Tree;
C++ implementation step by step;

The mathematical details

Majority vote

Majority vote is simple: we classify a dataset in a label is mostly frequent.

We note that:

$T=\left{\left(x_1,y_1\right),\dots,\left(x_N,y_N\right)\right},\quad&x_i\in&X,\quad&y_i\in&Y=\left{c_1,\dots,c_K\right}$

X is the set of instance’s feature vectors and Y is the set of label correspond to instances;

$N_k\left(x\right)$

is the set of k nearest neighbors of instance x;

The object function is to optimize:

$\arg\max_{c_j}\sum_{x_i\in&N_k(x)}{I(y_i=c_j)},\quad&i=1,2,\dots,N;\quad&j=1,2,\dots,K$

such that I(x=y) is called index function equals 1 if x=y is true else 0;

So we can get the cost function for majority vote:

$1-1/k\sum_{x_i\in&N_k(x)}{I(y_i=c_j)}$

Distance function

In math, distance function is important because it’s one of the basic elements to picture a normed vector space; In machine learning algorithm, it’s used to compute the similarity between two feature vectors; By the way, there’re many other methods except distance like cosine or Pearson correlation coefficient;

Let me introduce a commonly-used distance Lp distance:

$L_p\left(x_i,x_j\right)=\left(\sum_{m=1}^{n}\left|x_{i}^{m}-x_{j}^{m}\right|\right)^{1/p},\quad&x_i,x_j\in&\Re^n$

Several special condition when p=2 called Euclidean distance, p=1 called Manhattan distance, and p=inf is as follows:

$L_{\infty}\left(x_i,x_j\right)=\max_{m}\left|x_{i}^{m}-x_{j}^{m}\right|,\quad&x_i,x_j\in&\Re^n$

k-NN

Mathematical Details

If we have a training dataset:

$T=\left{\left(x_1,y_1\right),\dots,\left(x_N,y_N\right)\right},\quad&x_i\in&X,\quad&y_i\in&Y=\left{c_1,\dots,c_K\right}$

distance measure has been given, we note that the set of k nearest neighbors of x is:

$N_k\left(x\right)$

So we can predict x‘s class y by majority voting:

$\arg\max_{c_j}\sum_{x_i\in&N_k(x)}{I(y_i=c_j)},\quad&i=1,2,\dots,N;\quad&j=1,2,\dots,K$

kd Tree

kd Tree is a binary tree-like data structure which is a partition of k-dim space. Most developers are familiar with how to constructing a binary tree so I no longer describe it. Constructing a kd tree is similar to constructing a binary tree except that kd tree’s child nodes is generated by split points based on coordinates. For example, for each node in n depth, we can choose lth coordination as the

So split strategy is central to