Introduction
k-NN (k-nearest neighbour) is a classification and regression model. The goal of k-NN is to find k nearest neighbours of an instance and label them a given class. The input is feature vector and output is its class. Majority voting is used to predict a new instance’s class. k-NN is a supervised machine learning algorithm so that it needs labeled dataset.
Three basic components of k-NN is:
- The k value;
- Distance measure;
- The rule of classification;
Most people can figure out that we can implement k-NN by linear search for the main job in training phase is to find all nearest neighbour. Unfortunately, it’s unacceptable for large dataset because of high complexity both in time and space. A more universal method is applying kd tree which is a binary tree with special form. Therefore, k-NN is an effective classification and regression model.
Main Contents
- Majority vote;
- Distance function;
- kd Tree;
- C++ implementation step by step;
The mathematical details
Majority vote
Majority vote is simple: we classify a dataset in a label is mostly frequent.
We note that:
X is the set of instance’s feature vectors and Y is the set of label correspond to instances;
is the set of k nearest neighbors of instance x;
The object function is to optimize:
such that I(x=y) is called index function equals 1 if x=y is true else 0;
So we can get the cost function for majority vote:
Distance function
In math, distance function is important because it’s one of the basic elements to picture a normed vector space; In machine learning algorithm, it’s used to compute the similarity between two feature vectors; By the way, there’re many other methods except distance like cosine or Pearson correlation coefficient;
Let me introduce a commonly-used distance Lp distance:
Several special condition when p=2 called Euclidean distance, p=1 called Manhattan distance, and p=inf is as follows:
k-NN
Mathematical Details
If we have a training dataset:
distance measure has been given, we note that the set of k nearest neighbors of x is:
So we can predict x‘s class y by majority voting:
kd Tree
kd Tree is a binary tree-like data structure which is a partition of k-dim space. Most developers are familiar with how to constructing a binary tree so I no longer describe it. Constructing a kd tree is similar to constructing a binary tree except that kd tree’s child nodes is generated by split points based on coordinates. For example, for each node in n depth, we can choose lth coordination as the
So split strategy is central to