Machine Learning Notes (1): K-NN Algorithm and implementation

Introduction

k-NN (k-nearest neighbour) is a classification and regression model. The goal of k-NN is to find k nearest neighbours of an instance and label them a given class. The input is feature vector and output is its class. Majority voting is used to predict a new instance’s class. k-NN is a supervised machine learning algorithm so that it needs labeled dataset.

Three basic components of k-NN is:

  1. The k value;
  2. Distance measure;
  3. The rule of classification;

Most people can figure out that we can implement k-NN by linear search for the main job in training phase is to find all nearest neighbour. Unfortunately, it’s unacceptable for large dataset because of high complexity both in time and space. A more universal method is applying kd tree which is a binary tree with special form. Therefore, k-NN is an effective classification and regression model.

Main Contents

  1. Majority vote;
  2. Distance function;
  3. kd Tree;
  4. C++ implementation step by step;

The mathematical details

Majority vote

Majority vote is simple: we classify a dataset in a label is mostly frequent.

We note that:

X is the set of instance’s feature vectors and Y is the set of label correspond to instances;

is the set of k nearest neighbors of instance x;

The object function is to optimize:

such that I(x=y) is called index function equals 1 if x=y is true else 0;

So we can get the cost function for majority vote:

Distance function

In math, distance function is important because it’s one of the basic elements to picture a normed vector space; In machine learning algorithm, it’s used to compute the similarity between two feature vectors; By the way, there’re many other methods except distance like cosine or Pearson correlation coefficient;

Let me introduce a commonly-used distance Lp distance:

Several special condition when p=2 called Euclidean distance, p=1 called Manhattan distance, and p=inf is as follows:

k-NN

Mathematical Details

If we have a training dataset:

distance measure has been given, we note that the set of k nearest neighbors of x is:

So we can predict x‘s class y by majority voting:

kd Tree

kd Tree is a binary tree-like data structure which is a partition of k-dim space. Most developers are familiar with how to constructing a binary tree so I no longer describe it. Constructing a kd tree is similar to constructing a binary tree except that kd tree’s child nodes is generated by split points based on coordinates. For example, for each node in n depth, we can choose lth coordination as the

So split strategy is central to