Revision as of 22:32, 10 March 2008 by Yoder2 (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

k-Nearest Neighbor Algorithm

This algorithm is based on the observation that a sample that has features that are similar to the ones of points of one particular class it belongs to that class. These points are known as nearest neighbors. The parameter k specifies the number of neighbors (neighboring points) used to classify one particular sample point. Finally, the assignment of a sample to a particular class is done by having the k neighbors considered to "vote". In this fashion, the class represented by the largest number of points among the neighbors ought to be the class that the sample belongs to. The so called Nearest Neighbor algorithm is the particular instance of the kNN when k=1.

Consider the set of points in the feature space in the figure below:

KNNbase Old Kiwi.JPG


The next 3 figures illustrate the sample point in the feature space and neighbors for k={1,2,3}:

For k=1, we have:

KNNk1 Old Kiwi.JPG

For k=2, we have:

KNNk2 Old Kiwi.JPG

For k=3, we have:

KNNk3 Old Kiwi.JPG

Some of the important properties of the kNN algorithm are listed below:

  • The kNN can be used to classify data without requiring model building, this is called "instance-based learning".
  • A measurement of the distance between data points should be available
  • The kNN classification is based solely on local information (only the k nearest neighboring data points are inspected during the classification process).
  • The decision boundaries produced by the kNN classification can of of arbitrary shape
  • The classification is sensitive to the correct selection of k. If k is too small it will lead to "over-fitting"

The Algorithm's pseudo-code

Consider k as the desired number of nearest neighbors and S:={p_1, ... , p_n} be the set of training samples in the form p_1 = (x_i, c_i), where x_i is the d-dimensional feature vector of the point p_i and c_i is the class that p_i belongs to.

For each p'=(x',c')

  1. Compute the distance d(x', $ x_i $) between p' and all $ p_i \in S $
  2. Sort all points p_i according to the key d(x',$ x_i $)
  3. Select the first k points from the sorted list, those are the k closest training samples to p'
  4. Assign a class to p' based on majority vote: $ c' = arg \max_y \sum_{(x_i,c_i) \in S} I(y=c_i) $

end


Code available

Instance-Based learning Java Implementation

References

"Introduction to Data Mining", P. Tan, M. Steinbach and V. Kumar, Addison-Wesley, 2006.

Efficient Implementation of Nearest Neighbor Classification

Efficient Implementation of Nearest Neighbor Classification (Slides)

k-Nearest Neighbor (Wikipedia)

Alumni Liaison

Prof. Math. Ohio State and Associate Dean
Outstanding Alumnus Purdue Math 2008

Jeff McNeal