You are on page 1of 17

Machine Learning

k-Nearest Neighbour

KOTA BARU PARAHYANGAN BANDUNG

1
k-NN

2
References

• KTH Royal Institute – Lecture Notes


• K-Nearest Neighbor Learning - Dipanjan
Chakraborty
• T. Mitchell, Machine Learning,
McGraw-Hill (Recommended)
• Papers, Website, tutorial

3
k-NN Overview
• The k-NN algorithm is among the simplest of all
machine learning algorithms
• Supervised learning
• Data are represented in a vector space
• Most basic instance-based method
– Disadvantage of instance-based methods is that the
costs of classifying new instances can be high
– Nearly all computation takes place at classification
time rather than learning time
k-NN Overview
• KNN is a non-parametric, lazy learning algorithm
– Non-parametric: it does not make any assumptions on the
underlying data distribution. in the “real world”, most of the
data does not obey the typical theoretical assumptions
made (as in linear regression models, for example)
– Lazy: it does not use the training data points to do any
generalization. In other words, there is no explicit training
phase or it is very minimal.In contrast to so called “eager
learning” algorithms (which carries out learning without knowing
the test example, and after learning training examples can be
discarded)
• Its purpose is to use a database in which the data points
are separated into several classes to predict the
classification of a new sample point
• KNN can be used for classification
K-NN Overview
• Learning = storing all training instances
• Classification = assigning target function
to a new instance
• Referred to as “Lazy” learning
k-NN Application
• Credit ratings — collecting financial
characteristics vs. comparing people with
similar financial features to a database
• In political science — classing a potential
voter to a “will vote” or “will not vote”, or to
“vote Democrat” or “vote Republican”.
• Handwriting detection (like OCR), image
recognition and even video recognition.
k-NN Architecture
k-NN Algorithm
• Given a new set of measurements,
perform the following test:
– Find (using Euclidean distance, for example),
the k nearest entities from the training set.
These entities have known labels. The choice
of k is left to us.
– Among these k entities, which label is most
common? That is the label for the unknown
entity.
k-NN Algorithm
• Distance Metric

• k-Nearest Neighbor Predictions

• Distance Weighting
k-NN Algorithm – Example
• For each training example <x,f(x)>, add
the example to the list of
training_example
• Given a query instance xq to be classified,
– Let x1, x2, ... xk denote the k instances from
training_examples that are nearest to x q
– Return the class that represents the maximum
of the k instances
• If K = 5, then in this case query instance x q will be
classified as negative since three of its nearest
neighbors are classified as negative
k-NN Summary
• Advantages
– Learning is extremely simple and intuitive
– Very flexible decision boundaries
– Variable-sized hypothesis space
• Disadvantages
– distance function must be carefully chosen or tuned
– irrelevant or correlated features have high impact and
must be eliminated
– typically cannot handle high dimensionality
computational
– computational costs: memory and classification-time
computation
When to Consider Nearest
Neighbor ?
• Lots of training data and no training stage, all the work is
done during the test stage
• Advantages:
– Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear
boundary
– Training is very fast
– Learn complex target functions
– Don’t lose information
• Disadvantages:
– Choosing k may be tricky
– Slow at query time
– Easily fooled by irrelevant attributes
– Need large number of samples for accuracy

You might also like