You are on page 1of 10

K – Nearest Neighbors

Day 4
Introduction
At its core, the algorithm says:

• Pick a number of neighbors you want to use for


classification or regression
• Choose a method to measure distances
• Keep a data set with records
• For every new point, identify the number of nearest
neighbors you picked using the method you chose
• Let them vote if it is a classification issue or take a
mean/median for regression
Diagrammatically
• Let them vote if it is a
classification issue or take a
mean/median for regression
• Here, N = 1. The new green point is labeled black as
its nearest neighbor is black too

• Here, N = 3. The new green point


is labeled white based on the
voting of three nearest neighbors
From the algorithm, clearly KNN is

• Lazy: This is a technical term! All the techniques


we learned so far have a phase called “training
phase” and try to identify a function from the
training set. Then apply this function to the test
data. Such learning is called “eager learning”. K-
NN on the other hand does not generalize and
uses all the training data (or a subset) in the
testing phase. This type of learning is called lazy
learning or instance based learning.
• K-NN requires more time, as all data points
are needed to decide.
• It requires more memory as all training data
needs to be stored. So, it is very expensive for
large data sets and large dimensions. Where
N is the number of training examples, d is the
dimension of each sample.
• Hence, a lot of time must be spent in reducing
the N and d. K-NN does suffer from the curse
of dimensionality.
Attributes
Handling curse of dimensionality
• K-NN is heavily impacted by huge number of
dimensions
• Reduce the dimensions using
– Correlation , Principal Component Analysis
– Gain Ratio, Info gain (filter approach: We lose
some that are important)
– Wrapper methods (Forward selection,
Backward elimination)
– Weighting attributes
• Scaling the attributes
– Attributes with larger range can dominate
To understand this, consider the following pair of data
points (0.1, 20) and (0.9, 720)
• The distance is almost completely dominated by (720-
20) = 700. To avoid this, we standardize attributes to
force the attributes to have a common value range.
The common techniques include
• Taking logarithms when one variable is varying several
orders of magnitude
• Dividing with the highest value to get the variables
between 0 and 1
• Standardizing will bring most of the data to -3 and 3
• Categorical variables and Ordinal variables need to
be converted to numeric
• Handling missing values
– K-NN is impacted heavily by missing values
– Imputation is one option
• Handling overfitting
– Remove outliers (Wilson Editing)
• Speeding up KNN
– Condensation
Feature Engineering
• Library (class)
• kNN produces complex decision surfaces.
• As complexity of decision surfaces increases,
accuracy decreases and we need more data
• Increase K to decrease the over-fit.
• kNN gives no explicability.
• kNN is a distance method, only numeric
variables. Convert categorical/ordinal
values into numerical
• kNN works well in batch mode not in real
time
• kNN fails when there are missing values.
(use kNN Imputation in data pre-
processing to fill missing values)
• In kNN, training is easy but predictions are
difficult

You might also like