Introduction to kNN Classification and CNN Data Reduction

Outline The Classication Problem The k Nearest Neighbours Algorithm Condensed Nearest Neighbour Data Reduction
Introduction to kNN Classication and CNN Data

Reduction
Oliver Sutton
February, 2012
1 / 29
1
The Classication Problem
Examples
The Problem
2
The k Nearest Neighbours Algorithm
The Nearest Neighbours Approach
The Idea
The Algorithm
Applet Example
Problems With the Algorithm
3
Condensed Nearest Neighbour Data Reduction
The Idea
The Algorithm
Example Using the Applet
4
Summary
2 / 29
The Classication Problem: Example 1
Suppose we have a database of the characteristics of lots of

dierent people, and their credit rating
e.g. how much they earn, whether they own their house, how
old they are, etc.
Want to be able to use this database to give a new person a

credit rating
Intuitively, we want to give similar credit ratings to similar

people
3 / 29
The Classication Problem: Example 2
We have a database of characteristic measurements from lots

of dierent owers, along with the type of ower
e.g. their height, their colour, their stamen size, etc.
Want to be able to use this database to work out what type a

new ower is, based on its measurements
Again, we want to classify it with the type of ower it is most

similar to
4 / 29
The Classication Problem
In general, we start with a database of objects whos

classication we already know
Known as the training database, since it trains us to know

what the dierent types of things look like
We take a new sample, and want to know what classication

it should be
The classication is based on which items of the training

database the new sample is similar to
The Problem
What makes two items count as similar, and how do we measure
similarity?
5 / 29
One way of solving these problems is with the nearest neighbours
approach...
6 / 29
The Nearest Neighbours Idea: Measuring Similarity
The nearest neighbours approach involves interpreting each

entry in the database as a point in space
Each of the characteristics is a dierent dimension, and the

entrys value for that characteristic is its coordinate in that
dimension
Then, the similarity of two points is measured by the distance

between them
Dierent metrics can be used, although in general they give

dierent results
7 / 29
The Nearest Neighbours Idea: Suciently Similar
The nearest neighbours approach then classies the new

sample by looking at the classications of those closest to it
In the k Nearest Neighbours (kNN), this is achieved by

selecting the k entries which are closest to the new point
An alternative method might be to use all those points within

a certain range of the new point
The most common classication of these points is then given

to the new point
8 / 29
The k Nearest Neighbours Algorithm
The algorithm (as described in [1] and [2]) can be summarised as:
1. A positive integer k is specied, along with a new sample
2. We select the k entries in our database which are closest to
the new sample
3. We nd the most common classication of these entries
4. This is the classication we give to the new sample
9 / 29
First we set up the applet [1] like this:
i.e. with a cluster of 40 points of dierent colours in opposite

corners, and 20 points of random noise from each colour
10 / 29
These blue and pink points will be our training dataset, since
we have specied their characteristics (x and y coordinates),
and their classication (colour)
We then select the Handle Test tab at the top, and click
anywhere in the screen to simulate trying to classify a new
point
For the value of k specied in the parameters tab, the applet

will nd and highlight the k nearest data points to the test
point
The most common colour is then shown in the big circle

where the test point was placed, as shown on the next slide
11 / 29
12 / 29
Alternatively, we can use the applet to draw the map of the

data
The colour at each point indicates the result of classifying a

test point at that location
13 / 29
Cross Validation
In order to investigate how good our initial training set is, we

can use a procedure called cross validation
Essentially, this involves running the kNN algorithm on each

of the points in the training set in order to determine whether
they would be recognised as the correct type
Clearly, this can very easily be corrupted by introducing

random noise points into the training set
To demonstrate this, the applet was used with increasing

numbers of random data points in the training set, and the
number of cross validation errors counted, as shown in the
graph on the next slide
14 / 29
Cross Validation
!"#$%&# ()*+#$ ,- .$$,$/
0
10
20
30
40
0 50 100 150 200
!"#$%&# 6#$7#89%&# ,- .$$,$/
6
#
$
7
#
8
9
%
&
#

,
-

#
$
$
,
$
/
()*+#$ ,- $%8:,* ;,<89/ %::#:
Clearly, there is a definite upward trend as the number of random points is increased,
although the rate of increase seems to slow down when the number of random points
gets large. Testing on a purely random dataset shows that one expects the percentage
of points misclassified to tend to about half for 3 Nearest Neighbours, as would be
expected (since the chance of any neighbour of a data point being from either data
group is about equal). From Chart 1, it can be seen that the percentage of errors would
be expected to eventually (as the number of data points tends to infinity) tend to about
50%. Taking both these observations together, it suggests that perhaps the percentage
of errors follows a relationship similar to y = 1/(kx+0.02) + 50.
Condensed Nearest Neighbour data reduction was also implemented on 3 different
data sets as described in Figure 1, to obtain the percentage of data points classified as
outlier points (those which would not be recognised as the correct class if they were
added to the data set later), prototype points (the minimum set of points required in
!"#$%&# 6#$7#89%&# ,- =)9><#$/
!"#$%&# 6#$7#89%&# ,- 6$,9,9?;# 6,<89/
!"#$%&# 6#$7#89%&# =- !+/,$+#: 6,<89/
15 / 29
Cross Validation
It can clearly be seen that including more random noise points

in the training set increases the number of cross validation
errors
As the number of random noise points becomes very large,

the percentage of points which fail the cross validation tends
to 50%
As would be expected for a completely random data set

16 / 29
Main Problems With the Algorithm
Dicult to implement for some datatypes
e.g. colour, geographical location, favourite quotation, etc.
This is because it relies on being able to get a quantitative

result from comparing two items
Can often be got around by converting the data to a numerical

value, for example converting colour to an RGB value, or
geographical location to latitude and longitude
Slow for large databases
Since each new entry has to be compared to every other entry
Can be sped up using data reduction...

17 / 29
Condensed Nearest Neighbours Data Reduction
Data Reduction
The database is summarised by nding only the important data-
points
Datapoints in the training set are divided into three types (as
described in [1]):
1. Outliers: points which would not be recognised as the correct
type if added to the database later
2. Prototypes: the minimum set of points required in the
training set for all the other non-outlier points to be correctly
recognised
3. Absorbed points: points which are not outliers, and would be
correctly recognised based just on the set of prototype points
New samples only need to now be compared with the

prototype points, rather that the whole database
18 / 29
Condensed Nearest Neighbours Data Reduction
Data Reduction
The database is summarised by nding only the important data-
points
Datapoints in the training set are divided into three types (as
described in [1]):
1. Outliers: points which would not be recognised as the correct
type if added to the database later
2. Prototypes: the minimum set of points required in the
training set for all the other non-outlier points to be correctly
recognised
3. Absorbed points: points which are not outliers, and would be
correctly recognised based just on the set of prototype points
New samples only need to now be compared with the

prototype points, rather that the whole database
19 / 29
CNN Data Reduction Algorithm
The algorithm (as described in [1]) is as follows:
1. Go through the training set, removing each point in turn, and
checking whether it is recognised as the correct class or not
If it is, then put it back in the set
If not, then it is an outlier, and should not be put back

2. Make a new database, and add a random point.
3. Pick any point from the original set, and see if it is recognised
as the correct class based on the points in the new database,
using kNN with k = 1
If it is, then it is an absorbed point, and can be left out of the

new database
If not, then it should be removed from the original set, and

added to the new database of prototypes
4. Proceed through the original set like this
5. Repeat steps 3 and 4 until no new prototypes are added
20 / 29
Eects of CNN Data Reduction
After applying data reduction, we can classify new samples by

using the kNN algorithm against the set of prototypes
Note that we now have to use k = 1, because of the way we

found the prototypes
Classifying a new sample point is now faster, since we dont

have to compare it to so many other points
Classifying new samples against the new reduced data set will
sometimes lead to dierent results than comparing the new
sample against the whole training set
This is the trade-o we have to make, between speed and

accuracy
21 / 29
We set up a dataset as before
Under the parameters tab, we select Implement Reduction,

and reduce the number of nearest neighbours for test points
to 1
Crosses mark outliers, squares mark prototypes, and circles

mark absorbed points
22 / 29
Introducing a test point as before, we can see that it still has

the same result:
23 / 29
How Noise Aects CNN
Again using the applet, we can investigate how well CNN

manages to reduce the number of data points, in the presence
of dierent quantities of random noise
To do this, we record what percentage of points were assigned

each classication on each of three dierent data sets as we
increased the number of random noise points
The results are presented in the graph on the next slide

24 / 29
How Noise Aects CNN
The eects of implementing CNN data reduction on increasingly
noisy data sets:
group is more likely to have its nearest neighbours from the other group, so making it
more likely for it to be classified as an outlier. Similarly, the points which are not
classified as outliers from each data group will be more disparate, so we will require
more prototype points to represent our whole group. Then, since the absorbed points
must be the points which are neither outliers nor prototypes, we can expect the
percentage of these to decrease as more random points are added, which is
represented by the green curve in Chart 2.
Question 5
As with any algorithm, when using the K Nearest Neighbours algorithm it is
important to balance the accuracy of the results against how expensive it is to run. For
this reason, the effects of varying the parameters of the KNN algorithm were
compared with the results produced. Clearly, as the value of k (the number of
neighbours to consider) is increased the algorithm becomes more expensive to run.
On the other hand, implementing data reduction means that any new point to be
classified needs to be compared against fewer points from the data set, although it
number of 8andom
uaLa olnLs
Average ercenLage of
Cross-valldauon Lrrors
Average ercenLage
of roLoLype olnLs
Average ercenLage
of CuLller olnLs
Average ercenLage
of Absorbed olnLs
0 0 3 0 98
20 17.3 16.16667 3.3 78.33333
40 23 22 9.333333 69
60 29.66666667 23.66667 8.833333 63.3
80 33.38333333 31.16667 10 39
100 34.66666667 31.16667 11.16667 37.66667
120 33.83333333 33.66667 10.3 33.66667
140 37.23 32.33333 12.33333 33.16667
160 38.3 37 11.3 32.16667
0
23
30
73
100
0 30 100 130 200
Condensed nn uaLa 8educuon
e
r
c
e
n
L
a
g
e

o
f

o
l
n
L
s

A
s
s
l
g
n
e
d

L
a
c
h

S
L
a
L
u
s
number of 8andom uaLa olnLs
Average ercenLage of CuLllers
Average ercenLage of roLoLype olnLs
Average ercenLage Cf Absorbed olnLs
25 / 29
How Noise Aects CNN
As can be seen, increasing the number of noise points had the

following eects:
1. Percentage of points classed as outliers increased dramatically
2. Percentage of points classed as absorbed decreased
3. Percentage of points classed as prototypes increased slightly
These are all as would be expected

26 / 29
Problems With CNN
The CNN algorithm can take a long time to run, particularly

on very large data sets, as described in [4] and [3].
There are alternative methods for reducing dimensionality,

which are described in [3]
An alternative version of CNN which often performs faster

than standard CNN is also given in [4]
Using CNN can cause our system to give us dierent

classications than if we just used kNN on the raw data
27 / 29
Summary
We can use the kNN algorithm to classify new points, by

giving the same classication as those which the new point is
close to
For large datasets, we can apply CNN data reduction to get a

smaller training set to work with
Tests after CNN will be obtained more quickly, and will

normally have the same result
28 / 29
E. Mirkes, 2011
KNN and Potential Energy (Applet),
University of Leicester. Available at: http://www.math.le.
ac.uk/people/ag153/homepage/KNN/KNN3.html, 2011.
L. Kozma, 2008
k Nearest Neighbours Algorithm.
Helsinki University of Technology. Available at:
http://www.lkozma.net/knn2.pdf, 2008.
N. Bhatia et al,
Survey of Nearest Neighbor Techniques.
International Journal of Computer Science and Information
Security, Vol. 8, No. 2, 2010.
F. Anguilli,
Fast Condensed Nearest Neighbor Rule.
Proceedings of the 22nd International Conference on Machine
Learning, Bonn, Germany, 2005.
29 / 29

Introduction to kNN Classification and CNN Data Reduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to kNN Classification and CNN Data Reduction

Uploaded by

Copyright:

Available Formats

Outline The Classication Problem The k Nearest Neighbours Algorithm Condensed Nearest Neighbour Data Reduction

Introduction to kNN Classication and CNN Data

Suppose we have a database of the characteristics of lots of

Want to be able to use this database to give a new person a

Intuitively, we want to give similar credit ratings to similar

We have a database of characteristic measurements from lots

e.g. their height, their colour, their stamen size, etc.

Want to be able to use this database to work out what type a

Again, we want to classify it with the type of ower it is most

In general, we start with a database of objects whos

Known as the training database, since it trains us to know

We take a new sample, and want to know what classication

The classication is based on which items of the training

The nearest neighbours approach involves interpreting each

Each of the characteristics is a dierent dimension, and the

Then, the similarity of two points is measured by the distance

Dierent metrics can be used, although in general they give

The nearest neighbours approach then classies the new

In the k Nearest Neighbours (kNN), this is achieved by

An alternative method might be to use all those points within

The most common classication of these points is then given

First we set up the applet [1] like this:

i.e. with a cluster of 40 points of dierent colours in opposite

For the value of k specied in the parameters tab, the applet

The most common colour is then shown in the big circle

Alternatively, we can use the applet to draw the map of the

The colour at each point indicates the result of classifying a

In order to investigate how good our initial training set is, we

Essentially, this involves running the kNN algorithm on each

Clearly, this can very easily be corrupted by introducing

To demonstrate this, the applet was used with increasing

It can clearly be seen that including more random noise points

As the number of random noise points becomes very large,

As would be expected for a completely random data set

Dicult to implement for some datatypes

e.g. colour, geographical location, favourite quotation, etc.

This is because it relies on being able to get a quantitative

Can often be got around by converting the data to a numerical

Slow for large databases

Since each new entry has to be compared to every other entry

Can be sped up using data reduction...

New samples only need to now be compared with the

New samples only need to now be compared with the

If it is, then put it back in the set

If not, then it is an outlier, and should not be put back

If it is, then it is an absorbed point, and can be left out of the

If not, then it should be removed from the original set, and

After applying data reduction, we can classify new samples by

Note that we now have to use k = 1, because of the way we

Classifying a new sample point is now faster, since we dont

This is the trade-o we have to make, between speed and

We set up a dataset as before

Under the parameters tab, we select Implement Reduction,

Crosses mark outliers, squares mark prototypes, and circles

Introducing a test point as before, we can see that it still has

Again using the applet, we can investigate how well CNN

To do this, we record what percentage of points were assigned

The results are presented in the graph on the next slide

As can be seen, increasing the number of noise points had the

These are all as would be expected

The CNN algorithm can take a long time to run, particularly

There are alternative methods for reducing dimensionality,

An alternative version of CNN which often performs faster