Professional Documents
Culture Documents
Olga Veksler
Today
kNN classifier
the simplest classifier on earth
k-Nearest Neighbors
find k closest neighbors Classify unknown point with the most common class
classify as green
classify as red
How to choose k? A good rule of thumb is k = n , where n is the number of samples Interesting theoretical properties In practice, k = 1 is often used for efficiency Can find the best k through cross-validation, to be studied later
lightness
Nearest-neighbor rule leads to an error rate greater than E* But even for k =1, as n , it can be shown that nearest neighbor rule error rate is smaller than 2E* As we increase k, the upper bound on the error gets better and better, that is the error rate (as n ) for the kNN rule is smaller than cE*,with smaller c for larger k If we have a lot of samples, the kNN rule will do very well !
Bayes error rate is the best error rate a classifier can have, but we do not study it in this course
x1 For k = 1, ,5 point x gets classified correctly red class For larger k classification of x is wrong blue class
removed
Number of samples decreases We are guaranteed that the decision boundaries stay the same
Disadvantages:
complexity may not decrease significantly, how much complexity decreases depends on our luck and data layout
253 7 2 5 3
Advantages: Complexity decreases Disadvantages: finding good search tree is not trivial will not necessarily find the closest neighbor, and thus not guaranteed that the decision boundaries stay the same
(ak bk )2
= ab
However some features (dimensions) may be much more discriminative than other features (dimensions) Euclidean distance treats each feature as equally important
Suppose we have to find the class of x=[1 100] and we have 2 samples [1 150] and [2 110]
1 1 D( 100 , 150 ) =
= 50
1 2 D( 100 , 110 ) =
(1 2 )2 + (100 110 )2
= 10.5
x = [1 100] is misclassified! The denser the samples, the less of the problem
But we rarely have samples dense enough
160
140
120
100 1
1.2
1.4
1.6
1.8
decision boundaries for blue and green classes are in red These boundaries are really bad because feature 1 is discriminative, but its scale is small feature 2 gives no class information (noise) but its scale is large
Two approaches:
1. linearly scale the range of each feature to be, say, in [0,1]
fnew =
2. linearly scale to zero mean variance 1: If Z is a random variable of mean and variance 2, then (Z - )/ has mean 0 and variance 1 Thus for each feature f, compute its sample mean and variance, and let the new feature be [f - mean(f)]/sqrt[var(f)] If X is a matrix with all the samples piled up as rows (that is ith column of X has the ith feature), then can do this in matlab for all features simultaneously
-0.5
0.5
1.5
(ak bk )2
=
i
(ai bi )2 +
(a
bj
discriminative feature
noisy features
If the number of discriminative features is smaller than the number of noisy features, Euclidean distance is dominated by noise
w k (ak bk )
Can use our prior knowledge about which features are more important Can learn the weights wk from the validation data Increase/decrease weights until classification improves
10
kNN in Matlab
2 4 Class1 = 3 7 5 4
3 5 Class2 = 7 6 8 9 10 8
4 x= 7
k =3
11
1 1 1 trueClass = 2 2 2 2
4 x= 7
2 3 5 combinedSa mples = 3 5 7 6 4 7 4 8 9 10 8
4 4 4 testMatrix = 4 4 4 4 4 1 1 absDiff = 1 1 9 4 9 0 9 1 4 9 1
7 7 7 7 7 7 7
2 1 1 absDiff = 1 1 3 2
3 0 3 1 2 3 1
13 1 10 dist = 2 5 18 5
1 1 1 trueClass = 2 2 2 2
12
13 1 10 dist = 2 5 18 5 1 1 1 trueClass = 2 2 2 2 k =3
1 2 5 Y= 5 10 13 18
2 4 5 I= 7 3 1 6
2 neighborsInd = 4 5
1 neighbors = 2 2
1 neighbors = 2 2
class1 = [1]
class2 = 2 3 jo int = 1 2
class = 2
13
kNN Summary
Advantages
Can be applied to the data from any distribution Very simple and intuitive Good classification if the number of samples is large enough
Disadvantages
Choosing best k may be difficult Computationally heavy, but improvements possible Need large number of samples for accuracy
Can never fix this without assuming parametric distribution
14