Handout 2

Supervised Learning
N EAREST N EIGHBOUR C LASSIFICATION

Assume that instances x are members of the set X, while labels y are members of the set Y A classier is any function f : X Y
IIT Rajasthan
A supervised learning algorithm is not a classier Instead, it is an algorithm whose output is a classier
Slides last modied on November 6, 2012 Computer Science and Engineering, Indian Institute of Technology Rajasthan.
1 / 20
2 / 20
A supervised learning algorithm is a function of the type (X Y) (X Y) where n is the cardinality of the training set
n
k Nearest Neighbour method

A.k.a. the k-NN classier Training phase: Simply store every training example with its label Prediction phase: Compute the distance of the test example to every training example Keep the k closest training examples, where k 1 Identify the majority (most common) label The nearest neighbour method is a function of the type (X Y)n X Y
3 / 20
4 / 20
There are two major design choices for k-NN

The value of k The distance function to use
The commonly used distance function is the Euclidean distance between two real vectors (x, y) of the same length m 1/2 (x y )2 d(x, y) = ||x y|| = (x y)(x y) = i i
i=1
Ties can happen when two or more classes get the highest #votes Ties can also happen when two distance values are the same Some ad-hoc ways should be used to resolve the ties
Disadvantage: k-NN has a time complexity O(nm)
5 / 20
6 / 20
Bayes Error Rate for a classication problem is the minimum achievable error rate, i.e. the error rate of the best possible classier BER will be non-zero if the classes overlap BER is the average over the space of all examples of the minimum error probability for each example E = p(x) 1 max p(i | x)
xX i
Proof of the Theorem

Let n be the size of the training set Let x be the query point and let r be its closest neighbour The expected error rate of the 1-NN classier is
c i=1
p(i | x) [1 p(i | r)]
when the maximum is over the c possible labels i = 1 to i = c
where p(i | x) is the probability that x has label i and 1 p(i | r) is the probability that r has a different label
Theorem: 1-NN method has the theoretical property that where the number of training examples tends to , the error-rate of a 1NN classier is at worst twice the BER
7 / 20
8 / 20
Proof
If the number n of training examples is large enough, then the label probability distributions for all x and r will be essentially the same In this case, the expected error rate for the 1NN classier is
c i=1
To prove the theorem we need to show that

c i=1
p(i | x) [1 p(i | x)] 2 1 max p(i | x)

i
p(i | x) [1 p(i | x)]
Let maxi p(i | x) = r and let this maximum be attained with i = j LHS is r(1 r) + RHS: 2(1 r)
ij
p(i | x) [1 p(i | x)]
9 / 20
10 / 20
658
14. COMBINING MODELS

(1) (2) (M )
Figure 14.1 Schematic illustration of the Each r(1 r) + p(iboosting | x)] | x) [1 p(i framework. base classier ym (x) is trained ij on a weighted form of the trainThe LHS summation is maximizeding set values p(i | x) are equal which when all (blue arrows) in for (m) ij the weights wn depend on the r) (c 1) (1 of the pre(1 performance r) LHS = r(1 r) + (c 1) vious 1) (c base classier ym1 (x) c1 (green arrows). Once all base c+r2 = r(1 r) + (1 r) classiers have been trained, c1 they are combined to give Now r 1 the nal classier YM (x) (red And cr 2 < c 1 arrows).
So, LHS < 2(1 r)
{wn }
{wn }
{wn
y1 (x)
y2 (x)
yM (x)
YM (x) = sign
,
M m
m ym (x)
AdaBoost
11 / 20
12 / 20
1. Initialize the data weighting coefcients {wn } by setting wn = 1/N for
(1)
hence is more robust to outliers and mislabelled data points.
662
14. COMBINING MODELS

Figure 14.3 Plot of the exponential (green) and 1 0 E(z) rescaled cross-entropy (red) error functions along with the hinge error (blue) used in support vector canmachines, and by basing the boosting algorithm on the absolute be addressed the misclassication instead. These twoNote that for large compared in Figure 14.4. error functions are error (black). negative values of z = ty(x), the cross-entropy gives a linearly increasing penalty, whereas the exponential loss gives an exponentially inTree-based Models creasing penalty.
deviation |y t|
14.4.
There are various simple, but widely used, models that work by partitioning the input space into cuboid regions, whose edges are aligned with the axes, and then z 2 1 0 2 assigning a simple model (for example, a constant) to each region.1 They can be viewed as a model combination method in which only one model is responsible for making predictions at any given point in input space. The process of selecting which is half the log-odds. Thus the AdaBoost algorithm is seeking the best approxa specic model, given a new input x, can be described by a sequential decision imation to the log odds ratio, within the space of functions represented by the linear making process of base classiers, the traversal ofconstrained minimizationsplits into combination corresponding to subject to the a binary tree (one that resulting twofrom the sequential node). Here strategy. This a particular tree-based framework branches at each optimization we focus on result motivates the use of the sign 13 / 20 called classication and arrive at the trees,classication decision. al., 1984), although function in (14.19) to regression nal or CART (Breiman et there are many other variantsthat theby such names as ID3 and C4.5 (Quinlan, (4.90) We have already seen going minimizer y(x) of the cross-entropy error 1986; Quinlan, 1993). classication is given by the posterior class probability. In the case for two-class Figure 14.5 shows {1, 1}, we a recursive binary partitioning is given by Section 7.1.2 of a target variable tan illustration ofhave seen that the error function of the input space, +663 14.4. Tree-based Models ln(1along with the corresponding treewith the exponentialexample, the rst Figexp(yt)). This is compared structure. In this error function in step ure 14.3, where we have divided the cross-entropy error by a constant factor ln(2) so that it passes through the point (0, 1) for ease of comparison. We see that both error Figure 14.5 Illustration of a two-dimensional in- x2 E(z) (red) put can be seen as continuous approximations to the ideal misclassication error funcspace that has been partitioned much intotion. An advantage of the exponential error is that its sequential minimization leads ve regions using axis-aligned E to the simple AdaBoost scheme. One drawback, however, is that it penalizes large and boundaries. s and negative values of ty(x) much more 3 strongly than cross-entropy. In particular, we B see that for large negative values of ty, the cross-entropy grows linearly with |ty|, whereas the exponential error function grows exponentially with |ty|. Thus the exponential error function will be much less robust to outliers or misclassied data points. Another important difference between cross-entropy and the exponential erC D z 1 0 1 ror function is that the latter cannot be interpreted as the log likelihood function of 2 Exercise 14.8 any well-dened probabilistic model. Furthermore, the exponential error does not A generalize to classication problems having K > 2 classes, again in contrast to the boosting algorithm on the absolute deviation |y t| for a probabilistic model, which is easily generalized to give (4.108). Section 4.3.4 cross-entropy ns are compared in Figure 14.4. The interpretation of boosting as the sequential optimization of an additive model x1 1 4 under an exponential error (Friedman et al., 2000) opens the door to a wide range of boosting-like algorithms, including multiclass extensions, by altering the choice of error function. It also motivates the extension to regression problems (Friedman, 2001). If we consider a sum-of-squares error function for regression, then sequential minimization of an additive model of the form (14.21) simply involves tting each 15 / 20 widely used, models that work by partitioning the Exercise 14.9 new base classier to the residual errors tn fm1 (xn ) from the previous model. As s, whose edges are aligned with the axes, and then we have noted, however, the sum-of-squares error is not robust to outliers, and this example, a constant) to each region. They can be n method in which only one model is responsible 14. COMBINING MODELS given point in input space. The process of selecting input x, can be described by a sequential decision to the traversal of a binary tree (one that splits into ure 14.7 Probabilistic directed graph representing a mixture of parzn Here we focus on a particular tree-based framework n x1 1 Fig ion trees, orlinear regression models,> although (14.35). CART (Breiman et al., 1984),dened by oing by such names as ID3 and C4.5 (Quinlan, 1986;
, ,
14 / 20
16 / 20
ration of a recursive binary partitioning of the input x 2 nding tree 2structure. In this example, the rst x2 > 3 step
tn
al in- x2 ioned igned
x1 4
E
eters. In the E step, these parameter values are then used to evaluate the posterior ce into twoprobabilities, or responsibilities, each component k for every data point n given regions according to whether x1 of 1 C D by 2 model. This creates two subregions, each r of the T 1 independently. AFor instance, the = p(k|1 , old ) = k N (tn |wk n , ) . region x 1 nk = E[znk ] (14.37) n T 1 ) whether x2 2 or x2 > 2 , giving rise to the j j N (tn |wj n , rsive subdivision can be described by the traversal x1 1 4 The responsibilities are then used to determine the expectation, with respect to the
,
A The B B algorithm begins D rst choosing an initial value old for the model paramC E EM by
17 / 20
18 / 20
19 / 20

Handout 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handout 2

Uploaded by

Copyright:

Available Formats

Supervised Learning

N EAREST N EIGHBOUR C LASSIFICATION

k Nearest Neighbour method

There are two major design choices for k-NN

The value of k The distance function to use

Disadvantage: k-NN has a time complexity O(nm)

Proof of the Theorem

p(i | x) [1 p(i | r)]

when the maximum is over the c possible labels i = 1 to i = c

To prove the theorem we need to show that

p(i | x) [1 p(i | x)] 2 1 max p(i | x)

p(i | x) [1 p(i | x)]

p(i | x) [1 p(i | x)]

14. COMBINING MODELS

1. Initialize the data weighting coefcients {wn } by setting wn = 1/N for

hence is more robust to outliers and mislabelled data points.

14. COMBINING MODELS

al in- x2 ioned igned

You might also like