You are on page 1of 21

Chapter 12

Classification

is an increasingly important application of modern methods in


statistics. In the statistical literature the word is used in two distinct senses. The
entry (Hartigan, 1982) in the original Encyclopedia of Statistical Sciences uses the
sense of cluster analysis discussed in Section 11.2. Modern usage is leaning to the
other meaning (Ripley, 1997) of allocating future cases to one of

sense. (The older statistical literature sometimes refers to this as allocation.)


In pattern-recognition terminology this chapter is about supervised methods.
The classical methods of multivariate analysis (Krzanowski, 1988; Mardia, Kent
and Bibby, 1979; McLachlan, 1992) have largely been superseded by methods
from pattern recognition (Ripley, 1996; Webb, 1999; Duda et al., 2001), but some
still have a place.
It is sometimes helpful to distinguish discriminant analysis in the sense of
describing the differences between the

tion; the second can be a ‘black box’ that makes a decision without any explana-
tion. In many applications no explanation is required (no one cares how machines
read postal (zip) codes, only that the envelope is correctly sorted) but in others,
especially in medicine, some explanation may be necessary to get the methods
adopted.
data mining, although some of data mining is ex-
ploratory in the sense of Chapter 11. Hand et al. (2001) and (especially) Hastie
et al. (2001) are pertinent introductions.
Some of the methods considered in earlier chapters are widely used for clas-
n trees, logistic regression for groups and
multinomial log-linear models (Section 7.3) for groups.

12.1 Discriminant Analysis

Suppose that we have a set of classes, and for each case we know the class
(assumed correctly). We can then use the class information to help reveal the
structure of the data. Let denote the within-class covariance matrix, that is the
covariance matrix of the variables centred on the class mean, and denote the

331
332

between-classes covariance matrix, that is, of the predictions by the class means.
Let be the matrix of class means, and be the matrix of class
indicator variables (so if and only if case is assigned to class ). Then
the predictions are . Let be the means of the variables over the whole
sample. Then the sample covariance matrices are

(12.1)

Note that has rank at most .


Fisher (1936) introduced a linear discriminant analysis seeking a linear com-
bination of the variables that has a maximal ratio of the separation of the class
means to the within-class variance, that is, maximizing the ratio .
To compute this, choose a sphering (see page 305) of the variables so that
they have the identity as their within-group correlation matrix. On the rescaled
variables the problem is to maximize subject to , and as we saw
for PCA, this is solved by taking to be the eigenvector of corresponding to
the largest eigenvalue. The linear combination is unique up to a change of sign
(unless there are multiple eigenvalues). The exact multiple of returned by a

the conventional divisor of , but divisors of and have been used.


As for principal components, we can take further linear components corre-
sponding to the next largest eigenvalues. There will be at most
positive eigenvalues. Note that the eigenvalues are the proportions of the between-
classes variance explained by the linear combinations, which may help us to
choose how many to use. The corresponding transformed variables are called the
linear discriminants or canonical variates. It is often useful to plot the data on the

should be the identity, we chose an equal-scaled plot. (Using


will give this plot without the colours.) The linear discriminants are convention-
ally centred to have mean zero on dataset.
12.1 Discriminant Analysis 333

5
second linear discriminant
v v
v s
ss s
v vvvvv c ss ss
v vvvvvvvvvvvv vvc ccccccc s s ssss ss
v v v v c cc c s
s s
vvv vvvvv cc cc cc c sss s s sss ss s

0
v vvvvvc cv cc cccc s s ss ssss
s s s
v c cc c
v cc cc c s s s
v v c c cc c c
cc
s
c

-5

-5 0 5 10
first linear discriminant

Figure 12.1: The log

axis. Using

The approach we have illustrated is the conventional one, following Bryan


at (12.1) weights the
groups by their size in the dataset. Rao (1948) used the unweighted covariance
matrix of the group means, and our software uses a covariance matrix weighted

Discrimination for normal populations


An alternative approach to discrimination is via probability models. Let de-
note the prior probabilities of the classes, and the densities of distribu-
tions of the observations for each class. Then the posterior distribution of the
classes after observing is

(12.2)

and it is fairly simple to show that the allocation rule which makes the smallest
expected number of errors chooses the class with maximal ; this is known
as the Bayes rule. (We consider a more general version in Section 12.2.)
334

Now suppose the distribution for class is multivariate normal with mean
and covariance . Then the Bayes rule minimizes

(12.3)

Mahalanobis distance to the class centre,


and can be calculated by the function . The difference between
the for two classes is a quadratic function of , so the method is known as
quadratic discriminant analysis and the boundaries of the decision regions are
quadratic surfaces in space. This is implemented by our function .
Further suppose that the classes have a common covariance matrix . Dif-
ferences in the are then linear functions of , and we can maximize
or
(12.4)
To use (12.3) or (12.4) we have to estimate and or . The obvious
estimates are used, the sample mean and covariance matrix within each class, and
for .
How does this relate to Fisher’s linear discrimination? The latter gives new
variables, the linear discriminants, with unit within-class sample variance, and the
differences between the group variables. Thus on
these variables the Mahalanobis distance (with respect to ) is just

components of the vector depend on . Similarly, on these


variables

and we can work in dimensions. If there are just two classes, there is a single
linear discriminant, and

const

rescaled to unit length.


Note that linear discriminant analysis uses a that is a logistic regres-
sion for and a multinomial log-linear model for . However, it
differs from the methods of Chapter 7 in the methods of parameter estimation
used. Linear discriminant analysis will be better if the populations really are mul-
tivariate normal with equal within-group covariance matrices, but that superiority
is fragile, so the methods of Chapter 7 are usually preferred .

dataset
Can we construct a rule to predict the sex of a future Leptograpsus crab of un-
known colour form (species)? We noted that is measured differently for males
12.1 Discriminant Analysis 335

and females, so it seemed prudent to omit it from the analysis. To start with, we ig-
nore the differences between the forms. Linear discriminant analysis, for what are
,
a dimensionally neutral quantity. Six errors are made, all for the blue form:

It does make sense to take the colour forms into account, especially as the
within-group distributions look close to joint normality (look at the Figures 4.13

between-group variation; Figure 12.2 shows the data on those variables.

We cannot represent all the decision surfaces exactly on a plot. However,

proximation; see Figure 12.2.


336

O
OO
O
O

4
OO OO O
B O O O O
O O OO O O O
B OO O O
B B O O OOO
O O O
B B O OOOO
B BB B BB BB O O

2
B B OO O O
BB O
B BBBB B OO
BBB O
BB o
B B B oo
B B B O
BB B B b

Second LD
B
B B

0
o
B B BB o o
bb B o o o o
B o o
b o o o oo
Bb b b ooo
b bb o
o
ooo o
B b b o o
b
b b bb o o ooo o oo
b b bb b b b o o

-2
b o o o o
b b b bb b o o o
b b o
b
b bb bb o o
bb b b
b o o
-4

b
-6 -4 -2 0 2 4 6
First LD

Figure 12.2: Linear discriminants for the data. Males are coded as capitals, fe-
males as lower case, colours as the initial letter of blue or orange. The crosses are the
group means for a linear discriminant for sex (solid line) and the dashed line is the deci-
sion boundary for sex based on four groups.

The reader is invited to try quadratic discrimination on this problem. It per-


forms very marginally better than linear discrimination, not surprisingly since the
covariances of the groups appear so similar, as can be seen from the result of

Robust estimation of multivariate location and scale


We may wish to consider more robust estimates of (but not ). Somewhat

component of a multivariate mean (Rousseeuw and Leroy, 1987, p. 250), and it is


easier to consider the estimation of mean and variance simultaneously.
Multivariate variances are very sensitive to outliers. Two methods for robust
1
covariance estimation are available via our function and the S-PLUS
functions and (Rousseeuw, 1984; Rousseeuw and Leroy,
1987) and in library section . Suppose there are observations
of variables. The minimum volume ellipsoid method seeks an ellipsoid contain-
ing points that is of minimum volume, and the minimum
covariance determinant method seeks points whose covariance has minimum

1 Adopted by R in package .
12.1 Discriminant Analysis 337

15
CC
F H

2
FV V
V F HHHHHH H
F FFF V FVF FNF H
T F H
N F H
F F FFNFFFNF

10
N
FN N FT H
FFF NNFFFFFNFFN T
C F NVF
FVN
F NN HT NT
T HH H
NN
NNF FV
NFF
V
NF FF
F
NNF
N
N
N
NNF
F
N
T HH
H H N
0 NNFVNNFFN
FFNVF
VN
NNN NNN N
N VVNF FF FNFV
NVN
FF NN N
N H HH
N
N
F NNN N N N
FN
N H HH
N
N
N
N H T C N
N C C C
C C C C

5
T
LD1

LD1
N CN C N C C
-2

C H N
CN C C N NTT
V N C T
N CN
N H C
N N T
N HFFNN N
N
NFF
V FFF VFFN T T
NFNN HH
-4

F
F
VF
FF
FF
FF
F
F
NNFVVNN T

0
N F
N F
FFF
V
NNN T H
H
H
NVF
NN
F
V VF
FNF
F
V
NV
F
N
F
N
VN
N
NN
N
NN
F NN
NN
FF
FN
V H H N
N FNNNN
C
FN
N
N
FNF N
N
VVVVNN
F
F
FNF T H
H
F N N
H HH H
N FH H
HHF
H H
-6

H
H

-5
C
C
C
-8

-4 -2 0 2 4 6 -10 -5 0 5 10 15
LD2 LD2

Figure 12.3: The


robust estimation of the common covariance matrix.

Our function implements both.


The search for an MVE or MCD provides points whose mean and variance

those points whose Mahalanobis distance from the initial mean using the initial
lly within the 97.5% point under normality),
and returning their mean and variance matrix).
An alternative approach is to extend the idea of M-estimation to this setting,
distribution for a small number of degrees of free-
dom. This is implemented in our function ; the theory behind the algo-
rithm used is given in Kent, Tyler and Vardi (1994) and Ripley (1996). Normally
is faster than , but it lacks the latter’s extreme resistance. We
can use linear discriminant analysis on more than two classes, and illustrate this
with the forensic glass dataset .
Our function has an argument to use the minimum
volume ellipsoid estimate (but without robust estimation of the group centres) or
the multivariate distribution by setting . This makes a consid-
erable difference for the forensic glass data, as Figure 12.3 shows. We use
the default .
338

Try , which gives an almost linear plot.

In the terminology of pattern recognition the given examples together with their
training set, and future cases form the test set. Our
primary measure of success is the error (or
would obtain (possibly seriously) biased estimates by re-classifying the training
set, but that the error rate on a test set randomly chosen from the whole population
will be an unbiased estimator.
It may be helpful to know the type of errors made. A confusion matrix gives
the number of cases with true class . In some problems
some errors are considered to be worse than others, so we assign costs to
allocating a case of class to class . Then we will be interested in the average
error cost rather than the error rate.
It is fairly easy to show (Ripley, 1996, p. 19) that the average error cost is
minimized by the Bayes rule, which is to allocate to the class minimizing
where is the posterior distribution of the classes after ob-
serving . If the costs of all errors are the same, this rule amounts to choosing
the class with the largest posterior probability . The minimum average
cost is known as the Bayes risk. We can often estimate a lower bound for it by the
method of Ripley (1996, pp. 196–7) (see the example on page 347).
We saw in Section 12.1 how can be computed for normal popula-
tions, and how estimating the Bayes rule with equal error costs leads to lin-
ear and quadratic discriminant analysis. As our functions and

tion with error costs.


The posterior probabilities may also be estimated directly. For just
two classes we can model . For

this using a surrogate log-linear Poisson GLM model (Section 7.3), but using the
function in library section will usually be faster and easier.
directly by a special multiple logistic
model, one in which the right-hand side is a single factor specifying which leaf
the case will be assigned to by the tree. Again, since the posterior probabilities
339

are given by the method it is easy to estimate the Bayes rule for unequal
error costs.

Predictive and ‘plug-in’ rules

rule we need to know the posterior probabilities . Since these are unknown
we use an explicit or implicit parametric family . In the methods con-
sidered so far we act as if were the actual posterior probabilities, where
is an estimate computed from the training set , often by maximizing some ap-
propriate likelihood. This is known as the ‘plug-in’ rule. However, the ‘correct’
estimate of is (Ripley, 1996, 2.4) to use the predictive estimates

(12.5)

If we are very sure of our estimate there will be little difference between
and ; otherwise the predictive estimate will normally be less
extreme (not as near 0 or 1). The ‘plug-in’ estimate ignores the uncertainty in the
parameter estimate which the predictive estimate takes into account.
It is not often possible to perform the integration in (12.5) analytically, but it
is possible for linear and quadratic discrimination with appropriate ‘vague’ pri-
ors on (Aitchison and Dunsmore, 1975; Geisser, 1993; Ripley, 1996). This
estimate is implemented by of the meth-
ods for our functions and . Often the differences are small, especially
for linear discrimination, provided there are enough data for a good estimate of
the variance matrices. When there are not, Moran and Murphy (1979) argue
that considerable improvement can be obtained by using an unbiased estimator
of , implemented by the argument .

A simple example: Cushing’s syndrome


We illustrate these methods by a small example taken from Aitchison and Dun-
smore (1975, Tables 11.1–3) and used for the same purpose by Ripley (1996).
The data are on diagnostic tests on patients with Cushing’s syndrome, a hy-
persensitive disorder associated with over-secretion of cortisol by the adrenal
gland. This dataset has three recognized types of the syndrome represented as
a, b, c. (These encode ‘adenoma’, ‘bilateral hyperplasia’ and ‘carcinoma’, and
represent the underlying cause of over-secretion. This can only be determined
histopathologically.) The observations are urinary excretion rates (mg/24 h) of
the steroid metabolites tetrahydrocortisone and pregnanetriol, and are considered
on log scale.
There are six patients of unknown type (marked u), one of whom was later
found to be of a fourth type, and another was measured faultily.
and the various options
of quadratic discriminant analysis. This was produced by
340

LDA QDA
a a
cc c cc c

5.00

5.00
u u
c b c b
c c

Pregnanetriol

Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50

0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05

0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone

QDA (predictive) QDA (debiased)


a a
cc c cc c
5.00

5.00
u u
c b c b
c c
Pregnanetriol

Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50

0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05

0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone

Figure 12.4: Linear and quadratic discriminant analysis applied to the Cushing’s syndrome
data.

(Function is given in the scripts.)


We can contrast these with logistic discrimination performed by

(Function is given in the scripts.) When, as here, the classes have


quite different variance matrices, linear and logistic discrimination can give quite
different answers (compare Figures 12.4 and 12.5).
12.3 Non-Parametric Rules 341

a a
cc c cc c

2
5.00
u
b
u
b
c
c c c c

1
Pregnanetriol

Pregnanetriol
a b b a b b
a bb a bb

0
u u a u u

0.50
b b b b
a ub b a ub b

-1
b b b

-2
a u u a u u
0.05

-3
a a
5 10 50 1 2 3 4
Tetrahydrocortisone Tetrahydrocortisone

Figure 12.5
data.

in Figure 12.5.

Mixture discriminant analysis


Another application of the (plug-in) theory is mixture discriminant analysis
(Hastie and Tibshirani, 1996) which has an implementation in the library sec-
tion ture distributions to each class and
then applies (12.2).

12.3 Non-Parametric Rules

mates of the class densities or of the log posterior. Library section imple-
ments the
tler, 1982; Ripley, 1996) and learning vector quantization (Kohonen, 1990, 1995;
nearest examples in some
reference set, and taking a majority vote among the classes of these examples,
or, equivalently, estimating the posterior probabilities by the proportions
of the classes among the examples.
The methods differ in their choice of reference set. The -nearest neighbour
methods use the whole training set or an edited subset. Learning vector quantiza-
tion is similar to K-means in selecting points in the space other than the training
342

1-NN 3-NN

a a
c c c c
c c

5.00

5.00
u u
c b c b
c c
b b b b
Pregnanetriol

Pregnanetriol
a b a b
a b a b
u u u u
b b b b
0.50

0.50
a ub b a ub b

b b

a u u a u u
0.05

0.05
a a

5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone

Figure 12.6: -nearest neighbours applied to the Cushing’s syndrome data.

set examples to summarize the training set, but unlike K-means it takes the classes
of the examples into account.
These methods almost always measure ‘nearest’ by Euclidean distance. For
the Cushing’s syndrome data we use Euclidean distance on the logged covariates,
rather arbitrarily scaling them equally.

This dataset is too small to try the editing and LVQ methods in library section
.

12.4 Neural Networks

ear extension of multiple logistic re-


gression, as we saw in Section 8.10. We can consider them for the Cushing’s
syndrome example by the following code.2

2 S envi-
ronment.
12.4 Neural Networks 343

Size = 2 Size = 2, lambda = 0.001

a a
cc c cc c

5.00

5.00
u u
c b c b
c c
Pregnanetriol

Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50

0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05

0.05
a a
5 10 50 5 10 50

Tetrahydrocortisone Tetrahydrocortisone

Size = 2, lambda = 0.01 Size = 5, 20 lambda = 0.01

a a
cc c cc c
5.00

5.00
u u
c b c b
c c
Pregnanetriol

Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50

0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05

0.05
a a

5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone

Figure 12.7: Neural networks applied to the Cushing’s syndrome data. Each panel shows

The results are shown in Figure 12.7. We see that in all cases there are multiple

Once we have a penalty, the choice of the number of hidden units is often not
critical (see Figure 12.7). The spirit of the predictive approach is to average the
predicted
344

Many local maxima Averaged


a a
cc c cc c

5.00

5.00
u u
c b c b
c c
Pregnanetriol

Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50

0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05

0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone

Figure 12.8: Neural networks with three hidden units and applied to the Cush-
ing’s syndrome data.

Note that there are two quite different types of local maxima occurring here, and
some local maxima occur several times (up to convergence tolerances). An aver-

12.5 Support Vector Machines

They have been promoted enthusiastically, but with little respect to the selection
effects of choosing the test problem and the member of the large class of classi-
et al. (1992); Cortes and Vapnik
(1995); Vapnik (1995, 1998); the books by Cristianini and Shawe-Taylor (2000)
and Hastie et al. (2001, 4.5, 12.2, 12.3) present the underlying theory.
The method for classes is fairly simple to describe. Logistic regression

points on one side and all class-two points on the other. It would be a coincidence
for there to be only one such hyperplane, a
in the middle of the ‘gap’ between

middle of the gap, that is with maximal margin (the distance from the hyperplane
12.5 Support Vector Machines 345

to the nearest point). This is quadratic programming problem that can be solved
by standard methods.3 Such a hyperplane has support vectors, data points that are
exactly the margin distance away from the hyperplane. It will typically be a very

is tackled in two ways. First, we can allow some points to be on the wrong side
of their margin (and for some on the wrong side of the hyperplane) subject to

with Lagrange multiplier . This is still a quadratic programming problem,


because of the rather arbitrary use of sum of distances.
Second, the set of variables is expanded greatly by taking non-linear functions
of the original set of variables. Thus rather than seeking a classifying hyperplane
, we seek for a vector of
functions
to solving

where
not dissimilar (Hastie et al., 2001, p. 380) to a logistic regression with weight
). The claimed advantage of SVMs is

There is an implementation of SVMs for R in function in package


.4 The default values do not do well, but after some tuning for the
data we can get a good discriminant with 21 support vectors. Here is
and .

We can try a 10-fold cross-validation by

3 See Section
16.2 for S software for this problem; however, special-purpose software is often used.
4 Codeby David Meyer based on C++ code by Chih-Chung Chang and Chih-Jen Lin. A port to
S-PLUS is available for machines with a C++ compiler.
346

The extension to classes is much less elegant, and several ideas have
been used. The function uses one attributed to Knerr et al. (1990) in which
of classes, and the majority vote amongst
the resulting

12.6 Forensic Glass Example

The forensic glass dataset has 214 points from six classes with nine mea-

seen (Figures 4.17 on page 99, 5.4 on page 116, 11.5 on page 309 and 12.3 on
page 337) the types of glass do not form compact well-separated groupings, and
the marginal distributions are far from normal. There are some small classes (with
9, 13 and 17 examples), so we cannot use quadratic discriminant analysis.
We assess their performance by 10-fold cross-validation, using the same ran-
dom partition for all the methods. Logistic regression provides a suitable bench-
mark (as is often the case), and in this example linear discriminant analysis does
equally well.
12.6 Forensic Glass Example 347

We can use nearest-neighbour methods to estimate the lower bound on the Bayes
risk as about 10% (Ripley, 1996, pp. 196–7).

this dataset. We need to cross-validate over the choice of tree size, which does
vary by group from four to seven.
348

Neural networks
We wrote some general functions for testing neural network models by -fold
cross-validation. First we rescale the dataset so the inputs have range .

amount of weight decay by an inner cross-validation. To do so we wrote a fairly

the scripts for the code.)

the PC).
This code chooses between neural nets on the basis of their cross-validated
error rate. An alternative is to use logarithmic scoring, which is equivalent to

class is correct and 1 otherwise, we count for the true class . We


can easily code this variant by replacing the line

by

in .
Support vector machines

The following is faster, but not strictly comparable with the results above, as a
different random partition will be used.
12.7 Calibration Plots 349

Learning vector quantization


For LVQ as for -nearest neighbour methods we have to select a suitable metric.
The following experiments used Euclidean distance on the original variables, but
the rescaled variables or Mahalanobis distance could also be tried.

We set an even prior over the classes as otherwise there are too few representatives
of the smaller classes. Our initialization code in follows Kohonen’s in
selecting the number of representatives; in this problem 24 points are selected,
four from each class.

The initialization is random, so your results are likely to differ.

12.7 Calibration Plots

One measure that a suitable model for has been found is that the predicted
probabilities are well calibrated; that is, that a fraction of about of the events
we predict with probability actually occur. Methods for testing calibration of
probability forecasts have been developed in connection with weather forecasts
(Dawid, 1982, 1986).
For the forensic glass example we are making six probability forecasts for
each case, one for each class. To ensure that they are genuine forecasts, we should
use the cross-validation procedure. A minor change to the code gives the proba-
bility predictions:
350

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
predicted probability

Figure 12.9 data.

We can plot these and smooth them by

A smoothing method with an adaptive bandwidth such as is needed here,


as the distribution of points along the -axis can be very much more uneven
than in this example. The result is shown in Figure 12.9. This plot does show

one. Indeed, only 22/64 of the events predicted with probability greater than
0.9 occurred. (The underlying cause is the multimodal nature of some of the
underlying class distributions.)
12.7 Calibration Plots 351

use of plug-in rather than predictive estimates. Then the plot can be used to adjust
the probabilities (which may need further adjustment to sum to one for more than
two classes).

You might also like