Classification: 12.1 Discriminant Analysis

Chapter 12
Classification
is an increasingly important application of modern methods in

statistics. In the statistical literature the word is used in two distinct senses. The
entry (Hartigan, 1982) in the original Encyclopedia of Statistical Sciences uses the
sense of cluster analysis discussed in Section 11.2. Modern usage is leaning to the
other meaning (Ripley, 1997) of allocating future cases to one of
sense. (The older statistical literature sometimes refers to this as allocation.)

In pattern-recognition terminology this chapter is about supervised methods.
The classical methods of multivariate analysis (Krzanowski, 1988; Mardia, Kent
and Bibby, 1979; McLachlan, 1992) have largely been superseded by methods
from pattern recognition (Ripley, 1996; Webb, 1999; Duda et al., 2001), but some
still have a place.
It is sometimes helpful to distinguish discriminant analysis in the sense of
describing the differences between the
tion; the second can be a ‘black box’ that makes a decision without any explana-
tion. In many applications no explanation is required (no one cares how machines
read postal (zip) codes, only that the envelope is correctly sorted) but in others,
especially in medicine, some explanation may be necessary to get the methods
adopted.
data mining, although some of data mining is ex-
ploratory in the sense of Chapter 11. Hand et al. (2001) and (especially) Hastie
et al. (2001) are pertinent introductions.
Some of the methods considered in earlier chapters are widely used for clas-
n trees, logistic regression for groups and
multinomial log-linear models (Section 7.3) for groups.
12.1 Discriminant Analysis
Suppose that we have a set of classes, and for each case we know the class
(assumed correctly). We can then use the class information to help reveal the
structure of the data. Let denote the within-class covariance matrix, that is the
covariance matrix of the variables centred on the class mean, and denote the
331
332
between-classes covariance matrix, that is, of the predictions by the class means.
Let be the matrix of class means, and be the matrix of class
indicator variables (so if and only if case is assigned to class ). Then
the predictions are . Let be the means of the variables over the whole
sample. Then the sample covariance matrices are
(12.1)
Note that has rank at most .

Fisher (1936) introduced a linear discriminant analysis seeking a linear com-
bination of the variables that has a maximal ratio of the separation of the class
means to the within-class variance, that is, maximizing the ratio .
To compute this, choose a sphering (see page 305) of the variables so that
they have the identity as their within-group correlation matrix. On the rescaled
variables the problem is to maximize subject to , and as we saw
for PCA, this is solved by taking to be the eigenvector of corresponding to
the largest eigenvalue. The linear combination is unique up to a change of sign
(unless there are multiple eigenvalues). The exact multiple of returned by a
the conventional divisor of , but divisors of and have been used.

As for principal components, we can take further linear components corre-
sponding to the next largest eigenvalues. There will be at most
positive eigenvalues. Note that the eigenvalues are the proportions of the between-
classes variance explained by the linear combinations, which may help us to
choose how many to use. The corresponding transformed variables are called the
linear discriminants or canonical variates. It is often useful to plot the data on the
should be the identity, we chose an equal-scaled plot. (Using

will give this plot without the colours.) The linear discriminants are convention-
ally centred to have mean zero on dataset.
12.1 Discriminant Analysis 333
5
second linear discriminant
v v
v s
ss s
v vvvvv c ss ss
v vvvvvvvvvvvv vvc ccccccc s s ssss ss
v v v v c cc c s
s s
vvv vvvvv cc cc cc c sss s s sss ss s
0
v vvvvvc cv cc cccc s s ss ssss
s s s
v c cc c
v cc cc c s s s
v v c c cc c c
cc
s
c
-5
-5 0 5 10
first linear discriminant
Figure 12.1: The log
axis. Using
The approach we have illustrated is the conventional one, following Bryan

at (12.1) weights the
groups by their size in the dataset. Rao (1948) used the unweighted covariance
matrix of the group means, and our software uses a covariance matrix weighted
Discrimination for normal populations

An alternative approach to discrimination is via probability models. Let de-
note the prior probabilities of the classes, and the densities of distribu-
tions of the observations for each class. Then the posterior distribution of the
classes after observing is
(12.2)
and it is fairly simple to show that the allocation rule which makes the smallest
expected number of errors chooses the class with maximal ; this is known
as the Bayes rule. (We consider a more general version in Section 12.2.)
334
Now suppose the distribution for class is multivariate normal with mean
and covariance . Then the Bayes rule minimizes
(12.3)
Mahalanobis distance to the class centre,

and can be calculated by the function . The difference between
the for two classes is a quadratic function of , so the method is known as
quadratic discriminant analysis and the boundaries of the decision regions are
quadratic surfaces in space. This is implemented by our function .
Further suppose that the classes have a common covariance matrix . Dif-
ferences in the are then linear functions of , and we can maximize
or
(12.4)
To use (12.3) or (12.4) we have to estimate and or . The obvious
estimates are used, the sample mean and covariance matrix within each class, and
for .
How does this relate to Fisher’s linear discrimination? The latter gives new
variables, the linear discriminants, with unit within-class sample variance, and the
differences between the group variables. Thus on
these variables the Mahalanobis distance (with respect to ) is just
components of the vector depend on . Similarly, on these

variables
and we can work in dimensions. If there are just two classes, there is a single
linear discriminant, and
const
rescaled to unit length.

Note that linear discriminant analysis uses a that is a logistic regres-
sion for and a multinomial log-linear model for . However, it
differs from the methods of Chapter 7 in the methods of parameter estimation
used. Linear discriminant analysis will be better if the populations really are mul-
tivariate normal with equal within-group covariance matrices, but that superiority
is fragile, so the methods of Chapter 7 are usually preferred .
dataset
Can we construct a rule to predict the sex of a future Leptograpsus crab of un-
known colour form (species)? We noted that is measured differently for males
and females, so it seemed prudent to omit it from the analysis. To start with, we ig-
nore the differences between the forms. Linear discriminant analysis, for what are
,
a dimensionally neutral quantity. Six errors are made, all for the blue form:
It does make sense to take the colour forms into account, especially as the
within-group distributions look close to joint normality (look at the Figures 4.13
between-group variation; Figure 12.2 shows the data on those variables.
We cannot represent all the decision surfaces exactly on a plot. However,
proximation; see Figure 12.2.

336
O
OO
O
O
4
OO OO O
B O O O O
O O OO O O O
B OO O O
B B O O OOO
O O O
B B O OOOO
B BB B BB BB O O
2
B B OO O O
BB O
B BBBB B OO
BBB O
BB o
B B B oo
B B B O
BB B B b
Second LD
B
B B
0
o
B B BB o o
bb B o o o o
B o o
b o o o oo
Bb b b ooo
b bb o
o
ooo o
B b b o o
b
b b bb o o ooo o oo
b b bb b b b o o
-2
b o o o o
b b b bb b o o o
b b o
b
b bb bb o o
bb b b
b o o
-4
b
-6 -4 -2 0 2 4 6
First LD
Figure 12.2: Linear discriminants for the data. Males are coded as capitals, fe-
males as lower case, colours as the initial letter of blue or orange. The crosses are the
group means for a linear discriminant for sex (solid line) and the dashed line is the deci-
sion boundary for sex based on four groups.
The reader is invited to try quadratic discrimination on this problem. It per-

forms very marginally better than linear discrimination, not surprisingly since the
covariances of the groups appear so similar, as can be seen from the result of
Robust estimation of multivariate location and scale

We may wish to consider more robust estimates of (but not ). Somewhat
component of a multivariate mean (Rousseeuw and Leroy, 1987, p. 250), and it is

easier to consider the estimation of mean and variance simultaneously.
Multivariate variances are very sensitive to outliers. Two methods for robust
1
covariance estimation are available via our function and the S-PLUS
functions and (Rousseeuw, 1984; Rousseeuw and Leroy,
1987) and in library section . Suppose there are observations
of variables. The minimum volume ellipsoid method seeks an ellipsoid contain-
ing points that is of minimum volume, and the minimum
covariance determinant method seeks points whose covariance has minimum
1 Adopted by R in package .
15
CC
F H
2
FV V
V F HHHHHH H
F FFF V FVF FNF H
T F H
N F H
F F FFNFFFNF
10
N
FN N FT H
FFF NNFFFFFNFFN T
C F NVF
FVN
F NN HT NT
T HH H
NN
NNF FV
NFF
V
NF FF
F
NNF
N
N
N
NNF
F
N
T HH
H H N
0 NNFVNNFFN
FFNVF
VN
NNN NNN N
N VVNF FF FNFV
NVN
FF NN N
N H HH
N
N
F NNN N N N
FN
N H HH
N
N
N
N H T C N
N C C C
C C C C
5
T
LD1
LD1
N CN C N C C
-2
C H N
CN C C N NTT
V N C T
N CN
N H C
N N T
N HFFNN N
N
NFF
V FFF VFFN T T
NFNN HH
-4
F
F
VF
FF
FF
FF
F
F
NNFVVNN T
0
N F
N F
FFF
V
NNN T H
H
H
NVF
NN
F
V VF
FNF
F
V
NV
F
N
F
N
VN
N
NN
N
NN
F NN
NN
FF
FN
V H H N
N FNNNN
C
FN
N
N
FNF N
N
VVVVNN
F
F
FNF T H
H
F N N
H HH H
N FH H
HHF
H H
-6
H
H
-5
C
C
C
-8
-4 -2 0 2 4 6 -10 -5 0 5 10 15
LD2 LD2
Figure 12.3: The

robust estimation of the common covariance matrix.
Our function implements both.

The search for an MVE or MCD provides points whose mean and variance
those points whose Mahalanobis distance from the initial mean using the initial
lly within the 97.5% point under normality),
and returning their mean and variance matrix).
An alternative approach is to extend the idea of M-estimation to this setting,
distribution for a small number of degrees of free-
dom. This is implemented in our function ; the theory behind the algo-
rithm used is given in Kent, Tyler and Vardi (1994) and Ripley (1996). Normally
is faster than , but it lacks the latter’s extreme resistance. We
can use linear discriminant analysis on more than two classes, and illustrate this
with the forensic glass dataset .
Our function has an argument to use the minimum
volume ellipsoid estimate (but without robust estimation of the group centres) or
the multivariate distribution by setting . This makes a consid-
erable difference for the forensic glass data, as Figure 12.3 shows. We use
the default .
338
Try , which gives an almost linear plot.
In the terminology of pattern recognition the given examples together with their
training set, and future cases form the test set. Our
primary measure of success is the error (or
would obtain (possibly seriously) biased estimates by re-classifying the training
set, but that the error rate on a test set randomly chosen from the whole population
will be an unbiased estimator.
It may be helpful to know the type of errors made. A confusion matrix gives
the number of cases with true class . In some problems
some errors are considered to be worse than others, so we assign costs to
allocating a case of class to class . Then we will be interested in the average
error cost rather than the error rate.
It is fairly easy to show (Ripley, 1996, p. 19) that the average error cost is
minimized by the Bayes rule, which is to allocate to the class minimizing
where is the posterior distribution of the classes after ob-
serving . If the costs of all errors are the same, this rule amounts to choosing
the class with the largest posterior probability . The minimum average
cost is known as the Bayes risk. We can often estimate a lower bound for it by the
method of Ripley (1996, pp. 196–7) (see the example on page 347).
We saw in Section 12.1 how can be computed for normal popula-
tions, and how estimating the Bayes rule with equal error costs leads to lin-
ear and quadratic discriminant analysis. As our functions and
tion with error costs.

The posterior probabilities may also be estimated directly. For just
two classes we can model . For
this using a surrogate log-linear Poisson GLM model (Section 7.3), but using the
function in library section will usually be faster and easier.
directly by a special multiple logistic
model, one in which the right-hand side is a single factor specifying which leaf
the case will be assigned to by the tree. Again, since the posterior probabilities
339
are given by the method it is easy to estimate the Bayes rule for unequal
error costs.
Predictive and ‘plug-in’ rules
rule we need to know the posterior probabilities . Since these are unknown
we use an explicit or implicit parametric family . In the methods con-
sidered so far we act as if were the actual posterior probabilities, where
is an estimate computed from the training set , often by maximizing some ap-
propriate likelihood. This is known as the ‘plug-in’ rule. However, the ‘correct’
estimate of is (Ripley, 1996, 2.4) to use the predictive estimates
(12.5)
If we are very sure of our estimate there will be little difference between
and ; otherwise the predictive estimate will normally be less
extreme (not as near 0 or 1). The ‘plug-in’ estimate ignores the uncertainty in the
parameter estimate which the predictive estimate takes into account.
It is not often possible to perform the integration in (12.5) analytically, but it
is possible for linear and quadratic discrimination with appropriate ‘vague’ pri-
ors on (Aitchison and Dunsmore, 1975; Geisser, 1993; Ripley, 1996). This
estimate is implemented by of the meth-
ods for our functions and . Often the differences are small, especially
for linear discrimination, provided there are enough data for a good estimate of
the variance matrices. When there are not, Moran and Murphy (1979) argue
that considerable improvement can be obtained by using an unbiased estimator
of , implemented by the argument .
A simple example: Cushing’s syndrome

We illustrate these methods by a small example taken from Aitchison and Dun-
smore (1975, Tables 11.1–3) and used for the same purpose by Ripley (1996).
The data are on diagnostic tests on patients with Cushing’s syndrome, a hy-
persensitive disorder associated with over-secretion of cortisol by the adrenal
gland. This dataset has three recognized types of the syndrome represented as
a, b, c. (These encode ‘adenoma’, ‘bilateral hyperplasia’ and ‘carcinoma’, and
represent the underlying cause of over-secretion. This can only be determined
histopathologically.) The observations are urinary excretion rates (mg/24 h) of
the steroid metabolites tetrahydrocortisone and pregnanetriol, and are considered
on log scale.
There are six patients of unknown type (marked u), one of whom was later
found to be of a fourth type, and another was measured faultily.
and the various options
of quadratic discriminant analysis. This was produced by
340
LDA QDA
a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Tetrahydrocortisone Tetrahydrocortisone
QDA (predictive) QDA (debiased)

a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Figure 12.4: Linear and quadratic discriminant analysis applied to the Cushing’s syndrome
data.
(Function is given in the scripts.)

We can contrast these with logistic discrimination performed by
(Function is given in the scripts.) When, as here, the classes have

quite different variance matrices, linear and logistic discrimination can give quite
different answers (compare Figures 12.4 and 12.5).
12.3 Non-Parametric Rules 341
a a
cc c cc c
2
5.00
u
b
u
b
c
c c c c
1
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
0
u u a u u
0.50
b b b b
a ub b a ub b
-1
b b b
-2
a u u a u u
0.05
-3
a a
5 10 50 1 2 3 4
Figure 12.5
data.
in Figure 12.5.
Mixture discriminant analysis

Another application of the (plug-in) theory is mixture discriminant analysis
(Hastie and Tibshirani, 1996) which has an implementation in the library sec-
tion ture distributions to each class and
then applies (12.2).
12.3 Non-Parametric Rules
mates of the class densities or of the log posterior. Library section imple-
ments the
tler, 1982; Ripley, 1996) and learning vector quantization (Kohonen, 1990, 1995;
nearest examples in some
reference set, and taking a majority vote among the classes of these examples,
or, equivalently, estimating the posterior probabilities by the proportions
of the classes among the examples.
The methods differ in their choice of reference set. The -nearest neighbour
methods use the whole training set or an edited subset. Learning vector quantiza-
tion is similar to K-means in selecting points in the space other than the training
342
1-NN 3-NN
a a
c c c c
c c
5.00
5.00
u u
c b c b
c c
b b b b
Pregnanetriol
Pregnanetriol
a b a b
a b a b
u u u u
b b b b
0.50
0.50
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Figure 12.6: -nearest neighbours applied to the Cushing’s syndrome data.
set examples to summarize the training set, but unlike K-means it takes the classes
of the examples into account.
These methods almost always measure ‘nearest’ by Euclidean distance. For
the Cushing’s syndrome data we use Euclidean distance on the logged covariates,
rather arbitrarily scaling them equally.
This dataset is too small to try the editing and LVQ methods in library section
.
12.4 Neural Networks
ear extension of multiple logistic re-

gression, as we saw in Section 8.10. We can consider them for the Cushing’s
syndrome example by the following code.2
2 S envi-
ronment.
12.4 Neural Networks 343
Size = 2 Size = 2, lambda = 0.001
a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Size = 2, lambda = 0.01 Size = 5, 20 lambda = 0.01
a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Figure 12.7: Neural networks applied to the Cushing’s syndrome data. Each panel shows
The results are shown in Figure 12.7. We see that in all cases there are multiple
Once we have a penalty, the choice of the number of hidden units is often not
critical (see Figure 12.7). The spirit of the predictive approach is to average the
predicted
344
Many local maxima Averaged

a a
cc c cc c
5.00
5.00
u u
c b c b
c c
Pregnanetriol
Pregnanetriol
a b b a b b
a bb a bb
u u u u
0.50
0.50
b b b b
a ub b a ub b
b b
a u u a u u
0.05
0.05
a a
5 10 50 5 10 50
Figure 12.8: Neural networks with three hidden units and applied to the Cush-
ing’s syndrome data.
Note that there are two quite different types of local maxima occurring here, and
some local maxima occur several times (up to convergence tolerances). An aver-
12.5 Support Vector Machines
They have been promoted enthusiastically, but with little respect to the selection
effects of choosing the test problem and the member of the large class of classi-
et al. (1992); Cortes and Vapnik
(1995); Vapnik (1995, 1998); the books by Cristianini and Shawe-Taylor (2000)
and Hastie et al. (2001, 4.5, 12.2, 12.3) present the underlying theory.
The method for classes is fairly simple to describe. Logistic regression
points on one side and all class-two points on the other. It would be a coincidence
for there to be only one such hyperplane, a
in the middle of the ‘gap’ between
middle of the gap, that is with maximal margin (the distance from the hyperplane
12.5 Support Vector Machines 345
to the nearest point). This is quadratic programming problem that can be solved
by standard methods.3 Such a hyperplane has support vectors, data points that are
exactly the margin distance away from the hyperplane. It will typically be a very
is tackled in two ways. First, we can allow some points to be on the wrong side
of their margin (and for some on the wrong side of the hyperplane) subject to
with Lagrange multiplier . This is still a quadratic programming problem,

because of the rather arbitrary use of sum of distances.
Second, the set of variables is expanded greatly by taking non-linear functions
of the original set of variables. Thus rather than seeking a classifying hyperplane
, we seek for a vector of
functions
to solving
where
not dissimilar (Hastie et al., 2001, p. 380) to a logistic regression with weight
). The claimed advantage of SVMs is
There is an implementation of SVMs for R in function in package

.4 The default values do not do well, but after some tuning for the
data we can get a good discriminant with 21 support vectors. Here is
and .
We can try a 10-fold cross-validation by
3 See Section
16.2 for S software for this problem; however, special-purpose software is often used.
4 Codeby David Meyer based on C++ code by Chih-Chung Chang and Chih-Jen Lin. A port to
S-PLUS is available for machines with a C++ compiler.
346
The extension to classes is much less elegant, and several ideas have
been used. The function uses one attributed to Knerr et al. (1990) in which
of classes, and the majority vote amongst
the resulting
12.6 Forensic Glass Example
The forensic glass dataset has 214 points from six classes with nine mea-
seen (Figures 4.17 on page 99, 5.4 on page 116, 11.5 on page 309 and 12.3 on
page 337) the types of glass do not form compact well-separated groupings, and
the marginal distributions are far from normal. There are some small classes (with
9, 13 and 17 examples), so we cannot use quadratic discriminant analysis.
We assess their performance by 10-fold cross-validation, using the same ran-
dom partition for all the methods. Logistic regression provides a suitable bench-
mark (as is often the case), and in this example linear discriminant analysis does
equally well.
12.6 Forensic Glass Example 347
We can use nearest-neighbour methods to estimate the lower bound on the Bayes
risk as about 10% (Ripley, 1996, pp. 196–7).
this dataset. We need to cross-validate over the choice of tree size, which does
vary by group from four to seven.
348
Neural networks
We wrote some general functions for testing neural network models by -fold
cross-validation. First we rescale the dataset so the inputs have range .
amount of weight decay by an inner cross-validation. To do so we wrote a fairly
the scripts for the code.)
the PC).
This code chooses between neural nets on the basis of their cross-validated
error rate. An alternative is to use logarithmic scoring, which is equivalent to
class is correct and 1 otherwise, we count for the true class . We

can easily code this variant by replacing the line
by
in .
Support vector machines
The following is faster, but not strictly comparable with the results above, as a
different random partition will be used.
12.7 Calibration Plots 349
Learning vector quantization

For LVQ as for -nearest neighbour methods we have to select a suitable metric.
The following experiments used Euclidean distance on the original variables, but
the rescaled variables or Mahalanobis distance could also be tried.
We set an even prior over the classes as otherwise there are too few representatives
of the smaller classes. Our initialization code in follows Kohonen’s in
selecting the number of representatives; in this problem 24 points are selected,
four from each class.
The initialization is random, so your results are likely to differ.
12.7 Calibration Plots
One measure that a suitable model for has been found is that the predicted
probabilities are well calibrated; that is, that a fraction of about of the events
we predict with probability actually occur. Methods for testing calibration of
probability forecasts have been developed in connection with weather forecasts
(Dawid, 1982, 1986).
For the forensic glass example we are making six probability forecasts for
each case, one for each class. To ensure that they are genuine forecasts, we should
use the cross-validation procedure. A minor change to the code gives the proba-
bility predictions:
350
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
predicted probability
Figure 12.9 data.
We can plot these and smooth them by
A smoothing method with an adaptive bandwidth such as is needed here,

as the distribution of points along the -axis can be very much more uneven
than in this example. The result is shown in Figure 12.9. This plot does show
one. Indeed, only 22/64 of the events predicted with probability greater than
0.9 occurred. (The underlying cause is the multimodal nature of some of the
underlying class distributions.)
12.7 Calibration Plots 351
use of plug-in rather than predictive estimates. Then the plot can be used to adjust
the probabilities (which may need further adjustment to sum to one for more than
two classes).

Classification: 12.1 Discriminant Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification: 12.1 Discriminant Analysis

Uploaded by

Copyright:

Available Formats

Chapter 12

is an increasingly important application of modern methods in

sense. (The older statistical literature sometimes refers to this as allocation.)

12.1 Discriminant Analysis

Note that has rank at most .

the conventional divisor of , but divisors of and have been used.

should be the identity, we chose an equal-scaled plot. (Using

Figure 12.1: The log

The approach we have illustrated is the conventional one, following Bryan

Discrimination for normal populations

Mahalanobis distance to the class centre,

components of the vector depend on . Similarly, on these

rescaled to unit length.

between-group variation; Figure 12.2 shows the data on those variables.

We cannot represent all the decision surfaces exactly on a plot. However,

proximation; see Figure 12.2.

The reader is invited to try quadratic discrimination on this problem. It per-

Robust estimation of multivariate location and scale

component of a multivariate mean (Rousseeuw and Leroy, 1987, p. 250), and it is

Figure 12.3: The

Our function implements both.

Try , which gives an almost linear plot.

tion with error costs.

Predictive and ‘plug-in’ rules

A simple example: Cushing’s syndrome

QDA (predictive) QDA (debiased)

(Function is given in the scripts.)

(Function is given in the scripts.) When, as here, the classes have

Mixture discriminant analysis

12.3 Non-Parametric Rules

Figure 12.6: -nearest neighbours applied to the Cushing’s syndrome data.

12.4 Neural Networks

ear extension of multiple logistic re-

Size = 2 Size = 2, lambda = 0.001

Size = 2, lambda = 0.01 Size = 5, 20 lambda = 0.01

Many local maxima Averaged

12.5 Support Vector Machines

with Lagrange multiplier . This is still a quadratic programming problem,

There is an implementation of SVMs for R in function in package

We can try a 10-fold cross-validation by

12.6 Forensic Glass Example

amount of weight decay by an inner cross-validation. To do so we wrote a fairly

the scripts for the code.)

class is correct and 1 otherwise, we count for the true class . We

Learning vector quantization

The initialization is random, so your results are likely to differ.

12.7 Calibration Plots

Figure 12.9 data.

We can plot these and smooth them by

A smoothing method with an adaptive bandwidth such as is needed here,

You might also like