You are on page 1of 6

Pergamon

Vis!as in Astronomy Vol. 41, No. 3, pp. 405-410, 1997


@ 1997 Elsevier Science Ltd
Printed in Great Britain. All rights reserved

0083X5656/97
$15.00+ 0.00
PI I: SOO83-6656(97)00045-7

STATISTICAL CLASSIFICATION
JOSEF KITTLER
for Vision, Speech and Signal Processing, School of Electronic Engineering,
Information Technology and Mathematics, University of Surrey, Guildford GU2 5XH, UK
Centre

Abstract- Statistical pattern classification methodology is overviewed. The


recent developments in classifier combination are reviewed. @ 1997 Elsevier
Science Ltd. All rights reserved.

1. INTRODUCTION
Many data analysis tasks involve pattern classification. The aim of pattern classification is to
assign observations (patterns) into semantic categories. This paper aims at providing an overview
of the main approaches to pattern classification system design. This will serve as a starting point
for the discussion of techniques which strive to improve the classification system performance by
means of combining multiple expert opinions. The discussion will be confined here to statistical
methods where pattern classes are adequately modelled by means of probability distributions.
The paper is organised as follows. In Section 2 we overview classical approaches to pattern
classification. In Section 3 we then review the recent advances in combining classifiers. Finally
Section 4 will provide a brief summary. *

2. CLASSIFICATION
The commonly used methodology in pattern classification is underpinned by the statistical
decision theory in general, and the Bayes decision rule in particular (see, e.g., Refs. [1,2]). In
its general form the latter specifies how best decisions about class membership of patterns can
be made taking into account their probability distribution and any given loss function. The latter
defines how misclassifications should be weighted depending on the particular class assignment.
Let us assume that we associate a zero-one loss function with the decisions made by the pattern
classification system. In other words the loss is zero for every correct decision and it equals one
for any classification error made regardless of the type of misclassification. Then the optimal
E-mail: J.Kittler@ee.surrey.ac.uk.
This work was supported by the Engineering and Physical Sciences Research Council UK under Grant
GR/K68165.

J. Kittler

406

minimum error decision rule, the Buyes minimum error decision rule, assigns pattern x to class
Wj as
X --) Oj

if

P(OjIX)

= l+IiP(WijX),

(1)

where P(oi Ix) is the ith class a posteriori probability computable from p(xlwi) as

p(wilx)

P(XlW)P(W)
= x7==, p(XlOj)P(Oj)'

(2)

The various classification schemes suggested in the literature depend on the assumptions made
about the class distribution functions and the method of estimating or approximating either the
class densities P(xloi) or the class a posteriori probability functions P(Wi Ix). The simplest decision rules can be developed under the assumption that the class conditional probability density
functions have a known parametric form. Most frequently a Gaussian density is assumed, based
on some physical argument regarding the pattern generation process or just for the sake of simplicity. Accordingly, the density functions assume the form p(x]Oi) = [2n I Xi l]-1/2 exp{ - i (x/.Li)TZZF1(X - pi)) where pi is the mean vector of class Wi and JTi is the ith class covariance
matrix. It is easy to show that the decision rule (1) can be expressed in a parametric form.
When no assumption can be made about the form of the class conditional probability density
functions, a non-parametric approach to estimating the pdfs in (2) must be adopted. A classical
method devised for this purpose is the Parzen pdf estimation procedure [3,4]. However, the use
of a Parzen estimator for a direct implementation of the decision rule (1) is not prevalent because of its computational complexity. The motivation to reduce the computational complexity
of the Parzen approach led to the development of adaptive kernel methods and lately radial basis function pdf estimation methods. In the adaptive kernel approach the number of kernels and
their position are free to vary in addition to the parameters associated with any other degree of
freedom of the kernel adopted [4]. The form of the estimate, or perhaps more accurately, of an
approximation p*(x]oi) of the pdf is as follows:

P*(xlW) = 2 YiKCx, aj),

(3)

j=l

where ni denotes the number of kernels used in the approximation and vj is the weight associated
with the jth kernel.
Given a set of training patterns Xi from class Oi, the approximation p*(x]oi) of P(xloi) can
be obtained by means of the EM algorithm. It should be noted that a certain trade-off can be
achieved between the number of kernels used and their complexity, defined by the degree of
freedom available in their specification. Thus at one end of the spectrum we may resort to using
a relatively large number of Gaussian kernels with an identity covariance matrix while at the
other end we may prefer a small number of kernels with a full covariance matrix. Note that from
the computer storage point of view the former option may be preferable, in spite of the larger
number of kernels, since for the full covariance matrix the number of model parameters required
as a function of dimensionality of the pattern space grows with the power of two.
Alternatively, one may adopt the nearest neighbour approach which aims to estimate directly
the a posteriori probability functions P(wi lx) in (1). Suppose that we have available a training set
of labelled patterns X E ((xi, pj)lj = 1, . . . , N) where pj is the class label for pattern xj , i.e.
urn}. Consider a point x that we wish to classify and let us perform an experiment
Bj~{W,...,
consisting of drawing, from the set X, the k nearest neighbours to pattern x. Under some mild

Statistical classijication

407

assumptions the ratio ki/k where ki is the number of nearest neighbours from class oi provides
an unbiased estimate of the a posteriori probability P(Oi Ix).
The above argument suggests a very effective decision rule, known as the k-nearest neighbour
(WVN) decision de which can be stated as follows:
x +

wj

ifkj =$$ki.

(4)

Clearly C ki = k. For a large training set the decision rule in (4) implements the Bayes decision
rule (1). In practical situations with finite size training sets the kNN decision rule will only
approximate the Bayes rule. Nevertheless, the rule (4) remains effective. It can be considered
as a majority rule, i.e. an unknown pattern is assigned to the class identified with that of the
majority of its k nearest neighbours. Both a theoretical argument and empirical results suggest
that the choice of k should be related to the size N of the training set. A rule of thumb relating k
toNisk=fi.

3. CLASSIFIER COMBINATION
The ultimate goal of designing pattern recognition systems is to achieve the best possible classification performance. This objective traditionally led to the development of different classification schemes and then adopting one of the classifiers on the basis of an experimental assessment
of the different designs. It had been observed in such design studies, that although one of the
designs would yield the best performance, the sets of patterns misclassified by the different classifiers would not necessarily overlap. This suggested that different classifier designs potentially
offered complementary information about the patterns to be classified which could be harnessed
to improve the performance of the selected classifier.
These observations motivated the relatively recent interest in combining classifiers. The idea
is not to rely on a single decision making scheme. Instead, all the designs, or their subset, are
used for decision making by combining their individual opinions to derive a consensus decision. Various classifier combination schemes have been devised and it has been experimentally
demonstrated that some of them consistently outperform a single best classifier.
An important issue in combining classifiers is that this is particularly useful if they are different, see Ref. [S]. This can be achieved by using different feature sets [20,14] as well as by
different training sets, randomly selected [ 13,191 or based on a cluster analysis [7]. A possible application of a multistage classifier is that it may stabilise the training of classifiers based
on a small sample size, e.g. by the use of bootstrapping [18]. The combination of ensembles
of neural networks (based on different initialisations), has been studied in the neural network
literature, see, e.g., Refs. [12,8,9,11,16,17]. If only labels are available a majority vote [15,10]
is used. Sometimes the use can be made of a label ranking [6,14]. If continuous outputs like
posteriori probabilities are supplied, an average or some other linear combination have been suggested [ 12,201. It depends on the nature of the input classifiers and the feature space whether this
can be theoretically justified. An interesting study on these possibilities is given in [ 111. Finally
it is possible to train the output classifier separately using the outputs of the input classifiers as
new features [ 16,191.
From the point of view of their analysis, there are basically two classifier combination scenarios. In the first scenario, all the classifiers use the same representation of the input pattern. A
typical example of this category is a set of k-nearest neighbour classifiers, each using the same
measurement vector, but different classifier parameters (number of nearest neighbours k). In the

J. Kittler

408

second scenario each classifier uses its own representation of the input pattern. In other words,
the measurements extracted from the pattern are unique to each classifier. An important application of combining classifiers in this scenario is the possibility to integrate physically different
types of measurements/features.
In this case it is not possible to consider the computed a posteriori probabilities to be estimates of the same functional value, as the classification systems
operate in different measurement spaces.

3.1. Distinct representations


Let us assume that we have R classifiers each representing the given pattern by a distinct
measurement vector. Denote the measurement vector used by the i th classifier by xi. In the
measurement space each class Ii& is modelled by the probability density function p(xi I&. Now
according to the Bayesian decision theory, given measurements xi, i = 1, . . . , R, the pattern
should be assigned to class wj, i.e. its label 8 should assume the value 0 = mj, provided the a
posteriori probability of that interpretation is maximum, i.e.
aSSign8

Oj

ifP(8

=OjIXl,...,XR)=mkaXP(e

=WkIXl,...,XR).

(5)

We have recently demonstrated [21] that the above posterior class probability functions can be
approximated either by a product or a sum of class probabilities based on individual measurements xi as
p(e =

Ok

(Xl, . . . , XR) =

P-R(wk)fl~I p(e
Cq

P(wj)l7E,

Wk

I&)

PlvR(8 = WjlXi)

(6)

and
(1 - R)P(wk) +

p(e = WklXl, . . . , xR) =

Cy=l[(l

- R)J(wj)

xi=,p(wklxi)
+ CEt

P(wjIXi)I

(7)

respectively, depending on the assumptions made. These approximations lead to a large number
of classifier combination rules which are represented in Fig. 1. It has been also shown that the
rules derived from the sum approximation are much less sensitive to estimation errors than those
based on the product approximation [2 11.
3.2. Identical representations
In many situations we wish to combine the results of multiple classifiers which use an identical
representation for the input pattern. Then each of our R classifiers is estimating the same a
posteriori probabilities as
Pj((wiIX) =

P(@i

IX>+ Ej(WiIX>,

(8)

where Ej (wi Ix) denotes the estimation error incurred by the jth classifier. For simplicity we shall
assume that the estimation errors are unbiased, i.e. they are distributed with a zero mean and
variance u,. We can thus combine the multiple classifier outputs to obtain an estimate of each
class a posteriori probability
1
P(Wi

IX) =

7;i

R
C
J=l

Sj

(Wi IX).

Statistical classijication

R 0-1 Xl)

...

P(w,,

Xn)

409

Fig. 1

It follows that the estimate P(oi Ix) will be unbiased, i.e. E{ P(wi Ix)) = P(wi Ix) and, provided
the errors are independent, its variance will be reduced to o* = uz/R which will have a beneficial effect on the error probability of the combined output.

4. CONCLUSIONS
Statistical pattern classification methodology has been overviewed. The recent developments
in classifier combination have been reviewed.
References

[ 11 K. Fukunaga, Introduction to Statistical Pattern Recognition (Academic Press, New York,


1972).
[Z] PA. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach (Prentice-Hall,
Englewood Cliffs, NJ, 1982).
[3] E. Parzen, On estimation of a probability density function and mode, Annals of
Mathematical Statistics 33 (1965) 1065-1076.
[4] D.A. Hand, Kernel Discriminant Analysis, Research Studies Press (Wiley, Chichester,
1982).

410

J. Kittler

[5] K.M. Ali, M.J. Pazzani, On the link between error correlation and error reduction in decision
tree ensembles, Technical report 95-38, ICS-UCI, 1995.
[6] S.C. Bagui, N.R. Pal, A multistage
generalisation
of the rank nearest neighbour
classification rule, Pattern Recognition Letters 16 No. 6 (1995) 601614.
[7] J. Cao, M. Ahmadi, M. Shridhar, Recognition of handwritten numerals with multiple feature
and multistage classifier, Pattern Recognition 28 No. 2 (1995) 153-160.
[S] S.B. Cho, J.H. Kim, Combining multiple neural networks by fuzzy integral for robust
classification, IEEE Transactions on Systems, Man, and Cybernetics 25 No. 2 (1995) 380384.
[9] S.B. Cho, J.H. Kim, Multiple network fusion using fuzzy logic, IEEE Transactions on
Neural Networks 6 No. 2, (1995) 497-501.
[lo] J. Franke, E. Mandler, A comparison of two approaches for combining the votes of
cooperating classifiers,
Proceedings
11th IAPR International
Conference on Pattern
Recognition, Volume II, Conference B: Pattern Recognition Methodology and Systems,
(1992) p. 611-614.
[ 1 I] L.K. Hansen, P. Salamon, Neural network ensembles, IEEE Transactions on Pattern
Analysis and Machine Intelligence 12 No. 10 (1990) 993-1001.
[ 121 Hashem, B. Schmeiser, Improving model accuracy using optimal linear combinations of
trained neural networks, IEEE Transactions on Neural Networks 6 No. 3 (1995) 792-794.
[ 131 T.K. Ho, Random Decision Forests,
Third Int. Conf. on Document Analysis and
Recognition, Montreal, August 14-16, 1995, p. 278-282.
[14] T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems, IEEE
Transactions on Pattern Analysis and Machine Intelligence 16 No. 1 (1994) 66-75.
[ 151 E Kimura, M. Shridhar, Handwritten numerical recognition based on multiple algorithms,
Pattern Recognition 24 NO. 10 (1991) 969-983.
[ 161 A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and active learning, In:
Advances in Neural Information Processing Systems 7, G. Tesauro, D.S. Touretzky, T.K.
Leen, Eds. (MIT Press, Cambridge MA, 1995).
[ 171 G. Rogova, Combining the results of several neural network classifiers, Neural Networks 7
No. 5 (1994) 777-781.
[18] M. Skurichina, R.P.W. Duin, Stabilizing classifiers for very small sample sizes, ICPR 13,
Vienna, 1996.
[ 191 D.H. Wolpert, Stacked generalisation, Neural Networks 5 No. 2 (1992) 241-260.
[20] L. Xu, A. Krzyzak, C.Y. Suen, Methods of combining multiple classifiers and their
applications
to handwriting
recognition,
IEEE Transactions on Systems, Man, and
Cybernetics, 22 No. 3 (1992) 418-435.
[21] J. Kittler, M. Hatef, R.P.W. Duin, Combining classifiers, Proceedings 13th International
Conference on Pattern Recognition, Vienna, 1996, Volume II, Track B, p. 897-901.

You might also like