You are on page 1of 43

4.

Feature Reduction

Prof. Sebastiano B. Serpico

Universit di Genova

Dipartimento di Ingegneria
Biofisica ed Elettronica

Complexity of a Classifier
Increasing the number n of the features, the classifying design
presents different issues connected to the dimensionality of the
problem (curse of dimensionality):

Computational complexity;
Hughes phenomenon.

Computational complexity

Increasing n, the computational complexity of a classifier


increases. For some classification techniques this increment is
linear with n, for other is of a higher order (e.g. quadratic).
The increase in complexity involves an increase of computation
time and a larger memory occupation.

Hughes Phenomenon
Intuitive reasoning

increasing n, the amount of


available information for the
classifier should increase and
consequently also the classification accuracy, but...

Experimental observation

on the contrary, fixed the


number N of the training
samples, the probability of a
correct decision of a classifier
increases for 1 n n* till a
maximum and decreases for
n n* (Hughes phenomenon).

Interpretation

Increasing n, the number of


parameters Kn of the classifier
becomes higher and higher.
By increasing the ratio Kn/N,
the number of the available
training samples is too little
to obtain a satisfactory estimate of such parameters.

Feature Reduction
A solution to these dimensionality issues is to reduce the
number n of the features used in the classification process
(feature reduction or parameter reduction).

Disadvantage: reducing the dimension of the feature space


involves a loss of information.
Two main strategies exist to achieve feature reduction:

feature selection: inside the set of the n available features, the


identification of a subset of m features is obtained by adopting of
an optimization criterion, chosen to minimize the loss of
information or maximize classification accuracy;
feature extraction: the transformation (often linear) of the original
(n-dimensional) feature space in a space of smaller dimension m is
applied in such a way to minimize the information loss or
maximize the classification accuracy.

Feature Selection
Problem setting:

Given a set X = {x1, x2, , xn} of n features, identify the subset


S X, composed of m features (m < n), such to maximize the
functional ():
S* arg max (S)
S X

An algorithm for features selection is then defined on the basis


of two distinct objects:

the functional (). It has to be defined such that (S) measures


the goodness of the feature subset S in the classification process;
the algorithm for the search of the subset S*. The subsets of X are
in fact 2n, then an exhaustive search is computationally not
feasible, except for small values of n. Therefore, sub-optimal
strategies of maximization are adopted to detect good
solutions, even if they do not correspond to global optima.

Bhattacharyya Bounds
A choice of the functional (), which is significant from the
classification point of view, can be based on the criterion of the
minimum of the error probability Pe.

In the presence of two classes 1 and 2, only, the Bhattacharyya


distance B and the Bhattacharyya coefficient provide an upper
bound of Pe:
Pe u P1P2 exp( B)
where B ln e

p( x |1 ) p( x | 2 )dx

Moreover, its possible to demonstrate that:


1 1

1 4u2 u2
2 2
u2 Pe u
Pe

Bhattacharyya Distance and Coefficient


An approach to feature selection consists in the maximization
of the Bhattacharyya distance B or (equivalently) in the
minimization of the Bhattacharyya coefficient .

In particular, a distance B(S) (or a coefficient (S)) can be


associated to each subset S of m features; in fact, indicating a
vector of the feature subset S with xS, one can define:
B(S) ln (S) e (S)

Properties:

p( xS | 1 ) p( xS | 2 )dxS

0 (S) 1 and then B(S) 0;


if p(xS| 1) and p(xS| 2) are different from zero only in separeted
regions, then (S) = 0 and B(S) = + ;
if p(xS| 1) = p(xS| 2) for any xS, then (S) is the integral of a pdf
over the entire space m , therefore (S) = 1 and B(S) = 0.

Computation of the Bhattacharyya Coefficient and Distance


(S) is a multiple integral in an m-dimensional space, then its
analytical computation starting from the conditional pdf is
complex. Two particular cases, in which the computation is
simple, exist:

if the features in the subset S are independent, when conditioned


to each class, we have:
p( xS | i ) p( xr | i )

i 1, 2

xr S

Additive
property

then the following property is valid:


(S) ({ xr }) e B(S)
xr S

B({xr })

xr S

if p(xS| i) = (miS, iS) (i = 1, 2), we obtain:


1

S
S

1 S
1
S t
S
S
1
2
B(S) (m 2 m1 )
(
m

m
)

ln

2
1
8
2
2

S1 S2
2
S1 S2

Other Inter-Class Distances


In addition to the Bhattacharyya distance, different measures
of inter-class distances have been introduced in literature.

For example, the Divergence measures the separation between


two classes as a function of the likelihood ratio between the
respective conditional pdfs.
The Bhattacharyya distance and the Divergence are not upper
limited. This make them less appropriate as measures of interclass separation. In fact, if we focus on the Gaussian case for
simplicity, when two classes are well-separated, an increment of
the distance m1S m2S between the conditional means generates
a large increment of B(S), but an irrelevant reduction of Pe.
Then, other measures of inter-classes distance have been
proposed (not treated in depth here) that, being upper limited,
dont present such a problem. Among them, we recall the JeffriesMatusita Distance and the Modified Divergence [Richards 1999,
Swain 1978].

10

Multiclass Extension
Extension to the case of the M classes 1, 2, , M.

If ij(S) and Bij(S) are the Bhattacharyya coefficient and distance


between two classes i and j computed over a feature subset S and if
Pi = P(i) is the a priori probability of the class i, the following
average Bhattacharyya coefficient and average Bhattacharyya
distance are defined:

ave (S)

M 1 M

Pi Pjij (S),
i 1 j i 1

Bave (S)

M 1 M

Pi Pj Bij (S)
i 1 j i 1

Remarks

In the case M = 2, the maximization of B(S) was equivalent to the


minimization of (S) (because B(S) = ln (S)). In the multiclass case
the maximization of Bave(S) is no more equivalent to the
minimization of ave(S), because the relation between Bave(S) and
ave(S) is no more monotonic.
Under the hypothesis of class conditional feature independence, we
have:
Attention! It is not
Bave (S) Bave ({ xr })
xr S

valid for ave(S).

Maximization of the Functional


In a problem of feature selection the introduced measures of
inter-class separation are used in the role of the functional ()
to be maximized

Preliminary observations:

The Bhattacharyya distance has to be maximized, while the


Bhattacharyya coefficient has to be minimized. Therefore, in the
following, (S) may correspond to Bave(S) or to ave(S).

An exhaustive search over all possible subsets of X is, in general,


computationally not affordable.
Its feasible if the features are independent, when conditioned to
each class, and if the adopted functional is Bave. In such a case, in
fact, computed all values of the functional associated to the single
features, for the additive property, the optimum subset S* of m
features is simply composed by the m features that individually
presents the highest m values of Bave({xr}).

11

Sequential Forward Selection


In general, the search for a subset of m features is conducted by
means of a sub-optimal algorithm. Among such algorithms we
consider (for its simplicity) the sequential forward selection (SFS),
which is based on the following steps:

initialize S* = ;
compute the value of the functional for all the subset S* {xi},
with xi S*, and choose the feature x* S *, that corresponds to
the maximum value of (S * { xi });
update S* setting S* = S* {x*};
continue by iteratively adding one feature at a time until S*
reaches the desired cardinality m or until the value of the
functional stabilizes (reaches saturation).

12

Remarks on SFS
SFS identifies the optimum subset that can be obtained by
iteratively adding a single feature at time.

At the first step the single feature that corresponds to the


maximum value of the functional is chosen. At the second step,
the feature that, coupled with the previous one, provides the
maximum value of the functional is added. And so on...
The method is sub-optimal. For example, the optimal couple of
features does not always include the single optimal feature .

Advantage

SFS is not computationally heavy even if X contains hundreds of


features.

Disadvantage

A feature that has been included in the selected subset S* at a


specific iteration cannot be removed during the following
iterations, it means that SFS doesnt allow backtracking.

13

Sequential Backward Selection


Sequential backward selection (SBS) proceed in a dual way wrt
SFS, initializing S* = X and eliminating a single feature at a time
from S*, to maximize the functional (S) step by step.

Disadvantages

Like SFS, SBS, too, doesnt allow backtracking: the feature,


eliminated from S* at a specific iteration, will never be recovered
in the following steps;
Usually SBS is computationally disadvantageous wrt SFS: while
SFS starts from an empty subset and adds a feature at a time, SBS
starts from the original feature space. Therefore, SBS computes
values of the functional in spaces with much higher dimensions
than SFS. However, its advantageous if m n.

In literature other, more complex, methods have been


proposed (which we will not see) to search for suboptimal
subsets, which allow also backtracking [Serpico et al., 2001].

14

Operational Aspects of Features Selection


The computation of interclasses distance measures, used
in features selection, requires
the knowledge of the class
conditional pdfs and of the
class prior probabilities.

Usually, such pdfs are not a


priori known but should be
estimated from a training set,
by means of parametric or nonparametric methods.
Globally,
a
classification
system that involves a feature
selection
step
can
be
summarized by the following
flowchart.

Data set

Training set
training
samples for
eaxh class {i}

Class
conditional pdfs
and class prior
probabilities
{p(x| i), Pi}
Feature
selection
S*

Training of the
classifier
Application of the
classifier to the
data set

Classification of the data set

15

16

Example
Hyperspectral data set with 202 features and 9 classes.
Map of the
ground truth
that highlights
the training pixel

RGB composition
of three of the 202
bands acquired
by the sensor.

Estimated probability of correct


classification for a MAP
classifier under the hypothesis
of Gaussian classes.

8
6

100%

Bav e
4

90%

2
8 14 20 26 32 38 44 50 56
m

OA

80%

70%

Pc,max = 88.6%
for m = 40

60%
50%
0

50

100
m

150

200

Feature Extraction
Problem definition:

Given a set X = {x1, x2, , xn} of n features, we want to identify a


linear transformation that provides a transformed set of m
features Y={y1, y2, , ym} (with m<n) minimizing the loss of
information (measured by an appropriate functional) involved in
the dimensionality reduction;
The linear transformation is defined by a transformation matrix
T (m n) such that y = Tx.

Remarks

The transformations are typically orthonormal, i.e. the matrix T is


the juxtaposition by rows of m orthonormal vectors e1, e2, , em:
eti e j ij i , j 1, 2,..., m TT t I m

Linear transforming by such matrix T means to project into the


subspace generated by the orthonormal basis {e1, e2, , em}.

17

Extraction Based on Inter-Class Distances


Considering again the Bhattacharyya distance, in the case of two
Gaussian classes, we look for the orthonormal feature
transformation that maximize the distance in the transformed
space.

Let miY = E{y| i} = Tmi and iY = Cov{y | i} = TiTt (for i = 1,


2), B in the transformed space Y is given by:
Y1 Y2
Y
Y 1

1
2

1
Y
Y
Y
Y t
1
2
B(Y ) tr
(m 2 m1 )(m 2 m1 ) ln
8
2
2

Y1 Y2

Bm (Y )

B (Y )

In the expression of B, two distinct contributions Bm(Y) and B(Y)


appear, respectively linked to the conditional means and to the
conditional covariance matrices.

18

Extraction Based on Inter-Class Distances


In principle, we would search for the orthogonal matrix T that
would maximize B(Y). However:

The general problem of the maximization of B(Y) with respect to


T has no closed-form solution.
The problems of separately maximizing Bm(Y) or B(Y) have
closed form solutions (eigenproblems). Details can be found in
[Fukunaga, 1990].
Therefore, if one of the two contributions is largely dominant
over the other (i.e., Bm(Y) >> B(Y) or Bm(Y) << B(Y)), then these
closed-form solutions can be applied.
Otherwise, the maximization of B(Y) should be addressed
numerically (e.g., through the projection pursuit approach
[Jimenez et al., 1999]).
If the two Gaussian classes are equiprobable, the maximization of
Bm(Y) leads to the same solution of the Fisher transform
(discussed later).

19

Linear discriminant analysis


A popular method for feature extraction is linear discriminant
analysis (LDA, aka discriminant analysis feature extraction,
DAFE), which maximizes a measure of separation and
compactness of the classes directly defined on the training set.

Although explicit parametric assumptions are not stated, DAFE


is usually considered parametric, because it works poorly, for
example, with multimodal classes and it characterizes the classes
only through first and second-order moments.
Anyway, nonparametric extensions of this method have been
recently introduced.

The method can be applied to both binary and multiclass


problems.

Focusing first to the case of two classes, 1 and 2, the linear


discriminant analysis provides an optimum scalar projection,
named Fisher transform.

20

DAFE: Fisher transform


In general, although the classes are well separated in the
original n-dimensional space, they may not be such in a
transformed one-dimensional space, because the projection
can overlay samples drawn from different classes.
The problem is to find the orientation of the projection line
that provides the best separation between the two classes.

Given a set {x1, x2, , xN} of N pre-classified samples, let Di be the


subset of the samples assigned to i (i = 1, 2) and let Ni be the
cardinality of Di (obviously N = N1 + N2).
A transformation y = wtx projects the sample xk to yk = wtxk.Let Ei
= {y = wtx: x Di}.
We search for the transformation y = wtx that maximizes the
inter-class separation and minimizes the intra-class dispersion,
conveniently quantified.

21

Inter-class separation and intra-class dispersion


First, a functional that measures inter-class separation and
dispersion inside each class is necessary.

As a measure of inter-class separation, the difference between the


centroids of the samples in the transformed space is used:
1

i N x
i xDi

i wt i , i 1, 2

i 1 y

Ni yEi

As a measure of class dispersion around the centroids, the


scatter values are adopt in the transformed space, i.e.:

Si is called scatter
matrix of the class
i (i = 1, 2).

Si ( x i )( x i )t

xDi
2
t

w
Si w , i 1, 2
2
i
2
si ( y i )
yEi

22

The Fisher Functional


The goal of the Fisher transform is to maximize the distance
between the centroids of the classes and to minimize the
scatters in the one-dimensional transformed space.

For this purpose, the following Fisher functional is introduced:


2
1 2
( w) 2 2
s1 s2
Let us explicitly write the functional as a function of w:
s12 s22 w t (S1 S2 )w w t Sw w

2
t
t
t
t
1 2 w ( 1 2 ) w ( 1 2 )( 1 2 ) w w Sb w
w t Sb w
(w) t
,
w Sw w

where Sb = (1 2)(1 2)t is named between class scatter matrix


and Sw = S1 + S2 is named within class scatter matrix.

23

Optimality condition for the Fisher functional


Optimality condition

Through the usual zero-gradient condition, one may prove that


the vector w* that maximizes the Fisher functional is an
eigenvector of the product matrix Sw 1Sb:
(Sw1Sb I )w* 0,

i.e.,

(Sb Sw )w* 0,

where is the corresponding eigenvalue.

Close form solution

Therefore, w* satisfies the condition:


Sw1Sb w* w* Sw1 (1 2 )(1 2 )t w* w *

(1 2)tw* and are scalars, so w* is parallel to Sw 1(1 2).


Since the scale factors are irrelevant in linear projections, we
obtain the following closed-form solution (with no need for
explicitly computing eigenvectors):
w* Sw1 (1 2 )

Typically, the vector w* is also normalized.

24

DAFE: multiclass Fisher transform


We extend the discriminant analysis from the binary case to
the case of M classes 1, 2, , M and of an m n
transformation matrix.
Let us consider a set {x1, x2, , xN} of N preclassified samples,
denote as Di the subset of the samples assigned to i (i = 1, 2, ,
M) and as Ni the cardinality of Di (N = N1 + N2 + + NM).
The transformation y = Tx maps xk to yk = Txk. Given Ei = {wtx: x
Di}, let us define:
1
Centroids of i in
x

Ni xDi
the original and

i T i , i 1, 2,..., M

transformed spaces.
i 1 y

Ni yEi

Scatter matrices of
i in the original
and transformed
spaces.

Si ( x i )( x i )t

xDi
t

TS
T
, i 1, 2,..., M

i
i
t
Si ( y i )( y i )
yEi

25

DAFE: multiclass Fisher functional (1)


Let us extend the Fisher functional to the multiclass case.

In the multiclass case, we quantify inter-class separation through


the mean differences between the centroids of the classes and the
centroid of the entire training set in the transformed space:
1 N
1 N
1 M
y k T , where: x k Ni i
N k 1
N k 1
N i 1

We measure the dispersions inside the single classes by means of


the scatter matrices in the transformed space.
Then, the Fisher functional is generalized as follows:
M

(T )

Ni (i )( i )t
i 1

Si
i 1

26

DAFE: multiclass Fisher functional (2)


Let us explicitly write the Fisher functional as a function of the
unknown transformation matrix T.

Let us express numerator and denominator as functions of T and


let us consequently introduce a within class scatter matrix Sw and
a between class scatter matrix Sb:
M

i 1
M

i 1

Si T SiT

TSwT , where:
t

Sw Si
i 1

t t
t
N
(

)(

T
N
(

)(

)
T

TS
T
i i
i i
i
i
b
t

i 1

i 1

where: Sb Ni ( i )( i )t
i 1

(T )

TSbT t
TSwT t

27

Optimality condition for the multiclass case


Optimality condition

Again through a zero-gradient condition, one may prove that the


row vectors e1, e2, , em of the matrix T* that maximizes the
Fisher functional are eigenvectors of Sw 1Sb:
(Sw1Sb i I )ei 0, i.e., (Sb i Sw )ei 0,

i 1, 2,..., m,

where i is the eigenvalue corresponding to ei and is nonzero.

Remarks

The M matrices (i )(i )t, i = 1, 2, , M, have unit ranks.


Because of the linear relationship among the overall centroid
and the class centroids i, i = 1, 2, , M, they are also linearly
dependent.
Thus, rank(Sb) M 1 and, then, rank(Sw 1Sb) rank(Sb) M 1.
Therefore, at most (M 1) eigenvalues of Sw-1Sb are nonzero, i.e.,
the eigenvector equation provides at most (M 1) solution
vectors.

28

29

DAFE: comments
DAFE allows up to (M 1) transformed features to be linearly
extracted (remember that M is the number of classes).
Operational issues

The eigenvalues of Sw-1Sb can be computed as the roots of the


characteristic polynomial, i.e.,:
Sw1Sb I 0 or, equivalently:

Sb Sw 0

The second formulation is more convenient because it does not


require any matrix inversion.
The characteristic equation provides at most (M 1) nonzero
roots 1, 2, , M 1 and at least (n M + 1) zero solutions.
An eigenvector ei is computed from each resulting nonzero
eigenvalue i.
The optimal transformation matrix T* is obtained through a row
juxtaposition of the resulting eigenvectors.

Principal component analysis


The principal component analysis (PCA, or Karhunen-Loeve
transform, KL) is an unsupervised algorithm for feature
extraction. In particular, PCA reduces the dimension of the
feature space on the basis of a mean square error criterion.
Problem setting
Let a data set {x1, x2, , xN} composed of N samples be given.
A coordinate system in the n-D feature space is determined by an
orthonormal basis {e1, e2, , en} and by an origin c.
In such a coordinate system each sample is expressed as:
n

x k c yik ei ,

k 1, 2,..., N

i 1

To reduce the dimension of the feature space, one could keep


only m components:
m
x k c yik ei , k 1, 2,..., N
i 1

however it is not obvious that the m components yik be the


projections of (xk-c) along ei.

30

31

Geometric interpretation
Two-dimensional example

Approximation of the samples in a two-dimensional feature


space (plane) as the sum of a constant vector c and of the
component along one unit vector e1.
x2

c
e1
O

x1

32

PCA: mean square error


If the components of xk along (n m) axes are discarded, an
error is obviously introduced. PCA selects the coordinate
system that minimizes the mean square error.

The adopted functional is:

1
xk xk

N k 1

1
x k c yik ei

N k 1
i 1

This functional has to be minimized with respect to all related


variables, i.e., the origin c, the vectors ei, and the components yik,
under the following orthonormality constraint:
eti e j ij i , j 1, 2,..., m

Plugging this constraint in the expression of the functional yields:


m
m

1 N
2
t
x k c 2 yik ei ( x k c) yik2
N k 1
i 1
i 1

PCA: optimal components of the samples


Let us compute, first, the optimum components of the samples
along the first m components of the basis {e1, e2, ..., en}
(unconstrained minimization).

The stationarity of the functional with respect to each component


yik yields:

0 yik eti ( x k c) eti x k bi , k 1, 2,..., N ,


yik
where bi = eitc is the component of c along the ith unit vector ei in
the unknown orthonormal basis (i = 1, 2, ..., m).
Plugging this optimal values into allows obtaining:
m

1 N
2
x k c [eti ( x k c)]2
N k 1
i 1

1 N n t
2
[ei ( x k c)] [eti ( x k c)]2
N k 1 i 1
i 1

1 N n
1 N n
t
2
[ei ( x k c)] (eti x k bi )2
N k 1 i m 1
N k 1 i m 1

33

34

PCA: optimal origin


depends now on the origin c only through the components
bm + 1, bm + 2, ..., bn of c along em + 1, em + 2, ..., en.

The zero-gradient condition with respect to bi (i = m + 1, m + 2, ...,


n) yields:
Centroid of the data set
N

2
(bi eti x k ) 0
bi N k 1
bi

Consequently:

eti

1 N
x k eti , where:
N k 1

1 N
xk
N k 1

1 N n
1 N n
t
t
2
(ei x k ei ) [eti (x k )]2
N k 1 i m 1
N k 1 i m 1
N
1 N n t
t
ei (x k )(x k ) ei eti ei
N k 1 i m 1
i m 1

1 N
where: (x k )(x k )t
N k 1

Sample-covariance of
the data set

PCA: optimal orthonormal basis


The vectors ei (i = 1, 2, ..., n) are supposed to be orthonormal,
so their optimization is a constrained problem.

Optimum vector basis i:

Through Lagrange
multipliers

min eti ei
ei
2ei 2 i ei 0 ( i I )ei 0
2
ei eti ei 1

The sample covariance is symmetric and positive semidefinite.


Therefore, it has n real nonnegative eigenvalues 1, 2, , n with
corresponding orthonormal eigenvectors e1, e2, , en.
To establish which m eigenvectors should be preserved (and
which (n m) should be discarded), let us plug the obtained
optimal values in the expression of the functional. This yields the
following minimum mean square error:
*

i m 1

i eti ei

i m 1

35

PCA: feature reduction


Therefore, the minimum value of * is obtained if m + 1, m + 2,
, n are the smallest eigenvalues, i.e., if the preserved unit
vectors e1, e2, ..., em correspond to the m largest eigenvalues 1,
2, , m.
Expression of the PCA transformation

If the n eigenvalues of are ordered in decreasing order (i.e., 1


2 n), the PCA transformation projects the samples
(centered with respect to the centroid) along the axes e1, e2, , em
corresponding to the first m eigenvalues:
e1t
t
e
yik eti (x k ) i , k y k T (x k ) with: T 2


etm

36

37

PCA: remarks
Operatively, PCA is applied as follows:

Compute the centroid and the sample-covariance of the


whole data set.
Compute the eigenvalues and the eigenvectors of .
Order the eigenvalues in decreasing order.
Compute the matrix T through the row juxtaposition of the
eigenvectors corresponding to the first m eigenvalues.

Remarks

Therefore, the PCA transformation is y = T(x ).


According to the expression of the minimum mean square error,
the information loss due to feature reduction through PCA is
often quantified through the following efficiency factor :
m

i
i 1
n

i
i 1

m*

100%
95%
90%
85%
m

80%
1

11

16

PCA: interpretation of the principal components


The eigenvalue i represents the sample-variance along the
axis ei (i = 1, 2, , n).

The components along the axis e1, e2, , en are named principal
components. Therefore, one may say that PCA preserves the
first m principal components.
Geometrically, e1 is the direction along which the samples exhibit
the maximum dispersion and en is the direction along which the
sample dispersion is lowest.
Since the transformed features associated with maximum
dispersion are chosen, PCA implicitly assumes that information
is conveyed by the variance of the data (see the figure in slide 31).

38

PCA: remarks on the principal components


Choosing features related to maximum dispersion does not
imply choosing features that well discriminate the classes.

In this 2D example, separation between the classes is poor with


only the first PCA component y1, while considering both y1 and
y2 yields better separation:
x2

e1

e2

x1

Indeed, PCA does not use information about class membership of


the samples. If a training set is available, it is convenient to use a
supervised feature extraction method (e.g., LDA or more
sophisticated approaches).

39

Example (1)
Apply PCA to the following samples: (0, 0, 0), (1, 0, 0), (1, 0, 1),
(1, 1, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 1).

Transformation matrix for the extraction of two features:


1/ 2
2 1 1
1 1
1

N 8, 1/ 2 1 2 1
2 3 1/ 4
4

1/ 2
1 1 2
1
2
0
1
1
1
e1
1
,
e

1
,
e

1
2
3

3
6
2
1
1
1
1
3
T
2
6

1
3
1
6

1
3

1
6

40

41

Example (2)
Compute the transformed samples:

Subtraction of the centroid from the samples:

1/ 2 1/ 2 1/ 2
1/ 2 , 1/ 2 , 1/ 2 ,

1/ 2 1/ 2 1/ 2
Transformed samples:

1/ 2
1/ 2 ,

1/ 2

1/ 2
1/ 2 ,

1/ 2

1/ 2
1/ 2 ,

1/ 2

1/ 2
1/ 2 ,

1/ 2

1
1
1

2 3
2 3
2 3

y1 2 3 , y2
, y3
, y4
,

6
6
6
1
1
1
3
2 3
2 3
2 3
y5
, y6
, y7
, y8 2 3

1
1
2

0
6
6
6

1/ 2
1/ 2

1/ 2

Selection vs. extraction


Advantage of extraction methods

An extraction method projects the feature space onto a subspace


such that the maximum information is preserved, and is
consequently more flexible (indeed, selection is a particular case
of extraction).

Advantage of selection methods

The features provided by a selection method are a subset of the


original ones. Therefore, they maintain their physical meanings.
This is relevant when information about the interpretations of the
features are used in the classification process (e.g., knowledgebased methods).
On the contrary, an extraction method generates virtual
features, which are defined as linear combinations of the
measured original features and usually have well defined
mathematical meanings but not physical meanings.
Through selection, the discarded features are not necessary. With
extraction, one usually needs using all the original features (e.g.,
to compute linear combinations).

42

Bibliography

R. O. Duda, P. E. Hart, D. G. Stork,


Pattern Classification, 2nd Edition. New
York: Wiley, 2001.

P. H. Swain and S.M. Davis, Remote


sensing: the quantitative approach,
McGraw-Hill, New York, 1978.

K. Fukunaga, Introduction to statistical


pattern
recognition,
2nd
edition,
Academic Press, New York, 1990.

J. A. Richards, X. Jia, Remote sensing


digital image analysis, Springer-Verlag,
Berlin, 1999.

G. Hughes, "On the mean accuracy of


statistical pattern recognizers", IEEE
Transactions on Information Theory, vol.
14, no. 1, pp. 55-63, 1968.

L. O. Jimenez, D. A. Landgrebe,
"Supervised classification in highdimensional
space:
geometrical,
statistical, and asymptotical properties
of multivariate data", IEEE Transactions
on Systems, Man and Cybernetics, Part C,
vol. 28, no. 1, pp. 39-54, 1998.

S. B. Serpico and L. Bruzzone, A New


Search Algorithm for Feature Selection
in Hyperspectral Remote Sensing
Images, IEEE Transaction on Geoscience
and Remote Sensing, vol. 39, pp. 13601367, 2001.

L. O. Jimenez and D. A. Landgrebe,


Hyperspectral Data Analysis and
Feature Reduction Via Projection
Pursuit,
IEEE
Transactions
on
Geoscience and Remote Sensing. vol. 37,
pp. 2653-2667, 1999.

Harry C. Andrews, Introduction to


Mathematical Techniques in Pattern
Recognition, Wiley International, New
York., 1972.

43

You might also like