Feature Reduction PDF

4.
Feature Reduction
Prof. Sebastiano B. Serpico
Universit di Genova
Dipartimento di Ingegneria
Biofisica ed Elettronica
Complexity of a Classifier
Increasing the number n of the features, the classifying design
presents different issues connected to the dimensionality of the
problem (curse of dimensionality):
Computational complexity;
Hughes phenomenon.
Computational complexity
Increasing n, the computational complexity of a classifier

increases. For some classification techniques this increment is
linear with n, for other is of a higher order (e.g. quadratic).
The increase in complexity involves an increase of computation
time and a larger memory occupation.
Hughes Phenomenon
Intuitive reasoning
increasing n, the amount of

available information for the
classifier should increase and
consequently also the classification accuracy, but...
Experimental observation
on the contrary, fixed the

number N of the training
samples, the probability of a
correct decision of a classifier
increases for 1 n n* till a
maximum and decreases for
n n* (Hughes phenomenon).
Interpretation
Increasing n, the number of

parameters Kn of the classifier
becomes higher and higher.
By increasing the ratio Kn/N,
the number of the available
training samples is too little
to obtain a satisfactory estimate of such parameters.
Feature Reduction
A solution to these dimensionality issues is to reduce the
number n of the features used in the classification process
(feature reduction or parameter reduction).
Disadvantage: reducing the dimension of the feature space

involves a loss of information.
Two main strategies exist to achieve feature reduction:
feature selection: inside the set of the n available features, the

identification of a subset of m features is obtained by adopting of
an optimization criterion, chosen to minimize the loss of
information or maximize classification accuracy;
feature extraction: the transformation (often linear) of the original
(n-dimensional) feature space in a space of smaller dimension m is
applied in such a way to minimize the information loss or
maximize the classification accuracy.
Feature Selection
Problem setting:
Given a set X = {x1, x2, , xn} of n features, identify the subset

S X, composed of m features (m < n), such to maximize the
functional ():
S* arg max (S)
S X
An algorithm for features selection is then defined on the basis

of two distinct objects:
the functional (). It has to be defined such that (S) measures

the goodness of the feature subset S in the classification process;
the algorithm for the search of the subset S*. The subsets of X are
in fact 2n, then an exhaustive search is computationally not
feasible, except for small values of n. Therefore, sub-optimal
strategies of maximization are adopted to detect good
solutions, even if they do not correspond to global optima.
Bhattacharyya Bounds
A choice of the functional (), which is significant from the
classification point of view, can be based on the criterion of the
minimum of the error probability Pe.
In the presence of two classes 1 and 2, only, the Bhattacharyya

distance B and the Bhattacharyya coefficient provide an upper
bound of Pe:
Pe u P1P2 exp( B)
where B ln e
p( x |1 ) p( x | 2 )dx
Moreover, its possible to demonstrate that:

1 1
1 4u2 u2
2 2
u2 Pe u
Pe
Bhattacharyya Distance and Coefficient

An approach to feature selection consists in the maximization
of the Bhattacharyya distance B or (equivalently) in the
minimization of the Bhattacharyya coefficient .
In particular, a distance B(S) (or a coefficient (S)) can be

associated to each subset S of m features; in fact, indicating a
vector of the feature subset S with xS, one can define:
B(S) ln (S) e (S)
Properties:
p( xS | 1 ) p( xS | 2 )dxS
0 (S) 1 and then B(S) 0;

if p(xS| 1) and p(xS| 2) are different from zero only in separeted
regions, then (S) = 0 and B(S) = + ;
if p(xS| 1) = p(xS| 2) for any xS, then (S) is the integral of a pdf
over the entire space m , therefore (S) = 1 and B(S) = 0.
Computation of the Bhattacharyya Coefficient and Distance

(S) is a multiple integral in an m-dimensional space, then its
analytical computation starting from the conditional pdf is
complex. Two particular cases, in which the computation is
simple, exist:
if the features in the subset S are independent, when conditioned

to each class, we have:
p( xS | i ) p( xr | i )
i 1, 2
xr S
Additive
property
then the following property is valid:

(S) ({ xr }) e B(S)
xr S
B({xr })
xr S
if p(xS| i) = (miS, iS) (i = 1, 2), we obtain:

1
S
S
1 S
1
S t
S
S
1
2
B(S) (m 2 m1 )
(
m
m
)
ln
2
1
8
2
2
S1 S2
2
S1 S2
Other Inter-Class Distances

In addition to the Bhattacharyya distance, different measures
of inter-class distances have been introduced in literature.
For example, the Divergence measures the separation between

two classes as a function of the likelihood ratio between the
respective conditional pdfs.
The Bhattacharyya distance and the Divergence are not upper
limited. This make them less appropriate as measures of interclass separation. In fact, if we focus on the Gaussian case for
simplicity, when two classes are well-separated, an increment of
the distance m1S m2S between the conditional means generates
a large increment of B(S), but an irrelevant reduction of Pe.
Then, other measures of inter-classes distance have been
proposed (not treated in depth here) that, being upper limited,
dont present such a problem. Among them, we recall the JeffriesMatusita Distance and the Modified Divergence [Richards 1999,
Swain 1978].
10
Multiclass Extension
Extension to the case of the M classes 1, 2, , M.
If ij(S) and Bij(S) are the Bhattacharyya coefficient and distance

between two classes i and j computed over a feature subset S and if
Pi = P(i) is the a priori probability of the class i, the following
average Bhattacharyya coefficient and average Bhattacharyya
distance are defined:
ave (S)
M 1 M
Pi Pjij (S),
i 1 j i 1
Bave (S)
M 1 M
Pi Pj Bij (S)
i 1 j i 1
Remarks
In the case M = 2, the maximization of B(S) was equivalent to the

minimization of (S) (because B(S) = ln (S)). In the multiclass case
the maximization of Bave(S) is no more equivalent to the
minimization of ave(S), because the relation between Bave(S) and
ave(S) is no more monotonic.
Under the hypothesis of class conditional feature independence, we
have:
Attention! It is not
Bave (S) Bave ({ xr })
xr S
valid for ave(S).
Maximization of the Functional

In a problem of feature selection the introduced measures of
inter-class separation are used in the role of the functional ()
to be maximized
Preliminary observations:
The Bhattacharyya distance has to be maximized, while the

Bhattacharyya coefficient has to be minimized. Therefore, in the
following, (S) may correspond to Bave(S) or to ave(S).
An exhaustive search over all possible subsets of X is, in general,

computationally not affordable.
Its feasible if the features are independent, when conditioned to
each class, and if the adopted functional is Bave. In such a case, in
fact, computed all values of the functional associated to the single
features, for the additive property, the optimum subset S* of m
features is simply composed by the m features that individually
presents the highest m values of Bave({xr}).
11
Sequential Forward Selection

In general, the search for a subset of m features is conducted by
means of a sub-optimal algorithm. Among such algorithms we
consider (for its simplicity) the sequential forward selection (SFS),
which is based on the following steps:
initialize S* = ;
compute the value of the functional for all the subset S* {xi},
with xi S*, and choose the feature x* S *, that corresponds to
the maximum value of (S * { xi });
update S* setting S* = S* {x*};
continue by iteratively adding one feature at a time until S*
reaches the desired cardinality m or until the value of the
functional stabilizes (reaches saturation).
12
Remarks on SFS
SFS identifies the optimum subset that can be obtained by
iteratively adding a single feature at time.
At the first step the single feature that corresponds to the

maximum value of the functional is chosen. At the second step,
the feature that, coupled with the previous one, provides the
maximum value of the functional is added. And so on...
The method is sub-optimal. For example, the optimal couple of
features does not always include the single optimal feature .
Advantage
SFS is not computationally heavy even if X contains hundreds of

features.
Disadvantage
A feature that has been included in the selected subset S* at a

specific iteration cannot be removed during the following
iterations, it means that SFS doesnt allow backtracking.
13
Sequential Backward Selection

Sequential backward selection (SBS) proceed in a dual way wrt
SFS, initializing S* = X and eliminating a single feature at a time
from S*, to maximize the functional (S) step by step.
Disadvantages
Like SFS, SBS, too, doesnt allow backtracking: the feature,

eliminated from S* at a specific iteration, will never be recovered
in the following steps;
Usually SBS is computationally disadvantageous wrt SFS: while
SFS starts from an empty subset and adds a feature at a time, SBS
starts from the original feature space. Therefore, SBS computes
values of the functional in spaces with much higher dimensions
than SFS. However, its advantageous if m n.
In literature other, more complex, methods have been

proposed (which we will not see) to search for suboptimal
subsets, which allow also backtracking [Serpico et al., 2001].
14
Operational Aspects of Features Selection

The computation of interclasses distance measures, used
in features selection, requires
the knowledge of the class
conditional pdfs and of the
class prior probabilities.
Usually, such pdfs are not a

priori known but should be
estimated from a training set,
by means of parametric or nonparametric methods.
Globally,
a
classification
system that involves a feature
selection
step
can
be
summarized by the following
flowchart.
Data set
Training set
training
samples for
eaxh class {i}
Class
conditional pdfs
and class prior
probabilities
{p(x| i), Pi}
Feature
selection
S*
Training of the
classifier
Application of the
classifier to the
data set
Classification of the data set
15
16
Example
Hyperspectral data set with 202 features and 9 classes.
Map of the
ground truth
that highlights
the training pixel
RGB composition
of three of the 202
bands acquired
by the sensor.
Estimated probability of correct

classification for a MAP
classifier under the hypothesis
of Gaussian classes.
8
6
100%
Bav e
4
90%
2
8 14 20 26 32 38 44 50 56
m
OA
80%
70%
Pc,max = 88.6%
for m = 40
60%
50%
0
50
100
m
150
200
Feature Extraction
Problem definition:
Given a set X = {x1, x2, , xn} of n features, we want to identify a

linear transformation that provides a transformed set of m
features Y={y1, y2, , ym} (with m<n) minimizing the loss of
information (measured by an appropriate functional) involved in
the dimensionality reduction;
The linear transformation is defined by a transformation matrix
T (m n) such that y = Tx.
Remarks
The transformations are typically orthonormal, i.e. the matrix T is

the juxtaposition by rows of m orthonormal vectors e1, e2, , em:
eti e j ij i , j 1, 2,..., m TT t I m
Linear transforming by such matrix T means to project into the

subspace generated by the orthonormal basis {e1, e2, , em}.
17
Extraction Based on Inter-Class Distances

Considering again the Bhattacharyya distance, in the case of two
Gaussian classes, we look for the orthonormal feature
transformation that maximize the distance in the transformed
space.
Let miY = E{y| i} = Tmi and iY = Cov{y | i} = TiTt (for i = 1,

2), B in the transformed space Y is given by:
Y1 Y2
Y
Y 1
1
2
1
Y
Y
Y
Y t
1
2
B(Y ) tr
(m 2 m1 )(m 2 m1 ) ln
8
2
2
Y1 Y2
Bm (Y )
B (Y )
In the expression of B, two distinct contributions Bm(Y) and B(Y)

appear, respectively linked to the conditional means and to the
conditional covariance matrices.
18
Extraction Based on Inter-Class Distances

In principle, we would search for the orthogonal matrix T that
would maximize B(Y). However:
The general problem of the maximization of B(Y) with respect to

T has no closed-form solution.
The problems of separately maximizing Bm(Y) or B(Y) have
closed form solutions (eigenproblems). Details can be found in
[Fukunaga, 1990].
Therefore, if one of the two contributions is largely dominant
over the other (i.e., Bm(Y) >> B(Y) or Bm(Y) << B(Y)), then these
closed-form solutions can be applied.
Otherwise, the maximization of B(Y) should be addressed
numerically (e.g., through the projection pursuit approach
[Jimenez et al., 1999]).
If the two Gaussian classes are equiprobable, the maximization of
Bm(Y) leads to the same solution of the Fisher transform
(discussed later).
19
Linear discriminant analysis

A popular method for feature extraction is linear discriminant
analysis (LDA, aka discriminant analysis feature extraction,
DAFE), which maximizes a measure of separation and
compactness of the classes directly defined on the training set.
Although explicit parametric assumptions are not stated, DAFE

is usually considered parametric, because it works poorly, for
example, with multimodal classes and it characterizes the classes
only through first and second-order moments.
Anyway, nonparametric extensions of this method have been
recently introduced.
The method can be applied to both binary and multiclass

problems.
Focusing first to the case of two classes, 1 and 2, the linear

discriminant analysis provides an optimum scalar projection,
named Fisher transform.
20
DAFE: Fisher transform

In general, although the classes are well separated in the
original n-dimensional space, they may not be such in a
transformed one-dimensional space, because the projection
can overlay samples drawn from different classes.
The problem is to find the orientation of the projection line
that provides the best separation between the two classes.
Given a set {x1, x2, , xN} of N pre-classified samples, let Di be the

subset of the samples assigned to i (i = 1, 2) and let Ni be the
cardinality of Di (obviously N = N1 + N2).
A transformation y = wtx projects the sample xk to yk = wtxk.Let Ei
= {y = wtx: x Di}.
We search for the transformation y = wtx that maximizes the
inter-class separation and minimizes the intra-class dispersion,
conveniently quantified.
21
Inter-class separation and intra-class dispersion

First, a functional that measures inter-class separation and
dispersion inside each class is necessary.
As a measure of inter-class separation, the difference between the

centroids of the samples in the transformed space is used:
1
i N x
i xDi
i wt i , i 1, 2
i 1 y
Ni yEi
As a measure of class dispersion around the centroids, the

scatter values are adopt in the transformed space, i.e.:
Si is called scatter
matrix of the class
i (i = 1, 2).
Si ( x i )( x i )t
xDi
2
t
w
Si w , i 1, 2
2
i
2
si ( y i )
yEi
22
The Fisher Functional

The goal of the Fisher transform is to maximize the distance
between the centroids of the classes and to minimize the
scatters in the one-dimensional transformed space.
For this purpose, the following Fisher functional is introduced:

2
1 2
( w) 2 2
s1 s2
Let us explicitly write the functional as a function of w:
s12 s22 w t (S1 S2 )w w t Sw w
2
t
t
t
t
1 2 w ( 1 2 ) w ( 1 2 )( 1 2 ) w w Sb w
w t Sb w
(w) t
,
w Sw w
where Sb = (1 2)(1 2)t is named between class scatter matrix

and Sw = S1 + S2 is named within class scatter matrix.
23
Optimality condition for the Fisher functional

Optimality condition
Through the usual zero-gradient condition, one may prove that

the vector w* that maximizes the Fisher functional is an
eigenvector of the product matrix Sw 1Sb:
(Sw1Sb I )w* 0,
i.e.,
(Sb Sw )w* 0,
where is the corresponding eigenvalue.
Close form solution
Therefore, w* satisfies the condition:

Sw1Sb w* w* Sw1 (1 2 )(1 2 )t w* w *
(1 2)tw* and are scalars, so w* is parallel to Sw 1(1 2).

Since the scale factors are irrelevant in linear projections, we
obtain the following closed-form solution (with no need for
explicitly computing eigenvectors):
w* Sw1 (1 2 )
Typically, the vector w* is also normalized.
24
DAFE: multiclass Fisher transform

We extend the discriminant analysis from the binary case to
the case of M classes 1, 2, , M and of an m n
transformation matrix.
Let us consider a set {x1, x2, , xN} of N preclassified samples,
denote as Di the subset of the samples assigned to i (i = 1, 2, ,
M) and as Ni the cardinality of Di (N = N1 + N2 + + NM).
The transformation y = Tx maps xk to yk = Txk. Given Ei = {wtx: x
Di}, let us define:
1
Centroids of i in
x
Ni xDi
the original and
i T i , i 1, 2,..., M
transformed spaces.
i 1 y
Ni yEi
Scatter matrices of
i in the original
and transformed
spaces.
Si ( x i )( x i )t
xDi
t
TS
T
, i 1, 2,..., M
i
i
t
Si ( y i )( y i )
yEi
25
DAFE: multiclass Fisher functional (1)

Let us extend the Fisher functional to the multiclass case.
In the multiclass case, we quantify inter-class separation through

the mean differences between the centroids of the classes and the
centroid of the entire training set in the transformed space:
1 N
1 N
1 M
y k T , where: x k Ni i
N k 1
N k 1
N i 1
We measure the dispersions inside the single classes by means of

the scatter matrices in the transformed space.
Then, the Fisher functional is generalized as follows:
M
(T )
Ni (i )( i )t
i 1
Si
i 1
26
DAFE: multiclass Fisher functional (2)

Let us explicitly write the Fisher functional as a function of the
unknown transformation matrix T.
Let us express numerator and denominator as functions of T and

let us consequently introduce a within class scatter matrix Sw and
a between class scatter matrix Sb:
M
i 1
M
i 1
Si T SiT
TSwT , where:
t
Sw Si
i 1
t t
t
N
(
)(
T
N
(
)(
)
T
TS
T
i i
i i
i
i
b
t
i 1
i 1
where: Sb Ni ( i )( i )t
i 1
(T )
TSbT t
TSwT t
27
Optimality condition for the multiclass case

Optimality condition
Again through a zero-gradient condition, one may prove that the

row vectors e1, e2, , em of the matrix T* that maximizes the
Fisher functional are eigenvectors of Sw 1Sb:
(Sw1Sb i I )ei 0, i.e., (Sb i Sw )ei 0,
i 1, 2,..., m,
where i is the eigenvalue corresponding to ei and is nonzero.
Remarks
The M matrices (i )(i )t, i = 1, 2, , M, have unit ranks.

Because of the linear relationship among the overall centroid
and the class centroids i, i = 1, 2, , M, they are also linearly
dependent.
Thus, rank(Sb) M 1 and, then, rank(Sw 1Sb) rank(Sb) M 1.
Therefore, at most (M 1) eigenvalues of Sw-1Sb are nonzero, i.e.,
the eigenvector equation provides at most (M 1) solution
vectors.
28
29
DAFE: comments
DAFE allows up to (M 1) transformed features to be linearly
extracted (remember that M is the number of classes).
Operational issues
The eigenvalues of Sw-1Sb can be computed as the roots of the

characteristic polynomial, i.e.,:
Sw1Sb I 0 or, equivalently:
Sb Sw 0
The second formulation is more convenient because it does not

require any matrix inversion.
The characteristic equation provides at most (M 1) nonzero
roots 1, 2, , M 1 and at least (n M + 1) zero solutions.
An eigenvector ei is computed from each resulting nonzero
eigenvalue i.
The optimal transformation matrix T* is obtained through a row
juxtaposition of the resulting eigenvectors.
Principal component analysis

The principal component analysis (PCA, or Karhunen-Loeve
transform, KL) is an unsupervised algorithm for feature
extraction. In particular, PCA reduces the dimension of the
feature space on the basis of a mean square error criterion.
Problem setting
Let a data set {x1, x2, , xN} composed of N samples be given.
A coordinate system in the n-D feature space is determined by an
orthonormal basis {e1, e2, , en} and by an origin c.
In such a coordinate system each sample is expressed as:
n
x k c yik ei ,
k 1, 2,..., N
i 1
To reduce the dimension of the feature space, one could keep

only m components:
m
x k c yik ei , k 1, 2,..., N
i 1
however it is not obvious that the m components yik be the

projections of (xk-c) along ei.
30
31
Geometric interpretation
Two-dimensional example
Approximation of the samples in a two-dimensional feature

space (plane) as the sum of a constant vector c and of the
component along one unit vector e1.
x2
c
e1
O
x1
32
PCA: mean square error

If the components of xk along (n m) axes are discarded, an
error is obviously introduced. PCA selects the coordinate
system that minimizes the mean square error.
The adopted functional is:
1
xk xk
N k 1
1
x k c yik ei
N k 1
i 1
This functional has to be minimized with respect to all related

variables, i.e., the origin c, the vectors ei, and the components yik,
under the following orthonormality constraint:
eti e j ij i , j 1, 2,..., m
Plugging this constraint in the expression of the functional yields:

m
m
1 N
2
t
x k c 2 yik ei ( x k c) yik2
N k 1
i 1
i 1
PCA: optimal components of the samples

Let us compute, first, the optimum components of the samples
along the first m components of the basis {e1, e2, ..., en}
(unconstrained minimization).
The stationarity of the functional with respect to each component

yik yields:
0 yik eti ( x k c) eti x k bi , k 1, 2,..., N ,

yik
where bi = eitc is the component of c along the ith unit vector ei in
the unknown orthonormal basis (i = 1, 2, ..., m).
Plugging this optimal values into allows obtaining:
m
1 N
2
x k c [eti ( x k c)]2
N k 1
i 1
1 N n t
2
[ei ( x k c)] [eti ( x k c)]2
N k 1 i 1
i 1
1 N n
1 N n
t
2
[ei ( x k c)] (eti x k bi )2
N k 1 i m 1
N k 1 i m 1
33
34
PCA: optimal origin

depends now on the origin c only through the components
bm + 1, bm + 2, ..., bn of c along em + 1, em + 2, ..., en.
The zero-gradient condition with respect to bi (i = m + 1, m + 2, ...,

n) yields:
Centroid of the data set
N
2
(bi eti x k ) 0
bi N k 1
bi
Consequently:
eti
1 N
x k eti , where:
N k 1
1 N
xk
N k 1
1 N n
1 N n
t
t
2
(ei x k ei ) [eti (x k )]2
N k 1 i m 1
N k 1 i m 1
N
1 N n t
t
ei (x k )(x k ) ei eti ei
N k 1 i m 1
i m 1
1 N
where: (x k )(x k )t
N k 1
Sample-covariance of
the data set
PCA: optimal orthonormal basis

The vectors ei (i = 1, 2, ..., n) are supposed to be orthonormal,
so their optimization is a constrained problem.
Optimum vector basis i:
Through Lagrange
multipliers
min eti ei
ei
2ei 2 i ei 0 ( i I )ei 0
2
ei eti ei 1
The sample covariance is symmetric and positive semidefinite.

Therefore, it has n real nonnegative eigenvalues 1, 2, , n with
corresponding orthonormal eigenvectors e1, e2, , en.
To establish which m eigenvectors should be preserved (and
which (n m) should be discarded), let us plug the obtained
optimal values in the expression of the functional. This yields the
following minimum mean square error:
*
i m 1
i eti ei
i m 1
35
PCA: feature reduction

Therefore, the minimum value of * is obtained if m + 1, m + 2,
, n are the smallest eigenvalues, i.e., if the preserved unit
vectors e1, e2, ..., em correspond to the m largest eigenvalues 1,
2, , m.
Expression of the PCA transformation
If the n eigenvalues of are ordered in decreasing order (i.e., 1

2 n), the PCA transformation projects the samples
(centered with respect to the centroid) along the axes e1, e2, , em
corresponding to the first m eigenvalues:
e1t
t
e
yik eti (x k ) i , k y k T (x k ) with: T 2

etm
36
37
PCA: remarks
Operatively, PCA is applied as follows:
Compute the centroid and the sample-covariance of the

whole data set.
Compute the eigenvalues and the eigenvectors of .
Order the eigenvalues in decreasing order.
Compute the matrix T through the row juxtaposition of the
eigenvectors corresponding to the first m eigenvalues.
Remarks
Therefore, the PCA transformation is y = T(x ).

According to the expression of the minimum mean square error,
the information loss due to feature reduction through PCA is
often quantified through the following efficiency factor :
m
i
i 1
n
i
i 1
m*
100%
95%
90%
85%
m
80%
1
11
16
PCA: interpretation of the principal components

The eigenvalue i represents the sample-variance along the
axis ei (i = 1, 2, , n).
The components along the axis e1, e2, , en are named principal
components. Therefore, one may say that PCA preserves the
first m principal components.
Geometrically, e1 is the direction along which the samples exhibit
the maximum dispersion and en is the direction along which the
sample dispersion is lowest.
Since the transformed features associated with maximum
dispersion are chosen, PCA implicitly assumes that information
is conveyed by the variance of the data (see the figure in slide 31).
38
PCA: remarks on the principal components

Choosing features related to maximum dispersion does not
imply choosing features that well discriminate the classes.
In this 2D example, separation between the classes is poor with

only the first PCA component y1, while considering both y1 and
y2 yields better separation:
x2
e1
e2
x1
Indeed, PCA does not use information about class membership of

the samples. If a training set is available, it is convenient to use a
supervised feature extraction method (e.g., LDA or more
sophisticated approaches).
39
Example (1)
Apply PCA to the following samples: (0, 0, 0), (1, 0, 0), (1, 0, 1),
(1, 1, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 1).
Transformation matrix for the extraction of two features:

1/ 2
2 1 1
1 1
1
N 8, 1/ 2 1 2 1
2 3 1/ 4
4
1/ 2
1 1 2
1
2
0
1
1
1
e1
1
,
e
1
,
e
1
2
3
3
6
2
1
1
1
1
3
T
2
6
1
3
1
6
1
3
1
6
40
41
Example (2)
Compute the transformed samples:
Subtraction of the centroid from the samples:
1/ 2 1/ 2 1/ 2
1/ 2 , 1/ 2 , 1/ 2 ,
1/ 2 1/ 2 1/ 2
Transformed samples:
1/ 2
1/ 2 ,
1/ 2
1/ 2
1/ 2 ,
1/ 2
1/ 2
1/ 2 ,
1/ 2
1/ 2
1/ 2 ,
1/ 2
1
1
1
2 3
2 3
2 3
y1 2 3 , y2
, y3
, y4
,
6
6
6
1
1
1
3
2 3
2 3
2 3
y5
, y6
, y7
, y8 2 3
1
1
2
0
6
6
6
1/ 2
1/ 2
1/ 2
Selection vs. extraction

Advantage of extraction methods
An extraction method projects the feature space onto a subspace

such that the maximum information is preserved, and is
consequently more flexible (indeed, selection is a particular case
of extraction).
Advantage of selection methods
The features provided by a selection method are a subset of the

original ones. Therefore, they maintain their physical meanings.
This is relevant when information about the interpretations of the
features are used in the classification process (e.g., knowledgebased methods).
On the contrary, an extraction method generates virtual
features, which are defined as linear combinations of the
measured original features and usually have well defined
mathematical meanings but not physical meanings.
Through selection, the discarded features are not necessary. With
extraction, one usually needs using all the original features (e.g.,
to compute linear combinations).
42
Bibliography
R. O. Duda, P. E. Hart, D. G. Stork,

Pattern Classification, 2nd Edition. New
York: Wiley, 2001.
P. H. Swain and S.M. Davis, Remote

sensing: the quantitative approach,
McGraw-Hill, New York, 1978.
K. Fukunaga, Introduction to statistical

pattern
recognition,
2nd
edition,
Academic Press, New York, 1990.
J. A. Richards, X. Jia, Remote sensing

digital image analysis, Springer-Verlag,
Berlin, 1999.
G. Hughes, "On the mean accuracy of

statistical pattern recognizers", IEEE
Transactions on Information Theory, vol.
14, no. 1, pp. 55-63, 1968.
L. O. Jimenez, D. A. Landgrebe,
"Supervised classification in highdimensional
space:
geometrical,
statistical, and asymptotical properties
of multivariate data", IEEE Transactions
on Systems, Man and Cybernetics, Part C,
vol. 28, no. 1, pp. 39-54, 1998.
S. B. Serpico and L. Bruzzone, A New

Search Algorithm for Feature Selection
in Hyperspectral Remote Sensing
Images, IEEE Transaction on Geoscience
and Remote Sensing, vol. 39, pp. 13601367, 2001.
L. O. Jimenez and D. A. Landgrebe,

Hyperspectral Data Analysis and
Feature Reduction Via Projection
Pursuit,
IEEE
Transactions
on
Geoscience and Remote Sensing. vol. 37,
pp. 2653-2667, 1999.
Harry C. Andrews, Introduction to

Mathematical Techniques in Pattern
Recognition, Wiley International, New
York., 1972.
43

Feature Reduction PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Reduction PDF

Uploaded by

Copyright:

Available Formats

4.

Prof. Sebastiano B. Serpico

Increasing n, the computational complexity of a classifier

increasing n, the amount of

on the contrary, fixed the

Increasing n, the number of

Disadvantage: reducing the dimension of the feature space

feature selection: inside the set of the n available features, the

Given a set X = {x1, x2, , xn} of n features, identify the subset

An algorithm for features selection is then defined on the basis

the functional (). It has to be defined such that (S) measures

In the presence of two classes 1 and 2, only, the Bhattacharyya

Moreover, its possible to demonstrate that:

Bhattacharyya Distance and Coefficient

In particular, a distance B(S) (or a coefficient (S)) can be

0 (S) 1 and then B(S) 0;

Computation of the Bhattacharyya Coefficient and Distance

if the features in the subset S are independent, when conditioned

then the following property is valid:

if p(xS| i) = (miS, iS) (i = 1, 2), we obtain:

Other Inter-Class Distances

For example, the Divergence measures the separation between

If ij(S) and Bij(S) are the Bhattacharyya coefficient and distance

In the case M = 2, the maximization of B(S) was equivalent to the

valid for ave(S).

Maximization of the Functional

The Bhattacharyya distance has to be maximized, while the

An exhaustive search over all possible subsets of X is, in general,

Sequential Forward Selection

At the first step the single feature that corresponds to the

SFS is not computationally heavy even if X contains hundreds of

A feature that has been included in the selected subset S* at a

Sequential Backward Selection

Like SFS, SBS, too, doesnt allow backtracking: the feature,

In literature other, more complex, methods have been

Operational Aspects of Features Selection

Usually, such pdfs are not a

Classification of the data set

Estimated probability of correct

Given a set X = {x1, x2, , xn} of n features, we want to identify a

The transformations are typically orthonormal, i.e. the matrix T is

Linear transforming by such matrix T means to project into the

Extraction Based on Inter-Class Distances

Let miY = E{y| i} = Tmi and iY = Cov{y | i} = TiTt (for i = 1,

In the expression of B, two distinct contributions Bm(Y) and B(Y)

Extraction Based on Inter-Class Distances

The general problem of the maximization of B(Y) with respect to

Linear discriminant analysis

Although explicit parametric assumptions are not stated, DAFE

The method can be applied to both binary and multiclass

Focusing first to the case of two classes, 1 and 2, the linear

DAFE: Fisher transform

Given a set {x1, x2, , xN} of N pre-classified samples, let Di be the

Inter-class separation and intra-class dispersion

As a measure of inter-class separation, the difference between the

As a measure of class dispersion around the centroids, the

The Fisher Functional

For this purpose, the following Fisher functional is introduced:

where Sb = (1 2)(1 2)t is named between class scatter matrix

Optimality condition for the Fisher functional

Through the usual zero-gradient condition, one may prove that

where is the corresponding eigenvalue.

Close form solution