Professional Documents
Culture Documents
Feature Reduction
Universit di Genova
Dipartimento di Ingegneria
Biofisica ed Elettronica
Complexity of a Classifier
Increasing the number n of the features, the classifying design
presents different issues connected to the dimensionality of the
problem (curse of dimensionality):
Computational complexity;
Hughes phenomenon.
Computational complexity
Hughes Phenomenon
Intuitive reasoning
Experimental observation
Interpretation
Feature Reduction
A solution to these dimensionality issues is to reduce the
number n of the features used in the classification process
(feature reduction or parameter reduction).
Feature Selection
Problem setting:
Bhattacharyya Bounds
A choice of the functional (), which is significant from the
classification point of view, can be based on the criterion of the
minimum of the error probability Pe.
p( x |1 ) p( x | 2 )dx
1 4u2 u2
2 2
u2 Pe u
Pe
Properties:
p( xS | 1 ) p( xS | 2 )dxS
i 1, 2
xr S
Additive
property
B({xr })
xr S
S
S
1 S
1
S t
S
S
1
2
B(S) (m 2 m1 )
(
m
m
)
ln
2
1
8
2
2
S1 S2
2
S1 S2
10
Multiclass Extension
Extension to the case of the M classes 1, 2, , M.
ave (S)
M 1 M
Pi Pjij (S),
i 1 j i 1
Bave (S)
M 1 M
Pi Pj Bij (S)
i 1 j i 1
Remarks
Preliminary observations:
11
initialize S* = ;
compute the value of the functional for all the subset S* {xi},
with xi S*, and choose the feature x* S *, that corresponds to
the maximum value of (S * { xi });
update S* setting S* = S* {x*};
continue by iteratively adding one feature at a time until S*
reaches the desired cardinality m or until the value of the
functional stabilizes (reaches saturation).
12
Remarks on SFS
SFS identifies the optimum subset that can be obtained by
iteratively adding a single feature at time.
Advantage
Disadvantage
13
Disadvantages
14
Data set
Training set
training
samples for
eaxh class {i}
Class
conditional pdfs
and class prior
probabilities
{p(x| i), Pi}
Feature
selection
S*
Training of the
classifier
Application of the
classifier to the
data set
15
16
Example
Hyperspectral data set with 202 features and 9 classes.
Map of the
ground truth
that highlights
the training pixel
RGB composition
of three of the 202
bands acquired
by the sensor.
8
6
100%
Bav e
4
90%
2
8 14 20 26 32 38 44 50 56
m
OA
80%
70%
Pc,max = 88.6%
for m = 40
60%
50%
0
50
100
m
150
200
Feature Extraction
Problem definition:
Remarks
17
1
2
1
Y
Y
Y
Y t
1
2
B(Y ) tr
(m 2 m1 )(m 2 m1 ) ln
8
2
2
Y1 Y2
Bm (Y )
B (Y )
18
19
20
21
i N x
i xDi
i wt i , i 1, 2
i 1 y
Ni yEi
Si is called scatter
matrix of the class
i (i = 1, 2).
Si ( x i )( x i )t
xDi
2
t
w
Si w , i 1, 2
2
i
2
si ( y i )
yEi
22
2
t
t
t
t
1 2 w ( 1 2 ) w ( 1 2 )( 1 2 ) w w Sb w
w t Sb w
(w) t
,
w Sw w
23
i.e.,
(Sb Sw )w* 0,
24
Ni xDi
the original and
i T i , i 1, 2,..., M
transformed spaces.
i 1 y
Ni yEi
Scatter matrices of
i in the original
and transformed
spaces.
Si ( x i )( x i )t
xDi
t
TS
T
, i 1, 2,..., M
i
i
t
Si ( y i )( y i )
yEi
25
(T )
Ni (i )( i )t
i 1
Si
i 1
26
i 1
M
i 1
Si T SiT
TSwT , where:
t
Sw Si
i 1
t t
t
N
(
)(
T
N
(
)(
)
T
TS
T
i i
i i
i
i
b
t
i 1
i 1
where: Sb Ni ( i )( i )t
i 1
(T )
TSbT t
TSwT t
27
i 1, 2,..., m,
Remarks
28
29
DAFE: comments
DAFE allows up to (M 1) transformed features to be linearly
extracted (remember that M is the number of classes).
Operational issues
Sb Sw 0
x k c yik ei ,
k 1, 2,..., N
i 1
30
31
Geometric interpretation
Two-dimensional example
c
e1
O
x1
32
1
xk xk
N k 1
1
x k c yik ei
N k 1
i 1
1 N
2
t
x k c 2 yik ei ( x k c) yik2
N k 1
i 1
i 1
1 N
2
x k c [eti ( x k c)]2
N k 1
i 1
1 N n t
2
[ei ( x k c)] [eti ( x k c)]2
N k 1 i 1
i 1
1 N n
1 N n
t
2
[ei ( x k c)] (eti x k bi )2
N k 1 i m 1
N k 1 i m 1
33
34
2
(bi eti x k ) 0
bi N k 1
bi
Consequently:
eti
1 N
x k eti , where:
N k 1
1 N
xk
N k 1
1 N n
1 N n
t
t
2
(ei x k ei ) [eti (x k )]2
N k 1 i m 1
N k 1 i m 1
N
1 N n t
t
ei (x k )(x k ) ei eti ei
N k 1 i m 1
i m 1
1 N
where: (x k )(x k )t
N k 1
Sample-covariance of
the data set
Through Lagrange
multipliers
min eti ei
ei
2ei 2 i ei 0 ( i I )ei 0
2
ei eti ei 1
i m 1
i eti ei
i m 1
35
36
37
PCA: remarks
Operatively, PCA is applied as follows:
Remarks
i
i 1
n
i
i 1
m*
100%
95%
90%
85%
m
80%
1
11
16
The components along the axis e1, e2, , en are named principal
components. Therefore, one may say that PCA preserves the
first m principal components.
Geometrically, e1 is the direction along which the samples exhibit
the maximum dispersion and en is the direction along which the
sample dispersion is lowest.
Since the transformed features associated with maximum
dispersion are chosen, PCA implicitly assumes that information
is conveyed by the variance of the data (see the figure in slide 31).
38
e1
e2
x1
39
Example (1)
Apply PCA to the following samples: (0, 0, 0), (1, 0, 0), (1, 0, 1),
(1, 1, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 1).
N 8, 1/ 2 1 2 1
2 3 1/ 4
4
1/ 2
1 1 2
1
2
0
1
1
1
e1
1
,
e
1
,
e
1
2
3
3
6
2
1
1
1
1
3
T
2
6
1
3
1
6
1
3
1
6
40
41
Example (2)
Compute the transformed samples:
1/ 2 1/ 2 1/ 2
1/ 2 , 1/ 2 , 1/ 2 ,
1/ 2 1/ 2 1/ 2
Transformed samples:
1/ 2
1/ 2 ,
1/ 2
1/ 2
1/ 2 ,
1/ 2
1/ 2
1/ 2 ,
1/ 2
1/ 2
1/ 2 ,
1/ 2
1
1
1
2 3
2 3
2 3
y1 2 3 , y2
, y3
, y4
,
6
6
6
1
1
1
3
2 3
2 3
2 3
y5
, y6
, y7
, y8 2 3
1
1
2
0
6
6
6
1/ 2
1/ 2
1/ 2
42
Bibliography
L. O. Jimenez, D. A. Landgrebe,
"Supervised classification in highdimensional
space:
geometrical,
statistical, and asymptotical properties
of multivariate data", IEEE Transactions
on Systems, Man and Cybernetics, Part C,
vol. 28, no. 1, pp. 39-54, 1998.
43