You are on page 1of 101

PCA

in
Face Recognition
Berrin Yanikoglu

FENS
Computer Science & Engineering
Sabanc University


Some slides from Derek Hoiem, Lana Lazebnik, Silvio Savarese, Fei-Fei Li
Overview
Face recognition, verification, tracking
Feature subspaces: PCA and FLD
Results from vendor tests
Interesting findings about human face recognition

Face detection and recognition
Detection Recognition Sally
Applications of Face Recognition
Surveillance
Digital photography
Album organization
Consumer application: iPhoto 2009
Can be trained to recognize pets!
http://www.maclife.com/article/news/iphotos_faces_recognizes_cats
Consumer application: iPhoto 2009

Error measure
Face Detection/Verification
False Positives (%)
False Negatives (%)

Face Recognition
Top-N rates (%)
Open/closed set problems

Sources of variations:



With
glasses
Without
glasses
3 Lighting
conditions
5 expressions
Starting idea of eigenfaces
1. Treat pixels as a vector



2. Recognize face by nearest neighbor
x
n
y y ...
1
x y
T
k
k
min
The space of face images
When viewed as vectors of pixel values, face images
are extremely high-dimensional
100x100 image = 10,000 dimensions
Large memory and computational requirements
But very few 10,000-dimensional vectors are valid
face images
We want to reduce dimensionality and effectively
model the subspace of face images
Principal Component Analysis (PCA)
(
(
(
(

=
K
z
z
z

2
1
z
Pattern recognition in high-dimensional spaces
Problems arise when performing recognition in a high-dimensional space
(curse of dimensionality).
Significant improvements can be achieved by first mapping the data into a
lower-dimensional sub-space.





where K << N.
The goal of PCA is to reduce the dimensionality of the data while retaining
as much as possible of the variation present in the original dataset.
(
(
(
(

=
N
x
x
x

2
1
x
dimensionality reduction
Change of basis
x
1
x
2
z
1
z
2
3
3
p
(

=
(

+
(

=
(

=
(

+
(

=
3
3
1
1
0
1
1
3
3
3
1
0
3
0
1
3
p
p
Dimensionality reduction
(

+
(

=
1
1
1
1
2 1
z z q
(

+ =
(

+
(

+ =
1
1

1
1
1
1
1
2 1
b
b b
q q
q q
q
x
1
x
2
z
1
z
2
p
q
||

|| q q
Error:
Principal Component Analysis (PCA)
(
(
(
(
(
(

=
KN K K
N
N
u u u
u u u
u u u

2 1
2 22 21
1 12 11
W
N KN K K K
N N
N N
x u x u x u z
x u x u x u z
x u x u x u z
+ + + =
+ + + =
+ + + =
...
...
...
...
2 2 1 1
2 2 22 1 21 2
1 2 12 1 11 1
Wx z =
PCA allows us to compute a linear transformation that maps data from a
high dimensional space to a lower dimensional sub-space.







In short,

where
(
(
(
(

=
N
x
x
x

2
1
x
Principal Component Analysis (PCA)
K 2 1
u u u x
K
z z z + + + =
2 1

N 1
v , v ,
Lower dimensionality basis
Approximate vectors by finding a basis in an appropriate lower dimensional
space.
(1) Higher-dimensional space representation:
(2) Lower-dimensional space representation:
N N
x x x v v v x
2 1
+ + + =
2 1
are the basis vectors of the N-dimensional space
are the basis vectors of the N-dimensional space
K 1
u , , u
Note: If N=K, then x x

=
Dimensionality reduction
(

+
(

=
1
1
1
1
2 1
z z q
(

+ =
(

+
(

+ =
1
1

1
1
1
1
1
2 1
b
b b
q q
q q
x
1
x
2
z
1
z
2
p
q
q
Illustration for projection, variance and bases
x
1
x
2
z
1
z
2
Principal Component Analysis (PCA)
Dimensionality reduction implies information loss !!
Want to preserve as much information as possible, that is:

How to determine the best lower dimensional sub-space?
Principal Components Analysis (PCA)
The projection of x on the direction of u is: z = u
T
x

Find u such that Var(z) is maximized:
Var(z) = Var(u
T
x)
= E[ (u
T
x - u
T
) (u
T
x - u
T
)
T
|
= E[(u
T
x u
T
)
2
] //since (u
T
x - u
T
) is a scalar
= E[(u
T
x u
T
)(u
T
x u
T
)]
= E[u
T
(x )(x )
T
u]
= u
T
E[(x )(x )
T
]u
= u
T


u

where = E[(x )(x )
T
] (covariance of x)
Principal Component Analysis
Direction that maximizes the variance of the projected data:
Projection of data point
Covariance matrix of data
N
N
1/N
Maximize
subject to ||u||=1
Maximize Var(z) = u
T


u subject to ||u||=1




Taking the derivative w.r.t w
1
, and setting it equal to 0, we get:
u
1
= u
1

u
1
is an eigenvector of

Choose the eigenvector with the largest eigenvalue for Var(z) to be
maximum

Second principal component: Max Var(z
2
), s.t., ||u
2
||=1 and it is
orthogonal to u
1




Similar analysis shows that, u
2
= u
2
u
2
is another eigenvector of and so on.
( ) 1 max
1 1 1 1
1
E u u u u
u
T T
o
( ) ( ) 0 1 max
1 2 2 2 2 2
2
E u u u u u u
u
T T T
| o
o, |:Lagrange multipliers

Maximize var(z)=

Consider the eigenvectors of E for which
- Eu = u where u is an eigenvector of E and
is the corresponding eigenvalue.

Multiplying by u
T
:
u
T
Eu = u
T
u = u
T
u = for ||u||=1.

Choose the eigenvector with the largest eigenvalue.
Principal Component Analysis (PCA)
Given: N data points x
1
, ,x
N
in R
d

We want to find a new set of features that are
linear combinations of original ones:

u(x
i
) = u
T
(x
i
)

(: mean of data points)

Choose unit vector u in R
d
that captures the
most data variance
Forsyth & Ponce, Sec. 22.3.1, 22.3.2
What PCA does
The transformation z = W
T
(x )

where the columns of W are the eigenvectors of ,
is sample mean,
centers the data at the origin and rotates the axes
The covariance matrix is symmetrical and it can always be
diagonalized as:
T
W WA = E
where
is the column matrix consisting of
the eigenvectors of E,
W
T
=W
-1


A is the diagonal matrix whose elements
are the eigenvalues of E.
] ,..., , [
2 1 l
u u u W =
Eigenvalues of the covariance matrix
Principal Component Analysis (PCA)
Methodology
Suppose x
1
, x
2
, ..., x
M
are N x 1 vectors
Principal Component Analysis (PCA)
Methodology cont.
Principal Component Analysis (PCA)
Linear transformation implied by PCA
The linear transformation R
N
R
K
that performs the dimensionality
reduction is:
Principal Component Analysis (PCA)
How to choose the principal components?
To choose K, use the following criterion:
How to choose k ?
Proportion of Variance (PoV) explained




when
i
are sorted in descending order

Typically, stop at PoV>0.9

Scree graph plots of PoV vs k, stop at elbow
d k
k
+ + + + +
+ + +

2 1
2 1
Principal Component Analysis (PCA)
What is the error due to dimensionality reduction?
We saw above that an original vector x can be reconstructed using its
principal components:
It can be shown that the low-dimensional basis based on principal
components minimizes the reconstruction error:
It can be shown that the error is equal to:
Principal Component Analysis (PCA)
Standardization
The principal components are dependent on the units used to measure the
original variables as well as on the range of values they assume.
We should always standardize the data prior to using PCA.
A common standardization method is to transform all the data to have zero
mean and unit standard deviation:
Eigenface Implementation
Computation of the eigenfaces
Computation of the eigenfaces cont.
Computation of the eigenfaces cont.
Computation of the eigenfaces cont.
Face Recognition Using Eigenfaces
Face Recognition Using Eigenfaces cont.
The distance e
r
can be computed using the common Euclidean
distance, however, it has been reported that the Mahalanobis distance
performs better:
Eigenfaces example
Training images
x
1
,,x
N
Eigenfaces example

Top eigenvectors: u
1
,u
k
Mean:
Visualization of eigenfaces
Principal component (eigenvector) u
k

+ 3
k
u
k
3
k
u
k
Representation and reconstruction
Face x in face space coordinates:






=
Representation and reconstruction
Face x in face space coordinates:





Reconstruction:
~ +
+ w
1
u
1
+ w
2
u
2
+ w
3
u
3
+ w
4
u
4
+
=
x ~
P = 4
P = 200
P = 400
Reconstruction
Eigenfaces are computed using the 400 face images from ORL
face database. The size of each image is 92x112 pixels (x has ~10K
dimension).
Recognition with eigenfaces
Process labeled training images
Find mean and covariance matrix
Find k principal components (eigenvectors of ) u
1
,u
k
Project each training image x
i
onto subspace spanned by
principal components:
(w
i1
,,w
ik
) = (u
1
T
(x
i
), , u
k
T
(x
i
))

Given novel image x
Project onto subspace:
(w
1
,,w
k
) = (u
1
T
(x

), , u
k
T
(x

))
Optional: check reconstruction error x x to determine
whether image is really a face
Classify as closest training face in k-dimensional
subspace
^
M. Turk and A. Pentland, Face Recognition using Eigenfaces, CVPR 1991
PCA

General dimensionality reduction technique

Preserves most of variance with a much more
compact representation
Lower storage requirements (eigenvectors + a few
numbers per face)
Faster matching

What are the problems for face recognition?

Limitations
Global appearance method:
not robust at all to misalignment
Not very robust to background variation, scale
Principal Component Analysis (PCA)
Problems
Background (de-emphasize the outside of the face e.g., by
multiplying the input image by a 2D Gaussian window centered on
the face)
Lighting conditions (performance degrades with light changes)
Scale (performance decreases quickly with changes to head size)
multi-scale eigenspaces
scale input image to multiple sizes
Orientation (performance decreases but not as fast as with scale
changes)
plane rotations can be handled
out-of-plane rotations are more difficult to handle
Fisherfaces
Limitations
The direction of maximum variance is not
always good for classification
A more discriminative subspace: FLD
Fisher Linear Discriminants Fisher Faces

PCA preserves maximum variance

FLD preserves discrimination
Find projection that maximizes scatter between
classes and minimizes scatter within classes
Reference: Eigenfaces vs. Fisherfaces, Belheumer et al., PAMI 1997
Illustration of the Projection
Poor Projection
x1
x2
x1
x2
Using two classes as example:
Good
Comparing with PCA
Variables

N Sample images:
c classes:

Average of each class:
Average of all data:
{ }
N
x x , ,
1

{ }
c
_ _ , ,
1

=
e
i k
x
k
i
i
x
N
_

=
=
N
k
k
x
N
1
1

Scatter Matrices
Scatter of class i:
( )( )
T
i k
x
i k i
x x S
i k

_
=

=
=
c
i
i W
S S
1
( )( )

=
=
c
i
T
i i i B
N S
1

Within class scatter:
Between class scatter:
Illustration
2
S
1
S
B
S
2 1
S S S
W
+ =
x1
x2
Within class scatter
Between class scatter
Mathematical Formulation
After projection
Between class scatter
Within class scatter
Objective


Solution: Generalized Eigenvectors

Rank of W
opt
is limited
Rank(S
B
) <= |C|-1
Rank(S
W
) <= N-C
k
T
k
x W y =
W S W S
B
T
B
=
~
W S W S
W
T
W
=
~
W S W
W S W
S
S
W
W
T
B
T
W
B
opt
W W
max arg
~
~
max arg = =
m i w S w S
i W i i B
, , 1 = =
Illustration
2
S
1
S
B
S
2 1
S S S
W
+ =
x1
x2
Recognition with FLD
Compute within-class and between-class scatter matrices





Solve generalized eigenvector problem



Project to FLD subspace and classify by nearest neighbor
W S W
W S W
W
W
T
B
T
opt
W
max arg = m i w S w S
i W i i B
, , 1 = =
( )( )
T
i k
x
i k i
x x S
i k

_
=

=
=
c
i
i W
S S
1
( )( )

=
=
c
i
T
i i i B
N S
1

x W x
T
opt
=

Results: Eigenface vs. Fisherface


Variation in Facial Expression, Eyewear, and Lighting
Input: 160 images of 16 people
Train: 159 images
Test: 1 image
With
glasses
Without
glasses
3 Lighting
conditions
5 expressions
Reference: Eigenfaces vs. Fisherfaces, Belheumer et al., PAMI 1997
Eigenfaces vs. Fisherfaces
Reference: Eigenfaces vs. Fisherfaces, Belheumer et al., PAMI 1997
1. Identify a Specific Instance
Frontal faces
Typical scenario: few examples per face,
identify or verify test example
Whats hard: changes in expression, lighting,
age, occlusion, viewpoint
Basic approaches (all nearest neighbor)
1. Project into a new subspace (or kernel space)
(e.g., Eigenfaces=PCA)
1. Measure face features
2. Make 3d face model, compare shape+appearance (e.g.,
AAM)


Large scale comparison of methods
FRVT 2006 Report
Not much (or any) information available about
methods, but gives idea of what is doable

FVRT Challenge
Frontal faces
FVRT2006 evaluation


False
Rejection
Rate at False
Acceptance
Rate = 0.001
FVRT Challenge
Frontal faces
FVRT2006 evaluation: controlled illumination


FVRT Challenge
Frontal faces
FVRT2006 evaluation: uncontrolled illumination


FVRT Challenge
Frontal faces
FVRT2006 evaluation: computers win!


Face recognition by humans


Face recognition by humans: 20 results (2005)

Slides by Jianchao Yang








Result 17: Vision progresses from
piecemeal to holistic

Things to remember

PCA is a generally useful dimensionality reduction
technique
But not ideal for discrimination

FLD better for discrimination, though only ideal
under Gaussian data assumptions

Computer face recognition works very well under
controlled environments still room for improvement
in general conditions
Normal Distribution and Covariance
Matrix
Assume we have a d-dimensional input (e.g. 2D), x.

We will see how can we characterize p(x), assuming it is
normally distributed.

For 1-dimension it was the mean () and variance (o
2
)
Mean=E[x]
Variance=E[(x - )
2
]

For d-dimensions, we need
the d-dimensional mean vector
dxd dimensional covariance matrix

If x ~ N
d
(,E) then each dimension of x is univariate normal
Converse is not true


Normal Distribution &
Multivariate Normal Distribution
For a single variable, the normal density function is:



For variables in higher dimensions, this generalizes to:


where the mean is now a d-dimensional
vector, E is a d x d covariance matrix
|E| is the determinant of E:



Multivariate Parameters: Mean, Covariance
( )
T T T
d d d
d
d
E E xx x x x = =
(
(
(
(
(

= E ] [ ] ) )( [( Cov
2
2 1
2
2
2 21
1 12
2
1
o o o
o o o
o o o

| | | |
T
d
E ,..., : Mean
1
= = x
( ) )] )( [( , Cov
j j i i j i ij
x x E x x o
Variance: How much X varies
around the expected value

Covariance is the measure the
strength of the linear relationship
between two random variables
covariance becomes more positive
for each pair of values which differ
from their mean in the same
direction
covariance becomes more negative
with each pair of values which differ
from their mean in opposite
directions.
if two variables are independent,
then their covariance/correlation
is zero (converse is not true).

( )
( )
j i
ij
ij j i
j j i i j i ij
x x
x x E x x
o o
o

o
=

, Corr : n Correlatio
)] )( [( , Cov : Covariance
Correlation is a dimensionless
measure of linear dependence.
range between 1 and +1

Covariance Matrices
Contours of constant probability density for a 2D Gaussian distributions with
a) general covariance matrix
b) diagonal covariance matrix (covariance of x1,x2 is 0)
c) E proportional to the identity matrix (covariances are 0, variances of each
dimension are the same)
Shape and orientation of
the hyper-ellipsoid
centered at is defined
by E

v
1
v
2
The projection of a d-dimensional normal distribution
onto the vector w is univariate normal
w
T
x ~ N

(w
T
, w
T
Ew)

More generally, when W is a dxk matrix with k < d,
then the k-dimensional W
T
x is k-variate normal with:

W
T
x ~ N
k
(W
T
, W
T
EW)
W
1
T
W
2
T
.
.
W
T
=
Eigenvalue Decomposition of the
Covariance Matrix
Eigenvector and Eigenvalue
Given a linear transformation A, a non-zero vector x is
defined to be an eigenvector of the transformation if it
satisfies the eigenvalue equation
Ax = x for some scalar .

In this situation, the scalar is called an eigenvalue of A
corresponding to the eigenvector x.

(A- I)x=0 => det(A- I) must be 0.
Gives the characteristic polynomial whore roots are the
eigenvalues of A


The covariance matrix is symmetrical and it can always be
diagonalized as:
T
uAu = E
where
is the column matrix consisting of the eigenvectors of E,
u
T
=u
-1
and
A is the diagonal matrix whose elements are the eigenvalues of E.
] ,..., , [
2 1 l
u u u = u
Eigenvalues of the covariance matrix
If x ~ N
d
(,E) then each dimension of x is univariate
normal
Converse is not true counterexample?

The projection of a d-dimensional normal distribution
onto the vector w is univariate normal
w
T
x ~ N

(w
T
, w
T
Ew)

More generally, when W is a dxk matrix with k < d,
then the k-dimensional W
T
x is k-variate normal with:

W
T
x ~ N
k
(W
T
, W
T
EW)
W
1
T
W
2
T
.
.
W
T
=
Univariate normal transformed x
Proof for the first one:

E[w
T
x] = w
T
E[x] = w
T


Var[w
T
x] = E[ (w
T
x - w
T
) (w
T
x - w
T
)
T
|
= E[ (w
T
x - w
T
)
2
| since (w
T
x - w
T
) is a scalar
= E[ (w
T
x - w
T
) (w
T
x - w
T
) |
= E| w
T
(x) (x)
T
w | based on (slide 9.,rule 5)
= w
T
E w move out and use defin. of E

Similar to projection onto
w, A
T
x is computed
here, and results in A
T

and A
T
EA
Whitening Transform

Given the previous, we can construct a transformation matrix
A
w
to

decorrelate variables and obtain E=I.
I.e. Compute A
w
(x-) to obtain zero-mean and unit variance data.

This matrix is called the whitening transform A
w
=A
1/2
u
T
where
A is the diagonal matrix of the eigenvalues of the original distribution
and
u is the matrix composed of eigenvectors as its columns

See that the resulting covariance matrix, A
w
EA
w
T

is indeed
the identity matrix, I (has decorrelated dimensions)


From previous slides:
A
w
x has covariance matrix A
w
EA
w
T

A
w
EA
w
T
= (A
1/2
u
T
)(u A

u
T
) (A
1/2
u
T
)
T

= A
1/2
(u
T
u) A
(
u
T
u) (A
1/2
)
T


= A
1/2
(I) A

(I) (A
1/2
)
T
since u
T
= u
1


= A
1/2
(I) A

(I) (A
1/2
) since A
1/2
is symmetric
= A
1/2
A
1
A
1/2
=
I

You might also like