You are on page 1of 44

Introduction to Statistical

Machine Learning

Introduction to Statistical Machine Learning


Christfried Webers
Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February June 2013

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Neural Networks 1
Neural Networks 2
Kernel Methods
Sparse Kernel Methods
Graphical Models 1
Graphical Models 2
Graphical Models 3
Mixture Models and EM 1
Mixture Models and EM 2
Approximate Inference
Sampling
Principal Component Analysis
Sequential Data 1

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Sequential Data 2
Combining Models
Selected Topics
Discussion and Summary
1of 157

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

Part III

ISML
2013
Basic Concepts

Linear Algebra

Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

115of 157

Intuition

Geometry
Points and Lines
Vector addition and scaling
Humans have experience with 3 dimensions (less with 1, 2
though)
Generalisation to N dimensions (possibly N ! 1)
Line ! vector space V
Point ! vector x 2 V
Example : X 2 Rnm
Space of matrices Rnm and the space of vectors Rnm are
isomorphic

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

116of 157

Vector Space V, underlying Field F


Given vectors u, v, w 2 V and scalars , 2 F, the following
holds:
Associativity of addition : u + (v + w) = (u + v) + w .
Commutativity of addition : v + w = w + v .
Identity element of addition : 9 0 such that
v + 0 = v, 8 v 2 V.
Inverse of addition : For all v 2 V, 9w 2 V, such that
v + w = 0. The additive inverse is denoted v.
Distributivity of scalar multiplication with respect to vector
addition : (v + w) = v + w .
Distributivity of scalar multiplication with respect to field
addition : ( + )v = v + v .
Compatibility of scalar multiplication with field
multiplication : ( v) = ( )v .
Identity element of scalar multiplication : 1v = v, where 1 is
the identity in F .

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

117of 157

Introduction to Statistical
Machine Learning

Matrix-Vector Multiplication
1

f o r i i n xrange (m) :
R[ i ] = 0 . 0 ;
f o r j i n xrange ( n ) :
R[ i ] = R[ i ] + A [ i , j ] V [ j ]

Listing 1: Code for elementwise matrix multiplication.

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

118of 157

Introduction to Statistical
Machine Learning

Matrix-Vector Multiplication
1

f o r i i n xrange (m) :
R[ i ] = 0 . 0 ;
f o r j i n xrange ( n ) :
R[ i ] = R[ i ] + A [ i , j ] V [ j ]

Listing 2: Code for elementwise matrix multiplication.

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace

a11
6 a21
6
4. . .
am1

A
a12
a22
...
am2

...
...
...
...

32

Matrix Inverse

a1n
v1
a11 v1 + a12 v2 + + a1n vn
6 v2 7 6 a21 v1 + a22 v2 + + a2n vn 7
a2n 7
76 7 = 6
7
5
. . . 5 4. . . 5 4
...
amn
vn
am1 v1 + am2 v2 + + amn vn

Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

119of 157

Introduction to Statistical
Machine Learning

Matrix-Vector Multiplication

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

a11
6 a21
6
4. . .
am1

A
a12
a22
...
am2

...
...
...
...

32

Basic Concepts

a1n
v1
a11 v1 + a12 v2 + + a1n vn
6 v2 7 6 a21 v1 + a22 v2 + + a2n vn 7
a2n 7
76 7 = 6
7
5
. . . 5 4. . . 5 4
...
amn
vn
am1 v1 + am2 v2 + + amn vn

Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

120of 157

Introduction to Statistical
Machine Learning

Matrix-Vector Multiplication

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

a11
6 a21
6
4. . .
am1

a12
a22
...
am2

...
...
...
...

32

Basic Concepts

a1n
v1
a11 v1 + a12 v2 + + a1n vn
6 v2 7 6 a21 v1 + a22 v2 + + a2n vn 7
a2n 7
76 7 = 6
7
5
. . . 5 4. . . 5 4
...
amn
vn
am1 v1 + am2 v2 + + amn vn

R = A[: ,0] V[0];


f o r j i n xrange ( 1 , n ) :
R += A [ : , j ] V [ j ] ;

Listing 4: Code for columnwise matrix multiplication.

Linear Transformations

Trace
Inner Product
Projection
Rank, Determinant, Trace

Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

121of 157

Introduction to Statistical
Machine Learning

Matrix-Vector Multiplication

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

Denote the n columns of A by a1 , a2 , . . . , an


Each ai is now a (column) vector

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection

A
2

4a1

a2

...

3 v1
2
3
6 v2 7
7 4
5
an 5 6
4. . .5 = a1 v1 + a2 v2 + + an vn
vn

Rank, Determinant, Trace


Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

122of 157

Introduction to Statistical
Machine Learning

Transpose of R

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

Given R = AV,
2

What is R ?

R = 4a1 v1 + a2 v2 + + an vn 5

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

123of 157

Introduction to Statistical
Machine Learning

Transpose of R

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

Given R = AV,
2

What is R ?

R = 4a1 v1 + a2 v2 + + an vn 5

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace

RT = v1 aT1 + v2 aT2 + + vn aTn

NOT equal to AT V T ! (In fact, AT V T is not even defined.)

Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

124of 157

Transpose

Reverse order rule : (AV)T = V T AT

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

125of 157

Introduction to Statistical
Machine Learning

Transpose

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

Reverse order rule : (AV)T = V T AT


VT

v1

6
6
6
v2 . . . vn 6
6
6
4

AT
aT1
aT2
...
aTn

RT

7
7
7 T

7 = v1 a + v2 aT + + vn aTn
1
2
7
7
5

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

126of 157

Introduction to Statistical
Machine Learning

Trace

c 2013
Christfried Webers
NICTA
The Australian National
University

The trace of a square matrix A is the sum of all diagonal


elements of A.
n
X
tr {A} =
Akk
k=1

Example

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product

3
1 2 3
A = 44 5 65
7 8 9

Projection
Rank, Determinant, Trace

tr {A} = 1 + 5 + 9 = 15

The trace does not exist for a non-square matrix.

Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

127of 157

Introduction to Statistical
Machine Learning

vec(x) operator
Define vec(X) as the vector which results from stacking all
columns of a matrix A on top of each other.
Example
2 3
1
647
6 7
677
6 7
2
3
627
1 2 3
6 7
7
A = 44 5 65
vec(A) = 6
657
6
7
7 8 9
687
637
6 7
465
9

Given two matrices X, Y 2 Rnp , the trace of the product


XT Y can be written as

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

tr XT Y = vec(X)T vec(Y)
128of 157

Introduction to Statistical
Machine Learning

Inner Product

c 2013
Christfried Webers
NICTA
The Australian National
University

Mapping from two vectors x, y 2 V to a field of scalars F


( F = R or F = C ):

ISML
2013
Basic Concepts

h, i : V V ! F
Conjugate Symmetry :
hx, yi = hy, xi
Linearity :
hax, yi = ahx, yi , and hx + y, zi = hx, zi + hy, zi
Positive-definitness :
hx, xi 0, and hx, xi = 0 for x = 0 only.

Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

129of 157

Canonical Inner Products


Inner product for real numbers x, y 2 R:
hx, yi = x y

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

130of 157

Introduction to Statistical
Machine Learning

Canonical Inner Products

c 2013
Christfried Webers
NICTA
The Australian National
University

Inner product for real numbers x, y 2 R:


hx, yi = x y

ISML
2013

Dot product between two vectors:


Given x, y 2 Rn
T

hx, yi = x y

n
X
k=1

hX, Yi = tr XT Y =
=

Linear Transformations
Trace

xk yk = x1 y1 + x2 y2 + + xn yn

Inner Product
Projection
Rank, Determinant, Trace

Canonical inner product for matrices :


Given X, Y 2 Rnp

p
n
X
X

Basic Concepts

p
X
k=1

(XT Y)kk =

Matrix Inverse
Eigenvectors

p
n
X
X
k=1 l=1

Singular Value
Decomposition

(XT )kl (Y)lk

Directional Derivative,
Gradient
Books

Xlk Ylk

k=1 l=1

131of 157

Introduction to Statistical
Machine Learning

Calculations with Matrices


th

Denote (i, j) element of matrix X by Xij


Transpose : (XT )ij =
ji
PX
...
Product : (XY)ij = k=1 Xik Ykj
Proof of the linearity of the canonical matrix inner product :
hX + Y, Zi = tr (X + Y)T Z =
p

=
=
=
=

n
XX
k=1 l=1
p
n
X
X
k=1 l=1
p
n
X
X
k=1 l=1
p
X
T

p
X

((X + Y)T Z)kk

k=1

((X + Y)T )kl Zlk =

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations

n
XX

Trace

(Xlk + Ylk )Zlk

k=1 l=1

Xlk Zlk + Ylk Zlk

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors

(X )kl Zlk + (Y )kl Zlk

Singular Value
Decomposition
Directional Derivative,
Gradient
Books

(X Z)kk + (Y Z)kk

k=1

= tr XT Z + tr YT Z = hX, Zi + hY, Zi

132of 157

Projection
In linear algebra and functional analysis, a projection is a
linear transformation P from a vector space V to itself such
that
P2 = P

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection

Projection transformation can only have effective influenceRank, Determinant, Trace


for one time, because after first time projection, we can Matrix Inverse
Eigenvectors
not move it anymore by further applying that tranformation.
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

133of 157

Introduction to Statistical
Machine Learning

Projection
In linear algebra and functional analysis, a projection is a
linear transformation P from a vector space V to itself such
that
P2 = P

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations

(a,b)

Trace

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors

x
(a-b,0)

Singular Value
Decomposition
Directional Derivative,
Gradient
Books

134of 157

Introduction to Statistical
Machine Learning

Projection
In linear algebra and functional analysis, a projection is a
linear transformation P from a vector space V to itself such
that
P2 = P

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations

(a,b)

Trace

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition

Directional Derivative,
Gradient

(a-b,0)

a
a b
A
=
b
0

1
A=
0

Books

1
0
135of 157

Introduction to Statistical
Machine Learning

Orthogonal Projection
Orthogonality : need an inner product hx, yi
Choose two arbitrary vectors x and y. Then Px and y
are orthogonal.
0 = hPx, y

Pyi = (Px) (y

Py) = x (P

c 2013
Christfried Webers
NICTA
The Australian National
University

Py

P P)y

ISML
2013
Basic Concepts

Orthogonal Projection

Linear Transformations
Trace

P =P

P=P

Inner Product
Projection

Example : Given some unit vector u 2 Rn characterising a


line through the origin in Rn
project an arbitrary vector x 2 Rn onto this line by
P = uuT
Proof : P = PT , and

Rank, Determinant, Trace


Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

P2 x = (uuT )(uuT )x = uuT x = Px


136of 157

Introduction to Statistical
Machine Learning

Orthogonal Projection
Given a matrix A 2 Rnp and a vector x. What is the
closest point e
x to x in the column space of A ?
Orthogonal Projection into the column space of A
e
x = A(AT A)

Projection matrix

P = A(AT A)

AT x

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product

AT

Projection
Rank, Determinant, Trace

Proof

Matrix Inverse
Eigenvectors

P2 = A(AT A)

AT A(AT A)

AT = A(AT A)

AT = P

Singular Value
Decomposition
Directional Derivative,
Gradient

Orthogonal projection ?

Books

PT = (A(AT A)

AT )T = A(AT A)

AT = P
137of 157

Linear Independence of Vectors


A set of vectors is linearly independent if none of them can
be written as a linear combination of finitely many other
vectors in the set.
Given a vector space V and k vectors v1 , . . . , vk 2 V and
scalars a1 , . . . , ak . The vectors are linearly independent, if
and only if
2 3
0
6 07
6 7
a1 v1 + a2 v2 + + ak vk = 0 6 . 7
4 .. 5
0
has only the trivial solution

a1 = a2 = ak = 0

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

138of 157

Rank

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

The column rank of a matrix A is the maximal number of


linearly independent columns of A.
The row rank of a matrix A is the maximal number of
linearly independent rows of A.
For every matrix A, the row rank and column rank are
equal, called the rank of A.

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

139of 157

Introduction to Statistical
Machine Learning

Determinant

c 2013
Christfried Webers
NICTA
The Australian National
University

Let Sn be the set of all permutation of the numbers {1, . . . , n}.


The determinant of the square matrix A is then given by
X

det {A} =

2Sn

sgn( )

n
Y

Basic Concepts

Ai,

(i)

i=1

where sgn is the signature of the permutation ( +1 if the


permutation is even, 1 if the permutation is odd).
det {AB} = det {A} det {B} for A, B square matrices.
det A

= det {A}

det AT = det {A}.

ISML
2013

Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

140of 157

Introduction to Statistical
Machine Learning

Matrix Inverse

c 2013
Christfried Webers
NICTA
The Australian National
University

Identity Matrix I
I = AA 1 = A 1 A
The matrix inverse A 1 does only exist for square matrices
which are NOT singular.
Singular matrix
at least one eigenvalue is zero,
determinant |A| = 0.

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace

Inversion reverses the order (like transposition)

Matrix Inverse
Eigenvectors

(AB)
(Only then is (AB)(AB)

=B

= (AB)B

A
1

Singular Value
Decomposition

= AA

= I.)

Directional Derivative,
Gradient
Books

141of 157

Matrix Inverse and Transpose

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts

(A

1 T

) ?

Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

142of 157

Matrix Inverse and Transpose

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts

1 T

(A ) ?
need to assume that (A
rule : (A 1 )T = (AT ) 1

Linear Transformations

) exists

Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

143of 157

Introduction to Statistical
Machine Learning

Useful Identity

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

(A

+ BT C

B)

1 T

B C

= ABT (BABT + C)

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

144of 157

Introduction to Statistical
Machine Learning

Useful Identity

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

(A

+ BT C

B)

1 T

B C

= ABT (BABT + C)

Basic Concepts
Linear Transformations
Trace

How to analyse and prove such an equation?


A 1 must exist, therefore A must be square, say A 2 Rnn .
Same for C 1 , so lets assume C 2 Rmm .
Therefore, B 2 Rmn .

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

145of 157

Introduction to Statistical
Machine Learning

Prove the Identity

(A

+ BT C

c 2013
Christfried Webers
NICTA
The Australian National
University

1 T

B)

B C

= ABT (BABT + C)

Basic Concepts

Multiply by (BABT + C)
(A

+ BT C

Linear Transformations

1 T

B)

B C

(BABT + C) = ABT

= (A
= ABT

+ BT C
1

+B C

B)
1

B)

Trace
Inner Product
Projection

Simplify the left-hand side


(A

ISML
2013

Rank, Determinant, Trace

[BT C
1

[B C

B(ABT ) + BT ]
1

B+A

]AB

Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

146of 157

Introduction to Statistical
Machine Learning

Woodbury Identity

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

(A + BD

C)

=A

B(D + CA

B)

CA

Basic Concepts
Linear Transformations
Trace

Useful if A is diagonal and easy to invert, and B is very tall


(many rows, but only a few columns) and C is very wide
(few rows, many columns).

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

147of 157

Manipulating Matrix Equations

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

Dont multiply by a matrix which does not have full rank.


Why? In the general case, you will loose solutions.
Dont multiply by nonsquare matrices.
Dont assume a matrix inverse exists because a matrix is
square. The matrix might be singular and you are making
the same mistake as if deducing a = b from a 0 = b 0.

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

148of 157

Introduction to Statistical
Machine Learning

Eigenvectors

c 2013
Christfried Webers
NICTA
The Australian National
University

Every square matrix A 2 Rnn has an Eigenvector


decomposition
Ax = x
where x 2 Rn and
Example:

2 C.

ISML
2013
Basic Concepts
Linear Transformations
Trace

0 1
x= x
1 0

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse

= { , }

x=
,
1
1

Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

149of 157

Introduction to Statistical
Machine Learning

Eigenvectors

c 2013
Christfried Webers
NICTA
The Australian National
University

How many eigenvalue/eigenvector pairs?

ISML
2013
Basic Concepts

Ax = x
is equivalent to

Linear Transformations
Trace
Inner Product

(A

I)x = 0

Has only non-trivial solution for det {A


I} = 0
polynom of nth order; at most n distinct solutions

Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

150of 157

Introduction to Statistical
Machine Learning

Real Eigenvalues

c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

How can we enforce real eigenvalues?


Lets look at matrices with complex entries A 2 Cnn .
Transposition is replaced by Hermitian adjoint, e.g.

1 + 2
5 + 6

3 + 4
7 + 8

1
=
3

2
4

5
7

Basic Concepts
Linear Transformations
Trace
Inner Product

6
8

Denote the complex conjugate of a complex number


.

Projection
Rank, Determinant, Trace
Matrix Inverse

by

Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

151of 157

Introduction to Statistical
Machine Learning

Real Eigenvalues
How can we enforce real eigenvalues?
Lets assume A 2 Cnn , Hermitian (AH = A).
Calculate
xH Ax = xH x
for an eigenvector x 2 Cn of A.
Another possibility to calculate xH Ax

Basic Concepts
Linear Transformations

(A is Hermitian)
H

(reverse order)

= ( xH x)H

(eigenvalue)

= (x Ax)

ISML
2013

Trace

xH Ax = xH AH x
H

c 2013
Christfried Webers
NICTA
The Australian National
University

Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors

= x x

Singular Value
Decomposition

and therefore
=

( is real).

If A is Hermitian, then all eigenvalues are real.


Special case: If A has only real entries and is symmetric,
then all eigenvalues are real.

Directional Derivative,
Gradient
Books

152of 157

Singular Value Decomposition

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013

Every matrix A 2 Rnp can be decomposed into a product of


three matrices
A = UV T
where U 2 Rnn and V 2 Rpp are orthogonal matrices (
U T U = I and V T V = I ), and 2 Rnp has nonnegative
numbers on the diagonal.

Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

153of 157

How to calculate a gradient?


Given a vector space V, and a function f : V ! R.
Example: X 2 Rnp , arbitrary C 2 Rnn ,
f (X) = tr X T CX .
How to calculate the gradient of f ?

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

154of 157

How to calculate a gradient?


Given a vector space V, and a function f : V ! R.
Example: X 2 Rnp , arbitrary C 2 Rnn ,
f (X) = tr X T CX .
How to calculate the gradient of f ?
Ill-defined question. There is no gradient in a vector space.
Only a Directional Derivative at point X 2 V in direction
2 V.
f (X + h) f (X)
Df (X)() = lim
h!0
h
For the example f (X) = tr X T CX
tr (X + h)T C(X + h)
tr X T CX
Df (X)() = lim
h!0
h
T
T
= tr CX + X C = tr X T (CT + C)

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

155of 157

How to calculate a gradient?


Given a vector space V, and a function f : V ! R.
How to calculate the gradient of f ?
1
2
3

Calculate the Directional Derivative


Define an inner product hA, Bi on the vector space
The gradient is now defined as
Df (X)() = hgrad f , i

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace

For the example f (X) = tr X T CX


Define the inner product hA, Bi = tr AT B .
Write the Directional Derivative as an inner product with
T

Df (X)() = tr X (C + C)

Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

= h(C + C )X, i = hgrad f , i


156of 157

Books

Introduction to Statistical
Machine Learning
c 2013
Christfried Webers
NICTA
The Australian National
University

ISML
2013
Basic Concepts

Gilbert Strang, "Introduction to Linear Algebra", Wellesley


Cambridge, 2009.
David C. Lay, "Linear Algebra and Its Applications",
Addison Wesley, 2005.

Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books

157of 157

You might also like