Linear Algebra Guide

Linear algebra
Kevin P. Murphy
Last updated January 25, 2008
1 Introduction
Linear algebra is the study of matrices and vectors. Matrices can be used to represent linear functions1 as well as
systems of linear equations. For example,
4x1 − 5x2 = −13
(2)
−2x1 + 3x2 = 9
can be represented more compactly by
Ax = b (3)
where
4 −5 13
A= ,b= (4)
−2 3 −9
Linear algebra also forms the the basis of many machine learning methods, such as linear regression, PCA, Kalman
filtering, etc.
Note: Much of this chapter is based on notes written by Zico Kolter, and is used with his permission.
2 Basic notation
We use the following notation:
• By A ∈ Rm×n we denote a matrix with m rows and n columns, where the entries of A are real numbers.
• By x ∈ Rn , we denote a vector with n entries. Usually a vector x will denote a column vector — i.e., a matrix
with n rows and 1 column. If we want to explicitly represent a row vector — a matrix with 1 row and n columns
— we typically write xT (here xT denotes the transpose of x, which we will define shortly).
• The ith element of a vector x is denoted xi :
 
x1

 x2 

x= .. .
 . 
xn
• We use the notation aij (or Aij , Ai,j , etc) to denote the entry of A in the ith row and jth column:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
 
A= . .. .. ..  .
 .. . . . 
am1 am2 ··· amn
1A m n
function f : R →R is called linear if
f (c1 x + c2 y) = c1 f (x) + c2 f (y) (1)
for all scalars c1 ,c2 and vectors x, y. Hence we can predict the output of a linear function in terms of its response to simple inputs.
1
• We denote the jth column of A by aj or A:,j :
 
| | |
A =  a1 a2 ··· an  .
| | |
• We denote the ith row of A by aTi or Ai,: :

 
— aT1 —
 — aT2 — 
 
A= .. .
 . 
— aTm —
• A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically denoted D = diag(d1 , d2 , . . . , dn ),
with  
d1

 d2 

D= ..  (5)
 . 
dn
• The identity matrix, denoted I ∈ Rn×n , is a square matrix with ones on the diagonal and zeros everywhere
else, I = diag(1, 1, . . . , 1). It has the property that for all A ∈ Rm×n ,
AI = A = IA (6)
where the size of I is determined by the dimensions of A so that matrix multiplication is possible. (We define
matrix multiplication below.)
• A block diagonal matrix is one which contains matrices on its main diagonal, and is 0 everywhere else, e.g.,

A 0
(7)
0 B
• The unit vector ei is a vector of all 0’s, except entry i, which has value 1:
ei = (0, . . . , 0, 1, 0, . . . , 0)T (8)
The vector of all ones is denoted 1. The vector of all zeros is denoted 0.
3 Matrix Multiplication
The product of two matrices A ∈ Rm×n and B ∈ Rn×p is the matrix
C = AB ∈ Rm×p , (9)
where
n
X
Cij = Aik Bkj . (10)
k=1
Note that in order for the matrix product to exist, the number of columns in A must equal the number of rows in B.
There are many ways of looking at matrix multiplication, and we’ll start by examining a few special cases.
2
3.1 Vector-Vector Products
Given two vectors x, y ∈ Rn , the quantity xT y, called the inner product, dot product or scalar product of the
vectors, is a real number given by
Xn
xT y = hx, yi ∈ R = xi yi . (11)
i=1
T T
Note that it is always the case that x y = y x.
Given vectors x ∈ Rm , y ∈ Rn (they no longer have to be the same size), xyT is called the outer product of the
vectors. It is a matrix whose entries are given by (xyT )ij = xi yj , i.e.,
 
x1 y1 x1 y2 · · · x1 yn
 x2 y1 x2 y2 · · · x2 yn 
xyT ∈ Rm×n = 
 
.. .. .. .. . (12)
 . . . . 
xm y1 xm y2 · · · xm yn
3.2 Matrix-Vector Products

Given a matrix A ∈ Rm×n and a vector x ∈ Rn , their product is a vector y = Ax ∈ Rm . There are a couple ways of
looking at matrix-vector multiplication, and we will look at them both.
If we write A by rows, then we can express Ax as,
   T 
— aT1 — a1 x
 — aT2 —   aT2 x 
   
y= ..  x =  ..  . (13)
 .   . 
— aTm — aTm x
In other words, the ith entry of y is equal to the inner product of the ith row of A and x, yi = aTi x. In Matlab notation,
we have
y = [A(1,:)*x1; ...; A(m,:)*xn]
Alternatively, let’s write A in column form. In this case we see that,
 
  x1      
| | |  x2 
 
y =  a1 a2 · · · an   .  =  a1  x1 +  a2  x2 + . . . +  an  xn . (14)
 .. 
| | |
xn
In other words, y is a linear combination of the columns of A, where the coefficients of the linear combination are
given by the entries of x. In Matlab notation, we have
y = A(:,1)*x1 + ...+ A(:,n)*xn
So far we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by a
row vector. This is written, yT = xT A for A ∈ Rm×n , x ∈ Rm , and y ∈ Rn . As before, we can express yT in two
obvious ways, depending on whether we express A in terms on its rows or columns. In the first case we express A in
terms of its columns, which gives
 
| | |
yT = xT  a1 a2 · · · an  = xT a1 xT a2 · · · xT an (15)
| | |
which demonstrates that the ith entry of yT is equal to the inner product of x and the ith column of A.
3
Finally, expressing A in terms of rows we get the final representation of the vector-matrix product,
 
— aT1 —
T
 — a2 — 

yT = x1 x2 · · · xn  ..  (16)
 . 
— aTm —
(17)

= x1 — aT1 — + x2 — aT2 — + ... + xn — aTn — (18)
so we see that yT is a linear combination of the rows of A, where the coefficients for the linear combination are given
by the entries of x.
3.3 Matrix-Matrix Products
Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing the
matrix-matrix multiplication C = AB as defined at the beginning of this section. First we can view matrix-matrix
multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the
definition, is that the i, j entry of C is equal to the inner product of the ith row of A and the jth row of B. Symbolically,
this looks like the following,
   T 
— aT1 —   a1 b1 aT1 b2 · · · aT1 bp
 — aT2 —  | | |  aT2 b1 aT2 b2 · · · aT2 bp 
   
C = AB =  ..  b1 b2 · · · bp  =  .. .. .. .. . (19)
 . 
| | |
 . . . . 
— aTm — aTm b1 aTm b2 · · · aTm bp
Remember that since A ∈ Rm×n and B ∈ Rn×p , ai ∈ Rn and bj ∈ Rn , so these inner products all make sense. This
is the most “natural” representation when we represent A by rows and B by columns.
A special case of this result arises in statistical applications where A = X and B = XT , where X is the n × d
design matrix, whose rows are the data cases. In this case, XXT is an n × n matrix of inner products called the
Gram matrix:  T 
x1 x1 · · · xT1 xn
XXT = 
 .. 
(20)
. 
xTn x1 ··· xTn xn
Alternatively, we can represent A by columns, and B by rows, which leads to the interpretation of AB as a sum
of outer products. Symbolically,
 
 — bT1
 —
| | |  — bT2 —  n
 X
ai bTi .

C = AB =  a1 a2 ··· an   ..  = (21)
| | |  .  i=1
— bTn —
Put another way, AB is equal to the sum, over all i, of the outer product of the ith column of A and the ith row of B.
Since, in this case, ai ∈ Rm and bi ∈ Rp , the dimension of the outer product ai bTi is m × p, which coincides with
the dimension of C.
If A = XT and B = X, where X is the n × d design matrix, we get a d × d matrix which is proportional to the
empirical covariance matrix of the data (assuming it has been centered):
 2 
x xi,1 xi,2 · · · xi,1 xi,d
Xn X  i,1
XT X = xi xTi = .. 
(22)
 . 
i=1 i xi,d xi,1 xi,d xi,2 ··· x2i,d
4
Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if we
represent B by columns, we can view the columns of C as matrix-vector products between A and the columns of B.
Symbolically,    
| | | | | |
C = AB = A  b1 b2 · · · bp  =  Ab1 Ab2 · · · Abp  . (23)
| | | | | |
Here the ith column of C is given by the matrix-vector product with the vector on the right, ci = Abi . These matrix-
vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have the
analogous viewpoint, where we represent A by rows, and view the rows of C as the matrix-vector product between
the rows of A and C. Symbolically,
   
— aT1 — — aT1 B —
 — aT2 —   — aT2 B — 
   
C = AB =  ..  B =  .. . (24)
 .   . 
— aTm — — aTm B —
Here the ith row of C is given by the matrix-vector product with the vector on the left, cTi = aTi B.
It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these view-
points follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section.
However, virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spend
some time trying to develop an intuitive understanding of the viewpoints presented here.
In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level:
• Matrix multiplication is associative: (AB)C = A(BC).
• Matrix multiplication is distributive: A(B + C) = AB + AC.
• Matrix multiplication is, in general, not commutative; that is, it can be the case that AB 6= BA.
3.4 Multiplication by diagonal matrices

Sometimes either A or B are diagonal matrices. If we post multiply A by a diagonal matrix B = diag(s), then we
just scale each column by the corresponding scale factor in s. In Matlab notation
A * diag(s) = [s1*A(:,1), ..., sn*A(:,n)]
A special case of this is if A has just one row, call it aT . Then we are just scaling each element of a by the corre-
sponding entry in s. Hence in Matlab notation
a’ * diag(s) = [s1*a1, ..., sn*an] = a .* s
Similarly, if we pre multiply B by a diagonal matrix A = diag(s), then we just scale each row by the corresponding
scale factor in s. In Matlab notation
diag(s) * B = [s1*B(1,:); ...; sm*B(m,:)]
A special case of this is if B has just one column, call it b. Then we are just scaling each element of v by the
corresponding entry in s. Hence in Matlab notation
diag(s) * v = [s1*v1; ...; sm * vm] = s .* v
5
4 Operations on matrices
4.1 The transpose
The transpose of a matrix results from “flipping” the rows and columns. Given a matrix A ∈ Rm×n , is transpose,
written AT , is defined as
AT ∈ Rn×m , (AT )ij = Aji . (25)
We have in fact already been using the transpose when describing row vectors, since the transpose of a column vector
is naturally a row vector.
The following properties of transposes are easily verified:
• (AT )T = A
• (AB)T = BT AT
• (A + B)T = AT + BT
4.2 The trace of a square matrix
The trace of a square matrix A ∈ Rn×n , denoted tr(A) (or just trA if the parentheses are obviously implied), is the
sum of diagonal elements in the matrix:
Xn
trA = Aii . (26)
i=1
The trace has the following properties:

• For A ∈ Rn×n , trA = trAT .
• For A, B ∈ Rn×n , tr(A + B) = trA + trB.
• For A ∈ Rn×n , t ∈ R, tr(tA) = t trA.
• For A, B such that AB is square, trAB = trBA.
• For A, B, C such that ABC is square,
trABC = trBCA = trCAB (27)
and so on for the product of more matrices. This is called the cyclic permutation property of the trace operator.
from the cyclic permutation property, we can derive the trace trick, which rewrites the scalar inner product xT Ax
as follows
xT Ax = tr(xT Ax) = tr(xxT A) (28)
We shall use this result later in the book.
The trace of two matrices, tr(AB), is the sum of a vector which contains the inner product of the rows of A and
the columns of B. In Matlab notation we have
trace(A B) = sum([a(1,:)*b(:,1), ..., a(n,:)*b(:,n)])

Thus tr(AB) is a way to define the distance between two matrices.
6
4.3 Adjugate (classical adjoint) of a square matrix
Below we introduce some rather abstract material that will be necessary for defining matrix determinants and inverse.
However, it can be skipped without loss of continuity. This subsection is based on the wikipedia entry.2
Let A be an n × n matrix. Let M̃ij be the matrix obtained by deletring row i and column j from A. Define
Mij = det M̃ij to be the (i, j)’th minor of A. Define Cij = −1i+j Mij to be the (i, j)’th cofactor of A. Finally,
define the adjugate or classical adjoint of A to be adj(A) = CT . Thus the adjoint stores the cofactors, but in
transposed order.
Let us consider a 2 × 2 example.

a b d −b C11 C21
adj = = (29)
c d −c a C12 C22
Now let A be the following matrix:  

A11 A12 A13
A = A21 A22 A23  (30)
A31 A32 A33
Then its adjugate is  
A22 A23 A A13 A A13
+ A32
− 12 + 12
 A33 A32 A33 A22 A23 
 
 
 A21 A23 A A13 A A13 

− A31
adj(A) =  + 11 − 11  (31)
 A33 A31 A33 A21 A23 
 
 
 A21 A22 A A12 A A12 

+ − 11 + 11
A31 A32 A31 A32 A21 A22
For example, entry (2,1) is obtained by “crossing out” row 1 and column 2 from A, and computing -1 times the
determinant of the resulting submatrix:

A21 A23
adj(A)2,1 = − det (32)
A31 A33
4.4 Determinant of a square matrix

A square n × n matrix A can be viewed as a linear transformation that scales and rotates a vector. The determinant
of a square matrix det(A) = |A| is a measure of how much it changes a unit volume. The determinant operator
satisfies these properties, where A, B ∈ Rn×n
|A| = |AT | (33)

|AB| = |A||B| (34)
|A| = 0 ⇐⇒ A is singular (35)
|A|−1 = 1/|A| (36)
We can define the determinant in terms of an expansion of its cofactors (see Section 4.3) along any row i or column
j as follows:
Xn Xn
det A = aij Cij = aij Cij (37)
i=1 j=1
This is called Laplace’s expansion of the determinant. We see that the determinant is given by the inner product of
any row i of A with the corresponding row i of C. Hence
A adj(A) = det(A) I (38)

2 http://en.wikipedia.org/wiki/Adjugate_matrix.
7
We will return to this important formula below.
We now give some examples. Consider a 2 × 2 matrix. From Equation 29, and summing along the rows for column
1, we have,
a b
det = aC11 + cC21 = ad − cb (39)
c d
Alternatively, summing along the columns for row 2 we have

a b
det = cC21 + dC22 = −cb + da (40)
c d
Now consider a 3 × 3 matrix. From Equation 31, and summing along the rows for column 1, we have

A22 A23 A21 A23 A21 A22
det(A) = a11 − a12
+ a13
(41)
A32 A33 A31 A33 A31 A32
a11 a22 a33 + a12 a23 a31 + a13 a21 a32
= (42)
−a11 a23 a32 − a12 a21 a33 − a13 a22 a31
Note that the definition of the cofactors involves a determinant. We can make this explicit by giving the general
(recursive) formula:
n
X
|A| = (−1)i+j aij |M̃i,j | (for any j ∈ 1, . . . , n)
i=1
Xn
= (−1)i+j aij |M̃i,j | (for any i ∈ 1, . . . , n)
j=1
where M̃i,j is the minor matrix obtained by deleting row i and column j from A. The base case is that |A| = a11
for A ∈ R1×1 . If we were to expand this formula completely for A ∈ Rn×n , there would be a total of n! (n factorial)
different terms. For this reason, we hardly even explicitly write the complete equation of the determinant for matrices
bigger than 3 × 3. Instead, we will see below that there are various numerical methods for computing |A|.
4.5 The inverse of a square matrix
The inverse of a square matrix A ∈ Rn×n is denoted A−1 , and is the unique matrix such that
A−1 A = I = AA−1 . (43)
Note that A−1 only exists if det(A) > 0. If det(A) = 0, it is called a singular matrix.
The following are properties of the inverse; all assume that A, B ∈ Rn×n are non-singular:
• (A−1 )−1 = A
• If Ax = b, we can multiply by A−1 on both sides to obtain x = A−1 b. This demonstrates the inverse with
respect to the original system of linear equalities we began this review with.
• (AB)−1 = B−1 A−1
• (A−1 )T = (AT )−1 . For this reason this matrix is often denoted A−T .
From Equation 38, we have that if A is non-singular, then
1
A−1 = adj(A) (44)
|A|
While this is a nice “explicit” formula for the inverse of matrix, we should note that, numerically, there are in fact
much more efficient ways of computing the inverse. In Matlab, we just type inv(A).
8
For the case of a 2 × 2 matrix, the expression for A−1 is simple enough to give explicitly. We have

a b −1 1 d −b
A= , A = (45)
c d |A| −c a
For a block diagonal matrix, the inverse is obtained by simply inverting each block separately, e.g.,
−1 −1
A 0 A 0
= (46)
0 B 0 B−1
4.6 Schur complements and the matrix inversion lemma
(This section is based on [Jor07, ch13]. We will use these results in Section ??, when deriving equations for inferring
marginals and conditionals of multivariate Gaussians. We will also use special cases of these results (such as the
matrix inversion lemma) in many other places throughout the book.
Consider a general partitioned matrix
E F
M= (47)
G H
where we assume E and H are invertible. The goal is to derive an expression for M−1 . If we could block diagonalize
M, it would be easier to invert. To zero out the top right block of M we can pre-multiply as follows

I −FH−1 E F E − FH−1 G 0
= (48)
0 I G H G H
Similarly, to zero out the bottom right we can post-multiply as follows

E − FH−1 G 0 I 0 E − FH−1 G 0
= (49)
G H −H−1 G I 0 H
Putting it all together we get

I −FH−1 E F I 0 E − FH−1 G 0
= (50)
0 I G H −H−1 G I 0 H
| {z } | {z } | {z } | {z }
X M Z W
Taking determinants we get
|X||M||Z| = |M| = |W| = |E − FH−1 G||H| (51)
Let us define Schur complement of M wrt H as
M/H = E − FH−1 G (52)
Then Equation 51 becomes
|M| = |M/H||H| (53)
So we can see that M/H acts somewhat like a division operator. We can derive the inverse of M as follows
Z−1 M−1 X−1 = W−1 (54)
−1
M = ZW−1 X (55)
hence
−1
E F I 0 (M/H)−1 0 I −FH−1
= −1 −1 (56)
G H −H G I 0 H 0 I
−1
−1

(M/H) 0 I −FH
= (57)
−H−1 G(M/H)−1 H−1 0 I

(M/H) −1
−(M/H) FH−1
−1
= (58)
−H−1 G(M/H)−1 H−1 + H−1 G(M/H)−1 FH−1
9
Alternatively, we could have decomposed the matrix M in terms of E and M/E = (H − GE−1 F), yielding
−1 −1
E F E + E−1 F(M/E)−1 GE−1 E−1 F(M/E)−1
= (59)
G H −(M/E)−1 GE−1 (M/E)−1
Equating the top left block of these two expression yields the following formula:
(M/H)−1 = E−1 + E−1 F(M/E)−1 GE−1 (60)

(61)
Simplifiying this leads to the widely used matrix inversion lemma or the Sherman-Morrison-Woodbury formula:
(E − FH−1 G)−1 = E−1 + E−1 F(H − GE−1 F)−1 GE−1 (62)
In the special case that H = −1 a scalar, F = u a column vector, G = vT a row vector, we get the following formula
for the rank one update of an inverse matrix
(E + uvT )−1 = E−1 + E−1 u(−1 − vT E−1 u)−1 vT E−1 (63)

−1 T −1
E uv E
= E−1 − (64)
1 + vT E−1 u
We can derive another useful formula by equating the top right blocks of Equations 58 and 59 to get
−(M/H)−1 FH−1 = E−1 F(M/E)−1 (65)

−1 −1 −1 −1 −1 −1
(E − FH G) FH = E F(H − GE F) (66)
5 Special kinds of matrices

5.1 Symmetric Matrices
A square matrix is called Hermitian if A = A∗ , where A∗ is the transpose of the complex conjugate of A. A matrix
is symmetric if it is real and Hermitian, and hence A = AT . In the rest of this section, we shall assume A is real.
A matrix is anti-symmetric if A = −AT . It is easy to show that for any matrix A ∈ Rn×n , the matrix A + AT
is symmetric and the matrix A − AT is anti-symmetric. From this it follows that any square matrix A ∈ Rn×n can
be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since
1 1
A= (A + AT ) + (A − AT ) (67)
2 2
and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out that symmetric matrices
occur a great deal in practice, and they have many nice properties which we will look at shortly. It is common to
denote the set of all symmetric matrices of size n as Sn , so that A ∈ Sn means that A is a symmetric n × n matrix.
5.2 Orthogonal Matrices
Two vectors x, y ∈ Rn are orthogonal if xT y = 0. A vector x ∈ Rn is normalized if kxk2 = 1. A square matrix
U ∈ Rn×n is orthogonal (note the different meaning of the term orthogonal when talking about vectors versus
matrices) if all its columns are orthogonal to each other and are normalized (the columns are then referred to as being
orthonormal).
It follows immediately from the definition of orthogonality and normality that
UT U = I = UUT . (68)
In other words, the inverse of an orthogonal matrix is its transpose. Note that if U is not square — i.e., U ∈
Rm×n , n < m — but its columns are still orthonormal, then UT U = I, but UUT 6= I. We generally only
use the term orthogonal to describe the previous case, where U is square.
10
Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not
change its Euclidean norm, i.e.,
kUxk2 = kxk2 (69)
for any x ∈ Rn , U ∈ Rn×n orthogonal. Similarly, one can show that the angle between two vectors is preserved after
they are transformed by an orthogonal matrix. In summary, transformations by orthogonal matrices are generalizations
of rotations and reflections, since they preserve lengths and angles. Computationally, orthogonal matrices are very
desirable because they do not magnify roundoff or other kinds of errors.
An example of an orthogonal matrix is a rotation matrix. A rotation in 3d by angle α about the z axis is given by
 
cos(α) − sin(α) 0
R(α) =  sin(α) cos(α) 0 (70)
0 0 1
If α = 45◦ , this becomes  √1 

2
− √12 0
R(45) =  √1 √1 0 (71)
2 2
0 0 1
where √1
2
= 0.7071. We see that R(−α) = RT , so this is an orthogonal matrix.
5.3 Positive definite Matrices
Given a matrix square A ∈ Rn×n and a vector x ∈ R, the scalar value xT Ax is called a quadratic form. Written
explicitly, we see that
Xn X n
xT Ax = Aij xi xj . (72)
i=1 j=1
Note that,
1 1
xT Ax = (xT Ax)T = xT AT x = xT ( A + AT )x (73)
2 2
i.e., only the symmetric part of A contributes to the quadratic form. For this reason, we often implicitly assume that
the matrices appearing in a quadratic form are symmetric.
We give the following definitions:
• A symmetric matrix A ∈ Sn is positive definite (pd) iff for all non-zero vectors x ∈ Rn , xT Ax > 0. This is
usually denoted A 0 (or just A > 0), and often times the set of all positive definite matrices is denoted Sn++ .
• A symmetric matrix A ∈ Sn is positive semidefinite (PSD) iff for all vectors xT Ax ≥ 0. This is written A 0
(or just A ≥ 0), and the set of all positive semidefinite matrices is often denoted Sn+ .
• Likewise, a symmetric matrix A ∈ Sn is negative definite (ND), denoted A ≺ 0 (or just A < 0) iff for all
non-zero x ∈ Rn , xT Ax < 0.
• Similarly, a symmetric matrix A ∈ Sn is negative semidefinite (NSD), denoted A 0 (or just A ≤ 0) iff for
all x ∈ Rn , xT Ax ≤ 0.
• Finally, a symmetric matrix A ∈ Sn is indefinite, if it is neither positive semidefinite nor negative semidefinite
— i.e., if there exists x1 , x2 ∈ Rn such that xT1 Ax1 > 0 and xT2 Ax2 < 0.
It should be obvious that if A is positive definite, then −A is negative definite and vice versa. Likewise, if A is
positive semidefinite then −A is negative semidefinite and vice versa. If A is indefinite, then so is −A. It can also be
shown that positive definite and negative definite matrices are always invertible.
Finally, there is one type of positive definite matrix that comes up frequently, and so deserves some special mention.
Given any matrix A ∈ Rm×n (not necessarily symmetric or even square), the Gram matrix G = AT A is always
11
positive semidefinite. Further, if m ≥ n (and we assume for convenience that A is full rank), then G = AT A is
positive definite.
Note that if all elements of A are positive, it does not mean A is necessarily pd. For example,

4 3
A= (74)
3 2
is not pd. Conversely, a pd matrix can have negative entries e.g.,

2 −1
A= (75)
−1 2
A sufficient condition for a (real, symmetric) matrix to be pd is that it is diagonally dominant, i.e., if in every row
of the matrix, the magnitude of the diagonal entry in that row is larger than the sum of the magnitudes of all the other
(non-diagonal) entries in that row. More precisely,
X
|aii | > |aij | for all i (76)
j6=i
6 Properties of matrices
6.1 Linear independence and rank
A set of vectors {x1 , x2 , . . . xn } is said to be (linearly) independent if no vector can be represented as a linear
combination of the remaining vectors. Conversely, a vector which can be represented as a linear combination of the
remaining vectors is said to be (linearly) dependent. For example, if
n−1
X
xn = αi xi (77)
i=1
for some {α1 , . . . , αn−1 } then xn is dependent on {x1 , . . . , xn−1 }; otherwise, it is independent of {x1 , . . . , xn−1 }.
The column rank of a matrix A is the largest number of columns of A that constitute linearly independent set. In
the same way, the row rank is the largest number of rows of A that constitute a linearly independent set.
It is a basic fact of linear algebra that for any matrix A, columnrank(A) = rowrank(A), and so this quantity is
simply refereed to as the rank of A, denoted as rank(A). The following are some basic properties of the rank:
• For A ∈ Rm×n , rank(A) ≤ min(m, n). If rank(A) = min(m, n), then A is said to be full rank, otherwise it
is called rank deficient.
• For A ∈ Rm×n , rank(A) = rank(AT ).
• For A ∈ Rm×n , B ∈ Rn×p , rank(AB) ≤ min(rank(A), rank(B)).
• For A, B ∈ Rm×n , rank(A + B) ≤ rank(A) + rank(B).
6.2 Range and nullspace of a matrix
The span of a set of vectors {x1 , x2 , . . . xn } is the set of all vectors that can be expressed as a linear combination of
{x1 , . . . , xn }. That is, ( )
Xn
span({x1 , . . . xn }) = v : v = αi xi , αi ∈ R . (78)
i=1
It can be shown that if {x1 , . . . , xn } is a set of n linearly independent vectors, where each xi ∈ Rn , then span({x1 , . . . xn }) =
Rn . In other words, any vector v ∈ Rn can be written as a linear combination of x1 through xn .
The dimension of some subspace S, dim(S), is defined to be d if there exists some spanning set {b1 , . . . , bd } that
is linearly independent.
12
Figure 1: Visualization of the nullspace and range of an m × n matrix A. Here y1 = Ax1 , so y1 is in the range (is reachable);
similarly for y2 . Also Ax3 = 0, so x3 is in the nullspace (gets mapped to 0); similarly for x4 .
The range (sometimes also called the column space) of a matrix A ∈ Rm×n is the the span of the columns of A.
In other words,
range(A) = {v ∈ Rm : v = Ax, x ∈ Rn }. (79)
This can be thought of as the set of vectors that can be “reached” or “generated” by A. The nullspace of a matrix
A ∈ Rm×n is the set of all vectors that get mapped to the null vector when multiplied by A, i.e.,
nullspace(A) = {x ∈ Rn : Ax = 0}. (80)
See Figure 1 for an illustration of the range and nullspace of a matrix. We shall discuss how to compute the range and
nullspace of a matrix numerically in Section 12.1 below.
6.3 Norms of vectors
A norm of a vector kxk is informally measure of the “length” of the vector. For example, we have the commonly-used
Euclidean or `2 norm, v
u n
uX
kxk2 = t x2i . (81)
i=1
Note that kxk22

= x x. T
More formally, a norm is any function f : Rn → R that satisfies 4 properties:

1. For all x ∈ Rn , f (x) ≥ 0 (non-negativity).
2. f (x) = 0 if and only if x = 0 (definiteness).
3. For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x) (homogeneity).
4. For all x, y ∈ Rn , f (x + y) ≤ f (x) + f (y) (triangle inequality).
Other examples of norms are the `1 norm,
n
X
kxk1 = |xi | (82)
i=1
and the `∞ norm,
kxk∞ = maxi |xi |. (83)
In fact, all three norms presented so far are examples of the family of `p norms, which are parameterized by a real
number p ≥ 1, and defined as
n
!1/p
X
p
kxkp = |xi | . (84)
i=1
13
6.4 Matrix norms and condition numbers
(The following section is based on [Van06, p93].) Assume A is a non-singular matrix, and Ax = b, so x = A−1 b is
a unique solution. If we change b to b + ∆b, then
A(x + ∆x) = b + ∆b (85)
so the new solution is x + ∆x where
∆x = A−1 ∆b (86)
We say that A is well-conditioned if a small ∆b results in a small ∆x; otherwise we say that A is ill-conditioned.
For example, suppose

1 1 1 −1 1 − 1010 1010
A= 2 , A = (87)
1 + 10−10 1 − 10−10 1 + 1010 −1010
The solution for b = (1, 1) is x = (1, 1). If we change b by ∆b, the solution changes to

−1 ∆b1 − 1010 (∆b1 − ∆b2 )
∆x = A ∆b = (88)
∆b1 + 1010 (∆b1 − ∆b2 )
So a small change in b can lead to an extremely large change in x.
In general, we can think of a matrix A as a defining a linear function f (x) = Ax. The expression ||Ax||/||x||
gives the amplication factor or gain of the system in the direction x. We define the norm of A as the maximum gain
(over all directions):
||Ax||
||A|| = max = max ||Ax|| (89)
x6=0 ||x|| ||x||=1
This is sometimes called the 2-norm of a matrix. For example, if A = diag(a1 , . . . , an ), then ||A|| = maxni=1 |ai |. In
general, we have to use numerical methods to compute ||A||; in Matlab, we can just type norm(A).
Other matrix norms exist. For example, the Frobenius norm of a matrix is defined as
v
um X n
uX q
||A||F = t a2ij = tr(AT A) (90)
i=1 j=1
Note that this is the same as the 2-norm of A(:), the vector gotten by stacking all the columns of A together. In Matlab,
just type norm(A,’fro’).
We now show how the norm of a matrix can be used to quantify how well-conditioned a linear system is. Since
∆x = A−1 ∆b, one can show the following upper bound on the absolute error
||∆x|| ≤ ||A−1 || ||∆b|| (91)
−1
So large ||A || means ||∆x|| can be large even when ||∆b|| is small. Also, since ||b|| ≤ ||A|| ||x||, and hence
b|| , we have
x ≥ ||||A ||
||∆x|| ||A−1 || ||∆b||

≤ ||∆b|| ≤ ||A||||A−1 || (92)
||x|| ||x|| ||b||
thus establishing an upper bound on the relative error. The term
κ(A) = ||A||||A−1 || (93)
is called the condition number of A. Large κ(A) means ||∆x||/||x|| can be large even when ||∆b||/||b|| is small.
For the 2-norm case, the condition number is equal to the ratio of the largest to smallest singular values: κ(A) =
σmax /σmin . In Matlab, just type cond(A).
We say A is well conditioned if κ(A) is small (close to 1), and ill-conditioned if κ(A) is large. A large condition
number means A is nearly singular. This is a better measure of nearness to singularity than the size of the determinant.
For example, suppose A = 0.1I100×100 . Then det(A) = 10−100 , but κ(A) = 1, reflecting the fact that Ax simply
scales the entries of x by 0.1, and thus is well-behaved.
14
7 Eigenvalues and eigenvectors
7.1 Square matrices
Given a square matrix A ∈ Rn×n , we say that λ ∈ C is an eigenvalue of A and x ∈ Cn is the corresponding
eigenvector3 if
Ax = λx, x 6= 0 . (94)
Intuitively, this definition means that multiplying A by the vector x results in a new vector that points in the same
direction as x, but scaled by a factor λ. Also note that for any eigenvector x ∈ Cn , and scalar t ∈ C, A(cx) = cAx =
cλx = λ(cx), so cx is also an eigenvector. For this reason when we talk about “the” eigenvector associated with λ,
we usually assume that the eigenvector is normalized to have length 1 (this still creates some ambiguity, since x and
−x will both be eigenvectors, but we will have to live with this).
We can rewrite the equation above to state that (λ, x) is an eigenvalue-eigenvector pair of A if,
(λI − A)x = 0, x 6= 0 . (95)
But (λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a non-empty nullspace, which is only the
case if (λI − A) is singular, i.e.,
|(λI − A)| = 0 . (96)
We can now use the previous definition of the determinant to expand this expression into a (very large) polynomial
in λ, where λ will have maximum degree n; this is called the characteristic polynomial. We then find the n (possibly
complex) roots of this polynomial to find the n eigenvalues λ1 , . . . , λn . To find the eigenvector corresponding to
the eigenvalue λi , we simply solve the linear equation (λi I − A)x = 0. It should be noted that this is not the
method which is actually used in practice to numerically compute the eigenvalues and eigenvectors (remember that
the complete expansion of the determinant has n! terms); it is rather a mathematical argument.
The following are properties of eigenvalues and eigenvectors (in all cases assume A ∈ Rn×n has eigenvalues
λi , . . . , λn and associated eigenvectors x1 , . . . xn ):
• The trace of a A is equal to the sum of its eigenvalues,
n
X
trA = λi . (97)
i=1
• The determinant of A is equal to the product of its eigenvalues,

n
Y
|A| = λi . (98)
i=1
• The rank of A is equal to the number of non-zero eigenvalues of A.

• If A is non-singular then 1/λi is an eigenvalue of A−1 with associated eigenvector xi , i.e., A−1 xi = (1/λi )xi .
• The eigenvalues of a diagonal matrix D = diag(d1 , . . . dn ) are just the diagonal entries d1 , . . . dn .
We can write all the eigenvector equations simultaneously as
AX = XΛ (99)
where the columns of X ∈ Rn×n are the eigenvectors of A and Λ is a diagonal matrix whose entries are the eigenval-
ues of A, i.e.,  
| | |
n×n
X∈R =  x1 x2 · · · xn  , Λ = diag(λ1 , . . . , λn ) . (100)
| | |
3 Notethat λ and the entries of x are actually in C, the set of complex numbers, not just the reals; we will see shortly why this is necessary.
Don’t worry about this technicality for now, you can think of complex vectors in the same way as real vectors.
15
If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so
A = XΛX−1 . (101)
A matrix that can be written in this form is called diagonalizable.
7.2 Symmetric matrices
Two remarkable properties come about when we look at the eigenvalues and eigenvectors of a symmetric matrix A ∈
Sn . First, it can be shown that all the eigenvalues of A are real. Secondly, the eigenvectors of A are orthonormal, i.e.,
uTi uj = 0 if i 6= j, and uTi ui = 1, where ui are the eigenvectors. In matrix form, this becomes UT U = UUT = I
and |U| = 1. We can therefore represent A as
n
X
A = UΛUT = λi ui uTi (102)
i=1
remembering from above that the inverse of an orthogonal matrix is just its transpose. We can write this symbolically
as
  
  λ1 − uT1 −
| | |  λ2  − uT2 −
  
A = = u1 u2 · · · un   ..  ..  (103)
| | |
 .  . 
λn − uTn −
   
| |
= λ1 u1  − uT1 − + · · · + λn un  − uTn − (104)
| |
As an example of diagonalization, suppose we are given the following real, symmetric matrix
 
1.5 −0.5 0
A = −0.5 1.5 0 (105)
0 0 3
We can diagonalize this in Matlab as follows.
Listing 1: Part of rotationDemo
[U,D]=eig(A)
U =
-0.7071 -0.7071 0
-0.7071 0.7071 0
0 0 1.0000
D =
1 0 0
0 2 0
0 0 3
We recognize from Equation 71 that U corresponds to rotation by 45 degrees. and D corresponds to scaling by
diag(1, 2, 3). So the whole matrix is equivalent to a rotation by 45◦ , followed by scaling, followed by rotation by
−45◦ . Note that there is a sign ambiguity for each column of U, but the resulting answer is always correct:
Listing 2: Output of rotationDemo
>> U*D*U’
ans =
1.5000 -0.5000 0
-0.5000 1.5000 0
0 0 3.0000
We can also use the diagonalization property to show that the definiteness of a matrix depends entirely on the sign
of its eigenvalues. Suppose A ∈ Sn = UΛUT . Then
n
X
T T T T
x Ax = x UΛU x = y Λy = λi yi2 (106)
i=1
16
where y = UT x (and since U is full rank, any vector y ∈ Rn can be represented in this form). Because yi2 is always
positive, the sign of this expression depends entirely on the λi ’s. If all λi > 0, then the matrix is positive definite;
if all λi ≥ 0, it is positive semidefinite. Likewise, if all λi < 0 or λi ≤ 0, then A is negative definite or negative
semidefinite respectively. Finally, if A has both positive and negative eigenvalues, it is indefinite.
7.3 Covariance matrices
An important application of eigenvectors is to visualizing the probability density function of a multivariate Gaus-
sian, also called a multivariate Normal (MVN). We study this in more detail in Section ??, but we introduce it here,
since it serves as a good illustration of many ideas in linear algebra. The density is defined as follows:
def 1
N (x|µ, Σ) = exp[− 21 (x − µ)T Σ−1 (x − µ)] (107)
(2π)d/2 |Σ|1/2
where d is the dimensionality of x, µ = E[X] is the d × 1 mean vector, and Σ is a d × d symmetric and positive
definite covariance matrix, defined by
def
Cov(X) = Σ = E [(X − E X)(X − E X)T ] (108)
 
Var (X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xd )
=
 .. .. .. .. 
(109)
 . . . . 
Cov(Xd , X1 ) Cov(Xd , X2 ) · · · Var (Xd )
If d = 1, this reduces to the familiar univariate Gaussian density
def 1 1
N (x|µ, σ 2 ) = exp[− 2 (x − µ)2 ] (110)
(2π)1/2 σ 2σ
The expression inside the exponent of the MVN is a quadratic form called the Mahalanobis distance, and is a scalar
value equal to
(x − µ)T Σ−1 (x − µ) = xT Σ−1 x + µT Σ−1 µ − 2µT Σ−1 x (111)
Intuitively, this just measures the distance of point x from µ, where each dimension gets weighted differently. In the
case of a diagonal covariance, it becomes a weighted Euclidean distance:
d
X
∆ = (x − µ)T Σ−1 (x − µ) = (xi − µi )2 Σ−1
ii (112)
i=1
So if Σii is very large, meaning dimension i has high variance, then it will be downweighted (since we use 1/Σii )
when comparing x to µ. This is a form of feature selection.
Figure 2 plots some MVN densities in 2d for three different kinds of covariance matrices.4 A full covariance matrix
has d(d + 1)/2 parameters (we divide by 2 since it is symmetric). A diagonal covariance matrix has d parameters. A
spherical or isotropic covariance, Σ = σ 2 Id , has one free parameter.
The equation ∆ = const defines an ellipsoid, which are the level sets of constant probability density. We see that
a general covariance matrix has elliptical contours; for a diagonal covariance, the ellipse is axis aligned; and for a
spherical covariance, the ellipse is a circle (same “spread” in all directions). We now explain why the contours have
this form, using eigenanalysis.
Letting Σ = UΛUT we find
p
X 1
Σ−1 = U−T Λ−1 U−1 = UΛ−1 UT = ui uTi (113)
i=1
λi
4 These plots were made by creating a dense grid of points using meshgrid, evaluating the pdf on each grid cell using mvnpdf, and then using
the surfc and contour commands. See gaussPlot2dDemo for details.
17
Full, S=[2.0 1.5 ; 1.5 4.0], ρ=0.53 Diagonal, S=[1.2 −0.0 ; −0.0 4.8], ρ=−0.00 Spherical, S=[1.0 0.0 ; 0.0 1.0], ρ=0.00
0.018 0.018 0.04
0.016 0.016 0.035

0.014 0.014
0.03
0.012 0.012
0.025
0.01 0.01
p(x,y)
p(x,y)
p(x,y)
0.02
0.008 0.008
0.015
0.006 0.006
0.01
0.004 0.004
0.002 0.002 0.005
0 0 0
5 5 5
5 5 5
0 0 0
0 0 0
−5 −5 −5 −5 −5 −5
y y y
x x x
Full, S=[2.0 1.5 ; 1.5 4.0] Diagonal, S=[1.2 −0.0 ; −0.0 4.8] Spherical, S=[1.0 0.0 ; 0.0 1.0]
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
y
y
−1 −1 −1
−2 −2 −2
−3 −3 −3
−4 −4 −4
−5 −5 −5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x x x
Figure 2: Visualization of 2 dimensional Gaussian densities. Left: a full covariance matrix Σ. Middle: Σ has been decorrelated
(see Section ??). Right: Σ has been whitened (see Section 7.3.1). This figure was produced by gaussPlot2dDemo.
Hence
p
!
T −1 T
X 1
(x − µ) Σ (x − µ) = (x − µ) ui uTi (x − µ) (114)
i=1
λi
p p
X 1 X yi2
= (x − µ)T ui uTi (x − µ) = (115)
λ
i=1 i
λ
i=1 i
def
where yi = uTi (x − µ). The y variables define a new coordinate system that is shifted (by µ) and rotated (by U)
with respect to the original x coordinates: y = U(x − µ).
Recall that the equation for an ellipse in 2D is
y12 y2
+ 2 =1 (116)
λ1 λ2
Hence we see that the contours of equal probability density of a Gaussian lie along ellipses: see Figure 3.
This gives us a fast way to plot a 2d Gaussian, without having to evaluate the density on a grid of points and use
1
the contour command. If X is a matrix of points on the circle, then Y = UΛ 2 X is a matrix of points on the ellipse
represented by Σ = UΛUT . To see this, we use the result
Cov[Ax] = A Cov[x]AT (117)
to conclude
1 1 1 1
Cov[y] = UΛ 2 Cov[x]Λ 2 UT = UΛ 2 Λ 2 UT = Σ (118)
1 √
where Λ2 = diag( Λii ). We can implement this in Matlab as follows.5
5 We can
√
explain the k = 6 term as follows. We use the fact that the Mahalanobis distance is a sum of squares of d Gaussian random variables,
to conclude that ∆ ∼ χ2d , which is the χ-squared distribution with d degrees of freedom (see Section ??). Hence we can find the value of ∆
√
that corresponds to a 95% confidence interval by using chi2inv(0.95, 2), which is approximately 6. So by setting the radius to 6, we will
enclose 95% of the probability mass.
18
x2
u2
u1
y2
y1
µ
1/2
λ2
1/2
λ1
x1
Figure 3: Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse are defined by the first two
eigenvectors of the covariance matrix, namely u1 and u2 . Source: [Bis06] Figure 2.7.
Listing 3: gaussPlot2d
function h=gaussPlot2d(mu, Sigma, color)
[U, D] = eig(Sigma);
n = 100;
t = linspace(0, 2*pi, n);
xy = [cos(t); sin(t)];
k = sqrt(6); %k = sqrt(chi2inv(0.95, 2));
w = (k * U * sqrt(D)) * xy;
z = repmat(mu, [1 n]) + w;
h = plot(z(1, :), z(2, :), color);
axis(’equal’);
7.3.1 Whitening data

The above plotting routine took a spherical covariance matrix and converted it into an ellipse. We can also go in the
opposite direction. Let x ∼ N (µ, Σ), where Σ = UΛUT . Decorrelation is a linear operation which removes any
off-diagonal entries from Σ. Define y = UT x. Then
Cov[y] = UT ΣU = UT UΛUT U = Λ (119)
which is a diagonal matrix.

We can go further and convert this to a spherical matrix, by making the diagonal entries be 1 (unit variance). This
is called whitening or sphereing the data. Let
1
y = Λ− 2 U T x (120)
Then
1 1 1 1 1 1
Cov[y] = Λ− 2 U T ΣU Λ− 2 = Λ− 2 U T (U ΛU T U )Λ− 2 = Λ− 2 ΛΛ− 2 = I (121)
See Figure 2 for an example. In Matlab, we can implement this as follows
Listing 4: Part of gaussPlot2dDemo

mu = [1 0]’;
S = [2 1.5; 1.5 4];
[U,D] = eig(S);
A = sqrt(inv(D))*U’; % whitening transform (without centering)
mu2 = A*mu;
S2 = A*S*A’;
assert(approxeq(S2, eye(2)))
19
Finally, we can whiten the data and center it, by subtracting off the mean.
1
y = Λ− 2 UT (x − µ) (122)
1
E[y] = Λ− 2 UT (µ − µ) = 0 (123)
7.3.2 Standardizing data

Standardizing data is a simple preprocessing step that is a weaker form of whitening, which makes each dimension
be zero mean and have unit variance, but which does not remove any correlation (since it operates on each dimension
separately). Standardized data is often represented by Z. Symbolically we have
xij − x̄.j
zij = (124)
σj
We can do this in MATLAB using the MLAPA function standardize, or the Statistics function zscore.
Listing 5: standardize
function [Z, mu, sigma] = standardize(X, mu, sigma)
X = double(X);
if nargin < 2
mu = mean(X);
sigma = std(X);
ndx = find(sigma < eps);
sigma(ndx) = 1;
end
[n d] = size(X);
Z = X - repmat(mu, [n 1]);
Z = Z ./ repmat(sigma, [n 1]);
Of course, in a supervised learning setting, we must apply the same transformation to the test data. Thus it is very
common to see the following idiom (we will see examples in Chapter ??):
Listing 6: :
[Xtrain, mu, sigma] = [ones(size(Xtrain,1), 1) standardize(Xtrain)];
w = fitModel(Xtrain); % plug in appropriate learning procedure
Xtest = [ones(size(Xtest,1),1) standardize(Xtest, mu, sigma)];
ypred = testModel(Xtest, w); % plug in appropriate test procedure
Alternatively, we can rescale the data, e.g., to 0:1 or -1:1, using the rescaleData function. We must rescale the
test data in the same way, as in the example below.
Listing 7: :
xTrain = -10:1:10;
[zTrain, minx, rangex] = rescaleData(xTrain, -1, 1);% zTrain= -1:1
xTest = [-11, 8];
[zTest] = rescaleData(xTest, -1, 1, minx, rangex) % [-1.1, 0.8]
8 Matrix calculus
While the topics in the previous sections are typically covered in a standard course on linear algebra, one topic that
does not seem to be covered very often (and which we will use extensively) is the extension of calculus to the vector
setting. Despite the fact that all the actual calculus we use is relatively trivial, the notation can often make things look
much more difficult than they are. In this section we present some basic definitions of matrix calculus and provide a
few examples. We concentrate on functions that maps vectors / matrices to scalars, i.e., their output is scalar valued.
8.1 The gradient of a function wrt a matrix
Suppose that f : Rm×n → R is a function that takes as input a matrix A of size m × n and returns a real value. Then
the gradient of f (with respect to A ∈ Rm×n ) is the matrix of partial derivatives, defined as:
20
 
∂f (A) ∂f (A) ∂f (A)
∂A11 ∂A12 ··· ∂A1n
 ∂f (A) ∂f (A) ∂f (A)


∂A21 ∂A22 ··· ∂A2n

∇A f (A) ∈ Rm×n
 
= .. .. .. ..  (125)
 . . . . 
 
∂f (A) ∂f (A) ∂f (A)
∂Am1 ∂Am2 ··· ∂Amn
i.e., an m × n matrix with

∂f (A)
(∇A f (A))ij = . (126)
∂Aij
We give some examples below (based on [Jor07, ch13]). We will use these examples when computing the maximum
likelihood estimate of the covariance matrix for a multivariate Gaussian in Section ??.
8.1.1 Derivative of trace and quadratic forms
Recall that X
(AB)km = akl blm (127)
`
and hence that X XX

tr(AB) = (AB)kk = akl blm (128)
k k `
Hence
∂ ∂ XX
tr(AB) = = akl blm = bji (129)
∂aij ∂aij
k `
so
∇A tr(AB) = BT (130)
We can use this, plus the trace trick, to compute the derivative of a quadratic form wrt a matrix:
∇A xT Ax = ∇A tr(xT Ax) = ∇A tr(xxT A) = (xxT )T = xxT (131)
8.1.2 Derivative of a (log) determinant

P
Recall from Equation 37 that |A| = j aij Cij for any row i, where Cij is the cofactor. Hence
∂
|A| = Cij (132)
∂aij
since Cij does not contain any terms involving aij (since we deleted row i and column j. Hence
∇A |A| = adj(A)T = |A|A−T (133)
Now we can use the chain rule to see that

∂ log |A| ∂ log |A| ∂|A| 1 ∂|A|
= = . (134)
∂Aij ∂|A| ∂Aij |A| ∂Aij
Hence
1
∇A log |A| = ∇ |A| = A−T (135)
|A| A
Note the similarity to the scalar case, where ∂/(∂x) log x = 1/x.
21
8.2 The gradient of a function wrt a vector
Note that the size of ∇A f (A) is always the same as the size of A. So if, in particular, A is just a vector x ∈ Rn , we
have  ∂f (x) 
∂x1
 ∂f (x ) 
 ∂x2 
∇x f (x) =  .. .
 (136)
 . 
∂f (x)
∂xn
It is very important to remember that the gradient of a function is only defined if the function is real-valued, that is, if
it returns a scalar value. We can not, for example, take the gradient of Ax, A ∈ Rn×n with respect to x, since this
quantity is vector-valued.
It follows directly from the equivalent properties of partial derivatives that:
• ∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x).
• For t ∈ R, ∇x (t f (x)) = t∇x f (x).
We give some examples below, which we will use when computing the least squares estimate (see Section ??) as
well as when computing the maximum likelihood estimate of the mean of a multivariate Gaussian in Section ??.
8.2.1 Derivative of a scalar product
For x ∈ Rn , let f (x) = aT x for some known vector a ∈ Rn . Then
n
∂f (x) ∂ X
= ai xi = ak . (137)
∂xk ∂xk i=1
From this we can easily see that

∇x aT x = a (138)
This should be compared to the analogous situation in single variable calculus, where ∂/(∂x) ax = a.
8.2.2 Derivative of a quadratic form
Now consider the quadratic form f (x) = xT Ax for A ∈ Sn . Remember that
n X
X n
f (x) = Aij xi xj (139)
i=1 j=1
so
n n n n
∂f (x) ∂ XX X X
= Aij xi xj = Aik xi + Akj xj (140)
∂xk ∂xk i=1 j=1 i=1 j=1
Hence
∇x xT Ax = (A + AT )x (141)
If A is symmetric, this becomes ∇x xT Ax = 2Ax. Again, this should remind you of the analogous fact in single-
variable calculus, that ∂/(∂x) ax2 = 2ax.
8.3 The Jacobian
Let f be a function that maps Rn to Rm , and let y = f (x). Define the Jacobian matrix as
 ∂y1 ∂y1 
∂x1 ··· ∂xn
def ∂(y1 , . . . , ym ) def  . .. .. 
Jx→y = =  .. . .  (142)
∂(x1 , . . . , xn ) ∂y ∂ym
m
∂x1 ··· ∂xn
22
If we denote the i’th component of the output, then the i’th row of J is (∇x fi (x))T .
If f is a linear function, so y = Ax, where A is m × n, we can compute the gradient of each component of
the output separately. We have yi = ATi,: x, so from Equation ??, we have that the the i’th row of J is ATi,: , so
Jacobian(Ax) = A.
8.4 The Hessian of a function wrt a vector
Suppose that f : Rn → R is a function that takes a vector in Rn and returns a real number. Then the Hessian matrix
with respect to x, written ∇2x f (x) or simply as H is the n × n matrix of partial derivatives,
 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)

∂x21 ∂x1 ∂x2 · · · ∂x 1 ∂x n
 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) 

2 n×n

 ∂x2 ∂x1 ∂x2 2 · · · ∂x2 ∂xn 
∇x f (x) ∈ R = .. .. .. .. . (143)
 . 
 . . . 
∂ 2 f (x) ∂ 2 f (x) 2
f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂ ∂x 2
n
Obviously this is a symmetric matrix, since

∂ 2 f (x) ∂ 2 f (x)
= . (144)
∂xi ∂xj ∂xj ∂xi
We see that the Hessian is just the Jacobian of the gradient.
Note that the Hessian matrix is often written as
H = ∇x (∇Tx f (x)) (145)
where the term inside the bracket returns a row vector, which we then operate on elementwise, to create an outerproduct-
like matrix. We cannot write the Hessian as H = ∇x (∇x f (x)), since that would require computing the gradient of a
column vector.
8.4.1 Hessian of a quadratic form
Consider the quadratic form f (x) = xT Ax for A ∈ Sn . From above, we have
∇x xT Ax = (A + AT )x (146)
Hence
n n
∂ 2 f (x) ∂2 X X
= Aij xi xj = Ak` + A`k = 2Ak` . (147)
∂xk ∂x` ∂xk ∂x` i=1 j=1
so
∇2x xT Ax = A + AT (148)
If A is symmetric, this is 2A, and is again analogous to the single-variable fact that ∂ 2 /(∂x2 ) ax2 = 2a.
8.5 Eigenvalues as Optimization
Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvector
analysis. Consider the following, equality constrained optimization problem:
maxx∈Rn xT Ax subject to kxk22 = 1 (149)
n
for a symmetric matrix A ∈ S . A standard way of solving optimization problems with equality constraints is by
forming the Lagrangian, an objective function that includes the equality constraints. The Lagrangian in this case can
be given by
L(x, λ) = xT Ax + λ(1 − xT x) (150)
where λ is called the Lagrange multiplier associated with the equality constraint. It can be established that for x∗ to
be a optimal point to the problem, the gradient of the Lagrangian has to be zero at x∗ (this is not the only condition,
but it is required). That is,
∇x L(x, λ) = 2AT x − 2λx = 0. (151)
Notice that this is just the linear equation Ax = λx. This shows that the only points which can possibly maximize (or
minimize) xT Ax assuming xT x = 1 are the eigenvectors of A.
23
d d n-d d
R
X Q d
0
n =
n-d
Figure 4: QR decomposition of a non-square matrix A = QR, where QT Q = I and R is upper triangular. The shaded parts of
R, and all the below-diagonal terms, are zero. The shaded entries in Q are not computed in the economy-sized version, since they
are not needed.
9 Matrix factorizations
There are various ways to decompose a matrix into a product of other matrices. In Section 7, we discussed diago-
nalizing a square matrix A = XΛX−1 . In Section ?? below, we will discuss a way to generalize this to non-square
matrices using a technique called SVD. In this section, we present some other useful matrix factorizations.
9.1 Cholesky factorization of pd matrices
Any symmetric positive definite matrix can be factorized as A = RT R, where R is upper triangular with real, positive
diagonal elements. This is called a Cholesky factorization: In Matlab, type R = chol(A). In fact, we can use this
function to determine if a matrix is positive definite, as follows:
Listing 8: isposdef
function b = isposdef(a)
% Written by Tom Minka
[R,p] = chol(a);
b = (p == 0);
(This method is faster than checking that all the eigenvalues are positive.) We can create a random pd matrix using
the following function.
Listing 9: randpd
function M = randpd(d)
A = rand(d);
M = A*A’;
The Cholesky decomposition of a covariance matrix can be used to sample from a multivariate Gaussian. Let
y ∼ N (µ, Σ) and Σ = LLT , where L = RT is lower triangular. We first sample x ∼ N (0, I), which is easy
because it just requires sampling from d separate 1d Gaussians. We then set y = LT x + µ. This is valid since
Cov[y] = LT Cov[x]L = LT IL = Σ (152)
In Matlab you type
Listing 10: :
X = chol(Sigma)’*randn(d,n) + repmat(mu,n,1);
Of course, if you have the Matlab statistics toolbox, you can just call mvnrnd, but it is useful to know this other
method, too.
9.2 QR decomposition
The QR decomposition of an n × d matrix X is defined by
X = QR (153)
24
n n m-n n n
n
A U Σ VT
m =
Figure 5: SVD decomposition of a non-square matrix A = UΣVT . The shaded parts of Σ, and all the off-diagonal terms, are
zero. The shaded entries in U and Σ are not computed in the economy-sized version, since they are not needed.
where Q is an n × n orthogonal matrix (hence QT Q = I and QQT = I), and R is an n × d matrix, which is only
non-zero in its upper-right triangle (see Figure 4). In Matlab, type [Q,R] = qr(X). It is also possible to compute
an economy-sized QR decomposition, using [Q,R] = qr(X,0). In this case, Q is just n × d and R is d × d, where
the lower triangle is zero (see Figure 4). We will use this when we study least squares in Section ??.
10 Singular value decomposition (SVD)
Any (real) m × n matrix A can be decomposed as
   
| |
A = UΣVT = λ1 u1  − vT1 − + · · · + λr ur  − vTr − (154)
| |
where U is an m × m whose columns are orthornormal (so UT U = I), V is n × n matrix whose rows and columns
are orthonormal (so VT V = VVT = I), and Σ is a m × n matrix containing the r = min(m, n) singular values
σi ≥ 0 on the main diagonal, with 0s filling the rest of the matrix. The columns of U are the left singular vectors, and
the columns of V are the right singular vectors. See Figure 5 for an example. In Matlab, you can compute the SVD
using [U,Sigma,V] = svd(A).
As is apparent from Figure 5, if m < n, there are some zero columns in Σ, and hence the last rows of VT are
not used. The economy sized SVD (in matlab, svd(A,’econ’)) in this case leaves U as m × m, but makes Σ be
m × m and only computes the first m columns of V . If m > n, we take the first n columns of U, make Σ be n × n,
and leave V as is (this is also called a thin SVD).
A square diagonal matrix is nonsingular iff its diagonal elements are nonzero. The SVD implies that any square
matrix is nonsingular if, and only if, its singular values are nonzero. The most numerically reliable way to determine
whether matrices are singular is to test their singular values. This is far better than trying to compute determinants,
which are numerically unstable.
10.1 Connection between SVD and eigenvalue decomposition
If A is real, symmetric and positive definite, then the singular values are equal to the eigenvalues, and the left and
right singular vectors are equal to the eigenvectors (up to a sign change). However, Matlab always returns the singular
values in decreasing order, whereas the eigenvalues need not necessarily be sorted. This is illustrated below.
Listing 11: :
>> A=randpd(3)
A =
0.9302 0.4036 0.7065
0.4036 0.8049 0.4521
0.7065 0.4521 0.5941
>> [U,S,V]=svd(A)
U =
-0.6597 0.5148 -0.5476
-0.5030 -0.8437 -0.1872
-0.5584 0.1520 0.8155
S =
1.8361 0 0
25
0 0.4772 0
0 0 0.0159
V =
-0.6597 0.5148 -0.5476
-0.5030 -0.8437 -0.1872
-0.5584 0.1520 0.8155
>> [U,Lam]=eig(A)
U =
0.5476 0.5148 0.6597
0.1872 -0.8437 0.5030
-0.8155 0.1520 0.5584
Lam =
0.0159 0 0
0 0.4772 0
0 0 1.8361
In general, for an arbitrary real matrix A, if A = UΣVT , we have
AT A = VΣT UT UΣVT = V(ΣT Σ)VT (155)

T T
(A A)V = V(Σ Σ) = VD (156)
so the eigenvectors of AT A are equal to V, the right singular vectors of A. Also, the eigenvalues of AT A are equal
to the elements of Σ2 , the squared singular values. Similarly
AAT = UΣVT VΣT UT = U(ΣΣT )UT (157)

(AAT )U = U(ΣΣT ) = UD (158)
so the eigenvectors of AAT are equal to U, the left singular vectors of A. Also, the eigenvalues of AAT are equal to
the squared singular values.
We will use these results when we study PCA (see Section ??), which is very closely related to SVD.
10.2 Pseudo inverse
The pseudo-inverse of an m × n matrix A, denoted A† , is defined as the unique matrix that satisfies the following 4
properties:
AA† A = A (159)
A† AA† = A† (160)
(AA† )T = AA† (161)
(A† A)T = A† A (162)
If A is square and non-singular, then A† = A−1 . If m > n and the columns of A are linearly independent, then
A† = (AT A)−1 AT (163)
which is the same expression as arises in the normal equations (see Section ??). In this case, A† is a left inverse
because
A† A = (AT A)−1 AT A = I (164)
but is not a right inverse because
AA† = A(AT A)−1 AT (165)
only has rank n, and so cannot be the m × m identity matrix.
In general, we can compute the pseudo inverse using the SVD decomposition A = UΣVT . If A is square and
non-singular, we can compute A† easily:
A† = A−1 = (UΣVT )−1 = V[diag(1/σj )]UT (166)
26
n k k n
Σ VT k
A U
m ¼
Figure 6: Truncated SVD.
rank 1 rank 2 rank 5
rank 10 rank 20 rank 200
Figure 7: Low rank approximations to a 200 × 320 image. Ranks 1, 2, 5, 10, 20 and 200 (original is bottom right). Produced by
svdImageDemo.
since VT V = I and UT U = I. When computing the pseudo inverse, if any σj = 0, we replace 1/σj with 0. Suppose
the first r = rank(A) singular values are nonzero, and the rest are zero. Then
Σ† = diag(σ1−1 , . . . , σr−1 , 0, . . . , 0) (167)
Hence
A† = VΣ† UT (168)
amounts to only using the first r rows of V, U and Σ−1 , since the remaining elements of Σ† will be 0. In mat-
lab, we can just type pinv(A). The (abbreviated) source code for this built-in function is shown below. (Use
type(which(’pinv’)) to see the full source.)
Listing 12: pinv
function B = pinv(A)
[U,S,V] = svd(A,0); % if m>n, only compute first n cols of U
s = diag(S);
r = sum(s > tol); % rank
w = diag(ones(r,1) ./ s(1:r));
B = V(:,1:r) * w * U(:,1:r)’;
We will see how to use pseudo inverse for solving under-determined linear systems in Section 11.3.
10.3 Low rank approximation of matrices
(The following section is based on [Mol04, sec10.11].) The economy-sized SVD
A = UΣVT (169)
27
9000 4
10
8000
7000
3
10
6000
5000
log(σi)
2
σi
10
4000
3000
1
10
2000
1000
0
0 10
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
i i
Figure 8: Singular values (left) and log singular values (right) for the clown image.
can be rewritten as
A = E1 + E2 + · · · + Er (170)
where r = min(m, n). The component matrices Ei are rank one outer products:
Ei = σi ui vTi (171)
The norm of each matrix is the corresponding singular value, ||Ei || = σi . If we truncate this sum to the first k terms,
Ak = E1 + · · · Ek = U:,1:k Σ1:k,1:k VT:,1:k (172)
we get a rank r approximation to the matrix: see Figure 6. This is called a truncated SVD. The total number of
parameters needed to represent an m × n matrix using a rank k approximation is
mk + k + kn = k(m + n + 1) (173)
As an example, consider the 200 × 320 pixel image in Figure 7(top left). This has 64,000 numbers in it. We see that a
rank 20 approximation, with only (200 + 320 + 1) × 20 = 10, 420 numbers is a very good approximation.
The error in this approximation is given by
||A − Ak || = σk+1 (174)
Since the singular values often die off very quickly (see Figure 8), a low-rank approximation is often quite accurate.
In Section ??, we will see that the truncated SVD is the same thing as principal components analysis (PCA).
We can construct a low rank approximation to a matrix (here, an image) as follows.
Listing 13: Part of svdImageDemo

load clown; % built-in image
[U,S,V] = svd(X,0);
k = 20;
Xhat = (U(:,1:k)*S(1:k,1:k)*V(:,1:k)’);
image(Xhat);
11 Solving systems of linear equations

Consider solving a generic linear system Ax = b, where A is m × n. There are three cases, depending on whether
A is square (m = n), tall and skinny (m > n) or short and fat (m < n).
11.1 m = n: Square case
Suppose A is square, there are n equations and n unknowns. In Section 12, we discuss the conditions under which
a solution to Ax = b exists, and when it is unique. A sufficient condition for a unique solution to exist is that A is
non-singular. In this case, we can write x = A−1 b, but there are more efficient (and numerically stable) algorithms,
such as those that use LU factorization. In Matlab, you can just use the backslash operator: x=A\b. We discuss
numerical stability of computing linear systems of equations in Section 6.4.
28
11.2 m > n: Over-determined case
If m > n, the system is over-determined: there are more equations than unknowns. In general, there is no exact
solution. Typically one seeks the least squares solution:
x̂ = arg min ||Ax − b||2 (175)

x
This solution minimizes sum of the squared residuals:
m
X
||Ax − b||2 = ri2 (176)
i=1
ri = ai1 x1 + · · · + ain xn − bi , i = 1 : m (177)
In matlab, the backslash operator x=A\b, gives the least squares solution; internally Matlab uses QR decomposition:
see Section 9.2.
11.3 m < n: Under-determined case
If m < n, the system is under-determined: there is either no solution or there are infinitely many. In either case, the
“optimal” x is ambiguous. In Matlab, there are two main approaches one can use: the backslash operator, x=A\b,
which gives a least squares solution, or pseudoinverse, x = pinv(A)*b, which gives the minimum norm solution.
In the full rank case, these two solutions are the same, although pinv does more work than necessary to obtain the
answer. But in degenerate situations, these two methods can give different results: see Section ?? for details.
12 Existence/ uniqueness of solutions to linear systems: SVD analysis
Consider the problem of solving Ax = b for an n × n matrix A. Two natural questions are: when does some solution
x ∈ Rn exist? and if it does exist, when is it unique? The SVD can be used to answer these questions, as we now
describe. More interestingly, when the solution is not unique, one can use the singular vectors to determine what extra
data would need to be obtained in order to make the solution unique, as we show in Section ??. (The following section
in based on [Bee07, p29,p143].)
12.1 Left and right singular vectors form an orthonormal basis for the range and null space
Let us start by writing the SVD symbolically as
  
σ1 − (v1 )T −
 σ2  − (v2 )T −
= UΣVT = U 
  
A ..  ..  (178)
 .  . 
σn − (vn )T −
Hence
    
− σ1 (v1 )T − x1 σ1 < v1 , x >
− σ2 (v2 )T −
  x2 
   σ2 < v2 , x > 
  
Ax = U ..   ..  = U  ..  (179)
 .  .   . 
− σn (vn )T − xn σn < vn , x >
 
  σ1 < v1 , x >
| | |  σ2 < v2 , x > 
= u1 u2 · · · un    ..

 (180)
| | |
 . 
σn < vn , x >
 
u11 σ1 < v1 , x > + · · · + u1n σ1 < vn , x >
 u21 σ1 < v1 , x > + · · · + u2n σ1 < vn , x >  X n
 
=  .. = σj < vj , x > uj (181)
 .  j=1
un1 σ1 < v1 , x > + · · · + unn σ1 < vn , x >
29
Suppose that the first r singular vectors of A are nonzero and that the rest are zero.6 Hence
X r
X
Ax = σj < vj , x > uj = σj < vj , x > uj (182)
j:σj >0 j=1
Thus any Ax can be written as a linear combination of the left singular vectors u1 , . . . , ur , so the range of A is given
by
range(A) = span{uj : σj > 0} (183)
with dimension r.
To find a basis for the null space, let us now define a second vector y ∈ Rn that is a linear combination solely of
the right singular vectors for the zero singular values,
X n
X
y= cj v j = cj v j (184)
j:σj =0 j=r+1
Since the vj ’s are orthonormal, we have

   
σ1 < v1 , y > σ1 0
 ..    ..

 . 




 .
 σr < vr , y >   σr 0 
Ay = U  σr+1 = U   = U0 = 0 (185)
 < vr+1 , y >

0 < vr+1 , y >
 
 ..   .. 
 .   . 
σn < vn , y > 0 < vn , y >
Hence the right singular vectors form an orthonormal basis for the null space:
nullspace(A) = span{vj : σj = 0} (186)
with dimension n − r. We see that
dim(range(A)) + dim(nullspace(A)) = r + (n − r) = n (187)
In words, this is often written as

rank + nullity = n (188)
This is called the rank-nullity theorem.
12.2 Existence and uniqueness of solutions
We can finally answer our question about existence and uniqueness of solutions to Ax = b. If A is full rank (r = n),
then Ax can generate every vector b ∈ Rn . Hence for any b, there is an x such that Ax = b, namely x = A−1 b,
which can be rewritten as
n
X
x = A−1 b = (UΣVT )−1 v = VΣ−1 UT v = < uj , b > σj−1 vj (189)
j=1
Furthemore, this solution must be unique. To see why, suppose y is some other solution, so Ay = b. Define
v = y − x. Then
Ay = A(x + v) = Ax + Av = b + Av = b (190)
Hence v ∈ nullspace(A). So if nullspace(A) = {0} (since nullity=0), then we must have v = 0, so the solution x is
unique.
6 This is the opposite convention of [Bee07].
30
Now suppose A is not full rank (r < n), so A is singular and A−1 does not exist. If b 6∈ range(A), then there
is no solution, since A can only generate values in range(A). If b ∈ range(A), then there are an infinite number of
solutions of the form X
x=z+ cj v j (191)
j:σj =0
P
for some constants cj ∈ R, where z is a particular solution satisfying Az = b. To see this, let y = j:σj =0 cj vj be
a vector in the null space. Then from Equation 185,
Ax = Az + Ay = b + 0 = b (192)
Hence x = z + y is also a solution for any y in the null space.

To find a particular solution z, we can use the pseudo-inverse:
r
X
z = A† b = VΣ† UT v = < uj , b > σj−1 vj (193)
j=1
This satisfies Az = b since

   
σ1 1/σ1
 ..   .. 

 . 


 . 

†
 σr  T  1/σr  T
Az = AA b = U  V V U b (194)

 0 


 0 

 ..   .. 
 .   . 
0 0
r
X
= U:,1:r UT:,1:r b = uj < uj , b > = b (195)
j=1
The last step follows since b ∈ range(A) (by assumption), and hence b must lie in span{uj : σj > 0}.
It is natural to ask: what happens if we apply the pseudo inverse to a b that is not in the range of A? This happens
when A is over-determined (more rows than columns in A). In this case, z = A† b is not a solution, Az 6= b.
However, it is the “closest thing to a solution”, in this sense that it minimizes the residual norm:
||Az − b|| ≤ ||Ay − b|| ∀y ∈ Rn (196)
This is called a least squares solution: see Section ??.

12.3 Worked example
Consider the following example from [Van06, p53]. We have the following linear system of equations, where A is
singular, lower triangular matrix     
6 0 0 x1 b1
 2 0 0  x2  = b2  (197)
−4 1 −2 x3 b3
From the first equation, x1 = b1 /6. From the second equation,
b1
2x1 = b2 ⇒ 2 = b2 ⇒ b1 = 3b2 (198)
6
So the system is only solvable if b1 = 3b2 , and in that case, we can choose x2 arbitrarily. Finally, the third equation
can be satisfied by taking
1 1 1
x3 = − 21 (b3 + 4x1 − x2 ) = − b1 − b3 + x2 (199)
3 2 2
31
So if b1 6= 3b2 , the equation has no solution. If b1 = 3b2 , there are infinitely many solutions: x = (b1 /6, α, −b1/3 −
b3 /2 + α/2) is a solution for all α.
Let us now see how we can derive these results using SVD. The following code fragment verifies that b = (3, 1, 10)
(which satisfies b1 = 3b2 ) is in the range of A, by showing that there are coefficients c that can multiply the left singular
vectors to generate b.
Listing 14: Part of svdSpanDemo

A = [6 0 0; 2 0 0; -4 1 -2];
[U,S,V]=svd(A,’econ’); S = diag(S);
ndx0 = find(S<=eps);
ndxPos = find(S>eps);
R = U(:,ndxPos); % range
N = V(:,ndx0); % null space
b = [3 1 10]’; % this satisfies b1=3b2
% verify that b is in the range
c = R \ b;
assert(approxeq(b, sum(R*diag(c),2)))
We can also generate a particular solution z:

z = pinv(A)*b;
assert(approxeq(norm(A*z - b), 0))
We can also create other solutions by adding random multiples of vectors in the null space.

c = rand(length(ndx0),1);
x = z + sum(N*diag(c),2);
assert(approxeq(norm(A*x - b), 0))
Similarly, it it easy to verify that b = (1, 1, 10) is not a solution, and is not in the range of A (cannot be perfectly
generated by any Az for any z):

b = [1 1 10]’; % this does not satisfies b1=3b2
c = R \ b;
assert(not(approxeq(b, sum(R*diag(c),2))))
z = pinv(A)*b;
assert(not(approxeq(norm(A*z - b), 0)))
References
[Bee07] K. Beers. Numerical Methods for Chemical Engineering: Applications in MATLAB. Cambridge, 2007.
[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006.

[Jor07] M. I. Jordan. An Introduction to Probabilistic Graphical Models. 2007. In preparation.
[Mol04] Cleve Moler. Numerical Computing with MATLAB. SIAM, 2004.
[Van06] L. Vandenberghe. Applied numerical computing: Lecture notes, 2006.
32

Linear Algebra Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Algebra Guide

Uploaded by

Copyright:

Available Formats

Linear algebra

• We denote the ith row of A by aTi or Ai,: :

ei = (0, . . . , 0, 1, 0, . . . , 0)T (8)

3.2 Matrix-Vector Products

3.4 Multiplication by diagonal matrices

The trace has the following properties:

trABC = trBCA = trCAB (27)

trace(A B) = sum([a(1,:)*b(:,1), ..., a(n,:)*b(:,n)])

Now let A be the following matrix:  

4.4 Determinant of a square matrix

|A| = |AT | (33)

A adj(A) = det(A) I (38)

A−1 A = I = AA−1 . (43)

(M/H)−1 = E−1 + E−1 F(M/E)−1 GE−1 (60)

(E − FH−1 G)−1 = E−1 + E−1 F(H − GE−1 F)−1 GE−1 (62)

(E + uvT )−1 = E−1 + E−1 u(−1 − vT E−1 u)−1 vT E−1 (63)

−(M/H)−1 FH−1 = E−1 F(M/E)−1 (65)

5 Special kinds of matrices

If α = 45◦ , this becomes  √1 

is not pd. Conversely, a pd matrix can have negative entries e.g.,

Note that kxk22

More formally, a norm is any function f : Rn → R that satisfies 4 properties:

||∆x|| ||A−1 || ||∆b||

• The determinant of A is equal to the product of its eigenvalues,

• The rank of A is equal to the number of non-zero eigenvalues of A.

If d = 1, this reduces to the familiar univariate Gaussian density

the surfc and contour commands. See gaussPlot2dDemo for details.

0.018 0.018 0.04

0.016 0.016 0.035

0.002 0.002 0.005

Cov[Ax] = A Cov[x]AT (117)

7.3.1 Whitening data

Cov[y] = UT ΣU = UT UΛUT U = Λ (119)

which is a diagonal matrix.

See Figure 2 for an example. In Matlab, we can implement this as follows

Listing 4: Part of gaussPlot2dDemo

7.3.2 Standardizing data

i.e., an m × n matrix with

and hence that X XX

∇A xT Ax = ∇A tr(xT Ax) = ∇A tr(xxT A) = (xxT )T = xxT (131)

8.1.2 Derivative of a (log) determinant

∇A |A| = adj(A)T = |A|A−T (133)

Now we can use the chain rule to see that

From this we can easily see that

Obviously this is a symmetric matrix, since

Cov[y] = LT Cov[x]L = LT IL = Σ (152)

In Matlab you type

In general, for an arbitrary real matrix A, if A = UΣVT , we have

AT A = VΣT UT UΣVT = V(ΣT Σ)VT (155)

AAT = UΣVT VΣT UT = U(ΣΣT )UT (157)

A† = (AT A)−1 AT (163)

A† = A−1 = (UΣVT )−1 = V[diag(1/σj )]UT (166)

Figure 6: Truncated SVD.

rank 1 rank 2 rank 5

rank 10 rank 20 rank 200

Σ† = diag(σ1−1 , . . . , σr−1 , 0, . . . , 0) (167)

Ak = E1 + · · · Ek = U:,1:k Σ1:k,1:k VT:,1:k (172)

||A − Ak || = σk+1 (174)

Listing 13: Part of svdImageDemo

11 Solving systems of linear equations

x̂ = arg min ||Ax − b||2 (175)

Since the vj ’s are orthonormal, we have

nullspace(A) = span{vj : σj = 0} (186)

trace(A B) = sum([a(1,:)b(:,1), ..., a(n,:)b(:,n)])