You are on page 1of 368

Matrix Methods with Applications in

Science and Engineering


Kevin W. Cassel
Mechanical, Materials, and Aerospace Engineering Department
Illinois Institute of Technology
10 West 32nd Street
Chicago, IL 60616
cassel@iit.edu

c 2015 Kevin W. Cassel


Contents

Preface

viii

Part I
1
1.1
1.2
1.3

Systems of Linear Algebraic Equations


Eigenproblems
Quadratic Expressions
Systems of First-Order, Linear Differential Equations

Systems of Linear Algebraic Equations


1.4.1
1.4.2
1.4.3
1.4.4
1.4.5

1.5

Vector and Matrix Algebra


Definitions
Algebraic Operations
Applications
1.3.1
1.3.2
1.3.3
1.3.4

1.4

Matrix Methods

General Considerations
Determinants
Gaussian Elimination
Matrix Inverse
Cramers Rule

Linear Vector Spaces


1.5.1
1.5.2
1.5.3
1.5.4
1.5.5

Three-Dimensional Vectors Review


n-Dimensional Vectors
Linear Independence of Vectors
Basis of a Vector Space
Gram-Schmidt Orthogonalization

1.6
1.7

Linear Transformations
Note on Norms

2
2.1
2.2

The Eigenproblem and Its Applications


Eigenvalues and Eigenvectors
Real Symmetric Matrices
2.2.1
2.2.2
2.2.3

2.3
2.4

Properties of Eigenvectors
Linear Systems of Equations
Quadratic Forms

Normal Matrices
Diagonalization
2.4.1
2.4.2

Matrices with Linearly-Independent Eigenvectors


Jordan Canonical Form

iii

3
3
6
8
9
15
17
17
18
18
19
22
26
28
30
30
33
33
34
37
38
41
43
43
52
52
53
56
61
63
64
65

iv
2.5

Contents
Systems of Ordinary Differential Equations
2.5.1
2.5.2

2.6

General Approach for First-Order Systems


Higher-Order Equations

Decomposition of Matrices
2.6.1
2.6.2

Polar Decomposition
Singular-Value Decomposition

2.7

Functions of Matrices

3
3.1

Eigenfunction Solutions of Differential Equations


Function Spaces
3.1.1
3.1.2
3.1.3

3.2

Eigenfunction Expansions
3.2.1
3.2.2

3.3

Derivative Operators
Integral Theorems
Coordinate Transformations

Extrema of Functions
4.2.1
4.2.2
4.2.3
4.2.4

4.3

Laplace Equation
Unsteady Diffusion Equation

Vector and Matrix Calculus


Vector Calculus
4.1.1
4.1.2
4.1.3

4.2

Adjoint of a Differential Operator


Requirement for an Operator to be Self-Adjoint
Eigenfunctions of Sturm-Liouville Operators

Eigenfunction Solutions of Partial Differential Equations


3.4.1
3.4.2

4
4.1

Definitions
Eigenfunctions of Differential Operators

Adjoint and Self-Adjoint Differential Operators


3.3.1
3.3.2
3.3.3

3.4

Definitions
Linear Independence of Functions
Basis Functions

General Considerations
Constrained Extrema and Lagrange Multipliers
Linear Programming
Quadratic Programming

Least-Squares Solutions of Algebraic Systems of Equations


4.3.1
4.3.2

Overdetermined Systems (r > n)


Underdetermined Systems (r < n)

Part II

Numerical Methods

68
68
78
81
82
83
88
93
94
94
94
95
99
99
100
107
107
109
111
115
116
120
124
124
127
130
132
132
132
135
138
139
143
143
146

149

5
5.1
5.2

Introduction to Numerical Methods


Historical Background
Impact on Research and Practice

151
151
152

6
6.1

Computational Linear Algebra


Approximation and Its Effects

156
157
157
158

6.1.1
6.1.2

Operation Counts
Ill-Conditioning and Round-Off Errors

Contents
6.2

Systems of Linear Algebraic Equations

6.4

Singular-Value Decomposition

161
162
164
166
167
170
171
172
175
175
175
176
180
182

Nonlinear Algebraic Equations Root Finding

183

Optimization of Algebraic Systems

184

Curve Fitting and Interpolation

185

10

Numerical Integration

186

11
11.1

Finite-Difference Methods
General Considerations

6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
6.2.7
6.2.8

6.3

LU Decomposition
Cholesky Decomposition
Partitioning
Iterative Convergence
Jacobi Method
Guass-Seidel Method
Successive Over-Relaxation (SOR)
Conjugate-Gradient Method

Numerical Solution of the Eigenproblem


6.3.1
6.3.2
6.3.3

Similarity Transformation
QR Method to Obtain Eigenvalues and Eigenvectors
Arnoldi Method

11.5

Derivative Boundary Conditions

187
187
187
190
192
197
199
200
201
204

12

Finite-Difference Methods for ODEs

206

13
13.1
13.2
13.3
13.4
13.5

Classification of Second-Order PDEs


Mathematical Classification
Hyperbolic Equations
Parabolic Equations
Elliptic Equations
Mixed Equations

207
207
210
211
212
213

14
14.1
14.2

Finite-Difference Methods for Elliptic PDEs


Finite-Difference Methods for the Poisson Equation
Direct Methods for Linear Systems

214
215
218
219
222
224

11.1.1 Numerical Solution Procedure


11.1.2 Properties of a Numerical Solution

11.2
11.3
11.4

Formal Basis for Finite Differences


Extended-Fin Example
Tridiagonal Systems of Equations
11.4.1 Properties of Tridiagonal Matrices
11.4.2 Thomas Algorithm

14.2.1 Fourier Transform Methods


14.2.2 Cyclic Reduction

14.3

Iterative (Relaxation) Methods

vi

Contents
14.3.1 Jacobi Method
14.3.2 Gauss-Seidel Method
14.3.3 Successive Over-Relaxation (SOR)

14.4

Boundary Conditions
14.4.1 Dirichlet Boundary Conditions
14.4.2 Neumann Boundary Conditions

14.5

Alternating-Direction-Implicit (ADI) Method


14.5.1 Motivation
14.5.2 ADI Method

14.6
14.7

Compact Higher-Order Methods


Multigrid Methods
14.7.1 Motivation
14.7.2 Multigrid Methodology
14.7.3 Speed Comparisons

14.8

Treatment of Nonlinear Convective Terms


14.8.1 Picard Iteration
14.8.2 Upwind-Downwind Differencing
14.8.3 Newton Linearization

15
15.1

Finite-Difference Methods for Parabolic PDEs


Explicit Methods
15.1.1 First-Order Explicit (Euler) Method
15.1.2 Richardson Method
15.1.3 DuFort-Frankel Method

15.2

Numerical Stability Analysis


15.2.1 Matrix Method
15.2.2 von Neumann Method (Fourier Analysis)

15.3

Implicit Methods
15.3.1 First-Order Implicit Method
15.3.2 Crank-Nicolson Method

15.4

Nonlinear Convective Problems


15.4.1 First-Order Explicit Method
15.4.2 Crank-Nicolson Method
15.4.3 Upwind-Downwind Differencing

15.5

Multidimensional Problems
15.5.1
15.5.2
15.5.3
15.5.4

First-Order Explicit Method


First-Order Implicit Method
ADI Method with Time Splitting (Fractional-Step Method)
Factored ADI Method

225
226
227
228
228
229
231
231
232
234
237
238
240
252
253
253
255
258
259
260
260
261
262
263
264
266
268
268
269
272
272
273
274
278
279
280
280
282

16

Finite-Difference Methods for Hyperbolic PDEs

285

17
17.1

Additional Topics in Numerical Methods


Numerical Solution of Coupled Systems of PDEs

286
286
286
286
287
288
289

17.1.1 Introduction
17.1.2 Sequential Method

17.2

Grids and Grid Generation


17.2.1 Uniform Versus Non-Uniform Grids
17.2.2 Structured Versus Unstructured Grids

Contents
17.3

Finite-Element and Spectral Methods

Parallel Computing

289
290
291
296
297

Part III

299

17.3.1 Comparison of Methods


17.3.2 Spectral Methods
17.3.3 Finite-Element Methods

17.4

vii

Applications

18

Static Structural and Electrical Systems

301

19
19.1
19.2
19.3
19.4
19.5
19.6
19.7

Discrete Dynamical Systems


An Illustrative Example
Bifurcation Theory and the Phase Plane
Stability of Discrete Systems
Stability of Linear, Second-Order, Autonomous Systems
Nonautonomous Systems Forced Pendulum
Non-Normal Systems Transient Growth
Nonlinear Systems

302
303
309
309
313
317
321
327

20
20.1
20.2
20.3
20.4
20.5

Continuous Systems
Wave Equation
Electromagnetics
Schr
odinger Equation
Stability of Continuous Systems Beam-Column Buckling
Numerical Solution of the Differential Eigenproblem

329
329
337
337
337
341
341
341
342
343
344
347
351
352

20.5.1 Exact Solution


20.5.2 Numerical Solution

20.6

Hydrodynamic Stability
20.6.1
20.6.2
20.6.3
20.6.4
20.6.5

Linearized Navier-Stokes Equations


Local Normal-Mode Analysis
Numerical Solution of the Orr-Sommerfeld Equation
Example: Plane-Poiseuille Flow
Numerical Stability Revisited

21
21.1
21.2
21.3
21.4

Optimization and Control


Controllability, Observability, and Reachability
Control of Discrete Systems
Kalman Filter
Control of Continuous Systems

354
354
354
354
354

22
22.1
22.2

Image or Signal Processing and Data Analysis


Fourier Analysis
Proper-Orthogonal Decomposition

355
356
356

Appendix A

Row and Column Space of a Matrix

357

Preface

So much of mathematics was originally developed to treat particular problems


and applications; the mathematics and applications were inseparable. As the
years went by, the mathematical methods naturally were extended, unified, and
formalized. This has provided a solid foundation on which to build mathematics
as a standalone field and basis for extension to additional application areas, sometimes, in fact, providing the impetus for whole new fields of endeavor. Despite
the fact that this evolution has served mathematics, science, and engineering innumerably, this tendency widens the gap between pure theoretical mathematics
and science and engineering applications. As such, it becomes increasingly difficult to strike the right balance that encourages one to learn the mathematics in
the context of the applications to which the scientist and engineer are ultimately
interested.
Given the volume of theory, methods, and techniques in the arsenal of the
researcher and practitioner, it is a tall task to devise a text with the optimal
balance and arrangement of topics to both teach and motivate these topics in an
effective and engaging manner. I believe that the answer is in how the mathematical subjects are discretized into somewhat self-contained topics around which the
associated applications are hung. The primary complaint of this approach is that
each application area is treated insufficiently. However, the primary virtue is that
the reader clearly sees the connections, mathematical and otherwise, between a
wide variety of topics. In addition, there is the temptation in an applicationoriented text to overemphasize applications and underemphasize the underlying
mathematics. For example, it is not uncommon for texts on numerical methods to
include an appendix containing a brief review of vectors and matrices. Although
seemingly appropriate, it doesnt do justice to the integral and essential nature
of linear algebra in numerical methods.
For the scientist or engineer evaluating books on matrix mathematics or linear
algebra, a simple litmus test is to see if the text includes a discussion of singularvalue decomposition (SVD). Despite its widespread utility in numerical methods,
system diagnostics, image and signal compression, reduced-order modeling, and
optimal control, it is an all too rare text that even makes mention of SVD, let
alone highlight its wealth of applications.
Working principles in developing this text:
Objectives: Consolidation, generalization, and unification of topics.
viii

Preface

ix

Matrix methods are interpreted loosely to include function spaces and eigenfunction expansions, vector calculus, linear systems,etc.
Focus on topics common to numerous areas of science and engineering, not
subject-specific topics.
Part I is material typically found in an engineering analysis or linear algebra
text, part II is material typically found in a numerical methods book, and part III
is found in a variety of domain-specific texts on linear systems theory, dynamical
systems, image/signal processing, etc. Of course, all of this is treated within a
unified framework with matrix methods being the centerpiece.
Part I contains the mathematical foundations for the topics in the remainder
of the book. Mathematicians would refer to the material in Chapters 1 and 2 as
linear algebra, while scientists and engineers would more likely refer to it as matrix
analysis. Chapter 3 extends the approaches used in Chapters 1 and 2 for vectors
and matrices to functions and differential operators, which facilitates application
to ordinary and partial differential equations through the use of eigenvalues and
eigenfunctions.
The methods articulated in Part 1 are appropriate for small systems that can
be solved exactly by hand or using symbolic mathematical software. Despite this
obvious limitation, they are essential knowledge for all subsequent discussion of
numerical methods and applications. In order to solve moderate to large systems,
which are done approximately with digital computers, we must revisit linear
systems of algebraic equations and the eigenproblem to see how the methods
in Part 1 can be adapted for large systems. This is known as computational, or
numerical, linear algebra and is covered in Chapter 6.

Part I
Matrix Methods

1
Vector and Matrix Algebra

I did not look for matrix theory. It somehow looked for me.
(Olga Taussky Todd)

Although the typical undergraduate engineering student has not had a formal
course in linear algebra, which is the mathematics of vectors and matrices, they
have been exposed to such constructs as a convenient means of representing tensors and systems of linear algebraic and differential equations, for example. They
have likely observed several means of solving systems of equations, calculating
determinants, applying linear transformations, and determining eigenvalues and
eigenvectors. Typically, however, these topics have been presented to the student
in a disjointed fashion in the context of undergraduate engineering courses where
the need arose. The objective of this text is to unify these topics into a single
coherent subject and extend our skill with vectors and matrices to the level that
is required for graduate-level study and research in science and engineering.
Whereas vectors and matrices arise in a wide variety of applications and settings (see Section 1.3), the mathematics of these constructs is the same regardless
of where the vectors or matrices have their origin. We will focus here on the mathematics, but with little emphasis on formalism and proofs, and mention and/or
illustrate many of the applications to science and engineering.

1.1 Definitions
Let us begin by defining vectors and matrices and several different types of matrices.
Matrix: A matrix is an ordered arrangement of numbers, variables, or functions
comprised of a rectangular grouping of elements arranged in rows and columns
as follows:

A11 A12 A1n


A21 A22 A2n

A = ..
..
.. = [Aij ] .
..
.
.
.
.
Am1 Am2 Amn
The size of the matrix is denoted by the number of rows, m, and the number of
3

Vector and Matrix Algebra

columns, n, that is, m n, which we read as m by n. If m = n, then the matrix


is said to be square. Each element Aij in the matrix is uniquely identified by two
subscripts, with the first i being its row and the second j being its column. Thus,
1 i m and 1 j n. The elements Aij may be real or complex.
The main diagonal of the matrix A is given by A11 , A22 , . . . , Amm or Ann . If
the matrix is square, then Amm = Ann . Two matrices are said to be equal, that
is A = B, if their sizes are the same and Aij = Bij for all i, j.
Vector: A (column) vector is an n 1 matrix. For example,

x1
x2

x = .. .
.
xn
The vector is said to be n-dimensional. By common convention, matrices are denoted with bold capital letters and vectors with bold lowercase letters.
Matrix Transpose (Adjoint): The transpose of matrix A is obtained by interchanging its rows and columns as follows:

A11 A21 Am1


A12 A22 Am2

AT = ..
..
.. = [Aji ] ,
..
.
.
.
.
A1n A2n Amn
which results in an n m matrix. If AT = A, then A is said to be symmetric
(Aji = Aij ). Note that a matrix must be square to be symmetric.
T
If the elements of A are complex and A = A, then A is a Hermitian matrix
T
(Aji = Aij ), where the overbar represents the complex conjugate, and A is the
conjugate transpose of A. Note that a symmetric matrix is a special case of a
Hermitian matrix.
Zero Matrix (0): Matrix of all zeros.
Identity Matrix (I): Square matrix
everywhere else, for example

1 0
0 1

I5 =
0 0
0 0
0 0

with ones on the main diagonal and zeros


0
0
1
0
0

0
0
0
1
0

0
0

0
= [ij ] ,
0
1

where


ij =

1,
0,

i=j
.
i 6= j

1.1 Definitions

Scalar Matrix (S): Diagonal matrix with equal diagonal elements such that
S = aI = [aij ].
Triangular Matrix: All elements above (lower triangular) or
angular) the main diagonal are zeros. For example,

A11 0
0
0
0
A11 A12 A13

0 A22 A23
A21 A22
0
0
0

, U = 0
A
A
A
0
0
0 A33
L=
31
32
33

A41 A42 A43 A44


0
0
0
0
A51 A52 A53 A54 A55
0
0
0

below (upper triA14


A24
A34
A44
0

A15
A25

A35
.
A45
A55

Tridiagonal Matrix: All elements are zero except along the lower (first subdiagonal), main, and upper (first superdiagonal) diagonals as follows:

A11 A12 0
0
0
A21 A22 A23
0
0

0
A=
0 A32 A33 A34
.
0
0 A43 A44 A45
0
0
0 A54 A55
Hessenberg Matrix: All elements are

A11 A12
A21 A22

A=
0 A32
0
0
0
0
Toeplitz Matrix: Each diagonal

A11
A21

A=
A31
A41
A51

zero below the lower diagonal, that is

A13 A14 A15


A23 A24 A25

A33 A34 A35


.
A43 A44 A45
0 A54 A55

is a constant, such that

A12 A13 A14 A15


A11 A12 A13 A14

A21 A11 A12 A13


.
A31 A21 A11 A12
A41 A31 A21 A11

Matrix Inverse: If a square matrix A is invertible, then its inverse A1 is such


that
AA1 = A1 A = I.
Note: If A is 2 2, that is,


A=

A11 A12
A21 A22

then its inverse is


A1 =

1
|A|

A22 A12
A21 A11

Vector and Matrix Algebra


Table 1.1 Corresponding types of real and complex matrices.
Real
Complex
T

Symmetric: A = AT

Hermitian: A = A

Skew-Symmetric: A = AT

Skew-Hermitian: A = A

Orthogonal: A1 = AT

Unitary: A1 = A

where |A| = A11 A22 A12 A21 is the determinant of A (see Section 1.4.2).
Orthogonal Matrix: An n n matrix A is orthogonal if
AT A = I.
It follows that
AT = A1
for an orthogonal matrix. Such a matrix is called orthogonal because its column
(and row) vectors are mutually orthogonal (see Section 1.5).
Remark: Generalizations of some of the above definitions to matrices with complex elements are given in Table 1.1.

1.2 Algebraic Operations


This being linear algebra, we must extend the algebraic operations of addition
and multiplication, that are so familiar for scalars, to vectors and matrices.
Addition: For A and B having the same size m n, their sum A + B is an
m n matrix obtained by adding the corresponding elements in A and B. For
example:

A11 A12 A13


B11 B12 B13
A21 A22 A23 + B21 B22 B23
A31 A32 A33
B31 B32 B33

A11 + B11 A12 + B12 A13 + B13


= A21 + B21 A22 + B22 A23 + B23
A31 + B31 A32 + B32 A33 + B33
= [Aij + Bij ] .
Subtraction is simply a special case of addition.
Multiplication: In the case of multiplication, we must consider multiplication
of a scalar and a matrix as well as a matrix and a matrix.
1. Multiplication by a scalar : When multiplying a matrix by a scalar, we simply

1.2 Algebraic Operations

multiply each element of the matrix by the scalar c. That is,


cA = [cAij ].
2. Multiplication by a matrix : The definition of matrix multiplication is motivated
by its use in linear transformations (see Section 1.6 and Kreyszig, p. 281). For
the matrix product AB to exist, the number of columns of A must equal the
number of rows of B. Then
A B

mr rn

produces an m n matrix. Consider A B = C


mr rn

A11 A12
A21 A22

..
..
.
.

Ai1 Ai2

.
..
..
.
Am1 Am2

A1r
A2r
..
.
Air
..
.

mn

B11 B12 B1j

B21 B22 B2j


.
..
..
.
.
.
.

Br1 Br2 Brj

B1n
B2n
..
.

Brn

Amr

C11

C12

..
.

..
.

Cm1 Cm2

C1n

Cij

..
.

Cmn

where
Cij = Ai1 B1j + Ai2 B2j + + Air Brj =

r
X

Aik Bkj .

k=1

That is, Cij is the inner product of the ith row of A and the j th column of B
(see Section 1.5.1 for a definition of the inner product).
Note that in general, AB 6= BA (even if square), that is, premultiplying B
by A is not the same as postmultiplying B by A.
Rules: For A, B, C, and I of appropriate size, and n integer.
A (BC) = (AB) C (but cant change order of A, B and C)
A (B + C) = AB + AC
(B + C) A = BA + CA
AI = IA = A
T

(A + B) = AT + BT

Vector and Matrix Algebra


T

(AB) = BT AT
1

(AB)

(note reverse order; shown in Hildebrand, p. 14)

= B1 A1

(if A and B invertible; note reverse order)

An = AA A

(n factors)

An = (A )n = A1 A1 A1

(n factors)

An Am = An+m
(An )m = Anm
A0 = I
Note that scalar arithmetic is a special case of matrix arithmetic, that is, a scalar
is simply a 1 1 matrix.
We can combine vector addition and scalar multiplication to devise a very
common algebraic operation in matrix methods known as a linear combination
of vectors. Let us say that we have the m vectors x1 , x2 , . . ., xm , which are
n-dimensional. A linear combination of these vectors is given by
u = a1 x1 + a2 x2 + + am xm ,
where ai , i = 1, . . . m are constants. Thus, we are taking the combination of some
portion of each vector through summing to form the n-dimensional vector u. In
Section 1.5, we will regard the vectors xi , i = 1, . . . , m as the basis of a vector space
comprised of all possible vectors u given by all possible values of ai , i = 1, . . . m.
1.3 Applications
There is nothing inherently physical about the way vectors and matrices are defined and manipulated. However, they provide a convenient means of representing
the mathematical models that apply in a vast array of applications that span all
areas of physics, engineering, computer science, and image processing, for example. Once we become comfortable with them, vectors and matrices simply become
a natural extension of the basic algebra that is so inherent to mathematics. Linear
algebra, then, is not so much a separate branch of applied mathematics as it is
an essential part of our mathematical repertoire.
Before getting into the particulars of manipulating vectors and matrices, it is
worthwhile to summarize and categorize the various classes of problems that are
represented using these methods and some of the applications that produce them.
The remainder of Chapters 1 and 2 will then focus on the methods and techniques
for treating each of these classes of problems. Applications will be interspersed
periodically in order to remind ourselves that we are developing the tools to
solve real problems, and also to expand our understanding of what the general
mathematical constructs represent in physical and numerical applications.
The following subsections provide a classification of the various types of matrix
problems that commonly arise in engineering and the sciences. Various examples

1.3 Applications

Figure 1.1 Schematic of the electrical circuit in Example 1.1.

are drawn from these fields, and it is shown how the particular class of matrix
problem arises naturally from the physical or numerical situation under consideration.

1.3.1 Systems of Linear Algebraic Equations


The most common class of problems in matrix methods is in solving systems of
coupled linear algebraic equations. These arise, for example, in solving for the
forces in static (non-moving) systems of discrete objects (for example, trusses,
static spring-mass systems, etc.), parallel electrical circuits containing only voltage sources and resistors, and linear coordinate transformations (for example,
computer graphics and CAD). Perhaps the most important application is in numerical methods, for example finite difference and finite element methods, which
are designed to reduce continuous differential equations to (typically very large)
systems of linear algebraic equations so as to make them amenable to solution
via matrix methods.
A system of linear algebraic equations may be represented in the form
Ax = c,
where A is a given coefficient matrix, c is a given vector, and x is the solution
vector that is sought. Solution techniques are covered in Section 1.4.
Example 1.1 We seek to determine the currents, i, in the parallel electrical
circuit shown in Figure 2.5 (see Jeffrey, p. 121).
Solution: Recall Ohms law for resistors
V = iR,
where V is the voltage drop across the resistor, i is the current through the
resistor, and R is its resistance. In addition, we have Kirchhoffs laws:

10

Vector and Matrix Algebra

1. The sum of the currents into each junction is equal to the sum of the currents
out of each junction.
2. The sum of the voltage drops around each closed circuit is zero.
Applying the second of Kirchhoffs laws around each of the three parallel circuits gives the three coupled equations
Loop 1: 8 12i1 8(i1 i3 ) 10(i1 i2 ) = 0,
Loop 2:

4 10(i2 i1 ) 6(i2 i3 ) = 0,

Loop 3:

6 6(i3 i2 ) 8(i3 i1 ) 4i3 = 0.

Upon simplifying, this gives the system of linear algebraic equations


30i1 10i2

8i3 = 8,

10i1 +16i2

6i3 = 4,

8i1

6i2 +18i3 = 6,

which in matrix form Ax = c is


30 10 8 i1
8
10 16 6 i2 = 4 .
8 6 18
i3
6
Solving this system of coupled, linear, algebraic equations would provide the
T
sought after solution for the currents x = [i1 i2 i3 ] .
Note that if the circuit also includes capacitors and/or inductors, application
of Kirchhoffs laws would produce a system of ordinary differential equations (see
Section 1.3.4).

Example 1.2 Recall that for a structure in static equilibrium, the sum of the
forces and moments are both zero, which holds for each individual member as
well as the entire structure. Determine the forces in each member of the truss
structure shown in Figure 1.2 (see Jeffrey, p. 227).
Solution: Note that all members are of length `; therefore, the structure is comprised of equilateral triangles with all angles being /3. We first obtain the reaction forces at the supports 1 and 2 by summing the forces and moments for the
entire structure as follows:
P
F =0
R1 + R2 3 = 0,
P
MA = 0 2`R2 3` = 0;
therefore,
3
R1 = ,
2

3
R2 = .
2

1.3 Applications

11

Figure 1.2 Schematic of the truss structure in Example 1.2.

Next we draw a free-body diagram for each of the joints A, B, C, D, and E


and sum the forces in the x- and y-directions to zero as follows:

P
FAx = 0 F2 + F1 cos 3 = 0,

P
FAy = 0 R1 + F1 sin 3 = 0,


P
FBx = 0 F4 + F3 cos 3 F1 cos 3 = 0,


P
FBy = 0 F1 sin 3 F3 sin 3 = 0,
..
.

and similarly for joints C, D, and E. Because cos 3 = 12 and sin
resulting system of equations in matrix form is

1
1 0
0
0
0
0
2

3

0
0
0
0
2 0 0
0
1

1
32

1
0
0
0
F1

2
2

0
3

2 0 23

0
0
0
0 F2
0

0 1 1
0 12 1 0 F3 0

F4 = .
3
3

0 0
0
0
0 3

2
2
F5

1
1
0 0 0

0
1
0

2
2 F6

3
3 F
7
0
0
0
0
0

0
2
2

1
0 0 0
0
0
1
32
2

3
0 0 0
0
0
0
2

3
,
2

the

Observe that there are ten equations (two for each joint), but only seven unknown
forces; therefore, three of the equations must be linear combinations of the other
equations in order to have a unique solutions (see Section 1.4.3). Carefully examining the coefficient matrix and right-hand side vector, one can see that the

12

Vector and Matrix Algebra

Figure 1.3 Schematic of Couette flow.

coefficient matrix contains information that only relates to the structures geometry, while the right-hand side vector is determined by the external loads only.
Hence, different loading scenarios for the same structure can be considered by
simply adjusting c accordingly.

The previous two examples illustrate how matrices can be used to represent the
governing equations of a physical system directly. There are many such examples
in science and engineering. In the truss problem, for example, the equilibrium
equations are applied to each discrete element of the structure leading to a system of linear algebraic equations for the forces in each member that must be
solved simultaneously. In other words, the discretization follows directly from the
geometry of the problem. For continuous systems, such as solids, fluids, and heat
transfer problems involving continuous media, the governing equations are in the
form of ordinary or partial differential equations for which the discretization is
less obvious. This gets us into the important and expansive topic of numerical
methods. Although this is beyond the scope of our considerations at this time
(see Part II), let us briefly motivate the overall approach using a simple onedimensional example.

Example 1.3 Consider the fluid flow between two infinite parallel flat plates
with the upper surface moving with constant speed U and an applied pressure
gradient in the x-direction; this is known as Couette flow. This is shown schematically in Figure 1.3.
Solution: The one-dimensional, fully-developed flow is governed by the ordinary
differential equation (derived from the Navier-Stokes equations enforcing conservation of momentum)
d2 u
1 dp
=
= const.,
2
dy
dx

(1.1)

where u(y) is the fluid velocity in the x-direction, which we seek, p(x) is the
specified linear pressure distribution in the x-direction (such that dp/dx is a
constant), and is the fluid viscosity. The no-slip boundary conditions at the

1.3 Applications

13

Figure 1.4 Grid used for Couette flow example.

lower and upper surfaces are


u=0
u=U

at y = 0,

(1.2)

at

(1.3)

y = H,

respectively. Equations (1.1) through (1.3) represent the mathematical model of


the physical phenomenon, which is in the form of a differential equation.
In order to discretize the continuos problem, we first divide the domain 0
y H into I equal subintervals of length y = H/I as shown in Figure 1.4.
Now we must discretize the governing equation (1.1) on this grid or mesh. To
see how this is done using the finite-difference method, recall the definition of the
derivative

du
u(yi+1 ) u(yi )
= lim
,

dy y=yi y0
y
where y = yi+1 yi . This suggests that we can approximate the derivative by
taking y to be small, but not in the limit as it goes all the way to zero. Thus,

u(yi+1 ) u(yi )
du
ui+1 ui

=
,

dy y=yi
y
y
where ui = u(yi ). This is called a forward difference. A more accurate approximation is given by a central difference of the form

ui+1 ui1
du

.
(1.4)

dy y=yi
2y
We can obtain a central difference approximation for the second-order derivative
 
d du
d2 u
=
dy 2
dy dy
as in equation (1.1) by applying (1.4) midway between successive grid points as
follows:
 
 
du
du

2
dy i+1/2
dy i1/2
d u

.

2
dy y=yi
y

14

Vector and Matrix Algebra

Approximating du/dy as in equation (1.4) yields



d2 u

dy 2 y=yi

ui+1 ui
y

ui ui1
y

or

d2 u
ui+1 2ui + ui1

.

2
dy y=yi
(y)2

(1.5)

This is the central-difference approximation for the second-order derivative. As


we will see later, the approximations (1.4) and (1.5) can be formally derived using
Taylor series expansions.
Given equation (1.5), we can discretize the governing differential equation (1.1)
according to
ui+1 2ui + ui1
1 dp
=
,
2
(y)
dx
or
aui1 + bui + cui+1 = d,

(1.6)

where
a = 1,

b = 2,

c = 1,

d=

(y)2 dp
.
dx

The boundary conditions (1.2) and (1.3) then require that


u1 = 0,

uI+1 = U.

(1.7)

Applying equation (1.6) at each node point in the grid corresponding to i =


2, 3, 4, . . . , I (u1 and uI+1 are known) gives the system of equations
i=2

au1 + bu2 + cu3 = d,

i=3

au2 + bu3 + cu4 = d,

i=4

au3 + bu4 + cu5 = d,


..
.

i=i

aui1 + bui + cui+1 = d,


..
.

i = I 1 auI2 + buI1 + cuI


i=I

= d,

auI1 + buI + cuI+1 = d.

1.3 Applications

15

Given the boundary conditions (1.7), the system in matrix form is

b c
u1
d
a b c
u2 d

u3 d
a b
c

.. ..
.. .. ..

.
.
.

. = . ,

a
b
c

ui d

.. .. ..

... ...
.
.
.

a
b c uI1 d
a b
uI
d cU
where all of the empty elements in the coefficient matrix are zeros. This is called
a tridiagonal matrix as only three of the diagonals have non-zero elements. This
structure arises because of how the equation is discretized using central differences
that involve ui1 , ui , and ui+1 . In general, the number of domain subdivisions
I can be very large, thereby leading to a large system of algebraic equations to
solve for the velocities u2 , u3 , . . . , ui , . . . , uI .
In all three examples, we end up with a system of linear algebraic equations to
solve for the currents, forces, or discretized velocities.

1.3.2 Eigenproblems
As described in Section 1.6, a linear system of algebraic equations can be thought
of as a transformation from one vector space to another. Such transformations
can scale, translate, and/or rotate a vector. In many applications, it is of interest to determine for a given transformation which vectors get transformed into
themselves with only a scalar stretch factor, that is, the vectors point in the same
direction after the transformation has been applied. In such a case, the system
of equations is of the form
Ax = x,
where A is the transformation matrix, is an eigenvalue representing the stretch
factor, and x is known as an eigenvector and is the vector that is transformed
into itself.
Eigenproblems commonly arise in stability problems (see Section 19.3), obtaining natural frequencies of dynamical systems (see Section 19.1), and determining
the principle stresses in solid and fluid mechanics (see Section 2.1), for example.
It also plays a starring role in the diagonalization procedure used in quadratic
forms (see Section 2.2.3) and, more importantly, for solving systems of ordinary
differential equations (see Section 2.4). In addition, there are several powerful
matrix decompositions that are related to, based on, or used to solve the eigenproblem (see Section 2.6). Optimization using quadratic programming reduces
to solving generalized eigenproblems (see Section 4.2.4). The eigenproblem also

16

Vector and Matrix Algebra

Figure 1.5 Stresses acting on a tetrahedral element.

has important applications in numerical methods, such as determining if an iterative method will converge toward the exact solution (see Section 6.2.4). The
eigenproblem in its various forms occupies our attention in Chapter 2.
Example 1.4 Consider the stresses acting on an infinitesimally small tetrahedral element of a solid or fluid as illustrated in Figure 1.5.
Solution: Note that A is the area of B-C-D. The stress field is given by the stress
tensor, which is a 3 3 matrix of the form

x xy xz
= xy y yz ,
xz yz z
where x , y , and z are the normal stresses, and xy , xz , and yz are the shear
stresses. Observe that the stress tensor is symmetric, that is, = T .
The principal axes are the coordinate axes with respect to which only normal
stresses act, that is, there are no shear stresses. These are called principal stresses.
Let us take n to be the normal unit vector, having length one, to the inclined
face B-C-D on which only a normal stress n acts, then
n = nx i + ny j + nz k.
Enforcing static equilibrium of the element, that is,

Fx = 0 requires that

(n n i) A = x (An i) + xy (An j) + xz (An k) ,


where the factor in parentheses on the left-hand side is the component of the
stress in the x-direction that is acting on the B-C-D surface, and those in the
terms on the right-hand side are the areas of the x-, y-, and z-faces, respectively.
Simplifying yields
n nx = x nx + xy ny + xz nz .

1.3 Applications

17

Similarly,
P

Fy = 0 : n ny = xy nx + y ny + yz nz ,

Fz = 0 : n nz = xz nx + yz ny + z nz .

In matrix form, this system of equations is given by



x xy xz
nx
nx
xy y yz ny = n ny ,
xz yz z
nz
nz
or n = n n. Thus, the three eigenvalues n of the stress tensor are the principal
stresses, and the three corresponding eigenvectors n are the principal axes on
which they each act.
In a similar manner, the principal moments of inertia are the eigenvalues of
the moment of inertia tensor, which in two-dimensions is


Ixx Ixy
,
Ixy Iyy
where Ixx and Iyy are the moments of inertia, and Ixy is the product of inertia. The
eigenvectors are the corresponding coordinate directions on which the principal
moments of inertia act.

1.3.3 Quadratic Expressions


The focus of matrix methods is on linear systems, that is, those that can be
expressed as polynomials of first degree or less. However, there are some limited
scenarios in which these methods can be extended to nonlinear systems. In particular, this is the case for quadratic expressions involving polynomials of degree
two or less. This has applications in the method of least squares and in certain
optimization techniques (see Sections 4.2.4 and 4.3). Note also that kinetic energy
is expressed as a quadratic. Quadratic forms are briefly discussed in Section 2.2.3
and used as a means to introduce matrix diagonalization.

1.3.4 Systems of First-Order, Linear Differential Equations


Linear systems, or more generally dynamical systems, are largely synonymous
with matrix methods. The governing equations of any time-dependent linear system, whether electrical, mechanical, biological, or chemical, can be expressed as
a system of first-order linear differential equations of the form
x = Ax,
where the dot represents a time derivative, A is a known coefficient matrix,
and x is the solution vector that is sought. Expressed in this form, it is often

18

Vector and Matrix Algebra

referred to as the state-space representation (see Section 2.5.2). The diagonalization procedure used to solve such systems are discussed in Sections 2.5 and have
widespread application in a variety of fields of engineering. In addition, determining the natural frequencies of a system, that is when the entire system oscillates
with a single frequency, results in a differential eigenproblem as in Section 1.3.2
(see Section 19.1). Similarly, evaluating stability of a system subject to small
disturbances leads to a differential eigenproblem (see Sections 19.4 and 19.5).
When studying dynamical systems, we typically start with linear systems
from parallel electrical circuits and discrete mechanical systems involving masses,
springs, pendulums, and dampers, for example. However, some of the concepts
and methods extend to continuous systems of solids or fluids and to nonlinear
systems.
1.4 Systems of Linear Algebraic Equations
The primary application of vectors and matrices is in representing and solving
systems of coupled linear algebraic equations. The size of the system may be
as small as 2 2, for a simple dynamical system, to n n, where n is in the
millions, for systems that arise from implementation of numerical methods applied to complex physical problems. In the following, we discuss the properties of
such systems and methods for determining their solution. Methods developed in
this chapter are suitable for hand calculations involving small systems of equations. Techniques for solving very large systems of equations computationally are
discussed in Part II.
1.4.1 General Considerations
A linear equation is one in which only polynomial terms of first degree or less
appear. For example, a linear equation in n variables, x1 , x2 , . . . , xn , is of the form
A1 x1 + A2 x2 + + An xn = c,
where Ai , i = 1, . . . , n and c are constants. A system of m coupled linear equations
for n variables is of the form
A11 x1 + A12 x2 + + A1n xn = c1
A21 x1 + A22 x2 + + A2n xn = c2
.
..
.
Am1 x1 + Am2 x2 + + Amn xn = cm
The solution vector xT = [x1 x2 xn ] must satisfy all m equations
simultaneously. This system may be written in matrix form Ax = c as


A11 A12 A1n
x1
c1
A21 A22 A2n x2 c2


..
..
.. .. = .. .
..
.
.
.
. . .
Am1 Am2 Amn

xn

cm

1.4 Systems of Linear Algebraic Equations

19

Note that the coefficient matrix A is m n, the solution vector x is n 1, and


the right-hand side vector c is m 1.
Remarks:
1. A system of equations may be considered a transformation, in which premultiplication by the matrix A transforms the vector x into the vector c (see
Section 1.6).
2. A system with c = 0 is called homogeneous.
3. The solution to Ax = c is the list of values x at which all of the equations
intersect. For example, consider two lines l1 and l2 given by
A11 x + A12 y = c1
A21 x + A22 y = c2

As shown Figure 1.6, the possibilities are:


1. l1 and l2 are parallel lines no solution (inconsistent).
2. l1 and l2 have a unique intersection point at (x, y) one solution.
3. l1 and l2 are the same line infinite solutions.
4. As in Strang (2009), one can also interpret a system of linear algebraic equations as a linear combination of the column vectors of A. If the column vectors
of A are denoted by ai , i = 1, . . . n, then the system of equations can be represented as a linear combination as follows:
x1 a1 + x2 a2 + + xn an = c.
In this interpretation, the solution vector x contains the amounts of each of
the column vectors required to form the right-hand side vector c, that is, the
linear combination of the column vectors that produces c. This interpretation
emphasizes the fact that the right-hand side vector must be in the column
space of the matrix A in order to have a unique solution (see Appendix A).

1.4.2 Determinants
The determinant is an important quantity that characterizes a matrix. We first
motivate the determinant and why it is useful, then we describe how to compute
it for large matrices, and finally describe some of its properties.
Consider the system of linear algebraic equations
A11 x1 + A12 x2 = c1
,
A21 x1 + A22 x2 = c2
where A11 , A12 , A21 , A22 , c1 and c2 are known constants, and x1 and x2 are the
variables to be found. In order to solve for the unknown variables, multiply the

20

Vector and Matrix Algebra


y

l1
l2

=> no solution

y
l2
l1
=> one solution

y
l1
l2

=> infinite solutions


x

Figure 1.6 Three possible types of solutions to a system of two linear


algebraic equations.

first equation by A22 , the second by A12 and then subtract. This eliminates the
x2 variable, and solving for x1 gives
x1 =

A22 c1 A12 c2
.
A11 A22 A12 A21

1.4 Systems of Linear Algebraic Equations

21

Similarly, to obtain x2 , multiply the first equation by A21 , the second by A11 and
then subtract. This eliminates the x1 variable, and solving for x2 gives
x2 =

A11 c2 A21 c1
.
A11 A22 A12 A21

Observe that the denominators are the same in both cases. We call this denominator the determinant of the 2 2 coefficient matrix


A11 A12
A=
,
A21 A22
and denote it by
|A| = A11 A22 A12 A21 .
For a 3 3 matrix

A11

A = A21
A31

A12 A13
A22 A23 ,
A32 A33

(1.8)

the determinant is given by


|A| = A11 A22 A33 + A12 A23 A31 + A13 A21 A32
A13 A22 A31 A12 A21 A33 A11 A23 A32 .

(1.9)

Remark: This diagonal multiplication method does not work for A larger than 3
3. Therefore, a cofactor expansion must be used.
Cofactor Expansion:
For a square matrix An , Mij is the minor of Aij , and Cij is the cofactor of Aij .
The minor Mij is the determinant of the sub-matrix that remains after the ith
row and j th column are removed, and the cofactor of A is
Cij = (1)i+j Mij .
We can then write the cofactor matrix, which is

+M11 M12 +M13


M21 +M22 M23

C = [Cij ] = +M31 M32 +M33


M41 +M42 M43

..
.

M14

+M24

M34
.

+M44

..
.

Then, the determinant of 3 3 matrix A [see equation (1.8)] may be evaluated

22

Vector and Matrix Algebra

from a cofactor expansion, such as


|A| = A11 C11 + A12 C12 + A13 C13
= A11 [+M11 ] + A12 [M12 ] + A13 [+M13 ]


A22 A23


= A11 +

A
A
32
33




A21 A23


+ A13 + A21 A22
+A12


A31 A33
A31 A32

= A11 (A22 A33 A23 A32 ) A12 (A21 A33 A23 A31 )
+A13 (A21 A32 A22 A31 )
|A| = A11 A22 A33 + A12 A23 A31 + A13 A21 A32
A11 A23 A32 A12 A21 A33 A13 A22 A31
which is the same as equation (1.9). We must use the cofactor expansion for matrices larger than 3 3.
Properties of Determinants (A and B are n n):
1. If any row or column is all zeros, |A| = 0.
2. If two rows (or columns) are interchanged, the sign of |A| changes.
3. If any row (or column) is a linear combination of the other rows (or columns),
then |A| = 0.
4. |AT | = |A|.
5. In general, |A + B| =
6 |A| + |B|.
6. |AB| = |BA| = |A||B|.
7. If |A| = 0, then A is singular and not invertible. If |A| =
6 0, then A is
1
nonsingular and invertible, and |A1 | =
.
|A|
8. The determinant of a triangular (or diagonal) matrix is the product of the
diagonal elements.

1.4.3 Gaussian Elimination


Using Gaussian elimination, we may solve the system of linear equations Ax = c
by reducing it to a new system that has the same solution x but is straightforward to solve. This is accomplished using elementary row operations, which do
not change the solution vector, in a prescribed sequence. The elementary row
operations are:
1. Interchange two equations (rows).
2. Multiply an equation (row) by a non-zero number.
3. Add a multiple of one equation (row) to another.

1.4 Systems of Linear Algebraic Equations

23

Procedure:
1. Write the augmented matrix:

A11
A21

[A|c] = ..
.

A12
A22
..
.

|
|

A1n
A2n
..
.

c1
c2

.. .
.

|
Am1 Am2 Amn | cm

2. Move rows (equations) with leading zeros to the


3. Divide the 1st row by A11 :

0
0
1
A12 A1n |
A21 A22 A2n |

.
..
..
..
.
.
|
Am1 Am2 Amn |
4. Multiply the 1st row by Ai1

1
0

.
..

bottom.
0
c1
c2

.
..
.

cm

and subtract from the ith row, 2 i m:


0
0
0
A12 A1n | c1
0
0
0
A22 A2n | c2

0
0
0
A32 A3n | c3
.
..
..
..
.
.
| .
0

0 Am2 Amn | cm
5. Repeat steps (2) (4), called forward elimination, on the (m 1) (n 1)
sub-matrix, etc . . . until the matrix is in row-echelon form, in which case the
first non-zero element in each row is a one with zeros below it.
6. Obtain the solution by back substitution. First, solve each equation for its
leading variable (for example, x1 = , x2 = , x3 = ). Second, starting
at the bottom, substitute each equation into all of the equations above it, that
is, back substitution. Finally, the remaining nr variables are arbitrary, where
the rank r is defined below.

Example 1.5
equations

Using Gaussian elimination, solve the system of linear algebraic


2x1 + x3 = 3,
x2 + 2x3 = 3,
x1 + 2x2 = 3.

Solution: In the matrix form Ax = c, this system of equations is represented in


the form


2 0 1 x1
3
0 1 2 x2 = 3 .
1 2 0 x3
3

24

Vector and Matrix Algebra

Note that the determinant of A is |A| = 9; therefore, the matrix is nonsingular.


In order to apply Gaussian elimination, let us write the augmented system

2 0 1 | 3
0 1 2 | 3 ,
1 2 0 | 3
and perform elementary row operations to produce the row-echelon form via
forward elimination. In order to have a leading zero in the first column of the
first row, dividing through by two produces

1 0 12 | 32
0 1 2 | 3 .
1 2 0 | 3
We first seek to eliminate the elements below the leading one in the first column
of the first row. The first column of the second row already contains a zero, so
we focus on the the third row. To eliminate the leading one, subtract the third
row from the first row to produce

1 0 12 | 32
0 1 2 | 3 .
0 2 12 | 32
Now the sub-matrix that results from eliminating the first row and first column
is considered. There is already a leading one in the second column of the second
row, so we seek to eliminate the element directly beneath it. To do so, multiply
the second row by two and add to the third row to yield

1 0 12 | 32
0 1 2 | 3 .
0 0 92 | 92
To obtain row-echelon form, divide

1
0
0

the third row by 9/2 to give

0 12 | 32
1 2 | 3 .
0 1 | 1

Having obtained the row-echelon form, we now perform the back substitution
to obtain the solution for x. Beginning with the third row, we see that
x3 = 1.
Then from the second row
x2 + 2x3 = 3,
or substituting x3 = 1, we have
x2 = 1.

1.4 Systems of Linear Algebraic Equations

25

Similarly, from the first row,


1
3
x1 + x3 = ,
2
2
or again substituting x3 = 1 leads to
x1 = 1.
Therefore, the solution is

x1
1
x2 = 1 .
x3
1

In order to introduce some terminology, consider the following possible rowechelon form resulting from Gaussian elimination:
( 1 a b c | d
0 0 1 e | f
r

0 0
0 0 0 1 | g
[A |c ] =


0 0 0 0 | h
residual eqns.
0 0 0 0 | j
The order (size) of the largest square sub-matrix of A0 with a nonzero determinant is the rank of A, denoted by r = rank(A). Equivalently, the rank of A is
the number of non-zero rows of A when reduced to row-echelon form A0 . In the
above example, the rank is r = 3.
Remarks:
1. Possibilities, with n being the number of unknowns (columns) of A, and m
the number of equations (rows):
0

If the Aij s for the ith row (i > r) are all zero, but ci is nonzero, the system
is inconsistent and no solution exists. In this case, the residual equations
say that a nonzero number equals zero, which obviously cannot be the case.
If r = n (r remaining equations for n unknowns), then there is a unique
solution.
If r < n, then there is a (n r)parameter family of solutions with defect
n r. That is, (n r) of the variables are arbitrary. In this case, there are
infinite solutions.
2. If the rank of A and the rank of the augmented matrix [A|c] are equal, the
system is consistent and a solution(s) exists (cf. note 1).
3. rank(A) = rank(AT ), that is, the rank of the row space of A is the same as
that of the column space of A (see Appendix B).
4. Elementary row operations do not change the rank of a matrix.
5. If A is an n n triangular matrix, then
|A| = A11 A22 Ann ,

26

Vector and Matrix Algebra

in which case the determinant is the product of the elements on the main diagonal. Therefore, we may row reduce the matrix A to triangular form in order to
simplify the determinant calculation. In doing so, however, we must take into
account the influence of each elementary row operation on the determinant as
follows:
0

1. Interchange rows |A | = |A|.


0
2. Multiply row by k |A | = k|A|.
0
3. Add a multiple of one row to another |A | = |A|.

1.4.4 Matrix Inverse


Given a system of equations Ax = c, with an invertible coefficient matrix (|A| =
6
0), the system may also be solved using the matrix inverse. Premultiplying by
A1 gives1
A1 Ax = A1 c.
Because A1 A = I, the solution vector may be obtained from
x = A1 c.
Note that one can easily change c and premultiply by A1 to obtain a solution for
a different right-hand side vector c. If c = 0, that is, the system is homogeneous,
the only solution is the trivial solution x = 0.2
One technique for obtaining the matrix inverse is to use Gaussian elimination.
This is based on the fact that the elementary row operations that reduce An to
In also reduce In to A1
n :
[ An | In ]

row
yoperations
[In | A1
n ]
Remarks:
1. Note that A is not invertible if the row operations do not produce In .
1
2. Similar to the transpose, (AB) = B1 A1 , where we again note the reverse
order.

Despite our use of the notation A1 for the matrix inverse, there is no such thing as a matrix
1
reciprocal, that is, A1 6= A
.
Note that while the terms homogeneous and trivial both have to do with something being zero,
homogeneous refers to the right-hand side of an equation, while trivial refers to a solution.

1.4 Systems of Linear Algebraic Equations

27

Example 1.6 As in the previous example using Gaussian elimination, consider


the system of linear algebraic equations
2x1 + x3 = 3,
x2 + 2x3 = 3,
x1 + 2x2 = 3,
Solution: In matrix form Ax = c,

2 0
0 1
1 2

this is

3
1 x1

2 x2 = 3 .
3
0 x3

Recall that the determinant of A is |A| = 9; therefore, the matrix is nonsingular


and invertible.
Augmenting the coefficient matrix A with the 3 3 identify matrix I gives

2 0 1 | 1 0 0
0 1 2 | 0 1 0 .
1 2 0 | 0 0 1
Performing the same Gaussian elimination steps as in the previous example to
obtain the row-echelon form produces

1 0 12 | 21 0 0
0 1 2 | 0 1 0 .
0 0 1 | 19 49 29
To obtain the inverse of A on the right, we must have the identify matrix on the
left, that is, we must have the left side in reduced row-echelon form. To eliminate
the two in the third column of the second row, multiply the third row by minus
two and add to the second row. Similarly, to eliminate the one-half in the third
column of the first row, multiply the third row by minus one-half and add to the
first row. This leads to

1
1 0 0 | 49 29
9
1
4
0 1 0 | 2
.
9
9
9
1
4
2
0 0 1 | 9

9
9
Therefore, the inverse of A is

A1

4 2 1
1
4 ,
= 2 1
9 1
4 2

and the solution vector can then be obtained as follows:



4 2 1
3
9
1
1
1
1

2 1
4
3 =
9 = 1 ,
x=A c=
9 1
9 9
4 2 3
1

28

Vector and Matrix Algebra

which is the same solution as obtained using Gaussian elimination directly in the
previous example.
See Chapter 6 for additional methods for determining the inverse of a matrix
that are suitable for computer algorithms, including LU decomposition, Cholesky
decomposition, and partitioning.
1.4.5 Cramers Rule
Cramers rule is an alternative method for obtaining the unique solution to a
system Ax = c, where A is n n and nonsingular (|A| 6= 0). Recall that the
cofactor matrix for an n n matrix A is

C11 C12 C1n


C21 C22 C2n

C = ..
..
.. ,
.
.
.
Cn1 Cn2 Cnn
where Cij is the cofactor of Aij . We define the adjugate of A to be the transpose
of the cofactor matrix as follows
Adj(A) = CT .
The adjugate matrix is often referred to as the adjoint matrix; however, this can
lead to confusion with the term adjoint used in other settings.
Evaluating the product of matrix A with its adjugate, it can be shown using
the rules of cofactor expansions that
A Adj(A) = |A| I.
Therefore, because AA1 = I (if A is invertible), we can write
A1 =

1
Adj(A).
|A|

This leads, for example, to the familiar result for the inverse of a 22 matrix (see
Section 1.1). More generally, this result leads to Cramers rule for the solution of
the system Ax = c by recognizing that the j th element of the vector Adj(A)c =
|Aj |, where Aj , j = 1, 2, . . . , n, is obtained by replacing the j th column of A by
the right-hand side vector

c1
c2

c = .. .
.
cn
Then the system has a unique solution, which is
x1 =

|A1 |
|A2 |
|An |
, x2 =
, , xn =
.
|A|
|A|
|A|

1.4 Systems of Linear Algebraic Equations

29

Remarks:
1. If the system is homogeneous, in which case c = 0, then |Aj | = 0, and the
unique solution is x = 0, that is, the trivial solution.
2. Although Cramers rule applies for any size nonsingular matrix, it is efficient
for small systems (n 3) owing to the ease of finding determinants of 3 3
and smaller systems. For large systems, using Gaussian elimination is typically
more practical.

Example 1.7 Let us once again consider the system of linear algebraic equations from the last two examples, which is
2x1 + x3 = 3,
x2 + 2x3 = 3,
x1 + 2x2 = 3.
Solution: In matrix form Ax = c,

2 0
0 1
1 2

the system is

1 x1
3

2 x2 = 3 .
0 x3
3

Again, the determinant of A is |A| = 9. By Cramers rule, the solution is




3 0 1

1
9
|A1 |
3 1 2 =
=
= 1,
x1 =

|A|
9 3 2 0 9


2 3 1


|A2 |
1
= 9 = 1,
0
3
2
x2 =
=
|A|
9 1 3 0 9


2 0 3

9
|A3 |
1
0 1 3 =
=
= 1,
x3 =

|A|
9 1 2 3 9

which is the same as obtained using the inverse and Gaussian elimination in the
previous two examples.

Problem Set # 1

30

Vector and Matrix Algebra


3

u
u3
u2
u1

Figure 1.7 Three-dimensional vector u.

1.5 Linear Vector Spaces


Vectors, which are simply n 1 matrices, can serve many purposes. In mechanics,
vectors represent the magnitude and direction of quantities such as forces and
velocities. Mathematically, the primary use of vectors is in providing a basis for
a vector space. For example, the orthogonal unit vectors i, j, and k provide a
basis for three-dimensional Cartesian coordinates such that any three-dimensional
vector can be written as a linear combination of the basis vectors.
We first describe some operations and properties related to vectors followed by
a brief discussion of the use of vectors in forming bases for vector spaces. In order
to appeal to straightforward geometric interpretations, let us first review vectors
in three dimensions before extending to higher dimensions.
1.5.1 Three-Dimensional Vectors Review
Recall that a scalar is a quantity specified only by a magnitude, such as mass,
temperature, pressure, etc., whereas a vector is a quantity specified by both a
magnitude and direction, such as force, velocity, momentum, etc. In three dimensions, a vector u is given by

u1
u = u2 ,
u3
which is shown in Figure 1.7 for an orthogonal coordinate system, in which all
three basis vectors are mutually orthogonal.
Vector addition is simply a special case of matrix addition. For example, adding
the two three-dimensional vectors u and v leads to

u1
v1
u1 + v1
u + v = u2 + v2 = u2 + v2 .
u3
v3
u3 + v3

1.5 Linear Vector Spaces

31

v
u
u + v

Figure 1.8 Vector addition.

ku
u

Figure 1.9 Multiplication of a vector by a scalar.

Observe that adding two vectors results in another vector called the resultant
vector. The resultant vector u + v extends from the tail of one vector to the tip
of the other as illustrated in Figure 1.8.
Multiplying a scalar times a vector simply scales the length of the vector accordingly without changing its direction. It is a special case of matrix scalar
multiplication. For example,

u1
ku1
ku = k u2 = ku2 ,
u3
ku3
which is a vector as shown in Figure 1.9.
Matrix-matrix multiplication applied to vectors is known as the inner product.3
The inner product of u and v is denoted by

v1


hu, vi = uT v = u1 u2 u3 v2 = u1 v1 + u2 v2 + u3 v3 = vT u = hv, ui ,
v3
which is a scalar. If hu, vi = 0, we say that the vectors u and v are orthogonal.
In two and three dimensions, this means they are geometrically perpendicular.
Different authors use various notation for the inner product, including
hu, vi = u v = uT v = (u, v) .
The inner product can be used to determine the length of a vector, which is
3

The inner product also goes by the terms dot product and scalar product.

32

Vector and Matrix Algebra

Figure 1.10 Angle between two vectors.

known as its norm, and given by


1/2

kuk = hu, ui

u21 + u22 + u23 .

A unit vector is one having length (norm) equal to one, that is, unity. The inner
product and norm can also be used to determine the angle between vectors. The
angle between two vectors, as illustrated in Figure 1.10, is such that
cos =

hu, vi
.
kuk kvk

If hu, vi = 0, then = /2, and the vectors are orthogonal. For more on vector
norms, see Section 1.7.
Given the term inner product, one may wonder if there is an outer product.
Indeed there is; however we will not find much practical use for it. Whereas the
inner project of u and v is uT v, which produces a scalar (1 1 matrix), the outer
product is uvT , which produces an m n matrix, where u is m 1 and v is n 1
(m = n for there to be an inner product).
A vector operation that is unique to the two- and three-dimensional cases is
the cross product.4 Recall that the dot (inner or scalar) product of two vectors
produces a scalar; the cross product of two vectors produces a vector that is
perpendicular to the two vectors and obeys the right-hand rule.
In Cartesian coordinates, where the vectors u = u1 i + u2 j + u3 k and v =
v1 i + v2 j + v3 k, the cross product is


i
j k

w = u v = u1 u2 u3 = (u2 v3 u3 v2 ) i (u1 v3 u3 v1 ) j + (u1 v2 u2 v1 ) k.
v1 v2 v3
Note how the cofactor expansion about the first row is used.
Remarks:
1. Observe that the order of the vectors matters when taking the cross product
according to the right-hand rule.
4

The cross product is sometimes referred to as a vector product because it produces a vector in
contrast to the inner product, which produces a scalar.

1.5 Linear Vector Spaces

33

2. If either u or v is zero, or if u and v are parallel, then the cross product is the
zero vector w = 0.
3. If the two vectors u and v are taken to be the two adjacent sides of a parallelogram, then the length of the cross product |w| = |u v| is the area of the
parallelogram.
4. The cross product is a measure of rotation of some sort. For example, they
are typically first encountered in physics when computing moments of forces
about an axis.

1.5.2 n-Dimensional Vectors


The inner product and norm operations extend naturally to n-dimensional vectors; for example, the inner product is
hu, vi = u1 v1 + u2 v2 + + un vn ,
and the norm5 is
kuk = hu, ui

1/2

u21 + u22 + + u2n ;

however, graphical interpretations do not.

1.5.3 Linear Independence of Vectors


In order to form the basis for a vector space, the basis vectors must be linearly
independent. Essentially, this means that none of the vectors can be written as
linear combinations of the others.
In two dimensions, if two vectors u1 and u2 are such that constants c1 and c2
exist for which
c1 u1 + c2 u2 = 0,
where c1 and c2 are not both zero, then 1) u1 is a scalar multiple of u2 , 2) u1
and u2 are parallel, and 3) u1 and u2 are linearly dependent. If no such c1 and c2
exist, that is, the above expression is only satisfied if c1 = c2 = 0, then u1 and
u2 are linearly independent.
Similarly, in n dimensions, a set of m vectors u1 , u2 , , um are linearly independent if their linear combination is zero, in which case
c1 u1 + c2 u2 + + cm um = 0

(1.10)

only if c1 = c2 = = cm = 0. In other words, none of the vectors may be


expressed as a linear combination of the others. If the ci s are not all zero, then
the ui s are linearly dependent.
5

Note that u2 is taken to mean the


inner product of u with itself, that is, hu, ui; however, one must
be careful using this notation as u2 6= u as u2 is a scalar.

34

Vector and Matrix Algebra

To obtain a criteria for the existence of the ci s, take the inner product of each
vector ui (1 i m) with (1.10):
c1 u21 + c2 hu1 , u2 i + + cm hu1 , um i = 0
c1 hu2 , u1 i + c2 u22 + + cm hu2 , um i = 0
,
..
.
c1 hum , u1 i + c2 hum , u2 i + + cm u2m = 0
or in matrix form

u21
hu2 , u1 i

..


hu1 , um i
c1
0
c2 0
hu2 , um i

.. = .. .
..
..
. .
.
.
2
cm
0
hum , u1 i hum , u2 i
um
{z
}
|

hu1 , u2 i
u22
..
.

G0
Thus, if G = |G | 6= 0, where G is the Gram determinant, the only solution is
c = 0, that is, the trivial solution, and the vectors u1 , , um are linearly independent. If G = |G0 | = 0, many nontrivial solutions exist (non-unique solutions),
and the vectors u1 , , um are linearly dependent.
0

Remarks:
1. The matrix G0 is symmetric owing to the properties of inner products.
2. If the matrix A is formed by placing the vectors ui , i = 1, . . . , m as the
columns, then G0 = AT A.

Example 1.8 Consider the case when u1 , , um are all mutually orthogonal,
and determine if the vectors are linearly independent.
Solution: In this case, the vectors are non-zero, in which case


ku1 k2
0

0
c1
0
2
0

ku2 k

0 c2 0

..
..
.. .. = .. .
.
.
.
.
.
. . .
0

kum k2

cm

Thus, G 6= 0 and the only solution is c = 0 (trivial solution), and u1 , , um are


linearly independent.

1.5.4 Basis of a Vector Space


If u1 , u2 , , ur are linearly-independent, n-dimensional vectors (n elements in
each vector), all of the vectors v that are linear combinations of the vectors ui ,

1.5 Linear Vector Spaces

35

such that
v = c1 u1 + c2 u2 + + cr ur ,
where the ci s are arbitrary, form a vector space V with dimension r. We say that
1) V is a subspace of n-dimensional space, 2) the vectors u1 , , ur span V and
form a basis for it, 3) if ui are unit vectors, each ci is the component of v along
ui , 4) r is the rank of G0 (defect = n r), 5) r is also the rank of the matrix
with the ui s as rows or columns, and 6) the range of a matrix A, denoted by
range(A), is the vector space spanned by the columns of A.
Example 1.9

Determine whether the vectors





2
1
1

u1 = 1 , u2 = 0 , u3 = 1
3
1
2

span three-dimensional space.


Solution: In other words, can an arbitrary three-dimensional vector v be expressed as a linear combination of u1 , u2 and u3 ? That is
v = c1 u1 + c2 u2 + c3 u3 ,
or




v1
1
1
2
v2 = c1 1 + c2 0 + c3 1 ,
v3
2
1
3

or
v1 = c1 +c2 +2c3
v2 = c1
+c3 ,
v3 = 2c1 +c2 +3c3
or

1
1
2
|


1 2 c1
v1
0 1 c2 = v2 .
1 3 c3
v3
{z
}

A
Note that the columns of A are u1 , u2 and u3 .
Is this linear system consistent for all v? To find out, evaluate the determinant
(only possible if A is square)
|A| = 0 + 2 + 2 0 3 1 = 0.
Because the determinant is zero, the system is not consistent, and no unique
solution exists for c with any v. Hence, the vectors u1 , u2 and u3 do not span all
of three-dimensional space.

36

Vector and Matrix Algebra

Alternatively, we could determine G0 and evaluate the Gram determinant

u21
hu1 , u2 i hu1 , u3 i
u22
hu2 , u3 i
G0 = hu2 , u1 i
hu3 , u1 i hu3 , u2 i
u23

6 3 9
G0 = 3 2 5 ,
9 5 14

which is symmetric. Therefore,


G = |G0 | = 0,
and u1 , u2 , and u3 are not linearly independent and cannot span all of threedimensional space.
Alternatively, we could row reduce the system to determine if r = n. In this
example, r < n because the third equation is the sum of the first and second
equations. Observe that it is easiest to do the first method (if A is square).
Remarks:
1. If m > r, that is, we have u1 , u2 , , ur , , um , then (m r) us are linear
combinations of the r linearly-independent eigenvectors us.
2. n-dimensional space is spanned by any set of n linearly-independent, n-D
vectors, that is, having n elements.
3. A special case occurs when the basis vectors are mutually orthogonal, in which
case they are also linearly independent. Therefore, n mutually orthogonal, ndimensional vectors form a basis for n-dimensional space, which was shown
above. For example, for the standard basis
iT1 = [1 0 0 0]
|
{z
}
n elements

iT2 = [0 1 0 0] .
..
.
iTn = [0 0 0 1]
Summarizing: If A is an n n matrix, the following statements are equivalent:
|A| =
6 0.
A is nonsingular.
A is invertible.
A has rank n.
The row (or column) vectors of A are linearly independent.
The row (or column) vectors of A span n-dimensional space and form a basis
for it.
7. Ax = 0 has only the trivial solution because x = A1 0 = 0.

1.
2.
3.
4.
5.
6.

1.5 Linear Vector Spaces

u2

u2

e1

37

<u2 ,e1>

Figure 1.11 Step 2 in Gram-Schmidt orthogonalization.

8. Ax = c is consistent for every c, with x = A1 c, and there is a unique solution


for each c.
Remarks:
1. The opposite of (1)(7), but not (8), are also equivalent, in which case the
system of equations would either be inconsistent, having no solution, or the
system would have an infinity of solutions.
2. See Appendix B for a discussion of row and column spaces of a matrix.
1.5.5 Gram-Schmidt Orthogonalization
Although any set of n linearly-independent vectors can form a basis for an ndimensional vector space, it is often convenient to work with basis vectors that are
mutually orthogonal and of unit length, that is, orthonormal (orthogonal vectors
normalized to length one). This can be accomplished using Gram-Schmidt orthogonalization, which produces an orthogonal set of unit vectors (e1 , e2 , , es ) from
s linearly-independent vectors (u1 , u2 , , us ). Then any n-dimensional vector v
in the vector space V can be written in terms of the orthonormal basis vectors
as follows:
v = hv, e1 i e1 + hv, e2 i e2 + + hv, es i es ,
where, for example, hv, e1 i is the component of v in the e1 direction.
Procedure (using u1 as the reference vector):
1. Normalize u1 according to
e1 =

u1
,
ku1 k

such that e1 is a unit vector.


2. Subtract the component of u2 in the e1 direction from u2 to obtain
u02 = u2 hu2 , e1 i e1 ,
which is orthogonal to e1 as illustrated in Figure 1.11. Normalizing u02 gives

38

Vector and Matrix Algebra

e2 =

u02
.
ku02 k

3. Determine the component of u03 that is orthogonal to e1 and e2 . This is accomplished by subtracting the components of u3 in the e1 and e2 directions
from u3 as follows:
u03 = u3 hu3 , e1 i e1 hu3 , e2 i e2 .
Normalize u03 produces
e3 =

u03
.
ku03 k

4. Continue for 4 i s:
u0i

= ui

i1
X

hui , ek i ek ,

ei =

k=1

u0i
.
ku0i k

Remarks:
1. The Gram-Schmidt procedure results in an orthonormal (orthogonal and normalized) set of basis vectors e1 , e2 , , es .
2. This procedure always works if the ui s are linearly independent (the order of
the ui s affects the final orthonormal vectors).

1.6 Linear Transformations


Thus far, we have viewed the matrix problem Ax = c as a system of linear
algebraic equations, in which we seek the solution vector x for a given coefficient
matrix A and right-hand side vector c. Alternatively, the matrix problem
A

x = y ,

mn n1

m1

where A is specified and x and y are arbitrary vectors, may be viewed as a linear
transformation (mapping) from n-space (size of x) to m-space (size of y). In this
context, A transforms vectors from an n-space domain to an m-space range as
shown in Figure 1.12.
Example 1.10

Consider the transformation matrix




cos sin
A=
,
sin cos

and determine its action on a vector.

1.6 Linear Transformations

39

Ax
====>

Domain

Range

mspace

nspace

Figure 1.12 Linear transformation from x to y using A.


x2

(y1, y2)
(x1, x2)
x1 sin
+ x2 cos

x1

x1 cos - x2 sin

Figure 1.13 Components of the transformed vector in Example 1.10

Solution: Transforming xT = [x1 x2 ] to yT = [y1 y2 ], we see that


y

= Ax

 
cos sin x1
=
sin cos
x2
 


y1
x1 cos x2 sin
=
.
y2
x1 sin + x2 cos
From Figure 1.13, we see that premultiplying by A rotates x through an angle
.
Remarks:
1. Because the transformations are linear, they can be superimposed. For example,
Ax + Bx = y

40

Vector and Matrix Algebra

is equivalent to
(A + B)x = y.
2. Successively applying a series of transformations to x, such as
x1 = A1 x, x2 = A2 x1 , , xk = Ak xk1 ,
is equivalent to applying the transformation
xk = Ax,
where A = Ak A2 A1 (note the reverse order).
3. If the transformation matrix A is invertible, then A1 y transforms y back
to x, which is the inverse transformation. For example, consider the rotation
transformation. Rotating y through should return x. Recalling that sin
is odd and cos is even, we have
x = A0 y


cos() sin()
=
y
sin() cos()


cos
sin
x =
y.
sin
cos
Is A0 = A1 ? Check if A0 A = I:



cos sin cos sin
0
AA =
sin cos sin cos


cos2 + sin2
cos sin + sin cos
=
sin cos + cos sin
sin2 + cos2


1 0
=
0 1
= I,
in which case A0 = A1 .
4. If y = Ax, then y is a linear combination of the columns (a1 , a2 , . . . , an ) of A
in the form
y = Ax = x1 a1 + x2 a2 + + xn an .
5. In the case when a matrix A transforms a non-zero vector x to the zero vector
0, that is, Ax = 0, the vector x is said to be in the nullspace of A. The
nullspace is a subspace of that defined by the row or column vectors of A.

Problem Set # 2

1.7 Note on Norms

41

1.7 Note on Norms


Throughout this section we have made use of the norm as a measure of the size or
length of a vector. It essentially replaces the absolute value of real numbers (or the
modulus of complex numbers). Thus far, it has been defined for an n-dimensional
vector u as
q
1/2
kuk = hu, ui = u21 + u22 + + u2n ,
that is, the square root of the inner product of the vector with itself. In two and
three dimensions, this definition of the norm corresponds to the Euclidian length
of the vector. It turns out that while this is the most common definition of the
norm, there are others that are encountered as well.
In general, we define the so-called Lp norm, or simply the p-norm, of the vector
u as follows:
!1/p
n
X
kukp =
|ui |p
,
(1.11)
i=1

where p 1. Therefore, our definition of the norm corresponds to the L2 norm,


in which case it is called the two norm. As a result, many authors indicate this
norm as kuk2 . We will use this notation when necessary to prevent any confusion;
however, if no subscript is given, then it is assumed to be the L2 norm. Other
common norms correspond to p = 1 and p = . The L1 norm is the sum of the
absolute values of each component of u according to
kuk1 =

n
X

|ui |.

i=1

The L norm, or -norm, is given by


kuk = max |ui |,
1in

which is simply the largest component of u by absolute value.


We will also have occasion to compute norms of matrices in order to quantify
their size. The L1 norm is the largest L1 norm of the column vectors of A, that
is,
kAk1 = max kaj k1 ,
1jn

where aj denotes the jth column of A. The L norm is the largest L1 norm of
the row vectors of A, that is,
kAk = max kai k1 ,
1im

where ai denotes the ith row of A. Finally, the L2 norm is given by



1/2
kAk2 = AT A
,
where (AT A) is the spectral radius of AT A (see Section 2.1). Hence, it is also

42

Vector and Matrix Algebra

sometimes referred to as the spectral norm. To more directly mimic the L2 norm
for vectors, we also define the Frobenius norm of a matrix given by
!1/2
m X
n
X
2
kAkF =
|Aij |
.
i=1 j=1

This is simply the square root of the sum of the squares of all the elements of the
matrix A.
Remarks:
1. When both vectors and matrices are present, we use the same norm for both
when performing operations.
2. Unless indicated otherwise, we will use the L2 norm for both vectors and
matrices.
3. Although the L2 norm is most commonly used, the L1 norm and L norm are
typically more convenient to compute.
4. The Schwarz inequality for vectors is
| hu, vi | kuk2 kvk2 .
More generally, it can be proven that
kABk kAkkBk.
These inequalities prove useful in determining bounds on various quantities of
interest involving vectors and matrices.
5. For more on the mathematical properties of norms, see Golub and Van Loan
(2013) and Horn and Johnson (2013).
6. In Chapter 3, we will extend the L2 norm to functions, and vector and matrix
norms will play a prominent role in defining the condition number in Chapter 6
that is so central to computational methods.

2
The Eigenproblem and Its Applications

Mathematics is the music of reason.


(James Joseph Sylvester)

Recall from Section 1.6 that we may view a matrix A as a linear transformation
from a vector x to a vector y in the form Ax = y. It is often of interest to know
for what characteristic values of a matrix transforms a vector x into a constant
multiple of itself, such that Ax = x. Such values of are called eigenvalues, and
the corresponding vectors x are called eigenvectors. Some common applications
where eigenvalues and eigenvectors are encountered in mechanics are:
The eigenvalues of the stress tensor, which is a 3 3 matrix, in mechanics are
the principal stresses, with the eigenvectors defining the principal axes along
which they act (see Section 1.3.2).
Similarly, the principal moments of inertia of an object are the eigenvalues of
the moment of inertia matrix, with the eigenvectors defining the principal axes
of inertia.
The natural frequencies of dynamical (mechanical or electrical) systems correspond to eigenvalues (see Section 19.1).
Stability of dynamical (mechanical or electrical) systems is determined from a
consideration of eigenvalues (see Section 19.3).
2.1 Eigenvalues and Eigenvectors
For a given n n matrix A, we form the eigenproblem
Ax = x,

(2.1)

where is an unknown scalar, which is often a physical parameter, and x is an


unknown vector. For the majority of values of with a given matrix A, the system
(2.1) only has the trivial solution x = 0. We are interested in the values of that
produce nontrivial solutions, which are called the eigenvalues1 or characteristic
values. Each eigenvalue i has an associated eigenvector or characteristic vector
xi . Graphically, we may interpret the linear transformation Ax as simply scaling
the vector x by a factor as illustrated in Figure 2.1.
1

The German word eigen means proper or characteristic.

43

44

The Eigenproblem and Its Applications

x
x

Figure 2.1 Graphical interpretation of eigenvalues and eigenvectors x.

In order to determine the eigenvalues and eigenvectors, equation (2.1) may be


written
Ax = Ix
so that the right-hand side is of the same shape as the left-hand side. Rearranging
gives
(A I) x = 0,
where the right-hand side is the zero vector. Because this system is homogeneous,
a nontrivial solution will only exist if
|A I| = 0,
that is, if A I is singular. If |A I| 6= 0, then x = 0, which is the trivial
solution, is the only solution. Rewriting yields


A11
A12

A1n

A21
A22
A2n


= 0.
.
.
..
.
..
..
..


.


An1
An2
Ann
Setting this determinant equal to zero results in a polynomial equation of degree n for , which is called the characteristic equation. The n solutions to the
characteristic equation are the eigenvalues 1 , 2 , . . . , n , which may be real or
complex.
For each = i , i = 1, . . . , n, there is a non-trivial solution ui = x, i = 1, . . . , n;
these are the eigenvectors. For example, if n = 3, the characteristic equation is
of the form
c1 3 + c2 2 + c3 + c4 = 0;
therefore, there are three eigenvalues 1 , 2 , and 3 and three corresponding eigenvectors u1 , u2 , and u3 .

2.1 Eigenvalues and Eigenvectors

45

Figure 2.2 Stresses acting on an infinitesimally small area.

Example 2.1 Find the principal axes and principal stresses for the two-dimensional
stress distribution given by the stress tensor

 


3 1
A = xx xy =
.
yx yy
1 3
Solution: The normal stresses are xx and yy , and xy and yx are the shear
stresses acting on the body. Stresses are defined such that the first subscript
indicates the outward normal to the surface, and the second subscript indicates
the direction of the stress on that face as shown in Figure 2.2. Also recall that
xy = yx ; therefore, the stress tensor A is always symmetric.
The principal axes of the stress tensor A correspond to the orientations of
the axes for which only normal stresses act on the body, that is, there are no
shear stresses. The principal axes are the eigenvectors of the stress tensor A. The
eigenvalues of the stress tensor are the principal (normal) stresses. To find the
eigenvalues and eigenvectors of A, we write (A I)x = 0, which is


3 1
x = 0.
(2.2)
1 3
For a nontrivial solution to exist


3 1


1 3 = 0
(3 )(3 ) (1)(1) = 0
2 6 + 8 = 0
( 2)( 4) = 0
1 = 2, 2 = 4.

46

The Eigenproblem and Its Applications

To find the eigenvector u1 corresponding to 1 = 2, substitute = 2 into (2.2),


which produces


3 2 1
x = 0,
1 3 2
or
x1 x2 = 0
.
x1 + x2 = 0
Observe that these equations are the same; therefore, we have one equation for
two unknowns. Note that we always lose at least one equation when determining
eigenvectors. That is, for n n A, the rank of A I is always less than n. Let
x1 = c1 , in which case x2 = c1 , and the eigenvector is
 
1
.
u1 = c1
1
This is the eigenvector corresponding to 1 = 2 determined to a scalar multiple; hence, any scalar multiple of an eigenvector will satisfy equation (2.1). It is
because we lose one equation that we do not obtain a unique solution for the
eigenvectors.
Find the eigenvector u2 corresponding to 2 = 4:


3 4 1
x = 0,
1 3 4
or
x1 x2 = 0
,
x1 x2 = 0
which is one equation for two unknowns; therefore, we have one arbitrary constant. Let x2 = c2 , in which case x1 = c2 , and the eigenvector is
 
1
u2 = c2
,
1
which is the eigenvector corresponding to 2 = 4 determined to a scalar multiple.
Therefore, u1 and u2 are the principal axes, and 1 and 2 are the principal
(normal) stresses in the u1 and u2 directions, respectively, with no shear stresses
as shown in Figure 2.3.
Remarks:
1. The fact that such principal stresses and axes exist for any stress field is a consequence of the mathematical properties of real symmetric matrices. Specifically, they are always diagonalizable, such that in this case

 

0 0
0
2 0
A0 = x x
=
.
0
y0 y0
0 4
See Section 2.2.3 for more on diagonalization.

2.1 Eigenvalues and Eigenvectors

47

Figure 2.3 Principle stresses and principal axes.

2. Observe that the principal axes (eigenvectors) are orthogonal, such that hu1 , u2 i =
0. As shown in Section 2.2.1, this is always true for real symmetric matrices
with distinct eigenvalues.
In the above example, the eigenvalues are distinct, that is, 1 6= 2 . In some
cases, however, eigenvalues may be repeated as in the next example.
Example 2.2

Determine the eigenvalues and

0 0 1
1 2 0
A=
1 0 2
1 0 1

eigenvectors of the 4 4 matrix

1
1
.
1
0

Solution: In order to determine the eigenvalues, we take the determinant


|A I| = 0,
or



0
1
1

1 2
0
1

= 0.
1
0
2 1

1
0
1

To evaluate the determinant of a 4 4 matrix requires forming a cofactor expansion (see Section 1.4.2). Observing that the second column of our determinant
has all zeros except for one element, we take the cofactor expansion about this
column, which is



1
1

+(2 ) 1 2 1 = 0.
1
1

48

The Eigenproblem and Its Applications

Evaluating the determinant of the 3 3 minor matrix leads to


(2 ) [2 (2 ) + 1 + 1 (2 ) ] = 0,
(2 ) [2 (2 ) ] = 0,
( 2) (2 2 + 1) = 0,
( 2)( 1)2 = 0.
Note that when taking determinants of matrices that are larger than 2 2, look
for opportunities to factor out ( i ) factors rather than forming the nth -degree
polynomial in the standard form, which is typically difficult to factor. The four
eigenvalues are
1 = 0, 2 = 1, 3 = 1, 4 = 2.
Thus, there are two distinct eigenvalues, and two of the eigenvalues are repeated.
We say that the eigenvalue 2 = 3 = 1 has multiplicity two.
Let us first consider the eigenvectors corresponding to the two distinct eigenvalues. For 1 = 0, we have the system

0 0 1 1
1 2 0 1

1 0 2 1 x = 0.
1 0 1 0
Although it is not obvious by observation that we lose one equation, note that
the sum of the third and fourth equations equals the first equation. Therefore,
the system reduces to the final three equations for the four unknowns according
to
x1 + 2x2
+ x4 = 0,
x1
+ 2x3 + x4 = 0,
x1
x3
= 0.
Because we have only three equations for four unknowns, let x1 = c1 . Then back
substituting into the above equations leads to x3 = c1 , x4 = c1 , and x2 = c1 .
Therefore, the eigenvector corresponding to the eigenvalue 1 = 0 is

1
1

u1 = c1
1 .
1
Similarly, for the eigenvalue 4 = 2, we have the system

2 0 1
1
1 0 0
1

x = 0.
1 0 0
1
1 0 1 2

2.1 Eigenvalues and Eigenvectors

The second and third equations are the same;


are
2x1
+ x3 +
x1
+
x1
x3

49

therefore, the remaining equations


x4 = 0,
x4 = 0,
2x4 = 0.

In order to solve this system of three equations for four unknowns, we can use
Gaussian elimination to reduce the system to
x1

x3 2x4 = 0,
x3 + 3x4 = 0,
2x4 = 0.

Back substitution leads to the eigenvector corresponding to the eigenvalue 4 = 2


being

0
1

u4 = c4
0 ,
0
where we note that x2 is arbitrary as it does not appear in any of the equations.
For the repeated eigenvalue 2 = 3 = 1, we attempt to find two linearlyindependent eigenvectors. Setting = 1 in (A I)x = 0 gives

1 0 1
1
1 1 0
1

x = 0.
1 0 1
1
1 0 1 1
The first, third, and fourth rows (equations) are the same, so x1 , x2 , x3 , and x4
are determined from the two equations
x1
+ x3 + x4 = 0,
x1 + x2
+ x4 = 0.
We have two equations for four unknowns requiring two arbitrary constants, so
let x1 = c2 and x2 = c3 ; thus,
x1
x2
x4
x3

=
=
=
=

c2 ,
c3 ,
c2 c3 ,
c2 x4 = c2 (c2 c3 ) = c3 .

Therefore, the two eigenvectors must satisfy

c2
c3

u2,3 =
c3 .
c2 c3

50

The Eigenproblem and Its Applications

The parameters c2 and c3 are arbitrary, so we may choose any two unique pairs
of values that result in linearly-independent eigenvectors. Choosing, for example,
c2 = 1, c3 = 1 for u2 and c2 = 1, c3 = 0 for u3 gives the additional two eigenvectors
(along with u1 and u4 )


1
1
1
0


u2 =
1 , u3 = 0 .
0
1
Note that the arbitrary constant is always implied, even if it is not shown explicitly. In this case, the four eigenvectors are linearly independent; therefore, they
provide a basis for the 4-dimensional vector space associated with matrix A.
Remarks:
1.
2.
3.
4.

The
The
The
The

set of all eigenvalues of A, 1 , 2 , . . . , n , is the spectrum of A.


spectral radius of A is = max (|1 |, |2 |, , |n |).
eigenvalues of A and AT are the same.
n eigenvalues of an n n matrix A are such that
1 + 2 + + n = A11 + A22 + + Ann = tr(A),

where tr(A) is the trace of A, which is the sum of the elements on the main
diagonal.
5. The eigenvectors corresponding to distinct eigenvalues are always linearly independent.
6. It can be shown that |A| = 1 2 . . . n . Therefore, A is a singular matrix if
and only if at least one of the eigenvalues is zero.
7. The eigenvalues of an n n diagonal matrix are on the main diagonal, that is,

1 0 0
0 2 0

A = ..
.. . .
.. ,
.
. .
.
0 0 n
and the corresponding eigenvectors are



1
0
0
0
1
0



u1 = .. , u2 = .. , . . . , un = .. ,
.
.
.
0
0
1
which are linearly independent (and mutually orthogonal). If |A| = 1 2 n 6=

2.1 Eigenvalues and Eigenvectors

0, that is, i 6= 0, i = 1, . . . , n, then the inverse


1
0
1
1
0

A1 = .
.
.. . . .
..
0

51

of A is

0
0

.
..
.
1
n

8. Consider the eigenproblem Aui = i ui . Premultiplying by A gives


A (Aui ) = i Aui ,
or from the original eigenproblem
A2 ui = 2i ui .
Doing so again yields

A A2 ui = 2i Aui ,

or from the original eigenproblem


A3 ui = 3i ui .
Generalizing, we obtain the important result that
An ui = ni ui ,

n = 1, 2, 3, . . . .

That is, if i are the eigenvalues of A, then the eigenvalues of An are ni , and
the eigenvectors of A and An are the same.
9. Cayley-Hamilton Theorem: If Pn () = 0 is the nth -degree characteristic polynomial of an n n matrix A, then A satisfies
Pn (A) = 0.
That is, A satisfies its own characteristic polynomial. See Section 2.7 for an
example of the use of the Cayley-Hamilton theorem.
10. In some cases with repeated eigenvalues, there are fewer eigenvectors than
eigenvalues. For example, an eigenvalue with multiplicity two may only have
one corresponding regular eigenvector. When an n n matrix has fewer than
n linearly-independent eigenvectors, we say the matrix is defective. A procedure for obtaining generalized eigenvectors in such cases will be provided in
Section 2.4.2.
11. Computer algorithms for finding the eigenvalues and eigenvectors of large matrices are typically based on QR decomposition. Such algorithms are described
in Appendix C.
12. Some applications result in the so called generalized eigenproblem
Ax = Bx,
such that the regular eigenproblem (2.1) corresponds to B = I. Both types of
eigenproblems can be treated using very similar techniques. Do not confuse the
generalized eigenproblem with generalized eigenvectors; they are not related.

52

The Eigenproblem and Its Applications

2.2 Real Symmetric Matrices


In many applications, the square matrix A is real and symmetric, in which case
A = AT . This is the case in mechanics, for example, for the stress, strain, and
moment of inertia tensors (matrices). Let us consider some properties and applications of this special case.
2.2.1 Properties of Eigenvectors
Consider two distinct eigenvalues 1 and 2 of A, such that 1 6= 2 , and their
associated eigenvectors u1 and u2 , in which case
Au1 = 1 u1 ,

Au2 = 2 u2 .

Taking the transpose of the first eigenrelation


(Au1 )T = 1 uT1 ,
and recognizing that (Au1 )T = uT1 AT and postmultiplying by u2 gives
uT1 AT u2 = 1 uT1 u2 .

(2.3)

Premultiplying the second eigenrelation by uT1 gives


uT1 Au2 = 2 uT1 u2 .

(2.4)

Subtracting (2.3) from (2.4) with A = AT gives


(2 1 )uT1 u2 = 0,
or
(2 1 ) hu1 , u2 i = 0.
Because the eigenvalues are distinct, such that 2 6= 1 ,
hu1 , u2 i = 0.
Therefore, the two eigenvectors corresponding to two distinct eigenvalues of a real
symmetric matrix are orthogonal.
Remarks:
1. This is why the principal axes of a stress tensor are mutually orthogonal (see
example in Section 2.1).
2. It can also be shown that all of the eigenvalues of a real symmetric matrix are
real.2
3. If all eigenvalues of a real symmetric matrix are distinct, the eigenvectors
are all mutually orthogonal. In this case, we could normalize the eigenvectors
according to
ui
, i = 1, 2, . . . , n,
ei =
kui k
2

See Hildebrand, p. 32 for a proof.

2.2 Real Symmetric Matrices

53

in order to obtain an orthonormal basis for n-space (cf. Gram-Schmidt orthogonalization procedure).
4. If an eigenvalue is repeated s times, there are s corresponding eigenvectors
that are linearly independent, but not necessarily orthogonal (the remaining
ns eigenvectors are mutually orthogonal).3 The eigenvectors form a basis for
the n-dimensional vector space; one can orthogonalize using Gram-Schmidt if
desired.
5. Recall that a Hermitian matrix is such that

A11 A12 A1n


A12 A22 A2n
T

A = A = ..
..
.. ,
..
.
.
.
.
A1n A2n Ann
where the main diagonal terms must be real. As with real symmetric matrices, the eigenvalues of a Hermitian matrix are real, and the eigenvectors
corresponding to distinct eigenvalues are mutually orthogonal.
The following sections consider two applications involving real symmetric matrices. For an additional application to stability of numerical algorithms, see Section 15.2.

2.2.2 Linear Systems of Equations


In Chapter 1, we considered three methods for solving linear systems of algebraic equations Ax = c, namely 1) Gaussian elimination, 2) using the inverse
to evaluate x = A1 c, and 3) Cramers rule. Here, we obtain a particularly
straightforward and efficient method for solving such systems when A is real and
symmetric.
Consider the system
Ax = c,
where A is a real and symmetric n n matrix that is nonsingular with distinct
eigenvalues, c is known, and x is the solution vector to be found. Because x
is in n-space, it can be expressed as a linear combination of the orthonormal
eigenvectors, e1 , . . . , en , of A as follows
x = a1 e1 + a2 e2 + + an en .

(2.5)

The right-hand side vector c is also in n-space; consequently, we may write


c = b1 e1 + b2 e2 + + bn en .

(2.6)

Taking the inner product of e1 with c, which is equivalent to premultiplying (2.6)


3

See Hildebrand, Section 1.18 for a proof.

54

The Eigenproblem and Its Applications

by eT1 , gives
he1 , ci = eT1 c = b1 eT1 e1 +b2 eT1 e2 + + bn eT1 en ,
| {z }
| {z }
| {z }
1
0
0
but the eigenvectors are mutually orthogonal (and normalized) because the eigenvalues are distinct; therefore, he1 , ei i = 0 for i = 2, . . . , n, leaving
b1 = he1 , ci .
Generalizing, we have
bi = hei , ci ,

i = 1, 2, . . . , n,

(2.7)

which can all be evaluated to give the constants in equation (2.6). Substituting
(2.5) and (2.6) into Ax = c leads to
A (a1 e1 + a2 e2 + + an en ) = b1 e1 + b2 e2 + + bn en
a1 Ae1 + a2 Ae2 + + an Aen = b1 e1 + b2 e2 + + bn en
a1 1 e1 + a2 2 e2 + + an n en = b1 e1 + b2 e2 + + bn en .
Note that we could have used any linearly-independent basis vectors in (2.5) and
(2.6), but using the orthonormal eigenvectors as basis vectors allows us to perform
the last step above.
Because the ei s are linearly independent, each of their coefficients must be
equal according to
b2
bn
b1
,
a1 = , a2 = , , an =
1
2
n
or from (2.7)
ai =

hei , ci
,
i

i = 1, . . . , n.

Then from (2.5), the solution vector x for a system with a real symmetric coefficient matrix having distinct eigenvalues is
x=

n
X
hei , ci
i=1

ei .

Observe that once we have obtained the eigenvalues and eigenvectors of A, we


can easily determine the solution for various right-hand side vectors c.
Similarly, we can solve systems of the form
Ax x = c,

(2.8)

where is a parameter. Just as for Ax = c, the solution x and the right-hand


side vector c may be written in the form (2.5) and (2.6), respectively. Substituting
(2.5) and (2.6) into (2.8) gives
A (a1 e1 + + an en ) (a1 e1 + + an en ) = b1 e1 + + bn en .

2.2 Real Symmetric Matrices

55

If i , i = 1, . . . , n, are the eigenvalues of A, then the above equation becomes


a1 1 e1 + a2 2 e2 + + an n en a1 e1 a2 e2 an en
= b1 e1 + b2 e2 + + bn en .
Equating coefficients gives
a1 =

b1
b2
bn
, a2 =
, , an =
.
1
2
n

Therefore, from (2.7)


n
X
hei , ci
ei .
x=
i
i=1

Observe that if 6= i , i = 1, . . . , n, that is, the parameter is not equal to any


of the eigenvalues, then we obtain a unique solution. On the other hand, if = i
for any i, then there is no solution unless hei , ci = 0, such that c is orthogonal
to the eigenvector corresponding to i , in which case there are infinitely many
solutions.
Example 2.3 Let us once again consider the system of linear algebraic equations from the examples in Sections 1.4.3, 1.4.4, and 1.4.5, which is
2x1

x1

x3

= 3,

x2

+ 2x3 = 3,

+ 2x2

= 3.

Solution: In matrix form Ax = c,

2 0
0 1
1 2

the system is

1 x1
3
2 x2 = 3 ,
0 x3
3

where we observe that A is real and symmetric.


In order to apply the result of this section, it is necessary to obtain the eigenvalues and eigenvectors of the matrix A. The eigenvalues are the values of that
satisfy |A I| = 0, that is,


2
0
1

0
1 2 = 0.

1
2

Evaluating the determinant yields
(2 )(1 )() (1 ) 4(2 ) = 0,
where we note that there are no common ( i ) factors. Therefore, we simplify
to obtain the characteristic polynomial
3 32 3 + 9 = 0.

56

The Eigenproblem and Its Applications

Although there is an exact expression for determining the roots of a cubic polynomial (analogous to the quadratic formula), it is not widely known and not worth
memorizing. Instead, we can make use of mathematical software, such as Matlab
or Mathematica, to do the heavy lifting for us. Doing so results in the eigenvalues

1 = 3, 2 = 3, 3 = 3,
and the corresponding normalized eigenvectors


2 + 3
1
1
1
1 3 ,
e1 = 1 , e2 = q

3 1
6(2 3)
1

2 3
1
1 + 3 .
e3 = q

6(2 + 3)
1

Surprisingly enough, evaluating


x=

n
X
hei , ci
i=1

ei =

he1 , ci
he2 , ci
he3 , ci
e1 +
e2 +
e3
1
2
3

produces the correct solution



x1
1
x2 = 1 .
x3
1

This example illustrates that determining eigenvalues and eigenvectors is not


always neat and tidy even though A is only 3 3 and rather innocuous looking
in this case. Thankfully, the wide availability of tools such as Matlab and Mathematica allow us to tackle such problems without difficulty. See the Matrices
with Mathematica Demo for the details of this example along with examples of
many of the manipulations of vectors and matrices covered in Chapters 1 and 2.

2.2.3 Quadratic Forms


A second-degree expression in x1 , x2 , . . . , xn may be written
A

n X
n
X

Aij xi xj

i=1 j=1

(2.9)

A11 x21 + A22 x22 + + Ann x2n


+2A12 x1 x2 + 2A13 x1 x3 + + 2An1,n xn1 xn ,
where Aij and xi are real. This is known as the quadratic form, and the coefficients
are symmetric, such that Aij = Aji . Note that the 2 appears because of the
symmetry of Aij ; for example, there are both x1 x2 and x2 x1 terms that have the
same coefficient A12 = A21 . Observe that quadratics can be written in matrix
form as
A = xT Ax,

2.2 Real Symmetric Matrices

57

where A = [Aij ] is symmetric.


Some applications of quadratic forms include:
In two dimensions (x1 , x2 ): A = xT Ax represents a conic curve, for example,
ellipse, hyperbola, or parabola.
In three dimensions (x1 , x2 , x3 ): A = xT Ax represents a quadric surface, for
example, ellipsoid, hyperboloid, paraboloid, etc.
In mechanics, the moments of inertia, angular momentum of a rotating body,
and kinetic energy of a system of moving particles can be represented by a
quadratic form. In addition, the potential (strain) energy of a linear spring is
expressed as a quadratic.
The least-squares method of curve fitting leads to quadratics.
In electro-optics, the index ellipsoid of a crystal in the presence of an applied
electric field is given by a quadratic form (see Yariv and Yeh, Section 1.7).
Cost functions and functionals in optimization and control problems are often
expressed in terms of quadratic forms.4
What do quadratics, which are nonlinear, have to do with linear systems of
equations? We can convert the single quadratic equation (2.9) into a system of n
linear equations by differentiating the quadratic as follows
yi =

1
(xT Ax),
2 xi

i = 1, 2, . . . , n.

This leads to the system of linear equations


A11 x1 + A12 x2 + + A1n xn = y1
A12 x1 + A22 x2 + + A2n xn = y2
..
.

(2.10)

A1n x1 + A2n x2 + + Ann xn = yn


or
Ax = y,
where A is an n n real symmetric matrix with y defined in this way. Note then
that quadratics can be written in the equivalent forms
A = xT Ax = xT y = hx, yi .

Example 2.4
defined by

Determine the matrix A in the quadratic form for the ellipse


3x21 2x1 x2 + 3x22 = 8.

Solution: From equation (2.9), we have


A11 = 3,
4

(2.11)

A12 = 1,

A22 = 3.

See Sections 4.2.4, Chapter 21, and


Variational Methods with Applications in Science and Engineering, Sections 10.4 and 10.5.

58

The Eigenproblem and Its Applications

Thus, in matrix form



A11
A=
A12

 

A12
3 1
=
.
A22
1 3

Checking (2.11)
A

11

= xT A x
12 22 21

 

 3 1 x1
x
x
=
1
2
1 3
x2


 3x1 x2

= x1 x2
x1 + 3x2
= x1 (3x1 x2 ) + x2 (x1 + 3x2 )

= 3x21 2x1 x2 + 3x22 .

Setting A = 8 gives the particular ellipse under consideration.


Remarks:
1. If xT Ax > 0 for all x 6= 0, then the quadratic is positive definite; in other
words, it is definitely positive, and the real symmetric matrix A is said to
be positive definite. In this case,
The sign of the quadratic does not depend on x.
It occurs if and only if all eigenvalues of the symmetric matrix A are positive
(note that A is nonsingular in such a case).
See Appendix A for the Cholesky decomposition, which applies to positive
definite, real symmetric matrices.
2. See Appendix E for an application to optimization in which the function to
be minimized or maximized is a quadratic.
3. If there are no mixed terms in the quadratic, in which case it is of the form
A = A11 x21 + A22 x22 + + Ann x2n ,
the quadratic is said to be in canonical form.5 In such cases,
A is a diagonal matrix.
In two and three dimensions, we say that the conic curve or quadric surface
is in standard form. For example, the major and minor axes of an ellipse in
canonical form are aligned with the coordinate axes x1 and x2 .
In the context of mechanics, the x1 , x2 , and x3 directions in the canonical
form are called the principal axes, and the following diagonalization procedure gives rise to the parallel axes theorem, which allows one to transform
information from one coordinate system to another.
5

Canonical means standard.

2.2 Real Symmetric Matrices

59

Diagonalization of Symmetric Matrices with Distinct Eigenvalues


Given a general quadratic in terms of the coordinates xT = [x1 , x2 , . . . , xn ], we can
use a coordinate transformation, say x y, to put the quadratic in canonical
form, such that the quadratic is in canonical form relative to the coordinate
system y.
Suppose that the coordinate systems are related by the linear transformation
x = Q y,
nn

where the transformation matrix Q rotates and/or translates the coordinate system, such that the quadratic is in canonical form with respect to y. Substituting
into equation (2.11) gives
A = xT A x
T

= (Q y) A (Q y)
= yT (QT AQ)y
A = yT D y,
where D = QT A Q must be a diagonal matrix in order to produce the canonical
form of the quadratic with respect to y. We say that Q diagonalizes A, such that
premultiplying A by QT and postmultiplying by Q gives a matrix D that must
be diagonal. We call Q the modal matrix.
Procedure to find the modal matrix for real symmetric matrices with distinct
eigenvalues:
1. Determine the eigenvalues, 1 , 2 , . . . , n , of the n n real symmetric matrix
A.
2. Determine the orthonormal eigenvectors, e1 , e2 , . . . , en . Note that they are
already orthogonal owing to the distinct eigenvalues, and they simply need to
be normalized. Then
Ae1 = 1 e1 , Ae2 = 2 e2 , . . . , Aen = n en .

(2.12)

3. Construct the orthonormal modal matrix Q with columns given by the orthonormal eigenvectors e1 , e2 , . . . , en as follows:

..
..
..
.
.
.

e
e

e
Q =
2
n .
1
nn
..
..
..
.
.
.
Note: The order of the columns does not matter but corresponds to the order
in which the eigenvalues appear along the diagonal of D.

60

The Eigenproblem and Its Applications

To show that QT AQ diagonalizes A, premultiply Q by A and use (2.12)

..
..
..
.
.
.

e
e

e
AQ = A
2
n
1
..
..
..
.
.
.

..
..
..
..
..
..
.
. .
.
.
.

Ae
Ae

Ae

=
=
2
n
2 2
n en .
1
1 1
..
..
..
..
..
..
.
.
.
.
.
.

Then premultiply by QT

e1
e2

D = QT AQ =
..

.
en

.
..
..
.
.
.
.

1 e1 2 e2 n en

..
..
..
.
.
.

2 he1 , e2 i n he1 , en i
2 he2 , e2 i n he2 , en i

.
..
..
..

.
.
.
1 hen , e1 i 2 hen , e2 i n hen , en i

1 he1 , e1 i
1 he2 , e1 i

=
..

Because all vectors are of unit length and are mutually orthogonal, D is the
diagonal matrix

1 0 0
0 2 0

D = ..
.. . .
. .
.
. ..
.
0

Thus,
A = yT Dy = 1 y 21 + 2 y 22 + + n y 2n ,
which is in canonical form with respect to the coordinate system y.
Remarks:
1. Not only is D diagonal, we know what it is once we have the eigenvalues. That
is, it is not actually necessary to evaluate QT AQ.
2. The order in which the eigenvalues i appear in D corresponds to the order
in which the corresponding eigenvectors ei are placed in the modal matrix Q.
3. In the above case for quadratic forms, QT = Q1 , and we have an orthogonal
modal matrix.
4. The columns (and rows) of an orthogonal matrix form an orthonormal set of
vectors.

2.3 Normal Matrices

61

5. If A is symmetric, but with repeated eigenvalues, then Gram-Schmidt orthogonalization cannot be used to produce orthonormal eigenvectors from linearly
independent ones to form Q. This is because Gram-Schmidt orthogonalization
does not preserve eigenvectors.
6. Not only is transforming quadratic forms to canonical form an application
of diagonalization, it provides us with a geometric interpretation of what the
diagonalization procedure is designed to accomplish in other settings as well.
7. The diagonalization procedure introduced here is a special case (when Q is
orthogonal) of the general diagonalization procedure presented in Section 2.4
for symmetric A or non-symmetric A with distinct eigenvalues.
2.3 Normal Matrices
We focus our attention in Section 2.2 on real symmetric matrices because of
their prevalence in applications. The two primary results encountered for real
symmetric A are:
1. The eigenvectors of A corresponding to distinct eigenvalues are mutually orthogonal.
2. The matrix A is diagonalizable using the orthonormal modal matrix Q having
the orthonormal eigenvectors of A as its columns according to
D = QT AQ,
where D is a diagonal matrix containing the eigenvalues of A. This is referred
to as a similarity transformation because A and D have the same eigenvalues.6
One may wonder whether a real symmetric matrix is the most general matrix for
which these two results are true. It turns out that it is not.
The most general matrix for which they hold is a normal matrix. A normal
matrix is such that it commutes with its conjugate transpose so that
T A = AA
T,
A
or
AT A = AAT ,
if A is real.
Clearly symmetric and Hermitian matrices are normal. In addition, all orthogonal, skew-symmetric, unitary, and skew-Hermitian matrices are normal. However,
not all normal matrices are one of these forms. For example, the matrix

1 1 0
0 1 1
1 0 1
is normal, but it is not orthogonal, symmetric, or skew-symmetric.
6

Two matrices are said to be similar if they have the same eigenvalues.

62

The Eigenproblem and Its Applications

Example 2.5 Determine the requirements for a real 2 2 matrix A to be


normal.
Solution: For a real 2 2 matrix to be normal



A11 A21 A11
A12 A22 A21

AT A AAT = 0,
 


A12
A11 A12 A11 A21

= 0,
A22
A21 A22 A12 A22

 

A211 + A221
A11 A12 + A21 A22
A211 + A212
A11 A21 + A12 A22

= 0,
A11 A12 + A21 A22
A212 + A222
A11 A21 + A12 A22
A221 + A222


A221 A212
A11 A12 + A21 A22 A11 A21 A12 A22
= 0,
A11 A12 + A21 A22 A11 A21 A12 A22
A212 A221

(A12 A21 )



A12 A21 A11 A22
= 0.
A11 A22 A12 + A21

Thus, A is normal if and only if


A12 = A21 ,
in which case A is also symmetric, or
A21 = A12 ,

A11 = A22 ,
in which case



a b
A=
.
b a

Remarks:
1. The result from Section 2.2.1 applies for A normal. Specifically, the solution
to
Ax = c,
where A is an n n normal matrix, is
x=

n
X
hei , ci
i=1

ei ,

where i and ei , i = 1, . . . , n are the eigenvalues and eigenvectors of A, respectively.


2. Whether a matrix is normal or not has important consequences for stability of
systems governed by such matrices (see Section 19.6). Specifically, the stability
of systems governed by normal matrices is determined solely by the nature of
the eigenvalues of A. On the other hand, for systems governed by non-normal
matrices, the eigenvalues only determine the asymptotic stability of the system

2.4 Diagonalization

63

Table 2.1 Summary of the eigenvectors of a matrix based on the type of matrix and the nature
of the eigenvalues.

Symmetric A
Nonsymmetric A

Distinct Eigenvalues

Repeated Eigenvalues

Mutually Orthogonal Eigenvectors


Linearly-Independent Eigenvectors

Linearly-Independent Eigenvectors

as t , and the stability when t = O(1) may exhibit different behavior


known as transient growth.7
3. Given the rather special properties of normal matrices, one may wonder whether
the products AAT and AT A themselves have any special properties as well. It
turns out that for A of size mn, both products have the following properties:
They are always square; AAT is m m, and AT A is n n.
They are symmetric for any A.
rank(A) = rank(AAT ) = rank(AT A).
They both have the same nonzero eigenvalues. The additional eigenvalues
of the larger matrix are all zero.
5. The smaller of the two products is positive definite, that is, it has all positive
eigenvalues. The larger of the two is positive semi-definite (because of the
zero eigenvalues). If m = n, then both products are positive definite.
6. The eigenvectors are orthogonal.

1.
2.
3.
4.

We will encounter these products in polar decomposition (see Section 2.6.1),


singular-value decomposition (see Section 2.6.2), and least-squares methods
(see Section 4.3 and Chapter 9).
2.4 Diagonalization
In Section 2.2.3, we introduced the diagonalization procedure for real symmetric matrices having distinct eigenvalues. Here, we generalize this procedure for
symmetric matrices with repeated eigenvalues and nonsymmetric matrices. Diagonalization consists of performing a similarity transformation, in which a new
matrix D is produced that has the same eigenvalues as the original matrix A,
such that A and D are similar. The new matrix is diagonal with the eigenvalues of A residing on its diagonal. The outcome of the diagonalization process is
determined by the nature of the eigenvectors of A as summarized in Table 2.1.
The case considered in Section 2.2.3 is such that matrix A is real and symmetric
with distinct eigenvalues, in which case the eigenvectors are mutually orthogonal,
and the modal matrix is orthogonal (this also applies to other normal matrices as
described in Section 2.3). For a symmetric matrix with repeated eigenvalues or
a nonsymmetric matrix with distinct eigenvalues, the eigenvectors are mutually
linearly independent, but not necessarily orthogonal. As we will see, all three of
these cases allow for the matrix to be fully diagonalized. For a nonsymmetric
7

See Variational Methods with Applications in Science and Engineering, Section 6.5.

64

The Eigenproblem and Its Applications

matrix with repeated eigenvalues, however, the eigenvectors are not linearly independent, in general. Therefore, it cannot be fully diagonalized; however, the
same basic procedure will produce the so-called Jordan canonical form, which is
nearly diagonalized.
2.4.1 Matrices with Linearly-Independent Eigenvectors
The diagonalization procedure described in Section 2.2.3 for real symmetric matrices with distinct eigenvalues is a special case of the general procedure given
here. If the eigenvectors are linearly independent (including those that are also
orthogonal), then the modal matrix8 P, whose columns are the eigenvectors of
A, produce the similarity transformation

1 0 0
0 2 0

D = P1 AP = ..
.. . .
.. ,
.
. .
.
0 0 n
where D is diagonal with the eigenvalues along the diagonal as shown.
Remarks:
1. To prove that A and D are similar, consider the following:
|D I| = |P1 AP I|
= |P1 (A I) P|
= |P| |A I| |P1 |
= |P| |A I|

1
|P|

|D I| = |A I| ;
therefore, D and A have the same eigenvalues.
2. The general diagonalization procedure requires premultiplying by the inverse
of the modal matrix. When the modal matrix is orthogonal, as for symmetric
A with distinct eigenvalues, then P1 = PT = QT . Note that whereas it is
necessary for the orthonormal modal matrix Q to be formed from the normalized eigenvectors, it is not necessary to normalize the eigenvectors when
forming P; they simply need to be linearly independent.
3. An n n matrix A can be diagonalized if there are n regular eigenvectors
that are linearly independent [see ONeil (2012), for example, for a proof].
Consequently, all symmetric matrices can be diagonalized, and if the matrix
is not symmetric, the eigenvalues must be distinct.
4. It is not necessary to evaluate P1 AP, because we know the result D if the
eigenvectors are linearly independent.
8

We reserve use of Q to indicate orthonormal modal matrices.

2.4 Diagonalization

65

5. The term modal matrix arises from its application in dynamical systems in
which the diagonalization, or decoupling, procedure leads to isolation of the
natural modes of vibration of the system, which correspond to the eigenvalues.
The general motion of the system is then a superposition (linear combination)
of these modes (see Section 19.1).
2.4.2 Jordan Canonical Form
If an n n matrix A has fewer than n regular eigenvectors, as may be the case
for a nonsymmetric matrix with repeated eigenvalues, it is called defective and
additional generalized eigenvectors may be obtained such that they are linearly
independent with the regular eigenvectors in order to form the modal matrix. In
this case, P1 AP results in the Jordan canonical form, which is not completely
diagonalized.
For example, if A has two repeated eigenvalues 1 (multiplicity two) and three
repeated eigenvalues 2 (multiplicity three), then the Jordan canonical form is

1 a1 0 0 0
0 1 0 0 0

1
,
0
0

a
0
J = P AP =
2
2

0 0 0 2 a3
0 0 0 0 2
where a1 , a2 , and a3 above the repeated eigenvalues are 0 or 1.9 Unfortunately,
the only way to find the Jordan canonical matrix J is to actually evaluate P1 AP
just to determine the elements above the repeated eigenvalues.
In order to form the modal matrix P when there are less than n regular eigenvectors, we need a procedure for obtaining the necessary generalized eigenvectors
from the regular ones. Recall that regular eigenvectors ui satisfy the eigenproblem
(A i I) ui = 0.
If for a given eigenvalue, say 1 , with multiplicity k, we only obtain one regular eigenvector, then we need to obtain k 1 generalized eigenvectors. These
generalized eigenvectors satisfy the sequence of eigenproblems
m

(A 1 I) um = 0,

m = 2, 3, . . . , k.

Note that the regular eigenvector u1 results from the case with m = 1.
Rather than taking successive integer powers of (A 1 I) and obtaining the
corresponding eigenvectors, observe the following. If
(A 1 I) u1 = 0

(2.13)

produces the regular eigenvector u1 corresponding to the repeated eigenvalue 1 ,


then the generalized eigenvector u2 is obtained from
2

(A 1 I) u2 = 0.
9

See Hildebrand, p. 77.

(2.14)

66

The Eigenproblem and Its Applications

However, this is equivalent to writing the matrix problem


(A 1 I) u2 = u1 .
To see this, multiply both sides by (A 1 I) to produce
2

(A 1 I) u2 = (A 1 I) u1 ,
but from equation (2.13), the right-hand side is zero and we have equation (2.14).
Thus, the generalized eigenvectors can be obtained by successively solving the
systems of equations
(A 1 I) um = um1 ,

m = 2, 3, . . . , k,

(2.15)

starting with the regular eigenvector u1 on the right-hand side with m = 2. In


this manner, the previous regular or generalized eigenvector um1 becomes the
known right-hand side in the system of equations to be solved for the next generalized eigenvector um .
Remarks:
1. Because (A 1 I) is singular, we do not obtain a unique solution for each
successive generalized eigenvector. This is consistent with the fact that the
eigenvectors always include an arbitrary constant multiplier.
2. The resulting regular and generalized eigenvectors form a mutually linearlyindependent set of n eigenvectors.

Example 2.6 Determine the regular and generalized eigenvectors for the nonsymmetric matrix

2 1 2 0
0 3 1 0
.
A=
0 1
1 0
0 1 3 5
Then obtain the generalized modal matrix that reduces A to the Jordan canonical
form, and determine the Jordan canonical form.
Solution: The eigenvalues are
1 = 2 = 3 = 2,

4 = 5.

Corresponding to = 2 we obtain the single eigenvector



1
0

u1 =
0 ,
0

2.4 Diagonalization

67

and corresponding to = 5 we obtain the single eigenvector



0
0

u4 =
0 .
1
Therefore, we only have two regular eigenvectors.
In order to determine the two generalized eigenvectors, u2 and u3 , corresponding to = 2, we solve the system of equations
(A 1 I) u2 = u1 ,

(2.16)

to obtain u2 from the known u1 , and then we solve


(A 1 I) u3 = u2 ,

(2.17)

to obtain u3 from u2 . For example, using Gaussian elimination to solve the two
systems of equations in succession, we obtain

c1
c2
1
c1 + 2

u2 =
1 , u3 = c1 + 1 ,
2/3
2c1 /3 + 5/9
where c1 and c2 are arbitrary and arise because (2.16) and (2.17) do not have
unique solutions owing to the fact that |A 1 I| = 0. Choosing c1 = c2 = 0, we
have the generalized modal matrix

1 0
0 0
0 1
2 0
.
P=
0 1
1 0
0 2/3 5/9 1
This procedure produces four linearly-independent regular and generalized eigenvectors. Pseudo-diagonalizing then gives the Jordan canonical form

2 1 0 0
0 2 1 0

J = P1 AP =
0 0 2 0 .
0 0 0 5
Note that this requires that we actually invert P and evaluate P1 AP, unlike
cases for which D = P1 AP is a diagonal matrix. Observe that the eigenvalues
are on the diagonal with 0 or 1 above the repeated eigenvalues as expected.
Note that the generalized eigenvector(s), and the regular eigenvector(s) from
which it is obtained, must be placed in the order in which they are obtained to
form the modal matrix P.

68

The Eigenproblem and Its Applications

Problem Set # 3

2.5 Systems of Ordinary Differential Equations


One of the most important uses of the diagonalization procedure outlined in the
previous sections is in solving systems of first-order linear ordinary differential
equations.

2.5.1 General Approach for First-Order Systems


Recall that the general solution to the first-order linear ordinary differential equation
dx
x(t)

=
= ax(t),
dt
where a is a constant, is
x(t) = ceat ,
where c is a constant of integration. A dot denotes differentiation with respect to
the independent variable t.10
Now consider a system of n coupled first-order linear ordinary differential equations
x 1 (t) = A11 x1 (t) + A12 x2 (t) + + A1n xn (t) + f1 (t)
x 2 (t) = A21 x1 (t) + A22 x2 (t) + + A2n xn (t) + f2 (t)
,

..
.
x n (t) = An1 x1 (t) + An2 x2 (t) + + Ann xn (t) + fn (t)

where the Aij coefficients and fi (t) functions are known, and the functions x1 (t),
x2 (t), . . ., xn (t) are to be determined. If fi (t) = 0, i = 1, . . . , n, the system is
homogeneous. This system may be written in matrix form as

x(t)
= Ax(t) + f (t).

(2.18)

In order to solve this coupled system, we transform the solution vector x(t) to a
new vector of dependent variables y(t) for which the equations are easily solved.
In particular, we diagonalize the coefficient matrix A such that with respect to
the new coordinates y, the system is uncoupled. This is accomplished using
x(t) = Py(t),
10

(2.19)

We follow the convention of indicating ordinary differentiation with respect to time using dots and
primes otherwise, for example, with respect to x.

2.5 Systems of Ordinary Differential Equations

69

where the modal matrix is

..
..
..
.
.
.

u
u

u
P=
1
2
n .

..
..
..
.
.
.

The ui s are linearly-independent eigenvectors of A. Remember that there is


no need to normalize the eigenvectors when forming the modal matrix P. The
elements of P are constants; therefore, differentiating equation (2.19) gives

x(t)
= Py(t),

(2.20)

and substituting equations (2.19) and (2.20) into (2.18), leads to

Py(t)
= APy(t) + f (t).
Premultiplying by P1 gives

y(t)
= P1 APy(t) + P1 f (t).
Recall that unless A is nonsymmetric with repeated eigenvalues, D = P1 AP is
a diagonal matrix with the eigenvalues of A on the main diagonal. For example,
for a homogeneous system, having f (t) = 0, we have

y 1 (t)
1 0 0
y1 (t)
y 2 (t) 0 2 0 y2 (t)

.. = ..
.. . .
.. .. ,
. .
. . .
.
y n (t)
0 0 n yn (t)
or
y 1 (t) = 1 y1 (t)
y 2 (t) = 2 y2 (t)
.

..
.
y n (t) = n yn (t)

Therefore, the differential equations have been uncoupled, and the solutions in
terms of the transformed variables are
y1 (t) = c1 e1 t , y2 (t) = c2 e2 t , , yn (t) = cn en t ,
or in vector form

c1 e1 t
c2 e2 t

y(t) = .. .
.

cn en t

70

The Eigenproblem and Its Applications

Now transform back to determine the solution in terms of the original variable
x(t) using
x(t) = Py(t).
The constants of integration, ci , i = 1, . . . , n, are determined using the initial
conditions for the differential equations.
Remarks:
1. For this procedure to fully uncouple the equations requires linearly-independent
eigenvectors; therefore, it works if A is symmetric or if A is nonsymmetric with
distinct eigenvalues.
2. If A is nonsymmetric with repeated eigenvalues, that is, the eigenvectors are
not all linearly independent and a Jordan canonical matrix results from the
diagonalization procedure, then the system of equations in y can often still be
solved, although they are not completely uncoupled.
3. Note that if A is symmetric with distinct eigenvalues, then P1 = PT (if P is
comprised of orthonormal eigenvectors), and P = Q is an orthonormal matrix.
4. If the system is not homogeneous, determine the homogeneous and particular
solutions of the uncoupled equations and sum to obtain the general solution.
5. Although we can write down the uncoupled solution in terms of y(t) having
only the eigenvalues, we need the modal matrix P to transform back to the
original variable x(t).
6. Alternatively, reconsider the formulation for homogeneous systems

x(t)
= Ax(t).

(2.21)

For first-order, linear ordinary differential equations with constant coefficients,


we expect solutions of the form
xi (t) = ui ei t .

(2.22)

Then
x i (t) = i ui ei t ,
and substituting into equation (2.21) leads to
i ui ei t = Aui ei t ;
therefore,
Aui = i ui .
Observe that we have an eigenproblem for the coefficient matrix A, with the
eigenvalues 1 , 2 , . . . , n and eigenvectors u1 , u2 , . . . , un . After obtaining the
eigenvalues and eigenvectors, the solutions (2.22) are
x1 (t) = u1 e1 t , x2 (t) = u2 e2 t , . . . , xn (t) = un en t ,
and the general solution is
x(t) = c1 u1 e1 t + c2 u2 e2 t + . . . + cn un en t (= Py) .

2.5 Systems of Ordinary Differential Equations

71

Following are a series of examples of solving systems of first-order differential


equations using diagonalization. Each one introduces the approach to handling
an additional aspect of these problems. We begin with a homogeneous system of
first-order equations.
Example 2.7

Solve the homogeneous system of first-order equations

dx1
=
dt
dx2
= x1
dt
dx3
= x1 +
dt
Solution: We write this system in matrix


x 1 (t)
0 1
x 2 (t) = 1 0
x 3 (t)
1 1

x2 + x3
+ x3 .
x2
form x0 = Ax, or

1 x1 (t)
1 x2 (t) .
0 x3 (t)

Note that the coefficient matrix is symmetric; therefore, the eigenvectors are
linearly independent (even if the eigenvalues are not distinct). The characteristic
equation for the coefficient matrix is
3 3 2 = 0,
which gives the eigenvalues
1 = 1, 2 = 1, 3 = 2.
Consider the repeated eigenvalue 1 = 2 = 1. Substituting into (A I)x = 0
gives the single equation
x1 + x2 + x3 = 0.
Let x1 = d1 , x2 = d2 , then x3 = d1 d2 and we have

d1
.
u1,2 = d2
d1 d2
To obtain u1 , we choose d1 = 1, d2 = 0 resulting in

1

u1 = 0 ,
1
and for u2 we choose d1 = 1, d2 = 1 to give

1

u2 = 1 .
0

72

The Eigenproblem and Its Applications

Be sure to check and confirm that u1 and u2 are linearly independent. Substituting 3 = 2 gives the two equations (after some row reduction)
2x1 + x2 + x3 = 0
x2 + x3 = 0

If we let x3 = d1 , then the final eigenvector is



1

u3 = d1 1 .
1
Now we form the modal matrix

1 1 1
1 1 .
P= 0
1
0 1

Transforming to a new variable x(t) = Py(t), the solution in terms of y(t) is


y1 (t) = c1 et , y2 (t) = c2 et , y3 (t) = c3 e2t .
Transforming this set of solutions back


x1 (t)
1
x2 (t) = 0
x3 (t)
1

to the original variable x gives

1 1 c1 et
1 1 c2 et .
0 1
c3 e2t

Thus, the general solution to the system of first-order differential equations is


x1 (t) = c1 et c2 et +c3 e2t
x2 (t) =
x3 (t) = +c1 et

+c2 et +c3 e2t .


+c3 e2t

The coefficients c1 , c2 , and c3 would be determined using the three required initial
conditions. For example, the solution with initial conditions x1 (0) = 1, x2 (0) =
0, and x3 (0) = 2 is given in Figure 2.4.
In the next example, the system of equations that results from applying the
relevant physical principle is not of the usual form.
Example 2.8 Obtain the differential equations governing the parallel electrical
circuit shown in Figure 2.5.
Solution: In circuit analysis, we define the current I, resistance R, voltage V ,
inductance L, and capacitance C, which are related as follows. Ohms law gives
the voltage drop across a resistor as
V = IR,

2.5 Systems of Ordinary Differential Equations

73

0.2

0.4

0.6

0.8

1.0

-1

Figure 2.4 Solution to system of ordinary differential equations in


Example 2.7.

Figure 2.5 Schematic of the parallel electrical circuit in Example 2.8.

and the voltage drop across an inductor is


V =L

dI
,
dt

and a capacitor is such that


I=C

dV
,
dt

in which case
1
V =
C

Idt.

Parallel circuits must obey Kirchhoffs laws, which require that


P
P
1.
Iin = Iout at each junction,
P
2.
V = 0 around each closed circuit.

74

The Eigenproblem and Its Applications

Applying Kirchhoffs second law around each loop:


Loop 1:

VL + VR1 = 0

dI1
+ (I1 I2 )R1 = 0.
dt

Loop 2:

VC + VR2 + VR1 = 0

1
C

I2 dt + I2 R2 + (I2 I1 )R1 = 0.

Differentiating the second equation with respect to t (to remove the integral)
leads to


dI2 dI1
dI2
= 0.
+ CR1

I2 + CR2
dt
dt
dt
Therefore, we have the system of ordinary differential equations for the currents
I1 (t) and I2 (t) given by
dI1
= R1 I1 + R1 I2
dt
,
dI2
dI1
+ C(R1 + R2 )
= I2
CR1
dt
dt
L

or in matrix form



 

L
0
I1 (t)
I1 (t)
R1 R1
,
=
CR1 C(R1 + R2 )
I2 (t)
0
1
I2 (t)
|
{z
}
{z
}
|
A1

A2

or

A1 x(t)
= A2 x(t).
To obtain the usual form (x = Ax), premultiply A1
1 by both sides to give

x(t)
= A1
1 A2 x(t).
Therefore,
A = A1
1 A2 .
We would then diagonalize A as usual in order to solve for I1 (t) and I2 (t).
A sample solution with R1 = 10, R2 = 3, L = 30, C = 5 and initial conditions
I1 (0) = 10, and I2 (0) = 5 is shown in Figure 2.6. Note that because there is no
voltage source, the current decays with time via a damped oscillation.
Next we consider an example consisting of a nonhomogeneous system of equations having f (t) 6= 0.

2.5 Systems of Ordinary Differential Equations

75

10

20

40

60

80

100

120

140

-10

-20

-30

Figure 2.6 Sample solution for Example 2.8.

Example 2.9

(Jeffrey 6.11 # 8) Solve the system of equations


2x2 + sin t

x 1 (t) =
x 2 (t) = 2x1

or

x(t)
= Ax(t) + f (t),
where



0 2
A=
,
2 0



sin t
f (t) =
.
t

Solution: The eigenvalues and eigenvectors of A are


 
1
1 = 2, u1 =
,
1
 
1
2 = 2, u2 =
.
1
Note that A is symmetric with distinct eigenvalues; therefore, the eigenvectors
are mutually orthogonal.
Forming the modal matrix and taking its inverse we have




1 1 1
1 1
1
.
P=
P =
1 1
2 1 1
We transform to the new variable y(t) using
x(t) = Py(t),

76

The Eigenproblem and Its Applications

with respect to which our system of equations becomes

y(t)
= P1 APy(t) + P1 f (t).
Recalling that D = P1 AP has the eigenvalues along its main diagonal and
evaluating P1 f (t), this becomes



 


1 sin t + t
y 1 (t)
2 0
y1 (t)
;
=
+
y 2 (t)
0 2 y2 (t)
2 sin t t
therefore, we have the two uncoupled equations
1
1
sin t + t
2
2
.
1
1
y 2 (t) = 2y2 (t) + sin t t
2
2

y 1 (t) = 2y1 (t) +

Because the equations are nonhomogeneous, the solutions are of the form
y1 (t) = y1c (t) + y1p (t)
y2 (t) = y2c (t) + y2p (t)

where c represents the complementary, or homogeneous, solution, and p represents


the particular solution. The complementary solutions are
y1c (t) = c1 e1 t = c1 e2t
y2c (t) = c2 e2 t = c2 e2t

We determine the particular solutions using the method of undetermined coefficients,11 which works for right-hand sides involving polynomials, exponential
functions, and trigonometric functions.
Consider the equation for y1 (t), which is12
y1p (t) = A sin t + B cos t + Ct + D,
0
y1p
(t) = A cos t B sin t + C.

Substituting yields
A cos t B sin t + C = 2(A sin t + B cos t + Ct + D) +

1
1
sin t + t.
2
2

Equating like terms leads to four equations for the four unknown constants as
11
12

See, for example, Jeffrey 6.4, Table 6.2, or Kreyszig 2.9.


Be sure that yp (t) is not part of yc (t), for example, yp (t) = Ae2t + . . . . If it is, include a t factor in
the particular solution, for example, yp = Ate2t + . . . (see Jeffrey, example 6.13, or Kreyszig,
Section 2.7).

2.5 Systems of Ordinary Differential Equations

77

follows
A=

cos t : A = 2B
sin t : B = 2A +
1
t : 0 = 2C +
2
const : C = 2D

1
2

1
5

B = 4B +

1
1
B=
2
10

1
C=
4
1
D=
8

Thus, the particular solution is


1
1
1
1
y1p (t) = sin t
cos t t ,
5
10
4
8
and the general solution is
y1 (t) = y1c (t) + y1p (t) = c1 e2t

1
1
1
1
sin t
cos t t .
5
10
4
8

Similarly, considering the equation for y2 (t) leads to


y2p (t) =

1
1
1
1
sin t
cos t t + ,
5
10
4
8

y2 (t) = y2c (t) + y2p (t) = c2 e2t +

1
1
1
1
sin t
cos t t + .
5
10
4
8

To obtain the solution in terms of the original variable x(t), evaluate


x(t) = Py(t),
or

 

 

x1 (t)
1 1 y1 (t)
y1 + y2
=
=
.
x2 (t)
1 1 y2 (t)
y1 + y2

Thus, the general solution is


1
1
cos t t
5
2
.
1
2
2t
2t
x2 (t) = c1 e + c2 e + sin t +
5
4
x1 (t) = c1 e2t + c2 e2t

The integration constants c1 and c2 would be obtained using initial conditions if


given.
One other eventuality must be addressed, that is when the eigenvalues and/or
eigenvectors are complex. An example of this is given in the next section.

78

The Eigenproblem and Its Applications

2.5.2 Higher-Order Equations


We can extend the diagonalization approach applied above to systems of firstorder, linear differential equations to any linear ordinary differential equation as
follows. An nth -order linear differential equation
x(n) = F (t, x, x,
, x(n1) ),
can be converted to a system of n first-order differential equations by the substitutions
x1 (t) = x(t),
x2 (t) = x(t),

x3 (t) = x
(t),
..
.
xn1 (t) = x(n2) (t),
xn (t) = x(n1) (t).
Differentiating the above substitutions results in a system of first-order equations
x 1 (t)

= x(t)

= x2 (t),

x 2 (t)

=x
(t)
...
= x (t)

= x3 (t),

x 3 (t)

= x4 (t),

..
.
x n1 (t) = x(n1) (t) = xn (t),
x n (t)

= x(n) (t)

= F (t, x1 , x2 , , xn ),

with the last equation following from the original differential equation.
Remarks:
1. This approach can be used to convert any system of higher-order linear differential equations to a system of first-order linear equations. For example, three
coupled second-order equations could be converted to six first-order equations.
2. For dynamical systems considered in the following section, this first-order form
is called the state-space representation.

Example 2.10

Obtain the solution to the second-order initial-value problem


d2 x
+ x = 0,
dt2

(2.23)

2.5 Systems of Ordinary Differential Equations

79

with the initial conditions


x(0) = 1,


dx
= 2.
dt t=0

(2.24)

Solution: In order to convert this second-order equation to a system of two


first-order differential equations, we make the following substitutions
x1 (t) = x(t),
x2 (t) = x(t).

(2.25)

Differentiating the substitutions and transforming to x1 (t) and x2 (t), we have the
following system of two first-order equations
x 1 (t) = x(t)

= x2 (t),
x 2 (t) = x
(t) = x(t) = x1 (t),
for the two unknowns x1 (t) and x2 (t). Note that the original second-order equation (2.23) has been used in the final equation (
x = x). Written in matrix form,

x(t)
= Ax(t), we have

 


x 1 (t)
0 1 x1 (t)
=
,
x 2 (t)
1 0 x2 (t)
where A is not symmetric.
To obtain the eigenvalues, we evaluate |A I| = 0, or


1


1 = 0,
which yields the characteristic equation
2 + 1 = 0.
Factoring gives the eigenvalues
1 = i,

2 = i,

which is a complex conjugate pair but are distinct. Having complex eigenvalues
requires a minor modification to the procedure outlined above, but for now we
proceed as before. The corresponding eigenvectors are also complex and given by
 
 
i
i
u1 =
, u2 =
.
1
1
Consequently, forming the modal matrix, we have


i i
P=
.
1 1
Note that because A is not symmetric but has distinct eigenvalues, we have
linearly-independent eigenvectors, and the system can be fully diagonalized. In

80

The Eigenproblem and Its Applications

order to uncouple the system of differential equations, we transform the problem


according to
x(t) = Py(t).
With respect to y(t), the solution is
y1 (t) = c1 eit ,

y2 (t) = c2 eit .

Transforming back using x(t) = Py(t) gives the solution to the system of firstorder equations in terms of x(t). From the substitutions (2.25), we obtain the
solution with respect to the original variable as follows
x(t) = x1 (t) = c1 ieit c2 ieit .
We would normally be finished at this point. Because the solution is complex,
however, we must do a bit more work to obtain the real solution. It can be
shown that for linear equations, both the real and imaginary parts of a complex
solution are by themselves solutions of the differential equations, and that a linear
combination of the real and imaginary parts, which are both real, is also a solution
of the linear equations. We can extract the real and imaginary parts by applying
Eulers formula, which is
eait = cos(at) + i sin(at).
Applying the Euler formula to our solution yields
x(t) = c1 i (cos t i sin t) c2 i (cos t + i sin t) ;
therefore, the real and imaginary parts are
Re(x) = (c1 + c2 ) sin t,
Im(x) = (c1 c2 ) cos t.
To obtain the general solution for x(t), we then superimpose the real and imaginary parts to obtain the general form of the solution to the original second-order
differential equation
x(t) = A sin t + B cos t.

(2.26)

The constants A and B are obtained by applying the initial conditions (2.24),
which lead to A = 2 and B = 1; therefore, the final solution to (2.23) subject to
the initial conditions (2.24) is
x(t) = 2 sin t + cos t.
Remarks:
1. Observe that the imaginary eigenvalues correspond to an oscillatory solution.
2. The second-order differential equation (2.23) considered in this example governs the motion of an undamped oscillator. Dynamical systems are discussed
in more depth in Part III.

2.6 Decomposition of Matrices

81

Remarks:
1. The approach used in the above example to handle complex eigenvalues and
eigenvectors holds for linear systems for which superposition of solutions is
valid.
2. For additional examples of systems of linear ordinary differential equations
and application to discrete dynamical systems, see Chapter 19.
3. For an illustration of the types of solutions possible when solving systems of
nonlinear equations, see the Mathematica notebook Lorenz.nb.

Problem Set # 4

2.6 Decomposition of Matrices


Matrix decomposition, or factorization, is a method of taking a matrix A and
factoring it into the product of two or more matrices that have particular properties, often for further analysis or computation. For example, the diagonalization
procedure that has served us so well can be thought of as decomposition. Namely,
for a square matrix A with linearly-independent eigenvectors, we can decompose
A into the product of three matrices as follows
A = PDP1 ,
where P is the modal matrix containing the eigenvectors of A as its columns,
and D is a diagonal matrix having the eigenvalues of A along its diagonal. If
A is symmetric with distinct eigenvalues, then P = Q, which is an orthonormal
matrix, and the decomposition becomes the special case
A = QDQT .
Although not covered thus far, we have referenced several additional decompositions that are included elsewhere:
LU decomposition Section 6.2.1
Cholesky decomposition Section 6.2.2
QR decomposition Section 6.3
LU and Cholesky decomposition are used to calculate the solution to a system of
linear algebraic equations and the inverse of large matrices, and QR decomposition is the basis for numerical algorithms designed to determine the eigenvalues
and eigenvectors of large matrices.
Here, we discuss two additional decompositions, polar decomposition and singularvalue decomposition, that are based on solving eigenproblems. The latter is particularly useful and has numerous applications.

82

The Eigenproblem and Its Applications

2.6.1 Polar Decomposition


Given a square matrix A, there are matrices R, U, and V such that
A = RU = VR,

(2.27)

where R is orthogonal, and U and V are symmetric. To find R, U, and V, take


the transpose of equation (2.27)
AT = URT = RT V.
Postmultiplying by A = RU, where RT R = I because R is orthogonal,
AT A = URT RU = U2 ,

(2.28)

or premultiplying by A = VR
AAT = VRRT V = V2 .

(2.29)

Recall that the eigenvectors of U and U2 are the same, and the eigenvalues of U2
are 2i , where i are eigenvalues of U. Therefore, we determine the eigenvalues
and eigenvectors of the symmetric matrix U2 = AT A, and form the orthonormal
modal matrix Q. Then to diagonalize the symmetric matrices U2 and U, we take
2

1 0 0
0 22 0

QT U2 Q = ..
.. . .
.. ,
.
.
.
.
0

2n

and

1 0 0
0 2 0

QT UQ = ..
.. . .
.. .
.
. .
.
0 0 n

Premultiplying by Q and postmultiplying by QT gives the symmetric matrix U

1 0 0
0 2 0

T
U = Q ..
.. . .
. Q .
.
. ..
.
0

Then from equation (2.27), the orthogonal matrix R is


R = AU1 ,
and the symmetric matrix V is
V = AR1 = ART .
Remark: Decomposing a general square matrix A in this way allows us to take
advantage of the properties of orthogonal and symmetric matrices.

2.6 Decomposition of Matrices

83

2.6.2 Singular-Value Decomposition


A particularly important and useful decomposition is singular-value decomposition (SVD). It allows us to extend the ideas of eigenvalues and diagonalization,
which only exist for square matrices, to rectangular matrices. It can serve both as
a diagnostic tool and a solution technique in a number of important applications.
Motivation
Recall that a symmetric matrix A with distinct eigenvalues can be diagonalized
using
D = QT AQ,
where all matrices are nn (square), Q is the orthogonal modal matrix comprised
of the eigenvectors of A, and D is a diagonal matrix having the eigenvalues of A
along its diagonal. Let us suppose that we can diagonalize any m n rectangular
matrix A using a similar procedure. In the diagonalization of a square matrix,
we only need one modal matrix representing a single set of basis vectors, which
are the eigenvectors of A. Because the matrix A in SVD is no longer square, we
will need different sized modal matrices in order to perform the decomposition.
If this is possible, we might imagine that it would look like
= UT AV,

(2.30)

where is an m n diagonal matrix, U is an m m orthogonal matrix, and V


is an n n orthogonal matrix. It seems too much to ask that the matrices U and
V both be orthogonal for general A; however, this turns out to be the case.
How can we obtain two distinct bases of the appropriate sizes from a single
rectangular matrix? One possibility would be to use AAT and AT A, which have
size m m and n n, respectively. As we saw in Section ??, these matrices have
the added bonus of being symmetric and positive definite (or at least positive
semi-definite). Forming the eigenproblems for these two matrices gives
AAT u = 2 u,

AT Av = 2 v,

(2.31)

where 2 are the eigenvalues, and u and v are the respective sets of eigenvectors.
As suggested by equations (2.31), the eigenvalues of both AAT and AT A are the
same (despite their size difference). More specifically, all of the nonzero eigenvalues are the same, and the remaining eigenvalues of the larger matrix are all zeros.
Note that because AAT and AT A are at least positive semi-definite, thereby having all non-negative eigenvalues, we can denote them as squared quantities.
If equations (2.31) hold, then it can be shown that the eigenvectors u and v
are related through the relationships
Av = u,

AT u = v,

(2.32)

where the values of are called the singular values of the matrix A. These
relationships can be confirmed by substitution into equations (2.31). For example,
substituting the first of equations (2.32) into the second of equations (2.31) and
canceling one of the s from both sides yields the second of equations (2.32).

84

The Eigenproblem and Its Applications

Similarly, substituting the second of equations (2.32) into the first of equations
(2.31) and canceling one of the s from both sides yields the first of equations
(2.32). Thus, there is a special relationship between the two sets of basis vectors
u and v.
Finally, from the relationship (2.30), we can obtain the singular value decomposition of the matrix A by premultiplying U on both sides and postmultiplying
VT on both sides to yield
A = UVT ,

(2.33)

To further see the connection between the SVD and diagonalization, using this
decomposition, observe that
T
AAT = UVT UVT = UVT VT UT = U2 UT ,
where we have used the fact that VT V = I because V is orthogonal, and T =
2 for the diagonal matrix , where 2 contains the squares of the singular
values along the diagonal. Premultiplying UT on both sides and postmultiplying
U on both sides leads to

2 = UT AAT U.
Therefore, U is the orthogonal modal matrix that diagonalizes the symmetric
matrix AAT . Similarly,

2 = VT AT A V,
such that V is the orthogonal modal matrix that diagonalizes the symmetric
matrix AT A.
Method
Any mn matrix A, even singular or nearly singular matrices, can be decomposed
as follows
A = UVT ,
where U is an m m orthogonal matrix, V is an n n orthogonal matrix, and
is an m n diagonal matrix of the form

1 0 0 0 0
0 2 0 0 0

= ..
.. . .
. .
.. ,
.
. .. ..
.
.
0 0 p 0 0
where p = min(m, n) and 1 2 p 0. The i s are the square roots of
the eigenvalues of AT A or AAT , whichever is smaller, and are called the singular
values of A.
The columns ui of matrix U and vi of matrix V satisfy
Avi = i ui ,

A T ui = i v i ,

2.6 Decomposition of Matrices

85

and kui k = 1, kvi k = 1, that is, they are normalized. From the first equation
AT Avi = i AT ui = i2 vi ;

(2.34)

therefore, the vi s are the eigenvectors of AT A, with the i2 s being the eigenvalues. From the second equation
AAT ui = i Avi = i2 ui ;

(2.35)

therefore, the ui s are the eigenvectors of AAT with the i2 s being the eigenvalues.
We only need to solve one of the eigenproblems (2.34) or (2.35). Once we have
vi , for example, then we obtain ui from Avi = i ui . Whereas, if we have ui ,
then we obtain vi from AT ui = i vi . We compare m and n to determine which
eigenproblem will be easier (smaller).
Example 2.11

Use singular-value decomposition to factor the matrix




1 0 1
A=
.
1 1 0

Solution: The transpose of A is

1 1
AT = 0 1 ;
1 0

therefore,

2 1 1
AT A = 1 1 0 ,
1 0 1

AAT =



2 1
.
1 2

Let us determine the eigenvalues of AAT u = 2 u, which is the smaller of the


two. The characteristic equation for 2 is
4 4 2 + 3 = 0
12 = 3, 22 = 1,

noting that 12 22 . Next we obtain the corresponding eigenvectors


 


1 1
1 1
2
.
1 = 3 :
u = 0 u1 =
1 1 1
2 1
 


1 1
1 1
2
.
2 = 1 :
u =0
u2 =
1 1 2
2 1
Note that u1 and u2 are orthogonal and have been normalized. Now we can form
the matrix U using the eigenvectors


1 1 1
U = [u1 u2 ] =
.
2 1 1

86

The Eigenproblem and Its Applications

To obtain vi , recall that AT ui = i vi ; thus,

 
1 1
2
1 T
1
1
1
1
v 1 = A u1 = 0 1
= 1 ,
1
3 1 0
2 1
6 1
which is normalized. Similarly,

 
1 1
0
1
1
1 T
1 1

1 .
=
v 2 = A u2 = 0 1
2
1 1 0
2 1
2 1

Again, note that v1 and v2 are orthonormal. The third eigenvector, v3 , also must
be orthogonal to v1 and v2 ; thus, by inspection (let v3 = [a b c]T and evaluate
hv3 , v1 i = 0 and hv3 , v2 i = 0 to determine two of the three constants, then
normalize to determine the third)

1
1
v3 = 1 .
3 1
Therefore,

V = [v1 v2 v3 ] =
and

2
16
6
1
6

0
1
2
12

1
3
13
,
13


3 0 0
U AV =
= ,
0
1 0

where the singular values of A are 1 = 3 and 2 = 1. One can check the
result of the decomposition by evaluating
T

A = UVT .
See the Matrices with Mathematica Demo.
Remarks:
1. Note that we have not made any assumptions about A; therefore, the SVD can
be obtained for any real or complex rectangular matrix even if it is singular. It
is rather remarkable, therefore, that any matrix can be decomposed into the
product of two orthogonal matrices and one diagonal matrix.
2. For m n matrix A, r = rank(A) = rank(AT A) = rank(AAT ) is the number
of nonzero singular values of A. If r < p = min(m, n), then A is singular.
3. If A is nonsingular and invertible (must be square), the singular values of A1
are the reciprocals of the singular values of A.

2.6 Decomposition of Matrices

87

4. If A is square and invertible, we can take advantage of the orthogonality of U


and V to compute the inverse of the matrix A as follows:
1
A1 = UVT
= V1 UT ,
where the inverse of the diagonal matrix is simply a diagonal matrix containing the reciprocals of the singular values of A.
5. For A square, the columns of U corresponding to nonzero singular values form
an orthonormal basis for the range of A, and the columns of V corresponding
to zero singular values form an orthonormal basis for the nullspace of A (see
Section 1.5).
6. If A is an n n real symmetric matrix, then AT A = AAT = A2 , and the n
singular values of A are the absolute values of its eigenvalues.
7. If A is complex, then the singular-value decomposition of an m n matrix A
is
T
A = UV ,

8.

9.

10.

11.

13

where U is an m m unitary matrix, V is an n n unitary matrix, and


the singular values found along the diagonal of are real and nonnegative
(all previous instances of the transpose become the conjugate transpose for
complex A).
Mathematically, we are only concerned with whether a matrix is singular or
nonsingular. When carrying out numerical operations with matrices, however,
we are concerned with how close to being singular a matrix is. One measure
of how close a matrix is to being singular is the condition number. While
there are various definitions for the condition number, the most general and
rigorous is the ratio of the largest singular value to the smallest. Matrices
having very large condition numbers are said to be ill-conditioned. Recall that
if one or more of the singular values is zero, then the matrix is singular, that
is, the smallest singular value is zero and the matrix has infinite condition
number. The condition number is considered to be large, and the system to
be numerically ill-conditioned, if the reciprocal of the condition number is of
the same or smaller order as the precision of the computer.
SVD can even be used to solve systems having non-unique (infinite) solutions,
which occurs when there are more unknowns than equations, that is, when
n > m.
See Numerical Recipes, Section 2.6, for more details on numerical implementation of SVD. In particular, see the method for determining the best solution
of singular or ill-conditioned systems of equations (yes, that is right, singular
systems can be solved).
Additional applications of SVD include least-squares curve fitting, obtaining orthonormal basis vectors without the round-off errors that can build
up in the Gram-Schmidt orthogonalization procedure, image and signal compression, proper orthogonal decomposition (POD) (also called principal component analysis) for reduced-order modeling,13 pseudoinverse for solving illSee Variational Methods with Applications in Science and Engineering, Section 11.3.

88

The Eigenproblem and Its Applications

conditioned systems numerically, and optimal control and transient growth of


linear systems.14 In many applications, the SVD can be thought of as providing
the ratio of the input to the output properties of a system.

2.7 Functions of Matrices


We are familiar with functions of scalars, such as
f (x) = sin x,

f (x) = ex ,

f (x) = x2 + x + 1,

so one may wonder if there is such a thing as a function of a matrix, such as


f (A) = sin A,

f (A) = eA ,

f (A) = A2 + A + I.

Indeed, there is, and that is the subject of this section, and no, it is not simply
the sine, exponential, etc. of each element.
Because we know how to take integer powers of matrices, we can express the
matrix polynomial of degree m in the form
Pm (A) = 0 I + 1 A + 2 A2 + + m Am ,
where I is the identity matrix of the same size as A. Therefore, we have a straightforward generalization of algebraic polynomials. Because we can express any integer power of a matrix, we can also imagine expressing functions for which a
Taylor series exists. For example, the Taylor series of the exponential function is
f (x) = ex = 1 + x +

x2
+ ;
2!

therefore, we define the exponential of a matrix by


f (A) = eA = I + A +

A2
+ .
2!

Likewise, for the trigonometric functions


x3 x5
+

3!
5!
x2 x4
cos x = 1
+

2!
4!

sin x = x

A3 A5
+

3!
5!
A2 A4
cos A = I
+

2!
4!

sin A = A

In addition, mathematical operators operate on f (A) as for f (x), for example,


d  At 
e
= AeAt .
dt
The following theorem is the basis for a number of important developments in
functions of matrices.
14

See Variational Methods with Applications in Science and Engineering, Chapters 6 and 10.

2.7 Functions of Matrices

89

Cayley-Hamilton Theorem: If Pn () = 0 is the characteristic polynomial of an


n n matrix A, then A satisfies
Pn (A) = 0.
That is, A satisfies its own characteristic polynomial!15
Example 2.12

Confirm the Cayley-Hamilton theorem for the matrix




2 1
A=
.
0 2

Solution: The characteristic polynomial is




2

1
= (2 )(2 ) = 0
|A I| =
0
2
P2 () = 2 4 + 4 = 0.

According to the Cayley-Hamilton theorem then


P2 (A) = A2 4A + 4I = 0,
which is straightforward to verify.
A typical response to the Cayley-Hamilton theorem is Neat, but so what?
One use is in obtaining simple expressions for more complex quantities. For example, say we want A3 for A given above. Recall that from the characteristic
polynomial, we had
A2 4 A + 4 I = 0,
or
A2 = 4A 4I.
Then
A3 = AA2
= A (4A 4I)
= 4A2 4A
= 4(4A 4I) 4A
A3 = 12A 16I.
Therefore, we can determine A3 by simply multiplying A by a scalar and adding
a diagonal matrix, rather than evaluating A A A. While the advantage of this
may not be readily clear for A3 , imagine seeking to evaluate A100 , for example!
Only the coefficients, that is, 12 and 16, would be different from the expression
above for A3 ; therefore, this would be a huge time savings. See the alternative
method below for how to determine the constants.
15

The Cayley-Hamilton theorem is proven for symmetric matrices in Hildebrand, Section 1.22.

90

The Eigenproblem and Its Applications

Remarks
1. The Cayley-Hamilton theorem may be used to obtain the inverse of a nonsingular matrix in terms of matrix multiplications. If a nonsingular n n matrix
A has the characteristic polynomial
Pn () = n + c1 n1 + + cn1 + cn = 0,
then from the Cayley-Hamilton theorem
An + c1 An1 + + cn1 A + cn I = 0.
Premultiplying by A1 and rearranging gives

1  n1
A1 =
A
+ c1 An2 + + cn1 I ,
cn
which is an expression for the inverse of a matrix in terms of a linear combination of powers of the matrix.
2. Let us consider the case in which the matrix A has repeated eigenvalues. If
P () has (r )s , where s > 1, as a factor, that is, r is a repeated eigenvalue
with multiplicity s, a symmetric matrix A also satisfies
G(A) = 0,
where
G() =

P ()
.
( r )s1

This is called the reduced characteristic equation.


To further solidify the assertion in the previous example, consider Sylvesters
formula. If all eigenvalues of A are distinct, then
f (A) =

n
X
k=1

f (k )

Y A r I
,
k r
r6=k

recalling that r6=k represents the product of factors with r = 1, 2, . . . , n, excluding r 6= k.16 The essential result is that any matrix function f (A) can be
expressed as a linear combination of I, A, A2 , . . . , An1 , where n is the size of A.
As such, rather then expressing a function in the form of a Taylor series having
infinite powers of A, one can express such functions as a polynomial in A of degree n 1. Note that Sylvesters formula does not apply for the triangular matrix
in the above example because 1 = 2 = 2.17
Although Sylvesters formula tells us that f (A) can be expressed as a polynomial, it is more straightforward to determine that polynomial using an alternative
method, which applies for any f (A), including with repeated eigenvalues. To illustrate the alternative method, let us determine f (A) = eA for A 2 2 and
16
17

Sylvesters formula is proven in Hildebrand, pp. 6061.


See Hildebrand, p. 62 for an example of Sylvesters formula applied to eA , where A is 2 2.

2.7 Functions of Matrices

91

1 6= 2 . Because n = 2, f (A) = eA can be expressed as a polynomial of degree


n 1 = 1 of the form
f (A) = a0 I + a1 A,

(2.36)

where we simply need to determine the constants a0 and a1 . As a result of CayleyHamiltons theorem and Sylvesters formula, we do so by taking advantage of the
fact that
f () = a0 + a1 .
Hence, the matrix equation becomes a scalar equation of the same form.
In this case, f () = e with the two eigenvalues 1 and 2 of A. Given these
two eigenvalues, we have the two linear algebraic equations
e1 = a0 + a1 1
e2 = a0 + a1 2

for the two unknowns a0 and a1 . Solving, we obtain


a0 =

1 e2 2 e1
,
1 2

a1 =

e1 e2
.
1 2

Substituting a0 and a1 into (2.36) gives


eA =



1
(1 e2 2 e1 )I + (e1 e2 )A .18
1 2

If A has repeated eigenvalues, take enough derivatives to obtain the necessary


n equations. For example, if 1 = 2 in the previous case with n = 2, then taking
the derivative once gives
f 0 () = a1
providing the second equation.
If A is 3 3 and all three eigenvalues are the same, then
f () = a0 + a1 + a2 2 ,
f 0 () = a1 + 2 a2 ,
f 00 () = 2a2 ,
where because n = 3, f () is a polynomial of degree two. This provides the
necessary three equations for the three unknowns a0 , a1 , and a2 .
Example 2.13

18

Obtain cos A, where

2 1 0
A = 0 2 1 .
0 0 2

This is the same as equation (230) in Hildebrand using Sylvesters formula.

92

The Eigenproblem and Its Applications

Solution: Because n = 3, the polynomial is of degree two; therefore we will need

4 4 1
A2 = 0 4 4 .
0 0 4
The polynomial is of the form
f (A) = a0 I + a1 A + a2 A2 .
Because the matrix A is triangular, the eigenvalues are on the diagonal, and
= 2 has multiplicity three. Hence, owing to the repeated eigenvalue, we have
f () = a0 + a1 + a2 2 ,
f 0 () = a1 + 2a2 ,
f 00 () = 2a2 .
In this case, with f (A) = cos A, we have
cos 2 = a0 + 2a1 + 4a2 a0 = cos 2 + 2 sin 2,
sin 2 = a1 + 4a2

a1 = 2 cos 2 sin 2,

cos 2 = 2a2

1
a2 = cos 2.
2

Therefore,
cos A = (2 sin 2 cos 2)I + (2 cos 2 sin 2)A 21 cos 2A2

1 0 0
2 1 0
4 4 1
1
= (2 sin 2 cos 2) 0 1 0 + (2 cos 2 sin 2) 0 2 1 cos 2 0 4 4
2
0 0 1
0 0 2
0 0 4

cos 2 sin 2 21 cos 2


cos 2
sin 2 .
cos A = 0
0
0
cos 2

3
Eigenfunction Solutions of Differential
Equations
Time was when all the parts of the subject were dissevered, when algebra, geometry,
and arithmetic either lived apart or kept up cold relations of acquaintance confined to
occasional calls upon one another; but that is now at an end; they are drawn together
and are constantly becoming more and more intimately related and connected by a
thousand fresh ties, and we may confidently look forward to a time when they shall
form but one body with one soul.
(James Joseph Sylvester)

Thus far, we have focused our attention on linear algebraic equations represented as vectors and matrices. In this chapter, we build on analogies with
vectors and matrices to develop methods for solving ordinary and partial differential equations using eigenfunction expansions, which are series expansions
of solutions in terms of eigenfunctions of differential operators. As such, vectors
become functions, and matrices become differential operators.
As engineers and scientists, we view matrices and differential operators as two
distinct entities, and we motivative similarities by analogy. Although we often
think that the formal mathematical approach replete with theorems and proofs
is unnecessarily cumbersome (even though we appreciate that someone has put
such methods on a firm mathematical foundation), this is an example where the
more mathematical approach is very valuable. Although from the applications
point of view, matrices and differential operators seem very different and occupy
different topical domains, the mathematician recognizes that matrices and certain
differential operators are both linear operators that have particular properties,
thus unifying these two constructs. Therefore, the mathematician is not surprised
by the relationship between operations that apply to both; in fact, it couldnt be
any other way!
We begin by defining a function space by analogy to vector spaces that allows us
to consider linear independence and orthogonality of functions. We then consider
eigenvalues and eigenfunctions of differential operators, which allow us to develop
powerful techniques for solving ordinary, and in fact partial, differential equations
equations. These analytical methods also provide the background for spectral
numerical methods.
93

94

Eigenfunction Solutions of Differential Equations

3.1 Function Spaces


3.1.1 Definitions
In order to define a function space, let us draw an analogy to vector spaces considered in Chapter 1.
Vector Space:
1. A set of vectors ui , i = 1, . . . , m, are linearly independent if their linear combination is zero, that is
c1 u1 + c2 u2 + + cm um = 0,
only if the ci coefficients are all zero. Thus, no vector ui can be written as a
linear combination of the others.
2. Any vector in n-space may be expressed as a linear combination of n linearlyindependent basis vectors. Although not necessary, we prefer mutually orthogonal basis vectors, in which case
hui , uj i = 0,

i 6= j.

Function Space:
1. A set of functions ui (x), i = 1, . . . , m, are linearly independent if their linear
combination is zero, that is
c1 u1 (x) + c2 u2 (x) + + cm um (x) = 0,
only if the ci coefficients are all zero. Thus, no function ui (x) can be written
as a linear combination of the others.
2. Any piecewise continuous function in an interval may be expressed as a linear combination of an infinite number of linearly-independent basis functions.
Although not necessary, we prefer mutually orthogonal basis functions.

3.1.2 Linear Independence of Functions


To determine if a set of functions are linearly independent over some interval
a x b, we form the Wronskian, which is a determinant of the functions and
their derivatives. Say we have m functions, u1 (x), u2 (x), . . ., um (x), for which the
first m 1 derivatives are continuous. The Wronskian is then defined by


u1 (x)
u2 (x)

um (x)

u01 (x)
u02 (x)

u0m (x)

W [u1 (x), u2 (x), . . . , um (x)] =
.
..
..
..
..


.
.
.
.
(m1)

(m1)
(m1)
u
(x) u
(x) u
(x)
1

If the Wronskian is non-zero in the interval, that is, W 6= 0 for a x b, then


the functions u1 (x), u2 (x), . . ., um (x) are linearly independent over that interval.

3.1 Function Spaces

95

Figure 3.1 The function f (x) over the interval a x b.

3.1.3 Basis Functions


As mentioned above, any function in an interval may be expressed as a linear combination of an infinite number of linearly-independent basis functions.
Consider a function f (x) over the interval a x b as illustrated in Figure 3.1. If u1 (x), u2 (x), . . . are linearly-independent basis functions over the interval a x b, then f (x) can be expressed as
f (x) = c1 u1 (x) + c2 u2 (x) + =

cn un (x),

n=1

where not all ci s are zero. Because this requires an infinite set of basis functions,
we say that the function f (x) is infinite dimensional.
Although not necessary, we prefer mutually orthogonal basis functions such
that
Z b
hui , uj i =
r(x)ui (x)uj (x)dx = 0, i 6= j,
a

that is, ui (x) and uj (x) are orthogonal over the interval [a, b] with respect to the
weight function r(x). Note that r(x) = 1 unless stated otherwise.
Analogous to norms of vectors, we define the norm of a function, which gives
a measure of a functions size. Recall that the norm of a vector1 is defined by
!1/2
n
X
1/2
2
.
kuk = hu, ui =
uk
k=1

Similarly, the norm of a function is defined by


"Z
1/2

ku(x)k = hu, ui

#1/2

r(x)u (x)dx

where the discrete sum is replaced by a continuous integral in both the inner
product and norm. For example, we can use the GramSchmidt orthogonalization
1

Recall from Section ?? that we use the L2 norm unless indicated otherwise.

96

Eigenfunction Solutions of Differential Equations

procedure with definitions of norms and inner products of functions to obtain an


orthonormal set of functions, such that

Z b
0, i 6= j
r(x)ui (x)uj (x)dx = ij =
,
hui , uj i =
1, i = j
a
where ij is the Kronecker delta.
Example 3.1 Consider the polynomial functions u0 (x) = x0 = 1, u1 (x) =
x1 = x, u2 (x) = x2 , . . . un (x) = xn , . . . in the interval 1 x 1. Using these
functions, which are linearly independent, obtain an orthonormal set of basis
functions using the GramSchmidt orthogonalization procedure.
Solution: Check, for example, if u0 and u2 are orthogonal over the interval
1 x 1. Taking the inner product, we have
hu0 , u2 i =
6 0;
therefore, this is not an orthogonal set. Begin by normalizing u0 (x) to length one.
The norm of u0 (x) is
Z 1
1/2  1 1/2


= x
ku0 k =
(1)2 dx
= 2;
1

thus, normalizing gives


u0 (x) =

u0
1
= .
ku0 k
2

Consider u1 (x) = x. First, evaluate the inner product of u1 (x) with the previous
function u0 (x)
1
Z 1
1
x2
hu1 , u0 i =
xdx = = 0;
2 1
2 2 1
therefore, they are already orthogonal, and we simply need to normalize u1 (x) as
above:
1 !1/2 r
Z 1
1/2
x3
2
2
ku1 k =
x dx
=
=
;

3
3
1
1
u1
u1 (x) =
=
ku1 k

3
x.
2

Recall that even functions are such that f (x) = f (x), and odd functions are
such that f (x) = f (x). Therefore, all of the odd functions (x, x3 , . . .) are
orthogonal to all of the even functions (x0 , x2 , x4 , . . .) in the interval 1 x 1.
Now considering u2 (x) = x2 , the function that is mutually orthogonal to the
previous ones is given by
u02 = u2 hu2 , u1 i u1 hu2 , u0 i u0 ,

3.1 Function Spaces

97

but hu2 , u
1 i = 0, and
1
hu2 , u0 i =
2

1
x3
2
x dx = = .
3 2 1 3 2
1

Thus,
u02

2
=x
3 2
2

1
= x2 .
3

Normalizing gives
ku02 k

Z

1
2

=
1

u0
u
2 (x) = 2 =
ku2 k


1 2
3

1/2

dx

8
;
45


 r  2

45
1
5 3x 1
2
x
=
.
8
3
2
2

Repeating for u3 leads to


r 

7 5x3 3x
.
u
3 (x) =
2
2

If the square root factors are removed, this produces Legendre polynomials, which
are denoted by Pn (x) (see Section 3.3.3). Any continuous (or piecewise continuous) function f (x) in the interval 1 x 1 can be written as a linear
combination of Legendre polynomials according to
f (x) =

X
n=0

an u
n (x) =

hf (x), u
n (x)i u
n (x),

n=0

where an = hf (x), u
n (x)i because u
n (x) are orthonormal (cf. vectors).

Example 3.2 Consider the trigonometric functions in the interval 0 x 2


(note that cos(0x) = 1 and sin(0x) = 0):
1, cos(x), cos(2x), . . . , cos(nx), . . . , n = 0, 1, 2, . . .
sin(x), sin(2x), . . . , sin(mx), . . . , m = 0, 1, 2, . . .
Using these functions, which are linearly independent, obtain an orthonormal set
of basis functions using the GramSchmidt orthogonalization procedure.
Solution: One can confirm that all of these functions are mutually orthogonal
over the interval 0 x 2; therefore, it is only necessary to normalize them

98

Eigenfunction Solutions of Differential Equations

according to
1

cos(nx), n = 1, 2, 3, . . .

sin(mx), m = 1, 2, 3, . . .

2
1

Thus, any function in 0 x 2 can be written as

X
a0
cos(nx) X sin(mx)
f (x) = +
an
+
bm
,

2 n=1
m=1

where
a0



Z 2
1
1
=
=
f (x),
f (x)dx,
2
2 0

an



Z 2
cos(nx)
1
f (x) cos(nx)dx,
=
f (x),
=

bm =



Z 2
sin(mx)
1
f (x),
f (x) sin(mx)dx.
=

This produces the Fourier series (see Section 3.4.2). Once again, any piecewise
continuous function f (x) in the interval 0 x 2 can be expressed as a Fourier
series.
Remarks:
1. An infinite set of orthogonal functions u0 (x), u1 (x), u2 (x) . . . is said to be complete if any piecewise continuous function in the interval can be expanded in
terms of u0 (x), u1 (x), u2 (x), . . . . Note that whereas an n-dimensional vector
space is spanned by n mutually orthogonal vectors, an infinite-dimensional
function space requires an infinite number of mutually orthogonal functions to
span.
2. Analogous definitions and procedures can be developed for functions of more
than one variable; for example, two functions u(x, y) and v(x, y) are orthogonal
in the region A if
ZZ
hu(x, y), v(x, y)i =
r(x, y)u(x, y)v(x, y)dxdy = 0.
A

3. The orthogonality of Legendre polynomials and Fourier series, along with other
sets of functions, will prove very useful and is why they are referred to as special
functions.

3.2 Eigenfunction Expansions

99

m-space

n-space

Figure 3.2 Transformation from x to y via the linear transformation


matrix A.

3.2 Eigenfunction Expansions


In the previous section, it was shown that certain sets of functions have particularly attractive properties, namely that they are mutually orthogonal. Here, we
will explore sets of functions obtained as the eigenfunctions of a differential operator and show how they can be used to obtain solutions of ordinary differential
equations. First, of course we must establish what an eigenfunction is.

3.2.1 Definitions
Let us further build on the analogy between vector and function spaces.
Vector Space: Consider the linear transformation (cf. Section 1.6)
A x = y ,

mnn1

m1

such that the matrix A transforms x into y as shown in Figure 3.2.


Function Space: Differential operators perform a role similar to the transformation matrix A. Consider the general differential operator
L = a0 + a1

d2
dn
d
+ a2 2 + + an n ,
dx
dx
dx

which is an nth -order linear differential operator with constant coefficients. In the
differential equation
Lu(x) = f (x),
the differential operator L transforms u(x) into f (x) as illustrated in Figure 3.3.
Just as an eigenproblem can be formulated for a matrix A in the form
Au = u,

100

Eigenfunction Solutions of Differential Equations


L

u(x)

f(x)

Figure 3.3 Transformation from u(x) to f (x) via the linear transformation
L.

an eigenproblem can be formulated for a differential operator in the form


Lu(x) = u(x),
where any value of that corresponds to a nontrivial solution u(x) is an eigenvalue
of L, and the corresponding solution u(x) is an eigenfunction of L. Similar to the
way that A transforms u into a constant multiple of itself, L transforms u(x)
into a constant multiple of itself.
Note that just as we consider eigenvalues and eigenvectors for matrices, not
systems of algebraic equations, we consider eigenvalues and eigenfunctions of
differential operators, not entire differential equations. Note, however, that just as
the eigenproblem for a matrix is a system of algebraic equations, the eigenproblem
for a differential operator is a differential equation.
We do, however, require boundary conditions, which must be homogeneous. If
a x b, then we may specify the value of the solution at the boundaries
u(a) = 0,

u(b) = 0,

or the derivative of the solution


u0 (a) = 0,

u0 (b) = 0,

or a linear combination of the two


c1 u0 (a) + c2 u(a) = 0,

c3 u0 (b) + c4 u(b) = 0,

or combinations of the above, for example,


u(a) = 0,

u0 (b) = 0.

3.2.2 Eigenfunctions of Differential Operators


We will illustrate the eigenproblem for differential operators through an example.
Note the analogy between the approach used in the following two examples with
that used in Section 2.2.2 to solve systems of linear algebraic equations Ax = c,
where A is symmetric. Recall that we expanded the unknown solution vector x
and the right-hand side vector c as linear combinations of the eigenvectors of A.

3.2 Eigenfunction Expansions

Example 3.3

101

Consider the simple differential equation

d2 u
= f (x),
dx2
over the range 0 x 1 with the homogeneous boundary conditions
u(0) = 0,

(3.1)

u(1) = 0.

Solve for u(x) in terms of the eigenfunctions of the differential operator in equation (3.1).
Solution: The differential operator is
L=

d2
,
dx2

(3.2)

and the eigenproblem is then


Lu(x) = u(x).

(3.3)

Observe once again that the original differential equation is not an eigenproblem,
but that the eigenproblem is a differential equation.
We solve the eigenvalue problem using the techniques for solving linear ordinary
differential equations to determine the eigenvalues for which nontrivial solutions
u(x) exist. In general, may be positive, negative, or zero. In order to solve (3.3)
with (3.2), try letting = +2 > 0 giving
u00 2 u = 0.
The solution of constant coefficient, linear differential equations of this form is
u(x) = erx , where r is a constant. Upon substitution, this leads to the requirement
that r must satisfy
r2 2 = 0,
or
(r + )(r ) = 0.
Therefore, r = , and the solution is
u(x) = c1 ex + c2 ex ,

(3.4)

or equivalently
u(x) = c1 cosh(x) + c2 sinh(x).
Applying the boundary conditions to determine the constants c1 and c2
u(0) = 0 0 = c1 + c2

c1 = c2 ,

u(1) = 0 0 = c1 e + c2 e e = e .
The last condition is only true if = 0; therefore, the only solution (3.4) that

102

Eigenfunction Solutions of Differential Equations

satisfies the boundary conditions is the trivial solution u(x) = 0. We are seeking
nontrivial solutions, so let = 2 < 0, in which case we have
u00 + 2 u = 0.
Again, considering a solution of the form u(x) = erx , r must satisfy
r2 + 2 = 0,
or
(r + i)(r i) = 0.
The solution is
u(x) = c3 eix + c4 eix ,
or from the Euler formula
u(x) = c3 cos(x) + c4 sin(x).

(3.5)

Applying the boundary conditions to determine the constants c3 and c4 gives


u(0) = 0 0 = c3 (1) + c4 (0)

c3 = 0,

u(1) = 0 0 = c3 cos + c4 sin .


From the second equation with c3 = 0, we see that sin = 0. More generally,
we have two equations for the two unknown constants, which may be written in
matrix form as follows

   
1
0
c3
0
=
.
cos sin c4
0
For a nontrivial solution to exist for this homogeneous system


1
0

cos sin = 0,
in which case
sin = 0.
This is the characteristic equation. Note that the characteristic equation is transcendental rather than the polynomial characteristic equations encountered for
matrices, and that we have an infinite number of eigenvalues (and corresponding eigenfunctions). Plotting the characteristic equation in Figure 3.4, the zeros
(roots) are
= , 2, . . . , n, . . . ,

n = 1, 2, . . . .

Recall that we must consider positive, negative, and zero. We have considered
the cases with < 0 and > 0. From equation (3.3) with = 0, the eigenproblem
is u00 = 0, which has the solution
u(x) = c5 x + c6 .

3.2 Eigenfunction Expansions

103

sin

Figure 3.4 Zeros of the characteristic equation sin = 0.

Applying the boundary conditions


u(0) = 0 0 = c5 (0) + c6 c6 = 0,
u(1) = 0 0 = c5 (1) + 0

c5 = 0.

Therefore, we once again get the trivial solution. Thus, the roots that give nontrivial solutions are
= n = n,

n = 1, 2, 3, . . . ,

and the eigenvalues are


n = 2n = n2 2 ,

n = 1, 2, 3, . . . .

The corresponding eigenfunctions (3.5) are


un (x) = c4 sin(nx),

n = 1, 2, 3, . . . .

Note that it is not necessary to consider negative n because we obtain the same
eigenfunctions as for positive n. For example,
u2 (x) = c4 sin(2x)
u2 (x) = c4 sin(2x) = c4 sin(2x),
and both represent the same eigenfunction given that c4 is arbitrary.
The solution to the eigenproblem gives the eigenfunctions of the differential
operator. Although the constant c4 is arbitrary, it is often convenient to choose
c4 by normalizing the eigenfunctions, such that
kun k = 1,
or equivalently
kun k2 = 1.

104

Eigenfunction Solutions of Differential Equations

Then
Z

u2n (x)dx = 1

c24 sin2 (nx)dx = 1

1
=1
2

c4 = 2.
c24

Finally, the normalized eigenfunctions of the differential operator (3.2) are

un (x) = 2 sin(nx), n = 1, 2, 3, . . . ,
which is the Fourier sine series. Recall that the corresponding eigenvalues are
n = 2n = n2 2 ,

n = 1, 2, 3, . . . .

Observe that a solution of the differential eigenproblem (3.3) exists for any value
of ; however, not all of these solutions satisfy the boundary conditions. Those
nontrivial solutions that do satisfy both equation (3.3) and the boundary conditions are the eigenfunctions, and the corresponding values of are the eigenvalues,
of the differential operator.
Remarks:
1. Changing the differential operator and/or the boundary conditions will change
the eigenfunctions.
2. Although unusual, if more than one case ( < 0, = 0, > 0) produce nontrivial solutions, superimpose the corresponding eigenvalues and eigenfunctions.
3. In what follows, we will take advantage of the orthogonality of the eigenfunctions un (x). See Section 3.4 for why it is not necessary to explicitly check the
orthogonality of the eigenfunctions in this case.
Having obtained the eigenfunctions of the differential operator with associated
boundary conditions, we can now obtain a series solution to the original differential equation (3.1), which is repeated here
d2 u
= f (x), 0 x 1,
dx2
u(0) = 0, u(1) = 0.

(3.6)

We follow the same steps as outlined in Section 2.2.2. Note that because the
eigenfunctions, which are a Fourier sine series, are all mutually orthogonal, they
provide a basis for the function space spanned by any piecewise continuous function f (x) in the interval [0, 1]. Hence, the known right-hand side of the differential
equation can be expressed as follows

X
f (x) =
bn un (x), 0 x 1.
n=1

3.2 Eigenfunction Expansions

105

In order to obtain the coefficients bn for a given f (x), take the inner product of
the eigenfunctions um (x) with both sides
hf (x), um (x)i =

bn hun (x), um (x)i .

n=1

Because the eigenfunctions are all orthonormal, the terms on the right-hand side
vanish except for that corresponding to m = n, for which hun (x), um (x)i = 1.
Thus,
bn = hf (x), un (x)i .
We may also expand the unknown solution itself in terms of the orthonormal
eigenfunctions

X
u(x) =
an un (x),
n=1

where the coefficients an are to be determined. Evaluating the differential operator


leads to

X
d2 u X
an u00n (x) =
an n un (x),
Lu = 2 =
dx
n=1
n=1
where the last step is only possible because we have expanded u(x) using the
eigenfunctions of the differential operator. Substituting into the ordinary differential equation and equating like terms to determine the an coefficients gives
d2 u
dx2

= f (x),

an n un (x) =

n=1

bn un (x),

n=1

an n un (x) =

n=1

hf, un i un (x),

n=1

an

hf, un i
.
n

Thus, the solution to the differential equation (3.6) is


u(x) =

X
hf, un i
n=1

un (x),

0 x 1.

Note the similarity to the solution obtained in Section 2.2.2 using an analogous procedure. Substituting for the eigenvalues and eigenfunctions, we have the
Fourier sine series for the given differential equation and boundary conditions

E
X

2D
u(x) =
f
(x),
2
sin(nx)
sin(nx), 0 x 1.
n2 2
n=1

106

Eigenfunction Solutions of Differential Equations

See the Mathematica notebook Solutions to Differential Equations using Eigenfunction Expansions for an illustration of the solution for f (x) = x5 .
Note: Solving u00 = x5 exactly is straightforward simply by integrating twice, so
what is the advantage of the eigenfunction solution approach?
1. In general, Lu = f (x) can only be solved exactly for certain f (x), whereas the
eigenfunction expansion approach may be applied for general (even piecewise
continuous) f (x).
2. Solutions for various f (x) may be obtained with minimal effort once the eigenvalues and eigenfunctions of the differential operator are obtained (cf. changing
the right-hand side vector c in Section 2.2.2).
3. Eigenfunction expansions, for example Fourier series, may be applied to discrete data, such as from experiments. A popular approach is called properorthogonal decomposition (POD), which provides an alternative means of expressing large experimental or numerical data sets.
4. The eigenfunction approach provides the basis for spectral numerical methods
(see Section 3.7).

Example 3.4 Instead of equation (3.6), consider an ordinary differential equation of the form
d2 u
u = f (x), u(0) = 0, u(1) = 0,
(3.7)
dx2
where is some physical parameter.
Solution: The eigenproblem for the differential operator L is the same as in the
previous example; therefore, we use the same eigenfunctions, and following the
same procedure as above gives the coefficients in the expansion for the solution
as
hf, un i
.
an =
n
Therefore, the solution to equation (3.7) is
u(x) =

X
n=1

1
hf (x), un (x)i un (x),
+

n2 2

0 x 1,

where as in the previous example, the eigenfunctions are

un (x) = 2 sin(nx), n = 1, 2, 3, . . . .
If the parameter equals one of the eigenvalues, that is, = n = n2 2
for a particular n, then no solution exists unless hf, un i = 0, in which case an is
arbitrary. Again, recall the similarities to the cases considered in Section 2.2.2.
Alternatively, note that the differential operator in equation (3.7) could be
regarded as L = d2 /dx2 ; however, this would require obtaining different
eigenvalues and eigenfunctions.

3.3 Adjoint and Self-Adjoint Differential Operators

107

3.3 Adjoint and Self-Adjoint Differential Operators


In the following sections, we want to build further on the analogy between eigenvectors and eigenfunctions to determine for which differential operators the eigenfunctions are mutually orthogonal.2
3.3.1 Adjoint of a Differential Operator
Consider an nn matrix A and arbitrary n1 vectors u and v. Let us premultiply
A by vT and postmultiply by u and take the transpose
T
vT Au = uT AT v.
Observe from the matrix multiplication that the left-hand side is a scalar, which
is a 1 1 matrix; therefore, we may remove the transpose and write in terms of
inner products


hv, Aui = u, AT v .
We can think of this as a means to define the transpose of a matrix AT .
By analogy, a differential operator L has an adjoint3 operator L that satisfies
hv, Lui = hu, L vi ,

(3.8)

where u(x) and v(x) are arbitrary functions with homogeneous boundary conditions.
In order to illustrate the approach for determining the adjoint of a differential operator, consider the second-order linear differential equation with variable
coefficients
1
[a0 (x)u00 + a1 (x)u0 + a2 (x)u] = 0, a x b,
(3.9)
Lu =
r(x)
where r(x) is a weight function. To obtain the adjoint operator, consider an
arbitrary function v(x), and take the inner product with Lu to obtain the lefthand side of (3.8) as follows


Z b
1
00
0
hv, Lui =
r(x)v(x)
[a0 (x)u + a1 (x)u + a2 (x)u] dx,
(3.10)
r(x)
a
where the inner product is taken with respect to the weight function r(x). We
want to switch the roles of u(x) and v(x) in the inner product, that is, interchange
derivatives on u(x) for derivatives on v(x), in order to obtain the right-hand side
2

This material is from Variational Methods with Applications in Science and Engineering, Section
1.8.
In an unfortunate twist of terminology, the term adjoint has different meanings in the context of
matrices. Recall from Section 1.4.5 that the adjoint of a matrix is related to its cofactor matrix,
while from Section 1.1, it is also another term for taking the transpose of a matrix.

108

Eigenfunction Solutions of Differential Equations

of (3.8). This is accomplished using integration by parts.4 Integrating the second


term by parts gives
b Z b
Z b

0
a1 vu dx = a1 vu
u(a1 v)0 dx,
a

where
Z

pdq = pq

qdp

with
p = va1 ,

q = u,

dp = (va1 )0 dx,

dq = u0 dx.

Similarly, integrating the first term of (3.10) by parts twice results in


b Z b

b Z b
Z b

0
0
00
0
0
0
u(a0 v)00 dx,
a0 vu dx = a0 vu
u (a0 v) dx = a0 vu (a0 v) u +
a
a
a
a
{z
} |
{za
}
|
(1)

(2)

(1)
p = va0 ,

q = u0 ,

dp = (va0 )0 dx,

dq = u00 dx,

p = (va0 )0 ,

v = u,

dp = (va0 )00 dx,

d
v = u0 dx.

(2)

Substituting into (3.10) gives



b
0
0
hv, Lui = a0 vu (a0 v) u + a1 vu
a


Z b
1
00
0
+
r(x)u(x)
[(a0 v) (a1 v) + a2 v] dx,
r(x)
a
where the expression in {} is L v, and the integral is hu, L vi. Therefore, if the
boundary conditions on u(x) and v(x) are homogeneous, in which case the terms
evaluated at x = a and x = b vanish, we have
hv, Lui = hu, L vi ,
where the adjoint operator L of the differential operator L is

1 
00
0
[a0 (x)v] [a1 (x)v] + a2 (x)v .
L v =
r(x)

(3.11)

Note that the variable coefficients move inside the derivatives, and the odd-order
4

See Variational Methods with Applications in Science and Engineering, Section 1.6, for a review of
integration by parts.

3.3 Adjoint and Self-Adjoint Differential Operators

109

derivatives change sign as compared to Lu. This is the case for higher-order
derivatives as well.
Example 3.5

Determine the adjoint operator for

d2
d
+x ,
dx2
dx
with homogeneous boundary conditions.
Solution: From equation (3.9), we have
L=

a0 (x) = 1,

a1 (x) = x,

0 x 1,

a2 (x) = 0,

r(x) = 1.

Then from equation (3.11), the adjoint operator is


L v = [a0 v]00 [a1 v]0 + a2 v
= v 00 (xv)0
L v = v 00 xv 0 v.
Thus, the adjoint operator of L is
L =

d2
d
x
1.
dx2
dx

Note that L 6= L in this example.

3.3.2 Requirement for an Operator to be Self-Adjoint


Recall that a matrix that is equal to its transpose is said to be symmetric, or
Hermitian for its conjugate transpose, and has the property that its eigenvectors
are mutually orthogonal if its eigenvalues are distinct (see Section 2.2). Likewise,
if the differential operator and its adjoint are the same, in which case L = L , then
the differential operator L is said to be self-adjoint, or Hermitian, and distinct
eigenvalues produce orthogonal eigenfunctions.
To show this, consider the case when L = L given above and let u(x) and v(x)
be two eigenfunctions of the differential operator, such that Lu = 1 u, Lv = 2 v.
Evaluating equation (3.8)
hu, Lvi = hv, Lui
hu, 2 vi = hv, 1 ui
(1 2 ) hu, vi = 0.
Consequently, if 1 6= 2 , the corresponding eigenfunctions must be orthogonal
in order for their inner product to be zero.
As illustrated in the previous example, not all second-order linear differential
equations of the form (3.9) have differential operators that are self-adjoint, that

110

Eigenfunction Solutions of Differential Equations

is, for arbitrary a0 (x), a1 (x), and a2 (x) coefficients. Let us determine the subset
of such equations that are self-adjoint.
Recall from equation (3.11) that the adjoint operator is given by
L v =


1 
00
0
[a0 v] [a1 v] + a2 v .
r(x)

We rewrite the adjoint L v of Lu by carrying out the differentiations (product


rule) and collecting terms as follows
L v =

1
{a0 v 00 + [2a00 a1 ] v 0 + [a000 a01 + a2 ] v} .
r(x)

(3.12)

For L to be self-adjoint, the operators L and L in equations (3.9) and (3.12),


respectively, must be the same. Given that the first term is already identical,
consider the second and third terms. Equivalence of the second terms requires
that
a1 (x) = 2a00 (x) a1 (x),
or
a1 (x) = a00 (x).

(3.13)

Equivalence of the third terms requires that


a2 (x) = a000 (x) a01 (x) + a2 (x),
but this is always the case if (3.13) is satisfied. Therefore, substitution of the
condition (3.13) into (3.9) requires that
Lu =

1
{a0 u00 + a00 u0 + a2 u} = 0.
r(x)

The differential operator in the above expression may be written in the form




1
d
d
L=
a0 (x)
+ a2 (x) ,
r(x) dx
dx
which is called the Sturm-Liouville differential operator.
Therefore, a second-order linear differential operator is self-adjoint if and only
if it is of the Sturm-Liouville form




d
d
1
p(x)
+ q(x) ,
(3.14)
L=
r(x) dx
dx
where p(x) > 0 and r(x) > 0 in a x b, and the boundary conditions
are homogeneous. It follows that the corresponding eigenfunctions of the SturmLiouville differential operator are orthogonal with respect to the weight function
r(x).
Similarly, consider the fourth-order Sturm-Liouville differential operator
 2 




1
d
d2
d
d
L=
s(x)
+
p(x)
+
q(x)
.
r(x) dx2
dx2
dx
dx

3.3 Adjoint and Self-Adjoint Differential Operators

111

This operator can also be shown to be self-adjoint if the boundary conditions are
homogeneous of the form
u = 0, u0 = 0,
or
u = 0,

s(x)u00 = 0,

or
u0 = 0,

[s(x)u00 ] = 0,

at x = a or x = b. Then the corresponding eigenfunctions are orthogonal on the


interval a x b with respect to r(x).
3.3.3 Eigenfunctions of Sturm-Liouville Operators
Let us consider the eigenproblem Lu(x) = u(x) for the Sturm-Liouville differential operator (3.14), which is


d
du
p(x)
+ [q(x) + r(x)] u = 0, a x b.
(3.15)
dx
dx
Recall that p(x) > 0 and r(x) > 0 in a x b. The Sturm-Liouville equation is
a linear, homogeneous, second-order ordinary differential equation, and it is the
eigenproblem for the Sturm-Liouville differential operator. The Sturm-Liouville
equation is a boundary-value problem and requires homogeneous boundary conditions of the form
u(a) + u0 (a) = 0,

u(b) + u0 (b) = 0,

where and are not both zero, and and are not both zero.
As shown in the previous section, for the Sturm-Liouville differential operator:
1. The eigenvalues are distinct and nonnegative.
2. The eigenfunctions un (x) are orthogonal with respect to the weight function
r(x), such that
Z b
hun , um i =
r(x)un (x)um (x)dx = 0, m =
6 n.
a

Recall that the norm of un (x) is also defined with respect to the weight function
Z b
2
kun k =
r(x)u2n (x)dx.
a

Solutions to the eigenproblems associated with the Sturm-Liouville differential operator with various p(x), q(x), and r(x) and having appropriate boundary
conditions produce several common eigenfunctions, for example Fourier Series,
Legendre Polynomials, Bessel Functions, and Chebyshev Polynomials (see, for
example, Jeffrey 2002 and Asmar 2005).

112

Eigenfunction Solutions of Differential Equations

Fourier Series:
We have already found in Section 3.2 that for eigenproblems of the form
d2 u
+ u = 0, 0 x 1,
dx2
u(0) = 0, u(1) = 0,
the eigenfunctions of the differential operator produce a Fourier sine series
un (x) = an sin(nx),

n = 1, 2, 3, . . . .

Similarly, the boundary-value problem


d2 u
+ u = 0, 0 x 1,
dx2
u0 (0) = 0, u0 (1) = 0,
produces eigenfunctions that form the Fourier cosine series, for which
un (x) = an cos(nx),

n = 0, 1, 2, . . . .

In the interval 1 x 1, the complete Fourier series is required involving both


sine and cosine expansions (see Section 3.1).
Bessel Functions:
The differential operator associated with the equation


 
du
2
d
x
+ + 2 x u = 0, 0 x 1,
dx dx
x
has as its eigenfunctions the Bessel functions J (x), J (x) if is not an integer, and J (x), Y (x) if is an integer. J are the Bessel functions of the
first kind, and Y are the Bessel functions of the second kind. For example, see
Figure 3.5 for plots of J0 (x), J1 (x), Y0 (x), and Y1 (x). See Jeffrey, Section 8.6 for
expansions of Bessel functions.
Note that the above differential equation is a Sturm-Liouville equation with
p(x) = x,

q(x) =

2
,
x

r(x) = x,

= 2 .

Bessel functions are orthogonal over the interval 0 x 1 with respect to the
weight function r(x) = x, and the Bessel equation arises when solving partial
differential equations involving the Laplacian operator 2 in cylindrical coordinates (see Section 3.6.2).
Legendre Polynomials:

3.3 Adjoint and Self-Adjoint Differential Operators


1

0.5

113

J0
J1

Y0

10

12

14

0.5

Y1
1

Figure 3.5 Bessel functions J0 (x), J1 (x), Y0 (x), and Y1 (x).

Legendre polynomials arise as eigenfunctions of the differential operator associated with the equation


d
2 du
(1 x )
+ ( + 1)u = 0, 1 x 1,
dx
dx
where in the Sturm-Liouville equation
p(x) = 1 x2 ,

q(x) = 0,

r(x) = 1,

= ( + 1).

The Legendre polynomials are


P0 (x) = 1,

P1 (x) = x,

1
P2 (x) = (3x2 1), . . .
2

with the recursion relation


( + 1)P+1 (x) (2 + 1)xP (x) + P1 (x) = 0.
Given any two successive Legendre polynomials, the remaining sequence may
be obtained using the recursion relation. We have already shown that Legendre
polynomials are orthogonal over the interval 1 x 1 (see Section 3.1). Note
that the Legendre equation arises when solving Laplaces equation 2 u = 0 in
spherical coordinates.
Chebyshev Polynomials:
The differential operator from the following equation produces Chebyshev polynomials as eigenfunctions


du
d
(1 x2 )1/2
+ 2 (1 x2 )1/2 u = 0, 1 x 1,
(3.16)
dx
dx
where in the Sturm-Liouville equation
p(x) = (1 x2 )1/2 ,

q(x) = 0,

r(x) = (1 x2 )1/2 ,

= 2.

114

Eigenfunction Solutions of Differential Equations

The Chebyshev polynomials T (x) of degree are


T0 (x) = 1,

T1 (x) = x,

T2 (x) = 2x2 1, . . .

with the recursion relation


T+1 (x) 2xT (x) + T1 (x) = 0.
Chebyshev polynomials are orthogonal over the interval 1 x 1.
Converting to Self-Adjoint Form:
The Sturm-Liouville forms of the equations given above are all self-adjoint.
However, alternative forms of these same equations may not be. For example, the
Chebyshev equation written in the form
(1 x2 )u00 xu0 + 2 u = 0
is not self-adjoint. The Sturm-Liouville, that is, self-adjoint, form (3.16) is obtained by multiplying by (1 x2 )1/2 .
In general, any second-order linear ordinary differential eigenproblem of the
form
d2 u
du
a0 (x) 2 + a1 (x)
+ [a2 (x) + a3 (x)] u = 0
(3.17)
dx
dx
can be reformulated as a Sturm-Liouville equation. To show how, let us write
equation (3.17) in the form


d2 u a1 (x) du
a2 (x)
a3 (x)
+
+
+
u = 0.
dx2 a0 (x) dx
a0 (x)
a0 (x)
Consider the Sturm-Liouville equation (3.15), which may be written in the form
p(x)

d2 u dp du
+
+ [q(x) + r(x)] u = 0,
dx2 dx dx

or


d2 u
1 dp du
q(x)
r(x)
+
+
+
u = 0.
dx2 p(x) dx dx
p(x)
p(x)

Comparing these two equations, we see that in order to be self-adjoint, we must


have
p0
a1
q
a2
r
a3
= ,
= ,
= .
p
a0
p
a0
p
a0
Integrating the first of these yields
Z

a1
dx;
a0

Z


a1 (x)
dx ,
a0 (x)

ln p =
thus,
p(x) = exp

(3.18)

3.4 Eigenfunction Solutions of Partial Differential Equations

115

and the other two relationships lead to


q(x) =

a2 (x)
p(x),
a0 (x)

r(x) =

a3 (x)
p(x).
a0 (x)

(3.19)

Note that a3 (x) 6= 0 for differential eigenproblems. Using equations (3.18) and
(3.19), we can convert any second-order linear eigenproblem of the form (3.17)
into the self-adjoint Sturm-Liouville form (3.15).
Non-Homogeneous Equations:
Just as in Section 3.2, solutions to nonhomogeneous forms of differential equations with the above operators are obtained by expanding both the solution and
the right hand side in terms of the eigenfunctions of the differential operators
and determining the coefficients in the expansions.
We expand the right hand side f (x) in terms of eigenfunctions un (x), n =
0, 1, 2, . . . according to
f (x) =

bn un (x) = b0 u0 (x) + b1 u1 (x) + . . . .

n=0

To determine the coefficients bn , evaluate the inner product hum (x), f (x)i, which
is equivalent to multiplying the above expression by r(x)um (x) and integrating
over the interval a x b. If the eigenfunctions are orthogonal, then all of the
terms in the expansion with n 6= m are zero leaving that with n = m providing
Z b
Z b
2
r(x)um (x)f (x)dx = bm
r(x) [um (x)] dx,
a

or
hum (x), f (x)i = bm kum (x)k2 .
Thus, the coefficients in the expansion for f (x) are
bn =

hun (x), f (x)i


.
kun (x)k2

Note that if the eigenfunctions are normalized with respect to the weight function
r(x), then kun (x)k2 = 1. Once again, it is the orthogonality of the eigenfunctions
that makes this approach possible.

3.4 Eigenfunction Solutions of Partial Differential Equations


Thus far, we have considered eigenfunction expansions for cases governed by ordinary differential equations, having one independent variable x. In some cases, we
can use the eigenfunction expansion approach along with the method of separation
of variables to solve partial differential equations as well.

116

Eigenfunction Solutions of Differential Equations

Figure 3.6 Domain for Laplace equation with n indicating outward facing
normals to the boundary.

3.4.1 Laplace Equation


Let us begin by illustrating the method of separation of variables using the Laplace
equation, which governs heat conduction and electrostatic fields, for example. The
Laplace equation in two-dimensional Cartesian coordinates for u(x, y) is
2u 2u
+ 2 = 0,
x2
y

(3.20)

u
specified on each boundary, where n represents
n
the normal to the boundary (see Figure 3.6).
It is supposed that the solution u(x, y) can be written as the product of two
functions, one a function of x only and one a function of y only, thereby separating
the variables, as follows
with boundary conditions u or

u(x, y) = (x)(y).

(3.21)

Substituting into the Laplace equation (3.20) gives

d2
d2
+

= 0,
dx2
dy 2

or upon separating the variables


1 d2
1 d2
=

= .
(x) dx2
(y) dy 2
Because the left-hand side is a function of x only, and the right-hand side is a
function of y only, the equation must be equal to a constant, say , as x and y
may be varied independently. Thus, we have two equations
1 d2
=
(x) dx2

d2
(x) = 0,
dx2

(3.22)

and

1 d2
=
(y) dy 2

d2
+ (y) = 0.
dy 2

(3.23)

3.4 Eigenfunction Solutions of Partial Differential Equations

117

Figure 3.7 Domain for heat conduction example.

Observe that when it applies, that is, when an equation is separable in the
sense shown above, the method of separation of variables allows one to convert a
partial differential equation into a set of ordinary differential equations. Only the
differential equation with two homogeneous boundary conditions is an eigenproblem. Solve this one first, and then solve the other ordinary differential equation
using the same values of the eigenvalues from the eigenproblem.5
Example 3.6 Consider the temperature distribution u(x, y) due to conduction
in the rectangular domain given in Figure 3.7. Heat conduction is governed by
Laplaces equation (3.20), and the boundary conditions are
u=0

at x = 0, x = a, y = 0,

u = f (x)

at

y = b.

(3.24)

That is, the temperature is zero on three sides and some specified distribution of
x on the fourth side.
Solution: Separating variables as in (3.21) leads to the ordinary differential equations (3.22) and (3.23). We first consider the eigenproblem, which is the equation
having two homogeneous boundary conditions. Thus, consider equation (3.22)
d2
= 0,
dx2
which for = 2 < 0 ( = 0 and = +2 > 0 produce trivial solutions) has
the solution
(x) = c1 cos(x) + c2 sin(x).

(3.25)

The boundary condition u(0, y) = 0 requires that (0) = 0; therefore, c1 =


0. The boundary condition u(a, y) = 0 requires that (a) = 0; therefore, the
characteristic equation is
sin (n a) = 0,
5

The method of separation of variables is also used in Section 2.2 of Complex Variables.

118

Eigenfunction Solutions of Differential Equations

and the eigenvalues are


n =

n
,
a

n = 1, 2, . . . .

(3.26)

The eigenfunctions are then


n (x) = c2 sin

 n 
x ,
a

n = 1, 2, . . . ,

(3.27)

where c2 is arbitrary and set equal to one for convenience. Now consider equation
2
(3.23), recalling that n = 2n = n
,
a
d2
2n = 0,
dy 2
which has the solution6
n (y) = c3 cosh

 n 
 n 
y + c4 sinh
y .
a
a

However, u(x, 0) = 0; therefore, n (0) = 0 requiring that c3 = 0, and


 n 
n (y) = c4 sinh
y .
a

(3.28)

Then the solution is (cn = c4 )


u(x, y) =

X
n=1

un (x, y) =

n (x)n (y) =

n=1

X
n=1

cn sin

 n 
 n 
x sinh
y , (3.29)
a
a

which isP
essentially an eigenfunction expansion with variable coefficients n (y) (cf.
(x) = cn n (x)). We determine the cn coefficients by applying the remaining
boundary condition (3.24) at y = b as follows
u(x, b) =

n (x)n (b) = f (x).

n=1

Recognizing that the n (b) are constants, taking the inner product of the eigenfunctions m (x) with both sides gives
kn (x)k2 n (b) = hf (x), n (x)i ,
where all the terms in the summation on the left-hand side vanish owing to
orthogonality of eigenfunctions except when m = n. Then with kn k2 = a/2,
solving for the constants n (b) yields
Z
 n 
2 a
hf (x), n (x)i
=
f
(x)
sin
x dx, n = 1, 2, . . . ,
(3.30)
n (b) =
kn (x)k2
a 0
a
which are the Fourier sine coefficients of f (x). As before, if we had chosen the c2
6

We typically prefer trigonometric functions for finite domains and exponential functions for infinite
or semi-infinite domains.

3.4 Eigenfunction Solutions of Partial Differential Equations

119

coefficients in order to normalize the eigenfunctions (3.27), then kn (x)k2 = 1.


From (3.28) at y = b, the constants cn in the solution (3.29) are
cn =

n (b)
;
sinh nb
a

therefore, the solution (3.29) becomes

 n  sinh
u(x, y) =
un (x, y) =
n (x)n (y) =
n (b) sin
x
a
sinh
n=1
n=1
n=1

n
y
a 
nb
a

(3.31)
where n (b) are the Fourier sine coefficients of f (x) obtained from equation (3.30).
For example, consider the case with a = b = 1, and f (x) = 1. Then equation
(3.30) becomes
Z 1
2
n (1) = 2
sin (nx) dx =
[1 cos(n)] , n = 1, 2, . . . ,
n
0
4
for n odd. Therefore, let us define a new index
which is zero for n even, and n
according to n = 2m + 1, m = 0, 1, 2, 3, . . ., in which case

m (1) =

4
.
(2m + 1)

The solution (3.31) is then


u(x, y) =

4 X sin [(2m + 1)x] sinh [(2m + 1)y]


.
m=0
(2m + 1) sinh [(2m + 1)]

A contour plot of this solution is shown in Figure 3.8, where the contours represent
constant temperature isotherms in heat conduction.
Remarks:
1. The above approach works when three of the four sides have homogeneous
boundary conditions. Because the equations are linear, more general cases
may be treated using superposition. For example, see Figure 3.9 for an example
having two non-homogeneous boundary conditions.
2. When we obtain the eigenfunctions
un (x, y) = n (x)n (y),
they each satisfy the Laplace equation individually for n = 1, 2, 3, . . .. Because the Laplace equation is linear, we obtain the most general solution by
superimposing these solutions according to
X
X
u(x, y) =
un (x, y) =
n (x)n (y).

120

Eigenfunction Solutions of Differential Equations


1

0.8

0.6

0.4

0.2

0
0

0.2

0.4

0.6

0.8

Figure 3.8 Isotherms for Example 3.6 in increments of u = 0.1.

Figure 3.9 Superposition of eigenfunction solutions for a problem with two


non-homogeneous boundary conditions.

3.4.2 Unsteady Diffusion Equation


Consider the one-dimensional, unsteady diffusion equation
u
2u
= 2,
t
x

(3.32)

where is the diffusivity. It governs the temperature distribution u(x, y) owing


to unsteady conduction in the one-dimensional domain x `. The initial
condition is
u(x, 0) = f (x),

(3.33)

3.4 Eigenfunction Solutions of Partial Differential Equations

121

and the boundary conditions are


u(0, t) = 0,

u(`, t) = 0.

(3.34)

The unsteady diffusion equation may also be solved using the method of separation of variables in certain cases. As with the Laplace equation, we seek a
solution comprised of the product of two functions, the first a function of time
only and the second a function of space only according to
u(x, t) = (x)(t).

(3.35)

Substituting into equation (3.32) gives


d2
d
= 2 ,
dt
dx
which after separating variables leads to

1 d
1 d2
=
= = 2 ,
dt
dx2
where we note that = 0 and > 0 lead to trivial solutions. Consequently, the
single partial differential equation (3.32) becomes the two ordinary differential
equations
d2
+ 2 = 0,
(3.36)
dx2
d
+ 2 = 0.
dt

(3.37)

Equation (3.36) along with the homogeneous boundary conditions (3.34) represents the eigenproblem for (x). The general solution to equation (3.36) is
(x) = c1 cos(x) + c2 sin(x).

(3.38)

From the boundary condition u(0, t) = 0, we must have (0) = 0, which requires
that c1 = 0. Then from u(`, t) = 0, we must have (`) = 0, which gives the
characteristic equation
sin(n `) = 0;
therefore,
n
, n = 1, 2, 3, . . . ,
`
from which we obtain the eigenvalues
 n 2
n = 2n =
, n = 1, 2, 3, . . . .
`
If we let c2 = 1 for convenience, the eigenfunctions are
 n 
n (x) = sin
x , n = 1, 2, 3, . . . .
`
n =

(3.39)

(3.40)

(3.41)

122

Eigenfunction Solutions of Differential Equations

Let us now consider equation (3.37) with the result (3.39). The solution to this
first-order ordinary differential equation is
n (t) = cn exp(2n t),

n = 1, 2, 3, . . . .

From equation (3.35) with (3.41) and (3.42), the solution is





 n 
X
X
n2 2
u(x, t) =
n (x)n (t) =
cn exp 2 t sin
x ,
`
`
n=1
n=1

(3.42)

n = 1, 2, 3, . . . .

(3.43)
The cn coefficients are determined through application of the initial condition
(3.33) applied at t = 0, which requires that

 n 
X
u(x, 0) =
cn sin
x = f (x).
`
n=1
Alternatively, we may write this as

cn n (x) = f (x).

n=1

Taking the inner product of the eigenfunctions m (x) with both sides, the only
non-vanishing term occurs when m = n. Therefore,
cn kn (x)k2 = hf (x), n (x)i ,
which gives the coefficients as
Z
 n 
2 `
hf (x), n (x)i
=
f
(x)
sin
x dx,
cn =
kn (x)k2
` 0
`

n = 1, 2, 3, . . . .

(3.44)

These are the Fourier sine coefficients of the initial condition f (x). Thus, the
eigenfunction solution is given by equation (3.43) with the coefficients (3.44).
Remarks:
1. It is rather remarkable that we have been able to extend and generalize methods developed to solve algebraic systems of equations to solve ordinary differential equations and now partial differential equations. This is the remarkable
power of mathematics!
2. For much more on Sturm-Liouville and eigenfunction theory, see Asmar (2005).
For example, see Section 11.1 for application to Schrodingers equation of quantum mechanics, and Section 3.9 for application to non-homogeneous partial
differential equations, such as the Poisson equation.
3. For additional examples of the method of separation of variables for partial
differential equations and application to continuous systems governed by the
wave equation, see Section 20.1.
4. Observe that all the partial differential equations considered here have been
on bounded domains, in which case separation of variables with eigenfunction
expansions provides a solution in some cases. For partial differential equations

3.4 Eigenfunction Solutions of Partial Differential Equations

123

on unbounded domains, we can sometimes use integral transforms, such as


Fourier, Laplace, Hankel, etc... (see MMAE 502).

Problem Set # 5

4
Vector and Matrix Calculus

The world of ideas which [mathematics] discloses or illuminates, the contemplation


of divine beauty and order which it induces, the harmonious connexion of its parts,
the infinite hierarchy and absolute evidence of the truths with which it is concerned,
these, and such like, are the surest grounds of the title of mathematics to human
regard, and would remain unimpeached and unimpaired were the plan of the universe
unrolled like a map at our feet, and the mind of man qualified to take in the whole
scheme of creation at a glance.
(James Joseph Sylvester)

In Chapters 1 and 2, we focused on algebra involving vectors and matrices; here,


we treat calculus of vectors and matrices. There are several important operators
that arise involving differentiation (gradient, divergence, and curl) and integration
(Greens theorem, divergence theorem, and Stokes theorem). In addition, special
attention is focused on tensors, which are 2 2 or 3 3 matrices that have
certain additional properties that lend themselves to being used in continuum
mechanics. Coordinate transformations as applied to differential equations also
requires an understanding of vector calculus. Finally, determining extrema of
algebraic functions requires evaluating first and second derivatives; finding such
extrema segues into optimization, which is treated briefly here and in more detail
in Chapter 8.

4.1 Vector Calculus


Recall that a scalar is a quantity that requires only one component, that is,
magnitude, to specify its state. Physical quantities, such as pressure, temperature,
and density, are scalars; note that they are independent of the choice of coordinate
system.
A three-dimensional vector is a quantity that requires three components to
specify its state, that is, it requires both a magnitude and direction. Physical
quantities, such as force and velocity, are given by vectors in three-dimensional
space.
A tensor in three dimensions is a quantity that requires nine components to
specify its state. Physical quantities, such as stress, strain, and strain rate, are
124

4.1 Vector Calculus

125

Figure 4.1 Tetrahedral element of a solid or fluid with the stresses on each
of the three orthogonal faces to counterbalance the force on the inclined face.

Figure 4.2 Cartesian coordinate system.

defined by tensors. For example, the stress tensor is defined by

xx xy xz
= yx yy yz .
zx zy zz
The ii components on the diagonal are the normal stresses and the ij components are the shear stresses on an infinitesimally small tetrahedral element
of a solid or fluid substance (see Figure 4.1). In general, such tensors have nine
components; however, equilibrium requires that they be symmetric, in which case
ij = ji and there are only six unique components. A scalar is a rank zero tensor,
a vector in three dimensions is a rank one tensor, and a tensor in three dimensions
is a rank two tensor.
Before proceeding to the derivative operators, let us first consider further the
two most commonly used coordinate systems, namely Cartesian and cylindrical.
In the Cartesian coordinate system (see Figure 4.2), the unit vectors in the x, y,
and z coordinate directions are i, j, and k, respectively. Therefore, a vector in
three-dimensional Cartesian coordinates is given by
a = ax i + ay j + az k,
where, for example, ax is the component of the vector a in the x-direction, which
is given by ha, ii = a i. Note that in the Cartesian coordinate system, the unit
vectors i, j, and k are independent of changes in x, y, and z. Therefore, derivatives
of the unit vectors with respect to each of the coordinate directions vanish. For

126

Vector and Matrix Calculus

Figure 4.3 Cylindrical coordinate system.

example,
a
x

(ax i + ay j + az k) ,
x
0
0
0
7
i ay
j az
k
ax
i + ax  +
j + ay  +
k + az  ,
=
x
x
x
x
x
x
a
ax
ay
az
=
i+
j+
k.
x
x
x
x
Note that this is the same as simply taking the partial derivative with respect to
x of each of the components of a.
Cylindrical coordinates, which are shown in Figure 4.3, requires a bit more
care. A vector in three-dimensional cylindrical coordinates is given by
=

a = ar er + a e + az ez .
In terms of Cartesian coordinates, the unit vectors are
er = cos i + sin j,

e = sin i + cos j,

ez = k.

Note that the unit vector ez is independent of r, , and z, and the unit vectors
er and e are independent of r and z; however, they are dependent on changes
in . For example,

er
=
(cos i + sin j) = sin i + cos j = e .

Similarly,
e

=
( sin i + cos j) = cos i sin j = er .

Summarizing, we have that


er
= 0,
r
er
= e ,

er
= 0,
z

e
= 0,
r
e
= er ,

e
= 0,
z

ez
= 0,
r
ez
= 0,

ez
= 0.
z

4.1 Vector Calculus

127

Figure 4.4 The gradient of a scalar field.

The key in cylindrical coordinates is to remember that one must first take the
derivative before performing vector operations. For an example, see Example 4.1.

4.1.1 Derivative Operators


The vector differential operator , called del, is given in Cartesian coordinates
by1
=

i+
j+
k.
x
y
z

(4.1)

It is a vector operator that operates on scalars, such as (x, y, z), and vectors,
such as f (x, y, z), in three distinct ways. The first is denoted by , and is called
the gradient, which is the spatial rate of change of the scalar field (x, y, z). The
second is denoted by f and is called the divergence. The third is denoted by
f and is called the curl. The latter two are spatial rates of change of the
vector field f (x, y, z). We first focus on these operations in Cartesian coordinates
and then generalize to cylindrical and spherical coordinates.
Gradient (grad )
In Cartesian coordinates, the gradient of the scalar field (x, y, z) is


i+
j+
k =
i+
j+
k.
x
y
z
x
y
z

(4.2)

This is a vector field that represents the direction and magnitude of the greatest
spatial rate of change of (x, y, z) at each point. That is, it indicates the steepest
slope of the scalar field at each point as illustrated in Figure 4.4.
1

It is important to realize that when writing vectors containing dertivatives in this manner, we do
not mean to imply that in the first term, for example, that /x is operating on the unit vector i; it
is merely the x-component of the vector. This is why some authors place the unit vector before the
respective components to prevent any such confusion.

128

Vector and Matrix Calculus

Divergence (div f )
In Cartesian coordinates, the divergence of the vector field f (x, y, z) is the inner
(dot) product of the gradient and force vectors as follows:



fx fy
fz

i+
j+
k (fx i + fy j + fz k) =
+
+
.
(4.3)
f =
x
y
z
x
y
z
This is a scalar field that represents the net normal component of the vector field
f passing out through the surface of an infinitesimal volume surrounding each
point. It indicates the expansion of a vector field.
Curl (curl f )
In Cartesian coordinates, the curl of the vector field f (x, y, z) is the corss product
of the gradient and force vectors as follows:


i





j k 

fz
fy
fz
fx
fy
fx

j+

k.
f = x y z =
y
z
x
z
x
y
f
fy fz
x
(4.4)
Recall that we use a cofactor expansion about the first row to evaluate the determinant. This is a vector field that represents the net tangential component of
the vector field f along the surface of an infinitesimal volume surrounding each
point. It indicates the rotational characteristics of a vector field. For example,
the curl of a force field is the moment or torque, and the curl of a velocity field
yields the rate of rotation of the particles.
Laplacian
In Cartesian coordinates, the Laplacian operator is
 



2
2
2

2
i+
j+
k
i+
j+
k =
+
+
.
==
x
y
z
x
y
z
x2 y 2 z 2
(4.5)
The Laplacian can operate on both scalars and vectors to produce a scalar or
vector expression, respectively. For example, 2 = 0 is the Laplace equation,
which we have encountered several times already.
Gradient, Divergence, Curl, and Laplacian in Curvilinear Orthogonal
Coordinates
If is a scalar function and f = f1 e1 + f2 e2 + f3 e3 is a vector function of orthogonal curvilinear coordinates u1 , u2 , and u3 with e1 , e2 , and e3 being unit vectors
in the direction of increasing u1 , u2 , and u3 , respectively, then the gradient, divergence, curl, and Laplacian are defined as follows:
Gradient:
=

1
1
1
e1 +
e2 +
e3 .
h1 u1
h2 u2
h3 u3

(4.6)

4.1 Vector Calculus

129

Divergence:



1
(h2 h3 f1 ) +
(h3 h1 f2 ) +
(h1 h2 f3 ) .
f =
h1 h2 h3 u1
u2
u3

(4.7)

Curl:


h1 e1
h2 e2
h3 e3





1

/u1 /u2 /u3 .
f =

h1 h2 h3



h1 f1
h2 f2
h3 f3

(4.8)

Laplacian:







1

h2 h3
h3 h1
h1 h2

+
+
.
h1 h2 h3 u1
h1 u1
u2
h2 u2
u3
h3 u3
(4.9)
In the above expressions h1 , h2 , and h3 are the scale factors, which are given by






r
r
r





,
h1 =
, h2 =
, h3 =
u1
u2
u3

2 =

where r represents the position vector of a point in space. In Cartesian coordinates, for example, r = xi + yj + zk.
Below we provide the ui , hi , and ei for Cartesian, cylindrical, and spherical
coordinates for use in equations (4.6)(4.9).
Cartesian Coordinates: (x, y, z)
u1 = x h1 = 1 e1 = i
u2 = y h2 = 1 e2 = j
u3 = z h3 = 1 e3 = k
Cylindrical Coordinates: (r, , z)
u1 = r h1 = 1 e1 = er
u2 = h2 = r e2 = e
u3 = z h3 = 1 e3 = ez
Spherical Coordinates: (R, , )
u1 = R h1 = 1
e1 = eR
u2 = h2 = R
e2 = e
u3 = h3 = R sin e3 = e
The following properties can be established for the gradient, divergence, and

130

Vector and Matrix Calculus

curl of scalar and vector functions. The curl of the gradient of any scalar vanishes,
that is,
= 0.
The divergence of the curl of any vector vanishes according to
( f ) = 0.
Note also that the divergence and curl of the product of a scalar and vector obey
the product rule. For example,
(f ) = f + f ,

(f ) = f + f .

In these relationships, recall that the gradient of a scalar is a vector, the curl of
a vector is a vector, and the divergence of a vector is a scalar.
Example 4.1

Evaluate the divergence of the vector


f = fr er

in cylindrical coordinates, where fr is a constant.


Solution: Remembering that one must first take the derivative before performing vector operations, evaluating the divergence in cylindrical coordinates yields
(keeping in mind that fr is a constant)


1

er +
e +
ez (fr er )
f =
r
r
z
1

(fr er ) + e
(fr er ) + ez
(fr er )
= er
r
r
z
0
0
7
7
er
fr er
er
= er fr  + e
+ ez fr 
r
r
z
fr
= e e
r
fr
f =
,
r
where we have used the fact that er / = e obtained above. If we would have
first carried out the dot products before differentiating, we would have obtained
f = fr /r = 0, which is incorrect.

4.1.2 Integral Theorems


In addition to differentiation, calculus involves its inverse integration. Because
of the different forms of differentiation, there are corresponding integral theorems.

4.1 Vector Calculus

131

Figure 4.5 Schematic for divergence theorem.

Divergence Theorem
Let f be any continuously differentiable vector field in a volume V surrounded
by a surface S, then
I
Z
f dA =
f dV ,
(4.10)
V

where a differential area on surface S is dA = dA n, with n being the outward


facing normal at dA as shown in Figure 4.5. The divergence theorem transforms
a surface ingegral over S to a volume integral over V and states that the integral
of the normal component of f over the bounding surface is equal to the volume
integral of the divergence of f throughout the volume V . Recalling that the divergence of a vector field measures its expansion properties, the divergence theorem
takes into account that the expansion of each differential volume within V is absorbed by the neighboring differential volumes until the surface is reached such
that the net (integrated) result is that the total expansion of the volume is given
by the expansion at its bounding surface.
There are additional forms of the divergence theorem corresponding to the
other types of derivatives as well. For the gradient, we have
I
Z
dV ,
(4.11)
dA =
V

and for the curl


I

f dA =

f dV .

(4.12)

The latter case is also known as Gauss theorem.


Stokes Theorem
In two dimensions, we have a similar relationship between the integral around a
curve C bounding an area domain A (see Figure 4.6). Here, ds is a unit vector
tangent to C. Stokes theorem states that
I
Z
f ds =
( f ) dA.
(4.13)
C

Stokes theorem transforms a line integral over the bounding curve to an area
integral and relates the line integral of the tangential component of f over the

132

Vector and Matrix Calculus

Figure 4.6 Schematic for Stokes theorem.


f (x)

x0

Figure 4.7 One-dimensional function f (x) with a local minimum at x = x0 .

closed curve C to the integral of the normal component of the curl of f over the
area A.

4.1.3 Coordinate Transformations


4.2 Extrema of Functions
One of the common uses of calculus is in determining extrema, that is, minimums,
maximums, or stationary points, of functions. Essential in its own right, this also
provides the basis for optimization of systems governed by algebraic equations.
We must address cases with and without constraints; we consider unconstrained
functions of only one independent variable first.

4.2.1 General Considerations


First, let us consider a one-dimensional function f (x) as shown in Figure 4.7.2
The point x0 is a stationary (or critical) point of f (x) if
df
=0
dx
2

at x = x0 ,

This material is from Variational Methods with Applications in Science and Engineering, Section
1.4.

4.2 Extrema of Functions

133

or equivalently
df =

df
dx = 0
dx

at x = x0 ,

where df is the total differential of f (x). That is, the slope of f (x) is zero at
x = x0 . The following possibilities exist at a stationary point:
1. If d2 f /dx2 < 0 at x0 , the function f (x) has a local maximum at x = x0 .
2. If d2 f /dx2 > 0 at x0 , the function f (x) has a local minimum at x = x0 .
3. If d2 f /dx2 = 0, then f (x) may still have a local minimum (for example,
f (x) = x4 at x = 0) or a local maximum (for example, f (x) = x4 at x = 0),
or it may have neither (for example, f (x) = x3 at x = 0).
The important point here is that the requirement that f 0 (x0 ) = 0, in which case
x0 is a stationary point, provides a necessary condition for a local extremum,
while possibilities (1) and (2) provide additional sufficient conditions for a local
extremum at x0 .
Now consider the two-dimensional function f (x, y). For an extremum to occur
at (x0 , y0 ), it is necessary (but not sufficient) that
f
f
=
=0
x
y

at x = x0 , y = y0 ,

or equivalently
df =

f
f
dx +
dy = 0
x
y

at x = x0 , y = y0 ,

where df is the total differential of f (x, y). The point (x0 , y0 ) is a stationary point
of f (x, y) if df = 0 at (x0 , y0 ), for which the rate of change of f (x, y) at (x0 , y0 )
in all directions is zero. At a stationary point (x0 , y0 ), the following possibilities
exist, where subscripts denote partial differentiation with respect to the indicated
variable:
1.
2.
3.
4.

If
If
If
If

2
fxx fyy fxy
2
fxx fyy fxy
2
fxx fyy fxy
2
fxx fyy fxy

>0
>0
<0
=0

and fxx < 0 at (x0 , y0 ), then it is a local maximum.


and fxx > 0 at (x0 , y0 ), then it is a local minimum.
at (x0 , y0 ), then it is a saddle point.
at (x0 , y0 ), then (1), (2), or (3) is possible.

Observe that the evaluated expression can be expressed as the determinant




fxx fxy
2


fyx fyy = fxx fyy fxy .
This is called the Hessian, which is the determinant of the Hessian matrix, which
includes all possible second derivatives of the function. Possibilities (1) and (2)
above provide sufficient conditions, along with the necessary condition that df =
0 at (x0 , y0 ), for the existence of an extremum. See Hildebrand (1976) for a
derivation of these conditions.

134

Vector and Matrix Calculus

The above criteria can be generalized to n-dimensions. That is, we have a


stationary point of f (x1 , x2 , . . . , xn ) if
df =

f
f
f
dx1 +
dx2 + +
dxn = 0.
x1
x2
xn

The Hessian matrix is now n n and given

fx1 x1 fx1 x2
fx2 x1 fx2 x2

H = ..
..
.
.
fxn x1

fxn x2

by

fx1 xn
fx2 xn

.. .
..
.
.
fxn xn

If all of the derivatives of f are continuous, then the Hessian matrix is symmetric.
Using this Hessian matrix, the second-derivative test is used to determine what
type of extrema exists at each stationary point as follows:
1. If the Hessian matrix is negative definite (all eigenvalues are negative), then
the function f (x1 , . . . , xn ) has a local maximum.
2. If the Hessian matrix is positive definite (all eigenvalues are positive), then
the function f (x1 , . . . , xn ) has a local minimum.
3. If the Hessian matrix has both positive and negative eigenvalues, then the
function f (x1 , . . . , xn ) has a saddle point.
4. If the Hessian matrix is semidefinite (at least one eigenvalue is zero), then the
second-derivative test is inconclusive.
Note that the cases with f (x) and f (x, y) considered above are special cases of
this more general result.
Example 4.2 Obtain the location (x1 , x2 , x3 ) at which a minimum value of the
function
f1 (x1 , x2 , x3 )
x2 + 6x1 x2 + 4x22 + x23
= 1
f (x1 , x2 , x3 ) =
f2 (x1 , x2 , x3 )
x21 + x22 + x23
occurs.
Solution: At a point (x1 , x2 , x3 ) where f has zero slope, we have
f
= 0,
xi

i = 1, 2, 3.

With f = f1 /f2 , this requires that




1 f1 f1 f2

= 0,
f2 xi f2 xi

i = 1, 2, 3,

or letting = f = f1 /f2
f1
f2

= 0,
xi
xi

i = 1, 2, 3.

4.2 Extrema of Functions

135

Substituting
f1 = x21 + 6x1 x2 + 4x22 + x23 ,

f2 = x21 + x22 + x23 ,

and evaluating the partial derivatives produces three equations for the three unknown coordinate values. In matrix form, these equations are


1
3
0
x1
0
3
4
0 x2 = 0 .
0
0
1 x3
0
For a nontrivial solution, the required eigenvalues are

1
1
1 = 1, 2 = (5 + 3 5), 3 = (5 3 5),
2
2
and the corresponding eigenvectors are



4
4
0
3 + 61 (5 + 3 5)
3 + 61 (5 3 5)
, u3 =
.
u1 = 0 , u2 =
1
1
1
0
0

Of the three eigenvalues, f = 3 = 21 (5 3 5) has the minimum value, which


occurs at


4
x1
3 + 61 (5 3 5)
x2 = u3 =
.
1
x3
0
Alternatively, the nature of each stationary point could be determined using the
Hessian matrix and the second-derivative test.

4.2.2 Constrained Extrema and Lagrange Multipliers


It is often necessary to find an extremum of a function subject to some constraint(s). As in the previous section, let us first consider the case with a single
function before extending to systems of equations.
Recall from the previous section that a necessary condition for an extremum
of a function f (x, y, z) at (x0 , y0 , z0 ) without constraints3 is
df = fx dx + fy dy + fz dz = 0.

(4.14)

Because dx, dy, and dz are arbitrary (x, y, and z are independent variables), this
requires that
fx = 0,

fy = 0,

fz = 0,

which is solved to find x0 , y0 , and z0 .


3

This material is from Variational Methods with Applications in Science and Engineering, Section
1.5.

136

Vector and Matrix Calculus

Now let us determine the stationary point(s) of a function f (x, y, z) subject to


a constraint, say
g(x, y, z) = c,
where c is a specified constant. The constraint provides an algebraic relationship
between the coordinates x, y, and z. In addition to equation (4.14), we have that
the total differential of g(x, y, z) is
dg = gx dx + gy dy + gz dz = 0,

(4.15)

which is zero because g is equal to a constant. Because both (4.14) and (4.15)
equal zero at (x0 , y0 , z0 ), it follows that we can add them according to
df + dg = (fx + gx )dx + (fy + gy )dy + (fz + gz )dz = 0,

(4.16)

where is an arbitrary constant, which we call the Lagrange multiplier.4


Note that because of the constraint g = c, the variables x, y, and z are no
longer independent. With one constraint, for example, we can only have two of
the three variables being independent. As a result, we cannot regard dx, dy, and
dz as all being arbitrary and set the three coefficients equal to zero as in (4.14).
Instead, suppose that gz 6= 0 at (x0 , y0 , z0 ). Then the last term in (4.16) may
be eliminated by specifying the arbitrary Lagrange multiplier to be = fz /gz
giving
(fx + gx )dx + (fy + gy )dy = 0.
The remaining variables, x and y, may now be regarded as independent, and the
coefficients of dx and dy must each vanish. This results in four equations for the
four unknowns x0 , y0 , z0 , and as follows
fx + gx = 0,

fy + gy = 0,

fz + gz = 0,

g = c.

Consequently, finding the stationary point of the function f (x, y, z) subject to


the constraint g(x, y, z) = c is equivalent to finding the stationary point of the
augmented function
f(x, y, z) = f (x, y, z) + [g(x, y, z) c]
subject to no constraint. Thus, we seek the point (x0 , y0 , z0 ) where
fx = fy = fz = 0.
Remarks:
1. Observe that because c is a constant, we may write the augmented function
as f = f + (g c), as above, or as f = f + g.
2. Additional constraints may be imposed in the same manner, with each constraint having its own Lagrange multiplier.
4

Most authors use to denote Lagrange multipliers. Throughout this text, however, we use to
denote eigenvalues (as is common) and for Lagrange multipliers.

4.2 Extrema of Functions

Example 4.3
by

137

Find the semi-major and semi-minor axes of the ellipse defined


(x1 + x2 )2 + 2(x1 x2 )2 = 8,

which may be written


3x21 2x1 x2 + 3x22 = 8.
Solution: To determine the semi-major (minor) axis, calculate the farthest (nearest) point on the ellipse from the origin. Therefore, we maximize (minimize)
d = x21 + x22 , the square of the distance from the origin, subject to the constraint
that the coordinates (x1 , x2 ) be on the ellipse. Define an augmented function as
follows

d = d + 3x2 2x1 x2 + 3x2 8 ,
1

where is the Lagrange multiplier and is multiplied by the constraint that the

extrema be on the ellipse. To determine the extrema of the algebraic function d,


we evaluate
d
d
= 0,
= 0,
x1
x2
with

d = x21 + x22 + 3x21 2x1 x2 + 3x22 8 .

Evaluating the partial derivatives and setting equation to zero gives


d
= 2x1 + (6x1 2x2 ) = 0,
x1
d
= 2x2 + (2x1 + 6x2 ) = 0.
x2
Thus, we have two equations for x1 and x2 given by
3x1 x2 = x1 ,
x1 + 3x2 = x2 ,
where = 1 . This is an eigenproblem of the form Ax = x, where


3 1
A=
.
1 3
The eigenvalues of the symmetric matrix A are 1 = 2 and 2 = 4 with the
corresponding eigenvectors
 
 
1
1
u1 =
, u2 =
.
1
1
The two eigenvectors u1 and u2 (along with u1 and u2 ) give the directions of
the semi-major and semi-minor axes, which are along lines that bisect the first
and third quadrants and second and fourth quadrants, respectively. Note that

138

Vector and Matrix Calculus

because A is real and symmetric with distinct eigenvalues, the eigenvectors are
mutually orthogonal.
In order to determine which eigenvectors correspond to the semi-major and
semi-minor axes, we recognize that a point on the ellipse must satisfy
(x1 + x2 )2 + 2(x1 x2 )2 = 8.
Considering uT1 = [1 1], let us set x1 = c1 and x2 = c1 . Substituting into the
equation for the ellipse yields
4c21 + 0 = 8,

in which
1 = 2 and
case c1 = 2. Therefore, x1 = 2 and x2 =p 22 or (x
x2 = 2), and the length of the corresponding axis is x1 + x22 = 2. Similarly,
considering uT2 = [1 1], let us set x1 = c2 and x2 = c2 . Substituting into the
equation for the ellipse yields
0 + 8c22 = 8,
in which case c2 = 1. Therefore, x1 = 1 and
px2 = 1 (or x1 = 1 and x2 = 1),
and the length of the corresponding axis is x21 + x22 = 2. As a result, the
eigenvector u1 corresponds to the semi-major axis, and u2 corresponds to the
semi-minor axis.

4.2.3 Linear Programming


The application of the previous sections to determining extrema of linear algebraic
equations with constraints is known as linear programming, which is an important
topic in optimization. In linear programming, both the function, for which we seek
the extremum, and the constraints, which may be equality or inequality, are linear
algebraic.
Let us say that our objective is to minimize the linear function having n independent variables
J(x1 , x2 , . . . , xn ) = c1 x1 + c2 x2 + + cn xn ,
which is referred to as the objective function. The m linear constraints, where
m < n, are
A11 x1 + A12 x2 + + A1n xn = b1 ,
A21 x1 + A22 x2 + + A2n xn = b2 ,
..
.
Am1 x1 + Am2 x2 + + Amn xn = bm .
It is also often the case that we constrain the variables x1 , x2 , . . . , xn to be positive
or negative. In matrix form, we have the objective function
J(x) = hc, xi

(4.17)

4.2 Extrema of Functions

139

subject to the constraints given by Ax = b, where A is m n with m < n.


That is, there are fewer constraint equations than independent variables. These
are equality constraints; inequality constraints can be accommodated within this
framework using slack variables, which will be discussed in Chapter 8.
Let us first pursue the approach outlined in the previous section for optimization of algebraic equations with constraints.
 The constraint equations are incorporated using Lagrange multipliers T = 1 2 . . . m . Thus, we now want
to minimize the augmented function expressed in matrix form
J = hc, xi T (Ax b) ,

(4.18)

where the minus sign is for convenience. A necessary condition for an extremum of
a function is that its derivative with respect to each of the independent variables
must be zero. This requires that
J
= 0,
xi

i = 1, 2, . . . , n.

(4.19)

Applying (4.19) to the augmented function (4.18) yields the result that
AT = c
must be satisfied.5 This is a system of linear algebraic equations for the Lagrange
multipliers . However, we do not seek the Lagrange multipliers; we seek the
stationary solution(s) x that minimize the objective function and satisfy the constraints.
Remarks:
1. The reason why the usual approach does not work for linear programming is
because we are dealing with straight lines (in two dimensions), planes (in three
dimensions), and linear functions (in n dimensions) that do not in general have
finite maximums and minimums.
2. Thankfully, the standard approach described in the previous sections do work
for quadratic programming and the least-squares method, which are covered
in the following sections.
3. Linear programming requires introduction of additional techniques that do not
simply follow from differentiating the algebraic functions and setting equal to
zero. These will be covered in Chapter 8.

4.2.4 Quadratic Programming


A certain class of optimization problems, known as quadratic programming, involves obtaining extremums of quadratic forms (see Section 2.2.3). For example,
5

We do not include the details here because the method does not work. The details for similar
problems (that do work) will be given in subsequent sections.

140

Vector and Matrix Calculus

consider determining the minimum6 of the quadratic function


J(x1 , x2 , . . . , xn ) = A11 x21 + A22 x22 + . . . + Ann x2n +
2A12 x1 x2 + 2A13 x1 x3 + . . . + 2An1,n xn1 xn ,
which is now our objective function. The constraint is also in the form of a
quadratic
B11 x21 + B22 x22 + . . . + Bnn x2n + 2B12 x1 x2 + 2B13 x1 x3 + . . . + 2Bn1,n xn1 xn = 1.
Note that both the objective function and the constraint consist of a single
quadratic function. Alternatively, in matrix form we have the objective function
J(x) = xT Ax

(4.20)

subject to the quadratic constraint that


xT Bx = 1.

(4.21)

Here, A and B are n n symmetric matrices containing constants, and x is an


n 1 vector containing the variables. As before, the constraint is incorporated
using a Lagrange multiplier .7 Thus, we now want to maximize the augmented
function expressed in matrix form

J = xT Ax xT Bx 1 ,
(4.22)
where the negative sign is merely for convenience. Again, the necessary conditions
for an extremum of the augmented function is that its derivative with respect to
each of the variables must be zero as follows:
J
= 0,
xi

i = 1, 2, . . . , n.

(4.23)

In order to evaluate equation (4.23) for (4.22), observe that


A11 A12 A1n
x1
A12 A22 A2n x2




xT Ax = x1 x2 xn ..
..
.. ..
..
.
.
.
. .
A1n A2n Ann xn
xT Ax = A11 x21 + A22 x22 + . . . + Ann x2n
+2A12 x1 x2 + 2A13 x1 x3 + . . . + 2An1,n xn1 xn ,
where each term is quadratic in xi . Differentiating with respect to xi , i = 1, 2, . . . , n
6

Note that a minimum can be changed to a maximum or vice versa simply by changing the sign of
the objective function.
We use for the Lagrange multiplier here because it turns out to be the eigenvalues of an
eigenproblem.

4.2 Extrema of Functions

141

yields


xT Ax = 2A11 x1 + 2A12 x2 + 2A13 x3 + . . . + 2A1n xn ,


x1


xT Ax = 2A21 x1 + 2A22 x2 + 2A23 x3 + . . . + 2A2n xn ,


x2

..
.


xT Ax
xi

= 2Ai1 x1 + 2Ai2 x2 + . . . + 2Aii xi + . . . + 2Ain xn ,

..
.


xT Ax = 2An1 x1 + 2An2 x2 + . . . + 2An,n1 xn1 + 2Ann xn ,


xn

which is simply 2Ax. For the augmented function (4.22), therefore, equation
(4.23) leads to
2 (Ax Bx) = 0.
Because of the constraint (4.21), we are seeking non-trivial solutions for x.
Such solutions correspond to solving the generalized eigenproblem
Ax = Bx,

(4.24)

where the eigenvalues are the values of the Lagrange multiplier, and the eigenvectors x are the candidate stationary points for which (4.20) is an extremum.
The resulting eigenvectors must be checked against the constraint (4.21) to be
sure that it is satisfied and are thus stationary points. To determine whether
the stationary points are a minimum or maximum would require checking the
behavior of the second derivatives. In the special case when B = I, we have the
regular eigenproblem
Ax = x.

Example 4.4

Determine the stationary points of the quadratic function


f (x1 , x2 ) = x21 + x22

(4.25)

subject to the quadratic constraint


x1 x2 = 1.
Solution: In this case, we have equation (4.24) with




 
1 0
0 1
x
A=
, B= 1 2 , x= 1 .
0 1
0
x2
2

(4.26)

142

Vector and Matrix Calculus

The generalized eigenproblem (4.24) may be rewritten as a regular eigenproblem


of the form
B1 Ax = x,
which in this case is


0 2
x = x.
2 0

The corresponding eigenvalues and eigenvectors are


 
1
,
1 = 2, x1 = c1
1
2 = 2,

 
1
.
x2 = c2
1

Observe that x1 does not satisfy the constraint for any c1 , and that c2 = 1
in order for x2 to satisfy the constraint. Therefore, the stationary points are
(x1 , x2 ) = (1, 1) and (x1 , x2 ) = (1, 1) corresponding to = 2. It can be
confirmed that both of these points are local minimums.
Let us reconsider Example 4.3 involving determining the major and minor axes
of an ellipse.
Example 4.5

Determine the stationary points of the quadratic function


f (x1 , x2 ) = x21 + x22

(4.27)

subject to the quadratic constraint


3x21 2x1 x2 + 3x22 = 8.
Solution: In this case, we

1
A=
0

(4.28)

have equation (4.24) with





 
0
3 1
x
, B=
, x= 1 .
1
1 3
x2

The generalized eigenproblem (4.24) may be rewritten as a regular eigenproblem


of the form
B1 Ax = x,
which in this case is


1 3 1
x = x.
8 1 3

4.3 Least-Squares Solutions of Algebraic Systems of Equations

143

The corresponding eigenvalues and eigenvectors are


 
1
1
,
1 = , x1 = c1
1
4
 
1
1
.
2 = , x2 = c2
1
2

Observe that c1 = 1 and c2 = 2 to satisfy the constraint. Therefore,


the
stationary points
are (x1 , x2 ) = (1, 1), (x1 , x2 ) = (1, 1), (x1 , x2 ) = ( 2, 2),
and (x1 , x2 ) = ( 2, 2); the first two are minimums, that is, semi-minor axes,
and the last two are maximums, that is, semi-major axes.

4.3 Least-Squares Solutions of Algebraic Systems of Equations


In Chapter 1, our primary focus was on solving systems of equations Ax = c for
which the rank of the coefficient matrix was equal to the number of unknowns
(r = n), in which case there is a unique solution. This is the most common case
in applications. Using the principles developed in the previous section, we can
develop techniques for solving systems of linear algebraic equations that are
either inconsistent (r > n), which have no solution, or have an infinite number of
solutions (r < n).
4.3.1 Overdetermined Systems (r > n)
An inconsistent system has more linearly-independent equations than unknowns;
such systems are overdetermined. Although they do not have any solutions that
satisfy all of the equations, we may seek the solution that comes closest to
satisfying the equations. This will be the case, for example, in curve fitting in
Chapter 9.
For example, consider a system of equations Ax = c for which the rank of A
is greater than the number of unknowns. Although this system does not have a
according to
solution, let us define the residual of a vector x
r = c A
x,

(4.29)

which, for a given coefficient matrix A and right-hand side vector c, is essentially
is to satisfying the system of equations.8
a measure of how close the vector x
Hence, one could define the solution of an overdetermined system to be the
that comes closest to satisfying the system of equations, that is, the one
vector x
with the smallest residual. As before, we measure size of vectors using the norm.
Thus, we seek to minimize the square of the norm of the residual, which is the
scalar quantity
J(
x) = krk2 = kc A
xk2 .
(4.30)
8

If the system did have an exact solution, then the residual of the exact solution would be zero.

144

Vector and Matrix Calculus

This is the objective function. Because the norm involves the sum of squares, and
we are seeking a minimum, this is known as the least-squares solution.
In order to minimize the objective function (4.30), we will need to differentiate
. In preparation, let us expand the square of the
with respect to the solution x
norm of the residual as follows:
J(
x) = krk2
= kc A
xk2
T

= (c A
x) (c A
x)
h
i
T
= cT (A
x) (c A
x)
T AT ) (c A
= (cT x
x)
T AT c + x
T AT A
= cT c cT A
xx
x
T

T AT A
J(
x) = cT c cT A
x (cT A
x) + x
x.
Note that each of these four terms are scalars; thus, the transpose can be removed
on the third term. This yields
T AT A
J(
x) = cT c 2cT A
x+x
x.

(4.31)

and set equal


Because we seek a minimum, let us differentiate with respect to x
to zero as follows. The first term is a constant and vanishes, while the final
term is a quadratic with AT A being symmetric (see Section 2.2.3). Recall from
Section 4.2.4 that the derivative of the quadratic is


T AT A
x
x = 2AT A
x,

x
and we will show that


cT A
x = AT c.
(4.32)

x
yields
Therefore, differentiating equation (4.31) with respect to x

J(
x) = 2AT c + 2AT A
x.

x
Setting equal to zero gives
AT A
x = AT c.

(4.33)

Recall that A is m n, then A A is n n. If it is invertible, then the solution


that minimizes the square of the residual is
x
1 T
= AT A
A c.
(4.34)
x
Remarks:
1. Observe from equation (4.33) that although Ax = c does not have a solution,
premultiplying both sides by AT gives the system of equations AT A
x = AT c,
which does have a unique solution! Equation (4.32) is sometimes called a

4.3 Least-Squares Solutions of Algebraic Systems of Equations

145

normal system of equations because the coefficient matrix AT A is normal (see


Section 2.3.
2. Geometrically, we can interpret the least-squares solution as follows:
is the linear combination of the columns of A
The least-squares solution x
that is closest to c, that is, taking A
x gives the vector that is closest to c.
The vector A
x is the projection of c on the range of A.
1

3. The quantity (AT A) AT from equation (4.34) is called the pseudoinverse of


the matrix A and is denoted by A+ .
4. For large systems of equations, the least-squares solution is typically computed
using the QR method to be discussed in Section 6.3.2.
Before doing an example, let us show that equation (4.32) holds. Expanding
cT A
x gives the scalar expression


A11 A12 A1n
x1
A21 A22 A2n x2




cT A
x = c1 c2 cm ..
..
.. ..
..
.
.
.
. .
Am1 Am2 Amn xn
= (c1 A11 + c2 A21 + + cm Am1 ) x1
+ (c1 A12 + c2 A22 + + cm Am2 ) x2
..
.
+ (c1 A1n + c2 A2n + + cm Amn ) xn .
Differentiating with respect to each of the variables xi , i = 1, 2, . . . , n, we have


cT A
x = c1 A11 + c2 A21 + + cm Am1 ,
x1


cT A
x = c1 A12 + c2 A22 + + cm Am2 ,
x2

..
.


cT A
x = c1 A1n + c2 A2n + + cm Amn .
xn

A careful examination of this result will reveal that




cT A
x = AT c.

146

Vector and Matrix Calculus

Example 4.6 Determine the least-squares solution of the overdetermined system of linear algebraic equations Ax = c, where

1
2 0

1
1
, c = 0 .
A=
1
0 2
Solution: To evaluate equation (4.34), observe that


 2 0


2 1 0
5 1
T

1 1 =
A A=
,
0 1 2
1 5
0 2
which is symmetric as expected. Then the inverse is


1
1 5 1
T
A A
=
.
24 1 5
Also


 1
 
2 1 0
2
0 =
AT c =
.
0 1 2
2
1

Thus, the least-squares solution is



 
 
 
1 T
1 8
1 1
1 5 1
2
T

=
=
.
x= A A
A c=
24 1 5 2
24 8
3 1
We will encounter the versatile least-squares method in several contexts throughout the remainder of the text. For example, ....
4.3.2 Underdetermined Systems (r < n)
Can we use a similar technique to obtain a unique solution to systems that have
an infinite number of solutions? Such systems are called underdetermined and
have more unknowns than linearly-independent equations, that is, the rank is
less than the number of unknowns. We can use the least-squares method as in
the previous section, but with two key modifications.
Consider a system of equations Ax = c for which the rank of A is less than the
number of unknowns. Because this system has an infinite number of solutions, the
task now is to determine the unique solution that satisfies an additional criteria;
that is closest to the origin. Again, we use the norm
we seek the solution x = x

to quantify this distance. Thus, we seek to minimize the square of the norm of x
J(
x) = k
xk2 ,

(4.35)

which is now our objective function. Of all the solutions x, we seek the one x
that satisfies the original system of equations Ax = c and minimizes the objective
function. Using the Lagrange multiplier method, we form the augmented function
x) = k
J(
xk2 + T (c A
x) ,
(4.36)

4.3 Least-Squares Solutions of Algebraic Systems of Equations

147

where is the vector of Lagrange multipliers. Differentiating with respect to x


yields

J(
x) = 2
x AT .

x
Setting the derivative equal to zero, we can write
1
= AT .
x
(4.37)
2
Substituting equation (4.37) into the original system of equations c = A
x gives
1
c = AAT .
2
If AAT is invertible, then
= 2 AAT

1

c.

Substituting into equation (4.37) eliminates the Lagrange multiplier and provides
the least-squares solution
1
= AT AAT
c.
(4.38)
x
that satisfies Ax = c and minimizes J(
This is the x
x) = k
xk2 .

Part II
Numerical Methods

149

5
Introduction to Numerical Methods

So long as a man remains a gregarious and sociable being, he cannot cut himself off
from the gratification of the instinct of imparting what he is learning, of propagating
through others the ideas and impressions seething in his own brain, without stunting
and atrophying his moral nature and drying up the surest sources of his future
intellectual replenishment.
(James Joseph Sylvester)

The advent of the digital computer in the middle of the twentieth century
ushered in a true revolution in scientific and engineering research and practice.
Progressively more sophisticated algorithms and software paired with ever more
powerful hardware continues to provide for increasingly realistic simulations of
physical, chemical, and biological systems. This revolution depends upon the
remarkable efficiency of the computer algorithms that have been developed to
solve larger and larger systems of equations, thereby building on the fundamental
mathematics of vectors and matrices covered in Chapters 1 and 2.
In fact, one of the most common and important uses of matrix methods in
modern science and engineering contexts is in the development and execution
of numerical methods. Although numerical methods draw significantly from the
mathematics of linear algebra, they have not typically been considered subsets
of matrix methods or linear algebra. However, this association is quite natural
both in terms of applications and the methods themselves. Therefore, the reader
will benefit from learning these numerical methods in association with the mathematics of vectors, matrices, and differential operators.
The subsequent chapters of Part II provide a comprehensive treatment of numerical methods spanning the solution of systems of linear algebraic equations
and the eigenproblem to solution of ordinary and partial differential equations,
which is so central to scientific and engineering research and practice.

5.1 Historical Background


Approximation methods, e.g. Euler method
Determination of the value of , e.g. the method of exhaustion.
Logs and the slide rule
151

152

Introduction to Numerical Methods

Analog and digital computers improvements in algorithms and software layered on top of improvements in hardware (Moores law).
Technological milestones:
Hardware: vacuum tubes (1940s 1960s), transistors and integrated circuits
(1960s Present)
Software: FORTRAN (FORmula TRANslation) (developed 1954), Unix (1970)
Eras of calculations:

Paper and pencil


Math tables (1600s 1960s) - culminating in Abramowitz and Stegun
Slide rule (1930s 1970s)
Pocket calculator (1970s Present)
Punch cards (1950s 1970s)
Keyboard entry (1970s Present)
Programming languages:

Low-level languages, e.g. machine, assembly


High-level languages, e.g. Fortran, C, C++
Mathematical software, e.g. Matlab and Mathematica
Digital computers:

Manually programmed, e.g. ENIAC (1940s)


Group use, e.g. mainframes, servers, etc.
Individual use, e.g. workstations, personal computers, laptops, etc.
Supercomputers
5.2 Impact on Research and Practice

Traditionally, scientists and engineers have approached problems using analytical


methods for solving the governing differential equations and experimental modeling and prototyping of the system itself. Unfortunately, analytical solutions are
limited to simple problems in terms of the physics as well as the geometry. Therefore, experimental methods have typically formed the basis for engineering design
and research of complex problems in science and engineering. The arrival of the
digital computer in the middle of the last century and its continuous advancements over the ensuing years has given rise to an increasingly viable alternative
using computational methods.1 Hence, researchers and practitioners now have
three approaches to address problems in science and engineering: analytical, experimental, and computational.
To show how computational methods complement analytical and experimental
ones, let us consider some advantages (+) and disadvantages () of each approach
to engineering problems:
1

We will use the terms numerical and computational interchangeably.

5.2 Impact on Research and Practice

153

Analytical:
+
+
+

Provides exact solutions to the governing equations.


Gives physical insight, for example, relative importance of different effects.
Can consider hypothetical problems by neglecting friction, gravity, etc.
Exact solutions are only available for simple problems and geometries.
Of limited value in design.

Computational:
+ Address more complex problems, including physics and geometries.
+ Provides detailed solutions from which a good understanding of the physics
can be discerned.
+ Can easily try different configurations, for example, geometry, boundary conditions, etc., which is important in design.
+ Computers are becoming faster and cheaper; therefore, the range of applicability of computational mechanics continues to expand.
+ More cost effective and faster than experimental prototyping.
Requires accurate governing equations.
Boundary conditions are sometimes difficult to implement.
Difficult to do in certain parameter regimes, for example, highly nonlinear
physics.
Experimental:
+ Easier to get overall quantities for problem, for example, lift and drag on an
airfoil.
+ No modeling or assumptions necessary.
Often requires intrusive measurement probes.
Limited measurement accuracy.
Some quantities are difficult to measure, for example, the stress in the interior
of beam.
Experimental equipment often expensive.
Difficult and costly to test full-scale models.
As an example of how the rise of computational methods is altering engineering
practice, consider the typical design process. Figure 5.1 shows the traditional
approach used before computers became ubiquitous, where the heart of the design
process consists of iteration between the design and physical prototyping stages
until a satisfactory product or process is obtained. This approach can be both
time consuming and costly as it involves repeated prototype building and testing.
The modern approach to design is illustrated in Figure 5.2, where computational
modeling is inserted in two steps of the design process. Along with theoretical
modeling and experimental testing, computational modeling can be used in the
early stages of the design process in order to better understand the underlying
physics of the process or product before the initial design phase. Computational
prototyping can then be used to test various designs, thereby narrowing down
the possible designs before performing physical prototyping.

154

Introduction to Numerical Methods


Simple
Theoretical
Modeling

Experimental
Testing

Iterate

Design

Physical
Prototyping

Product

Figure 5.1 The traditional design approach.


Simple
Theoretical
Modeling

Computational
Modeling

Experimental
Testing

Iterate

Design

Iterate

Computational
Prototyping

Physical
Prototyping

Product

Figure 5.2 The modern design approach incorporating computational


modeling.

Using computational modeling, the modern design approach provides for more
knowledge and understanding going into the initial design before the first physical prototype is built. For example, whereas the traditional approach may have
required on the order of ten to fifteen wind tunnel tests to develop a wing design,
the modern approach incorporating computational modeling may only require on

5.2 Impact on Research and Practice

155

the order of two to four wind tunnel tests. As a bonus, computational modeling
and prototyping are generally faster and cheaper than experimental testing
and physical prototyping. This reduces time-to-market and design costs, as fewer
physical prototypes and design iterations are required, but at the same time, it
holds the potential to result in a better final product.

6
Computational Linear Algebra

In Chapters 1 and 2, we developed methods for determining exact solutions to


small matrix and eigenproblems. While these methods are essential knowledge
for everything that is to come, they are only appropriate or efficient for small
systems having two to five unknowns if solved by hand or tens of unknowns if
solved exactly using mathematical software, such as Matlab or Mathematica. In
scientific and engineering applications, however, we will frequently have occasion
to perform similar operations on systems involving hundreds, thousands, or even
millions of unknowns. Therefore, we must consider how these methods can be
extended (scaled) efficiently, and typically approximately, to such large systems.
This comprises the subject of computational, or numerical, linear algebra and
provides the methods and algorithms used by general purpose numerical software
codes and libraries.
When solving small systems of equations, eigenproblems, etc., we typically
emphasize ease of execution over algorithmic efficiency. That is, we prefer methods
that are straightforward to carry out even if they require more steps than an
alternative method that may be more complex. When solving large systems,
however, the opposite is true. The primary emphasis will be placed on efficiency
so as to be able to obtain solutions as quickly as possible via computers, even
at the expense of simplicity of the actual method. Because a computer will be
carrying out the steps, we would favor the more complex algorithm if it produces a
solution more quickly (we only have to program the method once the computer
does the work to solve each problem).
Moreover, this drive for efficiency will often suggest that we favor iterative,
rather than direct, methods. Although iterative methods only provide approximate solutions, they typically scale better to large systems than direct methods.
The methods considered in Chapters 1 and 2 are all direct methods that produce
an exact solution to the system of equations, eigenproblem, etc. in a finite number
of steps carried out in a predefined sequence. Think, for example, about Gaussian elimination. In contrast, iterative methods are step-by-step algorithms that
produce a sequence of approximations to the sought after solution, and when repeated many times converges toward the exact solution of the problem at least
we hope so.
Before discussing algorithms for large problems of interest, it is necessary to
first consider some of the issues that arise owing to the fact that the mathematics
will be carried out approximately by a digital computer. This includes operation
156

6.1 Approximation and Its Effects

157

counts, round-off error, and a brief mention of numerical stability. As suggested by


the above, the tone of our discussion in describing numerical methods often will
seem quite different than in previous chapters; however, keep in mind that we have
the same goals, that is, solving systems of algebraic equations, eigenproblems,
systems of first-order linear ordinary differential equations, etc. It is just that the
systems will be much larger and solved by a computer.

6.1 Approximation and Its Effects


6.1.1 Operation Counts
In order to compare various methods and algorithms, both direct and iterative,
against one another, we will often appeal to operation counts that show how
the methods scale with the size of the problem. Such operation counts allow us
to compare different methods for executing the same matrix operation as well as
determining which matrix method(s) is most computationally efficient for a given
application. For example, we have observed several uses of the matrix inverse in
Chapter 1; however, evaluating the matrix inverse is notoriously inefficient for
large matrices. As a result, it is actually quite rare for computational methods to
be based on the inverse; instead, we use alternatives whenever possible.
Operation counts provide an exact, or more often, order-of-magnitude, indication of the number of additions, subtractions, multiplications, and divisions
required by a particular algorithm or method. Take, for example, the inner product of two n-dimensional vectors
T

hu, vi = [u1 , u2 , , un ] [v1 , v2 , , vn ] = u1 v1 + u2 v2 + + un vn .


Evaluating the inner product requires n multiplications and n 1 additions;
therefore, the total operation count is 2n 1, which we say is 2n or O(n) for
large n.1 By extension, matrix multiplication of two n n matrices requires
n2 (2n 1) operations, that is, it requires n2 inner products, one for each element
in the resulting n n matrix. Consequently, the operation count is 2n3 or O(n3 )
for large n. Gaussian elimination of an n n matrix has an operation count
that is 32 n3 or O(n3 ). If, for example, we could come up with a method that
accomplishes the same thing as Gaussian elimination but is O(n2 ), then this
would represent a dramatic improvement in efficiency and scalability as n2 grows
much more slowly than n3 with increasing n. We say that O(n2 )  O(n3 ) for
large n, where  means much smaller (or less) than. For future reference, note
that nm  nm log n  nm+1 for large n as illustrated in Figure 6.1.
Operation counts for numerical algorithms can be used along with computer
hardware specifications to estimate how long it would take for a particular algorithm to be completed on a particular computers central processing unit (CPU).
1

O(A) is read is of the same order as A, or simply, order of A, and indicates that the operation
count, in this case, is directly proportional to A for large n.

158


Computational Linear Algebra

Figure 6.1 Comparison of n2 (solid line), n2 log n (dashed line), and n3


(dot-dashed line) with increasing n.

For numerical calculations, the important figure of merit is the FLOPS234 rating
of the CPU or its cores.

6.1.2 Ill-Conditioning and Round-Off Errors


The primary task of computational linear algebra involves solving a very large
system of algebraic equations of the form
Ax = c.

(6.1)

Unfortunately, the methods used to solve small systems, eigenproblems, etc. by


hand are almost never the most efficient or robust methods for solving them on
a computer. As mentioned in the previous section, for example, the number of
operations is O(n3 ) for Gaussian elimination. While it is convenient for hand
calculations involving small systems, therefore, it is not suitable for large-scale
calculations on a computer. Here, we discuss some issues unique to large-scale
calculations that are common to both direct and iterative methods.
There are two issues that arise when solving large systems of equations with
digital computers.5 First, real, that is floating point, numbers are not represented
2
3

FLOPS stands for FLoating point Operations Per Second.


A floating point operation is an addition, subtraction, multiplication, or division involving two
floating point numbers.
Real numbers as represented by computers are called floating point numbers because they contain a
fixed number of significant figures, and the decimal point floats. In contrast, a fixed point number
is an integer as the decimal point is fixed and does not float.
Nearly all computers today are digital computers. However, the fact that they are called digital
computers suggests that there are also analog computers. Before digital computers were developed
in the 1940s and 1950s, analog computers had become rather sophisticated. For example, an analog
computer, called a Norden bombsight, was used by U.S. bomber planes in WW II to calculate bomb
trajectories based on the planes current flight conditions, and it actually flew planes during the

6.1 Approximation and Its Effects

159

exactly on digital computers; in other words, all numbers, including rational numbers, are represented as decimals. This leads to round-off errors. More specifically,
the continuous number line, with an infinity of real numbers within any interval,
has smallest intervals of 251 = 4.4 1016 when using double precision6 (thus,
any number smaller than this is effectively zero). Every calculation, for example,
addition or multiplication, introduces a round-off error of roughly this magnitude.
Second, because of the large number of operations required to solve large systems
of algebraic equations, we must be concerned with whether these small round-off
errors can grow to pollute the final solution that is obtained; this is the subject
of numerical stability.
When solving small to moderate sized systems of equations by hand, the primary diagnostic required is the determinant, which tells us whether the matrix
A is singular and, thus, not invertible. This is the case if |A| = 0. When dealing
with round-off errors in solving large systems of equations, however, we also must
be concerned when the matrix is nearly singular, that is, when the determinant
is close to zero. How close? That is determined by the condition number. The
condition number (A) of the matrix A is defined by
(A) = kAk kA1 k,

(6.2)

where 1 (A) . As discussed in Section 1.7, there are various possible


norms that can be used, but the L2 norm is most common, in which case the
condition number becomes
2 (A) = kAk2 kA1 k2 =

||max
.
||min

(6.3)

A matrix with a small condition number is said to be well-conditioned, and a


matrix with a large condition number is called ill-conditioned. For example, a
singular matrix with ||min = 0 has (A) = .
What is considered to be a small or large condition number depends upon
the problem being considered. Using the condition number, we can estimate the
number of decimal places of accuracy that may be lost owing to round-off errors
in the calculation. The number of decimal places d that we expect to lose is
d = log10 (A).
Generally, we use double precision in numerical calculations, which includes 16
decimal places of accuracy, and it is reasonable to lose no more than five to six
decimal places of accuracy due to round-off errors (obviously, this depends upon
the application under consideration and what will be done with the result). This
is why we generally use double precision for numerical computations, because it
gives us more digits of accuracy to work with.

bombing process in order to more accurately target the bombs. These devices were considered so
advanced that, if a bomber was shot down or to land in enemy territory, the bombardier was
instructed to first destroy the bombsight before seeking to save himself.
By default, most computers use single precision for calculations; however, double precision is
standard for numerical calculations of the sort we are interested in throughout this text.

160

Computational Linear Algebra

In order to observe the influence of the condition number, consider how a


system of algebraic equations responds to the introduction of small perturbations
(errors) in the coefficient matrix A. A system is considered to be well-conditioned
if small perturbations to the problem only lead to small changes in the solution
vector x. When this is not the case, the system is said to be ill-conditioned. Let
us perturb the coefficient matrix A by the small amount A and evaluate the
resulting perturbation in the solution vector x.7 From the system of equations
(6.1), we have
(A + A) (x + x) = c.
Carrying out the multiplication produces
Ax + Ax + xA + Ax = c.
Because the first and last terms cancel from (6.1) and the Ax term is presumed
to be smaller than the others because it is the product of two small perturbations,
solving for x yields
x = A1 xA.
(6.4)
It is a property of vector and matrix norms that
kAxk kAk kxk;
therefore, applying to equation (6.4) requires
kxk kA1 k kxk kAk,
or
kxk
kA1 k kAk.
kxk
From the definition of the condition number (6.2), this becomes
kAk
kxk
(A)
.
kxk
kAk

(6.5)

Therefore, the magnitude of the resulting inaccuracy in the solution x is related


to the magnitude of the perturbation of the coefficient matrix A through the
condition number, where each is normalized by the norm of its parent vector or
matrix. The larger the condition number, the larger the magnitude of the error
in the solution as compared to that of the disturbance to the coefficient matrix.
One may wonder why we do not simply use the determinant of a matrix to
quantify how close to being singular it is. Recall that the determinant of a matrix
is the product of its eigenvalues, which depend upon the scale of the matrix. In
contrast, the condition number is scale invariant. For example, consider the n n
matrix A = aI. The determinant is |A| = an , which is small if |a| < 1, whereas
cond2 (A) = a/a = 1 for any a. As we develop various numerical methods for
7

Alternatively, one could perturb the right-hand side vector c by the small amount c and evaluate
the resulting perturbation in the solution vector x. The result (6.5) is the same but with A
replaced with c.

6.2 Systems of Linear Algebraic Equations

161

solving systems of algebraic equations, we must be cognizant of the influence


that each decision has such that it leads to well-conditioned matrices for the
equations we are interested in solving.
Although the condition number provides a definitive measure of the conditioning of a matrix, its determination requires calculation of eigenvalues of the
very large matrix in order to evaluate equation (6.3). As an alternative, it is often
possible to check if a matrix is diagonally dominant. Diagonal dominance simply
requires comparing the sums of the magnitudes of the off-diagonal elements with
that of the main diagonal element of each row of the matrix. The matrix A is
diagonally dominant if
N
X
|aii |
|aij |
(6.6)
j=1,j6=i

holds for each row of the matrix. If the greater than sign applies, then the matrix
is said to be strictly diagonally dominant. If the equal sign applies, it is weakly
diagonally dominant. It can be shown that diagonally dominant matrices are wellconditioned. More specifically, it can be proven that for any strictly diagonally
dominant matrix A:
1. A is nonsingular; therefore, it is invertible.
2. Gaussian elimination does not require row interchanges for conditioning.
3. Computations are stable with respect to round-off errors.
In the future, we will check our algorithms for diagonal dominance in order to
ensure that we have well-conditioned matrices.
To see the influence of condition number and diagonal dominance on round-off
errors, see the Mathematica notebook Ill-Conditioned Matrices and Round-Off
Error (incorporate into text).
The condition number and diagonal dominance addresses how mathematically
amenable a particular system is to providing an accurate solution. In addition, we
also must be concerned with whether the algorithm that is used to actually obtain
this mathematical solution on a computer is numerically amenable to producing
an accurate solution. This is determined by the algorithms numerical stability,
which will be discussed in Section 15.2. That is, conditioning is a mathematical
property of the algebraic system itself, and numerical stability is a property of the
numerical algorithm used to obtain its solution. See Trefethen and Bau (1997)
for much more on numerical stability of common operations and algorithms in
numerical linear algebra, including many of those treated in the remainder of this
chapter.

6.2 Systems of Linear Algebraic Equations


With a better understanding of how computer approximations may affect numerical calculations and an appreciation for operation counts, we are ready to consider
the primary task of computational linear algebra, which is solving large systems

162

Computational Linear Algebra

of linear algebraic equations. We begin with a series of direct methods that are
faster than Gaussian elimination. Two are decomposition methods similar to polar and singular-value decomposition discussed in Chapter 2. These methods are
appropriate for implementation in computer algorithms that determine the solution for large systems. We then discuss iterative, or relaxation, methods for
obtaining approximate solutions of systems of linear algebraic equations, including the Jacobi, Gauss-Seidel, successive over-relaxation, and conjugate gradient
methods. The essential issue of whether these iterative methods converge toward
the exact solution is also addressed.
Finally, we discuss methods for systems of equations having sparse coefficient
matrices. A sparse matrix is one in which relatively few of the elements of the
matrix are nonzero. This is often the case for systems of equations that arise
from numerical methods, for example. In some cases, these sparse matrices also
have a particular structure. For example, we will encounter tridiagonal, blocktridiagonal, and other types of banded matrices.

6.2.1 LU Decomposition
LU decomposition provides a general method for solving systems governed by any
nonsingular matrix A that has several advantages over Gaussian elimination. We
decompose (factor) A into the product of two matrices
A = LU,
where L is lower triangular, and U is upper triangular. For example, if A is 4 4,
then

1
0
0 0
U11 U12 U13 U14
L21 1
0 U22 U23 U24
0 0

.
L=
L31 L32 1 0 , U = 0
0 U33 U34
L41 L42 L43 1
0
0
0 U44
From matrix multiplication, observe that
A11 = U11 ,

A12 = U12 ,

A13 = U13 ,

A14 = U14 ,

which gives the first row of U. Continuing the matrix multiplication


A21 = L21 U11

L21 =

A21
,
U11

A22 = L21 U12 + U22 U22 = A22 L21 U12 ,


A23 = L21 U13 + U23 U23 = A23 L21 U13 ,
A24 = L21 U14 + U24 U24 = A24 L21 U14 .
Note that U11 , U12 , U13 , and U14 are known from the first row above; therefore,
these expressions give the second rows of L and U. This procedure is continued
to obtain all Lij and Uij .

6.2 Systems of Linear Algebraic Equations

163

Once the LU decomposition of A is completed, the system Ax = c may be


solved in the two-step procedure as follows. Given the LU decomposition, we have
LUx = c.
Let us introduce the vector y = Ux. Then
Ly = c,
or

1
L21

L31

..
.

0
1
L32

0
0
1

Ln1 Ln2 Ln3


0 y1
c1
y2 c2
0


0
y3 = c3 .
. .
..
. .. ..
1 yn
cn

This system of equations may be solved by straightforward forward substitution,


for example,
y1 = c1 , y2 = c2 L21 y1 , . . .
or more generally
yi = ci

i1
X

Lij yj ,

i = 2, 3, . . . , n,

j=1

where n is the size of A. Given y, we may now solve


Ux = y
for the desired solution x using back substitution as follows
!
n
X
yn
1
xn =
Uij xj , i = n 1, n 2, . . . , 1.
, xi =
yi
Unn
Uii
j=i+1
Remarks:
1. LU decomposition is a direct method as it requires a prescribed number of
known steps, but it is approximately twice as fast as Gaussian elimination
(n3 /3 operations compared to 2n3 /3 for large n). It works for any nonsingular
matrix A.
2. Both matrices L and U may be stored as a single matrix, for example,

U11 U12 U13 U14


L21 U22 U23 U24

L31 L32 U33 U34 ,


L41 L42 L43 U44
where we know that L has ones on the main diagonal. In fact, the procedure
is such that the elements of A can be replaced by the corresponding elements
of L and U as they are determined, thereby minimizing computer memory
requirements.

164

Computational Linear Algebra

3. The above approach is called Doolittles method and results in ones on the
main diagonal of L. Alternatively, Crouts method leads to ones on the main
diagonal of U.
4. This procedure is particularly efficient if A remains the same, while c is
changed (cf. the truss example in Section 1.3). In this case, the L and U
matrices only must be determined once.
5. Sometimes it is necessary to use pivoting, in which rows of A are exchanged,
in order to avoid division by zero (or small numbers). This is analogous to
exchanging rows (equations) in Gaussian elimination.
6. Recall that the determinant of a triangular matrix is the product of the elements along the main diagonal. Because the determinant of L is unity, the
determinant of A is simply
|A| = |U| = ni=1 Uii ,
which is the product of the main diagonal elements of the triangular matrix
U.
7. For more details on the implementation of LU decomposition, see Numerical
Recipes.
Given the LU decomposition of matrix A, the inverse A1 may be obtained as
follows. Recall that
AA1 = I.
Let us consider the 3 3 case for illustration, for which we have

A11 A12 A13 B11 B12 B13


1 0 0
A21 A22 A23 B21 B22 B23 = 0 1 0 ,
A31 A32 A33 B31 B32 B33
0 0 1
where Bij are the elements of A1 . We may then solve for the columns of B = A1
separately by solving the three systems


B11
1
B12
0
B13
0
A B21 = 0 , A B22 = 1 , A B23 = 0 .
B31
0
B32
0
B33
1
Because A does not change, these three systems of equations may be solved
efficiently using the LU decomposition of A.
6.2.2 Cholesky Decomposition
LU decomposition applies for any nonsingular matrix. If A is also positive definite,
whereby it is real symmetric (or Hermitian more generally) with all positive
eigenvalues, a special case of LU decomposition can be devised that is even more
efficient to compute. This is known as Cholesky decomposition (factorization)
and is given by
A = UT U,

6.2 Systems of Linear Algebraic Equations

165

or alternatively A = LLT . Whereas LU decomposition requires determination of


two triangular matrices, Cholesky decomposition only requires determination of
one; the other triangular matrix is simply its transpose. That is, L = UT .
Because U is upper triangular

U11 U12 U1n


0 U22 U2n

U = ..
..
.. .
..
.
.
.
.
0

Unn

Then A = UT U is

A11 A12 A1n


U11
0 0
U11 U12 U1n
A21 A22 A2n U12 U22


0 U22 U2n
=
.
..

..
.
.
.
.
.
.
.
..
..
..
.. ..
..
.. ..
..
..
.

.
.
.
.
An1 An2 Ann
U1n U2n Unn
0
0 Unn
Multiplying the matrices on the right-hand side, we obtain

2
A11 = U11
U11 = A11
A12 = U11 U12
..
.

U12 =

A1n = U11 U1n U1n =

A12
U11

A1n
U11

which is the first row of U. This procedure is continued to calculate all Uij .
After completion of the Cholesky decomposition of A, the system Ax = c may
be solved using forward and backward substitution as in LU decomposition (see
previous section). If the inverse of A is desired, observe that
T
A1 = U1 U1 ,
where
U1

B11 B12
0 B22

= B = ..
..
.
.
0
0

B1n
B2n

.. ,
..
.
.
Bnn

which is upper triangular. We find the Bij by evaluating


UB = I,
where both U and B are upper triangular, and U is already known.
Remark:
1. Cholesky decomposition can be thought of as taking the square root of the
matrix A.

166

Computational Linear Algebra

2. Once U is determined, the solution to Ax = c simply requires a back substitution as in LU decomposition.


3. When it applies, solving systems of equations using Cholesky decomposition is
another two times faster than using LU decomposition, which is approximately
twice as fast as Gaussian elimination. That is, for large n, the operation counts
are:
Gaussian elimination: 32 n3
LU decomposition: 13 n3
Cholesky decomposition: 16 n3
Because of symmetry, it is also only necessary to store one-half of the matrix
as compared to LU decomposition. In addition, Cholesky decomposition has
improved numerical stability properties as compared to alternative algorithms.
Because of these properties, it is the preferred algorithm for obtaining the
solution of symmetric, positive-definite matrices.
4. The Cholesky decomposition will fail if the matrix A is not positive definite.
Because of its efficiency, therefore, it is often the best technique for determining
if a matrix is positive definite (unless the eigenvalues are desired for some other
reason).
5. We will commonly encounter tridiagonal systems of equations when solving
differential equations using finite-difference methods. We will discuss their
properties and the Thomas algorithm for their solution in Section 11.4. This
algorithm only requires O(n) operations, which is as good a scaling as one
could hope for.

6.2.3 Partitioning
Partitioning allows for determination of the inverse of a large matrix by reducing
it down to finding the inverses of many small matrices. Let us partition matrix
A into four sub-matrices as follows:


A11 A12
,
A=
A21 A22
where A11 and A22 are square, but A12 and A21 need not be square (but the
number of rows of A12 must equal the number of rows of A11 , and its number of
columns must equal that of A22 , for example. Now let


B11 B12
,
B = A1 =
B21 B22
where each sub-matrix is the same size as its corresponding sub-matrix in A.
Then


I1 0
AB = I =
.
0 I2

6.2 Systems of Linear Algebraic Equations

167

Multiplying


 
A11 A12 B11 B12
I
= 1
A21 A22 B21 B22
0


0
,
I2

we obtain
A11 B11 + A12 B21
A21 B11 + A22 B21
A11 B12 + A12 B22
A21 B12 + A22 B22

= I1
=0
.
=0
= I2

(6.7)

From the second equation


B21 = A1
22 A21 B11 .

(6.8)

Substituting into the first of equations (6.7) gives



A11 A12 A1
22 A21 B11 = I1 ;
therefore,
B11 = A11 A12 A1
22 A21

1

In this manner, we obtain the partitions of A1 , for example, B11 , by inverting portions of A then substituting into equation (6.8) to obtain B21 . Repeating
this procedure with the last two equations of (6.7) provides B12 and B22 . Because smaller is better when inverting matrices, the partitioning procedure can
be implemented recursively.

6.2.4 Iterative Convergence


When we perform the direct methods covered in Chapter 1 for solving systems
of linear algebraic equations by hand, we obtain the exact solution if we do so
without making any approximations.8 When these direct methods, as well as those
in this chapter, are performed using floating-point arithmetic on a computer,
however, one encounters round-off errors as discussed in Section 4.1.2.
In addition to round-off errors, iterative methods are also subject to iterative
convergence errors that result from the iterative process itself. In so far as iterative
methods produce a sequence of successive approximations to the solution of the
original system of equations, one now also must be concerned with whether this
sequence of iterative approximations will converge to the true exact solution.
Here, we seek to determine a criterion for determining if an iterative algorithm
will converge toward the exact solution or not.
8

Note that Mathematica and Matlab are also capable of obtaining exact solutions using symbolic
arithmetic, which is the computer analogy to hand calculations in that no round-off errors are
incurred.

168

Computational Linear Algebra

Let us say we have a system of linear algebraic equations of the form


A11 x1 + A12 x2 + A13 x3 + + A1n xn = c01
A21 x1 + A22 x2 + A23 x3 + + A2n xn = c02
,
..
.
An1 x1 + An2 x2 + An3 x3 + + Ann xn = c0n
or Ax = c0 in matrix form. A simple iterative technique, known as the Jacobi
method, can be obtained by solving the first equation for x1 , the second equation
for x2 , and so forth, as follows:
x1

x2
..
.

12
A
x
A11 2
21
x
A
A22 1

13
A
x
A11 3

1n
A
x
A11 n

1
+ Ac11

23
A
x
A22 3

2n
A
x
A22 n

2
+ Ac22

An1
An2
An3
xn = A
x1 A
x2 A
x3
nn
nn
nn

n
+ Acnn

The xi , i = 1, . . . , n, on the right-hand sides are taken from the previous iteration
in order to update the solution vector on the left-hand side. In matrix form, this
and other iterative numerical schemes may be expressed as
x(r+1) = Mx(r) + c,

(6.9)

where (r) is the iteration number; x(r) , r = 0, 1, 2, . . ., is the sequence of iterative


approximations to the exact solution x; x(0) is the initial guess; and M is the nn
iteration matrix (determined by the iterative method). Note that the iteration
number (r) is placed in parentheses so as not to be confused with powers of the
vectors or matrix.
We want this iterative procedure to converge to the exact solution x of
x = Mx + c,

(6.10)

as r for any initial guess x(0) . Consequently, exact iterative convergence,


which would take an infinity of iterations, occurs when the solution x input on the
right-hand side is returned precisely upon execution of an iteration. In practice,
we terminate the iterative process when the approximation returned has changed
very little by an iteration.
In order to determine a convergence criteria, let 1 , . . . , n be the eigenvalues
of the iteration matrix M and u1 , . . . , un the corresponding eigenvectors, which
we assume are linearly independent. We define the error at the rth iteration by
e(r) = x x(r) ,
where x is the exact solution. Substituting into equation (6.9) gives


x e(r+1) = M x e(r) + c,
or
e(r+1) = x M x + M e(r) c,

6.2 Systems of Linear Algebraic Equations

169

but from equation (6.10)


e(r+1) = Me(r) ,

r = 0, 1, 2, . . . ,

which provides an alternative way to write equation (6.9) for convenience. It


follows that


(6.11)
e(r) = Me(r1) = M M e(r2) = . . . = Mr e(0) .
We may write the error of the initial guess e(0) as a linear combination of the
linearly-independent eigenvectors as follows
e(0) = 1 u1 + 2 u2 + + n un ,
where i 6= 0. Substituting for e(0) in equation (6.11) gives
e(r) = Mr (1 u1 + 2 u2 + + n un ) ,
or
e(r) = 1 Mr u1 + 2 Mr u2 + + n Mr un .

(6.12)

But Mui = i ui , and multiplying by M gives


M (Mui ) = i Mui = 2i ui .
Generalizing to any integer power of M, we have the general result
Mr ui = ri ui ,
where now r does represent the rth power. Consequently, if i are the eigenvalues
of M, the eigenvalues of Mr are ri , and the eigenvectors of M and Mr are the
same. Substituting into equation (6.12) leads to
e(r) = 1 r1 u1 + 2 r2 u2 + + n rn un .
For convergence |e(r) | must go to zero as r , but |i ui | =
6 0, in which case |ri |
must go to zero as r . Thus, we have the general result that for convergence
of the iterative numerical method
= |i |max < 1,
where is the spectral radius of the iteration matrix M. Therefore, iterative methods must be devised such that the magnitude of the eigenvalues of the iteration
matrix are all less than one.
A more general iterative scheme may be obtained by decomposing the coefficient matrix A as follows
A = M1 M2 .
Then we construct an iterative technique of the form
M1 x(r+1) = M2 x(r) + c.

170

Computational Linear Algebra

To be efficient numerically, M1 should be easily invertible. If we write the above


equation as
(r)
x(r+1) = M1
+ M1
1 M2 x
1 c,
we can follow the same analysis as above, with M = M1
1 M2 , leading to the
necessary and sufficient condition that
= |i |max < 1.
Remarks:
1. The i are the eigenvalues of the iterative matrix M = M1
1 M2 , and is its
spectral radius.
2. While the spectral radius provides a necessary and sufficient condition, strict
diagonal dominance provides a sufficient condition. Because it is not a necessary condition, an iterative method may converge even if it is not strictly
diagonally dominant; however, we cannot guarantee this to be the case unless
we check the spectral radius.
3. In addition to determining whether an iterative method will converge, the
spectral radius also determines the rate at which they converge, in which case
smaller is better.
4. For the Jacobi method, M1 is a diagonal matrix, and for the commonly used
Gauss-Seidel method, M1 is a triangular matrix (see Section 6.2.6).
5. This result applies for all iterative methods for which the eigenvectors of the
iteration matrix M are linearly independent.
6.2.5 Jacobi Method
Recall from the previous section that in the Jacobi method, we solve the difference
equation as an explicit equation for each of xi , i = 1, . . . , n. As indicated in
equation (6.9), the values of xi on the right-hand side are then taken from the
previous iteration, denoted by superscript (r), while the single value on the lefthand side is taken at the current iteration, denoted by superscript (r + 1).
Although we do not actually implement iterative methods in the following form,
it is instructive to view iterative methods in the matrix form Ax = c in order
to analyze its convergence properties. As in the previous section, we write the
coefficient matrix A = M1 M2 , such that an iterative scheme may be devised
by writing Ax = c in the form
M1 x(r+1) = M2 x(r) + c.
Multiplying by M1
1 gives
(r)
x(r+1) = M1
+ M1
1 M2 x
1 c,

or
x(r+1) = Mx(r) + M1
1 c,
where M = M1
1 M2 is the iteration matrix.

6.2 Systems of Linear Algebraic Equations

171

Let
D = diagonal elements of A,
L = lower triangular elements of A less main diagonal,
U = upper triangular elements of A less main diagonal.
Therefore, Ax = c becomes
(D L U)x = c.
Using this notation, the Jacobi iteration (M1 = D, M2 = L + U) is of the form
Dx(r+1) = (L + U)x(r) + c,
or
x(r+1) = D1 (L + U)x(r) + D1 c,
such that the iteration matrix is
1
M = M1
(L + U).
1 M2 = D

We then check to be sure that the spectral radius of M satisfies the requirement
that < 1 to ensure convergence of the iterative scheme according to the general
result in the previous section.
In addition to checking for iterative convergence, a smaller spectral radius
results in more rapid convergence, that is, in fewer iterations. For the Jacobi
method, the spectral radius of the iteration matrix M is



.
Jac (n) = cos
n+1
If n is large, then from the Taylor series for cosine



2

Jac (n) = cos


=1
+ ;
n+1
2 n+1

(6.13)

therefore, as n , Jac (n) 1. As a result, we observe slower convergence as n


is increased. In other words, there is a disproportionate increase in computational
time as n is increased. This is why the Jacobi method is not used in practice.
6.2.6 Guass-Seidel Method
In addition to the slow iterative convergence properties of the Jacobi method, it
also requires one to store both the current and previous iterations of the solution
fully. The Gauss-Seidel method addresses both issues by using the most recently
updated information during the iterative process. In this way, it is no longer
necessary to store xri at the previous iteration as with Jacobi iteration. In fact, the
values of xi are all stored in the same array, and it is not necessary to distinguish
between the (r)th or (r + 1)st iterates. We simply use the most recently updated

172

Computational Linear Algebra


(r+1)

(r)

information as each new value of xi


overwrites its corresponding value xi
from the previous iteration.
In matrix form, the Gauss-Seidel method consists of M1 = DL and M2 = U,
in which case
(D L)x(r+1) = Ux(r) + c,
or
x(r+1) = (D L)1 Ux(r) + (D L)1 c.
Therefore, the iteration matrix is
1
M = M1
U.
1 M2 = (D L)

It can be shown that


GS (n) = 2Jac (n);
therefore, from equation (6.13), the spectral radius for large n is
#2
 "

2

2

1

2
2
= 1
+ = 1
+ .
GS (n) = Jac (n) = cos
n+1
2 n+1
n+1
(6.14)
Consequently, the rate of convergence is twice as fast as for the Jacobi method
for large n, that is, the Gauss-Seidel method requires only one-half the iterations
for the same level of accuracy.
Remark:
1. It can be shown that strong diagonal dominance of A is a sufficient, but not
necessary, condition for convergence of the Jacobi and Gauss-Seidel iteration
methods. That is, the spectral radius is such that < 1 for the iteration matrix
M = M1
1 M2 (see Morton and Mayers, p. 205 for proof).

6.2.7 Successive Over-Relaxation (SOR)


In Gauss-Seidel iteration, the sign of the error typically does not change from
iteration to iteration. That is, the approximate iterative solution tends to approach the exact solution from the positive or negative side exclusively. In addition, it normally does so relatively slowly. Iterative convergence can often be
accelerated by over-relaxing, or magnifying, the change at each iteration. This
over-relaxation is accomplished by taking a weighted average (linear combination) of the previous iterate xri and the Gauss-Seidel iterate xr+1
to form the new
i
approximation at each iteration.
If we denote the Gauss-Seidel iterate (14.20) by xi , the new SOR iterate is
given by
(r+1)

xi

(r)

= (1 )xi + xi ,

(6.15)

6.2 Systems of Linear Algebraic Equations

173

where is the acceleration, or relaxation, parameter and 1 < < 2 for convergence (Morton & Mayers, p. 206). Note that = 1 corresponds to the GaussSeidel method.
In matrix form, the SOR method is given by M1 = D L, M2 = (1 )D +
U. Then
(D L)x(r+1) = [(1 )D + U] x(r) + c,
or
x(r+1) = (D L)1 [(1 )D + U] x(r) + (D L)1 c.
Therefore, the iteration matrix is
1
M = M1
[(1 )D + U] .
1 M2 = (D L)

It can be shown that the optimal value of that minimizes the spectral radius,
and consequently the number of iterations, is (see, for example, Morton and
Mayers, p. 212, and Moin, p. 146)
opt =

1+

2
,
1 2Jac

(6.16)

and for this opt , the spectral radius for SOR is


Jac
p
1 + 1 2Jac

SOR =

Recall that Jac = 1

1
2

n+1

opt =

2

(6.17)

+ for large n; thus,

2
p
1 + 1 2Jac

1+
=


1 1

1
2

n+1

2

2

2



2

1 1 n+1 +

1+
opt

!2

2
.
1 + n+1

(6.18)

174

Computational Linear Algebra

Figure 6.2 Spectral radius versus acceleration parameter for SOR.

Then for large n, the spectral radius for SOR is


2 2
1

2 n+1

SOR

1 + n+1
("
)2

2 # 
1

1
1
+
2 n+1
n+1

2

1
+
n+1
2
SOR 1
.
n+1

(6.19)

Thus, as n , opt 2, and SOR 1. However, from a comparison of


equations (6.19) and (14.21), we see that
SOR < GS ,
such that SOR converges at a rate 2(n+1)
times faster than Gauss-Seidel if the

optimal value of the relaxation parameter is used. Therefore, the convergence


rate improves linearly with increasing n relative to Gauss-Seidel. This analysis
assumes that we know opt . Typically, we do not, and the rate of convergence
depends significantly on the choice of ; for example, the typical behavior for
linear problems with given n is given in Figure 6.2.
Remarks:
1. SOR can be thought of as choosing the relaxation parameter that most
decreases the spectral radius of the iteration matrix.
2. For a given problem, opt often must be estimated from a similar problem
and/or trial and error.
3. For direct methods, we only need to be concerned with the influence of roundoff errors as determined by the condition number or diagonal dominance. For
iterative methods, we are also concerned with iterative convergence rate as
determined by the spectral radius. For both, we are concerned with the rate
of convergence as determined by the truncation error (see Section 11.1.2).

6.3 Numerical Solution of the Eigenproblem

175

4. We will discuss iterative methods, including Jacobi, Gauss-Seidel, successive


over-relation (SOR), and alternating-direction-implicit (ADI) methods, in more
detail in the context of obtaining numerical solutions of elliptic partial differential equations in Chapter 14.

6.2.8 Conjugate-Gradient Method


For systems of equations with sparse coefficient matrices that are symmetric and
positive definite.

6.3 Numerical Solution of the Eigenproblem


Along with solving systems of linear algebraic equations, the other workhorse
of computational linear algebra is algorithms for obtaining the eigenvalues and
eigenvectors of large matrices. For large matrices, the procedure outlined in Section 2.1 is inefficient or too memory intensive for implementation on a computer.
For example, it is not practical or efficient to try to factor the high-order characteristic polynomials numerically. Therefore, we require numerical solution techniques. The standard method for numerically approximating the eigenvalues and
eigenvectors of a matrix is based on QR decomposition, which entails performing
a series of similarity transformations that leave the eigenvalues unchanged.9

6.3.1 Similarity Transformation


Recall that the diagonalization procedure outlined in Section ?? relies on the fact
that the diagonal matrix is similar to the matrix from which it was obtained. Two
matrices are similar if they have the same eigenvalues. Similarity transformations
form the basis for the QR method of obtaining eigenvalues and eigenvectors.
Consider the eigenproblem
Ax = x,

(6.20)

where A is a real, square matrix. Suppose that Q is an orthogonal matrix such


that Q1 = QT . Let us consider the transformation
B = QT AQ.

(6.21)

Postmultiplying both sides by QT x leads to


BQT x = QT AQQT x,
= QT Ax,
= QT x,
BQT x = QT x.
9

This is the approach used by the built-in Mathematica and Matlab functions Eigenvalues[]/
Eigenvectors[] and eig(), respectively.

176

Computational Linear Algebra

Defining y = QT x, this can be written as


By = y,

(6.22)

which is an eigenproblem for the matrix B defined by the transformation (6.21).


Note that the eigenproblems (6.20) and (6.22) have the same eigenvalues ;
therefore, we call equation (6.21) a similarity transformation because A and
B = QT AQ have the same eigenvalues. This is the case because Q is orthogonal.
The eigenvectors x of A and y of B are related by
y = QT x

(x = Qy) .

(6.23)

If in addition to being real and square, A is symmetric such that A = AT , observe


that
BT = [QT AQ]T = [Q]T [A]T [QT ]T = QT AQ = B.
Therefore, if A is symmetric, then B is symmetric as well when Q is orthogonal. In
summary, for A real and symmetric, the similarity transformation (6.21) preserves
the eigenvalues and symmetry of A, and the eigenvectors are related by (6.23).
6.3.2 QR Method to Obtain Eigenvalues and Eigenvectors
Having confirmed the properties of similarity transformations, we now turn our
attention to the iterative QR method for obtaining the eigenvalues and eigenvectors of a matrix A, which requires such transformations.
Basic Approach
Consider A real and symmetric, for which A = AT . A QR decomposition exists
such that
A = QR,
where Q is an orthogonal matrix, and R is an upper (right) triangular matrix.10
Letting
A0 = A, Q0 = Q, R0 = R,
the QR decomposition of the given matrix is
A0 = Q0 R0 .

(6.24)

Let us form the product (note the order)


A1 = R0 Q0 .

(6.25)

T
Because Q0 is orthogonal, premultiplying equation (6.24) by Q1
0 = Q0 gives

QT0 A0 = QT0 Q0 R0 = R0 .
10

Recall that in LU decomposition, the L refers to a lower triangular matrix, and U refers to an
upper triangular matrix, whereas here it is customary to use L for left and R for right
triangular matrices. More to the point, a left triangular matrix is lower triangular, and a
right triangular matrix is upper triangular.

6.3 Numerical Solution of the Eigenproblem

177

Therefore, substituting for R0 in equation (6.25) we may determine A1 from


A1 = QT0 A0 Q0 ,

(6.26)

which is a similarity transformation. That is, taking R0 Q0 is equivalent to the


similarity transformation (6.26), and A1 has the same eigenvalues as A0 = A.
Thus, generalizing equation (6.25) we have
Ak+1 = Rk Qk ,

k = 0, 1, 2, . . . ,

(6.27)

where all A1 , A2 , . . . , Ak , . . . are similar to A0 = A. Not only do they all have


the same eigenvalues, the similarity transformations maintain the same structure
as A, for example tridiagonal.
It can be shown (not easily) that the sequence of similar matrices
A0 , A1 , A2 , . . .
gets progressively closer, that is, converges, to a diagonal or upper triangular matrix if A is symmetric or non-symmetric, respectively. In either case, the eigenvalues of A (and A1 , A2 , . . .) are on the main diagonal in increasing order by
absolute magnitude.
But how do we determine the Q and R matrices for each iteration? This
requires plane rotations.
Plane Rotations
Consider the n n transformation matrix P comprised of the identity matrix
with only four elements changed in the pth and q th rows and columns according
to
Ppp = Pqq = c,

Ppq = s,

Pqp = s,

where c = cos and s = sin . That is,

..

c 0 0 s

0
1
0

.
..
.
..
..
P=
.

0
1
0

s 0 0 c

..

1
Observe the effect of transforming an n-dimensional vector x according to the
tranformation
y = Px,

(6.28)

178

Computational Linear Algebra

where xT = [x1 x2 xp xq xn ]. Then



x1
x2

..
.

yp

y = Px = .. ,
.

yq

.
..
xn
where the only two elements that are altered are
yp = cxp + sxq ,

(6.29)

yq = sxp + cxq .

(6.30)

For example, consider the case for n = 2, that is, p = 1, q = 2


y1 = cx1 + sx2 ,
y2 = sx1 + cx2 ,
or
  
 
y1
cos sin x1
=
.
y2
sin cos x2

This transformation rotates the vector x through an angle to obtain y. Note


that y = PT x rotates the vector x through an angle . (This is backwards;
therefore, switch negative signs on sines or correct interpretation of
transformation?)
Thus, in the general n-dimensional case (6.28), P rotates the vector x through
an angle in the xp xq -plane. The angle may be chosen with one of several
objectives in mind. For example, it could be used to zero all elements below (or
to the right of) a specified element, for example yT = [y1 y2 yj 0 0]. This
could be accomplished using the Householder transformation (reflection), which
is efficient for dense matrices. Alternatively, the transformation could be used
to zero a single element, for example yp or yq [see equations (6.29) and (6.30)].
This could be accomplished using the Givens transformation (rotation), which is
efficient for sparse, structured (for example banded) matrices.
Remarks:
1. The transformation matrix P is orthogonal, such that PT = P1 .
2. We can generalize to rotate a set of vectors, in the form of a matrix, by taking
Y = PX.

6.3 Numerical Solution of the Eigenproblem

179

We can imagine a series of Givens or Householder transformations that reduce


the matrix A to a matrix that is upper triangular, which is the R matrix in
a QR decomposition. Thus, if m projections are required to produce an upper
triangular matrix, R is given by
R = Pm P2 P1 A.

(6.31)

Because A = QR, that is, R = QT A, the orthogonal matrix Q is then obtained


from
Q T = Pm P2 P1 .
Taking the transpose leads to
Q = PT1 PT2 PTm .

(6.32)

In this manner, the QR decomposition (6.31) and (6.32) is obtained from a series
of plane (Givens or Householder) rotations. Givens transformations are most efficient for large, sparse, structured matrices, which can be configured to only zero
the elements that are not already zero. There is a fast Givens transformation,
for which the P matrices are not orthogonal, but the QR decompositions can be
obtained two times faster than in the standard Givens transformation illustrated
here. Convergence of the iterative QR method may be accelerated using shifting
(see, for example, Numerical Recipes, Section 11.3).
The order of operations for the QR method per iteration are as follows: O(n3 )
for a dense matrix, O(n2 ) for a Hessenberg matrix, and O(n) for a tridiagonal
matrix. Thus, the most efficient procedure is as follows:
1. Transform A to a similar tridiagonal or Hessenberg form if A is symmetric or
non-symmetric, respectively. This is done using a series of similarity transformations based on Householder rotations for dense matrices or Givens rotations
for sparse matrices.
2. Use the iterative QR method to obtain the eigenvalues of the tridiagonal or
Hessenberg matrix.
See the Mathematica notebook QRmethod.nb for an illustration of how QR
decomposition is used in an iterative algorithm to obtain the eigenvalues of a
matrix.
See Trefethen and Bau (1997) for the QR algorithm with shifting, which accelerates the convergence of the iterative method.
The iterative QR method is the workhorse of the vast majority of eigenproblem
solvers. Although we are interested in square matrices here, the QR decomposition
exists for any rectangular matrix as well. For A m n, Q is m m, and R is
m n. Carried to completion, the QR method provides approximations for the
full spectrum consisting of all eigenvalues and eigenvectors of a matrix.
Although primarily used to obtain eigenvalues and eigenvectors, the QR decomposition can also be used to determine the solution of the system of equations
Ax = c. Given the QR decomposition of A, the system of equations becomes
QRx = c,

180

Computational Linear Algebra

or multiplying both sides by Q1 = QT yields the system


,
Rx = c
= QT c. Because R is upper triangular, this system of equations can be
where c
solved via back substitution. This approach is rarely used, however, as it takes
twice as long as LU decomposition.

6.3.3 Arnoldi Method


Particularly in stability problems, where we only seek the least stable mode, it is
only necessary to obtain a small number of eigenvalues (and possibly eigenvectors)
of a large sparse matrix, that is, it is not necessary to obtain the full spectrum.
This is done efficiently using the Arnoldi method.
Suppose we seek the largest k eigenvalues (by magnitude) of the large sparse
n n matrix A, where k  n. Given an arbitrary n-dimensional vector q0 , we
define the Krylov subspace by


Kk (A, q0 ) = span q0 , Aq0 , A2 q0 , . . . , Ak1 q0 ,
which has dimension k and is a subspace of Rn . The Arnoldi method is based on
constructing an orthonormal basis, for example, using Gram-Schmidt orthogonalization, of the Krylov subspace Kk that can be used to project a general n n
matrix A onto the k-dimensional Krylov subspace Kk (A, q0 ).
We form the orthonormal projection matrix Q using the following step-by-step
(non-iterative) method that produces a Hessenberg matrix H whose eigenvalues
approximate the largest k eigenvalues of A:
1.
2.
3.
4.

Specify the starting Arnoldi vector q0 .


Normalize: q1 = q0 /kq0 k.
Set Q = q1 .
Do for i = 2, . . . , k:

i)
ii)
iii)
iv)
v)

Multiply qi = Aqi1 .
Orthonormalize qi against q1 , q2 , . . . , qi1 .
Append qi to Q.
Form the Hessenberg matrix H = QT AQ.
Determine the eigenvalues of H.

5. End Do
At each step i = 2, . . . , k, an n i orthonormal matrix Q is produced that forms
an orthonormal basis for the Krylov subspace Ki (A, q0 ). Using the projection
matrix Q, we transform A to produce an i i Hessenberg matrix H (or tridiagonal for symmetric A), which is an orthogonal projection of A onto the Krylov
subspace Ki . The eigenvalues of H, sometimes called the Ritz eigenvalues, approximate the largest i eigenvalues of A. The approximations of the eigenvalues
improve as each step is incorporated, and we obtain the approximation of one

6.3 Numerical Solution of the Eigenproblem

181

additional eigenvalue.
Remarks:
1. Because k  n, we only require the determination of eigenvalues of Hessenberg
matrices that are no larger than k k as opposed to the original n n matrix
A.
2. Although the outcome of each step depends upon the starting Arnoldi vector
q0 used, the procedure converges to the correct eigenvalues of matrix A for
any q0 .
3. The more sparse the matrix A is, the smaller k can be and still obtain a good
approximation of the largest k eigenvalues of A.
4. When applied to symmetric matrices, the Arnoldi method reduces to the Lanczos method.
5. A shift and invert approach can be incorporated to determine the k eigenvalues
close to a specified part of the spectrum rather than that with the largest
magnitude. For example, it can be designed to determine the k eigenvalues
with the largest real or imaginary part.
6. When seeking a set of eigenvalues in a particular portion of the full spectrum,
it is desirable that the starting Arnoldi vector q0 be in (or nearly in) the
subspace spanned by the eigenvectors corresponding to the sought after eigenvalues. As the Arnoldi method progresses, we get better approximations of the
desired eigenvectors that can then be used to form a more desirable starting
vector. This is known as the implicitly restarted Arnoldi method and is based
on the implicitly-shifted QR decomposition method. Restarting also reduces
storage requirements by keeping k small.
7. The Arnoldi method may also be adapted to solve linear systems of equations;
this is called the generalized minimal residual (GMRES) method.
8. The Arnoldi method can be designed to apply to the generalized eigenproblem
Ax = Bx,
where it is required that B be positive definite, that is, have all positive eigenvalues. The generalized eigenproblem is encountered in structural design problems in which A is called the stiffness matrix, and B is called the mass matrix.
It also arises in hydrodynamic stability.
9. Many have standardized on the Arnoldi method as implemented in ARPACK
(http://www.caam.rice.edu/software/ARPACK/). ARPACK was developed
at Rice University in the mid 1990s, first as a Fortran 77 library of subroutines, and subsequently it has been implemented as ARPACK++ for C++ .
It has been implemented in Matlab via the eigs() function, where the s denotes sparse. In addition, it has been implemented in Mathematica, where
one includes the option Method Arnoldi in the Eigenvalues[] function.
The Arnoldi method is illustrated in more detail in the Mathematica notebook
Arnoldi.nb.

182

Computational Linear Algebra

References:
Arnoldi, W. (1951) Q. Appl. Math. 9, 17. (Did not originally apply to the
eigenproblem!)
Nayar, N. & Ortega, J. M. (1993) Computation of Selected Eigenvalues of
Generalized Eigenvalue Problems. J. Comput. Phys. 108, pp. 814.
Saad, Y. Iterative Methods for Sparse Linear Systems SIAM, Philadelphia (2003).
Radke, R. A Matlab Implementation of the Implicitly Restarted Arnoldi Method
for Solving Large-Scale Eigenvalue Problems, MS Thesis, Rice University (1996).
6.4 Singular-Value Decomposition

7
Nonlinear Algebraic Equations Root
Finding

183

8
Optimization of Algebraic Systems

184

9
Curve Fitting and Interpolation

185

10
Numerical Integration

186

11
Finite-Difference Methods

One of the central goals of applied mathematics is to develop methods for solving differential equations as they form the governing equations for many topics
in the sciences and engineering. In Chapter 2, we addressed the solution of systems of linear first-order (and by extension higher-order) ordinary differential
equations by diagonalization. In Chapter 3, we used eigenfunction expansions
to develop methods for solving self-adjoint ordinary differential equations, with
extension to certain linear partial differential equations via the method of separation of variables. While these methods represent fundamental techniques in
applied mathematics, the scope of their application is very limited, primarily owing to their restriction to linear differential equations. As a complement to these
analytical techniques, therefore, numerical methods open up the full spectrum of
ordinary and partial differential equations for solution. Although these solutions
are approximate, the techniques are adaptable to ordinary differential equations
in the form of initial- and boundary-value problems as well as large classes of
partial differential equations.
In so far as many topics in science and engineering involve solving differential equations, and in so far as very few practical problems are amenable to exact
closed form solution, numerical methods form an essential tool in the researchers
and practitioners arsenal. In this chapter, we introduce finite-difference methods
and focus on aspects common to solving both ordinary and partial differential
equations. A simple initial-value problem and a boundary-value problem are used
to develop many of the ideas and provide a framework for thinking about numerical methods as applied to differential equations. The following two chapters then
address methods for ordinary and partial differential equations in turn.

11.1 General Considerations


11.1.1 Numerical Solution Procedure
Before getting into a detailed discussion of numerical methods, it is helpful to
view the numerical solution procedure in the form of a general framework as illustrated in Figure 11.1. We begin with the physical system of interest to be
modeled. The first step is to apply appropriate physical laws and models in order to derive the mathematical model, or governing equations, of the system.
These are typically in the form of ordinary or partial differential equations. The
187

188

Finite-Difference Methods
Physical System
i.e. Reality

Physical Laws + Models

Mathematical Model
i.e. Governing Equations
(odes or pdes)

Analytical
Solution

Discretization

System of Linear Equations

Matrix Solver

Numerical Solution

Figure 11.1 The general numerical solution procedure.

Figure 11.2 Schematic of a discretized two-dimensional domain.

physical laws include, for example, conservation of mass, momentum, and energy,
and models include any assumptions or idealizations applied in order to simplify
the governing equations. When possible, analytical solutions of the mathematical
model are sought. If this is not possible, we turn to numerical methods, which
is the focus of Part II. The second step is to discretize the mathematical model,
which involves approximation of the continuous differential equation(s) by a system of algebraic equations for the dependent variables at discrete locations in the
independent variables (space and time). For example, see Figure 11.2. The discretization step leads to a system of linear algebraic equations, whose numerical
solution comprises step three of the numerical solution procedure. The method
of discretization often produces a large, sparse matrix problem with a particular
structure. For example, we will see that second-order accurate, central differences

11.1 General Considerations

189

m
F cos(t)
Figure 11.3 Schematic of the forced spring-mass system.

lead to a tridiagonal system(s) of equations. This large matrix problem is then


solved using the direct or iterative methods to be discussed in due course.
Introducing a simple example will allow us to discuss the steps in the numerical
solution procedure in a more concrete fashion. For this purpose, let us consider the
forced spring-mass system shown in Figure 11.3 to be our physical system. The
height of the mass u(t) is measured vertically upward from its neutral position
at which no spring force fs acts on the mass; that is, fs = 0 at u = 0. We assume
that the spring is linear, such that the spring force is directly proportional to its
elongation according to
fs = ku,
where k is the linear spring constant. We neglect the mass of the spring but
account for the drag on the moving mass. The drag force is assumed to behave
according to the Stokes model for low speeds given by
fd = cv,
where c is the drag coefficient, and v(t) is the velocity of the mass. The minus
sign arises because the drag force is in the opposite direction of the motion of the
mass.
In order to obtain the equation of motion, let us consider Newtons second law
X
ma =
fi ,
where a(t) is the acceleration of the mass, and fi are the forces acting on it.
Recall that the velocity v(t) and position u(t) are related to the acceleration via
dv
d2 u
du
, a=
= 2.
dt
dt
dt
From a free-body diagram of the forces acting on the mass, with W = mg being
v=

190

Finite-Difference Methods

the weight, and ff = F cos(t) being the force owing to the forcing, Newtons
second law leads to
ma = fd + fs W ff ,
or
ma = cv ku mg F cos(t).
Writing in terms of the mass position u(t) only, the governing equation is
d2 u
c du
k
F
+
+ u = g cos(t),
(11.1)
2
dt
m dt
m
m
which is a second-order, linear, non-homogeneous ordinary differential equation
in the form of an initial-value problem.
Now let us consider the forced spring-mass system in the context of the general
numerical solution procedure. Step 1 is application of the physical law, namely
conservation of momentum in the form of Newtons second law, and models. In
this case, we assume that the spring is linear elastic and that the moving mass is
subject to low-speed Stokes drag. This results in the mathematical model given by
equation (11.1) in the form of a second-order, linear ordinary differential equation.
Although in this case, it is possible to obtain an exact solution analytically, let
us continue with the numerical solution procedure.
The second step of the numerical solution procedure involves discretizing the
continuous governing equation and time domain. We approximate the continuous
domain u(t) and differential equation at discrete locations in time ti , which are
separated by the small time step t. In order to see how the derivatives in the
governing equation (11.3) are discretized, recall the definition of the derivative
with t small, but finite:
u(t + t) u(t)
u(t + t) u(t)
du
= u0 (t) = lim

.
t0
dt
t
t
As suggested by this definition, the derivative of u(t) can be approximated by
linear combinations of the values of u at adjacent time steps more on this later.
Such finite differences, that is, differences of the dependent variable between
adjacent finite time steps, allow for calculation of the position u at the current
time step in terms of values at previous time steps. In this way, the value of
the position is calculated at successive time steps in turn. See Figure 11.4 for a
sample solution.
11.1.2 Properties of a Numerical Solution
Each step of the numerical solution procedure produces its own source of error.
Step one produces modeling errors, which are the differences between the actual
physical system and the exact solution of the mathematical model. The difference
between the exact solution of the governing equations and the exact solution of
the system of algebraic equations is the discretization error. This error, which is
produced by step two, is comprised of two contributing factors: 1) the inherent

11.1 General Considerations

191

y!t"

20

40

60

80

100

-5

-10

-15

-20

Figure 11.4 Solution of the forced spring-mass system with g = 9.81,


m = 1, k = 1, c = 0.01, F = 1, = 0.2.

error of the method of discretization, that is, the truncation error, and 2) the
error owing to the resolution of the computational grid used in the discretization.
Finally, unless a direct method is used, there is the iterative convergence error,
which is the difference between the iterative numerical solution and the exact
solution of the algebraic equations. In both direct and iterative methods, roundoff errors arise as discussed in Section 4.1.2.
As we discuss various numerical methods applied to a variety of types of problems and equations, there are several properties of successful numerical solution
methods that must be considered. The first is consistency, which requires that
the discretized equations formally become the governing equations as the grid
size goes to zero; this is a property of the discretization method. Specifically, the
truncation error, which is the difference between the solution to the discretized
equations and the exact solution of the governing equations, must go to zero
as the grid size goes to zero. For example, we will see that a finite-difference
approximation to the first-order derivative of u with respect to x is given by
du
ui+1 ui1
=
+ O(x2 ),
dx
2x
where O(x2 ) is the truncation error. Therefore, from the definition of the derivaui1
tive, as x 0, ui+12x
du
for consistency.
dx
The second property is stability, which requires that a time-marching numerical
procedure must not magnify round-off errors produced in the numerical solution
such that the numerical solution diverges from the exact solution. This will be
discussed in more detail in Sections (odes) and 15.2. Note the similarity between
stability and conditioning discussed in Section 6.1.2 as they both involve examining the effect of disturbances. The distinction is that conditioning quantifies the
effect of disturbances in the system of equations itself, whereas stability determines the effect of disturbances in the algorithm that is used to solve the system
of equations.
The third property is convergence, whereby the numerical solution of the dis-

192

Finite-Difference Methods

cretized equations approaches the exact solution of the governing equations as


the grid size goes to zero (contrast this with iterative convergence). Of course,
we generally do not have the exact solution; therefore, we refine the grid until a
grid-independent solution is obtained. The truncation error gives the rate of convergence to the exact solution as the grid is reduced. For example, for an O(x2 )
truncation error, halving the grid size reduces the error by a factor of four.
Finally, we desire correctness, whereby the numerical solution should compare favorably with available analytical solutions, experimental results, or other
computational solutions within the limitations of each (see validation below).
Given the numerous sources of errors and the potential for numerical instability, how do we determine whether a numerical solution is accurate and reliable?
This is accomplished through validation and verification. Validation is quantification of the modeling errors in the computational model, that is, step 1 of the
numerical solution procedure. It seeks to answer the questions: Am I solving the
correct equations? and Am I capturing the physics correctly? Validation is accomplished through comparison with highly accurate experiments carried out on
the actual system. Verification is quantification of the discretization and iterative
convergence errors in the computational model and its solution, that is, steps
2 and 3 of the numerical solution procedure. It seeks to answer the questions:
Am I solving the equations correctly? and Am I capturing the mathematics
correctly? Verification is accomplished through comparison with existing benchmark solutions in the form of exact or highly accurate numerical solutions to
which the numerical solutions can be compared.
The discretization step (step 2 in the numerical solution procedure) may be
accomplished using different methods. The most common approaches used in
science and engineering applications are finite-difference methods, finite-element
methods, and spectral methods. Our focus here is on finite-difference methods
owing to their widespread use, flexibility, and fundamental nature relative to
other numerical methods. For more on finite-element and spectral methods, see
Section 11.

11.2 Formal Basis for Finite Differences


In Section 8.1.1, we motivated the finite-difference approximation using the definition of the derivative with small, but finite, step size. Here, we will formalize
this approach using Taylor series expansions, but first reconsider the definition
of the derivative for the function u(x) at a point xi as follows:

u(xi + x) u(xi )
du
0
= lim
.
u (xi ) =

dx x=xi x0
x
This may be interpreted as a forward difference if x is small, but not going all the
way to zero, thereby resulting in a finite difference. If, for example, xi+1 = xi +x
and ui+1 = u(xi+1 ), then we can interpret this graphically as shown in Figure 11.5.
Thus, we have three approximations to the first derivative:

11.2 Formal Basis for Finite Differences

193

Figure 11.5 Graphical representation of the forward, backward, and


central difference approximations to the first derivative.

Forward Difference:


ui+1 ui
du

,

dx xi
x

Backward Difference:


du
ui ui1
,


dx xi
x

Central Difference:


ui+1 ui1
du

.

dx xi
2x

Intuitively, we might expect that the central difference will provide a more accurate approximation than the forward and backward differences. Indeed, this is
the case as shown formally using Taylor series expansions.
Finite-difference approximations are based on truncated Taylor series expansions, which allow us to express the local behavior of a function in the vicinity of
some point in terms of the value of the function and its derivatives at that point.
Consider the Taylor series expansion of u(x) in the vicinity of the point xi



(x xi )2 d2 u
du
u(x) = u(xi ) + (x xi )
+
dx i
2!
dx2 i




(x xi )3 d3 u
(x xi )n dn u
+
+ +
+ .
3!
dx3 i
n!
dxn i


(11.2)

Likewise, apply the Taylor series at x = xi+1 , with xi+1 xi = x, as follows:




ui+1 = ui + x

du
dx

+
i

x2
2

d2 u
dx2

+
i

x3
6

d3 u
dx3

++
i

xn
n!

dn u
dxn

+ .
i

(11.3)
Solving for (du/dx)i gives


du
dx

=
i

ui+1 ui x

x
2

d2 u
dx2

xn1
n!

dn u
dxn

+ .
i

(11.4)

194

Finite-Difference Methods

Similarly, apply the Taylor series at x = xi1 , with xi1 xi = x, to give








 
du
x2 d2 u
x3 d3 u
(1)n xn dn u
ui1 = ui x
+

+ +
+ .
dx i
2
dx2 i
6
dx3 i
n!
dxn i
(11.5)
Solving again for (du/dx)i gives
 






du
ui ui1 x d2 u
x2 d3 u
(1)n xn1 dn u
=
+

+ +
+ .
dx i
x
2
dx2 i
6
dx3 i
n!
dxn i
(11.6)
Alternatively, subtract equation (11.5) from (11.3) to obtain




 
x3 d3 u
2x2n+1 d2n+1 u
du
+
+ +
+ ,
ui+1 ui1 = 2x
dx i
3
dx3 i
(2n + 1)! dx2n+1 i
(11.7)
and solve for (du/dx)i to obtain
 


 2n+1 
du
ui+1 ui1 x2 d3 u
x2n
d
u

. (11.8)
dx i
2x
6
dx3 i
(2n + 1)! dx2n+1 i
If all of the terms are retained in the expansions, equations (11.4), (11.6), and
(11.8) are exact expressions for the first derivative (du/dx)i . Approximate finite
difference expressions for the first derivative may then be obtained by truncating
the series after the first term:

ui+1 ui
du

Forward Difference:
+ O(x),

dx xi
x
Backward Difference:


ui ui1
du

+ O(x),
dx xi
x

Central Difference:


ui+1 ui1
du

+ O(x2 ).
dx xi
2x

The O(x) and O(x2 ) terms represent the truncation error of the corresponding
approximation. For small x, successive terms in the Taylor series get smaller,
and the order of the truncation error is given by the first truncated term. We
say that the forward- and backward-difference approximations are first-order
accurate, and the central-difference approximation is second-order accurate.
Observe that the central-difference approximation is indeed better than the forward and backward differences as expected. Observe that the truncation error
arises because of our choice of algorithm, whereas the round-off error arises because of the way calculations are carried out on a computer. Therefore, truncation
error would result even on a perfect computer using exact arithmetic.
Higher-order approximations and/or higher-order derivatives may be obtained
by various manipulations of the Taylor series at additional points. For example,
to obtain a second-order accurate forward-difference approximation to the first

11.2 Formal Basis for Finite Differences

derivative, apply the Taylor series at xi+2 as follows






 
(2x)2 d2 u
(2x)3 d3 u
du
+
+
+ .
ui+2 = ui + 2x
dx i
2!
dx2 i
3!
dx3 i

195

(11.9)

The (d2 u/dx2 )i term can be eliminated by taking 4(11.3) (11.9) to obtain


 
du
2x3 d3 u
4ui+1 ui+2 = 3ui + 2x

+ .
dx i
3
dx3 i
Solving for (du/dx)i gives


 
3ui + 4ui+1 ui+2 x2 d3 u
du
=
+ ,
+
dx i
2x
3
dx3 i

(11.10)

which is second-order accurate and involves the point of interest and the next
two points to the right.
For a second-order accurate central-difference approximation to the second derivative add equation (11.3) and (11.5) for ui+1 and ui1 , respectively, to eliminate
the (du/dx)i term. This gives
 2 


d u
x4 d4 u
2
ui+1 + ui1 = 2ui + x
+
+ .
dx2 i
12 dx4 i
Solving for (d2 u/dx2 )i leads to

 2 

ui+1 2ui + ui1 x2 d4 u
d u
=
+ ,

dx2 i
x2
12 dx4 i

(11.11)

which is second-order accurate and involves the point of interest and its nearest
neighbors to the left and right.
We call finite-difference approximations that only involve the point of interest
and its two nearest neighbors on either side compact. Therefore, the secondorder accurate central-difference approximations given above for both the first
and second derivatives are compact, whereas the second-order accurate forwarddifference approximation to the first derivative is not compact.
In comparing the first-order and second-order accurate forward-difference approximations, observe that increasing the order of accuracy of a finite-difference
approximation requires including additional grid points. Thus, as more complex
situations are encountered, for example, involving higher-order derivatives and/or
approximations, determining how to combine the linear combination of Taylor
series to produce such finite-difference formulae can become very difficult. Alternatively, it is sometimes easier to frame the question as follows: For a given set of
adjacent grid points, called the finite-difference stencil, what is the highest-order
finite-difference approximation possible? Or stated slightly differently: What is
the best finite-difference approximation using a given pattern of grid points, in
other words, the one with the smallest truncation error?
We illustrate the procedure using an example. Equation (11.10) provides a
second-order accurate, forward-difference approximation to the first derivative,
and it involves three adjacent points at xi , xi+1 , and xi+2 . Let us instead determine

196

Finite-Difference Methods

the most accurate approximation to the first derivative that involves the four
points xi , xi+1 , xi+2 , and xi+3 . This will be of the form
u0i + c0 ui + c1 ui+1 + c2 ui+2 + c3 ui+3 = T.E.,

(11.12)

where primes denote derivatives of u with respect to x, and T.E. is the truncation
error. The objective is to determine the constants c0 , c1 , c2 , and c3 that produce
the highest-order truncation error. The Taylor series approximations for u(x) at
xi+1 , xi+2 , and xi+3 about xi are
x2 00 x3 000 x4 0000
u +
u +
u + ,
2! i
3! i
4! i
(2x)2 00 (2x)3 000 (2x)4 0000
ui+2 = ui + 2xu0i +
ui +
ui +
ui + ,
2!
3!
4!
(3x)2 00 (3x)3 000 (3x)4 0000
ui +
ui +
ui + .
ui+3 = ui + 3xu0i +
2!
3!
4!
Substituting these expansions into equation (11.12) and collecting terms leads to
ui+1 = ui + xu0i +

+x2

(c0 + c1 + c2 + c3 ) ui + [1 + x (c1 + 2c2 + 3c3 )] u0i





1
9
1
4
9
c1 + 2c2 + c3 u00i + x3
c1 + c2 + c3 u000
i
2
2
6
3
2


2
27
1
+x4
c1 + c2 + c3 u0000
i + = T.E.
24
3
8

(11.13)

The highest-order truncation error will occur when the maximum number of
lower-order derivative terms are eliminated in this equation. Because we have
four constants to determine in this case, we can eliminate the first four terms
in the expansion. This requires the following four simultaneous linear algebraic
equations to be solved for the coefficients:
c0 + c1 + c2 + c3 = 0,
c1 + 2c2 + 3c3 =

1
,
x

1
9
c1 + 2c2 + c3 = 0,
2
2
1
4
9
c1 + c2 + c3 = 0.
6
3
2
The solution to this system of equations is
11
3
3
1
, c1 =
, c2 =
, c3 =
,
6x
x
2x
3x
or with the least common denominator
11
18
9
2
c0 =
, c1 =
, c2 =
, c3 =
.
6x
6x
6x
6x
The remaining term that has not been zeroed provides the leading-order truncation error. Substituting the solutions for the coefficients just obtained, this term
c0 =

11.3 Extended-Fin Example

197

Figure 11.6 Schematic of the extended fin.

becomes
1
T.E. = x3 u0000
i ,
4
which indicates that the approximation is third-order accurate. Therefore, the
approximation (11.12) that has the highest-order truncation error is
 
11ui + 18ui+1 9ui+2 + 2ui+3
du
=
+ O(x3 ).
(11.14)
dx i
6x
11.3 Extended-Fin Example
As a second example, let us now consider a boundary-value problem, the onedimensional model of the heat conduction in an extended fin, an array of which
may be used to cool an electronic device, for example. This example will assist in
further solidifying our understanding of the general numerical solution procedure
as well as introduce a number of the issues in finite-difference methods. After
introducing this example, the Thomas algorithm will be devised to solve the resulting tridiagonal system of equations. In addition, there will be some discussion
of handling different types of boundary conditions. Figure 11.6 is a schematic of
the extended fin with heat conduction in the fin and convection between the fin
and ambient air.
Step 1 of the numerical solution procedure consists of applying conservation
of energy within the fin along with any simplifying assumptions. In this case, we
assume that the heat transfer is one-dimensional, that is, only axially along the
length of the fin. This is a good assumption for a fin with small cross section
relative to its length and for which the cross-sectional area changes gradually
along the length of the fin. It is further assumed that the convective heat transfer
coefficient is a constant. Based on this, the heat transfer within the extended fin is
governed by the one-dimensional ordinary-differential equation (see, for example,
Incropera & DeWitt)




1 dAc dT
1 h dAs
d2 T
+

(T T ) = 0,
(11.15)
dx2
Ac dx dx
Ac k dx

198

Finite-Difference Methods

Figure 11.7 Schematic of the uniform, one-dimensional grid for the


extended fin example.

where T (x) is the temperature distribution along the length of the fin, T is
the ambient air temperature away from the fin, Ac (x) is the cross-sectional area,
As (x) is the surface area from the base, h is the convective heat transfer coefficient
at the surface of the fin, and k is the thermal conductivity of the fin material.
Observe that equation (11.15) is a second-order ordinary differential equation
with variable coefficients, and it is a boundary-value problem requiring boundary
conditions at both ends of the domain.
Letting u(x) = T (x) T , rewrite equation (11.15) as
du
d2 u
+ f (x)
+ g(x)u = 0,
2
dx
dx

(11.16)

where
f (x) =

1 dAc
,
Ac dx

g(x) =

1 h dAs
.
Ac k dx

For now, consider the case where we have a specified temperature at both the
base and tip of the fin, such that
u = ub = Tb T at x = 0,
(11.17)
u = u` = T` T at x = `.
This is called a Dirichlet boundary conditions. Equation (11.16) with boundary
conditions (11.17) represents the mathematical model (step 1 in the numerical
solution procedure).
Step 2 consists of discretizing the domain and governing differential equation.
Dividing the interval 0 x ` into I equal subintervals of length x = `/I
gives the uniform grid illustrated in Figure 11.7.1 Here, fi = f (xi ) and gi = g(xi )
are known at each grid point, and the solution ui = u(xi ) is to be determined for
all interior points i = 2, . . . , I. Approximating the derivatives in equation (11.16)
using second-order accurate finite differences gives
ui+1 ui1
ui+1 2ui + ui1
+ fi
+ gi ui = 0,
2
x
2x
1

We will begin our grid indices at one in agreement with the typical notation used for vectors and
matrices as in Chapters 1 and 2; that is, i = 1 corresponds to x = 0. This is despite the fact that
many programming languages begin array indices at zero by default.

11.4 Tridiagonal Systems of Equations

199

for the point xi . Multiplying by (x)2 and collecting terms leads to


ai ui1 + bi ui + ci ui+1 = di ,

i = 2, . . . , I,

(11.18)

where the difference equation is applied at each interior point of the domain, and
ai = 1

x
fi ,
2

bi = 2 + x2 gi ,

ci = 1 +

x
fi ,
2

di = 0.

Note that because we have discretized the differential equation at each interior
grid point, we obtain a set of (I 1) algebraic equations for the (I 1) unknown values of the temperature ui , i = 2, . . . , I at each interior grid point. The
coefficient matrix for the difference equation (11.18) is tridiagonal

b2 c2 0 0
a3 b3 c3 0

0 a4 b4 c4

..
..
.. ..
.
.
. .

0 0 0 0
0 0 0 0

..
.

0
0
0
..
.

0
0
0
..
.

aI1 bI1

0
aI

0
0
0
..
.

u2
u3
u4
..
.

d2 a2 u1
d3
d4
..
.

=
,

dI1
cI1 uI1
uI
dI cI uI+1
bI

where we note that the right-hand-side coefficients have been adjusted to account
for the known values of u at the boundaries.
Remarks:
1. As in equation (11.18), it is customary to write difference equations with the
unknowns on the left-hand side and knowns on the right-hand side.
2. We multiply through by (x)2 in equation (11.18) such that the resulting
coefficients in the difference equation are O(1).
3. To prevent ill-conditioning, the tridiagonal system of equations should be diagonally dominant, such that
|bi | |ai | + |ci |.

11.4 Tridiagonal Systems of Equations


Because second-order accurate, central difference approximations generally produce tridiagonal matrices as in our one-dimensional extended fin example, we
consider properties of such matrices in the next section and an algorithm for
their solution in the following section.

200

Finite-Difference Methods

11.4.1 Properties of Tridiagonal Matrices


Consider an N N tridiagonal matrix with constant a, b, and c along the lower,
main, and upper diagonals, respectively, of the form

b c 0 0 0
a b c 0 0

0 a b 0 0

A = .. .. .. . . .. .. .
. . .
.
. .

0 0 0 b c
0 0 0 a b
It can be shown that the eigenvalues of such a tridiagonal matrix with constants
along each diagonal are



j
, j = 1, . . . , N.
(11.19)
j = b + 2 ac cos
N +1
The eigenvalues with the largest and smallest magnitudes are (which is largest
or smallest depends upon a, b, and c)




,
|1 | = b + 2 ac cos
N +1




.
|N | = b + 2 ac cos
N +1
Let us consider N large. In this case, expanding cosine in a Taylor series gives
[the first is expanded about /(N +1) 0 and the second about N /(N +1) ]



2

4

cos
=1
+
+ ,
N +1
2! N + 1
4! N + 1


2

2

1
N
1

N
cos
= 1 +
= 1 +
.
N +1
2! N + 1
2 N +1
Consider the common case that may result from the use of central differences
for a second-order derivative:
a = 1,

b = 2,

c = 1,

which is weakly diagonally dominant. Then with the large N expansions from
above

"
# 

2
2
q


1



|1 | = 2 + 2 (1)(1) 1
+ =
+ ,


2 N +1
N +1

"
#

2
q


1



|N | = 2 + 2 (1)(1) 1 +
= 4 + .


2 N +1
(11.20)

11.4 Tridiagonal Systems of Equations

201

Thus, the condition number for large N is approximately


cond2 (A)

4(N + 1)2
4
=
,
2
(/(N + 1))
2

for large N .

Therefore, the condition number increases proportional to N 2 with increasing N .


Now consider the case with
a = 1,

b = 4,

c = 1,

which is strictly diagonally dominant. Then from equation (11.20), the condition
number for large N is approximately
cond2 (A)

6
= 3,
2

for large N ,

in which case it is constant with increasing N .


See the Mathematica notebook Ill-Conditioned Matrices and Round-Off Error for the influence of the condition number and diagonal dominance on tridiagonal matrices.

11.4.2 Thomas Algorithm


Recall that our extended fin equation with Dirichlet boundary conditions produces the tridiagonal systems of equations

u2
b2 c2 0 0
0
0
0
d2 a2 u1
a3 b3 c3 0

0
0
0
d3

u3

0 a4 b4 c4

0
0
0 u4
d4

..

.
..
.. .. . .
..
..
.. ..
..
.

.
.
.
.
.
.
.
.
.

0 0 0 0 aI1 bI1 cI1 uI1

dI1
uI
0 0 0 0
0
aI
bI
dI cI uI+1
This tridiagonal form is typical of other finite-difference methods having compact
stencils and Dirichlet boundary conditions.
Tridiagonal systems may be solved directly, and efficiently, using the Thomas
algorithm, which is based on Gaussian elimination. Recall that Gaussian elimination consists of a forward elimination and back substitution step. First, consider
the forward elimination, which eliminates the ai coefficients along the lower diagonal. Let us begin by dividing the first equation through by b2 to give

1 F2 0 0
0
0
0
u2
2
a3 b3 c3 0

0
0
0
d3

u3

0 a4 b4 c4

0
0
0 u4
d4

=
..

,
.
.
.
.
.
.
.
.
..
.. .. . . .
..
..
.. ..
..
.

0 0 0 0 aI1 bI1 cI1 uI1

dI1
0 0 0 0
0
aI
bI
uI
dI cI uI+1

202

Finite-Difference Methods

where
F2 =

c2
,
b2

2 =

d2 a2 u1
.
b2

(11.21)

To eliminate a3 in the second equation, subtract a3 times the first equation from
the second equation to produce

1
F2
0 0
0
0
0
u2
2
0 b3 a3 F2 c3 0

0
0
0
u3 d3 a3 2

a4
b4 c4
0
0
0 u4
d4

=
..

.
.
.
.
.
.
.
.
.
..
.. .. . . .
..
..
.. ..
..

0
0 0 aI1 bI1 cI1 uI1
dI1
0
0
0 0
0
aI
bI
uI
dI cI uI+1
Dividing the second equation through by b3 a3 F2 then leads to

1 F2 0 0
0
0
0
u2
2
0 1 F3 0

0
0
0
3

u3

0 a4 b4 c4

0
0
0
u
d
4
4

=
.. ..

,
.
.
.
.
.
.
.
..
.. . . .
..
..
.. ..
..
. .

0 0

0 0 aI1 bI1 cI1 uI1


dI1
0 0 0 0
0
aI
bI
uI
dI cI uI+1
where
F3 =

c3
,
b3 a3 F2

3 =

d3 a3 2
.
b3 a3 F2

(11.22)

Similarly, to eliminate a4 in the third equation, subtract a4 times the second


equation from the third equation to give

1 F2
0
0
0
0
0
u2
2

0 1

F3
0
0
0
0
3

u3

0 0 b4 a4 F3 c4

0
0
0 u4 d4 a4 3

.. ..
.. =
.
..
.. . .
..
..
..
..
. .

.
.
.
.
.
. .
.

0 0

0
0 aI1 bI1 cI1 uI1
dI1
0 0
0
0
0
aI
bI
uI
dI cI uI+1
Dividing the third equation through by b4 a4 F3 then leads to

1 F2 0 0
0
0
0
u2
2
0 1 F3 0

0
0
0
3

u3

0 0

1 F4
0
0
0 u4
4

=
.. ..

,
..
.. . .
..
..
.. ..
..
. .

.
.
.
.
.
.
.
.

0 0

0 0 aI1 bI1 cI1 uI1


dI1
0 0 0 0
0
aI
bI
uI
dI cI uI+1
where
F4 =

c4
,
b4 a4 F3

4 =

d4 a4 3
.
b4 a4 F3

(11.23)

11.4 Tridiagonal Systems of Equations

203

Comparing equations (11.21), (11.22), and (11.23), observe that we can define
the following recursive coefficients to perform the forward elimination:
F1 = 0,
Fi =

ci
,
bi ai Fi1

i =

1 = u1 ,
di ai i1
,
bi ai Fi1

i = 2, . . . , I.

Upon completion of the forward elimination step, we have the row-echelon form
of the system of equations given by

1 F2 0 0 0 0
0
u2
2
0 1 F3 0 0 0

0
3

u3

0 0

1
F

0
0
0
4
4
4

=
.. ..

.
.
.
.
.
.
.
.
..
.. . . . .. ..
.. ..
..
. .

0 0

0 0 0 1 FI1 uI1
I1
0 0 0 0 0 0
1
uI
I FI uI+1
We then apply back substitution to obtain the solutions for ui . Starting with
the last equation, we have
uI = I FI uI+1 ,
where uI+1 is known from the Dirichlet boundary condition at the tip. Using this
result, the second to last equation then gives
uI1 = I1 FI1 uI .
Generalizing yields
ui = i Fi ui+1 ,

i = I, . . . , 2,

where we note the order starting at the tip and ending at the base.
Summarizing, the Thomas algorithm consists of the two recursive stages:
1. Forward elimination:
F1 = 0,
Fi =

1 = u1 = ub = boundary condition

ci
,
bi ai Fi1

i =

di ai i1
,
bi ai Fi1

i = 2, . . . , I.

2. Back substitution:
uI+1 = u` = boundary condition
ui = i Fi ui+1 ,

i = I, . . . , 2.

Again, note the order of evaluation for the back substitution.


Remarks:
1. The Thomas algorithm only requires O(I) operations, which is as good a scaling as one could hope for. In contrast, Gaussian elimination of a full (dense)
matrix requires O(I 3 ) operations. It is so efficient that numerical methods for

204

Finite-Difference Methods

more complex situations, such as two- and three-dimensional partial differential equations, are often designed specifically to take advantage of the Thomas
algorithm.
2. Observe that it is only necessary to store each of the three diagonals in a vector
(one-dimensional array) and not the entire matrix owing to the structure of
the matrix.
3. Similar algorithms are available for other banded, such as pentadiagonal, matrices.
4. Notice how round-off errors could accumulate in the Fi and i coefficients
during the forward elimination step, and ui in the back substitution step.

11.5 Derivative Boundary Conditions


In order to illustrate treatment of derivative boundary conditions, let us return
to the extended fin example and consider a more realistic convection boundary
condition at the tip of the fin in place of the Dirichlet boundary condition, which
assumes that we know the temperature there. The convection condition at the
tip is given by
dT
= h(T T ) at x = `,
k
dx
which is a balance between conduction and convection. Substituting u(x) =
T (x) T gives
 
du
k
= hu(`).
dx x=`
Note that the convection condition results in specifying a linear combination of
the temperature and its derivative at the tip. This is known as a Robin, or mixed,
boundary condition and is of the form
pu + q

du
=r
dx

at x = `,

(11.24)

where for the convection condition p = h, q = k, r = 0.


To apply the boundary condition (11.24) at the tip of the fin, first write the
general difference equation (11.18) at the right boundary, where i = I + 1, as
follows:
aI+1 uI + bI+1 uI+1 + cI+1 uI+2 = dI+1 .

(11.25)

Observe that uI+2 is outside the domain as xI+1 corresponds to x = `; therefore,


it must be eliminated from the above equation. To do so, let us approximate
the boundary condition (11.24) at the tip of the fin as well using a second-order
accurate, central difference for the derivative. This yields
puI+1 + q

uI+2 uI
= r,
2x

11.5 Derivative Boundary Conditions

205

which also contains the value uI+2 outside the domain. Hence, solving for this
gives
2x
uI+2 = uI +
(r puI+1 ) .
q
Substituting into equation (11.25) in order to eliminate the point outside the
domain and collecting terms yields
q (aI+1 + cI+1 ) uI + (qbI+1 2xpcI+1 ) uI+1 = qdI+1 2xrcI+1 .
This equation is appended to the end of the tridiagonal system of equations to
allow for determination of the additional unknown uI+1 .
Finally, let us consider evaluation of the heat flux at the base of the fin according
to Fouriers law


 
dT
du
qb = kAc (0)
= kAc (0)
.
dx x=0
dx x=0
In order to evaluate du/dx at the base x = 0, we must use a forward difference.
From equation (11.4) applied at i = 1, we have the first-order accurate forwarddifference approximation
 
u2 u1
du

+ O(x).
dx x=0
x
For a more accurate approximation, we may use the second-order accurate approximation from equation (11.10) applied at i = 1
 
du
3u1 + 4u2 u3
+ O(x2 ).

dx x=0
2x
Even higher-order finite-difference approximations may be formed. For example,
the third-order, forward-difference approximation from equation (11.14) is
 
du
11u1 + 18u2 9u3 + 2u4

+ O(x3 ).
dx x=0
6x
Observe that each successive approximation requires one additional point in the
interior of the domain.

12
Finite-Difference Methods for Ordinary
Differential Equations
BVPs and IVPs

206

13
Classification of Second-Order Partial
Differential Equations
Our focus in the remainder of Part II will be development of methods for solving
partial differential equations. As we develop these numerical methods, it is essential that they be faithful to the physical behavior inherent within the various
types of partial differential equations. This behavior is determined by its classification, which depends upon the nature of its characteristics. These are the curves
within the domain along which information propagates in space and/or time. The
nature of the characteristics depends upon the coefficients of the highest-order
derivatives. Owing to their prominence in applications, we focus on second-order
partial differential equations.

13.1 Mathematical Classification


Consider the general second-order partial differential equation for u(x, y) given
by
auxx + buxy + cuyy + dux + euy + f u = g,

(13.1)

where subscripts denote partial differentiation with respect to the indicated variable. The equation is linear if the coefficients a, b, c, d, e, and f are only functions
of (x, y). If they are functions of (x, y, u, ux , uy ), then the equation is said to be
quasi-linear. In this case, the equation is linear in the highest derivatives, and
any nonlinearity is confined to the lower-order derivatives.
Let us determine the criteria necessary for the existence of a smooth (differentiable) and unique (single-valued) solution along a characteristic curve C as
illustrated in Figure 13.1. Along C, we define the parametric functions
1 ( ) = uxx , 2 ( ) = uxy , 3 ( ) = uyy ,
1 ( ) = ux , 2 ( ) = uy ,

(13.2)

where is a variable along a characteristic. Substituting into equation (13.1)


gives (keeping the second-order terms on the left-hand-side)
a1 + b2 + c3 = g d1 e2 f u = H.
Transforming from (x, y) , we have
d
dx
dy
=
+
.
d
d x d y
207

(13.3)

208

Classification of Second-Order PDEs

Figure 13.1 The characteristic C in a two-dimensional domain along which


the solution propagates.

Thus, from equation (13.2) we have


d1
d
dx
dy
dx
dy
=
ux =
uxx +
uxy =
1 +
2 ,
d
d
d
d
d
d

(13.4)

d2
d
dx
dy
dx
dy
=
uy =
uxy +
uyy =
2 +
3 .
d
d
d
d
d
d

(13.5)

and

Equations (13.3)(13.5) are three equations for three unknowns, namely the
second-order derivatives 1 , 2 , and 3 . Written in matrix form, they are

a
b
c
1
H
dx/d dy/d
2 = d1 /d .
0
0
dx/d dy/d
3
d2 /d
Because the system is non-homogeneous, if the determinant of the coefficient
matrix is not equal to zero, a unique solution exists for the second derivatives
along the curve C. It also can be shown that if the second-order derivatives exist,
then derivatives of all orders exist along C as well, in which case they are smooth.
On the other hand, if the determinant of the coefficient matrix is equal to zero,
then the solution is not unique, and the second derivatives are discontinuous
along C. Setting the determinant equal to zero gives
 2
  
 2
dx
dx
dy
dy
a
+c
b
= 0,
d
d
d
d
or multiplying by (d /dx)2 yields
 2
 
dy
dy
a
+ c = 0.
b
dx
dx
This is a quadratic equation for dy/dx, which is the slope of the characteristic
curve C. Consequently, from the quadratic formula, the slope is

dy
b b2 4ac
=
.
(13.6)
dx
2a

13.1 Mathematical Classification

209

The characteristic curves C of equation (13.1), for which y(x) satisfies (13.6), are
curves along which the second-order derivatives are discontinuous.
Because the characteristics must be real, their behavior is determined by the
sign of b2 4ac as follows:
b2 4ac > 0 2 real roots

2 characteristics

Hyperbolic PDE,

b2 4ac = 0 1 real roots

1 characteristics

Parabolic PDE,

b 4ac < 0 no real roots no characteristics Elliptic PDE.


Remarks:
1. Physically, characteristics are curves along which information propagates in
the solution.
2. The classification only depends on the coefficients of the highest-order derivatives, that is, a, b, and c.
3. The terminology arises from classification of the second-degree algebraic equation ax2 + bxy + cy 2 + dx + ey + f = 0, which are conic sections.
4. It can be shown that the classification of a partial differential equation is independent of the coordinate system (see, for example, Tannehill et al.). For
example, if it is elliptic in Cartesian coordinates, then it is elliptic in all curvilinear coordinate systems.
5. Note that if from equation (13.1), we let x = x1 , y = x2 , then au11 + 2b u12 +
b
u + cu22 = H, and we can write the matrix form
2 21


a b/2
A=
,
b/2 c
which is analogous to quadratic forms in Section 2.2.3. Taking the determinant
then gives
det[A] = ac b2 /4,
or upon rearranging we have

> 0,
4 det[A] = b2 4ac = 0,

< 0,

hyperbolic
parabolic .
elliptic

Thus, the classification of a second-order partial differential equation can be


determined from the determinant of A.
Let us consider each classification of second-order partial differential equation
to see how their respective solutions behave. In particular, take note of the nature
of initial and boundary conditions along with the domain of influence (DoI) and
domain of dependence (DoD) for each type of equation. Each type of equation
will be represented by a canonical partial differential equation representative of
its classification. In general, the coefficients a, b, and c in equation (13.6) can
be functions of the coordinates x and y, in which case the characteristics C are
curves in the two-dimensional domain. For the sake of simplicity, however, let

210

Classification of Second-Order PDEs

Figure 13.2 The characteristics of the wave equation indicating the


domain of influence (DoI) and domain of dependence (DoD).

us consider the case when they are constants, such that the characteristics are
straight lines. The conclusions drawn for each type of equation are then naturally
generalizable.

13.2 Hyperbolic Equations


When b2 4ac > 0 in equation (13.6), the partial differential equation is hyperbolic, and there are two real roots corresponding to the two characteristics. For
a, b, and c constant, let these two real roots be given by
dy
dy
= 1 ,
= 2 ,
(13.7)
dx
dx
where 1 and 2 are real constants. Then we may integrate to obtain the straight
lines
y = 1 x + x1 ,

y = 2 x + x2 .

Therefore, the solution propagates along two linear characteristic curves.


For example, consider the wave equation
2
2u
2 u
=

,
t2
x2

where u(x, t) is the amplitude of the wave, and is the wave speed within the
medium. In this case, the independent variable y becomes time t, and a = 2 ,
b = 0, and c = 1. Comparing with equations (13.6) and (13.7), we have
1
1
, 2 = .

Therefore, the characteristics of the wave equation with a, b, and c constant are
straight lines with slopes 1/ and 1/ as shown in Figure 13.2. Take particular
notice of the domains of influence and dependence. The domain of influence of
point P indicates the region within (x, t) space whose solution can influence that
1 =

13.3 Parabolic Equations

211

at P . The domain of dependence of point P indicates the region within (x, t)


space whose solution depends on that at P .
The solution to the wave equation with constant wavespeed is of the form
u(x, t) = F1 (x + t) + F2 (x t),
with F1 being a right-moving traveling wave, and F2 being a left-moving traveling
wave. Initial conditions are required at say t = 0, such as
u(x, 0) = f (x),

ut (x, 0) = g(x).

The first is an initial condition on the amplitude, and the second is on its velocity.
Note that no boundary conditions are necessary unless there are boundaries at
finite x.

13.3 Parabolic Equations


When b2 4ac = 0 in equation (13.6), the partial differential equation is parabolic,
and there is one real root corresponding to a single characteristic. For a, b, and
c constant, the single root is given by
dy
b
=
.
dx
2a

(13.8)

Integrating yields
y=

b
x + 1 ,
2a

which is a straight line. Therefore, the solution propagates along one linear characteristic direction (usually time).
For example, consider the one-dimensional, unsteady diffusion equation
2u
u
= 2,
t
x
which governs unsteady heat conduction, for example. Here, u(x, t) is the quantity
undergoing diffusion (for example, temperature), and is the diffusivity of the
medium. Again, the independent variable y becomes time t, and a = and
b = c = 0. Because b = 0 in equation (13.8), the characteristics are lines of
constant t, corresponding to a solution that marches forward in time as illustrated
in Figure 13.3. Observe that the DoI is every position and time prior to the current
time, and the DoD is every position and time subsequent to the current time.
Initial and boundary conditions are required, such as
u(x, 0) = u0 (x),

u(x1 , t) = f (t),

u(x2 , t) = g(t),

which can be interpreted as the initial temperature throughout the domain and
the temperatures at the boundaries in the context of unsteady heat conduction.

212

Classification of Second-Order PDEs

Figure 13.3 The characteristic of the unsteady diffusion equation


indicating the domain of influence (DoI) and domain of dependence (DoD).

Figure 13.4 The domain of influence (DoI) and domain of dependence


(DoD) for an elliptic partial differential equation.

13.4 Elliptic Equations


When b2 4ac < 0 in equation (13.6), the partial differential equation is elliptic, and there are no real roots or characteristics. Consequently, disturbances
have infinite speed of propagation in all directions. In other words, a disturbance
anywhere affects the solution everywhere instantaneously. As illustrated in Figure 13.4, the DoI and DoD are the entire domain.
For example, consider the Laplace equation
2u 2u
+ 2 = 0,
x2
y
which governs steady heat conduction, potential fluid flow, electrostatic fields, etc.
In this case, a = 1, b = 0, and c = 1. Because of the instantaneous propagation of
the solution in all directions, elliptic equations require a global solution strategy,
and the boundary conditions, which typically specify u or its normal derivative at
the boundary, must be specified on a closed contour bounding the entire domain.

13.5 Mixed Equations

213

Figure 13.5 Transonic fluid flow past an airfoil with distinct regions where
the equation is elliptic, parabolic, and hyperbolic.

13.5 Mixed Equations


If a, b, and c are variable coefficients, then b2 4ac may change sign with space
and/or time, in which case the character of the partial differential equation may
be different in certain regions. For example, consider transonic fluid flow that
occurs near the speed of sound of the medium. The governing equation for twodimensional, steady, compressible, potential flow about a slender body is
(1 M 2 )

2 2
+ 2 = 0,
x2
y

where (s, n) is the velocity potential, and M is the local Mach number. Here,
a = 1 M 2 , b = 0, and c = 1. To determine the nature of the equation, observe
that
b2 4ac = 0 4(1 M 2 )(1) = 4(1 M 2 );
therefore,
M < 1 (subsonic)

b2 4ac < 0 Elliptic,

M = 1 (sonic)

b2 4ac = 0 Parabolic,

M > 1 (supersonic) b2 4ac > 0 Hyperbolic.


For example, transonic flow past an airfoil is illustrated in Figure 13.5. Observe
that in this case, all three types of equations are present, each one exhibiting very
different physical behavior. In the above example, we have the same equation, but
different behavior in various regions. In some cases, we have different equations
in different regions of the domain owing to the local governing physics.

14
Finite-Difference Methods for Elliptic
Partial Differential Equations
Recall from the previous chapter that the canonical second-order, elliptic partial
differential equation is the Laplace equation, which in two-dimensional Cartesian
coordinates is
2 2
+ 2 = 0.
(14.1)
x2
y
The non-homogeneous version of the Laplace equation, that is,
2 2
+ 2 = f (x, y),
x2
y

(14.2)

is called the Poisson equation. Also recall that elliptic problems have no preferred
direction of propagation; therefore, they require a global solution strategy and
boundary conditions on a closed contour surrounding the entire domain as illustrated in Figure 14.1, that is, elliptic problems are essentially boundary-value
problems (whereas parabolic and hyperbolic problems are initial-value problems).
The types of boundary conditions include Dirichlet, in which the values of on
the boundary are specified, Neumann, in which the normal derivative of is
specified on the boundary, and Robin, or mixed, in which a linear combination
of and its normal derivative is specified along the boundary. In the context
of heat transfer, for example, a Dirichlet condition corresponds to an isothermal
boundary condition, a Neumann condition corresponds to a specified heat flux,
and a Robin condition arises from a convection condition at the boundary (see,
for example, Section 11.5). Combinations of the above boundary conditions may
be applied on different portions of the boundary as long as some boundary condition is applied at every point along the boundary contour.
Remarks:
1. Solutions to the Laplace and Poisson equations with Neumann boundary conditions on the entire boundary can only be determined relative to an unknown
constant, that is, (x, y)+c is a solution. We will need to take this into account
when devising numerical algorithms.
2. A linear elliptic problem consists of a linear equation and boundary conditions,
whereas a nonlinear problem is such that either the equation and/or boundary
conditions are nonlinear. For example, the Laplace or Poisson equation with
Dirichlet, Neumann, or Robin boundary conditions is linear. An example of
214

14.1 Finite-Difference Methods for the Poisson Equation

215

Figure 14.1 Schematic of a two-dimensional elliptic problem with


boundary conditions around the entire closed domain.

Figure 14.2 Schematic of the Poisson equation in a two-dimensional


rectangular domain.

a nonlinear boundary condition that would render a problem nonlinear is a


radiation boundary condition, such as

= D(4 4sur ),
n
where n represents the outward facing normal to the boundary. The nonlinearity arises owing to the fourth power on the temperature.

14.1 Finite-Difference Methods for the Poisson Equation


As a representative elliptic equation, which is important in its own right, consider the two-dimensional Poisson equation in Cartesian coordinates (14.2) as
illustrated in Figure 14.3. Recall that if f (x, y) = 0, we have the Laplace equation.
In order to define the grid, the x domain 0 x a is divided into I equal
subintervals of length x, with xi = x(i 1). Likewise, the y domain 0 y b
is divided into J equal subintervals of length y, with yj = y(j 1). See

216

Finite-Difference Methods for Elliptic PDEs

Figure 14.3 Schematic of the two-dimensional finite-difference grid.

Figure 14.4 Finite-difference stencil for second-order accurate


approximations.

Figure 14.3 and note that the domain has been discretized into a two-dimensional
grid intersecting at (I + 1) (J + 1) points.
Consider second-order accurate, central difference approximations to the derivatives in equation (14.2) at a typical point (i, j); the five point finite-difference
stencil is shown in Figure 14.4. With i,j = (xi , yj ), the second-order accurate,
central difference approximations are given by
i+1,j 2i,j + i1,j
2
=
+ O(x2 ),
x2
x2
2
i,j+1 2i,j + i,j1
=
+ O(y 2 ).
2
y
y 2
Substituting into (14.2), multiplying by (x)2 , and collecting terms gives the
final form of the finite-difference equation
"

 #


x 2
x 2
i+1,j 2 1 +
i,j + i1,j +
(i,j+1 + i,j1 ) = x2 fi,j , (14.3)
y
y

14.1 Finite-Difference Methods for the Poisson Equation

217

which results in a system of (I + 1) (J + 1) equations for the (I + 1) (J + 1)


unknowns.
As described in Chapter 6, there are two classes of techniques for solving such
systems of equations:
1. Direct Methods:
A set procedure with a predefined number of steps that leads to the solution
of the system of algebraic equations, for example, Gaussian elimination.
No iterative convergence errors.
Efficient for certain types of linear systems, for example, tridiagonal, where
we use the Thomas algorithm, and block-tridiagonal.
Becomes less efficient for large, dense systems of equations, which are typically O(N 3 ) operations. Thus, if we double the number of points in both
directions of a two-dimensional problem, the number of points increases by
a factor of four, and the number of operations increases by approximately
a factor of eight.
Typically cannot adapt to nonlinear problems.
2. Iterative (Relaxation) Methods:
Beginning with an initial guess, an iterative process is repeated that (hopefully) results in successively closer approximations to the exact solution of
the system of algebraic equations.
Introduces iterative convergence errors.
Generally more efficient for large, sparse systems of equations. In particular,
they allow us to get closer to O(N 2 ) operations, such that it scales with the
total number of points as for the Thomas algorithm.
Can apply to linear and nonlinear problems.
In general, the method for solving the system A = d depends upon the form
of the coefficient matrix A. If the matrix is full and dense, then Gaussian elimination, LU decomposition, etc. are required; these are very expensive computationally. If the matrix is sparse and structured, for example, banded with only
a small number of non-zero diagonals, then the result from discretization of certain classes of problems can often be solved using more efficient algorithms. For
example, tridiagonal systems may be solved very effectively using the Thomas
algorithm. We will add two additional methods to our arsenal of direct methods. Fast Fourier Transform (FFT) and cyclic reduction are generally the fastest
methods for problems in which they apply, for example, block-tridiagonal matrices.
Although these direct methods remain in widespread use in many applications, very large systems involving thousands or millions of unknowns demand
methods that display improved scaling properties and have more modest storage
requirements. This brings us to iterative methods as introduced in Section 6.2.
In order to gain something, improved scalability in this case, there is almost always a price to pay. In the case of iterative methods, the price is that we will
no longer obtain the exact solution of the algebraic systems of equations that

218

Finite-Difference Methods for Elliptic PDEs

Figure 14.5 Conversion of the two-dimensional indices (i, j) into the


one-dimensional index n.

results from our numerical discretizations. Instead, we will carry out the iteration process until an approximate solution is obtained having an acceptably small
amount of iterative convergence error. As in Section 6.2, we will begin with a discussion of the Jacobi, Guass-Seidel, and SOR methods. This will be followed by
the alternating-direction implicit (ADI) method, which improves even more on
scalability properties. Finally, the multigrid framework will be described, which
accelerates the underlying iteration technique, such as Gauss-Seidel, and results
in the fastest and most flexible algorithms currently available.

14.2 Direct Methods for Linear Systems


Before describing the iteration methods that are our primary focus owing to
their improved scalability, let us describe two direct methods that are particularly
efficient when they apply. Repeating the difference equation (14.3) for the Poisson
equation, we have
i,j + i1,j +
(i,j+1 + i,j1 ) = x2 fi,j ,
i+1,j 2(1 + )

(14.4)

= (x/y)2 , and i = 1, . . . , I + 1; j = 1, . . . , J + 1. In order to write


where
the system of difference equations in a convenient matrix form A = d, let us
renumber the two-dimensional mesh1 (i, j) into a one-dimensional array (n) so
that is a vector rather than a matrix as illustrated in Figure 14.5. Thus, the
relationship between the one-dimensional and two-dimensional indices is
n = (i 1)(J + 1) + j,
where i = 1, . . . , I + 1, j = 1, . . . , J + 1, and n = 1, . . . , (I + 1)(J + 1). Therefore,
our five-point finite difference stencil becomes that shown in Figure 14.6. And
the finite-difference equation (14.4) for the Poisson equation becomes (with =
= 1)
x = y; therefore,
n+J+1 + n(J+1) + n+1 + n1 4n = 2 fn ,
where n = 1, . . . , (I + 1)(J + 1).
1

The terms grid and mesh are used interchangably.

(14.5)

14.2 Direct Methods for Linear Systems

219

Figure 14.6 Finite-difference stencil with two- and one-dimensional indices.

Thus, the system of

D I 0
I D I

0 I D

..
..
..
.
.
.

0 0 0
0 0 0

equations A = d is of the form

0 0
1
d1

0 0
2
d2

0 0
3
d3


.
..
..
..
..
..

. .
.
.
.

D I
(I+1)(J+1)1
d(I+1)(J+1)1
I D
(I+1)(J+1)
d(I+1)(J+1)

The coefficient matrix A is a block tridiagonal matrix with (I + 1) blocks in both


directions, where each block is a (J + 1) (J + 1) matrix, and d contains all
known information from the boundary conditions. Recall that 0 is a zero matrix
block, I is an identity matrix block, and the tridiagonal blocks are

4 1
0 0
0
1 4 1 0
0

1
4

0
0

D = ..
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.

0
0
0 4 1
0
0
0 1 4

14.2.1 Fourier Transform Methods


The Fourier transform of a continuous, one-dimensional function (x) is defined
by
Z
() =
(x)e2ix dx,

where is the frequency, and i is the imaginary number. The inverse transform
is
Z
()e2ix d.
(x) =

220

Finite-Difference Methods for Elliptic PDEs

Consider the discrete form of the Fourier transform in which we have K values
of (x) at discrete points defined by
k = (xk ),

k = 0, 1, 2, . . . , K 1,

xk = k,

where is the step size in the data.


After taking the Fourier transform, we will have K discrete points in the frequency domain defined by
K
K
m
, m = ,..., ,
K
2
2
corresponding to the Nyquist critical frequency range. Thus, the discrete Fourier
transform is
Z
m =

(x)e2im x dx,

(m ) =

or approximating the integral by summing over each interval


(m )

K1
X

k e2im xk .

k=0

Using the fact that m xk = (m/K)k = mk/K gives


(m )

K1
X

k e2ikm/K .

k=0

Denoting the summation in the previous expression by m , we write


(m ) m ,

m =

K1
X

k e2ikm/K .

(14.6)

k=0

The discrete inverse Fourier transform is then


k =

1
K

K/2
X

m e2ikm/K .

(14.7)

m=K/2

The above expressions are all for a one-dimensional function. For a two-dimensional
spatial grid, we define the physical grid (x, y)
k = 0, . . . , K 1,

l = 0, . . . , L 1.

The transform (frequency) space is then


K
K
L
L
,..., , n = ,..., .
2
2
2
2
Analogous to equation (14.6), the two-dimensional discrete Fourier transform of
(x, y) is
m=

(m,n )

2 m,n ,

m,n =

K1
X L1
X
k=0 l=0

k,l e2imk/K e2inl/L .

(14.8)

14.2 Direct Methods for Linear Systems

221

This is equivalent to taking the Fourier transform in each direction. Analogous


to equation (14.7), the inverse Fourier transform is
k,l

K/2
L/2
X
X
1
m,n e2ikm/K e2iln/L .
=
KL m=K/2 n=L/2

(14.9)

Now let us apply this to the discretized Poisson equation (14.4). For simplicity,
set = x = y giving (with i k, j l)
k+1,l + k1,l + k,l+1 + k,l1 4k,l = 2 fk,l .

(14.10)

Substituting equation (14.9) into equation (14.10) leads to the following equation
for each Fourier mode

m,n e2i(k+1)m/K e2iln/L + e2i(k1)m/K e2iln/L
+e2ikm/K e2i(l+1)n/L + e2ikm/K e2i(l1)n/L

4e2ikm/K e2iln/L = 2 fm,n e2ikm/K e2iln/L ,
where fm,n is the Fourier transform of the right-hand-side fk,l . Canceling the
common factor e2ikm/K e2iln/L , we have
h
i
m,n e2im/K + e2im/K + e2in/L + e2in/L 4 = 2 fm,n .
Recalling that cos(ax) = 12 (eiax + eiax ) gives






2m
2n
m,n 2 cos
+ 2 cos
4 = 2 fm,n .
K
L
Solving for the Fourier transform m,n leads to
m,n =


2 cos

2 fm,n

2m
+ cos
K

2n
L

,

(14.11)

for m = 1, . . . , K 1; n = 1, . . . , L 1.
Therefore, the procedure to solve the difference equation (14.10) using Fourier
transform methods is as follows:
1) Compute the Fourier transform fm,n of the right-hand side fk,l using
fm,n =

K1
X L1
X

fk,l e2imk/K e2inl/L ,

(14.12)

k=0 l=0

which is similar to equation (14.8).


2) Compute the Fourier transform of the solution m,n from equation (14.11).
3) Compute the solution k,l using the inverse Fourier transform (14.9).
Remarks:

222

Finite-Difference Methods for Elliptic PDEs

Figure 14.7 Two-dimensional grid for cyclic reduction.

1. The above procedure works for periodic boundary conditions, such that the
solution satisfies
k,l = k+K,l = k,l+L .
For Dirichlet boundary conditions, we use the Fourier sine transform, and for
Neumann boundary conditions, we use the Fourier cosine transform.
2. In practice, the Fourier (and inverse) transforms are computed using a Fast
Fourier Transform (FFT) technique (see, for example, Numerical Recipes).
3. Fourier transform methods can only be used for partial differential equations
with constant coefficients in the direction(s) for which the Fourier transform
is applied.
4. We use Fourier transforms to solve the difference equation, not the differential
equation; therefore, this is not a spectral method (see Section 17.3).

14.2.2 Cyclic Reduction


Again, consider the Poisson equation
2 2
+ 2 = f (x, y)
x2
y
discretized on a two-dimensional grid with = x = y and i = 0, . . . , I; j =
0, . . . , J, where I = 2m with integer m. This results in (I + 1) (J + 1) points as
shown in Figure 14.7.
Applying central differences to the Poisson equation, the difference equation
for constant x-lines becomes
ui1 2ui + ui+1 + B0 ui = 2 fi ,

i = 0, . . . , I,

(14.13)

14.2 Direct Methods for Linear Systems

223

where

i,0
i,1
i,2
..
.

ui =
,

i,J1
i,J

fi,0
fi,1
fi,2
..
.

fi =
,

fi,J1
fi,J

2 1
0 0
0
1 2 1 0
0

0
1 2 0
0

0
B = ..
.
..
.. . .
..
..
.
. .
.
.
.

0
0
0 2 1
0
0
0 1 2

The first three terms in equation (14.13) correspond to the central difference
in the x-direction, and the fourth term corresponds to the central difference in
the y-direction (see B0 ). Taking B = 2I + B0 , where I is the identity matrix,
equation (14.13) becomes
ui1 + Bui + ui+1 = 2 fi ,
where the (J + 1) (J + 1) matrix B is

4 1
0
1 4 1

0
1 4

B = ..
..
..
.
.
.

0
0
0
0
0
0

..
.

i = 0, . . . , I,

0
0
0
..
.

0
0
0
..
.

(14.14)

4 1
1 4

Note that equation (14.14) corresponds to the block-tridiagonal matrix for equation (14.5), where B is the tridiagonal portion and the coefficients of the ui1
and uu+1 terms are the fringes. Writing three successive equations of (14.14) for
i 1, i, and i + 1 gives
ui2 + Bui1 + ui = 2 fi1 ,
ui1 + Bui + ui+1 = 2 fi ,
ui + Bui+1 + ui+2 = 2 fi+1 .
Multiplying B times the middle equation and adding all three gives
ui2 + B ui + ui+2 = 2 fi ,

(14.15)

where
B = 2I B2 ,
fi = fi1 Bfi + fi+1 .
This is an equation of the same form as (14.14); therefore, applying this procedure
to all even numbered i equations in (14.14) reduces the number of equations by a
factor of two. This cyclic reduction procedure can be repeated recursively until a
single equation remains for the middle line of variables uI/2 , which is tridiagonal.
This is why in the x-direction, I = 2m with integer m.
Using the solution for uI/2 , solutions for all other i are obtained by successively

224

Finite-Difference Methods for Elliptic PDEs

Figure 14.8 Cyclic reduction pattern.

solving the tridiagonal problems at each level in reverse as illustrated in Figure 14.8. This results in a total of I tridiagonal problems to obtain ui , i = 0, . . . , I.
Remarks:
1. The number of grid points in the direction for which cyclic reduction is applied,
x in the above case, must be an integer power of two.
2. The speed of FFT and cyclic reduction methods are comparable, with FFT
being a bit faster.
3. Cyclic reduction may be applied to somewhat more general equations, such as
those with variable coefficients.
4. One can accelerate by taking the FFT in the direction with constant coefficients and using cyclic reduction in the other.
5. These so-called fast Poisson solvers have O(N log N ) operations, where N =
I J.
6. This algorithm is not (easily) implemented in parallel.

14.3 Iterative (Relaxation) Methods


Returning to the difference equation (14.3) for the Poisson equation (14.2) recall
that we have
i,j + i1,j +
(i,j+1 + i,j1 ) = x2 fi,j ,
i+1,j 2(1 + )

(14.16)

= (x/y)2 . This may be written in general form as


where
L = f,

(14.17)

where L is the finite-difference operator for the particular problem. Iterative


methods consist of beginning with an initial guess and iteratively relaxing equation (14.17) until convergence.

14.3 Iterative (Relaxation) Methods

225

14.3.1 Jacobi Method


Recall from Section 6.2.5 that the difference equation is an explicit equation for
i,j at each grid point in the domain for the Jacobi method. The values of i,j
on the right-hand side are then taken from the previous iteration, denoted by
superscript n, while the single value on the left-hand side is taken at the current
iteration, denoted by superscript n + 1.2 Solving equation (14.16) for i,j yields
n+1
i,j =

 n


1
n
ni,j+1 + ni,j1 x2 fi,j .

i+1,j
i1,j

2(1 + )

(14.18)

The procedure is as follows:


1. Provide an initial guess 0i,j for i,j at each point i = 1, . . . , I +1, j = 1, . . . , J +
1.
2. Relax (iterate) by applying equation (14.18) at each grid point to produce
successive approximations:
1i,j , 2i,j , . . . , ni,j , . . . .
3. Continue until the iteration process converges, which is determined, for example, by the requirement that

n
max |n+1
i,j i,j |

< ,
max |n+1
i,j |
where  is a small tolerance value set by the user.
Remarks:
1. Iterative convergence is too slow; therefore, the Jacobi method is not used in
practice.
2. Requires that n+1
and ni,j be stored for all i = 1, . . . , I + 1, j = 1, . . . , J + 1.
i,j
3. Used as a basis for comparison with other methods to follow.
Although we do not actually implement iterative methods in the following form,
it is instructive to view them in the matrix form Ax = c in order to analyze their
convergence properties. Recall from Section 6.2.5 that the coefficient matrix is
such that A = M1 M2 , where M1 = D, M2 = L + U for the Jacobi method.
Then the iteration matrix is
1
M = M1
(L + U).
1 M2 = D

In addition to checking for iterative convergence, which requires that < 1 for
the iteration matrix M, a smaller spectral radius results in more rapid convergence, that is, in fewer iterations. Similar to Section 6.2.5, it can be shown that
2

Note that we have switched to n being the iteration number, rather than r, and the parentheses
have been removed from the superscripted iteration number as there is no confusion with powers in
the present context.

226

Finite-Difference Methods for Elliptic PDEs

Figure 14.9 Sweeping directions for Gauss-Seidel iteration.

= 1 (x = y) and Dirichlet boundary conditions


for equation (14.18) with
that





1

Jac (I, J) =
cos
+ cos
.
2
I +1
J +1
If I = J and I is large, then from the Taylor series for cosine


2

1

=1
+ ;
Jac (I) = cos
I +1
2 I +1

(14.19)

therefore, as I , Jac (I) 1. As a result, we observe slower convergence as I


is increased. In other words, there is a disproportionate increase in computational
time as I is increased. This is why the Jacobi method is not used in practice.
Note that in the subsequent discussion of Gauss-Seidel and SOR methods, reference to the model problem consists of solving the Poisson equation with
Dirichlet boundary conditions using second-order accurate central differences
with x = y and I = J.
14.3.2 Gauss-Seidel Method
Recall that the Gauss-Seidel method improves on the Jacobi method by using the
most recently updated information. This not only eliminates the need to store the
full solution array at both the previous and current iterations, it also speeds up
the iteration process substantially. For example, consider sweeping through the
grid with increasing i and j as illustrated in Figure 14.9. Hence, when updating
i,j , the points i,j1 and i1,j have already been updated during the current
iteration, and using these updated values, equation (14.18) is changed to
n+1
i,j =

 n


1
n+1
2
n
i+1,j + n+1
i1,j + i,j+1 + i,j1 x fi,j .

2(1 + )

(14.20)

The values of i,j are all stored in the same array, and it is not necessary to
distinguish between the nth or (n + 1)st iterates. We simply use the most recently
updated information.

14.3 Iterative (Relaxation) Methods

227

In matrix form, the Gauss-Seidel method consists of M1 = DL and M2 = U,


in which case the iteration matrix is
1
M = M1
U.
1 M2 = (D L)

As in Section 6.2.6, it can be shown that for our model problem


GS = 2Jac ;
therefore, from equation (14.19), the spectral radius is
#2
"

2

2

1
2
+ = 1
+ .
GS (I) = Jac (I) = 1
2 I +1
I +1

(14.21)

Consequently, the rate of convergence is twice as fast as that for the Jacobi
method for large I, that is, the Gauss-Seidel method requires one-half the iterations for the same level of accuracy. Recall from Section 6.2.6 that it can be
shown that diagonal dominance of A is a sufficient, but not necessary, condition
for convergence of the Jacobi and Gauss-Seidel iteration methods.
14.3.3 Successive Over-Relaxation (SOR)
Recall from Section 6.2.7 that SOR accelerates Gauss-Seidel iteration by magnifying the change in the solution accomplished by each iteration. By taking a
weighted average of the previous iterate ni,j and the Gauss-Seidel iterate n+1
i,j ,
the iteration process may be accelerated toward the exact solution.
If we denote the Gauss-Seidel iterate (14.20) by i,j , the new SOR iterate is
given by
n

n+1
i,j = (1 )i,j + i,j ,

(14.22)

where is the relaxation parameter and 0 < < 2 for convergence (Morton &
Mayers, p. 206). We then have the following three possibilities:
1 < < 2 Over-Relaxation
=1

Gauss-Seidel

0 < < 1 Under-Relaxation


Over-relaxation typically accelerates convergence of linear problems, while underrelaxation is often required for convergence when dealing with nonlinear problems.
In matrix form, the SOR method is given by M1 = D L, M2 = (1 )D +
U, and the iteration matrix is
1
M = M1
[(1 )D + U] .
1 M2 = (D L)

As determined in Section 6.2.7, the optimal value of that minimizes the spectral
radius for large I is
2
opt
,
1 + I+1

228

Finite-Difference Methods for Elliptic PDEs

which for large I leads to the spectral radius


SOR 1

2
.
I +1

Thus, as I , opt 2, and SOR 1. However, SOR converges at a rate


2(I+1)
times faster than Gauss-Seidel if the optimal value of the relaxation pa
rameter is used for the model problem under consideration. Therefore, the convergence rate improves linearly with increasing I relative to Gauss-Seidel.
Remarks:
1. Although opt does not depend on the right-hand-side f (x, y), it does depend
upon the shape of the domain, the differential equation, the boundary conditions, and the method of discretization.
2. For a given problem, opt must be estimated from a similar problem and/or
determined via trial and error.
3. Recall from Figure 6.2 that the spectral radius is very sensitive to the choice
of the acceleration parameter. Therefore, if one is willing to spend the time to
find the opt for a given problem, then SOR can result is dramatic reductions
in the number of iterations requires. If such an exercise is not warranted, for
example, because only one solution is sought for the system of equations, then
simply use Gauss-Seidel.

14.4 Boundary Conditions


Let us consider how Dirichlet and Neumann boundary conditions are implemented within the iterative methods discussed.

14.4.1 Dirichlet Boundary Conditions


Implementation of specified Dirichlet boundary conditions is very straightforward as no calculation is required at the boundary. Simply apply the specified values at the boundaries to 1,j , I+1,j , i,1 , and i,J+1 , and iterate on the
Jacobi (14.18), Gauss-Seidel (14.20), or SOR (14.22) equation in the interior,
i = 2, . . . , I, j = 2, . . . , J. For example, the Jacobi or Gauss-Seidel iterate at
i = 2, j = 2 is illustrated in Figure 14.10 and given by
n+1
2,2 =

 n


1
n + 2,1 x2 f2,2 .
3,2 + 1,2 +
2,3

2(1 + )

Hence, we simply apply the general finite-difference equation for Jacobi or GaussSeidel in the interior as usual, and the values on the boundary are picked up as
necessary.

14.4 Boundary Conditions

229

Figure 14.10 Schematic of Jacobi or Gauss-Seidel iterate a the point


(i, j) = (2, 2).

Figure 14.11 Neumann boundary condition.

14.4.2 Neumann Boundary Conditions


As an example of a Neumann boundary condition, consider

= c at x = 0
(14.23)
x
as shown in Figure 14.11. The simplest treatment would be to use the Jacobi
(14.18), Gauss-Seidel (14.20), or SOR (14.22) equation to update i,j in the interior for i = 2, . . . , I, and then to approximate the boundary condition (14.23)
by a forward difference applied at i = 1 according to
2,j 1,j
+ O(x) = c.
x
This could then be used to update 1,j , j = 2, . . . , J using

(14.24)

1,j = 2,j cx.


This is unacceptable, however, because equation (14.24) is only first-order accurate. If a lower-order approximation, such as equation (14.24), is used at a
boundary, its truncation error generally dominates the convergence rate of the
entire scheme.

230

Finite-Difference Methods for Elliptic PDEs

Figure 14.12 Boundary conditions in the top left corner of the domain.

A better alternative is to use the same method as in Section 11.5 for the Thomas
algorithm with a Robin boundary condition. In this approach, the interior points
are updated as before, but we now apply the difference equation at the boundary.
For example, we could apply Jacobi (14.18) at i = 1 as follows:
 n


1
n
n1,j+1 + n1,j1 x2 f1,j .

(14.25)
n+1
2,j
0,j
1,j =

2(1 + )
However, this involves a value n0,j that is outside the domain. A second-order
accurate, central-difference approximation for the boundary condition (14.23) is
n2,j n0,j
+ O(x2 ) = c,
(14.26)
2x
which also involves the value n0,j . Therefore, solving equation (14.26) for n0,j
gives
n0,j = n2,j 2cx,
and substituting into the difference equation (14.25) to eliminate n0,j leads to
n+1
1,j =





1
n1,j+1 + n1,j1 x2 f1,j .
2 n2,j cx +

2(1 + )

(14.27)

Thus, we use equation (14.27) to update n+1


1,j , j = 2, . . . , J. This is the same
procedure used for a Dirichlet condition but with an additional sweep along the
left boundary using equation (14.27) for n+1
1,j , j = 2, . . . , J.
This approach requires special treatment at corners depending upon the boundary condition along the adjacent boundary. For example, consider if in addition
to the boundary condition (14.23) that we have

= d at y = b.
y

(14.28)

This is illustrated in Figure 14.12. Applying equation (14.27) at the corner i =


1,j = J + 1 yields




1
n
n1,J+2 + n1,J x2 f1,J+1 , (14.29)
n+1
2

cx
+

1,J+1 =
2,J+1

2(1 + )

14.5 Alternating-Direction-Implicit (ADI) Method

231

where n1,J+2 is outside the domain. Approximating equation (14.28) using a central difference in the same manner as equation (14.26) gives
n1,J+2 n1,J
= d,
2y
which leads to
n1,J+2 = n1,J + 2dy.
Substituting into equation (14.29) to eliminate n1,J+2 gives
n+1
1,J+1 =





1
n
n1,J + dy x2 f1,J+1 ,
2

cx
+
2

2,J+1

2(1 + )

which is used to update the corner value i = 1, j = J + 1.


14.5 Alternating-Direction-Implicit (ADI) Method
14.5.1 Motivation
Recall that in elliptic problems, the solution anywhere depends on the solution
everywhere, that is, it has an infinite speed of propagation in all directions. However, in Jacobi, Gauss-Seidel, and SOR, information only propagates through the
mesh one point at a time. For example, if we are sweeping from left to right along
lines of constant y with 0 x a, it takes I iterations before the boundary condition at x = a is felt at x = 0. Consequently, these point-by-point relaxation
techniques are not very elliptic like. A more elliptic-like method could be
obtained by solving entire lines in the grid in an implicit manner. For example,
sweeping along lines of constant y and solving each constant y-line implicitly,
that is, all at once, would allow for the boundary condition at x = a to influence
the solution in the entire domain after only one sweep through the grid.
In order to see how such a scheme might be devised, let us return to the
difference equation (14.16) for the Poisson equation
i,j + i1,j +
(i,j+1 + i,j1 ) = x2 fi,j .
i+1,j 2(1 + )

(14.30)

Consider the j th line and assume that values along the j + 1st and j 1st lines
are taken from the previous iterate. Rewriting equation (14.30) as an implicit
equation for the values of i,j along the j th line gives

n+1
2
n
n+1
n
n+1
i = 2, . . . , I.
i+1,j 2(1 + )i,j + i1,j = x fi,j i,j+1 + i,j1 ,
(14.31)
Therefore, we have a tridiagonal problem for i,j along the j th line, which can be
solved using the Thomas algorithm.
Remarks:
1. If sweeping through j-lines, j = 2, . . . , J, then ni,j1 becomes n+1
i,j1 in equation
(14.31), as it has already been updated. In other words, we might as well
update the values as in Gauss-Seidel.

232

Finite-Difference Methods for Elliptic PDEs

2. SOR can also be incorporated after each tridiagonal solve to accelerate iterative convergence.
3. This approach is more efficient at spreading information throughout the domain; therefore, it reduces the number of iterations required for convergence,
but there is more computation per iteration.
4. This illustration provides the motivation for the alternating-direction implicit
(ADI) method, which accomplishes the above in both x- and y-directions.
14.5.2 ADI Method
In the ADI method, we sweep along lines but in alternating directions. Although
the order is arbitrary, let us sweep along lines of constant y first followed by lines
of constant x.
In the first half of the iteration, from n to n + 1/2, we perform a sweep along
constant y-lines by solving the series of tridiagonal problems for each of j =
2, . . . , J given by
n+1/2

n+1/2

n+1/2

i+1,j h(2 + )i,j


+ i1,j = x2ifi,j

ni,j+1 2 ni,j + n+1/2
, i = 2, . . . , I.

i,j1

(14.32)

This results in a tridiagonal system of equations to solve for each constant y line,
that is, for each j and all i. The tridiagonal system corresponding to each value
of j is solved in succession as we sweep along each constant y line. This is why
n+1/2
the i,j1 term on the right-hand-side has been updated from the previous line
solved. Unlike in equation (14.31), differencing in the x- and y-directions are kept
separate to mimic diffusion in each direction. This is why i,j appears on both
sides of the equation one from the derivative in the x-direction and one from the
derivative in the y-direction. This approach is called a splitting method, in which
all of the terms associated with the derivatives in each direction are kept together
on one side of the difference equation. Observe that we have added a term i,j to
each side of the equation. The numerical parameter is an acceleration parameter
to enhance diagonal dominance ( 0); = 0 corresponds to no acceleration.
Note that the terms on each side of the equation cancel so as not to alter the
solution for . Although it would appear to enhance diagonal dominance, larger
is not necessarily better.
In the second half of the iteration, from n + 1/2 to n + 1, we sweep along
constant x-lines by solving the series of tridiagonal problems for i = 2, . . . , I
given by
n+1
2
n+1

n+1

i,j+1h (2 + )i,j + i,j1 = x


i fi,j
n+1/2
n+1/2
(14.33)

(2 )
+ n+1 , j = 2, . . . , J,
i+1,j

i,j

i1,j

where n+1
i1,j has been updated from the previous line. Once again, there is a
tridiagonal system to solve for each constant x line, and splitting has been used.

14.5 Alternating-Direction-Implicit (ADI) Method

233

Remarks:
1. Each ADI iteration involves (I 1) + (J 1) tridiagonal solves (for Dirichlet
boundary conditions in which it is not necessary to solve along the boundaries).
= 1), it can be shown that for the Poisson (or Laplace)
2. For x = y (
equation with Dirichlet boundary conditions that the acceleration parameter
that gives the best speedup is
= 2 cos (/R) ,
3.

4.

5.

6.

where R = max(I + 1, J + 1).


The ADI method with splitting is typically faster than Gauss-Seidel or SOR
for equations in which the terms are easily factored into x and y directions,
but some of the efficiency gains are lost if there are mixed derivative terms,
such as 2 /xy.
When using the Thomas algorithm as described in Section 3.5 to solve the
tridiagonal system along each constant x or y line, the vector coefficients a,
b, c, and d for the constant y-lines are defined for i = 1, . . . , I + 1, and
the necessary adjustments for the boundary conditions at x = 0 and x = a
are made within the Thomas algorithm, with the same holding along lines of
constant x.
If we have Neumann or Robin boundary conditions, then it is necessary to
solve a tridiagonal system along such boundaries with the points outside the
domain being eliminated as for Gauss-Seidel.
In order to emphasize the splitting of derivative directions, let us write the
Poisson equation
2 2
+ 2 =f
x2
y
in the abbreviated form
Tx + Ty = f,
where
Tx =

2
i+1,j 2i,j + i1,j
=
,
2
x
(x)2

Ty =

2
i,j+1 2i,j + i,j1
=
.
2
y
(y)2

Then in the ADI method with splitting, along constant y lines we solve
2

(x) Tx = (x) (f Ty ) ,
and along constant x lines we solve
2

(x) Ty = (x) (f Tx ) .

Problem Set # 2

234

Finite-Difference Methods for Elliptic PDEs

14.6 Compact Higher-Order Methods


Thus far, we have used second-order accurate finite differences throughout, which
results in a five-point stencil for two-dimensional problems. In some cases, however, it is desirable to incorporate higher-order approximations. For example,
this may allow for numerical solution of a problem with fewer grid points while
maintaining the same level of accuracy. In other words, this would offset a decrease in the truncation error with an increase in discretization error to produce
the same level of accuracy. This is particularly important in simulations within
three-dimensional domains, for which any reduction in the number of grid points
required is welcomed.
Recall that in order to obtain higher-order finite-difference approximations,
it is necessary to incorporate additional points in the finite-difference stencil.
However, this produces matrices with additional non-zero bands and requires
special treatment near boundaries. If possible, it is advantageous to maintain the
compact nature of second-order accurate central differences because they produce
tridiagonal systems of equations that may be solved efficiently and do not require
special treatment near boundaries.
Here, we derive a compact, fourth-order accurate, central-difference approximation for the second-order derivatives in the Poisson equation using a method
based on that in Weinan E and J. G. Liu (1996), Journal of Computational
Physics, 126, p. 122. For convenience, we define the following second-order accurate, central-difference operators
i1,j 2i,j + i+1,j
,
x2
i,j1 2i,j + i,j+1
.
=
y 2

x2 i,j =
y2 i,j

Recall from equation (11.11) in Section 11.2 that from the Taylor series, we have
2
x2 4
2
=

+ O(x4 ),
x
x2
12 x4

(14.34)

where the second term is the truncation error for the second-order accurate approximation, which we will now include in our approximation. Therefore,


2 x2 4
x2 2 2
2
4
x =
+
+ O(x ) = 1 +
+ O(x4 ). (14.35)
x2
12 x4
12 x2 x2
But from equation (14.34), observe that
2
= x2 + O(x2 ).
x2

14.6 Compact Higher-Order Methods

235

Substituting into equation (14.35) gives






 2
x2 2 2
x2 2
2
4
2
x + O(x )
+ O(x ) = 1 +

+ O(x4 ).
x = 1 +
12
x2
12 x x2
Solving for 2 /x2 yields

1

1
2
x2 2
x2 2
2
= 1+

x + 1 +

O(x4 )
x2
12 x
12 x
From a binomial expansion (with x sufficiently small) observe that
1

x2 2
x2 2
1+
x
=1
+ O(x4 ).
12
12 x

(14.36)

(14.37)

Because the last term in equation (14.36) is still O(x4 ), we can write equation
(14.36) as

1
2
x2 2
=

x2 + O(x4 ).
(14.38)
1
+
x2
12 x
Substituting the expression (14.37) into equation (14.38) leads to an O(x4 )
accurate central-difference approximation for the second derivative given by


x2 2 2
2
= 1
+ O(x4 ).
x2
12 x x
Owing to the x2 (x2 ) operator, however, this approximation involves the five
points i2 , i1 , i , i+1 , and i+2 ; therefore, it is not compact. In order to obtain
a compact scheme, we also consider the derivative in the y-direction. Similar to
equation (14.38), we have in the y-direction

1
2
y 2 2
=
1
+

y2 + O(y 4 ).
(14.39)
y 2
12 y
Now consider the Poisson equation
2 2
+ 2 = f (x, y).
x2
y
Substituting equations (14.38) and (14.39) into the Poisson equation leads to

1

1
x2 2
y 2 2
1+
x
x2 + 1 +
y
y2 + O(4 ) = f (x, y),
12
12



2
y 2 2
2
where = max(x, y). Multiplying by 1 + x

1
+

gives
x
y
12
12




y 2 2 2
x2 2 2
1+
+ 1+
+ O(4 )
12 y x
12 x y
(14.40)


x2 2 y 2 2
4
= 1+
+
+ O( ) f (x, y),
12 x
12 y

236

Finite-Difference Methods for Elliptic PDEs

which is a fourth-order accurate finite-difference approximation to the Poisson


equation. Expanding the first term in equation (14.40) yields




y 2 2 2
y 2 2 i1,j 2i,j + i+1,j
1+
=
1+

12 y x
12 y
x2
1
=
(i1,j 2i,j + i+1,j )
x2
1
+
[(i1,j1 2i1,j + i1,j+1 )
12x2
2 (i,j1 2i,j + i,j+1 )
+ (i+1,j1 2i+1,j + i+1,j+1 )]
=

1
[20i,j + 10 (i1,j + i+1,j )
12x2
2 (i,j1 + i,j+1 )
+i1,j1 + i1,j+1 + i+1,j1 + i+1,j+1 ] .

Therefore, we have a nine-point stencil, but the approximation only requires three
points in each direction, and thus it is compact.
Similarly, expanding the second term in equation (14.40) yields


x2 2 2
1
1+
x y =
[20i,j + 10 (i,j1 + i,j+1 )
12
12y 2
2 (i1,j + i+1,j ) + i1,j1 + i1,j+1 + i+1,j1 + i+1,j+1 ] ,
and the right-hand-side of equation (14.40) is


x2 2 y 2 2
1
1+
x +
y f (x, y) = fi,j +
[fi1,j 2fi,j + fi+1,j
12
12
12
+fi,j1 2fi,j + fi,j+1 ]
1
[8fi,j + fi1,j + fi+1,j + fi,j1 + fi,j+1 ] .
12
Thus, the coefficients in the nine-point finite-difference stencils for (x, y) and
f (x, y) are illustrated graphically in Figures 14.13 and 14.14, respectively.
=

Remarks:
1. Observe that in equation (14.40), the two-dimensionality of the equation has
been exploited to obtain the compact finite-difference stencil. That is, the
x2 x2 and y2 y2 operators have been converted to x2 y2 and y2 x2 difference
operators.
2. Because the finite-difference stencil is compact, that is, only involving three
points in each direction, application of the ADI method as in the previous
section results in a set of tridiagonal problems to solve. In this manner, the
fourth-order, compact finite-difference approach is no less efficient than that
for the second-order scheme used in the previous section (there are simply additional terms on the right-hand-side of the equations that must be evaluated).

14.7 Multigrid Methods

237

Figure 14.13 Graphical representation of the coefficients in the nine-point


stencil for i,j .

Figure 14.14 Graphical representation of the coefficients in the nine-point


stencil for fi,j .

3. The primary disadvantage of using higher-order schemes is that it is generally


necessary to use lower-order approximations for derivative boundary conditions, which may effect the convergence rate. This is not a problem, however,
for Dirichlet boundary conditions.

14.7 Multigrid Methods


Multigrid methods are a framework that allows one to accelerate an underlying
iteration (relaxation) technique, such as the Gauss-Seidel or ADI method. As
such, they take advantage of the scalability and flexibility of iterative methods but
allow for further efficiency gains with respect to iterative convergence behavior. In
particular, multigrid methods are comparable in speed with fast Poisson solvers
as discussed in Section 14.2. However, whereas FFT and cyclic reduction are
limited to linear partial differential equations, multigrid methods can be adapted

238

Finite-Difference Methods for Elliptic PDEs

to nonlinear equations with minimal loss in efficiency. They are currently among
the best methods for solving partial differential equations numerically, both in
terms of efficiency and flexibility.

14.7.1 Motivation
If one were to carefully examine the iterative convergence history of the typical
iterative techniques, such as Gauss-Seidel or ADI, the following properties would
be observed: high-frequency components of the error experience fast convergence,
whereas low-frequency components of the error exhibit relatively slower convergence. As in Briggs et al. (??), let us illustrate this behavior by considering the
following simple one-dimensional problem
d2
= 0,
dx2

0 x 1,

(14.41)

with (0) = (1) = 0, which has the exact solution (x) = 0. Therefore, all plots
of the numerical solution for (x) are also plots of the error. Discretizing equation
(14.41) over I equal subintervals (I + 1 points) using central differences gives3
i+1 2i + i1 = 0,

i = 1, . . . , I 1,

0 = I = 0.
To show how the nature of the error affects convergence, consider an initial
guess, that is, error, consisting of the Fourier mode (x) = sin(kx), where k
is the wavenumber and indicates the number of half sine waves on the interval
0 x 1. In discretized form, with xi = ix = i/I, this is


ki
i = sin
, i = 0, . . . , I,
I
with the wavenumber 1 k I 1. Thus, the error for the initial guess is such
that small wavenumber k corresponds to long, smooth (low frequency) waves,
and large k corresponds to highly oscillatory (high frequency) waves in the initial
condition. Figure 14.15 illustrates several different modes in the error. Applying
Jacobi iteration with I = 64, the solution converges more rapidly for the higher
frequency initial guess as illustrated in Figure 14.16. A more realistic situation is
one in which the initial guess contains multiple modes, for example
  




i
6i
32i
1
sin
+ sin
+ sin
,
i =
3
I
I
I
which has modes with k = 1, 6, and 32 terms represent low-, medium-, and highfrequency modes, respectively. Applying Jacobi iteration with I = 64, the error is
reduced rapidly during the early iterations but more slowly thereafter as shown
in Figure 14.17. Thus, there is rapid convergence of the overall solution until
3

Note that the first grid point at the left boundary x = 0 is designated as being at i = 0, rather than
i = 1 as before. This is more natural given the way that the grids will be defined.

14.7 Multigrid Methods

239

Figure 14.15 Initial guesses (errors) with k = 1, 3, and 6.

Figure 14.16 Convergence rate for error with modes having k = 1, 3, or 6


using Jacobi iteration.

the high-frequency modes are smoothed out followed by slow convergence when
only lower frequency modes are present. To further illustrate this phenomena,
consider the following sequence using Jacobi iteration. Figure 14.18 shows the
result of relaxation acting on a mode with k = 3 after one iteration (left) and
ten iterations (right), and Figure 14.19 shows the result of relaxation acting on a
mode with k = 16 after one iteration (left) and ten iterations (right). Figure 14.20
shows the result of relaxation acting on an error having modes with k = 2 and
k = 16 after one iteration (left) and ten iterations (right). Again, we observe that
the high-frequency error is reduced more rapidly as we iterate (relax) compared
to the low-frequency error.
Multigrid methods take advantage of this property of relaxation techniques by
recognizing that smooth components of the error become more oscillatory with

240

Finite-Difference Methods for Elliptic PDEs

Figure 14.17 Convergence rate for error with modes having k = 1, 6, and
32 using Jacobi iteration.

Figure 14.18 Error of a mode with k = 3 after one iteration (left) and ten
iterations (right).

respect to the grid size on a coarse grid. That is, there are fewer grid points per
wavelength on a coarser grid as compared to a finer grid, and the error appears
more oscillatory on the coarser grid. Thus, relaxation would be expected to be
more effective on a coarse grid representation of the error. Note that it is also
faster as there are fewer points to compute.
Remarks:
1. Multigrid methods are not so much a specific set of techniques as they are a
framework for accelerating relaxation (iterative) methods.
2. Multigrid methods are comparable in speed with fast direct methods, such as
Fourier methods and cyclic reduction, but they can be used to solve general
elliptic equations with variable coefficients and even nonlinear equations.

14.7.2 Multigrid Methodology


In order to show how the multigrid methodology can be applied to an elliptic
partial differential equation, let us consider the general second-order linear elliptic

14.7 Multigrid Methods

241

Figure 14.19 Error of a mode with k = 16 after one iteration (left) and ten
iterations (right).

Figure 14.20 Error of modes with k = 2 and k = 16 after one iteration


(left) and ten iterations (right).

partial differential equation with variable coefficients of the form


A(x, y)

2
+B(x, y) +C(x, y) 2 +D(x, y) +E(x, y) = F (x, y). (14.42)
2
x
x
y
y

To be elliptic, A(x, y)C(x, y) > 0 for all (x, y). Approximating this differential
equation using second-order accurate central differences gives
i+1,j i1,j
i+1,j 2i,j + i1,j
+ Bi,j
2
x
2x
i,j+1 2i,j + i,j1
i,j+1 i,j1
+Ci,j
+ Di,j
y 2
2y
+Ei,j i,j = Fi,j ,
Ai,j

where Ai,j = A(xi , yj ), etc. We rewrite this difference equation in the form
ai,j i+1,j + bi,j i1,j + ci,j i,j+1 + di,j i,j1 + ei,j i,j = Fi,j ,
where
Ai,j
Bi,j
+
,
2
x
2x
Ci,j
Di,j
=
+
,
2
y
2y
2Ai,j
= Ei,j

x2

Ai,j
Bi,j

,
2
x
2x
Ci,j
Di,j
=

,
2
y
2y

ai,j =

bi,j =

ci,j

di,j

ei,j

2Ci,j
.
y 2

(14.43)

242

Finite-Difference Methods for Elliptic PDEs

Figure 14.21 The coarse and fine grids, with the coarse grid consisting of
every other point in the fine grid.

Coarse-Grid Correction
For convenience, write (14.43) (or some other difference equation) as
L = f,

(14.44)

where L represents the difference operator. If (x, y) is the exact solution to


equation (14.43), then the error is defined by

e = ,

(14.45)

y) is a numerical approximation to (x, y). The residual4 is defined by


where (x,

r = f L.

(14.46)

Observe from equation (14.44) that if = , then the residual is zero; therefore,
the residual is a measure of how wrong the approximate solution is. Substituting
(14.45) into equation (14.44) gives
Le + L = f,
or

Le = f L,
which is the error equation
Le = r.

(14.47)

For the multigrid method, it is necessary to have a means of clearly identifying


operators and quantities on successive grids. As illustrated in Figure 14.21, the
fine grid is denoted by h and the course grid by 2h .5 Note that grids are typically reduced by a factor of two in each direction. In order to shift information
between the fine and coarse grids, we use the restriction and interpolation operators. The restriction operator, given by Ih2h , moves information from the fine
grid h to the coarse grid 2h . The interpolation, or prolongation, operator, given
h
by I2h
, moves information from the coarse grid 2h to the fine grid h . In the
4

Our use of r to denote the residual here is not to be confused with the rank of a matrix in
Chapters 1 and 2.
In the numerical methods literature, h is often used to indicate the grid size, here given by x and
y.

14.7 Multigrid Methods

243

Figure 14.22 Coarse-grid correction sequence between a fine grid h and a


coarse grid 2h .

operators, observe that the subscript indicates the grid from which information is
moved, and the superscript indicates the grid to which the information is moved.
From these definitions, we can devise a scheme with which to correct the solution on a fine grid by solving for the error on a coarse grid. This is known as
coarse-grid correction (CGC) and consists of the following steps as illustrated in
Figure 14.22:
1. Relax the original difference equation Lh h = f h on the fine grid h using
Gauss-Seidel, ADI, etc. 1 times with an initial guess h .
2. Compute the residual on the fine grid h and restrict it to the coarse grid 2h :
r2h = Ih2h rh = Ih2h (f h Lh h ).
3. Using the residual as the right-hand-side, solve the error equation L2h e2h =
r2h on the coarse grid 2h .
4. Interpolate the error to the fine grid and correct the fine-grid approximation
according to
h h + I h e2h .
2h

5. Relax Lh h = f h on the fine grid h 2 times with the corrected approximation


h as the initial guess.
This CGC scheme is the primary component of all the many multigrid algorithms.
Remarks:
1. In practice, 1 and 2 are small, typically 1, 2, or 3.
2. In the CGC scheme, it is the residual r that is being restricted and the error
e that is interpolated. However, we will present these as operating on the
dependent variable for generality.
To illustrate the CGC sequence, consider equation (14.41), i.e. 00 = 0, with

244

Finite-Difference Methods for Elliptic PDEs

Figure 14.23 Initial condition (error) for the CGC illustration.

Figure 14.24 Error from Step 1 after one relaxation sweep on the fine grid.

I = 64 and 1 = 3. The initial guess is given by


 



16i
40i
1
sin
+ sin
,
i =
2
I
I
which includes a low frequency mode (k = 16) and a high-frequency mode (k =
40) as shown in Figure 14.23. Figure 14.24 shows the error from Step 1 after
one relaxation sweep on the fine grid. Observed that the high-frequency mode
is eliminated after only one relaxation step. Step 1 after three relaxation sweeps
on the fine grid is given in Figure 14.25, for which the low-frequency mode is
reduced very little by the second and third relaxation sweeps. The residual is
then computed on the fine grid and restricted to the coarse grid in Step 2. In lieu

14.7 Multigrid Methods

245

Figure 14.25 Error from Step 1 after three relaxation sweeps on the fine
grid.

Figure 14.26 Error from Step 3 after one relaxation sweep on the coarse
grid.

of solving the error equation on the fine grid, let us simply relax as on the fine
grid. Figure 14.26 shows the error after one relaxation sweep on the coarse grid.
An additional two relaxation sweeps on the coarse grid produce the error shown
in Figure 14.27. Note that restriction to the coarse grid accelerates convergence
of the low-frequency mode, which has a higher frequency relative to the coarser
grid. In Step 4, the error is interpolated from the coarse to the fine grid and
used to correct the approximate solution on the fine grid. Step 5 after three
relaxation sweeps on the fine grid produces the error shown in Figure 14.28. As

246

Finite-Difference Methods for Elliptic PDEs

Figure 14.27 Error from Step 3 after three relaxation sweeps on the coarse
grid.

Figure 14.28

you can see, the low-frequency mode is nearly eliminated after the CGC sequence
is completed.
An obvious question at this point is, how do we obtain the coarse grid solution for e2h in step 3? Actually, we already know the answer to this perform
additional CGCs. That is, if the CGC works between two grid levels, it should
be even more effective between three, four, or as many grid levels as we can accommodate. To implement this, then, we recursively replace Step 3 by additional
CGCs on progressively coarser grids until it is no longer possible to further reduce
the grid. This leads to the so-called V-cycle as illustrated in Figure 14.29. The
V-cycles are then repeated until convergence; each V of the V-cycle is essentially

14.7 Multigrid Methods

247

Figure 14.29 Schematic of a four-grid V-cycle.

a multigrid iteration. Although we simply denote the error at each grid level as
e(x, y), it is important to realize that on each successively coarser grid, what is
being solved for is the error of the error on the next finer grid. Thus, on the finest
grid, say h , relaxation is carried out on the original equation for (x, y); on the
next coarser grid 2h , relaxation is on the equation for the error on h ; on the
next coarser grid 4h , relaxation is on the equation for the error on 2h , and so
forth.
This simple V-cycle scheme is appropriate when a good initial guess is available
to start the V-cycles. For example, when considering a solution to equation (14.42)
in the context of an unsteady calculation, in which case the solution for h from
the previous time step is a good initial guess for the current time step. If no
good initial guess is available, then full multigrid V-cycle (FMG) may be applied
according to the following procedure, which utilizes the same components as in
the coarse-grid correction sequence:
1.
2.
3.
4.
5.
6.

Solve L = f on the coarsest grid (note: not e).


Interpolate to next finer grid.
Perform V-cycle to correct .
Interpolate to next finer grid.
Repeat (3) and (4) until finest grid is reached.
Perform V-cycles until convergence.

This is illustrated in Figure 14.30


The coarse-grid correction sequence represents the core of the multigrid method;
however, we must consider in more detail the restriction and interpolation operators, boundary conditions, and relaxation methods. We begin with the grid
definition.
Grid Definitions
Because each successive grid differs by a factor of two in each direction, the
finest grid size is often taken as 2n + 1 points, where n is an integer. This is
overly restrictive, particularly when we get to large grid sizes. Somewhat more

248

Finite-Difference Methods for Elliptic PDEs

Figure 14.30 Schematic of full multigrid V-cycle to obtain a good initial


guess.

general grids may be obtained using the following grid definitions. The differential
equation (14.42) is discretized on a uniform grid having Nx Ny points, which
are defined by
Nx = mx 2(nx 1) + 1,

Ny = my 2(ny 1) + 1,

(14.48)

where nx and ny determine the number of grid levels, and mx and my determine
the size of the coarsest grid, which is (mx + 1) (my + 1).
In order to maximize the benefits of the multigrid methodology, we want to
maximize the number of grid levels between which the algorithm will move. Therefore, for a given grid, nx and ny should be as large as possible, and mx and my
should be as small as possible for maximum efficiency. Typically, mx and my are
2, 3, or 5. For example:
Nx = 65

mx = 2, nx = 6

Nx = 129 mx = 2, nx = 7
Nx = 49

mx = 3, nx = 5

Nx = 81

mx = 5, nx = 5

The number of grid levels is given by


N = max(nx , ny ),

(14.49)

G(1) < . . . < G(L) < . . . < G(N ),

(14.50)

resulting in the series of grids

where G(1) is the coarsest grid, G(N ) is the finest grid, and L = 1, . . . , N . Each
grid G(L) has Mx (L) My (L) grid points, where
Mx (L) = mx 2[max(nx +LN,1)1] + 1,
My (L) = my 2[max(ny +LN,1)1] + 1.
For example, if
Nx = 65,

Ny = 49,

(14.51)

14.7 Multigrid Methods

249

then
mx = 2, nx = 6

and my = 3, ny = 5.

Thus, N = max(nx , ny ) = 6, and


G(6) : Mx (6) = 65, My (6) = 49
G(5) : Mx (5) = 33, My (5) = 25
G(4) : Mx (4) = 17, My (4) = 13
G(3) : Mx (3) = 9,

My (3) = 7

G(2) : Mx (2) = 5,

My (2) = 4

G(1) : Mx (1) = 3,

My (1) = 4

Boundary Conditions
At each boundary, let us consider the general form of the boundary condition

= s,
(14.52)
n
where n is the direction normal to the surface. A Dirichlet boundary condition has
q = 0, and a Neumann boundary condition has p = 0. This boundary condition
is applied directly on the finest grid h , that is,
p + q

h
= sh .
(14.53)
n
On the coarser grids, however, we need the boundary condition for the error. In
order to obtain such a condition, consider the following. On the coarse grid 2h ,
equation (14.52) applies to the solution ; thus,
p h h + q h

p2h 2h + q 2h

2h
= s2h .
n

If we enforce the error to be zero on the boundaries, then from e2h = 2h 2h = 0,


in which case 2h = 2h , we also have
2h
p2h 2h + q 2h
= s2h .
n
Therefore, to obtain a boundary condition for the error on 2h , observe that

2h
2h
2h 
2h 2h
2h e
2h 2h
2h
2h 2h
2h
p e +q
= p +q
p +q
n
n
n
2h
2h
(14.54)
= s s
e2h
= 0.
n
Thus, the boundary conditions are homogeneous on all but the finest grid, where
the original condition on is applied. For example, a Dirichlet boundary condition
p2h e2h + q 2h

250

Finite-Difference Methods for Elliptic PDEs

Figure 14.31 Red-black Gauss-Seidel iteration.


e
has e = 0 on the boundary, and a Neumann boundary condition has n
= 0 on the
boundary. In other words, the type of boundary condition (Dirichlet, Neumann,
or Robin) does not change, in which case the p and q coefficients are the same,
but they become homogeneous, in which case s = 0.

Relaxation
At the heart of the multigrid method is an iterative (relaxation) scheme that is
used to update the numerical approximation to the solution. Typically, red-black
Gauss-Seidel iteration is used to relax the difference equation as illustrated in
Figure 14.31. By performing the relaxation on all of the red and black grid points
separately, it eliminates data dependencies such that it is easily implemented on
parallel computers (see Section 12). Note that when Gauss-Seidel is used, SOR
should not be implemented because it destroys the high-frequency smoothing of
the multigrid approach.
Although Gauss-Seidel is most commonly used owing to its ease of implementation, particularly in parallel, it is better to use alternating-direction implicit
(ADI) relaxation for the same reason that ADI is better than Gauss-Seidel. When
sweeping along lines of constant y, the following tridiagonal problem is solved for
each j = 1, . . . , My (L) (see equation (14.43))
ai,j i+1,j + ei,j i,j + bi,j i1,j = fi,j ci,j i,j+1 di,j i,j1 ,

(14.55)

for i = 1, . . . , Mx (L). Here, denotes the most recent approximation, which may
be from the previous or current iteration depending upon the sweep direction.
Similar to red-black Gauss-Seidel, we could sweep all lines with j even and j
odd separately to eliminate data dependencies. We will refer to this as zebra
relaxation. Then lines of constant x are swept by solving the tridiagonal problem
for each i = 1, . . . , Mx (L) given by
ci,j i,j+1 + ei,j i,j + di,j i,j1 = fi,j ai,j i+1,j bi,j i1,j ,

(14.56)

for j = 1, . . . , My (L). Again we could sweep all lines with i even and i odd
separately to accommodate parallelization.

14.7 Multigrid Methods

251

Table 14.1 Notation for fine and coarse grids.


Grid
Indices
Range
Coarser
Finer

(i, j)
(i , j )

1 i Nx ,
1 i Nx ,

1 j Ny
1 j Ny

Figure 14.32 Matrix symbol for full-weighting restriction.

Restriction Operator: Ih2h


The restriction operator is required for moving information from a finer grid to
a coarser grid. The notation used is provided in Table 14.1. Applying restriction
in both directions, we have the following relationships between the coarser and
finer grids:
Nx = 12 (Nx 1) + 1, Ny = 21 (Ny 1) + 1,
i = 2i 1,

j = 2j 1.

The easiest restriction operator is straight injection for which


h
2h
i,j = i ,j ,

in which we simply drop the points that are not common to both the coarser and
finer grids. The matrix symbol for straight injection is [1].
A better restriction operator is full weighting, with the matrix symbol being
given in Figure 14.32. In this case,

1
2h
=
hi 1,j 1 + hi 1,j +1 + hi +1,j 1 + hi +1,j +1
i,j
16

1
(14.57)
+ hi ,j 1 + hi ,j +1 + hi 1,j + hi +1,j
8
1
+ hi ,j
4
This represents a weighted average of surrounding points in the fine mesh. We
h
then use straight injection on the boundaries, such that 2h
i,j = i ,j , i = 1, Nx , j =
1, . . . , Ny and j = 1, Ny , i = 1, . . . , Nx .
In general, the grids in the x and y directions may have different numbers of
grid levels as discussed above. Therefore, restriction may be in both directions or

252

Finite-Difference Methods for Elliptic PDEs

Figure 14.33 Schematic of bilinear interpolation from coarser to finer grid.

only one direction. If, for example, restriction is applied only in the x-direction,
then Ny = Ny and j = j in equation (14.57).
h
Interpolation (Prolongation) Operator: I2h
The interpolation operator is required for moving information from the coarser
to finer grid. The most commonly used interpolation operator is based on bilinear
interpolation as illustrated in Figure 14.33. In bilinear interpolation, information
at the four corners of a general grid cell on the finer and coarser grids are related
by
copy common points
hi ,j = 2h
i,j ,

2h
hi +1,j = 12 2h
i,j + i+1,j ,

2h
hi ,j +1 = 12 2h
i,j + i,j+1 ,

2h
2h
2h
hi +1,j +1 = 14 2h
i,j + i+1,j + i,j+1 + i+1,j+1 .

14.7.3 Speed Comparisons


Consider the test problem
A(x)

2
+
B(x)
+
C(y)
+ D(y)
= F (x, y),
2
2
x
x
y
y

with Neumann boundary conditions. The following times are for an SGI Indy
R5000-150MHz. The grid is N N .
ADI:
N
65
129

 = 104
Iterations Time (sec)
673
22.35
2, 408
366.06

 = 105
Iterations Time (sec)
821
27.22
2, 995
456.03

Note that in both cases, the total time required for the N = 129 case is approximately 16 that with N = 65 ( 4 increase in points and 4 increase

14.8 Treatment of Nonlinear Convective Terms

253

in iterations).
Multigrid:
V-cycle with ADI relaxation (no FMG to get improved initial guess). Here the
convergence criterion  is evaluated between V-cycles.
N
65
129

 = 104
V-Cycles Time (sec)
18
1.78
23
10.10

 = 105
V-Cycles Time (sec)
23
2.28
29
12.68

Remarks:
1. In both cases, the total time required for the N = 129 case is approximately
6 that with N = 65 (the minimum is 4).
The multigrid method scales to larger grid sizes more effectively than ADI
alone, i.e. note the small increase in the number of V-cycles with increasing
N.
2. The case with N = 65 is approximately 13 faster than ADI, and the case
with N = 129 is approximately 36 faster!
3. Aside from the additional programming complexity, which is considerable, the
only cost for the dramatic speed and scalability improvements is a doubling
of memory requirements in order to store the and errors at each grid level.
Hence, the multigrid method is essentially a trade-off between computational
time and memory requirements.
4. Because of their speed and generality, multigrid methods are currently the
preferred framework for solving elliptic partial differential equations, including
nonlinear equations. Nonlinear equations are solved using the Full Approximation Storage (FAS) method.
5. References:
Developed by Achi Brandt in the 1970s See original references on multilevel adaptive techniques.
Briggs, W.C., Henson, V.E. and McCormick, S.F., A Multigrid Tutorial,
(2nd Edition) SIAM (2000).
Thomas, J.L., Diskin, B. and Brandt, A.T., Textbook Multigrid Efficiency
for Fluid Simulations, Annual Review of Fluid Mechanics (2003), 35, pp.
317340.

14.8 Treatment of Nonlinear Convective Terms


14.8.1 Picard Iteration
The methods discussed thus far apply to general linear elliptic partial differential equations. In fluid mechanics and heat transfer applications, the Laplacian
operator 2 = accounts for the diffusion of momentum and/or heat

254

Finite-Difference Methods for Elliptic PDEs

throughout the domain, respectively. In applications that also involve convection, additional terms are required that are nonlinear. We now discuss how to
treat nonlinear convective terms using the Burgers equations.
Consider the two-dimensional, steady Burgers equations


2u 2u
u
u
=
+ 2,
(14.58)
Re u
+v
x
y
x2
y


2v
v
v
2v
=
Re u
+v
+
,
(14.59)
x
y
x2 y 2
which represent a simplified prototype of the Navier-Stokes equations as there
are no pressure terms. The velocity components are u(x, y) and v(x, y) in the
x and y directions, respectively. The Reynolds number Re is a non-dimensional
parameter representing the ratio of convective to inertial forces in the flow; larger
Reynolds numbers result in increased nonlinearity of the equations. The terms
on the left-hand-side are the convection terms, and those on the right-hand-side
are the viscous, or diffusion, terms. The Burgers equations are elliptic owing to
the nature of the second-order viscous terms, but the convection terms make the
equations nonlinear actually quasi-linear.
A simple approach to linearizing convective terms is known as Picard iteration,
in which we take the coefficients of the nonlinear (first derivative) terms to be

.
known from the previous iteration denoted by ui,j and vi,j
Let us begin by approximating equation (14.58) using central differences for
all derivatives as follows


ui,j+1 ui,j1
ui+1,j ui1,j
+ vi,j
Re ui,j
2x
2y
ui+1,j 2ui,j + ui1,j
ui,j+1 2ui,j + ui,j1
=
+
.
x2
y 2
Multiplying by x2 and rearranging leads to the difference equation


1 12 Re x ui,j ui+1,j + 1 + 12 Re x ui,j ui1,j


1/2 v ui,j+1 +
+ 1 Re x
1/2 v ui,j1
1 Re x
+
i,j
i,j
2
2
i,j = 0,
2(1 + )u

(14.60)

= (x/y)2 . We can solve equation (14.60) using any of the iterative


where
methods discussed except SOR we generally need under-relaxation for nonlinear
problems. For the iteration to converge, however, we must check to be sure that
equation (14.60) is diagonally dominant? To be diagonally dominant, we must
have




q +
+ q 2(1 + )
,
|1 p| + |1 + p| +
where
1
p = Re x ui,j ,
2

1/2 vi,j
q = Re x
.
2

14.8 Treatment of Nonlinear Convective Terms

255

then this requires that


Suppose, for example, that p > 1 and q > ,
+ (
+ q) 2(1 + ),

(p 1) + (1 + p) + (q )
or

2(p + q) 2(1 + ),
this condition cannot be satisfied, and equation (14.60)
but with p > 1 and q >

is not diagonally dominant. The same result holds for p < 1 and q < .
or
Therefore, we must have |p| 1 and |q|







1
1/2 ,
Re x u 1, and 1 Re x v
i,j
i,j
2
2
which is a restriction on the mesh size for a given Reynolds number and velocity
field. There are two difficulties with this approach:
1. As the Reynolds number Re increases, the grid sizes x and y must decrease.

2. The velocities ui,j and vi,j


vary throughout the domain and are unknown.
The underlying cause of these numerical convergence problems is the use of
central-difference approximations for the first-order derivatives as they contribute
to the off-diagonal terms but not the main diagonal terms, thereby adversely affecting diagonal dominance.

14.8.2 Upwind-Downwind Differencing


In order to restore diagonal dominance, we use forward or backward differences
for the first-derivative terms depending upon the signs of the coefficients of the
first derivative terms, that is, the velocities. For example, consider the u u/x
term:
1. If ui,j > 0, then using a backward difference gives
u

u
ui,j ui1,j
= ui,j
+ O(x),
x
x

which gives a positive addition to the ui,j term to promote diagonal dominance
(note the sign of the ui,j terms from the viscous terms on the right-hand-side
of the difference equation).
2. If ui,j < 0, then using a forward difference gives
u

u
ui+1,j ui,j
= ui,j
+ O(x),
x
x

which again gives a positive addition to the ui,j term to promote diagonal
dominance.
Similarly, for the v u/y term:

256

Finite-Difference Methods for Elliptic PDEs

1. If vi,j
> 0, then use a backward difference according to

u
ui,j ui,j1
= vi,j
+ O(y).
y
y

2. If vi,j
< 0, then use a forward difference according to

u
ui,j+1 ui,j
= vi,j
+ O(y).
y
y

Let us consider diagonal dominance of the approximations to the x-derivative


terms in Burgers equation (14.58) in the x direction

u ui1,j , ui,j > 0


2u
u
ui+1,j 2ui,j + ui1,j Re ui,j i,j
Tx =
Re u
=

;
x2
x
x2
x u

u
,
u
<
0
i+1,j
i,j
i,j
therefore, multiplying by (x)2 to keep the coefficients O(1) gives

ui+1,j + (1 + Re x ui,j )ui1,j (2 + Re x ui,j )ui,j , ui,j > 0


2
x Tx =
.

(1 Re x ui,j )ui+1,j + ui1,j (2 Re x ui,j )ui,j , ui,j < 0


As a result, if we are using ADI with splitting of x- and y-derivative terms,
diagonal dominance of the tridiagonal problems along lines of constant y requires
that
1. For ui,j > 0 (Re x ui,j > 0), we must have
|1| + |1 + Re x ui,j | | (2 + Re x ui,j )|,
in which case
1 + 1 + Re x ui,j = 2 + Re x ui,j ,
which is weakly diagonally dominant.
2. For ui,j < 0 (Re x ui,j < 0), we must have
|1| + |1 Re x ui,j | | (2 Re x ui,j )|,
in which case
1 + 1 Re x ui,j = 2 Re x ui,j ,
which is also weakly diagonally dominant.
The same is true for the y-derivative terms when sweeping along lines of constant
x.
Remarks:
1. Updwind-downwind differencing forces diagonal dominance; therefore, the iteration will always converge with no mesh restrictions.

14.8 Treatment of Nonlinear Convective Terms

257

2. In order to maintain consistency, use upwind-downwind differencing of both


first-order terms whether sweeping along lines of constant x or y. Thus, there
are four possible cases at each point in the grid.
3. Note that we have linearized the difference equation, not the differential equation, in order to obtain a linear system of algebraic equations. Consequently,
the physical nonlinearity is still being accounted for in the numerical solution.
4. Upwind-downwind differencing is motivated by numerical, not physical, considerations. It is simply a scheme to increase the magnitude of the diagonal
coefficients ui,j .
5. The forward and backward differences used for the first-order derivatives are
only first-order accurate, that is, the method is O(x, y) accurate.
To see the potential affects of the first-order accuracy in the convective terms,
consider the one-dimensional Burgers equation
Re u

d2 u
du
= 2.
dx
dx

(14.61)

Recall from Section 3.2 that, for example, the first-order, backward-difference
approximation to the first-order derivative is
 


ui ui1 x d2 u
du
=
+
+ ...,
dx i
x
2
dx2 i
where we have included the truncation error. Substituting into (14.61) gives
  2 



d u
ui ui1 x d2 u
+
.
.
.
=
,
+
Re ui
2
x
2
dx i
dx2 i
or

 2 
ui ui1
Re
d u
Re ui
.
= 1
x ui
x
2
dx2 i

Therefore, depending upon the values of Re, x, and u, the truncation error from
the first-derivative terms, which is not included in the numerical solution, may
be of the same order, or even larger than, the physical diffusion term. This is
often referred to as artificial, or numerical, diffusion, the effects of which increase
with increasing Reynolds number. There are two potential remedies to the firstorder approximations inherent in the upwind-downwind differencing approach as
usually implemented:
1. Second-order, that is, O(x2 , y 2 ), accuracy can be restored using deferred
correction, in which we use the approximate solution for u to evaluate the leading term of the truncation error, which is then added to the original discretized
equation as a source term.
2. Alternatively, we could use second-order accurate forward and backward differences, but the resulting system of equations would no longer be tridiagonal.
In other words, it is no longer compact.

258

Finite-Difference Methods for Elliptic PDEs

14.8.3 Newton Linearization


Picard iteration is the standard method for treating nonlinear convective terms
that appear in transport equations. An alternative approach that can be applied
to any type of nonlinear term is Newton linearization. To illustrate the method,
let us consider the uu/x term, which is approximated using a central difference
as follows:
ui+1,j ui1,j
+ O(x2 ).
(14.62)
ui,j
2x
Take u
i,j to be a close numerical approximation of u(xi , yj ), then the exact solution can be written as
ui,j = u
i,j + ui,j ,
(14.63)
where ui,j is assumed to be small. Substituting into equation (14.62) yields
(neglecting 2x for now)
(
ui,j + ui,j ) (
ui+1,j + ui+1,j u
i1,j ui1,j ) .
Multiplying and neglecting terms that are quadratic in u gives
u
i,j u
i+1,j u
i,j u
i1,j + u
i,j ui+1,j u
i,j ui1,j + u
i+1,j ui,j u
i1,j ui,j + H.O.T.
(14.64)
Substituting equation (14.63) for the u terms
u
i,j u
i+1,j u
i,j u
i1,j + u
i,j (ui+1,j u
i+1,j ) u
i,j (ui1,j u
i1,j )
+
ui+1,j (ui,j u
i,j ) u
i1,j (ui,j u
i,j ) .

(14.65)

We then replace equation (14.62) with equation (14.65)/(2x) and iterate until
convergence as usual.
Remarks:
1. Newtons method exhibits quadratic convergence rate if a good initial guess is
used, that is, u is small.
2. It has problems with divergence if u is too large.

15
Finite-Difference Methods for Parabolic
Partial Differential Equations
Whereas the elliptic partial differential equations that were the subject of the previous chapter are boundary-value problems, parabolic partial differential equations are initial-value problems. Parabolic equations have one preferred direction
of propagation of the solution, which is usually time. Time-dependent problems
are often referred to as unsteady or transient. The canonical parabolic partial
differential equation is the one-dimensional, unsteady diffusion equation
2

= 2,
(15.1)
t
x
where is the diffusivity of the material through which diffusion is occurring. Because parabolic problems are initial-value problems, we need to develop numerical
methods that march in the preferred direction (time) in a step-by-step manner.
Unlike elliptic problems, for which the approximate numerical solution must be
stored throughout the entire domain, in parabolic problems, it is only necessary
to store in memory the approximate solution at the current and previous time
steps. Solutions obtained at earlier time steps can be saved to permanent storage,
but do not need to be retained in memory.
Consider the general linear, one-dimensional, unsteady equation
2

= a(x, t) 2 + b(x, t)
+ c(x, t) + d(x, t),
(15.2)
t
x
x
which is parabolic forward in time for a(x, t) > 0. For the one-dimensional, unsteady diffusion equation (15.2), a = 1, and b = c = d = 0. In various fields,
the dependent variable (x, t) represents different quantities. For example, temperature in heat conduction, velocity in momentum diffusion, and concentration
in mass diffusion. Techniques developed for the canonical equation (15.1) can be
used to solve the more general equation (15.2).
There are two basic techniques for numerically solving parabolic problems:
1. Method of lines: Discretize the spatial derivatives to reduce the partial differential equation to a set of ordinary differential equations in time and solve
using predictor-corrector, Runge-Kutta, etc...
2. Marching methods: Discretize in both space and time.
i) Explicit methods obtain a single equation for (x, t) at each mesh point.
ii) Implicit methods obtain a set of algebraic equations for (x, t) at all mesh
points for each time step.
259

260

Finite-Difference Methods for Parabolic PDEs

Figure 15.1 Schematic of the first-order explicit method.

Explicit methods are designed to have a single unknown when the difference
equation is applied at each grid point, which allows for the solution to be updated
explicitly at each point in terms of the approximate solution at surrounding
points (cf. Gauss-Seidel). Implicit methods, on the other hand, result in multiple
unknowns when applied at each grid point, which requires solution of a system
of algebraic equations to obtain the solution at each time step (cf. ADI).

15.1 Explicit Methods


In explicit methods, the spatial derivatives are all evaluated at the previous time
level(s) resulting in a single unknown n+1
= (xi , tn+1 ) on the left-hand-side of
i
the difference equation. Note that now the superscript n denotes the time step
rather than the iteration number.

15.1.1 First-Order Explicit (Euler) Method


As illustrated in Figure 15.1, second-order accurate central differences are used
for the spatial derivatives, and a first-order accurate forward difference is used
for the temporal derivative. For each of the methods, take particular note of
the location in the grid at which the equation is approximated. Applied to the
one-dimensional, unsteady diffusion equation, we have the first-order accurate
forward difference for the time derivative

n+1 ni
= i
+ O(t),
t
t
and the second-order accurate central difference for the spatial derivatives at the
previous, that is, nth , time level, at which the approximate solution is known
ni+1 2ni + ni1
2
=
+ O(x2 ).
x2
x2

15.1 Explicit Methods

261

Substituting into equation (15.1) yields


n 2ni + ni1
n+1
ni
i
= i+1
,
t
x2
and solving for the only unknown n+1
gives the explicit expression
i

t n
i+1 2ni + ni1 ,
n+1
= ni +
i
2
x
or

n+1
= (1 2s)ni + s ni+1 + ni1 , i = 1, . . . , I + 1,
i

(15.3)

where s = t/x2 .
Remarks:
1. Equation (15.3) is an explicit equation for n+1
at the (n + 1)st time step in
i
n
th
terms of i at the n time step.
2. It requires one sweep for i = 1, . . . , I + 1 at each time step n + 1.
3. The method is second-order accurate in space and first-order accurate in time.
4. The time steps t may be varied from step-to-step.
5. As will be shown in Section 15.2, there are restrictions on t and x for the
first-order explicit method applied to the one-dimensional, unsteady diffusion
equation to remain stable. We say that the method is conditionally stable,
because for the numerical method to remain stable requires that
t
1
,
2
x
2
1
which is very restrictive. If s > 2 , then the method is unstable, and errors in
the solution grow to become unbounded as time proceeds.
s=

15.1.2 Richardson Method


The Richardson method seeks to improve on the temporal accuracy of the Euler
method by using a second-order accurate central difference for the time derivative as illustrated in Figure 15.2. Thus, the one-dimensional, unsteady diffusion
equation becomes
n 2ni + ni1
n+1
n1
i
i
+ O(t2 ) = i+1
+ O(x2 ),
2t
x2
or

n+1
= n1
+ 2s ni+1 2ni + ni1 ,
i
i

i = 1, . . . , I + 1.

(15.4)

Remarks:
1. The Richardson method is second-order accurate in both space and time.
2. We must keep t constant as we move from time step to time step, and a
starting method is required as we need i at the two previous time steps.
3. The method is unconditionally unstable for s > 0; therefore, it is not used.

262

Finite-Difference Methods for Parabolic PDEs

Figure 15.2 Schematic of the Richardson method.

15.1.3 DuFort-Frankel Method


In order to maintain second-order accuracy in time, but improve numerical stability, let us modify the Richardson method by taking an average between time
levels for ni . To devise such an approximation, consider the Taylor series approximation at tn+1 about tn
 n

n

t2 2
n+1
n
i = i + t
+
+ .
t i
2
t2 i
Similarly, consider the Taylor series approximation at tn1 about tn
 n

n
t2 2

n1
n
+
i = i t
+ .
t i
2
t2 i
Adding these Taylor series together gives
n+1
i

n1
i

2ni

+ t

2
t2

n

+ ,
i

and solving for ni leads to


ni =

 1 2
1 n+1
i + n1
t
i
2
2

2
t2

n

+ .

(15.5)

Therefore, averaging between time levels in this manner (with uniform time step
t) is O(t2 ) accurate.
Substituting the time average (15.5) into the difference equation (15.4) from
Richardsons method gives



n+1
= n1
+ 2s ni+1 n+1
+ n1
+ ni1 ,
i
i
i
i
or

(1 + 2s)n+1
= (1 2s)in1 + 2s ni+1 + ni1 .
i

Solving for the single unknown yields



1 2s n1
2s
i +
ni+1 + ni1 ,
n+1
=
i
1 + 2s
1 + 2s

i = 1, . . . , I + 1.

(15.6)

15.2 Numerical Stability Analysis

263

However, let us consider the consistency of this approximation (see Section


2.1). Including the truncation errors of each approximation, the one-dimensional,
unsteady diffusion equation is approximated as in the Richardson method (with
central differences for the temporal and spacial derivatives) in the form

n

2
n+1
t2 3
n1
i
i
2 =

+
t
x
2t
6
t3 i
n 2ni + ni1 x2
+
i+1
x2
12

4
x4

n

+ .
i

Substituting the time-averaging equation (15.5) for ni to implement the DuFortFrankel method leads to

2
n+1 n1
 n

i
2 = i

i+1 + ni1 n+1


+ n1
i
i
2
t
x
2t
x



n

n
n
t2 2
t2 3
x2 4

+
+ .
x2
t2 i
6
t3 i
12
x4 i
For consistency, all of the truncation error terms must go to zero as t 0 and
x 0. That is, the difference equation must reduce to the differential equation
as x, t 0. Although the second and third truncation error terms do so,
however, the first term requires that t 0 faster than x 0, in which case
t << x, for consistency. Because this is not the case in general, the DuFortFrankel method is considered inconsistent.
Remarks:
1. The method is second-order accurate in both space and time.
2. We must keep t constant, and a starting method is necessary because the
finite-difference equation involves two time levels.
3. The method is unconditionally stable for any s = t/x2 .
4. The method is inconsistent; therefore, it is not used.
15.2 Numerical Stability Analysis
Whereas for elliptic problems, our concern is iterative convergence rate, determined by the spectral radius of the iteration matrix, in parabolic problems, the
concern is numerical stability. In particular, the issue is how the numerical timemarching scheme handles the inevitable small errors, for example, round-off errors, that are inherent to all numerical calculations. These errors effectively act
as disturbances in the solution, and the question is what happens to these disturbances as the solution progresses in time. If the small errors decay in time, that
is, they are damped out, then the numerical solution is stable. If the small errors
grow in time, that is, they are amplified, then the numerical solution is unstable.
There are two general techniques commonly used to test numerical methods
for stability:

264

Finite-Difference Methods for Parabolic PDEs

1. Matrix Method: The matrix method is more rigorous as it evaluates stability


of the entire scheme, including treatment of boundary conditions. However, it
is more difficult to perform because it involves determining the eigenvalues of
a large matrix.
2. von Neumann Method (Fourier Analysis): The von Neumann method is only
applicable to linear initial-value problems with constant coefficients; nonlinear
problems must first be linearized. Therefore, it is more restrictive than the
matrix method. In addition, it does not account for boundary conditions.
As such, it often provides useful guidance, but it is not always conclusive
for complex problems. Nevertheless, it is the most commonly used as it is
significantly easier to evaluate stability than using the more rigorous matrix
method.

15.2.1 Matrix Method


Let us denote the exact solution of the difference equation at t = tn by ni ; then
the error is
eni = ni ni , i = 2, . . . , I,
(15.7)
where ni is the approximate numerical solution at t = tn . In order to illustrate the
matrix method, consider the first-order explicit (Euler) method given by equation
(15.3), which is repeated here
n+1
= (1 2s)ni + s(ni+1 + ni1 ),
i

(15.8)

where s = t/x2 . Suppose that we have Dirichlet boundary conditions at


x = a, b. Both ni and ni satisfy equation (15.8); thus, the error satisfies the
same equation
en+1
= (1 2s)eni + s(eni+1 + eni1 ),
i

i = 2, . . . , I.

(15.9)

This may be written in matrix form as


en+1 = Aen ,

n = 0, 1, 2, . . . .

(15.10)

Thus, we perform a matrix multiply to advance each time step (cf. the matrix
form for iterative methods). The (I 1) (I 1) matrix A and the (I 1) vector
en are

n
1 2s
s
0

0
0
e2
s

en3
1

2s
s

0
0

0
e4
s
1 2s
0
0

n
A = ..
,
e
=

.. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.

0
en
0
0
1 2s
s
I1
0
0
0

s
1 2s
enI
Note that if is specified at the boundaries, then the error is zero there, that is,
en1 = enI+1 = 0.

15.2 Numerical Stability Analysis

265

The numerical method is stable if the eigenvalues j of the matrix A are such
that
|j | 1

for all

j ,

(15.11)

that is, the spectral radius is such that 1, in which case the error will not grow.
This is based on the same arguments for convergence in Section 6.2.4. Because
A is tridiagonal with constant elements along each diagonal, the eigenvalues are
(see Section 11.4.1)
 
j
2
, j = 1, . . . , I 1.
(15.12)
j = 1 (4s) sin
2I
Note that there are (I 1) eigenvalues of the (I 1) (I 1) matrix A. For
equation (15.11) to hold,
 
j
2
1 1 (4s) sin
1.
2I
The right inequality is true for all j (s > 0 and sin2 () > 0). The left inequality
is true if
 
j
2
1,
1 (4s) sin
2I
 
j
2
(4s) sin
2,
2I
 
1
j
.
s sin2
2I
2
Because 0 sin2

j
2I

1, this is true for all j if

t
1
.
2
x
2
Thus, the first-order explicit method with Dirichlet boundary conditions is stable
for s 1/2.
s=

Remarks:
Whereas here we only needed eigenvalues for a tridiagonal matrix, in general the
matrix method required determination of the eigenvalues of an (I + 1) (I + 1)
matrix.
The effect of different boundary conditions are reflected in A and, therefore,
the resulting eigenvalues.
Because the eigenvalues depend on the number of grid points I + 1, stability
is influenced by the grid size x.
This is the same method used to obtain convergence properties of iterative
methods for elliptic problems. Recall that the spectral radius (I, J) is the
modulus of the largest eigenvalue of the iteration matrix, and 1 for an
iterative method to converge. In iterative methods, however, we are concerned

266

Finite-Difference Methods for Parabolic PDEs

not only with whether they will converge or not, but the rate at which they
converge, for example, the Gauss-Seidel method as compared to the Jacobi
method. Consequently, we seek to devise algorithms that minimize the spectral
radius for maximum convergence rate. For parabolic problems, however, we
are only concerned with stability in an absolute sense. There is no such thing
as a time-marching method that is more stable than another one; we only
care about whether the spectral radius is less than or equal to one not how
much less than one. Because of this, it is often advantageous to use a timemarching (parabolic) scheme for solving steady (elliptic) problems owing to the
less restrictive stability criterion. This is sometimes referred to as the pseudotransient method. If we seek to solve the Laplace equation
2 = 0,
for example, it may be more computational efficient to solve its unsteady counterpart

= 2
t
until /t 0, which corresponds to the steady solution.
15.2.2 von Neumann Method (Fourier Analysis)
If the difference equation is linear, then we can take advantage of superposition
of solutions using Fourier analysis. The error is regarded as a superposition of
Fourier modes, and the linearity allows us to evaluate stability of each mode
individually; if each mode is stable, then the linear superposition of the modes is
also stable.
We expand the error along grid lines at each time level as a Fourier series.
It is then determined if the individual Fourier modes decay or amplify in time.
Expanding the error at t = 0 (n = 0), we have
e(x, 0) =

I1
X

em (x, 0) =

m=1

I1
X

am (0)eim x ,

(15.13)

m=1

where
0 x 1, am (0) are the amplitudes of the Fourier modes m = m, and
i = 1. At a later time t, the error is expanded as follows:
e(x, t) =

I1
X
m=1

em (x, t) =

I1
X

am (t)eim x .

(15.14)

m=1

We want to determine how am (t) behaves with time for each Fourier mode m. To
do this, define the gain of mode m, denoted by Gm (x, t), as
Gm (x, t) =

em (x, t)
am (t)
=
,
em (x, t t)
am (t t)

which is the amplification factor for the mth mode during one time step. Hence,

15.2 Numerical Stability Analysis

267

the error will not grow, that is, the method is stable, if |Gm | 1 for all m. If it
takes n time steps to get to time t, then the amplification after n time steps is
(Gm )n =

am (t)
am (t t)
am (t)
am (t)
...
=
,
am (t t) am (t 2t)
am (0)
am (0)

where (Gm )n is the nth power of Gm .


The von Neumann method consists of seeking a solution of the form (15.14),
with am (t) = (Gm )n am (0), of the error equation corresponding to the difference
equation. For the first-order explicit (Euler) method, the error equation (15.9) is
(use index j instead of i)
en+1
= (1 2s)enj + s(enj+1 + enj1 ),
j

j = 2, . . . , I.

(15.15)

This equation is linear; therefore, each mode m must satisfy the equation independently. Thus, substituting equation (15.14) with am (t) = (Gm )n am (0) into
equation (15.15) gives (canceling am (0) in each term)
h
i
(Gm )n+1 eim x = (1 2s)(Gm )n eim x + s(Gm )n eim (x+x) + eim (xx) ,
where we note that j x, j+1 x+x, j1 xx. Dividing by (Gm )n eim x ,
which is common in each term, we have

Gm = (1 2s) + s eim x + eim x

Gm

= 1 2s [1 cos(m x)]


m x
2
.
= 1 (4s) sin
2

For stability, |Gm | 1; therefore,




m x
2
1 1 (4s) sin
1
2

[cos(ax) = 12 (eiax + eiax )]


[sin2 x = 21 [1 cos(2x)]]

for all

m = m.

The right inequality holds for all m and s > 0 because




m x
2
0 sin
1.
2
From the same arguments used in application of the matrix method, the left
t
1
inequality holds if s = x
2 2 . For this case, therefore, we obtain the same
stability criterion as from the matrix method (this is unusual).
Remarks:
1. |Gm | is the modulus; thus, if we have a complex gain, |Gm | equals the square
root of the sum of the squares of the real and imaginary parts.
2. The advantage of the von Neumann method is in the fact that we can evaluate
stability of each individual Fourier mode separately, in which case it is not
necessary to consider a large matrix problem to evaluate stability. This is only
the case, however, for linear equations, which is the primary limitation of the
approach.

268

Finite-Difference Methods for Parabolic PDEs

Figure 15.3 Schematic of the first-order implicit method.

3. von Neumann analysis applies for linear differential equations having constant
coefficients, and the boundary conditions are not accounted for actually, they
are assumed to be periodic.

15.3 Implicit Methods


Explicit methods are notoriously prone to numerical stability problems as we have
seen in our brief treatment, and the situation generally gets worse as the equations
get more complex. We can improve dramatically on the stability properties of
explicit methods by solving implicitly for more information at the new time level.

15.3.1 First-Order Implicit Method


Recall that in the first-order explicit method, a central difference approximation is
used for spatial derivatives on the nth (previous) time level, and a first-order forward difference is used for the time derivative. In the first-order implicit method,
we instead apply a central-difference approximation for the spatial derivatives on
the (n + 1)st (current) time level, and a first-order backward difference for the
time derivative as illustrated in Figure 15.3.
Application of the first-order implicit method to the one-dimensional, unsteady
diffusion equation yields
n+1
n+1
nj
n+1
+ n+1
j
j+1 2j
j1
+ O(t) =
+ O(x2 ).
2
t
x

Collecting the unknowns on the left-hand side, we have the difference equation
n+1
n
sn+1
+ sn+1
j+1 (1 + 2s)j
j1 = j ,

j = 2, . . . , I

(15.16)

applied at each time step. This is a tridiagonal system for n+1


at the current
j
time level. Observe that it is strictly diagonally dominant.

15.3 Implicit Methods

269

Let us consider a von Neumann stability analysis. The error satisfies the finite
difference equation (15.16), so is
n+1
n
sen+1
+ sen+1
j+1 (1 + 2s)ej
j1 = ej .

(15.17)

We expand the error at time t according to equation (15.14) with am (t) =


(Gm )n am (0) as follows:
e(x, t) =

I1
X

(Gm )n am (0)eim x ,

m = m.

(15.18)

m=1

Substituting into equation (15.17) for the error gives [and canceling am (0) in each
term]
s(Gm )n+1 eim (x+x) (1 + 2s)(Gm )n+1 eim x
+s(Gm )n+1 eim (xx) = (Gm )n eim x



s eim x + eim x (1 + 2s) Gm = 1
[(2s) cos(m x) (1 + 2s)] Gm = 1
{1 + 2s [1 cos(m x)]} Gm = 1.
Thus, the gain of mode m is



m x 1
Gm = 1 + (4s) sin2
.
2

The method is stable if |Gm | 1. Therefore, noting that




m x
1 + (4s) sin2
>1
2
for all m and s > 0, the method is unconditionally stable.
Remarks:
1. The first-order implicit method is only O(t) accurate.
2. There are more computations required per time step than for explicit methods.
However, one can typically use much larger time steps because they are only
limited by temporal resolution required, not stability. Specifically, for explicit
methods, we must choose the time step t for both accuracy and stability,
whereas for implicit methods, we must only be concerned with accuracy.
15.3.2 Crank-Nicolson Method
Although the first-order implicit method dramatically improves the stability
properties as compared to the explicit methods, it is only first-order accurate
in time. Because we prefer second-order accuracy in time, consider approximating the equation midway between time levels as shown in Figure 15.4. This is
known as the Crank-Nicolson method.

270

Finite-Difference Methods for Parabolic PDEs

Figure 15.4 Schematic of the Crank-Nicolson method.

For the one-dimensional, unsteady diffusion equation, this is accomplished as


follows


n+1
ni
2 n+1 2 n
i
+ O(t2 ) =
+
+ O(t2 ),
t
2
x2
x2
which becomes
ni
n+1

i
=
t
2

n+1
ni+1 2ni + ni1
n+1
+ n+1
i+1 2i
i1
+
x2
x2

Later we will show that averaging the second-order derivative terms across time
levels in this manner is second-order accurate in time. Writing the difference
equation in tridiagonal form, with the unknowns on the left-hand side, we have
n+1
n
n
n
sn+1
+sn+1
i+1 2(1+s)i
i1 = si+1 2(1s)i si1 ,

i = 2, . . . , I, (15.19)

at the current time level. As with the first-order implicit


which we solve for n+1
i
method, this results in a tridiagonal system of equations, but with more calculation required to evaluate the right-hand side values.
Remarks:
1.
2.
3.
4.
5.

The Crank-Nicolson method is second-order accurate in both space and time.


It is unconditionally stable for all s.
Derivative boundary conditions would be applied at the current time level.
The Crank-Nicolson method is sometimes called the trapezoidal method.
Because of its accuracy and stability properties, it is a very popular scheme
for parabolic problems.
6. Observe that the approximations of the time derivatives for the first-order explicit, first-order implicit, and Crank-Nicolson methods all appear to be the
same; however, they are actually forward-, backward-, and central-difference
approximations, respectively, owing to the point in the grid at which the equation is approximated.

15.3 Implicit Methods

271

Figure 15.5 Schematic for averaging across time levels.

When describing the DuFort-Frankel method, it was established that averaging


across time levels is second-order accurate in time. Here, we again confirm this,
and in doing so, we will illustrate a method to determine the accuracy of a
specified finite-difference approximation.
As illustrated in Figure 15.5, consider averaging a quantity i midway between
time levels as follows
n+1/2

1
= (n+1
+ ni ) + T.E.
2 i

(15.20)

n+1/2
We seek an expression of the form i
= + T.E., where is the exact value
n+1/2
of i
midway between time levels. Let us expand each term in the expression
(15.20) as a Taylor series about (xi , tn+1/2 ) as follows:

n+1
i
ni


k

X
1 t

=
Dt ,
k!
2
k=0


k
k

X
X
t
1
(1)k t

Dt =
Dt ,
=
k!
2
k!
2
k=0
k=0

where Dt = /t. Substituting these expansions into equation (15.20) gives


n+1/2


k

 t
1X 1 

1 + (1)k
Dt .
2 k=0 k!
2

Noting that
(

1 + (1)k =

0,

k = 1, 3, 5, . . .

2,

k = 0, 2, 4, . . .

we can rewrite this expansion as (letting k = 2m)


n+1/2
i

1
=
(2m)!
m=0

t
Dt
2

2m

272

Finite-Difference Methods for Parabolic PDEs

The first three terms of this expansion are



2

4
1 t
1 t
n+1/2
Dt +
Dt + . . . .
i
= +
2! 2
4! 2
The first term in the expansion (m = 0) is the exact value, and the second term
(m = 1) gives the truncation error of the approximation. This truncation error
term is
1 2 2
;
t
8
t2
therefore, the approximation (15.20) is second-order accurate, that is,
n+1/2

1
= (n+1
+ ni ) + O(t2 ).
2 i

This confirms that averaging across time levels gives an O(t2 ) approximation
of i at the mid-time level tn+1/2 .
15.4 Nonlinear Convective Problems
As with elliptic equations, which correspond to steady problems, it is essential
in fluid dynamics, heat transfer, and mass transfer to be able to handle nonlinear convective terms in unsteady, that is, parabolic, contexts. Consider the onedimensional, unsteady Burgers equation, which is a one-dimensional, unsteady
diffusion equation with convection term, given by
2u
u
u
= 2 u ,
(15.21)
t
x
x
where is the viscosity. Let us consider how the nonlinear convection term
uu/x is treated in the various schemes.

15.4.1 First-Order Explicit Method


Approximating spatial derivatives at the previous time level, and using a forward
difference in time, we have
n
n
un 2uni + uni1
un+1
uni
i
n ui+1 ui1
= i+1

u
+ O(x2 , t),
i
t
x2
2x
where the uni in the convection term is known from the previous time level.
Writing in explicit form leads to




1 n n
1 n n
n
un+1
=
s

C
u
+
(1

2s)u
+
s
+
C
ui1 ,
i
i+1
i
2 i
2 i

where s =
Remark:

t
,
x2

and Cin =

un
i t
x

is the Courant number.

15.4 Nonlinear Convective Problems

273

Figure 15.6 Schematic of the Crank-Nicolson methods for the


one-dimensional, unsteady Burgers equation.

1. For stability, this method requires that 2 Rex 2/Ci , where Rex =
uni x/ is the mesh Reynolds number. This is very restrictive.
15.4.2 Crank-Nicolson Method
In the Crank-Nicolson method, all spatial derivatives are approximated at the
mid-time level as shown in Figure 15.6. Therefore, the one-dimensional, unsteady
Burgers equation becomes




2 un+1 2 un
un+1
uni
1 n+1/2 un+1 un
i
=
+
+
u
t
2
x2
x2
2
x
x


n+1
n
n
n
(15.22)
=
ui+1
2un+1
+ un+1
i
i1 + ui+1 2ui + ui1
2
2x
n+1/2

u
n+1
n
n
2
2
un+1
i
i+1 ui1 + ui+1 ui1 + O(x , t ),
4x
where we average across time levels to obtain the velocity according to
1
= (un+1
+ uni ) + O(t2 ).
(15.23)
2 i
This results in the implicit finite-difference equation




1 n+1/2 n+1
1 n+1/2 n+1
n+1
s Ci
ui+1 + 2(1 + s)ui s + Ci
ui1
2
2




1 n+1/2 n
1 n+1/2 n
n
= s Ci
ui+1 + 2(1 s)ui + s + Ci
ui1 ,
2
2
(15.24)
n+1/2

ui

n+1/2

n+1/2

n+1/2

where here Ci
= i x , but we do not know ui
procedure requires iteration at each time step:

yet. Therefore, this

1. Begin with uki = uni (k = 0), that is, use the ui from the previous time step as
an initial guess at current time step.
2. Increment k = k + 1. Compute an update for uki = un+1
, i = 1, . . . , I + 1, using
1
equation (15.24).
n+1/2
3. Update ui
= 12 (un+1
+ uni ).
i

274

Finite-Difference Methods for Parabolic PDEs

Figure 15.7 Crank-Nicolson with upwind-downwind differencing of the


first-order convection term when un+1/2 > 0.

converges for all i.


4. Return to step (2) and repeat until uki = un+1
i
Remarks:
1. It typically requires less than ten iterations to converge at each time step; if
more are required, then the time step t is too large.
2. In elliptic problems, we use Picard iteration because we only care about the
final converged solution, whereas here we want an accurate solution at each
time step, thereby requiring iteration.
3. When nonlinear convective terms are included, explicit methods lead to very
restrictive time steps in order to maintain numerical stability. Implicit methods
improve on the stability properties but require iteration at each time step.
4. See Appendix B for an unusual case in which the unsteady Burgers equations
can be transformed into the linear steady diffusion equation.
15.4.3 Upwind-Downwind Differencing
As in the elliptic case, we can also us upwind-downwind differencing. Consider
the first-order convective term in the Crank-Nicolson approximation
un+1/2

un+1/2
.
x

If un+1/2 > 0, then we approximate un+1/2 /x as shown in Figure 15.7, in which


case
!


u
1 un+1
un
+
=
x
2
x i1/2
x i+1/2
is given by
u
1
=
x
2

un uni
un+1
un+1
i
i1
+ i+1
x
x

(15.25)

15.4 Nonlinear Convective Problems

275

Figure 15.8 Crank-Nicolson with upwind-downwind differencing of the


first-order convection term when un+1/2 < 0.

Although the finite-difference approximation at the current time level appears


to be a backward difference and that at the previous time level appears to be a
forward difference, they are in fact central differences evaluated at half-points in
the grid.
If instead un+1/2 < 0, then we approximate un+1/2 /x as shown in Figure 15.8,
in which case
!


u
1 un+1
un
=
+
x
2
x
x
i+1/2

i1/2

is given by
u
1
=
x
2

n+1
un uni1
un+1
i+1 ui
+ i
x
x

(15.26)

Substituting equations (15.25) and (15.26) into equation (15.22) in place of


central differences yields

1
n+1
n
n
n
+ un+1
un+1
uni = s un+1
i
i+1 2ui
i1 + ui+1 2ui + ui1
2
(
n+1/2
n
n
ui
>0
un+1
1 n+1/2 un+1
i1 + ui+1 ui ,
i
.
Ci
n+1/2
n+1
n+1
n
n
2
ui+1 ui + ui ui1 , ui
<0

This gives the tridiagonal problem


(

)
un+1
un+1
i
i1
+ 2(1 +

+
n+1
un+1
i+1 ui
(
)
n
n
u

u
,
n+1/2
i+1
i
= suni+1 + 2(1 s)uni + suni1 Ci
n
n
ui ui1
,

sun+1
i+1

s)un+1
i

sun+1
i1

n+1/2
Ci

n+1/2

ui

n+1/2
ui

>0

.
<0
(15.27)

Remarks:
1. Equation (15.27) is diagonally dominant for the one-dimensional, unsteady

276

Finite-Difference Methods for Parabolic PDEs

Figure 15.9 Schematic for determination of truncation error of


Crank-Nicolson method with upwind-downwind differencing.

n+1/2

n+1/2

Burgers equation for all s and Ci


(note that Ci
negative). Be sure to check this for different equations.

may be positive or

2. Iteration at each time step is required owing to the nonlinear term uu/x,
which may require under-relaxation on ui ; therefore,
,
uk+1
= (1 )uki + uk+1
i
i

k = 0, 1, 2, . . . ,

where 0 < < 1 for under-relaxation.


3. As shown below, this method is (almost) O(x2 , t2 ) accurate.
Let us determine the truncation error of the Crank-Nicolson scheme with
upwind-downwind differencing using the same technique as in the previous section. We have used second-order accurate central differences for the u/t and
2 u/x2 terms; therefore, consider the un+1/2 un+1/2 x term with un+1/2 < 0
from (15.26), which yields
1
u
=
(un+1 un+1
+ uni uni1 ) + T.E.
i
x
2x i+1

(15.28)

We would like to determine the order of accuracy, that is, the truncation error
T.E., of this approximation. Here, Dt = /t, Dx = /x, and u
is the exact
n+1/2
value of ui
midway between time levels as illustrated in Figure 15.9.
We seek an expression of the form
u

u
=
+ T.E.
x
x
Expanding each term in equation (15.28) as a two-dimensional Taylor series about

15.4 Nonlinear Convective Problems

277

(xi , tn+1/2 ) as follows:


un+1
i+1

un+1
=
i
uni

uni1


k

X
1 t
,
Dt + xDx u
k! 2
k=0

k

X
1 t
,
Dt u
k! 2
k=0

k

k

X
X
t
(1)k t
1
Dt u

=
,
Dt u
k!
2
k!
2
k=0
k=0
k
k



X
X
1
t
(1)k t
=
.
Dt xDx u
Dt + xDx u
k!
2
k!
2
k=0
k=0

Substituting these expansions into equation (15.28) produces


(

k

 t
u
1 X 1 
=
1 (1)k
Dt + xDx
x
2x k=0 k!
2
k )

 t

k
Dt
u
,
+ 1 + (1)
2
"
k 
k #

u
1 X 1 + (1)k+1
t
t
=
Dt + xDx
Dt
u
.
x
2x k=0
k!
2
2
Note that
(

1 + (1)k+1 =

0,

k = 0, 2, 4, . . .

2,

k = 1, 3, 5, . . .

Therefore, let k = 2l + 1; thus,


"
2l+1 
2l+1 #

u
1 X
1
t
t
=
Dt + xDx
Dt

u
.
x
x l=0 (2l + 1)!
2
2

(15.29)

In order to treat the term that arises from the two-dimensional Taylor series,
recall the binomial theorem
!
k
X
k km m
k
(a + b) =
a
b ,
m
m=0
where

k
m

are the binomial coefficients


k
m

k!
m!(k m)!

(0! = 1).

278

Finite-Difference Methods for Parabolic PDEs

Application of the binomial theorem to the expression (15.29) leads to


"2l+1
!

X 2l + 1  t 2l+1m
1 X
1
u
(xDx )m
=
Dt
x
x l=0 (2l + 1)! m=0
m
2

2l+1 #
t

u
,
Dt
2
"2l+1
!

X 2l + 1  t 2lm+1
1 X
1
=
(xDx )m
Dt
x l=0 (2l + 1)! m=1
m
2
"
!
!
#
2l+1 
2l+1 #
2l + 1
t
2l + 1
t
+

u
,
Dt
Dt
=1
0
2
2
0

2lm+1
2l+1

X
X
(2l + 1)!
t
1
(x)m1 (Dx )m u
,
=
Dt
(2l
+
1)!
m!(2l

m
+
1)!
2
m=1
l=0

2lm+1
2l+1
u
u
XX
1
t
=
+
Dt
(x)m1 (Dx )m u
,
x
x l=1 m=1 m!(2l m + 1)! 2
where the u
/x term results from taking l = 0, m = 1 in the double summation.
To obtain the truncation error, consider the l = 1 term for which m = 1, 2, 3
produces
 2
 
1
t
1
t
1
2
Dt Dx u
+
xDt Dx2 u
+
x2 Dx3 u
.
1!2! 2
2!1! 2
3!1!
Therefore, the truncation error is
O(t2 , tx, x2 ).
Consequently, if t < x, then the approximation is O(x2 ) accurate, and if
t > x, then the approximation is O(t2 ) accurate. This is better than O(x)
or O(t), but strictly speaking it is not O(t2 , x2 ). Note that the O(tx)
term arises owing to the diagonal averaging across time levels with the upwinddownwind differencing.
Remark:
1. The method is unconditionally stable.

15.5 Multidimensional Problems


In order to illustrate the various methods for treating parabolic partial differential
equations, we have focused on the one-dimensional case in space. More likely,
however, we are faced with solving problems that are two- or three-dimensional
in space. Let us see how these methods extend to multidimensions.

15.5 Multidimensional Problems

279

Figure 15.10 Discretization of two-dimensional domain for parabolic


problems.

Consider the two-dimensional, unsteady diffusion equation



 2

2
+ 2 , = (x, y, t),
=
t
x2
y

(15.30)

with initial condition


(x, y, 0) = 0 (x, y)

at

t = 0,

(15.31)

and boundary conditions (Dirichlet, Neumann, or mixed) on a closed contour C


enclosing the domain. As usual, we discretize the domain as shown in Figure 15.10,
but now with t being normal to the (x, y) plane.
15.5.1 First-Order Explicit Method
Approximating equation (15.30) using a forward difference in time and central
differences in space at the previous time level gives
 n

n
n+1
i+1,j 2ni,j + ni1,j
ni,j+1 2ni,j + ni,j1
i,j i,j
=
+
t
x2
y 2
+O(t, x2 , y 2 ).
Therefore, solving for the only unknown gives the explicit expression
n
n
n
n
n
n+1
i,j = (1 2sx 2sy )i,j + sx (i+1,j + i1,j ) + sy (i,j+1 + i,j1 ),
2

(15.32)

where sx = t/x , and sy = t/y . For numerical stability, a von Neumann stability analysis requires that
1
sx + sy .
2
Thus, for example if x = y, in which case sx = sy = s, we must have
1
s ,
4
which is even more restrictive than for the one-dimensional, unsteady diffusion
equation, with s 12 for stability. The three-dimensional case becomes even more
restrictive, with s 16 for numerical stability.

280

Finite-Difference Methods for Parabolic PDEs

Figure 15.11 The banded matrix produced by application of the first-order


implicit method to the two-dimensional, unsteady diffusion equation.

15.5.2 First-Order Implicit Method


Applying a backward difference in time and central differences in space at the
current time level leads to the implicit expression
n+1
n+1
n+1
n+1
n
(1 + 2sx + 2sy )n+1
i,j sx (i+1,j + i1,j ) sy (i,j+1 + i,j1 ) = i,j .

(15.33)

Observe that this contains five unknowns and produces a banded matrix as illustrated in Figure 15.11, rather than a tridiagonal matrix as in the one-dimensional
case.
Remarks:
1. For the two-dimensional, unsteady diffusion equation, the first-order implicit
method is unconditionally stable for all sx and sy .
2. The usual Crank-Nicolson method could be used to obtain second-order accuracy in time. It produces a similar implicit equation as for the one-dimensional
case, but with more terms on the right-hand-side evaluated at the previous
time step.

15.5.3 ADI Method with Time Splitting (Fractional-Step Method)


Rather than using the Crank-Nicolson method in its usual form to maintain
second-order accuracy in time, it is more common to combine it with the ADI
method with time splitting, which is often referred to as a fractional-step method.
In this method, we split each time step into two half steps (or three for three
dimensional), which results in two (or three) sets of tridiagonal problems per time
step. In step 1, we solve implicitly for the terms associated with one coordinate
direction, and in step 2, we solve implicitly for the terms associated with the
other coordinate direction.
For example, if we sweep along lines of constant y during the first half time
step, the difference equation with Crank-Nicolson becomes
!
n+1/2
n+1/2
n+1/2
n+1/2
ni,j+1 2ni,j + ni,j1
i,j
ni,j
i+1,j 2i,j
+ i1,j
=
+
,
t/2
x2
y 2
(15.34)

15.5 Multidimensional Problems

281

where a central difference is used for the time derivative, and averaging across
time levels is used for the spatial derivatives. Consequently, putting the unknowns
on the left-hand side and the knowns on the right-hand side yields
1
1
1
1
n+1/2
n+1/2
n+1/2
sx i+1,j (1 + sx )i,j
+ sx i1,j = sy ni,j+1 (1 sy )ni,j sy ni,j1 .
2
2
2
2
(15.35)
Taking i = 1, . . . , I + 1 leads to the tridiagonal problems (15.35) to be solved for
n+1/2
i,j
at each j = 1, . . . , J + 1, at the intermediate time level.
We then sweep along lines of constant x during the second half time step using
the difference equation in the form
!
n+1/2
n+1/2
n+1/2
n+1/2
n+1
n+1
n+1
i+1,j 2i,j
n+1
+ i1,j
i,j i,j
i,j+1 2i,j + i,j1
=
+
,
t/2
x2
y 2
(15.36)
which becomes
1
1
1
1
n+1/2
n+1/2
n+1/2
n+1
n+1
sy n+1
sx i1,j .
i,j+1 (1 + sy )i,j + sy i,j1 = sx i+1,j (1 sx )i,j
2
2
2
2
(15.37)
Taking i = 1, . . . , I + 1 leads to the tridiagonal problems (15.37) to be solved for
n+1
at each j = 1, . . . , J + 1, at the current time level.
i,j
Note that this approach requires boundary conditions at the intermediate time
level n + 1/2 for equation (15.35). This is straightforward if the boundary condition does not change with time; however, a bit of care is required if this is not the
case. For example, if the boundary condition at x = 0 is Dirichlet, but changing
with time, as follows
(0, y, t) = a(y, t),
then
n1,j = anj .
Subtracting equation (15.36) from (15.34) gives
n+1/2

n+1/2
ni,j
n+1
i,j i,j

t/2
t/2
!
n+1
n+1
i,j+1 2i,j + n+1
i,j1

,
y 2

i,j

ni,j+1 2ni,j + ni,j1


y 2

and solving for the unknown at the intermediate time level results in
 1  n

1 n
n+1/2
n+1
n+1
i,j
=
i,j + n+1
+ sy i,j+1 2ni,j + ni,j1 n+1
.
i,j
i,j+1 2i,j + i,j1
2
4
Applying this equation at the boundary x = 0, leads to
 1  n

1 n
n+1/2
n+1
1,j
=
+ an+1
.
aj + an+1
+ sy aj+1 2anj + anj1 an+1
j
j+1 2aj
j1
2
4
This provides the boundary condition for 1,j at the intermediate (n + 1/2) time

282

Finite-Difference Methods for Parabolic PDEs

level. Note that the first term on the right-hand-side is the average of a at the
n and n + 1 time levels, the second term is 2 an /y 2 , and the third term is
2 an+1 /y 2 . Thus, if the boundary condition a(t) does not depend on y, then
n+1/2
1,j
is simply the average of an and an+1 .
Remarks:
1. The ADI method with time splitting is O(t2 , x2 , y 2 ) accurate.
2. For stability, it is necessary to apply the von Neumann analysis at each half
step and take the product of the resulting amplification factors, G1 and G2 ,
to obtain G for the full time step. Such an analysis shows that the method
is unconditionally stable for all sx and sy for the two-dimensional, unsteady
diffusion equation.
3. In three dimensions, we require three fractional steps (t/3) for each time
step, and the method is only conditionally stable, where
3
sx , sy , sz ,
2
for stability (sz = t/z 2 ).
15.5.4 Factored ADI Method
We can improve on the ADI method with time splitting using the factored ADI
method. It provides a minor reduction in computational cost as well as improving
on the stability properties for thee-dimensional cases. In addition, it can be extended naturally to the nonlinear convection case. Let us once again reconsider
the two-dimensional, unsteady diffusion equation
 2

2

=
+ 2 ,
(15.38)
t
x2
y
and apply the Crank-Nicolson approximation
n

n+1
2 n+1
i,j i,j
2 n
=
x i,j + x2 ni,j + y2 n+1
i,j + y i,j ,
t
2

where 2 represents second-order central difference operators (as in Section 14.6)


x2 i,j =

i+1,j 2i,j + i1,j


,
x2

y2 i,j =

i,j+1 2i,j + i,j1


.
y 2

Rewriting the difference equation with the


the knowns on the right leads to




1
1 t x2 + y2 n+1
=
1+
i,j
2

unknowns on the left-hand side and




1
t x2 + y2 ni,j .
2

(15.39)

15.5 Multidimensional Problems

We factor the difference operator on


 


1
2
2
1 t x + y 1
2

the left-hand side as follows




1
1
2
2
tx 1 ty ,
2
2

283

(15.40)

where the first factor only involves the difference operator in the x-direction, and
the second factor only involves the difference operator in the y-direction. Observe
that the factored operator produces an extra term as compared to the unfactored
operator
1 2 2 2 2
t x y = O(t2 ),
4
which is O(t2 ). Therefore, the factorization (15.40) is consistent with the secondorder accuracy in time of the Crank-Nicolson approximation.
The factored form of equation (15.39) is





 n
1
1
1
2
2
n+1
2
2
1 tx 1 ty i,j = 1 + t x + y i,j ,
2
2
2
which can be solved in two steps by defining the intermediate variable


1
i,j = 1 ty2 n+1
i,j .
2

(15.41)

n+1/2
Note that i,j is not the same as i,j , which is an intermediate approximation
to i,j at the half time step, in ADI with time splitting. The two-stage solution
process is:

1. Sweep along constant y-lines solving







1
1
1 tx2 i,j = 1 + t x2 + y2 ni,j ,
2
2

(15.42)

which produces a tridiagonal problem at each j for the intermediate variable


i,j , i = 1, . . . , I + 1.
2. Sweep along constant x-lines solving [from equation (15.41)]


1
2

1 ty n+1
(15.43)
i,j = i,j ,
2
which produces a tridiagonal problem at each i for n+1
i,j , j = 1, . . . , J + 1 at
the current time step. Note that the right-hand side of this equation is the
intermediate variable solved for in equation (15.42).
Remarks:
1. The factored ADI method is similar to the ADI method with time splitting,
n+1/2
but we have an intermediate variable i,j rather than half time step i,j .
The factored ADI method is somewhat faster as it only requires one evaluation
of the spatial derivatives on the right-hand side per time step [for equation
(15.42)] rather than two for the ADI method with time splitting [see equations
(15.35) and (15.37)].

284

Finite-Difference Methods for Parabolic PDEs

2. The factored ADI method is O(t2 , x2 , y 2 ) accurate and is unconditionally


stable (even for three-dimensional implementation of the unsteady diffusion
equation).
3. Boundary conditions are required for the intermediate variable i,j in order
to solve (15.42). These are obtained from equation (15.41) applied at the
boundaries (see Moin, Section 5.9, and Fletcher, Section 8.4.1).
4. The order of solution can be reversed, that is, we could define


1
i,j = 1 tx2 n+1
i,j .
2
instead of as in equation (15.41).
5. The method can be extended naturally to three-dimensions.
6. If we have nonlinear convective terms, as in the unsteady Navier-Stokes or
transport equation for example, use upwind-downwind differencing as in Section 15.4.3. See Peridier, Smith, & Walker, JFM, Vol. 232, pp. 99131 (1991),
which shows factored ADI with upwind-downwind differencing applied to the
unsteady boundary-layer equations (but the method applies to more general
equations). This algorithm is compact and maintains second-order accuracy in
space and time.
7. Observe that numerical methods for parabolic problems do not require relaxation or acceleration parameters as is often the case for elliptic solvers. Recall
that the issue is numerical stability, not iterative convergence. As mentioned
previously, it is often advantageous to solve steady elliptic equations reframed
as their unsteady parabolic counterparts using the pseudo-transient method.

16
Finite-Difference Methods for Hyperbolic
Partial Differential Equations
Like parabolic equations, hyperbolic equations represent initial-value problems.
Therefore, methods for their numerical solution bear a strong resemblance. However, there are several unique issues that arise in the solution of hyperbolic problems that must be adequately accounted for in obtaining their numerical solution.

285

17
Additional Topics in Numerical Methods

17.1 Numerical Solution of Coupled Systems of Partial Differential


Equations
17.1.1 Introduction
Until now, we have considered methods for solving single partial differential
equations; but in many fields we must solve systems of coupled equations for
multiple dependent variables. In fluid dynamics, for example, we must solve the
Navier-Stokes equations, which for two-dimensional, unsteady flows consists of
two parabolic equations for the two velocity components and one elliptic equation
for the pressure. There are two general methods for treating coupled equations
numerically, sequential and simultaneous, or coupled, solutions. In the sequential
method, we solve (iterate on) each equation for its dominant variable, treating
the other variables as known by using the most recent values. This requires one
pass through the mesh for each equation at each iteration and iteration until convergence at each time step (if unsteady). The simultaneous (or coupled) solution
method combines the coupled equations into a single system of algebraic equations. If we have n dependent variables, it produces an n n block tridiagonal
system of equations that is solved for all the dependent variables simultaneously.
See, for example, S. P. Vanka, J. Comput. Phys., Vol. 65, pp. 138158 (1986).
If solving using Gauss-Seidel, for example, a sequential method would require
one Gauss-Seidel expression for each of n dependent variable that is updated
in succession. A simultaneous method, in contrast, would require solution of n
equations for the n unknown dependent variables directly at each grid point.
The sequential method is most common because it is easiest to implement as
it directly uses existing methods for elliptic and parabolic equations. It is also
straightforward to incorporate additional physics simply by adding the additional
equation(s) into the iterative process.

17.1.2 Sequential Method


Let us consider the sequential method in the context of steady (elliptic) and
unsteady (parabolic) problems. The iterative process for the steady (elliptic) case
is illustrated in Figure 17.1. Note that it may be necessary to use under-relaxation
for convergence owing to nonlinearity (see Sections 6.2.7 and 14.3.3).
For unsteady problems, which are governed by parabolic partial differential
286

17.2 Grids and Grid Generation

287

Figure 17.1 Schematic of the iterative process for coupled elliptic


equations solved sequentially.

Figure 17.2 Schematic of the time-marching process for coupled parabolic


equations solved sequentially.

equations, an outer loop is added to account for the time marching as shown in
Figure 17.2. The inner loop is performed to obtain the solution of the coupled
equations at the current time step. Generally, there is a trade-off between the
time step and the number of iterations required at each time step. Specifically,
reducing the time step t reduces the number of inner loop iterations required
for convergence. In practice, the time step should be small enough such that no
more than ten to twenty iterations are required at each time step.

17.2 Grids and Grid Generation


Thus far, we have used what are called uniform, collocated, structured grids.
Uniform means that the grid spacings in each direction x and y are uniform.
Collocated means that all dependent variables are approximated at the same
point. Finally, structured means that there is an ordered relationship between
neighboring grid cells. In the following two sections we consider alternatives to
uniform and structured grids.

288

Additional Topics in Numerical Methods

Figure 17.3 Schematic of a non-uniform grid.

17.2.1 Uniform Versus Non-Uniform Grids


In many physical problems, the solution has local regions of intense gradients.
This is particularly common in fluid dynamics, where thin boundary layers near
solid surfaces exhibit large gradients in velocity, for example. Therefore, it would
be beneficial to be able to impose a finer grid in such regions in order to resolve
the large gradients but without having to increase the resolution elsewhere in
the domain where it is not needed. In other words, a uniform grid would waste
computational resources where they are not needed.
Alternatively, a non-uniform grid would allow us to refine the grid only where
it is needed. Let us obtain the finite-difference approximations using Taylor series
as before, but without assuming that all xs are equal. For example, consider
the first-derivative term /x as illustrated in Figure 17.3. Applying Taylor
series at xi1 and xi+1 with xi1 6= xi and solving for /x leads to
 2 
 3 
x3i + x3i1
x2i x2i1
i+1 i1

+ .

2
x
xi1 + xi 2(xi1 + xi ) x i 6(xi1 + xi ) x3 i
If the grid is uniform, that is, xi1 = xi , then the second term vanishes,
and the approximation reduces to the usual O(x2 )-accurate central difference
approximation for the first derivative. However, for a non-uniform grid, the truncation error is only O(x). We could restore second-order accuracy by using an
appropriate approximation to ( 2 /x2 )i in the second term, which results in
i+1 x2i1 i1 x2i + i (x2i x2i1 )

=
+ O(x2 ).
x
xi xi1 (xi1 + xi )
As one can imagine, this gets very complicated, and it is difficult to ensure consistent accuracy for all approximations.
An alternative approach for concentrating grid points in certain regions of a
domain employs grid transformations that map the physical domain to a computational domain that may have a simple overall shape and/or cluster grid points
in regions of the physical domain where the solution varies rapidly, that is, where
large gradients occur. The mapping is such that a uniform grid in the computa-

17.3 Finite-Element and Spectral Methods

289

tional domain corresponds to a non-uniform grid in the physical domain, thereby


allowing for utilization of all of the finite-difference approximations throughout
Part II for the numerical solution carried out in the computational domain. The
solution is then transformed back to the physical domain.
There are three general approaches to grid generation: algebraic, elliptic, and
variational. In algebraic grid generation, an algebraic function is chosen a priori
that accomplishes the desired transformation. In elliptic grid generation, an elliptic partial differential equation(s) is solved for the mapping. In variational grid
generation, an optimization problem is formulated that when solved produces the
mapping; this is called solution-adaptive grid generation.1

17.2.2 Structured Versus Unstructured Grids


Recall that a structured grid is comprised of an ordered arrangement of cells
typically rectangles in two dimensions or quadrilaterals in three dimensions.
They are commonly used for problems in which the geometry of the domain is of
a regular shape, and they can be extended to complex domains through use of
grid transformations from a complex to simple domain.
Alternatively, additional flexibility in treating complex domain shapes, including those that have irregular shapes and/or cutouts, etcetera, can be obtained
by employing unstructured grids. The cells of an unstructured grid are not constrained to be arranged in an ordered fashion, thereby providing additional flexibility in arranging the cells to accommodate complex shapes; the only requirement
is that faces of adjacent cells must coincide. In this way, one can locally refine
the grid as desired simply by defining smaller grid cells. For two-dimensional
domains, the elements are typically quadrilateral, defined by four nodes, or triangular, defined by three nodes. In three-dimensional domains, the elements would
be generalized to become hexahedral or tetrahedral. Unstructured girds present
a significant challenge (and additional computational overhead) in accounting for
the irregular orientation and arrangement of cells; however, it obviates the need
for complicated grid generation strategies.

17.3 Finite-Element and Spectral Methods


Our focus throughout has been on the application of finite-difference methods to
elliptic, parabolic, and hyperbolic partial differential equations. As we have seen,
finite-difference methods:
Are based on discretizing the governing equations in differential form using
Taylor-series-based approximations.
Provide a general and flexible framework for discretizing partial differential
equations of any form.
1

See Variational Methods with Applications in Science and Engineering, Chapter 12, for more on
algebraic, elliptic, and variational grid generation.

290

Additional Topics in Numerical Methods

Utilize structured grids with transformation-based grid generation to handle


complex geometries.
Can be applied to ordinary and partial differential equations in any field.
Other numerical methods are also regularly used in solving partial differential
equations. They include finite-element methods and spectral methods:
17.3.1 Comparison of Methods
Let us summarize the basic approaches along with the advantages and disadvantages of each method.
Finite-Difference Methods
Basic approach:
Discretize the governing equations in differential form using Taylor-series-based
finite-difference approximations at each grid point.
Produces linear algebraic equations involving each grid point and surrounding
points.
Local approximation method.
Popular in research.
Advantages:
Relatively straightforward to understand and implement (based on Taylor series).
Utilizes familiar differential form of governing equations.
Very general as it applies to a wide variety of problems, including complex
physics.
Can extend to higher-order approximations.
Disadvantages:
More difficult to implement for complex geometries.
Finite-Element Methods
Basic approach:
Apply conservation equations in variational form with weight function to set
of finite elements.
Produces set of linear or nonlinear algebraic equations.
Local approximation method.
Popular in commercial codes (particularly for solid mechanics and heat transfer).
Advantages:
Easy to treat complex geometries using inherently unstructured grids.
Disadvantages:

17.3 Finite-Element and Spectral Methods

291

Unstructured grids significantly increase computational complexity as compared to structured grids.


Solution methods are comparatively inefficient for the dense and unstructured matrices typically resulting from finite-element discretizations; recall that
finite-difference methods typically produce sparse, highly structured matrices.
Spectral Methods
Basic approach:
Solution of governing equations in differential form are approximated using
truncated (usually orthogonal) eigenfunction expansions.
Produces system of algebraic equations (steady) or system of ordinary differential equations (unsteady) involving the coefficients in the eigenfunction
expansion.
Global approximation method, that is, we solve for the solution in the entire
domain simultaneously.
Popular for situations in which a highly accurate solution is required.
Advantages:
Obtain highly accurate solutions when underlying solution is smooth.
Can achieve rapid, that is, spectral, convergence.
Can be combined with finite-element method to produce the spectral-element
method that combines many of the advantages of each.
Disadvantages:
Less straightforward to implement than finite-difference methods.
More difficult to treat complicated boundary conditions, for example, Neumann.
Small changes in the problem, for example, boundary conditions, can cause
large changes in the algorithm.
Not well suited for solutions having large gradients.
In order to better see the relative advantages and disadvantages between the
various methods, let us discuss spectral and finite-element methods in more detail
for comparison with the finite-difference methods that have been the focus of our
attention. We primarily focus on one-dimensional ordinary differential equations
to illustrate the basic methods recognizing that their true advantages are realized
in multidimensional problems in complex domains.
17.3.2 Spectral Methods
In finite-difference methods, the continuous governing differential equation is approximated in such a way that a large system of algebraic equations is produced
for the solution at each point in a grid defined by the discretization process.
Spectral methods are a class of numerical techniques based on the principles of

292

Additional Topics in Numerical Methods

eigenfunction expansions (see Chapter 3). Rather than approximating the differential equation itself, the unknown solution is approximated directly in the form
of a truncated series expansion. The primary virtue of spectral methods is that
they converge very rapidly toward the exact solution as the number of terms in
the expansions is increased. At its best, this convergence is exponential, which is
referred to as spectral convergence.
Along with finite-element methods, spectral methods utilize the method of
weighted residuals. We begin by approximating the unknown solution to a differential equation Lu = f using the expansion
u
(x, y, z, t) = 0 (x, y, z) +

N
X

cn (t)n (x, y, z),

(17.1)

n=1

where u
(x, y, z, t) is the approximate solution to the differential equation; n (x, y, z)
are the spatial basis, or trial, functions; cn (t) are the time-dependent coefficients; and 0 (x, y, z) is chosen to satisfy the boundary conditions, such that
n = 0, n = 1, . . . , N at the boundaries. Similarly, we expand the forcing function f (x, y, z) in terms of the same basis functions.
Consider the exact solution u(x, y, z, t) of the differential equation
Lu = f.

(17.2)

Because the numerical solution u


(x, y, z, t) is only approximate, it does not exactly satisfy the original differential equation. Substituting its expansion (17.1)
into the differential equation (17.2), we define the residual as
r = f L
u.
Note that as N , u
u, and r 0.
In order to determine the coefficients cn (t) in the expansion (17.1), we require
that the integral of the weighted residual be zero over the spatial domain, that
is, the inner product of r and wi are zero, where wi (x) are the weight, or test,
functions. For example, in the one-dimensional case
Z x1
r(x, t)wi (x)dx = 0, i = 1, . . . , N.
(17.3)
x0

Substituting the trial functions (17.1) into (17.3) requires that


"
#)
Z x1 (
N
X
f (x) L 0 (x) +
cn (t)n (x) wi (x)dx = 0, i = 1, . . . , N. (17.4)
x0

n=1

Different weighted residual methods correspond to different choices for the


weight (test) functions:
Collocation:
wi (x) = (x xi ),
where is the Dirac delta function centered at the collocation points xi .

17.3 Finite-Element and Spectral Methods

293

Least squares:
wi (x) =
which results in
Galerkin:

r
,
ci

r2 dx being a minimum.
wi (x) = i (x);

that is, the weight (test) functions are the same as the basis (trial) functions.
In spectral methods, the trial functions are chosen such that they are mutually
orthogonal for reasons that will become apparent in the following example.
As a one-dimensional example, let us consider the ordinary differential equation
Lu =

d2 u
+ u = 0,
dx2

0 x 1,

(17.5)

with the boundary conditions


u=0

at x = 0,

u=1

at x = 1.

and

Let us use sines, which are mutually orthogonal over the specified domain, for
the trial functions, such that
n (x) = sin(nx),

n = 1, . . . , N.

Note that n = 0 at x = 0, 1. To satisfy the boundary conditions, we choose


0 (x) = x. Thus, the spectral solution (17.1) becomes
N
X

u
(x) = x +

cn sin(nx),

(17.6)

n=1

where cn are constant as there is no time dependence. Differentiating gives


u
0 (x) = 1 +

N
X

cn n cos(nx),

n=1

and again
u
00 (x) =

N
X

cn (n)2 sin(nx).

n=1

Substituting into the differential equation (17.5) to obtain the residual yields
r(x) = f L
u=

N
X
n=1

cn (n) sin(nx) x

N
X
n=1

cn sin(nx),

294

Additional Topics in Numerical Methods

or
r(x) = x

N
X



cn 1 (n)2 sin(nx).

n=1

From equation (17.3) with wi (x) = i (x), that is, the Galerkin method, we have
Z 1
r(x)i (x)dx = 0, i = 1, . . . , N,
0

which results in N equations as follows. Substituting the weight functions and


residual gives
)
Z 1(
N
X


2
x+
cn 1 (n) sin(nx) {sin(ix)} dx = 0,
0

n=1

or
Z

x sin(ix)dx +
0

N
X



cn 1 (n)2

sin(ix) sin(nx)dx = 0

(17.7)

n=1

R1
for i = 1, . . . N . Owing to orthogonality of sines, that is, 0 sin(ix) sin(nx)dx =
0 for n 6= i, the only contribution to the summation in equation (17.7) arises when
n = i. Let us evaluate the resulting integrals:

1
Z 1
sin(ix) x cos(ix)
x sin(ix)dx =

(i)2
i
0
0

, i = 1, 3, 5, . . .

(1)i
i
=
=
,

1
i

, i = 2, 4, 6, . . .
i

and
Z
0

x sin(2ix)

sin (ix)dx =
2
4i
2

1
0

1
= .
2

Thus, equation (17.7) becomes


(1)i 1 
+ ci 1 (i)2 = 0,
i
2

i = 1, . . . , N.

Solving for the coefficients yields


ci =

2(1)i
,
i [1 (i)2 ]

i = 1, . . . , N,

(17.8)

which completes the expansion (17.6)


u
(x) = x +

N
X

2(1)n
sin(nx).
n [1 (n)2 ]
n=1

(17.9)

17.3 Finite-Element and Spectral Methods

295

1.0

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.0

Figure 17.4 ...

Figure 17.4 is a plot of the spectral solution with N = 1, which is


u
(x) = x

2
sin(x).
[1 2 ]

For comparison, the exact solution to (17.5) is


u(x) =

sin x
.
sin 1

Observe from Figure 17.4, even with only one term in the spectral solution, that
it is indistinguishable from the exact solution.
Remarks:
1. Only small N is required in the above example because the underlying solution
is a sine function. This, of course, is not normally the case.
2. Because each trial function spans the entire domain, spectral methods provide
global approximations.
3. Spectral methods give highly accurate approximate solutions when the underlying solution is smooth, and it is when the solution is smooth that exponential
(spectral) convergence is achieved with increasing N . The number of terms required becomes very large when the solution contains large gradients, such
as shocks. Finite-difference methods only experience geometric convergence.
For example, in a second-order accurate method, the error scales according to
1/N 2 , which is much slower than exponential decay.
4. For periodic problems, trigonometric functions are used for the trial functions.
Whereas, for non-periodic problems, Chebyshev or Legendre polynomials are
typically used for the trial functions.
5. For steady problems, the cn coefficients are in general determined by solving a
system of N algebraic equations.

296

Additional Topics in Numerical Methods

6. For unsteady problems, the cn (t) coefficients are in general determined by


solving a system of N ordinary differential equations.
7. For nonlinear problems, collocation is typically used rather than the Galerkin
method. This is referred to as a pseudospectral method.
17.3.3 Finite-Element Methods
As with spectral methods, finite-element methods are typically based on weightedresidual methods. After choosing the basis (trial) and weight (test) functions, the
weighted-residual equation (17.4) is integrated by parts to produce the so-called
weak form of the problem; the original differential equation is known as the strong
form.
In the commonly used Galerkin method, the weight functions are the same
as the trial functions; therefore, equation (17.4) for the one-dimensional case
becomes
"
#)
Z x1 (
N
X
f (x) L 0 (x) +
cn (t)n (x) i (x)dx = 0, i = 0, . . . , N.
x0

n=1

Observe the similarities between obtaining the weak form and the inverse problem
in variational methods.
The primary difference between finite-element and spectral methods is that N
is small, that is, N = 1 (linear) or 2 (quadratic) is typical, for finite-element
methods. However, these shape functions are applied individually across many
small elements that make up the entire domain. In other words, just like finitedifference methods, finite-element method are local approximation methods. Recall that in spectral methods, each basis function is applied across the entire
domain, which is why N must in general be larger than in finite-element methods.
Remarks:
1. References:
For more details on the variational underpinnings of finite-element methods, see Chapter 3 of Variational Methods with Applications in Science and
Engineering by Cassel (2013).
For additional details of finite-element methods and the method of weighted
residuals, see Sections 6.6 and 6.7 of Fundamentals of Engineering Numerical
Analysis by Moin (2010) and Computational Fluid Dynamics by Chung
(2010).
2. Finite-element methods can be combined with spectral methods to obtain the
spectral-element method, for which the shape functions across each element in
the finite-element method are replaced by spectral expansion approximations.
This typically allows for larger elements as compared to finite-element methods that have only linear or quadratic shape functions. One can then take
advantage of the flexibility of finite-elements in representing complex domains
with the spectral accuracy and convergence rate of spectral methods.

17.4 Parallel Computing

297

17.4 Parallel Computing


The reader may have wondered when reading Chapters 616 why a range of methods were described usually progressing from simple, but inefficient or unstable,
to more complicated, but efficient and stable. It is tempting to simply cover the
best algorithms and not waste our time on others. However, with the evolution
of computer hardware technologies and programming paradigms, it is impossible
to anticipate which methods and algorithms will be best on future computer architectures. This is particularly true for supercomputing architectures, which are
highly (and increasingly) parallel, and programming methodologies. Therefore, it
is essential that we be familiar with the full range of approaches and algorithms,
and be open to development of still more approaches that are optimized for new
computer architectures.
As a simple example, consider the Jacobi and Gauss-Seidel methods. While
the traditional view (repeated here) is that the Jacobi method should never be
used because Gauss-Seidel is twice as fast and only requires half of the memory, the Jacobi method does have one advantage. With the large supercomputers
available today, which have increasing numbers of CPUs, each with increasing
numbers of cores, it is not hard to imagine a day when it would be possible to
map each grid point in our physical domains to a single core on a supercomputer,
thereby allowing for all of the points to be computed simultaneously in one single
step. While this would appear to lead to an astounding speedup in our iterative
algorithms, however, Gauss-Seidel depends upon the sweeping sequence throughout the grid disallowing such extreme parallelism. The Jacobi method, on the
other hand, has no such limitations. Despite its inherent disadvantages on serial
computers, therefore, it may actually be preferred on some parallel architectures.
Similarly, observe that our beloved, and widely used, Thomas algorithm is not
parallelizable. Thus, while we can compute multiple grid lines in parallel, we must
perform each tridiagonal solve on a single core.

Part III
Applications

299

18
Static Structural and Electrical Systems

301

19
Discrete Dynamical Systems

The mathematical theory of discrete dynamical systems, known generally as linear systems theory, plays a prominent role in almost all areas of science and
engineering, including mechanical systems, biological systems, electrical systems,
communications, signal processing, structural systems, and many others. For a
system to be linear, it simply means that the mathematical model governing the
systems mechanics is comprised of linear algebraic and/or differential operators.
They have the property that when the input is increased or decreased, the output
is increased or decreased by a proportional amount. The mathematics of linear
systems is very well established and tightly integrated.
Linear systems theory is so powerful, and the theory so well developed, that
when dealing with a nonlinear system, it is common to first analyze the system by
linearizing the model in the vicinity of a characteristic (often equilibrium) state
or solution trajectory, thereby bringing to bear the full range of tools available for
analyzing linear systems. Although such an analysis only applies when the system
state is close to that about which it has been linearized, this approach is often
very fruitful and provides a wealth of knowledge about the systems behaviors.
In discrete systems, the masses are rigid (nondeformable), whereas in continuous systems, the mass is distributed and deformable. Discrete dynamical systems
are comprised of masses, pendulums, springs, etc, and are generally governed
by systems of linear ordinary differential equations (see Chapter 2). Continuous
dynamical systems are comprised of strings, membranes, beams, etc, and are
generally governed by linear partial differential equations (see Chapter 3).
Linear systems theory is most closely associated with the dynamics of discrete
systems, for which there are two formulations that are commonly used in the context of dynamical systems. Application of Newtons second law and Hamiltons
principle (through the Euler-Lagrange equations) directly produces a system of
n second-order ordinary differential equations in time, where n is the number
of degrees of freedom of the system. Alternatively, the state-space form is often
emphasized. This form follows from Hamiltons canonical equations, which produce a system of 2n first-order linear ordinary differential equations in time.1 Of
course, these two formulations are mathematically equivalent, and we can easily
transform one to the other as desired.
1

See Variational Methods with Applications in Science and Engineering by Cassel (2013) for more on
Hamiltons principle, the Euler-Lagrange equations, and Hamiltons canonical equations.

302

19.1 An Illustrative Example


2k

303

2m

x1 (t)

k
x2 (t)

Figure 19.1 Schematic of the spring-mass system in Example ??.

19.1 An Illustrative Example


Now let us consider a physical example from dynamics.
Example 19.1 Consider the spring-mass system shown in Figure 19.1. The
system is such that there is no force in any of the springs when the masses are
located at x1 = 0, x2 = 0. Obtain the equations of motion for the system, the
natural frequencies of the system, and general solutions for the motion of the
masses.
Solution: This spring-mass system has two degrees of freedom, which is the
minimum number of dependent variables (x1 (t) and x2 (t) in this case) required
to specify the state of the system. Physically, this means that the system has two
natural frequencies, or modes, of vibration, and the general motion of the system
is a superposition of these modes.
In order to obtain the governing equations, that is, equations of motion, for this
system, recall Newtons second law: F = ma = m
x. Applying Newtons second
law to the free-body diagram for each mass:
1m
2kx1 
m

- k(x x )
2
1

d2 x1
= k(x2 x1 ) 2kx1 = 3kx1 + kx2 ,
dt2

(19.1)

2m
k(x2 x1 ) 

2m

- kx
2

d2 x2
= kx2 k(x2 x1 ) = kx1 2kx2 .
(19.2)
dt2
Thus, we have a system of two second-order ordinary differential equations for
x1 (t) and x2 (t).
It is common to write such systems of equations for dynamical systems in the
matrix form
2m

M
x + Cx + Kx = f (t),
where M is the mass matrix, C is the damping matrix, K is the stiffness matrix,

304

Discrete Dynamical Systems

(t) is the acceleration vector, x(t)

x
is the velocity vector, x(t) is the displacement
vector, and f (t) is the force vector. In the present case






 
 
m 0
0 0
3k k
0
x
M=
, C=
, K=
, f=
, x= 1 .
0 2m
0 0
k 2k
0
x2
Before seeking the general solution to this system of equations, let us first
obtain the natural frequencies (normal modes) of the system.2 These will occur
when the two masses oscillate with the same frequency; therefore, assume periodic
motion of the form3
x1 (t) = A1 eit ,

x2 (t) = A2 eit ,

where is the natural frequency, and A1 and A2 are the amplitudes of masses 1
and 2, respectively. It is understood that we are taking the real part of the final
expressions for x1 (t) and x2 (t). For example,


x1 (t) = Re A1 eit = Re [A1 cos(t) + A1 i sin(t)] = A1 cos(t).
Evaluating the derivatives gives
d2 x1
= A1 2 eit ,
dt2

d2 x2
= A2 2 eit .
dt2

Substituting into equations (19.1) and (19.2) and canceling eit gives the two
linear algebraic equations
m 2 A1 = 3kA1 + kA2 ,
2m 2 A2 = kA1 2kA2 ,
or upon rearranging
3A1 A2 = A1 ,
21 A1 + A2 = A2 ,
where = m 2 /k. Thus, we have two linear algebraic equations for the two
unknown amplitudes A1 and A2 . In matrix form, this is

 
 
3 1 A1
A
= 1 .
21 1
A2
A2
Consequently, we have an eigenproblem in which the eigenvalues are related to
the natural frequencies of the system, which is typical of dynamical systems.
Solving, we find that the eigenvalues are

6
m22
6
m12
=2
, 2 =
=2+
.
1 =
k
2
k
2
2

A normal mode is a solution in which all of the parts of the system oscillate with the same
frequency, so the motion of the entire system is periodic.
This form is better than setting x1 (t) = A1 cos(t) directly, for example, which does not work if we
have odd-order derivatives.

19.1 An Illustrative Example

305

Therefore, the two natural frequencies, or modes, of the system, corresponding to


the two degrees of freedom, are
r
r
k
k
1 = 0.8805
, 2 = 1.7958
.
m
m
The smallest of the natural frequencies, 1 , is known as the fundamental mode.
As we will show, the general motion of the system is a superposition of these two
modes of oscillation for systems governed by linear equations of motion as is the
case here.
Now we return to the general solution of the system of two second-order differential equations (19.1) and (19.2). In order to convert these to a system of
first-order differential equations, we make the following substitutions
x
1 (t) = x1 (t),
x
2 (t) = x2 (t),
x
3 (t) = x 1 (t),

(19.3)

x
4 (t) = x 2 (t).
Recall from Section 2.5.2 that two second-order ordinary differential equations
transform to four first-order equations. Differentiating the substitutions and transforming to x
1 (t), . . . , x
4 (t), we have the following system of four equations
x
1 (t) = x 1 (t) = x
3 (t),
x
2 (t) = x 2 (t) = x
4 (t),
x
3 (t) = x
1 (t) = 3Kx1 (t) + Kx2 (t) = 3K x
1 (t) + K x
2 (t),
1 (t) K x
2 ,
x
4 (t) = x
2 (t) = 21 Kx1 (t) Kx2 (t) = 12 K x
for the four unknowns x
1 (t), x
2 (t), x
3 (t), and x
4 (t), where K = k/m. Written in
(t) = A
matrix form, x
x(t), we have

x
1 (t)
0
0
1 0 x
1 (t)
x

2 (t)

0
0 1
2 (t) = 0
x
.
x
3 (t) 3K K 0 0 x
3 (t)
1
x
4 (t)
K K 0 0 x
4 (t)
2
For simplicity, let us take K = k/m = 1. Then the eigenvalues of A, which are
obtained using Matlab or Mathematica, are
1 = 1.7958i, 2 = 1.7958i, 3 = 0.8805i, 4 = 0.8805i,
which are complex conjugate pairs. The corresponding eigenvectors are also complex

0.4747i
0.4747i
0.1067i
0.1067i

u1 =
0.8524 , u2 = 0.8524 ,
0.1916
0.1916

306

Discrete Dynamical Systems

0.3077
0.3077
0.6846
0.6846

u3 =
0.2709i , u4 = 0.2709i .
0.6027i
0.6027i

Thus, forming the modal matrix, we have

0.4747i 0.4747i 0.3077


0.3077
0.1067i 0.1067i 0.6846
0.6846
.
P=
0.8524
0.8524 0.2709i 0.2709i
0.1916 0.1916 0.6027i 0.6027i
In order to uncouple the system of differential equations, we transform the
problem according to (recall that A is non-symmetric with distinct eigenvalues)
(t) = Py(t).
x
With respect to y(t), the solution is
y1 (t) = c1 e1.7958it , y2 (t) = c2 e1.7958it , y3 (t) = c3 e0.8805it , y4 (t) = c4 e0.8805it .
(t) = Py(t) gives the solution to the system of firstTransforming back using x
(t). From the substitutions (19.3), we obtain the
order equations in terms of x
solution with respect to the original variables as follows
x1 (t) = x
1 (t) = 0.4747iy1 (t) + 0.4747iy2 (t) + 0.3077y3 (t) + 0.3077y4 (t),
x2 (t) = x
2 (t) = 0.1067iy1 (t) + 0.1067iy2 (t) + 0.6846y3 (t) + 0.6846y4 (t).
As in Example ??, we can extract the real and imaginary parts by applying
Eulers formula. Doing this, the real and imaginary parts of x1 (t) and x2 (t) are
Re(x1 ) = 0.4747(c1 + c2 ) sin(1.7958t) + 0.3077(c3 + c4 ) cos(0.8805t),
Im(x1 ) = 0.4747(c1 c2 ) cos(1.7958t) + 0.3077(c3 c4 ) sin(0.8805t),
Re(x2 ) = 0.1067(c1 + c2 ) sin(1.7958t) + 0.6846(c3 + c4 ) cos(0.8805t),
Im(x2 ) = 0.1067(c1 c2 ) cos(1.7958t) + 0.6846(c3 c4 ) sin(0.8805t).
To obtain the general solution for x1 (t) and x2 (t), we then superimpose the real
and imaginary parts, that is, xi (t) = ARe(xi ) + BIm(xi ), where A and B are
constants, giving
x1 (t) = 0.4747a1 sin(1.7958t) + 0.3077a2 cos(0.8805t)
0.4747a3 cos(1.7958t) + 0.3077a4 sin(0.8805t),
x2 (t) = 0.1067a1 sin(1.7958t) + 0.6846a2 cos(0.8805t)
+0.1067a3 cos(1.7958t) + 0.6846a4 sin(0.8805t),

(19.4)

where the constants are


a1 = A(c1 + c2 ),

a2 = A(c3 + c4 ),

a3 = B(c1 c2 ),

a4 = B(c3 c4 ).

19.1 An Illustrative Example

307

1.0

0.5

10

20

30

40

-0.5

-1.0

Figure 19.2 Solution for x1 (t) (solid line) and x2 (t) (dashed line) for
Example ??.

Note that the general solution consists of a superposition of the two modes determined by the natural modes calculation above. Also observe that the imaginary
eigenvalues correspond to oscillatory behavior as expected for the spring-mass
system.
To obtain the four integration constants, we require four initial conditions. We
will specify the positions and velocities of the two masses as follows:
x1 = 0, x 1 = 0

at t = 0,

x2 = 1, x 2 = 0

at t = 0.

Applying these initial conditions to (19.4), we obtain a system of four equations


for the four unknown constants


0
0.3077 0.4747
0
a1
0
0.8525
a2 0
0
0
0.2709

= .

0
0.6846 0.1067
0 a3 1
0.1916
0
0
0.6028 a4
0
Observe that the initial conditions appear in the right-hand side vector. Solving
this system gives
a1 = 0,

a2 = 1.3267,

a3 = 0.8600,

a4 = 0.

Substituting these constants into the general solutions (19.4) for x1 (t) (solid line)
and x2 (t) (dashed line) yields the solution shown in Figure 19.2.
Recall that all general solutions, including the one above, are linear combinations of the two natural frequencies. Thus, we would expect that a set of initial
conditions can be found that excites only one of the natural frequencies. For
example, choosing initial conditions such that only 1 is excited is shown in Figure 19.3, and that which only excites 2 is shown in Figure 19.4.

308

Discrete Dynamical Systems


1.0

0.5

10

20

30

40

-0.5

-1.0

Figure 19.3 Solution for x1 (t) (solid line) and x2 (t) (dashed line) for
Example ?? when only 1 is excited.

0.6
0.4
0.2

10

20

30

40

-0.2
-0.4
-0.6

Figure 19.4 Solution for x1 (t) (solid line) and x2 (t) (dashed line) for
Example ?? when only 2 is excited.

Remarks:
1. In this example, the matrix A is comprised entirely of constants, that is, it
does not depend on time. Such systems are called autonomous.
2. The diagonalization procedure is equivalent to finding an alternative set of
coordinates, y(t), sometimes called principal coordinates, with respect to which
the motions of the masses in the system are uncoupled. Although this may
seem physically counter-intuitive for a coupled mechanical system, such as our
spring-mass example, the mathematics tells us that it must be the case (except
for systems that result in nonsymmetric matrices with repeated eigenvalues,
which result in the Jordan canonical form).

19.2 Bifurcation Theory and the Phase Plane

309

19.2 Bifurcation Theory and the Phase Plane


19.3 Stability of Discrete Systems
For a given dynamical system, the procedure to determine its stability is as
follows:4
1. Determine the systems equilibrium positions, that is, positions where the
system remains at rest.
2. Determine stability of the system about each equilibrium position as t . If
when given a small disturbance, the motion remains bounded, then the system
is stable. If the motion grows and becomes unbounded, then it is unstable.
We illustrate this procedure with two examples.
Example 19.2 Let us consider stability of the spring-mass system in Example ?? and shown in Figure 19.1.
Solution: Recall that the governing equations are (K = k/m)
d2 x1
= 3Kx1 + Kx2 ,
dt2
1
d2 x2
= Kx1 Kx2 ,
2
dt
2
= A
or when converted to a system of first-order equations x
x (
x1 = x1 , x
2 =
x2 , x
3 = x 1 , x
4 = x 2 ), they are


1
x
1
0
0
1 0 x

0
0
0
1
2 .
2 =
(19.5)
x
3
3 3K K 0 0 x
1
x
4
K K 0 0 x
4
2
First, we obtain the equilibrium positions for which the velocities and accelerations of the masses are both zero. Thus,
{x 1 , x 2 , x
1 , x
2 } = {x
1 , x
2 , x
3 , x
4 } = {0, 0, 0, 0} .
= 0 in equation (19.5) and solving A
This is equivalent to setting x
x = 0. Because
A is invertible, the only solution to this homogeneous system is the trivial solution
= 0,
s=x
or
T

s = [
x1 , x
2 , x
3 , x
4 ] = [0, 0, 0, 0] ,

(19.6)

where s = [s1 , s2 , s3 , s4 ]T denotes an equilibrium (or stationary) point of the


4

This material is from Variational Methods with Applications in Science and Engineering, Chapter 6.

310

Discrete Dynamical Systems

system at which the masses could remain indefinitely. Note that s = 0 is the
neutral position of the masses for which the forces in the three springs are zero.
The second step is to consider the behavior, that is, stability, of the system
about this equilibrium point subject to small disturbances
= s + u,
x
= A
= u and s = 0 for this case, the linear system x
x
where   1. Given that x
becomes
u = Au.

(19.7)

and the disturbances u are the same in the


Observe that the equations for x
linear case; therefore, introduction of the disturbances is often implied, but not
done explicitly.
The behavior of the system (19.7) is given by its solution, which is determined
from the eigenvalues of A, which may be obtained through diagonalization as
in Section 2.6.1. Recall that the eigenvalues of the coefficient matrix A are the
complex conjugate pairs
1,2 = 1.7958i,

3,4 = 0.8805i,

which corresponds to harmonic motion, that is, linear combinations of sin(i t) and
cos(i t), i = 1, 2, 3, 4. Harmonic motion remains bounded for all time; therefore,
it is stable. We say that the equilibrium (stationary) point s of the system is
linearly stable in the form of a stable center. Note that the motion does not decay
toward s due to the lack of damping in the system.
Remarks:
1. Stability of dynamical systems is determined by the nature of its eigenvalues.
2. More degrees of freedom leads to larger systems of equations and additional
natural frequencies.

Example 19.3 Now let us consider stability of a nonlinear system. As illustrated in Figure 19.5, a simple pendulum (with no damping) is a one-degree of
freedom system with the angle (t) being the dependent variable. It is governed
by
g
+ sin = 0.
(19.8)
`
Solution: To transform the nonlinear governing equation to a system of firstorder equations, let
x1 (t) = (t),

x2 (t) = (t),

19.3 Stability of Discrete Systems

311

y
0

g
(t)

Figure 19.5 Schematic of the simple pendulum.

such that x1 (t) and x2 (t) give the angular position and velocity, respectively.
Differentiating and substituting the governing equation gives
x 1 = = x2 ,
(19.9)
g
g
x 2 = = sin = sin x1 .
`
`
Therefore, we have a system of first-order nonlinear equations owing to the sine
function.
Equilibrium positions occur where the angular velocity and angular acceleration are zero; thus,
}
= {x 1 , x 2 } = {0, 0}.
{,
Thus, from the system (19.9)
g
sin x1 = 0,
`
and the stationary points are given by
x2 = 0,

s1 = x1 = n,

s2 = x2 = 0,

n = 0, 1, 2, . . . .

The equilibrium states of the system s = [s1 , s2 ]T are then


s = [s1 , s2 ]T = [n, 0]T ,

n = 0, 1, 2, . . . .

Therefore, there are two equilibrium points corresponding to n even and n odd.
The equilibrium point with n even corresponds to when the pendulum is hanging vertically downward ( = 0, 2, . . .), and the equilibrium point with n odd
corresponds to when the pendulum is located vertically above the pivot point
( = , 3, . . .).
In order to evaluate stability of the equilibrium states, let us impose small
disturbances about the equilibrium points according to
x1 (t) = s1 + u1 (t) = n + u1 (t),
x2 (t) = s2 + u2 (t) = u2 (t),

312

Discrete Dynamical Systems

where   1. Substituting into equations (19.9) leads to


u 1 = u2 ,
g
u 2 = sin(n + u1 ).
`

(19.10)

Note that
(

sin(n + u1 ) = sin(n) cos(u1 ) + cos(n) sin(u1 ) =

sin(u1 ),

n even

sin(u1 ), n odd

Because  is small, we may expand the sine function in terms of a Taylor series
(u1 )3
+ = u1 + O(3 ),
3!
where we have neglected higher-order-terms in . Essentially, we have linearized
the system about the equilibrium points by considering an infinitesimally small
perturbation to the equilibrium (stationary) solutions of the system. Substituting
into the system of equations (19.10) and canceling  leads to the linear system of
equations
u 1 = u2 ,
(19.11)
g
u 2 = u1 ,
`
where the minus sign corresponds to the equilibrium point with n even, and the
plus sign to n odd. Let us consider each case in turn.
For n even, the system (19.11) in matrix form u = Au is
" # "
#" #
0 1 u1
u 1
g
.
=
0 u2

u 2
`
In order to diagonalize the system, determine the eigenvalues of A. We write
sin(u1 ) = u1

(A I) = 0.
For a nontrivial solution, the determinant must be zero, such that


1

g
= 0,



`
or
g
2 + = 0.
`
Factoring the gives the eigenvalues
r
r
g
g
1,2 = =
i ( imaginary).
`
`
Then in uncoupled variables y (x = Py), where
y = P1 APy,

P1 AP =



1 0
,
0 2

19.4 Stability of Linear, Second-Order, Autonomous Systems

313

the solution near the equilibrium point corresponding to n even is of the form
g
g
u(t) = c1 e ` it + c2 e ` it ,
or
r

u(t) = c1 sin


r 
g
g
t + c2 cos
t .
`
`

This oscillatory solution is linearly stable in the form of a stable center (see the
next section) as the solution remains bounded for all time.
In the case of n odd, the system (19.11) is
" # "
#" #
0 1 u1
u 1
= g
.
0 u2
u 2
`
Finding the eigenvalues produces


1
g


= 0,



`
g
2 = 0,
`
r
g
1,2 =
( real).
`
Thus, in uncoupled variables

g
y1 = c1 e ` t ,

y2 = c2 e

g
`

Then the solution near the equilibrium point corresponding to n odd is of the
form
g
g
u(t) = c1 e ` t + c2 e ` t .
Observe that while the second term decays exponentially as t , the first term
grows exponentially, eventually becoming unbounded. Therefore, this equilibrium
point is linearly unstable in the form of an unstable saddle (see the next section).

19.4 Stability of Linear, Second-Order, Autonomous Systems


As we have seen in the previous section, the stability of at least some dynamical
systems is determined from the nature of the eigenvalues of the coefficient matrix
in the equations of motion expressed as a system of first-order linear equations
u = Au. This is the case when the coefficient matrix A is comprised of constants,
that is, it does not change with time. Such systems are called autonomous. We
consider an example of a nonautonomous system in the next section. Stability
analysis that is based on the behavior of the eigenvalues, or modes, of the system
is called modal stability analysis.

314

Discrete Dynamical Systems

In order to illustrate the possible types of stable and unstable equilibrium


points, we evaluate stability of the general second-order system expressed as a
system of first-order equations in the form5
  
 
u 1
A11 A12 u1
=
.
u 2
A21 A22 u2
The original system may be nonlinear; however, it has been linearized about the
equilibrium point(s) as illustrated for the simple pendulum. For autonomous systems, the equilibrium points are steady and do not change with time. Therefore,
u = 0, and the steady equilibrium points are given by Au = 0.
In order to characterize the stability of the system, we obtain the eigenvalues
of A as follows:
|A I| = 0


A11
A12

=0
A21
A22
(A11 )(A22 ) A12 A21 = 0
2 (A11 + A22 ) + A11 A22 A12 A21 = 0
2 tr(A) + |A| = 0,
where from the quadratic formula


q
1
1,2 =
tr(A) tr(A)2 4|A| .
2
Many references summarize the various possibilities graphically by demarcating
the various stable and unstable regions on a plot of |A| versus tr(A). It is also
informative to plot solution trajectories in the phase plane, which is a plot of
u2 (t) versus u1 (t). This can be interpreted as a plot of velocity versus position of
the solution trajectory. In the phase plane, steady solutions are represented by a
single point, and periodic, or limit cycle, solutions are closed curves.
Each type of stable or unstable point is considered in turn by plotting several
solution trajectories on the phase plane. The dots on each trajectory indicate the
initial condition for that trajectory. In each case, the origin of the phase plane
is the equilibrium point around which the system has been linearized for consideration of stability. In general, complex systems may have numerous equilibrium
points, each of which exhibits one of the stable or unstable behaviors illustrated
below:
1. Unstable Saddle Point: tr(A)2 4|A| > 0, |A| < 0
If the eigenvalues of A are positive and negative real values, the equilibrium
point is an unstable saddle as shown in Figure 19.6. Observe that while some
trajectories move toward the equilibrium point, all trajectories eventually move
away from the origin and become unbounded. Therefore, it is called an unstable
5

This material is from Variational Methods with Applications in Science and Engineering, Chapter 6.

19.4 Stability of Linear, Second-Order, Autonomous Systems

315

u2 HtL
2

-2

-1

u1 HtL

-1

-2

Figure 19.6 Phase plane plot of an unstable saddle point.


u2 HtL

-10

-5

10

u1 HtL

-5

Figure 19.7 Phase plane plot of an unstable node or source.

saddle point. This is the behavior exhibited by the simple pendulum with n
odd ( = ).
2. Unstable Node or Source: tr(A)2 4|A| > 0, tr(A) > 0, |A| > 0
If the eigenvalues of A are both positive real values, the equilibrium point is
an unstable node as shown in Figure 19.7. In the case of an unstable node, all
trajectories move away from the origin to become unbounded.
3. Stable Node or Sink: tr(A)2 4|A| > 0, tr(A) < 0, |A| > 0
If the eigenvalues of A are both negative real values, the equilibrium point is a
stable node as shown in Figure 19.8. In the case of a stable node, all trajectories
move toward the equilibrium point and remain there.
4. Unstable Spiral or Focus: tr(A)2 4|A| < 0, Re[1,2 ] > 0 (tr(A) 6= 0)
If the eigenvalues of A are complex, but with positive real parts, the equilibrium point is an unstable focus. A plot of position versus time is shown in

316

Discrete Dynamical Systems


u2 HtL

10

-10

-5

10

u1 HtL

-5

-10

Figure 19.8 Phase plane plot of a stable node or sink.


u1 HtL

10

-5

Figure 19.9 Plot of position versus time for an unstable spiral or focus.

Figure 19.9 and the phase plane plot is shown in Figure 19.10. As can be seen
from both plots, the trajectory begins at the origin, that is, the equilibrium
point, and spirals away, becoming unbounded.
5. Stable Spiral or Focus: tr(A)2 4|A| < 0, Re[1,2 ] < 0 (tr(A) 6= 0)
If the eigenvalues of A are complex, but with negative real parts, the equilibrium point is a stable focus. A plot of position versus time is shown in
Figure 19.11 and the phase plane plot is shown in Figure 19.12. In the case of
a stable focus, the trajectory spirals in toward the equilibrium point from any
initial condition.
6. Stable Center: tr(A)2 4|A| < 0, tr(A) = 0
If the eigenvalues of A are purely imaginary, the equilibrium point is a stable
center. A plot of position versus time is shown in Figure 19.13 and the phase
plane plot is shown in Figure 19.14. A stable center is comprised of a periodic

19.5 Nonautonomous Systems Forced Pendulum

317

u2 HtL

10

u1 HtL

-5

10

-5

Figure 19.10 Phase plane plot of an unstable spiral or focus.


u1 HtL
10

10

12

14

-5

Figure 19.11 Plot of position versus time for a stable spiral or focus.

limit cycle solution centered at the equilibrium point. Because the trajectory
remains bounded, the solution is stable. Recall that the simple pendulum with
n even ( = 0) has a stable center.
There are situations for which such a modal analysis is not possible, as for
nonautonomous systems illustrated in the next section, or incomplete, as for
non-normal systems illustrated in Section 19.6.

19.5 Nonautonomous Systems Forced Pendulum


Let us make a small change to the simple pendulum considered in Section 19.3 by
adding vertical forcing of the pivot point.6 Consider the motion of a pendulum
6

This material is from Variational Methods with Applications in Science and Engineering, Chapter 6.

318

Discrete Dynamical Systems


u2 HtL
8
6
4
2
u1 HtL
10

-5
-2
-4

Figure 19.12 Phase plane plot of a stable spiral or focus.


u1 HtL
1.0

0.5

10

15

20

25

30

-0.5

-1.0

Figure 19.13 Plot of position versus time for a stable center.

of mass m that is forced vertically according to yO = A cos(t) as shown in


Figure 19.15. The equation of motion for the forced pendulum is


g A 2

cos(t) sin = 0.
`
`
Observe that this reduces to the equation of motion (19.8) for the simple pendulum when the amplitude of forcing vanishes (A = 0). For convenience, let us
nondimensionalize according to the forcing frequency by setting = t, in which
case d2 /dt2 = 2 d2 /d 2 , and the equation of motion becomes
+ ( + cos ) sin = 0,

(19.12)

2
where = g/`
p and = A/`. Note that the natural frequency of the unforced
pendulum is g/`; therefore, is the square of the ratio of the natural frequency

19.5 Nonautonomous Systems Forced Pendulum

319

u2 HtL
1.0

0.5

-1.0

u1 HtL
1.0

0.5

-0.5

-0.5

-1.0

Figure 19.14 Phase plane plot of a stable center.

A cos (t)

g
(t)

m
Figure 19.15 Schematic of the forced pendulum.

of the
p pendulum to the forcing frequency . Forcing at the natural frequency,
= g/`, corresponds to = 1.
As for the simple pendulum, the equilibrium positions s, for which qi = qi = 0,
require that sin s = 0; thus, s = n, n = 0, 1, 2, . . .. The cases for which n is
even correspond to s = 0, in which case the pendulum is hanging vertically
down. Likewise, the cases for which n is odd correspond to s = , for which the
pendulum is located vertically above the pivot point.
Now let us consider stability of the equilibrium position s = 0, which is stable
for the simple pendulum (A = 0). Introducing small perturbations about the
equilibrium position
( ) = s + u( ) = u( ),

320

Discrete Dynamical Systems

equation (19.12) leads to the perturbation equation


u
+ ( + cos ) u = 0,

(19.13)
3

where again we linearize using the Taylor series sin(u) = u  3!u + . Equation (19.13) is known as Mathieus equation. As with the simple pendulum, we
can convert this to a system of first-order equations in matrix form using the
transformation
u1 ( ) = u( ),
u2 ( ) = u(
).
Differentiating and substituting Mathieus equation leads to the system
  
 
u 1
0
1 u1
=
.
(19.14)
u 2
( + cos ) 0 u2
Observe that the matrix A in (19.14) for Mathieus equation is time-dependent
owing to the forcing. Such systems are called nonautonomous, and we cannot evaluate stability by examining the eigenvalues. Instead, we directly solve Mathieus
equation numerically for the perturbations u( ).
Whereas the simple pendulum is stable near the equilibrium position corresponding to s = 0, surprisingly, the forced pendulum allows for regions of instability for s = 0 and certain values of the parameters and , which are related
to the forcing frequency and amplitude, respectively. To illustrate this behavior,
we will set the initial conditions to be u(0) = 1 and u(0)

= 0, and the amplitude


parameter is set to = 0.01. Figure 19.16 shows results obtained by setting the
forcing frequency parameter to = 0.24 and = 0.25. For = 0.24, the oscillation amplitude remains bounded and is stable. For just a small change in the
forcing frequency, however, the case with = 0.25 becomes unbounded and is
unstable. The case with = 0.26 (not shown) is stable and looks very similar
to Figure 19.16(a) for = 0.24. Recall that all of these cases correspond to the
equilibrium state for which the pendulum is hanging down, which is stable when
there is no forcing (A = 0). Thus, when the forcing is at a certain frequency, we
can observe unstable behavior of the forced pendulum. In fact, there is a very
narrow range of forcing frequencies for which the solution becomes unstable. Observe also that = 0.25 does not correspond to forcing at the natural frequency
of the unforced pendulum, which would be = 1. Therefore, the instability is
not a result of some resonance phenomena, nor is it due to nonlinear effects as
Mathieus equation is linear. Based on further analysis, it can be shown that
there are an infinity of unstable tongues similar to the one near = 1/4 at
= 41 n2 , n = 0, 1, 2, . . ..
Even more surprising than the fact that there are unstable regions near the
= 0 equilibrium point is that there is a small region for negative , which
corresponds to the = equilibrium point, for which the system is stable.
An example is shown in Figure 19.17. As the pendulum begins to fall off to
one side or the other, which would be unstable without the forcing, the small
amplitude forcing acts to attenuate the pendulums motion and force it back

19.6 Non-Normal Systems Transient Growth

321

u()
1.0

0.5

100

200

300

400

500

400

500

0.5

1.0
(a) = 0.24.

u()

100

200

300

(b) = 0.25.

Figure 19.16 Oscillation amplitude u( ) for a case with = 0.01 and two
forcing frequencies.

toward the equilibrium position, keeping the pendulums motion stable. This
unusual behavior is only observed for small magnitudes of negative , that is,
large forcing frequencies.

19.6 Non-Normal Systems Transient Growth


A conventional stability analysis considers each individual stability mode, that is,
eigenvalue, separately, with the overall stability being determined by the growth
rate of the most unstable (or least stable) mode. Therefore, it is called a modal

322

Discrete Dynamical Systems


u( )
1.0

0.5

100

200

300

400

500

0.5

1.0

Figure 19.17 Oscillation amplitude u( ) for case with = 0.1 and


= 0.001.

stability analysis, and the stability characteristics as t are determined from


each corresponding eigenvalue. This is known as asymptotic stability as it applies
for large times, and the eigenvalues provide the growth rate of the infinitesimal
perturbations. A modal analysis is sufficient for systems governed by normal
matrices (see Section 2.3), for which the eigenvectors are mutually orthogonal
and, therefore, do not interact with one another, allowing for each mode to be
considered individually. A normal matrix is one for which the coefficient matrix
commutes with its conjugate transpose according to
T

AA = A A.
A real, symmetric matrix, for example, is normal, but not all normal matrices
are symmetric.
For systems governed by non-normal matrices, the asymptotic stability behavior for large times is still determined by the eigenvalues of the systems coefficient
matrix. However, owing to exchange of energy between the nonorthogonal modes,
it is possible that the system will exhibit a different behavior for O(1) times as
compared to the asymptotic behavior observed for large times. This is known as
transient growth.7 For example, a system may be asymptotically stable, meaning
that none of the eigenvalues lead to a growth of their corresponding modes (eigenvectors) as t , but it displays a transient growth for finite times in which
the amplitude of a perturbation grows for some period of time before yielding to
the large-time asymptotic behavior.
Consider a second-order example given in Farrell and Ioannou (1996) in order
to illustrate the effect of non-normality of the system on the transient growth
behavior for finite times before the solution approaches its asymptotically stable
7

This material is from Variational Methods with Applications in Science and Engineering, Chapter 6.

19.6 Non-Normal Systems Transient Growth

323

1.0
0.8
0.6
0.4
0.2

10

Figure 19.18 Solution of the normal system with = /2 for u1 (t) (solid
line) and u2 (t) (dashed line).

equilibrium point for large time. The system is governed by the following firstorder ordinary differential equations
  
 
u 1
1 cot u1
=
.
(19.15)
u 2
0
2
u2
The coefficient matrix A is normal if = /2 and non-normal for all other .
The system has an equilibrium point at (u1 , u2 ) = (0, 0).
Let us first consider the behavior of the normal system with = /2. In this
case, the eigenvalues are 1 = 2 and 2 = 1. Because the eigenvalues are
negative and real, the system is asymptotically stable in the form of a stable
node. The corresponding eigenvectors are given by
 
 
0
1
v1 =
, v2 =
,
1
0
which are mutually orthogonal as expected for a normal system. The solution
for the system (19.15) with = /2 is shown in Figures 19.18 and 19.19 for the
initial conditions u1 (0) = 1 and u2 (0) = 1. For such a stable normal system,
the perturbed solution decays exponentially toward the stable equilibrium point
(u1 , u2 ) = (0, 0) according to the rates given by the eigenvalues. In phase space,
the trajectory starts at the initial condition (u1 (0), u2 (0)) = (1, 1) and moves
progressively toward the equilibrium point (0, 0).
Now consider the system (19.15) with = /100, for which the system is no
longer normal. While the eigenvalues 1 = 2 and 2 = 1 remain the same, the
corresponding eigenvectors are now

 


1
cot 100
, v2 =
,
v1 =
1
0

324

Discrete Dynamical Systems


u2(t)
1.0

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.0

u1(t)

Figure 19.19 Solution of the normal system with = /2 for u1 (t) and
u2 (t) in phase space (the dot indicates the initial condition).

which are no longer orthogonal giving rise to the potential for transient growth
behavior. The solution for the system (19.15) with = /100 is shown in Figures 19.20 and 19.21 for the same initial conditions u1 (0) = 1 and u2 (0) = 1
as before. Recall that the system is asymptotically stable as predicted by its
eigenvalues, and indeed the perturbations do decay toward zero for large times.
Whereas u2 (t) again decays monotonically8 toward the equilibrium point, however, u1 (t) first grows to become quite large before decaying toward the stable
equilibrium point. In phase space, the trajectory starts at the initial condition
(u1 (0), u2 (0)) = (1, 1) but moves farther away from the equilibrium point at the
origin before being attracted to it, essentially succumbing to the asymptotically
stable large-t behavior.
The transient growth behavior can occur for non-normal systems because the
nonorthogonal eigenvectors lead to the possibility that the individual modes can
exchange energy leading to transient growth. In other words, even though each
eigenvalue leads to exponential decay of their respective modes, the corresponding
nonorthogonal eigenvectors may interact to induce a temporary growth of the
solution owing to the different rates of decay of each mode.
Note that for a nonlinear system, the linearized behavior represented by the
perturbation equations only governs in the vicinity of the equilibrium point.
Therefore, transient growth behavior could lead to the solution moving too far
from the equilibrium point for which the linearized behavior is valid, thereby
bringing in nonlinear effects and altering the stability properties before the eventual decay to the asymptotic stability state can be realized.
Not all non-normal systems exhibit such a transient growth behavior. For example, try the system (6.9) with = /4. Also, observe that the coefficient matrices
are non-normal for both the simple and forced pendulum examples considered
8

If a function decays monotonically, it never increases over any interval.

19.6 Non-Normal Systems Transient Growth

325

t
2

10

Figure 19.20 Solution of the non-normal system with = /100 for u1 (t)
(solid line) and u2 (t) (dashed line).
u2(t)
1.0

0.8

0.6

0.4

0.2

u1(t)

Figure 19.21 Solution of the non-normal system with = /100 for u1 (t)
and u2 (t) in phase space (the dot indicates the initial condition).

previously; however, neither exhibits transient growth behavior. For non-normal


systems that do exhibit transient growth, the question becomes, What is the
initial perturbation that leads to the maximum possible transient growth in the
system within finite time before succumbing to the asymptotic stability behavior
as t ?
The initial perturbation that experiences maximum growth over a specified
time range is called the optimal perturbation and is found using variational methods. Specifically, we seek to determine the initial perturbation u(0) for the system
u = Au,

(19.16)

for which the energy growth over some time interval 0 t tf is a maximum.
We define the energy of a disturbance u(t) at time t using the norm operator as

326

Discrete Dynamical Systems

follows:
E(t) = ku(t)k2 = hu(t), u(t)i ,
where h, i is the inner product. We seek the initial perturbation u(0) that produces the greatest relative energy gain during the time interval 0 t tf defined
by
E [u(tf )]
.
G(tf ) = max
u(0) E [u(0)]
In other words, we seek the initial disturbance that maximizes the gain functional
G[u] =

ku(tf )k2
hu(tf ), u(tf )i
uT (tf )u(tf )
=
=
,
ku(0)k2
hu(0), u(0)i
uT (0)u(0)

(19.17)

subject to the differential constraint that the governing equation u = Au is


satisfied for all time in the range 0 t tf .
As shown in Section 6.5 of Variational Methods with Applications in Science
and Engineering by Cassel (2013), the adjoint equation for the adjoint variables
is
= AT ,

(19.18)
and the initial conditions are
2

u(0) =

[uT (0)u(0)]
(0),
2uT (tf )u(tf )

(tf ) =

2
uT (0)u(0)

u(tf ).

(19.19)
(19.20)

The differential operator in the adjoint equation (19.18) is the adjoint of that
in the original governing equation u = Au. Note that the negative sign requires
equation (19.18) to be integrated backward in time from t = tf to t = 0, which
is consistent with the initial condition (19.20).
In order to determine the optimal initial perturbation that maximizes the gain
for a given terminal time tf , we solve the governing equation (19.16) forward
in time using the initial condition obtained from equation (19.19). The solution
at the terminal time u(tf ) is then used in equation (19.20) to determine the
initial condition (tf ) for the backward-time integration of the adjoint equation
(19.18). The resulting solution for the adjoint variable (t) provides (0) for use
in equation (19.19), which is used to obtain the initial condition for the governing
equation (19.16). This procedure is repeated iteratively until a converged solution
for u(t) and (t) is obtained, from which the optimal initial perturbation u(0)
is determined.
Let us return to the system governed by equation (19.15). Recall that the
system is non-normal for = /100 and leads to transient growth for finite
times before succumbing to the asymptotically stable behavior for large times
predicted by a modal analysis. After carrying out a series of optimal perturbation
calculations as described above for a range of terminal times tf , we can calculate

19.7 Nonlinear Systems

327

G(tf )
70
60
50
40
30
20
10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

tf

Figure 19.22 Gain G(tf ) for a range of terminal times tf for the
non-normal system with = /100.

the gain G(tf ) as defined by equation (19.17) and shown in Figure 19.22 for
a range of terminal times. The optimal initial perturbation u(0) that produces
these maximum gains for each terminal time tf is plotted in Figure 19.23. The
maximum gain is found to be G = 63.6 for a terminal time of tf = 0.69, which
means that the transient growth can result in a perturbation energy growing to
nearly 64 times its initial energy at t = 0. The initial perturbation that produces
this maximum gain is (u1 (0), u2 (0)) = (0.134, 2.134). As shown by the plot of
gain G(tf ), the gain eventually decays to zero as the terminal time increases,
consistent with the fact that the system is asymptotically stable for large times.
For the linear stability analysis of small perturbations about stationary equilibrium points considered here, the transient growth analysis for non-normal systems
can be reduced to performing a singular-value decomposition (SVD) to obtain the
optimal disturbance. The primary virtue of the variational approach presented
here is that it can be extended to evaluate stability of nonlinear state equations
and/or stability of perturbations about time-evolving (unsteady) solutions. The
nonlinear case corresponds to analysis of nonlinear systems subject to perturbations that are not infinitesimally small. Stability of non-normal, continuous fluid
dynamical systems will be encountered in Section 20.6.
19.7 Nonlinear Systems

328

Discrete Dynamical Systems

4
3
2
1

0.5

1.0

1.5

2.0

2.5

3.0

tf

1
2

Figure 19.23 Initial perturbations u1 (0) (solid line) and u2 (0) (dashed
line) that produce the maximum gain for a range of terminal times tf for the
non-normal system with = /100.

20
Continuous Systems

Recall that for discrete systems there are a finite number of degrees of freedom,
and the equations of motion are systems of ordinary differential equation in time.
Furthermore, stability is governed by algebraic eigenproblems. For continuous
systems, in which the mass is distributed throughout the system, there are an
infinite number of degrees of freedom, and the equations of motion are partial
differential equations in space and time. Stability is governed by differential eigenproblems.
We begin our discussion of continuous systems by considering the wave equation, which is the canonical hyperbolic partial differential equation (see Section 13.2). Using the method of separation of variables described in Chapter 3,
we obtain solutions in terms of eigenfunctions expansion for the one-dimensional
and two-dimensional wave equations.
20.1 Wave Equation
Consider a general partial differential equation, for example wave equation, unsteady beam equation, etc., of the form
M
u + Ku = 0,

(20.1)

where M and K are linear differential operators in space, dots denote differentiation in time, and u(x, y, z, t) is the unsteady displacement of the string,
membrane, beam, etc.
In order to convert the partial differential equation (20.1) into ordinary differential equations, we use the method of separation of variables. We write u(x, y, z, t)
as the product of two functions, one that accounts for the spatial dependence and
one for the temporal dependence, as follows
u(x, y, z, t) = (x, y, z)(t).
Substituting into the governing equation (20.1) gives

M
+ K = 0,
or separating variables leads to

(t)
K(x, y, z)
=
= .
M(x, y, z)
(t)
329

(20.2)

330

Continuous Systems
(x,t)
P

P
x=0

x=

Figure 20.1 Vibration of an elastic string.

P
x=

x=0

Figure 20.2 Vibration of a longitudinal rod.

Because the left-hand side is a function of x, y, and z only, and the right-hand
side of t only, the equation must be equal to a constant, say . Then
K(x, y, z) = M(x, y, z),

(20.3)

= (t),
(t)

(20.4)

where (20.3), which is a partial differential equation, is an eigenproblem for the


spatial vibrational displacement modes (x, y, z) of the system, and equation
(20.4) is an ordinary differential equation in time.
For a continuous system, there are an infinity of eigenvalues representing
the natural frequencies of the system and corresponding eigenfunctions (x, y, z)
representing the vibrational mode shapes. The general, time-dependent motion
of the system is a superposition of these eigenmodes, which is determined from
a solution of the ordinary differential equation (20.4) in time (with being the
eigenvalues of (20.3)).
Example 20.1

Consider the one-dimensional wave equation

2u
2u
= c2 2 ,
(20.5)
2
t
x
where c is the wave speed in the material, and u(x, t) is the displacement. The
wave equation1 governs, for example:
1. Lateral vibration of string, with c2 = P , where is the mass per unit length
of the string, as shown in Figure 20.1.
2. Longitudinal vibration of rod, with c2 = E , where E is Youngs modulus of
the rod, and is the mass per unit volume of the rod, as shown in Figure 20.2.
Solve the wave equation using an eigenfunction expansion.
1

See Variational Methods with Applications in Science and Engineering for a derivation.

20.1 Wave Equation

331

Solution: Writing the wave equation (20.5) in the form of equation (20.1), we
have the spatial differential operators
1
2
,
K
=
.
c2
x2
From equations (20.3) and (20.4), and letting = 2 > 0, we have
M=

d2 2
+ 2 = 0,
dx2
c
2
d
+ 2 = 0,
dt2
which are two ordinary differential equations in x and t, respectively, from one
partial differential equation. The solutions to these equations are (see the first
example in Section 3.2 with = /c and = , respectively),





(x) = c1 cos
x + c2 sin
x ,
(20.6)
c
c
(t) = c3 cos(t) + c4 sin(t).

(20.7)

Recall that (20.6) are the spatial eigenfunctions (x), that is, the vibrational
modes, and 2 /c2 are the eigenvalues.
To obtain the four constants, we need boundary and initial conditions. For a
vibrating string, for example, consider the homogeneous boundary conditions
u(0, t) = 0,

u(`, t) = 0,

corresponding to zero displacement at both ends. Noting that the boundary conditions in x on u(x, t) = (x)(t) must be satisfied by (x), we require
u(0, t) = 0

(0) = 0

c1 = 0.

Similarly, from
u(`, t) = 0

(`) = 0,

and we have


sin

`
c

= 0,

(20.8)

which is the characteristic, or frequency, equation. The eigenvalues are


n `
= n,
c

n = 1, 2, . . . ;

therefore,
nc
, n = 1, 2, . . . ,
`
where n are the natural frequencies (rad/s). The eigenfunctions are


n
n (x) = c2 sin
x , n = 1, 2, . . . ,
`
n =

(20.9)

(20.10)

332

Continuous Systems

where c2 is arbitrary and may be taken as c2 = 1. Thus, the eigenfunctions,


that is, vibration mode shapes, are Fourier sine series. Note that because it is
a continuous system, the string or rod has infinite degrees of freedom, that is,
infinite natural frequencies (cf. spring-mass example).
The general, time-dependent motion of the string is a superposition of the
infinite eigenfunctions, that is, natural modes of vibration. The actual motion is
determined from the temporal solution n (t) and initial conditions, for example
Initial displacement: u(x, 0) = f (x),
u
(x, 0) = g(x).
t
Note that two initial conditions are required as the equation is second order in
time. From equation (20.7), the solution is
Initial velocity:

n (t) = c3 cos(n t) + c4 sin(n t),

(20.11)

where n = nc/`, and differentiating in time gives


n (t) = c3 n sin(n t) + c4 n cos(n t).

(20.12)

From the initial conditions


u(x, 0) = f (x)

n (x)n (0) = f (x),

X
u(x, 0)
= g(x)
n (x) n (0) = g(x).
t
But from equations (20.11) and (20.12)

n (0) = c3 ,

(20.13)

n (0) = c4 n ;

therefore, substituting into (20.11) leads to


n (t) = n (0) cos(n t) +

1
n (0) sin(n t).
n

(20.14)

Taking the inner product of m (x) with both sides of the first equation in (20.13)
gives
kn (x)k2 n (0) = hf (x), n (x)i ,
where all terms are zero due to orthogonality except when m = n. Then
Z
 n 
hf (x), n (x)i
2 `
n (0) =
=
f
(x)
sin
x dx.
kn (x)k2
` 0
`
Similarly, from the second equation in (20.13)
kn (x)k2 n (0) = hg(x), n (x)i ,
n (0) =

hg(x), n (x)i
2
=
kn (x)k2
`

g(x) sin
0

 n 
x dx.
`

20.1 Wave Equation

333

2
u!x,t"

0
3

-1
-2
0

2
0.2
0.4
x

0.6
0.8
1 0

Figure 20.3 Solution of wave equation with ` = 1, a = 1, b = 100.

Both n (0) and n (0) are Fourier sine coefficients of f (x) and g(x), respectively.
Finally, the solution is


 n  
X
X
1
x n (0) cos(n t) +
u(x, t) =
n (x)n (t) =
sin
n (0) sin(n t) ,
`
n
n=1
n=1
(20.15)
where from (20.9)
nc
n =
, n = 1, 2, . . . .
`
For example, consider the initial conditions
2

u(x, 0) = f (x) = aeb(x1/2) ,

u(x,

0) = g(x) = 0,

which corresponds to a Gaussian displacement distribution centered at x = 1/2


at t = 0. Then we obtain
Z
 n 
2 ` b(x1/2)2
hf (x), n (x)i
=
ae
sin
x dx = . . . ,
n (0) =
kn (x)k2
` 0
`
hg(x), n (x)i
n (0) =
= 0.
kn (x)k2
Plotting the solution (20.15) for u(x, t) with ` = 1, a = 1, b = 100 gives the
solution shown in Figure 20.3. Observe that the initial Gaussian distribution at
t = 0 leads to the propagation of right- and left-moving waves that reflect off the
boundaries at x = 0 and x = ` = 1.

334

Continuous Systems

Thus far our examples have involved Fourier series for the eigenfunctions. In
the following example we encounter Bessel functions as the eigenfunctions.
Example 20.2 As an additional application to vibrations problems, consider
the vibration of a circular membrane of radius one.
Solution: The governing equation for the lateral displacement u(r, , t) is the
two-dimensional wave equation in cylindrical coordinates
 2


2u
u 1 u
1 2u
2
=
c
+
+
= c2 2 u ,
(20.16)
2
2
2
2
t
r
r r
r
where in equation (20.1)
M=

1
,
c2

K=

2
1
1 2
+
+
.
r2 r r r2 2

The boundary condition is


u(1, ) = 0,
that is, the membrane is clamped along its perimeter. Upon separating the spatial
and temporal variables according to u(r, , t) = (r, )(t), equation (20.3) for
the vibrational modes becomes
2 1
1 2

+
+
+ 2 = 0,
2
2
2
r
r r
r
c

(20.17)

which is known as the Helmholtz partial differential equation 2 + c2 = 0.


Because equation (20.17) is a partial differential equation, let us again separate
variables according to
(r, ) = G(r)H(),
which leads to
r2
G


d2 G 1 dG

1 d2 H
+
+
G
=

= ,
dr2
r dr
c2
H d2

(20.18)

where is the separation constant. Then we have the two ordinary differential
eigenproblems


2
2
dG
2d G
r
+r
+ 2 r G = 0,
(20.19)
dr2
dr
c
d2 H
+ H = 0.
d2
The general solution of equation (20.20) (with > 0) is of the form

H() = c1 cos( ) + c2 sin( ).

(20.20)

(20.21)

20.1 Wave Equation

335

To be single-valued in the circular domain, however, the solution must be 2periodic in , requiring that
(r, + 2) = (r, ).
For the cosine term to be 2-periodic,

cos[ ( + 2)] = cos( ),


or

cos( ) cos(2 ) sin( ) sin(2 ) = cos( ).

Matching coefficients of the cos( ) and sin( ) terms requires that

= 0, 1, 2, 3, . . . ,
cos(2 ) = 1

and

sin(2 ) = 0

1
3
= 0, , 1, , 2, . . . .
2
2

Because both conditions must be satisfied, 2-periodicity of the cos( ) term


requires that the eigenvalues be

= n = 0, 1, 2, 3, . . . .
(20.22)

Consideration of 2-periodicity of the sin( ) term in (20.21) produces the


same requirement. Therefore, equation (20.21) becomes
Hn () = c1 cos(n) + c2 sin(n),

n = 0, 1, 2, 3, . . . .

Equation (20.19) may be written in the form




 
d

dGn
n2
r
+ + 2 r Gn = 0,
dr
dr
r
c

(20.23)

(20.24)

which is Bessels equation. As stated in Section 3.4.2, we expect to arrive at the


Bessel equation when solving an equation with the Laplacian operator 2 in
cylindrical coordinates as is the case here. The general solution is
!
!

Gn (r) = c3 Jn
r + c4 Yn
r ,
(20.25)
c
c
where Jn and Yn are Bessel functions of the first and second kind, respectively.
Recall that Yn are unbounded at r = 0; therefore, we must set c4 = 0. Taking
c3 = 1, we have
!

n (r, ) = Jn
r [c1 cos(n) + c2 sin(n)] , n = 0, 1, 2, . . . ,
c
where in order to
satisfy the boundary condition at r = 1, we have
p the characteristic equation Jn ( /c) = 0. That is, m,n is chosen such that m,n /c are zeros
of the Bessel functions Jn (there are an infinity of zeros, m = 1, 2, 3, . . ., for each
Bessel function Jn ). For example, four modes are shown in Figure 20.4. Observe

336

Continuous Systems
1
0.5
0
-0.5
-1
1
1
0.5

0.75
1

0.5

0.25

0.5

0
-1

-1

-0.5

-0.5
-0.5

0.5

0.5
1

-1
1

(a) n = 0, m = 1

n = 0, m = 2.

0.5

0.5

-0.5

-0.5

-1

-1

0.5

0.5

-0.5

-0.5

-1

-1
-0.5

-0.5
0

0
0.5

0.5
1

(b) n = 1, m = 1

n = 1, m = 2.

Figure 20.4 Various modes in the solution of the wave equation in


cylindrical coordinates.

that n + 1 corresponds to the number of lobes in the azimuthal -direction, and


m corresponds to the number of regions having positive or negative sign in the
radial r-direction. As in the previous example, the solutions for (t) along with
initial conditions would determine how the spatial modes evolve in time to form
the full solution

X
u(r, , t) =
n (r, )n (t).
n=0

Remarks:
1. Whereas the eigenvalues and eigenfunctions have physical meaning in the vibrations context, where they are the natural frequencies and modes of vibration, respectively, in other contexts, such as heat conduction, electromagnetics,

20.2 Electromagnetics

337

(x)
P

P
x
x=0

x=

Figure 20.5 Schematic of the column-buckling problem.

etc..., they do not and are merely a mathematical device by which to obtain
the solution via an eigenfunction expansion.

20.2 Electromagnetics
20.3 Schr
odinger Equation
20.4 Stability of Continuous Systems Beam-Column Buckling
Recall from Section 2.6.3 that stability of discrete systems, with n degrees of
freedom, results in an algebraic eigenproblem of the form
Ax = x.
In contrast, stability of continuous systems, with infinite degrees of freedom,
results in a differential eigenproblem of the form
Lu = u.
For example, let us consider the beam-column buckling problem.
Example 20.3 The buckling equation governing the lateral stability of a beamcolumn under axial load is the fourth-order differential equation2


d2
d2 u
d2 u
EI
+
P
= 0, 0 x `.
dx2
dx2
dx2
Here, x is the axial coordinate along the column, u(x) is the lateral deflection
of the column, and P is the axial compressive force applied to the ends of the
column as shown in Figure 20.5. The properties of the column are the Youngs
modulus E and the moment of inertia I of the cross-section. The product EI
represents the stiffness of the column.
If the end at x = 0 is hinged, in which case it cannot sustain a moment, and
the end at x = ` = 1 is fixed, then the boundary conditions are

u(0) = 0,

u00 (0) = 0,

u(1) = 0,

u0 (1) = 0.

See Variational Methods with Applications in Science and Engineering, Section 6.6, for a derivation.

338

Continuous Systems

Determine the buckling mode shapes and the corresponding critical buckling
loads.
Solution: For a constant cross-section column with uniform properties, E and I
are constant; therefore, we may define
=

P
,
EI

and write the equation in the form


d4 u
d2 u
+

= 0,
(20.26)
dx4
dx2
which may be regarded as a generalized eigenproblem of the form L1 u = L2 u.3
Note that this may more readily be recognized as an eigenproblem of the usual
form if the substitution w = d2 u/dx2 is made resulting in w00 + w = 0; however,
we will consider the fourth-order equation for the purposes of this example. Trivial
solutions with u(x) = 0, corresponding to no lateral deflection, will occur for most
= P/EI. However, certain values of the parameter will produce non-trivial
solutions corresponding to buckling; these values of are the eigenvalues.
The fourth-order linear ordinary differential equation with constant coefficients
(20.26) has solutions of the form u(x) = erx . Letting = +2 > 0 ( = 0 and
< 0 produce only trivial solutions for the given boundary conditions4 ), then r
must satisfy
r4 + 2 r2 = 0,
r2 (r2 + 2 ) = 0,
r2 (r + i)(r i) = 0,
r = 0, 0, r = i.

Taking into account that r = 0 is a double root, the general solution to (20.26) is
u(x) = c1 e0x + c2 xe0x + c3 eix + c4 eix ,
or
u(x) = c1 + c2 x + c3 sin(x) + c4 cos(x).

(20.27)

Applying the boundary conditions at x = 0, we find that


u(0) = 0

0 = c1 + c4 c1 = c4 ,

00

u (0) = 0 0 = 2 c4

c4 = 0.

Thus, c1 = c4 = 0. Applying the boundary conditions at x = 1 leads to


u(1) = 0

0 = c2 + c3 sin ,

u0 (1) = 0 0 = c2 + c3 cos .
3
4

(20.28)

Recall that the generalized algebraic eigenproblem is given by Ax = Bx.


This makes sense physically because = P/EI = 0 means there is no load, and P < 0 means there
is a tensile load, in which case there is no buckling.

20.4 Stability of Continuous Systems Beam-Column Buckling

339

30
20
10
2

10

12

-10
-20
-30

Figure 20.6 Plots of tan and .

In order to have a nontrivial solution for c2 and c3 , we must have




1 sin


1 cos = 0,
which gives the characteristic equation
tan = .

Plotting tan and in Figure 20.6, the points of intersection are the roots. The
roots of the characteristic equation are (0 = 0 gives the trivial solution)
1 = 1.43, 2 = 2.46, 3 = 3.47, . . . ,
which have been obtained numerically. Thus, with n = 2n , the eigenvalues are
1 = 2.05 2 , 2 = 6.05 2 , 3 = 12.05 2 , . . . .
From the first relationship in (20.28), we see that
c2
c3 =
;
sin n
therefore, from the solution (20.27), the eigenfunctions are


sin(n x)
un (x) = c2 x
, n = 1, 2, 3, . . . .
sin n

(20.29)

As before, c2 is an arbitrary constant that may be chosen such that the eigenfunctions are normalized if desired.
Any differential equation with the same differential operator and boundary conditions as in equation (20.26), including nonhomogeneous forms, can be solved
by expanding the solution (and the right-hand side) in terms of the eigenfunctions (20.29). A nonhomogeneous equation in this context would correspond to
application of lateral loads on the column.
For the homogeneous case considered here, however, solutions for different values of n correspond to various possible modes of the solution for the lateral

340

Continuous Systems
1.4
1.2
1

u1 (x) 0.8
0.6
0.4
0.2
0.2

0.4

0.6

0.8

Figure 20.7 Mode shape for the Euler buckling load.


Table 20.1 Corresponding mathematical and physical aspects of the column buckling problem.
Mathematical
Physical
Eigenproblem
Eigenvalues
Trivial solution
Eigenfunction (non-trivial)
Non-homogeneous equation

Governing (buckling) equation


Critical buckling loads
No buckling (stable)
Buckling mode shapes
Lateral loads

deflection. In particular, recall that P = EI; therefore, with respect to the


eigenvalues, we have
Pn = n EI, n = 1, 2, 3, . . . ,
which are called the critical buckling loads of the column. So for P < P1 , the only
solution is the trivial solution, and the column does not deflect laterally. When
P = P1 = 2.05 2 EI,
which is called the Euler buckling load, the column deflects laterally according to
the following mode shapes
u1 (x) = x

sin(1.43x)
,
sin(1.43)

corresponding to a single bulge in the column as shown in Figure 20.7. Recall


that the column is hinged at x = 0 and fixed at x = 1.
Remarks:
1. Achieving the higher critical loads and corresponding deflection modes would
require appropriately placed constraints that restrict the lateral movement of
the column and suppression of the lower modes.
2. Observe the correspondence between the mathematical and physical aspects
of the column buckling problem shown in Table 20.1.

20.5 Numerical Solution of the Differential Eigenproblem

341

20.5 Numerical Solution of the Differential Eigenproblem


When stability of systems governed by partial differential equations is considered,
it is typically necessary to solve the resulting differential eigenproblem numerically for the eigenvalues and eigenfunctions. In order to illustrate the process
for numerically determining the eigenvalues and eigenvectors/functions in such
situations, let us explore a scenario in which we know the exact eigenvalues and
eigenfunctions. Recall that in Chapter 3 we obtained the eigenvalues and eigenfunctions for the differential operator in the one-dimensional, unsteady diffusion
equation
2u
u
(20.30)
= 2 , 0 x `,
t
x
with the boundary conditions
u(0, t) = 0,

u(`, t) = 0,

(20.31)

and the initial condition


u(x, 0) = f (x).

(20.32)

20.5.1 Exact Solution


In Section 3.4.2, the method of separation of variables, with u(x, t) = (x)(t),
was used to obtain the eigenvalues and eigenfunctions of the differential operator.
The differential eigenproblem that results for the spatial eigenfunctions (x) is
d2
+ = 0,
dx2

(20.33)

The eigenvalues are


n =

n2 2
,
`2

and the spatial eigenfunctions are


n (x) = sin

 n 
x ,
`

n = 1, 2, 3, . . .

(20.34)

20.5.2 Numerical Solution


For comparison, and to show how we obtain eigenvalues and eigenvectors numerically, let us consider solving the differential eigenproblem (20.33) with Dirichlet
boundary conditions (20.31) numerically. Using a central-difference approximation for the second-order derivative, the differential equation becomes
i+1 2i + i1
= i ,
x2

i = 2, 3, . . . , I,

where the eigenvalues are to be determined. Multiplying by x2 , and applying


the difference equation at each of the interior points in the domain leads to the

342

Continuous Systems

tridiagonal system of equations (1 = 0, I+1 = 0)

2 1
0 0
0
2
2
1 2 1 0

3
0

0
4
4
1
2

0
0

..
.. = x2 .. ,
..
.. . .
..
..
.

. .
.
.
. .

.
0
I1
0
0 2 1 I1
I
0
0
0 1 2
I

or

A = .

(20.35)

= x2 and the eigenThis is an algebraic eigenproblem for the eigenvalues


vectors , which are discrete approximations of the continuous eigenfunctions
(20.34). Hence, we see that for a continuous (exact) problem, we have a differential eigenproblem, and for a discrete (numerical) problem, we have an algebraic
eigenproblem.
Because A is tridiagonal with constants along each diagonal, we have a closed
form expression for the eigenvalues (see Section 11.4.1). In general, of course,
this is not the case, and the eigenvalues (and eigenvectors) must be determined
numerically for a large matrix A. See the Mathematica notebook 1Ddiff.nb for
a comparison of the eigenvalues obtained numerically as described here versus
the exact eigenvalues obtained using the method of separation of variables.
The standard method for numerically determining the eigenvalues and eigenvectors of a large, dense matrix is based on QR decomposition, which entails
performing a series of similarity transformations. This is the approach used by
the built-in Mathematica and Matlab functions Eigenvalues[]/ Eigenvectors[]
and eig(), respectively. See Chapter 6 for more detail on the QR method.

20.6 Hydrodynamic Stability


One of the more fascinating stability problems governed by partial differential
equations is that of fluid flows, often referred to as hydrodynamic stability.5 This
brief treatment will not do justice to the myriad of topics and physical phenomena
that can arise in hydrodynamic stability; it is simply intended to provide a context
in which stability of partial differential equations and the resulting differential
eigenproblem can be derived and discussed.
5

This is a somewhat curious term to use for this subject as hydro normally refers to water
specifically, whereas fluid dynamics and hydrodynamic stability theory apply universally to all fluids
and gases that behave as a continuum.

20.6 Hydrodynamic Stability

343

20.6.1 Linearized Navier-Stokes Equations


Consider the nondimensional Navier-Stokes equations for two-dimensional, incompressible flow
u v
+
= 0,
x y

(20.36)

u
u
p
1
u
+u
+v
=
+
t
x
y
x Re

v
v
v
p
1
+u
+v
=
+
t
x
y
y Re


2u 2u
+
,
x2
y 2

(20.37)

2v
2v
+
x2 y 2

(20.38)

Here, x and y are the spatial Cartesian coordinates, and t is time. The velocity
components in the x and y directions are u(x, y, t) and v(x, y, t), respectively,
and p(x, y, t) is the pressure. The Reynolds number, Re, is a non-dimensional
number that characterizes the relative importance of viscous diffusion and convection. Equation (20.36) enforces conservation of mass and is often referred to
as the continuity equation. Equations (20.37) and (20.38) enforce conservation of
momentum in the x and y directions, respectively.
We denote the solution to (20.36)(20.38), which we call the base flow, by
u0 (x, y, t), v0 (x, y, t) and p0 (x, y, t), and seek the behavior of small disturbances
to this base flow. If the amplitude of the small disturbances grow with time or
space, the flow is hydrodynamically unstable. There are two possibilities that
may be considered: 1) a temporal analysis determines whether the amplitude of
a spatial disturbance, for example, a wavy wall, grows or decays with time, and
2) a spatial analysis determine whether the amplitudes of temporal disturbances,
for example, a vibrating ribbon, grows or decays with time. The former is known
as an absolute instability, and the latter is known as a convective instability.
For infinitesimally small disturbances (  1), the flow may be decomposed as
follows
u(x, y, t) = u0 (x, y, t) + 
u(x, y, t),
v(x, y, t) = v0 (x, y, t) + 
v (x, y, t),
p(x, y, t) = p0 (x, y, t) + 
p(x, y, t).
Substituting into the Navier-Stokes equations (20.36)(20.38) gives
u
v0

v
u0
+
+
+
= 0,
x
x
y
y




u0
u

u0
u

u0
u

+
+ (u0 + 
u)
+
+ (v0 + 
v)
+
t
t
x
x
y
y
 2

2
2
2
p0
p
1 u0
u
u0
u

=

+
+ 2 +
+ 2 ,
x
x Re x2
x
y 2
y

(20.39)

344

Continuous Systems





v0
v0

v
v0

v
+ (v0 + 
v)
+
+ (u0 + 
u)
+
+
t
t
x
x
y
y
 2

2 v 2 v0
2 v
p0
p
1 v0
+ 2 +
+ 2 .
=

+
y
y Re x2
x
y 2
y

As expected, the O(1) terms are simply the Navier-Stokes equations (20.36)
(20.38) for the base flow u0 (x, y, t), v0 (x, y, t), and p0 (x, y, t). The O() terms for
the disturbance flow are
u

v
+
= 0,
(20.40)
x y

2u
2u

+ 2 ,
x2
y

(20.41)


2 v 2 v
+
.
x2 y 2

(20.42)

u
u0
u0
p
1
+ u0
+ v0
+
u
+
v =
+
t
x
y
x
y
x Re

v v0
v0
p
1
+ u0
+ v0
+
u
+
v =
+
t
x
y
x
y
y Re

Because  is small, we neglect O(2 ) terms. Thus, the evolution of the disturbances u
(x, y, t), v(x, y, t), and p(x, y, t) are governed by the linearized NavierStokes (LNS) equations (20.40)(20.42), where the base flow is known. Because
we assume that the disturbances are infinitesimally small, such that the nonlinear
Navier-Stokes equations are linearized, this is called linear stability theory.
In principle, we could impose a disturbance u
, v, p at any time ti and track its
evolution in time and space to determine if the flow is stable to the imposed disturbance. To fully characterize the stability of the base flow, however, would require
many calculations of the LNS equations with different disturbance shapes imposed at different times. We can formulate a more manageable stability problem
by doing one or both of the following: 1) consider simplified base flows, and/or
2) impose well-behaved disturbances.

20.6.2 Local Normal-Mode Analysis


Classical linear stability analysis (see, for example, Drazin 2002 and Drazin &
Reid 1981) takes advantage of both simplifications mentioned above as follows.
First, the base flow is assumed to be steady and parallel. A steady flow is simply
one whose velocities do not depend on time, such that the time-derivative term
in the base flow equations vanishes, that is,
v0
u0
=
= 0.
t
t

(20.43)

A parallel base flow is one in which the velocity components only depend upon
on coordinate direction. For example, pressure-driven flow through a horizontal
channel, called Poiseuille flow, is such that (see Figure 20.8.
u0 = u0 (y),

v0 = 0,

p0 = p0 (x).

(20.44)

20.6 Hydrodynamic Stability

345

Figure 20.8 Schematic of Poiseuille flow.

Second, the disturbances are imposed as normal modes for u


, v, and p of the form
(temporal analysis):
u
(x, y, t) = u1 (y)ei(xct) ,
v(x, y, t) = v1 (y)ei(xct) ,
i(xct)

p(x, y, t) = p1 (y)e

(20.45)

where is the wavenumber, which is real, and c = cr + ici is the complex


wavespeed. Note that the wavelength of the imposed disturbance is proportional
to 1/.
Remarks:
1. It is understood that we take the real parts of the normal modes (20.45)
(or equivalently add the complex conjugate). Consider the disturbances, for
example,


u
(x, y, t) = Re u1 (y)ei(xct) , c = cr + ici


= Re u1 (y)eci t ei(xcr t) ,
= Re [u1 (y)eci t {cos[(x cr t)] + i sin[(x cr t)]}] ,
u
(x, y, t) = u1 (y)eci t cos[(x cr t)]
Hence, we have a sine wave with wavenumber and phase velocity cr , which
are the normal modes. If ci > 0, the amplitude of the disturbance grows
unbounded as t with growth rate ci .
2. Because equations (20.40)(20.42) are linear, each normal mode with wavenumber may be considered independently of one another (cf. von Neumann numerical stability analysis). For a given mode , we are looking for the eigenvalue (wavespeed) with the fastest growth rate, that is, max(ci ), the most
unstable (or least stable) mode. The presupposition in modal analysis is that
any infinitesimally small perturbation can be expressed as a superposition of
the necessary individual modes, such as in the form of a Fourier series.
3. This is regarded as a local analysis because the stability of only one velocity
profile (20.44), that is, at a single streamwise location, is considered at a time

346

Continuous Systems

due to the parallel-flow assumption. In some cases, the parallel-flow assumption can be justified on formal grounds if the wavelength is such that 1/  L,
where L is a typical streamwise length scale in the flow.
4. For the temporal analysis considered here, is real and c is complex. For a
spatial analysis, is complex and c is real in the perturbations (20.45).
For steady, parallel base flow (20.43) and (20.44), the Navier-Stokes equations
(20.36)(20.38) reduce to [from equation (20.37)]
dp0
d2 u0
= Re
,
dy 2
dx

(20.46)

where Re p00 (x) is a constant for Poiseuille flow. The disturbance equations (20.40)
(20.42) become
v
u

+
= 0,
x y

(20.47)

u
u0
p
1
u

+ u0
+
v =
+
t
x
y
x Re

v
p
1

v
+ u0
=
+
t
x
y Re

2u
2u
+ 2 ,
x2
y


2 v 2 v
+
.
x2 y 2

(20.48)

(20.49)

Substitution of the normal modes (20.45) into the disturbance equations (20.47)
(20.49) leads to
iu1 + v10 = 0,
i(u0 c)u1 + u00 v1 = ip1 +
i(u0 c)v1 = p01 +

(20.50)
1 00
(u 2 u1 ),
Re 1

1 00
(v 2 v1 ),
Re 1

(20.51)
(20.52)

where primes denote differentiation with respect to y. Solving equation (20.50)


for u1 and substituting into equation (20.51) results in
(u0 c)v10 + u00 v1 = ip1

1 1 000
(v 2 v10 ),
i Re 1

(20.53)

leaving equations (20.52) and (20.53) for v1 (y) and p1 (y). Solving equation (20.53)
for p1 (y), differentiating, and substituting into equation (20.52) leads to
(u0 c)(v100 2 v1 ) u000 v1 =

1 1 0000
(v 22 v100 + 4 v1 ),
i Re 1

which is the Orr-Sommerfeld equation.


Remarks:

(20.54)

20.6 Hydrodynamic Stability

347

1. For a given base flow velocity profile u0 (y), Reynolds number Re, and wavenumber , the Orr-Sommerfeld equation is a differential eigenproblem of the form
L1 v1 = cL2 v1 ,
where the wavespeeds c are the (complex) eigenvalues, the disturbance velocities v1 (y) are the eigenfunctions, and L1 and L2 are differential operators.
2. The Orr-Sommerfeld equation applies for steady, parallel, viscous flow perturbed by infinitesimally small normal modes, which is a local analysis.
3. For inviscid flow, which corresponds to Re , the Orr-Sommerfeld equation
reduces to the Rayleigh equation
(u0 c)(v100 2 v1 ) u000 v1 = 0.
4. For non-parallel flows, a significantly more involved global stability analysis is
required in which the base flow is two- or three-dimensional, that is,
u0 = u0 (x, y),

v0 = v0 (x, y),

p0 = p0 (x, y),

and the disturbances are of the form


u
(x, y, t) = u1 (x, y)eict ,
v(x, y, t) = v1 (x, y)eict ,
p(x, y, t) = p1 (x, y)eict ,
in place of equation (20.45).

20.6.3 Numerical Solution of the Orr-Sommerfeld Equation


Recall that the continuous differential eigenproblem (20.54), that is, the OrrSommerfeld equation, is of the form
L1 v1 = cL2 v1 .
We seek a corresponding discrete, that is, algebraic, generalized eigenproblem of
the form
M(, Re)v = cN()v,

(20.55)

where we have dropped the subscript on v. Let us rewrite the Orr-Sommerfeld


equation in the form



i 0000
v + Pj v 00 + Qj v = c v 00 2 v ,
(20.56)
Re
where
P (yj ) = u0 (yj )
Q(yj ) =

2i
,
Re

3 i
2 u0 (yj ) u000 (yj ).
Re

348

Continuous Systems

To discretize equation (20.56) using central differences, we take


vj+1 2vj + vj1
d2 v
=
+ O(y 2 ),
dy 2
y 2
d4 v
vj+2 4vj+1 + 6vj 4vj1 + vj2
=
+ O(y 2 ).
4
dy
y 4
Substituting these approximations into equation (20.56) and collecting terms
leads to the difference equation

j1 + Av
j + Bv
j+1 , (20.57)
Cvj2 + Bj vj1 + Aj vj + Bj vj+1 + Cvj+2 = c Bv
where
Aj
Bj
C
A


6i
+ y 2 y 2 Qj 2Pj ,
Re
4i
=
+ y 2 Pj ,
Re
i
,
=
Re
= y 2 (2 + 2 y 2 ) ,

= y 2 .

Because the Orr-Sommerfeld equation is 4th -order, we need two boundary conditions on the disturbance velocity at each boundary. For solid surfaces at y = a, b,
we set
v = v 0 = 0,

at y = a, b.

(20.58)

Because v is known at y = a, b, where v1 = vJ+1 = 0, the unknowns are vj , j =


2, . . . , J.
Applying equation (20.57) at j = 2, we have




0
0

>
>

Cv0 + B2
v
+
A
v
+
B
v
+
Cv
=
c
B
v
+
Av
+
Bv
1
2 2
2 3
4
2
3 ,
1
and from v 0 = 0 at y = a (j = 1)
v2 v0
=0
2y

v0 = v2 .

Substituting into the difference equation for j = 2 results in




2 + Bv
3 .
[(C + A2 )v2 + B2 v3 + Cv4 ] = c Av

(20.59)

Similarly, for j = J
[CvJ2 + BJ vJ1 + (C + AJ )vJ ] = c [BvJ1 + AvJ ] .

(20.60)

Also for j = 3, J 1, we have v1 = vJ+1 = 0. Therefore, the matrices in the

20.6 Hydrodynamic Stability

349

Figure 20.9 Typical distribution of eigenvalues of the Orr-Somerfeld


equation.

algebraic form of the eigenproblem

C + A2 B 2 C
B3
A3 B3

C
B
A4
4

M(, Re) = ..
.
..
..

.
.

0
0
0
0
0
0

(20.55) are
0 0
C 0
B4 C
..
..
.
.
0 0
0 0

..
.


A B 0 0
A B
0
B

A B

0 B

N() = .. .. .. ..
.
. . .

0 0 0 0
0 0 0 0

..
.

0
0
0
..
.

0
0
0
..
.

0
0
0
..
.

0
0
0
..
.

C BJ1 AJ1 BJ1


0
C
BJ C + AJ

0 0 0
0 0 0

0 0 0

,
.. .. ..
. . .

A B

0 B A

and the eigenvector is vT = [v2 v3 vJ ]. Thus, M is pentadiagonal, and N is


tridiagonal.
The (large) generalized eigenproblem (20.55) must be solved to obtain the complex wavespeeds c, which are the eigenvalues, and the discretized eigenfunctions
v(y) for a given , Re, and u0 (y).
Remarks:
1. The J 1 eigenvalues and eigenvectors obtained approximate the first J 1
of the infinity of eigenvalues and eigenfunctions of the continuous differential
eigenproblem (20.54), namely the Orr-Sommerfeld equation.
2. The least stable mode, which is the one with the fastest growth rate ci ,
is given by max(ci ). For example, the full spectrum of eigenvalues (wave
speeds) may be distributed as in Figure 20.9.
3. A marginal stability curve may be obtained for a given velocity profile u0 (y)

350

Continuous Systems

Figure 20.10 Sample marginal stability curve.

Figure 20.11 Schematic of a shooting method for solving the stability


boundary-value problem.

by determining max(ci ) for a range of Re and and plotting the max(ci ) = 0


contour. A typical result in shown in Figure 20.10.
Left unaddressed thus far is how the generalized eigenproblem (20.55) is actually solved. There are three techniques that are typically used:
1. Convert the generalized eigenproblem (20.55) into a regular eigenproblem by
multiplying both sides by the inverse of N. That is, find the eigenvalues from
1

N M cI = 0.
This requires inverting a large matrix, and although M and N are typically
sparse and banded, N1 M is a full, dense matrix. Recall that the standard
method for numerically determining the eigenvalues and eigenvectors of a
large, dense matrix is based on QR decomposition.
2. In order to avoid solving the large matrix problem that results from the
boundary-value problem (BVP), traditionally shooting methods for initialvalue problems (IVP) has been used. This is illustrated in Figure 20.11. This
approach avoided the need to find the eigenvalues of large matrices in the days

20.6 Hydrodynamic Stability

351

Figure 20.12 Schematic for plane-Poiseuille flow between two parallel


plates.

when computers were not capable of such large calculations. In addition, it allowed for use of well-developed algorithms for IVPs. However, using methods
for IVPs to solve BVPs is like using a hammer to drive a screw.
3. Solve the generalized eigenproblem
Mv = cNv,
where M and N are large, sparse matrices. In addition to the fact that the
matrices are sparse, in stability contexts such as this, we only need the least
stable mode, not the entire spectrum of eigenvalues. Recall that the least stable
mode is that with the largest imaginary part. Currently, the state-of-the-art
in such situations is the Arnoldi method (see Section 4.?).
Note that N must be positive definite for use in the Arnoldi method. That
is, it must have all positive eigenvalues. In our case, this requires us to take
the negatives of the matrices M and N as defined above.
20.6.4 Example: Plane-Poiseuille Flow
As an illustration, let us consider stability of plane-Poiseuille flow, which is
pressure-driven flow in a channel, as illustrated in Figure 20.12. Such a flow
is parallel; therefore, the base flow is a solution of equation (20.46). The solution
is a parabolic velocity profile given by
u0 (y) = y(2 y),

0 y 2.

(20.61)

Note that the base flow is independent of the Reynolds number. See the Mathematica notebook OS Psvll.nb for a solution of the Orr-Sommerfeld equation
using the approach outlined in the previous section. This notebook calculates
the complex wavespeeds, that is, the eigenvalues, for a given wavenumber and
Reynolds number Re in order to evaluate stability. Recall that a flow is unstable if the imaginary part of one of the discrete eigenvalues is positive, that is, if
max(ci ) > 0. The growth rate of the instability is then max(ci ).
Here, the base-flow solution is known analytically. More typically, the base flow
is computed numerically. If this is the case, note that the accuracy requirements

352

Continuous Systems
1.10

1.05

1.00

0.95

0.90

0.85

0.80
5000

6000

7000

8000

9000

10 000

Figure 20.13 Marginal stability curve for plane-Poiseuille flow.

for the base flow in order to perform stability calculations are typically much
greater than that required for the base-flow resolution alone. For example, 201
points have been used in the current calculations, which is much greater than
would be required to accurately represent the parabolic base flow.
By performing a large number of such calculations for a range of wavenumbers and Reynolds numbers, we can plot the marginal stability curve for planePoiseuille flow. This shows the curve in -Re parameter space for which max(ci ) =
0, thereby delineating the regions of parameter space in which the flow is stable and unstable to normal-mode disturbances. The marginal stability curve for
plane-Poiseuille flow is shown in Figure 20.13. Thus, the critical Reynolds number
is approximately Rec = 5, 800.
20.6.5 Numerical Stability Revisited
In light of our discussion of hydrodynamic instability here and numerical instability in Section 6.3, let us revisit these issues in relation to one another. Recall that
in real flows, small disturbances, for example, imperfections, vibrations, etc...,
affect the base flow. In numerical solutions, on the other hand, small errors,
for example, round-off, etc..., act as disturbances in the flow. Consequently, the
issue is, What happens to small disturbances or errors as a real flow and/or
numerical solution evolves in time? If they decay, then it is stable, and the disturbances/errors are damped out. If they grow, then they are unstable, and the
disturbances/errors are amplified.
In computational fluid dynamics (CFD), therefore, there are two possible sources
of instability: 1) hydrodynamic instability, in which the flow itself is inherently
unstable, and 2) numerical instability, in which the numerical algorithm magnifies
the small errors. It is important to realize that the former is real and physical,
whereas the second is not physical and signals that a new numerical methods is
needed. The difficulty in CFD is that both are manifest in similar ways, that is,
in the form of oscillatory solutions; therefore, it is often difficult to determine

20.6 Hydrodynamic Stability

353

whether oscillatory numerical solutions are a result of a numerical or hydrodynamic instability.


For an example of this, see Supersonic Boundary-Layer Flow Over a Compression Ramp, Cassel, Ruban, & Walker, JFM 1995. In this case, we can
clearly identify both physical and numerical instabilities (normally it is not so
definitive).
Hydrodynamic versus numerical instability:
1. Hydrodynamic stability analysis:
Often difficult to perform.
Assumptions must typically be made, for example, parallel flow.
Can provide conclusive evidence for hydrodynamic instability (particularly
if confirmed by analytical or numerical results).
For example, in supersonic flow over a ramp, Rayleighs and Fjrtofts
theorems are necessary conditions that can be tested for.
2. Numerical stability analysis:
Often gives guidance, but not always conclusive for complex problems. For
example, recall that von Neumann analysis does not account for boundary
conditions.
Remarks:
1. Typically, these topics are treated separately; hydrodynamic stability in books
on theory, and numerical stability in books on CFD. In practice, however, this
distinction is often not so clear.
2. Just because a numerical solution does not become oscillatory does not mean
that no physical instabilities are present! For example, unsteady calculations
are required to reveal a physical or numerical instability. In addition, there
may not be sufficient resolution to reveal the unstable modes.
3. It is generally not possible to obtain grid-independent solutions when hydrodynamic instabilities are present. Typically, smaller grids allow for additional
modes that may have different, often faster, growth rates.
4. Both physical and numerical instabilities are suppressed in steady solutions;
the issue numerically is convergence, not numerical stability.

21
Optimization and Control

21.1 Controllability, Observability, and Reachability


21.2 Control of Discrete Systems
21.3 Kalman Filter
21.4 Control of Continuous Systems

354

22
Image or Signal Processing and Data
Analysis
Until the early seventeenth century, our sight was limited to those things that
could be observed by the naked eye. The invention of the lens facilitated development of both the microscope and the telescope that allowed us to directly
observe the very small and the very large but distant. This allowed us to extend
our sight in both directions on the spatial scale from O(1). The discovery of the
electromagnetic spectrum and invention of devices to view and utilize it extended
our sight in both directions on the electromagnetic spectrum from the visible
range.
As I write this in July 2015, the New Horizons probe has just passed Pluto
after nine and one-half years traveling through the solar system. There are seven
instruments on the probe that are recording pictures and data that are then
beamed billions of miles back to earth for recording and processing. Three of the
instruments are recording data in the visible, infrared, and ultraviolet ranges of
the electromagnetic spectrum, and the data from all of the instruments is transmitted back to earth as radio waves. In addition to interplanetary exploration,
medical imaging, scientific investigation, and surveillance, much of modern life
relies on myriad signals transmitting our communication, television, GPS, internet, and encryption data. More and more of this data is digital, rather than
analog, and requires rapid processing of large quantities of images and signals
that each contain increasing amounts of detailed data. All of these developments
have served to dramatically increase the volume, type, and complexity of images
and signals requiring processing.
Signals and images are produced, harvested, and processed by all manner of
electronic device that utilize the entire electromagnetic spectrum. The now traditional approach to image and signal processing is based on Fourier analysis,
which is the primary subject of the early part of this chapter. This accounts for
both discrete and continuous data. More recently, advances in variational image
processing have supplemented these techniques as well (see, for example, Cassel
2013).
As the volume of experimental and computational data rapidly proliferates,
there is a growing need for creative techniques to characterize and consolidate the
data such that the essential features readily can be extracted. This has given rise
to the development of the field of reduced-order modeling, of which the method of
proper-orthogonal decomposition (POD) (or principal-component analysis (PCA))
is the workhouse.
355

356

Image or Signal Processing and Data Analysis

22.1 Fourier Analysis


22.2 Proper-Orthogonal Decomposition

Appendix A
Row and Column Space of a Matrix

Suppose xH is a solution of the homogeneous system


AxH = 0,
and xP is a particular solution of
AxP = c.
Then
x = xP + kxH ,
where k is an arbitrary constant, is also a solution of
Ax = c.
In order to show this, observe that
A(xP + kxH ) = c
AxP + kAxH = c
c + k(0) = c.
Now consider the m n matrix A of the form

A11 A12 A1n


v1
A21 A22 A2n v2

..
..
..
..
.
.
.
.
,
Am1 Am2 Amn
vm

u1

u2

un

where ui are the column vectors of A, and vi are the row vectors of A. The row
representation of A is
T

v1
A11
v2T
A12

A = .. , where v1 = .. , etc . . .
.
.
T
vm

A1n
357

358

Row and Column Space of a Matrix

Then AxH = 0 requires that


hv1 , xH i = 0, hv2 , xH i = 0, . . . , hvm , xH i = 0.
Thus, xH is orthogonal to each of the row vectors vi of A. The following possibilities exist:
1. If the rank of A is n (r = n), then xH = 0 is the only solution to AxH = 0,
and the m vi vectors are linearly independent and span n-space.
2. If the rank is such that r < n, in which case only r of the vi , i = 1, . . . , m
vectors are linearly independent, then the r-dimensional space spanned by
the r linearly-independent v vectors is called the row space. There are then
(n r) arbitrary vectors that satisfy AxH = 0, and are thus orthogonal to the
row space. The (n r)-dimensional space where these xH s exist is called the
orthogonal complement of the row space of A. It follows that a nonzero vector
xH satisfies AxH = 0 if and only if it is in the orthogonal complement of the
row space of A.
Similarly, consider the column representation of A, which is


A = u1 u2 un ,
where

A11
A21

u1 = .. ,
.

etc . . .

Am1
Then, for Ax = y, if c is a linear combination of the ui vectors, that is,
x1 u1 + x2 u2 + + xn un = c,
then c is in the column space of A.

You might also like