You are on page 1of 240

Gustaf Söderlind

Numerical Methods for


Differential Equations
An Introduction to Scientific Computing

November 29, 2017

Springer
Contents

Part I Scientific Computing: An Orientation

1 Why numerical methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.1 Concepts and problems in analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Concepts and problems in algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Computable problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Principles of numerical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


2.1 The First Principle: Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The Second Principle: Polynomials and linear algebra . . . . . . . . . . . . 23
2.3 The Third Principle: Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 The Fourth Principle: Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Correspondence Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Accuracy, residuals and errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Differential equations and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


3.1 Initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Two-point boundary value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Summary: Objectives and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Part II Initial Value Problems

5 First Order Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


5.1 Existence and uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 Stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.1 Linear stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.2 Logarithmic norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.3 Inner product norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

v
vi Contents

6.4 Matrix categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


6.5 Nonlinear stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.6 Stability in discrete systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 The Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


7.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Alternative bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 The Lipschitz assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 The Implicit Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


8.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.3 Stiff problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4 Simple methods of order 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

9 Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.1 An elementary explicit RK method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2 Explicit Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.3 Taylor series expansions and elementary differentials . . . . . . . . . . . . . 67
9.4 Order conditions and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.5 Implicit Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.6 Stability of Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.7 Embedded Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.8 Adaptive step size control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.9 Stiffness detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

10 Linear Multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


10.1 Adams methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2 BDF methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.3 Operator theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.4 General theory of linear multistep methods . . . . . . . . . . . . . . . . . . . . . 99
10.5 Stability and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.6 Stability regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.7 Adaptive multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11 Special Second Order Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


11.1 Standard methods for the harmonic oscillator . . . . . . . . . . . . . . . . . . . 106
11.2 The symplectic Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.3 Hamiltonian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.4 The Störmer–Verlet method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.5 Time symmetry, reversibility and adaptivity . . . . . . . . . . . . . . . . . . . . 114

Part III Boundary Value Problems


Contents vii

12 Two-point Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


12.1 The Poisson equation in 1D. Boundary conditions . . . . . . . . . . . . . . . 3
12.2 Existence and uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
12.3 Notation: Spaces and norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
12.4 Integration by parts and ellipticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
12.5 Self-adjoint operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
12.6 Sturm–Liouville eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . 11

13 Finite difference methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


13.1 FDM for the 1D Poisson equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
13.2 Toeplitz matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
13.3 The Lax Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
13.4 Other types of boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
13.5 FDM for general 2pBVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13.6 Higher order methods. Cowell’s difference correction . . . . . . . . . . . . 39
13.7 Higher order problems. The biharmonic equation . . . . . . . . . . . . . . . . 43
13.8 Singular problems and boundary layer problems . . . . . . . . . . . . . . . . . 49
13.9 Nonuniform grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

14 Finite element methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


14.1 The weak form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
14.2 The cG(1) finite element method in 1D . . . . . . . . . . . . . . . . . . . . . . . . 54
14.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
14.4 Neumann conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
14.5 cG(1) FEM on nonuniform grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Part I
Scientific Computing: An Orientation
Chapter 1
Why numerical methods?

Numerical computing is the continuation


of mathematics by other means

Science and engineering rely on both qualitative and quantitative aspects of mathe-
matical models. Qualitative insight is usually gained from simple model problems
that may be solved using analytical methods. Quantitative insight, on the other hand,
typically requires more advanced models and numerical solution techniques. With
increasingly powerful computers, ever larger, and more complex mathematical mod-
els can be studied. Results are often analyzed by visualizing the solution, sometimes
for a large number of cases defined by varying some problem parameter.
Scientific computing is the systematic use of highly specialized numerical meth-
ods for solving specific classes of mathematical problems on a computer. But are
numerical methods different from just solving the mathematical problem, and then
inserting the data to evaluate the solution? The answer is yes. Most problems of
interest do not have a “closed form solution” at all. There is no formula to evaluate.
The problems are often nonlinear and almost always too complex to be solved by
analytical techniques. In such cases numerical methods allow us to use the powers
of a computer to obtain quantitative results. All important problems in science and
engineering are solved in this manner.
It is important to note that a numerical solution is approximate. As we cannot
obtain an exact solution to our problem, we construct an approximating problem
that is amenable to automatic computation. The construction and analysis of com-
putational methods is a mathematical science in its own right. It deals with questions
such as how to obtain accurate results, and whether they can be computed efficiently.
This cannot be taken for granted. As a simple example, let us consider the prob-
lem of solving a linear system of equations, Ax = b, on a computer using standard
Gaussian elimination. Let us assume that A is large, with (say) N = 10, 000 equa-
tions, and that A is a dense matrix. Because Gaussian elimination has an operation
count of O(N 3 ), the total number of operations in solving the problem is on the or-
der of 1012 operations. Not only do we need a fast computer and a large memory, but
we might also ask the question whether it is at all possible to obtain any accuracy.
After all, in standard IEEE arithmetic, operations are “only” carried out to 16 digit
precision. Can we trust the computed result, given that our computational sequence
is so long?

3
4 1 Why numerical methods?

This is a nontrivial issue, and the answer depends both on the problem’s mathe-
matical properties as well as on the numerical algorithms used to solve the problem.
It typically requires a high level of mathematical and numerical skills in order to
deal with such problems successfully. Nevertheless, it can be done on a routine ba-
sis, provided that one masters central notions and techniques in scientific computing.
Mathematical problems can roughly be divided into four categories. Thus we
distinguish problems in algebra and problems in analysis. On the other hand, we
distinguish between linear problems and nonlinear problems, see Table 1.1.
The distinction between these categories is very important. In algebra, all con-
structs are finite, but in analysis, one is allowed to use transfinite constructs such
as limits. Thus analysis involves infinite processes, and whether these converge or
not. In scientific computing this presents an issue, because on a computer we can
only carry out finite computations. Therefore all computational methods are based
on algebraic methodology, and it becomes a central issue whether we can devise
algebraic problems that (in one sense or another) have solutions that approximate
the solutions of analysis problems.

Table 1.1 Categories of mathematical problems. Only problems in linear algebra are computable.
Category Algebra Analysis
Linear computable not computable
Nonlinear not computable not computable

In a similar way, the distinction between linear and nonlinear problems is of im-
portance. Most classical mathematical techniques deal with linear problems, which
share a lot of structure, while nonlinear problems usually have to be approached on
a case-by-case basis, if they van be solved by analytical techniques at all.
Borrowing a notion from computer science, we say that a solution is computable
if it can be obtained by a finite algorithm. However, a computer is limited to finite
combinations of the four arithmetic operations, +, −, ×, ÷, and logical operations.
Therefore, in general, only problems in the linear algebra category are computable,
with the standard examples being linear systems of equations, linear least squares
problems, linear programming problems in optimization, and some problems in dis-
crete Fourier analysis.
There are a few problems in analysis or nonlinear problems that can be solved by
analytical techniques, but the vast majority cannot. For such problems, the only way
to obtain quantitative results is by using numerical methods to obtain approximate
results. This is where scientific computing plays its central role, and as we shall see,
computational methods almost invariably work by, in one way or another, approxi-
mating the problem at hand by (a sequence of) problems in linear algebra, since this
is where we possess finite, or terminating, algorithms.
1.1 Concepts and problems in analysis 5

However, in practice it is also impossible to solve linear systems exactly, unless


they are very small. It is not uncommon to encounter large linear systems, perhaps
having millions or even billions of unknowns. In such cases it is again only possible
to obtain approximate results. This may perhaps sound disheartening to the math-
ematician, but the interesting challenge for the numerical analyst is to construct
algorithms that have the capacity of computing approximations to any prescribed
accuracy. Better still, if we can construct fast converging algorithms, such accurate
approximations can be obtained with a moderate computational effort. Needless
to say, there is a trade-off; more accuracy will almost always come at a cost, although
on a good day, a good numerical analyst may occasionally enjoy a free lunch.
This lays out some of the interesting issues in scientific computing. The objective
is to solve complex mathematical problems on a computer, and to construct reliable
mathematical software for given classes of problems, and making sure that quanti-
tative results can be obtained both with accuracy and efficiency. In order to further
outline some basic thoughts, we have to investigate what mathematical concepts we
have to deal with in the different problem categories above.

1.1 Concepts and problems in analysis

The central ideas of analysis that distinguish it from algebra are limits and con-
vergence. The two most important concepts from analysis are derivatives and in-
tegrals. Let us start with the derivative, classically defined by the limit

df f (x + h) − f (x)
= lim . (1.1)
dx h→0 h
If the function f is differentiable, the difference quotient converges to the limit f 0 (x)
as h goes to zero. Later, we will consider several alternative expressions to that of
(1.1) for approximating derivatives. In practical computations, mere convergence is
rarely enough; we would also like fast convergence, so that the trade-off between
accuracy and computational effort becomes favorable.
The derivatives of elementary functions, such as polynomials, trigonometric
functions and exponentials, are well known. Nevertheless, if a function is complex
enough, or an explicit expression of the function is unknown, the derivative may
be impossible to compute analytically. However, it can always be approximated. At
first, this may not seem to be an important application, but as we shall see later, it
is important in scientific computing to approximate derivatives to a high order of
accuracy, as this is key to solving differential equations.
Let us consider the problem of computing an “algebraic” approximation to (1.1).
Since we cannot compute the limit in a finite process, we consider a finite difference
approximation,
df f (x + ∆ x) − f (x)
≈ , (1.2)
dx ∆x
6 1 Why numerical methods?
0
10

10 -2

10 -4
Relative error

10 -6

10 -8

10 -10

10 -12
10 -15 10 -10 10 -5 10 0
Delta x

Fig. 1.1 Finite difference approximation to a derivative. The graph plots the relative error r in
the finite difference approximation (1.3) vs. ∆ x in a log-log diagram. Blue straight line on the
right represents r as described by (1.4). A nearby dashed reference line of slope +1 shows that
r = O(∆ x). For ∆ x < 10−8 , however, roundoff becomes dominant, as demonstrated by the red
erratic part of the graph on the left. Roundoff error grows like O(∆ x−1 ), as indicated by the second
dashed reference line on the left, of slope −1. Thus the maximum accuracy that can be obtained
by (1.3) is ∼ 10−8

where ∆ x > 0 is small, but non-zero. Thus the finite difference approximation sim-
ply skips taking the limit in (1.1), replacing the limit by (1.2). Let us see how accu-
rate this approximation is.

Example Consider the function f (x) = ex with derivative f 0 (x) = ex , and consider com-
puting a finite difference approximation to f 0 (1), i.e.,

f (1 + ∆ x) − f (1)
f 0 (1) ≈ . (1.3)
∆x
Since f 0 (1) = e, the relative error in the approximation is, by Taylor series expansion,

f (1 + ∆ x) − f (1) e∆ x − 1 ∆x
r= −1 = −1 = + O(∆ x2 ). (1.4)
e∆ x ∆x 2
Thus we can expect to obtain an accuracy proportional to ∆ x. For example, in order to obtain
six-digit accuracy, we should take ∆ x ≈ 10−6 . In order to check this result, we compute the
difference quotient (1.3) for a large number of values of ∆ x and compare to the exact result,
f 0 (1) = e, to evaluate the error.
It is an important to note that the approximation (1.2) is convergent, as shown by (1.4).
Convergence means that the error can be made arbitrarily small, by reducing ∆ x. However,
1.1 Concepts and problems in analysis 7

in the numerical experiment in Figure 1.1, we see that we cannot obtain arbitrarily high
accuracy. How is this discrepancy consistent with the claim that the approximation (1.2) is
convergent?
The answer is that convergence can only take place on a dense set of numbers, such as on
the set of real numbers, which allows us to use the standard “epsilon – delta” arguments of
analysis. However, the real number system cannot be implemented in computer arithmetic
(since it requires an infinite amount of information) and in numerical computation we have
to make do with computer representable numbers, usually the IEEE 754 standard. Be-
cause this is a finite set of numbers, the notion of convergence does not exist in computer
arithmetic, which explains why there will be a breakdown, sooner or later, if we take ∆ x
too small. Even so, the set of computer numbers is dense enough for almost all practical
purposes, and in the right part of Figure 1.1, we can clearly see “the beginning of conver-
gence,” as the error behaves exactly as expected according to (1.4). Thus we can in practice
usually observe the correct order of convergence. In this case the order of convergence is
p = 1, meaning that the error is r = O(∆ x p ).

In this example we have also seen that the results were plotted in a log-log di-
agram. This is a standard technique used whenever the plotted quantity obeys a
power law. A power law is a relation of the form

η = C · ξ p.

Assuming that ξ and η are positive and taking logarithms, we obtain

log η = logC + p · log ξ .

This is the equation of a straight line. Thus log η is a linear function of log ξ , with an
easily identifiable slope of p, which is the “power” in the power law. In our example,
η is the error r, and ξ is the “step size” ∆ x. Consequently, we have the power law

r ≈ C · ∆ xp,

where we have neglected the higher order terms in the expansion (1.4). We plot
log r as a function of log ∆ x to be able to identify p. In this case we have p = 1 and
C = 1/2, which represents the leading term of the error. If ∆ x is not too large, the
higher order terms can be neglected and the order of convergence clearly observed.
A first order approximation, such as the one above, is rarely satisfactory in ad-
vanced scientific computing. We are therefore interested in whether it is possible to
improve accuracy, and, in particular, to improve the order of convergence. This will
play a central role in our analysis later, because it will allow us to construct highly
accurate methods that require only a moderate computational effort. That sounds
like a “free lunch,” but we shall see that if we develop our techniques well, it is
indeed possible to obtain a vastly improved performance.

Example We shall again consider the function f (x) = ex with derivative f 0 (x) = ex . This
time, however, we are going to use a symmetric a finite difference approximation to f 0 (1),
of the form
f (1 + ∆ x) − f (1 − ∆ x)
f 0 (1) ≈ . (1.5)
2∆ x
8 1 Why numerical methods?
0
10

10 -2

10 -4
Relative error

10 -6

10 -8

10 -10

10 -12
10 -15 10 -10 10 -5 10 0
Delta x

Fig. 1.2 Symmetric finite difference approximation. Relative error r in the symmetric difference
quotient (1.5) is plotted vs. ∆ x in a log-log diagram. Green straight line on the right represents
r as described by (1.6). A nearby dashed reference line of slope +2 shows that r = O(∆ x2 ). For
∆ x < 10−5 , roundoff becomes dominant, as demonstrated by the red erratic part of the graph on
the left. Roundoff error grows like O(∆ x−1 ), as indicated by the second dashed reference line on
the left. The maximum accuracy that can be obtained by (1.5) is ∼ 10−11

The limit, as ∆ x → 0, equals the derivative, but as we shall see both theoretically and prac-
tically, the convergence is much faster. By expanding both f (1 + ∆ x) and f (1 − ∆ x) in a
Taylor series around x = 1, we find the relative error

f (1 + ∆ x) − f (1 − ∆ x) e∆ x − e−∆ x ∆ x2
r= −1 = −1 = + O(∆ x4 ). (1.6)
2e∆ x 2∆ x 6
Once again we would observe a power law, but this time p = 2, which makes the symmetric
finite difference approximation convergent of order p = 2. Again, we compare with real
computations, and obtain the result in Figure 1.1.
Evidently we can obtain considerably higher accuracy, while we still only use a a finite
difference quotient requiring two function evaluations. It is of some interest to make a closer
comparison, and by superimposing the two plots we obtain Figure 1.1.
This shows that a higher order approximation will produce much higher accuracy, and ob-
tain that higher accuracy at a relatively large value of ∆ x. We also see that the roundoff
error is largely unaffected. To see how the roundoff error occurs, we assume that the func-
tion f (x) can be evaluated to a relative accuracy of ε10−16 , which approximately equals the
IEEE 754 roundoff unit. This means that we obtain an approximate value f˜(x) satisfying

f˜(x) − f (x)

≤ ε.
f (x)
1.1 Concepts and problems in analysis 9
0
10

10 -2

10 -4
Relative error

10 -6

10 -8

10 -10

10 -12
10 -15 10 -10 10 -5 10 0
Delta x

Fig. 1.3 Comparison of finite difference approximations. Relative errors in first and second order
finite difference approximations are compared as functions of ∆ x in a log-log diagram. The straight
lines corresponds to order p = 1 (blue) and p = 2 (green), respectively. At ∆ x = 10−5 , the second
order approximation is more than five orders of magnitude more accurate. Roundoff errors remain
similar for both approximations

As a consequence, when we evaluate the difference quotient (1.5) we obtain a perturbed


value,
f˜(1 + ∆ x) − f˜(1 − ∆ x) f (1 + ∆ x) − f (1 − ∆ x)
f 0 (1) ≈ = + ρ, (1.7)
2∆ x 2∆ x
where we will have a maximum roundoff error of
ε f (1) εe
ρ ≤ ρ̂ = = .
∆x ∆x
Likewise, in (1.3) the maximum roundoff is 2ρ̂ in (1.3). Upon close inspection, the graphs
also reveal that the roundoff is slightly larger in (1.3). However, in both cases ρ̂ = O(∆ x−1 ),
meaning that the effect of the roundoff is inversely proportional to ∆ x when ∆ x → 0. In
other words, the approximation eventually deteriorates when ∆ x is reduced, explaining the
negative unit slope observed in the three plots above.

The brief glimpse above of problems in analysis demonstrates that computations


(and problem solving) require approximate techniques. These are constructed in
such a way that the approximating quantities converge, in a mathematical sense,
to the original analysis quantities. However, because convergence itself is a notion
from analysis, the successive improvement of the approximations is limited by the
accuracy of the computer number system. This means that, while it is possible to
obtain highly accurate approximations, one must be aware of the shortcomings of
10 1 Why numerical methods?

finite computations. Some difficulties can be overcome by constructing fast con-


verging approximations, but the bottom line is still that the approximations will be
in error, and that it is important to be able to estimate and bound the remaining er-
rors. This requires considerable mathematical skill as well as a thorough knowledge
of the main principles of scientific computing.

1.2 Concepts and problems in algebra

The key concepts in linear algebra are vectors and matrices. The central problems
relate vectors and matrices, and the most important problem is linear systems of
equations,
Ax = b, (1.8)
where A is an n × n matrix, x is the unknown n-vector, and b is a given n-vector. In
mathematics the solution (if it exists) is often expressed as

x = A−1 b, (1.9)

where A−1 is the inverse of A.


In practice the inverse is rarely computed. Instead we employ some computa-
tional technique, such as Gaussian elimination, to solve the problem. This standard
numerical method obtains the exact solution in a finite number of steps, i.e., the so-
lution is computable. More precisely, it takes about 2n3 /2 arithmetic operations to
compute the solution in this way. Although the expression A−1 b is frequently used in
theoretical discussions, Gaussian elimination avoids computing the inverse, which
is more expensive to compute.
In fact, the elimination procedure is equivalent to a matrix factorization, A =
LU, where L and U are lower and upper triangular matrices. This transforms the
original system to
LUx = b, (1.10)
which is solved in two steps: first the triangular system Ly = b is solved for y, and
then the system Ux = y is solved for x. This procedure is used for the simple reason
that it is faster to solve triangular systems than full systems, thus economizing the
total work for obtaining x. In summary, Gaussian elimination first factorizes the
matrix A = LU (this is the expensive operation), after which the equation solving
is reduced to solving triangular systems. In scientific computing it is very common
that one has to solve a sequence of systems,

Axk = bk , (1.11)

where the right hand sides bk often depend on the previous solution xk−1 . Here the
superscript denotes the recursion index, not to be confused with the vector compo-
nent index. In such situations the LU factorization of A is only carried out once, and
1.2 Concepts and problems in algebra 11

the factors L and U are then used to solve each system (1.11) cheaply. These forward
and back substitutions only have an operation count of 2n2 arithmetic operations.
Most computational algorithms in linear algebra are built on some matrix factor-
ization technique, chosen to provide suitable benefits. For example, in linear least
squares problems it is standard procedure to factorize the matrix A = QR, where Q
is an orthogonal matrix and R is upper triangular. This again reduces the amount
of work involved, and the use of orthogonal matrices is beneficial to maintaining
stability and accuracy in long computations.
For very large systems (say problems involving millions of unknowns) the com-
putational complexity of matrix factorization can become prohibitive. It is then
common to employ iterative methods for solving linear systems. This means that
we sacrifice obtaining an exact solution (up to roundoff), in favor of obtaining an
approximate solution at a much reduced computational effort. In fact, large-scale
problems in linear partial differential equations may be too large for factorization
methods, leaving us with no choice other than iterative methods.
Another important standard problem in linear algebra is to find the eigenvalues
and eigenvectors of a matrix A. They satisfy the equation

Ax = λ x. (1.12)

As both λ and x are unknowns, this problem is technically nonlinear, although it is


referred to as the linear eigenvalue problem. This designation stems from the fact
that A is a linear operator.
The eigenvalues are the (complex) roots λ of the characteristic equation

det(A − λ I) = 0. (1.13)

If A is n × n, then P(λ ) := det(A − λ I) is a polynomial of degree n, so we seek the


n roots of the polynomial equation P(λ ) = 0. Polynomial equations are obviously
nonlinear whenever the degree of the polynomial is 2 or higher. The major compli-
cation is, however, that there are in general no closed form expressions for the roots
of (1.13) if n > 4. For higher order matrices, therefore, numerical methods are neces-
sary. In practice eigenvalues are always computed by iterative methods, producing
a sequence of approximations λk converging to the eigenvalue λ . These computa-
tional techniques are non-trivial, and usually employ various matrix factorizations
as part of the computation. If the matrix A is large, the computational effort can eas-
ily become overwhelming, and most computational techniques for such problems
offer the possibility of only computing a few (dominant) eigenvalues.
Thus we see that convergence becomes an issue also for computational meth-
ods in some standard linear algebra problems. Although this may appear coun-
terintuitive, it shows that numerical methods are necessary in the vast majority of
mathematical problems. Computational methods are constructive approximations,
and are not a matter of merely inserting data into a mathematical formula in order
to obtain the solution to the problem.
12 1 Why numerical methods?

It is fair to say that almost every single problem in applied mathematics involves
linear algebra problems at some level. Therefore linear algebra techniques are at the
core of scientific computing, although we will see that real problems in science and
engineering often have to be approached by a number of preparatory techniques be-
fore the task has been brought to a linear algebra problem where standard techniques
may be employed.

1.3 Computable problems

We have seen that mathematical problems can be divided into problems from alge-
bra and analysis, and that these two categories can further be divided into linear
and nonlinear problems. That leaves us with four categories of problems. We say
that a problem is computable if its exact solution can be obtained in a finite num-
ber of operations. Unfortunately, only some of the problems in linear algebra are
computable.
It is important to note that the notion of computable problems is concerned with
inexact computer arithmetic. For example, Gaussian elimination will in theory solve
a linear system exactly in a finite number of operations, but on the computer it is sub-
ject to roundoff errors, since it is impossible to implement the real number system
on a computer. Instead we have to make do with the IEEE 754 computer arithmetic.
This system only contains a (small) subset of rational numbers, and further, each
arithmetic operation will in general incur additional errors, since the set of computer
representable numbers is not closed under the standard arithmetical operations.
As a consequence, not even computable problems are necessarily solved exactly
on a computer. This is why numerical methods are necessary, even for the smallest
of problems. They are not an alternative to analytical techniques, but rather the only
way to obtain quantitative results.
In spite of the necessity of using computational methods in almost all of applied
mathematics, the analytical tools of pure mathematics are no less important in nu-
merical analysis. The construction and analysis of computational methods rely on
basic as well as advanced concepts in mathematics, which the numerical analyst
must master in order to devise stable, accurate and efficient computational methods.
Numerical analysis is the continuation of mathematics by other means. Its
goal is to construct computational methods and provide software for the efficient
approximate solution of mathematical problems of all kinds, using the computer as
its main tool. The aim is to compute an approximations to a prescribed accuracy,
preferably in tandem with error bounds or error estimates. In other words, we want
to design methods that produce accurate approximations that converge as fast as
possible to the exact solution. This convergence will only be observed in part, as the
exact solution is generally both unknown and non-computable.
The design of such computational methods is a nontrivial task. Although their
aim is to address the most challenging problems in nonlinear analysis, computa-
1.3 Computable problems 13

tional methods are usually assessed by applying them to well known problems from
applied mathematics, where there are known analytical expressions for the exact
solution. It is of course not necessary to solve such problems, but they remain the
best benchmarks for new computational methods. Thus, a method which cannot ex-
cel at solving a standard problem in applied mathematics is almost bound to fail on
real-life problems.
Chapter 2
Principles of numerical analysis

The subject of numerical analysis revolves around how to construct computable


problems that approximate the mathematical problem of interest. There is a very
large number of computational methods, superficially having rather little in com-
mon. But it is important to realize that all computational methods are constructed
from, and rely on, only four basic principles of numerical analysis. These are:
• The principle of discretization
• Linear algebra, including polynomial spaces
• The principle of iteration
• The principle of linearization.
Because the computable problems are essentially those of linear algebra, almost
all numerical methods will at the core work with linear algebra techniques. The
purpose of the other three principles listed above is to construct various approxima-
tions that bring the original problem to computable form. Below we shall outline
what these principles entail and why these ideas are so important. They will be en-
countered throughout the book in many different shapes.

2.1 The First Principle: Discretization

The purpose of discretization is to reduce the amount of information in a problem to


a finite set, making way for algebraic computations. It is used as a general technique
for converting analysis problems into (approximating) algebra problems, see Table
1.1, and is the key to numerical methods for differential equations.
Consider an interval Ω = [0, 1] and select from it a subset ΩN = {x0 , x1 , . . . , xN }
of distinct points, ordered so that x0 < x1 < · · · < xN . The set ΩN is called discrete,
to distinguish it from the “continuous set” (or rather the continuum) Ω . We refer to
ΩN as the grid.
Let a continuous function f be defined on Ω . Discretization (of f ) means that
we convert f into a vector, by the map f 7→ F = f (ΩN ), i.e.,

15
16 2 Principles of numerical analysis
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 2.1 Discretization of a function. A continuous function f on [0, 1] (top). A grid ΩN is selected
on the x-axis and samples from f are drawn (center). The grid function F (bottom) is a vector with
a finite number of components (red), plotted vs. the grid points ΩN (black)

 
f (x0 )
F =  ...  . (2.1)
 

f (xN )

The function f has the independent variable x ∈ [0, 1], meaning that f (x) is a par-
ticular value of that function. Likewise, the vector F has an index as its independent
variable, and Fk = f (xk ) is a particular value of the vector, with 0 ≤ k ≤ N. As F is
only defined on the grid, it is also called a grid function, see Figure 2.1. As the grid
function is obtained by drawing equidistant samples of the function f , we may in
effect think of the process as an analog-to-digital conversion, akin to recording an
analog audio signal to a digital format.
For theoretical reasons, one may also wish to consider the case N → ∞. In such
situations the grid as well as grid functions are sequences rather than vectors. In
practical computations, however, the number is always finite, meaning that compu-
tational methods work with vectors as discrete representations of functions.
To see how discretization can be used in differential equations, consider the sim-
ple radioactive decay problem

u̇ = αu ; u(0) = u0 , (2.2)
2.1 The First Principle: Discretization 17

with exact solution u(t) = eαt u0 , and suppose we want to solve this equation on
[0, T ]. To make the problem computable, we will turn it into a linear algebra problem
by using a discrete approximation to the derivative.
Introduce a grid ΩN = {t0 , . . . ,tN } with tk = k∆t and N∆t = T . Noting that

u(t + ∆t) − u(t)


u̇(t) = lim , (2.3)
∆t→0 ∆t
we introduce a grid function U approximating u, i.e., Uk ≈ u(tk ). We then have

u(tk + ∆t) − u(tk ) Uk+1 −Uk


u̇(tk ) ≈ ≈ . (2.4)
∆t ∆t
This is referred to as a finite difference approximation. Next, we will use this to re-
place the derivative in (2.2). Thus the original, non-computable problem is replaced
by a computable, discrete problem,
Uk+1 −Uk
= αUk . (2.5)
∆t
This can be rewritten as a linear system of equations,
 
1 0 ... 0  U0   u0 
 −1 − α∆t 1  U1   0
  

 ..   ..   .. 

 0 −1 − α∆t 1 .  .  =  . ,
 (2.6)
 .. ..     
. .
    
0 ... −1 − α∆t 1 UN 0

showing that the approximating problem is indeed a problem of linear algebra.


As the matrix is lower triangular, the system is easily solved using forward substi-
tution, meaning that we can compute successive approximations to u(∆t), u(2∆t), . . .
by repetitive use of the formula Uk+1 = (1 + α∆t)Uk . We then find that

u(k∆t) ≈ Uk = (1 + α∆t)k u0 .

All approximations are obtained using elementary arithmetic operations. If, in par-
ticular, we want to find an approximation to u(T ) in N steps, we take ∆t = T /N,
and get
αT N
 
u(T ) ≈ UN = 1 + u0 .
N
From analysis we recognize the limit

αT N
 
lim 1 + = eαT ,
N→∞ N
18 2 Principles of numerical analysis

so apparently the method is convergent — the numerical approximation approaches


the exact solution u(T ) = eαT u0 as N → ∞. Formally, the method needs an infinite
number of steps to generate the exact solution.
Although this may at first seem to be a drawback, it is the very idea of conver-
gence that saves the situation. Thus, for every ε > 0, an ε-accurate approximation is
computable in N(ε) steps. This means that we can obtain a solution to any desired
accuracy with a finite effort. Although N(ε) is finite for every ε > 0, it need not be
bounded as ε → 0; the numerical problem is still computable, although the original
analysis problem is not.
Now, above we note that the discretization is built on the finite difference approx-
imation (2.4), which incidentally is the same as the approximation (1.2) we used to
approximate a derivative numerically. The resulting method, (2.5), is known as the
explicit Euler method. Since the finite difference approximation (2.4) of the deriva-
tive is of first order, we might expect that the resulting method for the differential
equation is also first order convergent. Indeed, one can show that

|uN − u(T )| ≤ C · N −1 = O(∆t).

Because the error is proportional to ∆t p with p = 1, the method is first order con-
vergent. This is a slow convergence; if we want ten times higher accuracy, we will
have to use a ten times shorter time step ∆t, or, equivalently, ten times more steps
(work) to reach the end point T , since ∆t = T /N.
This is demonstrated below using a M ATLAB implementation of the function
eulertest(alpha, u0, t0, tf), where alpha is the parameter α in the
problem, u0 is the initial value u(0), and t0 and tf are the initial time and terminal
time of the integration. The problem was run with α = −1 on the time interval [0, 1]
with initial value u(0) = 1. It was further run for N = 10, 100, 1000 and 10000,
corresponding to step sizes ∆t = 1/N. The results are found in Figures 2.1 and 2.1.

function eulertest(alpha, u0, t0, tf)


% Test of explicit Euler method. Written (c) 2017-09-06

for k = 1:4;
N = 10ˆ(5-k); % Number of steps
dt = (tf-t0)/N; % Time step
u = u0; % Initial value
t = t0;
sol = u;
time = t;
err = 0;

for i=1:N % Time stepping loop


u = u + dt*alpha*u; % A forward Euler step
t = t + dt; % Update time
sol = [sol u]; % Collect solution data
time = [time t];
end
2.1 The First Principle: Discretization 19

uexact = exp(alpha*(time - t0))*u0; % Exact solution


error = sol - uexact; % Numerical error
h(k) = dt; % Save step size
r(k) = abs(error(end)); % Save endpoint error

figure(1)
subplot(2,1,1)
plot(time,sol,'r') % Numerical solution (red)
hold on
plot(time,uexact,'b') % Exact solution (blue)
grid on
axis([0 1 0.3 1])
xlabel('t')
ylabel('u(t)')
hold off

subplot(2,1,2)
semilogy(time,abs(error)) % Error vs. time
grid on
hold on
axis([0 1 1e-6 1e-1])
xlabel('t')
ylabel('error')
end

figure(2)
loglog(h,r,'b') % Endpoint error vs dt
xlabel('dt')
ylabel('error')
grid on
hold on
xref = [1e-4 1e-1];
yref = [1e-5 1e-2];
loglog(xref,yref,'k--')
hold off

The explicit Euler method performs as expected. Although the results are text-
book perfect, the convergence is slow and the accuracy is less impressive. Even at
N = 104 we do not obtain more than four-digit accuracy. However, when approxi-
mating derivatives, we also used a second order approximation, (1.5). It would then
be of interest to try out that approximation too, for the differential equation u̇ = αu.
This leads to a different discretization,
Vk+1 −Vk−1
= αVk , (2.7)
2∆t
where Vk ≈ u(tk ) as before, and tk = k · ∆t. By Taylor series expansion, we find that

u(tk+1 ) − u(tk−1 ) u(tk + ∆t) − u(tk − ∆t)


= = u̇(tk ) + O(∆t 2 ),
2∆t 2∆t
20 2 Principles of numerical analysis

0.9

0.8

0.7

u(t)
0.6

0.5

0.4

0.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t

10 -2
error

10 -4

10 -6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t

Fig. 2.2 Test of the explicit Euler method. Top panel shows the exact solution (blue) and the nu-
merical solution (red) using N = 10 steps. The step size is coarse enough to make the error readily
visible. Bottom panel shows the error |Uk − u(tk )| along the solution in a lin-log diagram. From top
to bottom, the individual graphs correspond to N = 10, 100, 1000 and N = 10000. At each point in
time, the error decreases by a factor of 10 when N increases by a factor of 10, demonstrating that
the error is O(∆t)

so we would expect the method (2.7) to be of second order and therefore more accu-
rate than the explicit Euler method. The method is known as the explicit midpoint
method. We note that there is one minor complication using this method — it re-
quires two starting values instead of one, since we need both V0 and V1 to get the
recursion
Vk+1 = Vk−1 + 2α∆tVk (2.8)
started. We shall take V0 = u(0) and V1 = eα∆t u(0). This corresponds to taking initial
values from the exact solution.
The previous code can easily be modified to test the explicit midpoint method.
When this was done, the code was tested in a slightly more challenging setting, by
taking α = −4.5 but otherwise solving the same problem as before. A wider range
of step sizes were chosen, by taking N = 10, 32, 100, . . . , 32000, 100000, so that ∆t
varies between 0.1 and 10−5 .
This time the results are not in line with expectations. As we see in Figures 2.1
and 2.1, the method appears to be second order convergent, but the error is not as
regular as one would have hoped for. This turns out to depend on the construction of
the method, and it illustrates that advanced methods in scientific computing cannot,
in general, be constructed by intuitive techniques. The loss of performance for this
2.1 The First Principle: Discretization 21
-1
10

10 -2
error

10 -3

10 -4

10 -5
10 -4 10 -3 10 -2 10 -1
dt

Fig. 2.3 Test of the explicit Euler method. The endpoint error |UN − u(1)| is plotted in a log-log
diagram for N = 10, 100, 1000 and N = 10000 (blue). A dashed reference line of slope 1 shows
that the error is O(∆t), i.e., the method is first order convergent

method is due to some stability issues; in fact, this method is unsuitable for the
radioactive decay problem, although it excels for other classes of problems, such
as problems in celestial mechanics and in hyperbolic partial differential equations.
Thus, one needs a deep understanding of both the mathematical problem and the
discretization method in order to match the two and obtain reliable results. For the
time being, and until the computational behavior of the explicit midpoint method
has been sorted out and analyzed, we have to consider the method a potential failure,
in spite of the unquestionable success we observed before, when using exactly the
same difference quotient for approximating derivatives.
We have seen above that a discretization method only computes approximations
at discrete points to a continuous function. In the simplest case, we approximate
derivatives at distinct points, and in the more advanced cases we compute grid func-
tions (vectors) that approximate a continuous function solving a differential equa-
tion. The distinctive feature is that discretization uses algebraic computations to
approximate problems from analysis. Thus the computations can be carried out in
finite time, at the price of being approximate. Even so, we have seen that it is a non-
trivial task to find such approximations; intuitive techniques do not always produce
the desired results. For this reason, we need to carefully examine how approximate
methods are constructed, and how the approximate methods differ in character from
the exact solutions to problems of analysis.
22 2 Principles of numerical analysis
1

0.8

0.6

u(t)
0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t

10 0

10 -5
error

10 -10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


t

Fig. 2.4 Test of the explicit midpoint method. Top panel shows the exact solution (blue) and the
numerical solution (red) using N = 10 steps. The numerical solution has an undesirable oscillatory
behavior of growing amplitude, indicating instability. The method eventually produces negative
values, in spite of the exact solution always remaining positive, being an exponential. Bottom
panel shows the error |Vk − u(tk )| along the solution in a lin-log diagram. From top to bottom, the
individual graphs correspond to N = 10, 320, 1000, . . . 100000. Initially, the error is O(∆t 2 ), but
here, too, we observe an oscillatory behavior, and a faster error growth when t grows

The examples above are very simple. In particular, the differential equation is
linear. It is chosen simply to illustrate the errors of the approximate methods. Linear
problems usually have solutions that can be expressed in terms of elementary func-
tions. By contrast, most nonlinear differential equations of interest cannot be solved
in terms of elementary functions. For example, the van der Pol equation,

u0 = v
v0 = µ · (1 − u2 ) v − u

with u(0) = 2 and v(0) = 0, is a system of ordinary differential equations, which


is nonlinear if µ 6= 0. An analytical solution can only be found for µ = 0; then
u(t) = 2 cost and v(t) = −2 sint. Nonlinearity is therefore an added difficulty, and
for most nonlinear problems, there is no alternative to computing approximate nu-
merical solutions.
In a linear problem the unknowns enter only in their first power. There are no
other powers, like u2 or v−1/2 , nor any nonlinear functions, such as u2 , sin u or log v,
occurring in the equation. In the van der Pol equation above, we see that the cubic
2.2 The Second Principle: Polynomials and linear algebra 23
0
10

10 -2

10 -4
error

10 -6

10 -8

10 -10

10 -12
10 -5 10 -4 10 -3 10 -2 10 -1
dt

Fig. 2.5 Test of the explicit midpoint method. The endpoint error |VN − u(1)| is plotted in a log-log
diagram for N = 10, . . . , 100000 (blue). A dashed reference line of slope 2 shows that the error is
O(∆t 2 ), but only if ∆t < 3 · 10−4 (left part of graph, corresponding to N > 3200). Thus the method
is second order convergent, but for larger ∆t, the error grows rapidly and the proper convergence
order is no longer observed (right part of graph)

term u2 v occurs. Note that a nonlinearity can take the form of a product of first-
degree terms, such as uv, which is a quadratic term.
Discretization methods are in general aimed at solving nonlinear problems, pro-
vided that the original problem has a unique solution. A necessary condition for
solving nonlinear problems successfully is that we are able to solve linear problems
in an accurate and robust way. In addition, many classical problems in applied math-
ematics are linear, but still have to be solved numerically due to various complica-
tions, such as problem size or complex geometry. Discretization therefore remains
one of the most important principles in scientific computing.

2.2 The Second Principle: Polynomials and linear algebra

Linear algebra is synonymous with matrix problems of various kinds, such as solv-
ing linear systems of equations and eigenvalue problems. But to fully appreciate the
role of linear algebra in numerical analysis, it must be recognized that most compu-
tational problems that give rise to algebraic computations do not come ready-made
in matrix–vector form.
24 2 Principles of numerical analysis

Problems in analysis typically involve continuous functions, for instance the


computation of integrals. It may sometimes be difficult to solve such problems ana-
lytically, as one cannot always find primitive functions in terms of elementary func-
tions. But polynomials are different. It is straightforward to work with polynomi-
als in analysis – the primitive function of a polynomial is again a polynomial, and
conversely, the derivative of a polynomial is also a polynomial. This makes them at-
tractive and convenient tools in numerical analysis, and a large number of numerical
methods are therefore based on polynomial approximation.
The simplest (but far from the best) representation of a polynomial is

P(x) = c0 + c1 x + c2 x2 + · · · + cN xN . (2.9)

Thus every polynomial is completely defined by a finite set of information, the co-
efficient vector, (c0 , c1 , . . . , cN )T . Consequently, many problems involving polyno-
mials lead directly to matrix–vector computations, or, in other words, linear algebra.
The interpolation problem is one of the most important uses of polynomials
in numerical analysis. In interpolation, given a grid function F on a grid ΩN , one
constructs a polynomial P on the interval Ω , such that P(ΩN ) = F. The purpose
is the opposite of discretization. Interpolation aims at generating a continuous
function from discrete data. If we think of discretization as an analog-to-digital
conversion, interpolation is the opposite, a digital-to-analog conversion.
Example. Let us consider approximating the function f (x) = 1 + 0.8 sin 2πx by a
polynomial P on the interval [0, 1], such that the polynomial reproduces the correct
values of f at some selected points, say at x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4 and
x4 = 1. At those points, f (x) takes the values 1, 1.8, 1, 0.2 and 1, respectively. Using
the ansatz (2.9) with N = 4 and imposing the five conditions

c0 + c1 xk + c2 xk2 + c3 xk3 + c4 x4 = f (xk ) ; k = 0, 1, . . . , 4

we obtain the linear system of equations

1 x0 x02 x03 x04


    
c0 f (x0 )
    
    
 1 x1 x2 x3 x4   c1   f (x1 ) 
 1 1 1    
    
     
 1 x2 x2 x3 x4   c2  =  f (x2 )  . (2.10)
 2 2 2    
    
    
 1 x3 x2 x3 x4   c3   f (x3 ) 
 3 3 3    
    
2 3
1 x4 x4 x4 x4 4 c4 f (x4 )

This determines the unknown coefficients c0 , . . . , c4 that uniquely characterize the


interpolation polynomial. Inserting the data and solving for the coefficient vector
c gives
128 128 2 256 3
P(x) = 1 + x− x + x + 0 · x4 .
15 5 15
2.2 The Second Principle: Polynomials and linear algebra 25
2

1.5

data
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1.5
interpolant

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.1
error

-0.1

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Fig. 2.6 Interpolation of a function. A grid function F, indicated by blue markers, represents sam-
ples of a trigonometric function f on [0, 1], indicated by dashed curve (top panel). The grid function
is interpolated by a polynomial P of degree 3 (solid blue curve, center panel). As P deviates visibly
from f , interpolation does not recover the original function f from the grid function F. The error
P(x) − f (x) is plotted as a function of x ∈ [0, 1] (red curve, lower panel). The error is zero only at
the interpolation points, indicated by red markers

We note that there is no fourth degree term due to the “partial symmetry” of the
function f . In general, having to interpolate at five points, we would expect the
polynomial to have five coefficients, i.e., P would have to be a degree four polyno-
mial. Here, however, the interpolant is a polynomial of degree 3.
Let us leave aside the question of whether this is a good approximation – the
point here is rather to recognize that although both f (x) and the approximating
polynomial P(x) are nonlinear functions, the interpolation problem is linear and
therefore computable. The unknown coefficients ck enter the problem linearly, and
we obtained the linear system of equations (2.10). The interpolation problem there-
fore falls in the linear algebra category.
As the example in Figure 2.2 shows, interpolation generates a continuous func-
tion from a grid function, but it is not the inverse operation of discretization. Thus,
discretization maps a function f to a grid function F, and interpolation maps F
to a polynomial P. This does not recover f unless f originally was identical to P.
Therefore, in general, it holds that f (x) 6= P(x), except at the interpolation points.
In scientific computing it is often preferable to use more general functions than
the standard polynomial (2.9). Just like linear algebra uses basis vectors, it is com-
mon to choose a set of basis functions, ϕ0 (x), ϕ1 (x), . . . ϕN (x). These are often, but
26 2 Principles of numerical analysis

not always, polynomials that are carefully chosen for their computational advan-
tages, or to reflect some special properties of the problem at hand.
Given a function f (x), we can then try to find the best approximation among
all linear combination of the basis functions ϕk (x). In other words, we want to find
the best choice of coefficients ck such that

∑ ck ϕk (x) ≈ f (x), (2.11)


k

for x in some set. This could either be an interval, say Ω = [0, 1], or a grid ΩN =
{x0 , x1 , . . . , xN }. The interpolation problem above was a particular case, where we
used the well-known monomial basis,

ϕk (x) = xk ; k = 0, . . . , N (2.12)

together with the grid ΩN . With general basis functions, that system would become
    
ϕ0 (x0 ) ϕ1 (x0 ) . . . ϕN (x0 ) c0 f (x0 )
 ϕ0 (x1 ) ϕ1 (x1 ) ϕN (x1 ) 
  c1   f (x1 ) 
   
= . . (2.13)

 .. .. .. .
  ..   .. 
   
 . . .
ϕ0 (xN ) ... ϕN (xN ) cN f (xN )

The important observation here is that the basis functions enter the matrix as its
column vectors, and that the system remains linear.
A natural question is whether the interpolation is “improved” by letting N → ∞.
In general this is not the case; it depends in a complicated way on the choice of
basis functions, as well as on the grid points ΩN and the regularity of f . Because the
grid points are often defined by the application (e.g. in image processing the grid
points correspond to pixels with fixed locations), the basis functions usually play
the more important role. An example is the monomial basis (2.12), a choice that is
fraught with prohibitive shortcomings for large N. It is often better to use piecewise
low-degree interpolation, a technique frequently used in connection with differential
equations and the Finite Element Method, where N is often very large.
Let ϕk and F denote the grid functions
   
ϕk (x0 ) f (x0 )
ϕk =  ...  F =  ...  . (2.14)
   

ϕk (xN ) f (xN )

Then (2.13) can be written


N
∑ ck ϕk = F. (2.15)
k=0
2.2 The Second Principle: Polynomials and linear algebra 27

It expresses the vector F as a linear combination of the basis vectors ϕk , while


(2.11) expresses the function f (x) as a linear combination of the basis functions
ϕk (x).
The least-squares problem is quite closely related to the interpolation problem.
There the number of basis functions, M + 1, is small compared to the number of
elements N + 1 in ΩN (not to mention the case when Ω is an interval). For example,
fitting a straight line (a first-degree polynomial) to a data set means that M = 1, i.e.,
we have but two coefficients, c0 and c1 , to determine, while the data set, defined by
f -values on ΩN , may be arbitrarily large. The system
M
∑ ck ϕk = F (2.16)
k=0

is then an overdetermined system; it has more equations than unknowns. This cor-
responds to (2.13) but with a rectangular matrix containing only a few columns. In
such a situation, we form the scalar product of (2.16) with any ϕ j , to get
M
∑ ck ϕ Tj ϕk = ϕ Tj F ; j = 0, . . . , M. (2.17)
k=0

This is an M × M linear system of equations, referred to as the normal equations.


It can be written Ac = g, with matrix elements a jk = ϕ Tj ϕk , and where g j = ϕ Tj F. If
the basis vectors {ϕk } are linearly independent, it can be solved for the vector c of
unknown coefficients ck .
An even better approach is to first select the basis functions ϕk (x) so that the
vectors ϕk are orthogonal on ΩN , which means that ϕ Tj ϕk = 0 if j 6= k. Then the sum
on the left hand side of (2.17) has only one term; the system reads c j ϕ Tj ϕ j = ϕ Tj F.
Thus we can immediately solve for the coefficient c j , to get

ϕ Tj F
cj = . (2.18)
ϕ Tj ϕ j

This means that the coefficient vector is easily computed in terms of scalar prod-
ucts alone, one of the basic techniques in numerical linear algebra. A most impor-
tant special case is Fourier analysis, which is exclusively based on this technique,
making it very useful and efficient in practical computation. The coefficients c j in
(2.18) are commonly referred to as Fourier coefficients. In Fourier analysis, the
basis functions are typically chosen as trigonometric functions. By Euler’s formula,
eiωn = cos ωn + i sin ωn. Therefore, trigonometric functions are polynomials in the
variable x = eiω , motivating the commonly used term “trigonometric polynomials.”
More interestingly, but less obviously, the Finite Element Method (FEM) for
differential equations is based on similar ideas, using scalar products to construct
the linear system that needs to be solved. In effect, the best approximation is then
found by solving a system of the form (2.17), although we then have M = N. There
28 2 Principles of numerical analysis

are many different FEM approaches, consisting in choosing the basis functions {ϕ j }
to carefully represent the desired properties of the solution, in terms of the degree
of the polynomial basis functions as well as in order to satisfy boundary and conti-
nuity conditions in a proper way. All these methods lead to linear systems of special
structure, and provide strong examples of how numerical analysis prefers to work in
various polynomial settings to make efficient use of linear algebra for the determi-
nation of the best linear combination of the chosen basis functions. Thus, it is fair
to say that approximation by polynomials is a cornerstone of scientific computing,
allowing us to employ the full range of linear algebra techniques in the process of
finding approximate solutions to problems from analysis.
An added difficulty is that in many applications, not least in FEM, the linear
systems are extremely large and sparse. It is not uncommon to have millions of
unknowns or more. In such cases it may no longer be possible to use standard fac-
torization methods to solve the system. Instead, other approximate techniques have
to be considered. These are iterative.

2.3 The Third Principle: Iteration

Nonlinear problems can rarely be solved by analytical techniques. The computa-


tional techniques used are iterative. This means that they are “repetitive,” in the
sense that the computational method generates a sequence of successively improved
approximations that converge to the solution of the nonlinear problem. Since con-
vergence is usually an infinite process, the iteration will in practice be terminated
after a finite number of steps, but if convergence is observed it is possible to termi-
nate the process when an acceptable accuracy has been achieved.
The very √simplest example is the problem of computing the square root of a
number, say 2. As this is an irrational number, we only have numerical approxi-
mations to it, although today’s software allows us to compute such approximations
at the touch of a button.

The square root of 2, symbolized by 2, is the positive root of the nonlinear
equation x2 = 2. The root is obviously greater than 1 and less than 2, so a simple
guess is x ≈ 1.5. Since the root is now in the interval [1, 1.5], one could repeat halv-
ing the interval to improve the accuracy. This process, called bisection, is however
unacceptably slow. Much better and more general techniques are needed.
Two thousand years ago, Heron of Alexandria noted that if an approximation x
was a bit too large (like 1.5), then 2/x would be a bit too small, and vice versa. He
thus suggested that one take the average of x and 2/x to obtain a better approxima-
tion. This looks deceptively similar to bisection, but it was an enormous advance,
which 1,700 years later became known as Newton’s method. If iterated, the com-
putation is  
k+1 1 k 2
x = x + k ,
2 x
2.3 The Third Principle: Iteration 29

where the superscript k is the iteration index.
√ To√ verify that lim xk = 2, and to
analyze the iterative process, let ε k = (xk − 2)/ 2 denote the relative error of the
approximation xk . Then, by expanding in a Taylor series, we get
√ !

 
1 2 1 2
xk+1 = xk + k = 2(1 + ε k ) +
2 x 2 1 + εk
1   √ (ε k )2
≈ √ 1 + ε k + (1 − ε k + (ε k )2 − . . . ) ≈ 2 + √ .
2 2

Hence it follows that the relative error in xk+1 is



k+1 xk+1 − 2 (ε k )2
ε = √ ≈ .
2 2

This means that the accuracy more than doubles each iteration! If the relative error
of x0 is 10−2 (or 1%, corresponding to two correct digits, as in x0 = 1.4), then the
relative error of x1 is 0.5 · 10−4 . In fact, the third iterate, x3 , is correct to 16 digits,
implying that full IEEE precision has been attained. By contrast, simple bisection
would require 50 iterations to achieve the same.
This shows the power of well-designed iterative methods. In scientific computing
there are two basic types of iterative methods for solving nonlinear equations. These
are fixed-point iteration and Newton’s method. The iteration demonstrated above is
an example of both methods. In fact, they solve slightly different problems. New-
ton’s method solves problems of the form f (x) = 0, while fixed-point iteration
solves problems of the form x = g(x). The functions f and g are both nonlinear.
We shall leave Newton’s method for the next section, and only analyze fixed-point
iteration here.
Given a nonlinear function g : D ⊂ Rm → Rm , fixed-point iteration starts from an
initial approximation x0 and computes the next approximation from x1 = g(x0 ). The
iteration becomes
xk+1 = g(xk ). (2.19)
The name fixed-point iteration comes from the fact that if the iteration converges,
we have lim xk = x∗ , where
x∗ = g(x∗ ), (2.20)
i.e., x∗ is left unchanged by the map g. Thus x∗ maps to itself, and is termed a fixed
point of g. Whether there exist fixed points and whether the iteration converges
depend on the function g.
Subtracting (2.20) from (2.19), we get

xk+1 − x∗ = g(xk ) − g(x∗ ). (2.21)

Taking norms (for the time being, any vector norm will do), we find that

kxk+1 − x∗ k = kg(xk ) − g(x∗ )k.


30 2 Principles of numerical analysis

Next, we shall assume that the map g is Lispchitz continuous.

Definition 2.1. Let g : D ⊂ Rm → Rm . The Lipschitz constant of g on D is defined


by
kg(u) − g(v)k
L[g] = sup (2.22)
u6=v ku − vk
for all u, v ∈ D.

Using this definition, it holds that

kxk+1 − x∗ k = kg(xk ) − g(x∗ )k ≤ L[g] · kxk − x∗ k.

Letting ε k = kxk − x∗ k denote the norm of the absolute error in xk , we note that

ε k+1 ≤ L[g] · ε k .

Hence, if L[g] < 1, the error decreases in every iteration, and, in fact, ε k → 0. A map
with L[g] is called a contraction map, as the distance between two points (here xk
and x∗ ) decreases under the map g. For contraction maps the fixed point iteration is
therefore convergent, provided that x0 , x∗ ∈ D. The fixed point theorem (also known
as the contraction mapping theorem) is one of the most important theorems in
nonlinear analysis, and it states the following:

Theorem 2.1. (Fixed point theorem) Let D be a closed set and assume that g is a
Lipschitz continuous map satisfying g : D ⊂ Rm → D. Then there exists a fixed point
x∗ ∈ D. If, in addition, L[g] < 1 on D, then the fixed point x∗ is unique, and the fixed
point iteration (2.19) converges to x∗ for every starting value x0 ∈ D.

We shall not give a complete proof. The key points here are that g maps the
closed set D into itself (existence) and that g is a contraction, L[g] < 1 (uniqueness).
Proving existence is the hard part (20th century mathematics), while uniqueness and
convergence are quite simple. Naturally, existence is always of fundamental impor-
tance, but in computational practice it is the contraction that matters most, since the
theorem offers a constructive way of obtaining successively improved approxima-
tions, converging to the proper limit x∗ .
It is worth noting that Lipschitz continuity is quite close to differentiability. In
fact, we can rewrite (2.21) as

xk+1 − x∗ = g(xk ) − g(x∗ ) = g(x∗ + (xk − x∗ )) − g(x∗ ) ≈ g0 (x∗ ) · (xk − x∗ ), (2.23)

provided that g is differentiable. Thus

kxk+1 − x∗ k / kg0 (x∗ )k · kxk − x∗ k.

This indicates that the error is diminishing provided that kg0 (x∗ )k < 1. In fact, if g
is differentiable and the set D is convex, then one can show that
2.3 The Third Principle: Iteration 31

L[g] = sup kg0 (x)k.


x∈D

Returning to the classical (but trivial) problem of computing the square root of 2,
we note that  
1 2
g(x) = x+ .
2 x
It follows that  
1 2
g0 (x) = 1− 2 .
2 x

Therefore, g0 (x∗ ) = g0 ( 2) = 0. This means that g has a very small Lipschitz con-
stant in a neighborhood of the root, explaining the fast convergence of Heron’s for-
mula.
In scientific computing, it is always of interest to find error estimates. Again, we
can rewrite (2.21) as

xk+1 − x∗ = g(xk ) − g(x∗ ) = g(xk ) − g(xk+1 ) + g(xk+1 ) − g(x∗ ). (2.24)

By the triangle inequality, we have

ε k+1 ≤ kg(xk ) − g(xk+1 )k + kg(xk+1 ) − g(x∗ )k ≤ L[g] · kxk − xk+1 k + L[g] · ε k+1 .

Hence we have
(1 − L[g])ε k+1 ≤ L[g] · kxk − xk+1 k.
Solving for ε k+1 , we obtain the error bound

L[g]
ε k+1 ≤ · kxk − xk+1 k. (2.25)
1 − L[g]

Thus, while the true error remains unknown, it can be bounded in terms of the com-
putable quantity kxk − xk+1 k, provided that we know L[g]. Unfortunately, the Lip-
schitz constant is rarely known, but a rough lower estimate can be obtained during
the iteration process from

kg(xk ) − g(xk−1 )k kxk+1 − xk k


L[g] ≈ = k .
kxk − xk−1 k kx − xk−1 k

This makes it possible to compute a reasonably accurate error estimate from


(2.25).
While iterative methods are necessary for nonlinear problems, iterative methods
are also of interest for large-scale linear systems. For example, linear systems arising
in partial differential equations may have hundreds of millions of equations, which
excludes the use of conventional “direct methods” such as Gaussian elimination.
The remaining option is then iterative methods. If the problem is Ax = b, one usually
tries to split the matrix so that A = M − N, and rewrite the system
32 2 Principles of numerical analysis

Mx = Nx + b.

If one can find a splitting such that M is inexpensive to invert (for example if M is
diagonal), we can use a fixed point iteration

xk+1 = M −1 (Nxk + b).

Here, if xk = x + δ xk , with δ xk representing the absolute error, we see that

δ xk+1 = M −1 Nδ xk .

This iteration will be convergent (δ xk → 0) if all eigenvalues of M −1 N are located


strictly inside the unit circle; this then becomes a measure of a “well designed”
iterative method. Interestingly, whether this can be achieved depends on the choice
of discretization as well as on the properties of the differential equation.
It is easily seen that the iteration above can be rewritten

xk+1 = xk − M −1 (Axk − b).

Here the quantity rk = Axk − b is known as the residual. In many large-scale prob-
lems it is easy to compute the residual, but the matrix A itself may not be available
for separate processing. The iterative method then attempts to successively improve
the approximation xk by only evaluating the residual, but changing its direction by
the matrix M in order to speed up convergence. For such a technique to be success-
ful, much work goes into the construction of M, using all available knowledge of
the problem at hand. Depending on on the construction of the iterative method, the
matrix M is often referred to as a preconditioner.
The “ideal” choice of M would be to take M = A, but this choice requires that
A is inverted or factorized by conventional techniques. Therefore M is always some
approximation, for example an “incomplete” factorization of A.
As mentioned before, there are many other kinds of problems that require iter-
ative methods, e.g. eigenvalue problems. It is impossible to give a comprehensive
overview in a limited space, but suffice it to say that without iterative methods, sci-
entific computing would not be able to address important classes of problems in
applied mathematics.

2.4 The Fourth Principle: Linearization

The last principle that is a key element in many numerical methods is linearization.
This simply means that one converts a nonlinear problem to a linear one. Often, this
implies that one considers small variations around a given point, using differentia-
tion as an approximation.
2.4 The Fourth Principle: Linearization 33

Consider a differentiable map f : Rm → Rn , mapping x ∈ Rm to y ∈ Rn , so that


y = f (x). Analysis then allows us to write

dy = f 0 (x) dx. (2.26)

This expresses that the differential dy is a linear function of the differential dx. It
is valid for any differentiable function f , and the “derivative” f 0 (x) can be a scalar
(when n = m = 1), a gradient (when n = 1 with m arbitrary but finite), a Jacobian
matrix (with both n and m arbitrary, but finite), or a linear operator (both n and m
possibly infinite), as the case may be; the formalism is always the same.
In numerical analysis, it is common to approximate the infinitesimal differentials
by finite differences, such as (1.2). We then write

∆ y ≈ f 0 (x) · ∆ x. (2.27)

This is again expressing (small) variations in y as a linear function of ∆ x. That is,


we consider the effect ∆ y, due to small variations ∆ x in the independent variable x,
to be proportional to ∆ x. This is the essential idea of linearization.
To take this further, let f be a differentiable nonlinear map f : Rm → Rm . Given
any fixed vector x and a small, varying offset ∆ x, we can expand f in a Taylor series
about x, to obtain
f (x + ∆ x) ≈ f (x) + f 0 (x) · ∆ x, (2.28)
by retaining the first two terms. This is equivalent to linearization because it ap-
proximates f (x + ∆ x) by a linear function of ∆ x; the approximation on the right
hand side of (2.28) only contains ∆ x to its first power.
Note that if the problem is scalar (m = 1) this is a standard Taylor series, with
all quantities involved being scalar. One can then graph the right hand side of (2.28)
and get a straight line with slope f 0 (x). In the vector case, however, x is an m-vector,
f (x) is a m-vector, and each component of f (x) depends on all components of x,
i.e.,
fi (x) = fi (x1 , . . . , x j , . . . , xm ).
Hence each component fi can be differentiated with respect to every component of
x, to obtain its gradient fi0 (x). In other words,
 
0 ∂ fi ∂ fi ∂ fi
fi (x) = gradx fi (x) = ,..., ,..., . (2.29)
∂ x1 ∂xj ∂ xm

If we write down the derivatives of all components of f simultaneously, we obtain


the m × m Jacobian matrix,
34 2 Principles of numerical analysis
 ∂ f1 ∂ f1 ∂ f1 
∂ x1 ∂ x2 ... ∂ xm
 
 
0
 ∂ f2 ∂ f2 ∂ f2 
f (x) = 
 ∂ x1 ∂ x2 .
∂ xm  (2.30)
 .. .. .. 
 . . . 
∂ fm ∂ fm
∂ x1 ... ∂ xm

The Jacobian matrix enters as f 0 (x) in the Taylor series expansion (2.28), which
remains a linear approximation.
Let us now turn to the problem of solving a nonlinear equation f (x) = 0. If one
has some approximation x0 of the solution, we can expand around x0 to get

f (x) = f x0 + (x − x0 ) ≈ f (x0 ) + f 0 (x0 ) · (x − x0 ),




retaining the first two terms of the expansion, corresponding to a linear approx-
imation. Because nonlinear problems are not computable but linear ones are, we
may consider the linear system of equations

f (x0 ) + f 0 (x0 ) · (x − x0 ) = 0, (2.31)

as an approximation to the nonlinear system f (x) = 0. In (2.31), f (x0 ) is an m-


vector, f 0 (x0 ) an m × m-matrix, so this is a linear system that can be used to deter-
mine x. The formal solution is
−1
x = x0 − f 0 (x0 ) f (x0 ),

where all expressions on the right hand side are computable, provided that the Ja-
cobian matrix is available and nonsingular. It is obvious, however, that this is not
the solution to the problem f (x) = 0, as that problem was replaced by a linear ap-
proximation. If the linearization is in good agreement with the nonlinear function,
we may obtain an improved solution compared to x0 . Naturally, this process can be
repeated, leading to the iterative method known as Newton’s method,
 −1
xk+1 = xk − f 0 (xk ) f (xk ), (2.32)

where the superscript is used as the iteration index as before, to distinguish it from
the vector component subscript index used above.
As we have seen above, Newton’s method is based on approximating the original
problem f (x) = 0 by a sequence of linear, computable problems, whose solutions
are expected to produce successively better approximations, converging to the true
solution. We encountered Newton’s method already in the previous section, applied
to the quadratic equation x2 − 2 = 0. Thus, taking f (x) = x2 − 2, we have f 0 (x) = 2x,
and Newton’s method becomes
(xk )2 − 2 xk
 
k+1 k 1 1 k 2
x =x − = + k = x + k .
2xk 2 x 2 x
2.4 The Fourth Principle: Linearization 35

Thus we have obtained Heron’s formula for computing square roots. Newton’s
method is far more general, however, and is one of the most frequently used methods
for nonlinear equations, of equal importance to the fixed point iteration of the previ-
ous section. It is by far the most important example of the principle of linearization.
Nevertheless, the convergence of Newton’s method remains an extremely diffi-
cult matter. To summarize, it will converge if f 0 (x) is nonsingular in the vicinity of
the solution x∗ to f (x) = 0. This is equivalent to requiring that k( f 0 (x))−1 k ≤ C1 in
a neighborhood B(r) = {x : kx − x∗ k ≤ r} of x∗ . In addition, we have to require that
00
the second derivative (a 3-tensor) is bounded, i.e., k f (x)k ≤ C2 for x ∈ B(r). Fi-
nally we need an initial approximation x0 ∈ B(r) and k f (x)k ≤ C0 on B(r). Success
depends on further conditions on f and on the relation between the bounds C0 , C1
and C2 .
In practice, it is impossible to verify these conditions mathematically, and as
a consequence, one often experiences considerable difficulties or even failures
in computational practice. This is not necessarily due to any shortcomings of the
method; it is more of an indication that strong skills in mathematics and numerical
analysis are needed. On the positive side, if one is successful, Newton’s method is
quadratically convergent, as we already saw in the case of Heron’s formula. Thus,
when all conditions of convergence are fulfilled, the error behaves like

ε k+1 = O((ε k )2 ),

which means that Newton’s method offers an extraordinary fast convergence when
all conditions are in place. By contrast, fixed point iteration is in general only lin-
early convergent, i.e.,
ε k+1 = O(ε k ),
thus calling for a much larger number of iterations before an acceptable approxima-
tion can be obtained. These issues are not much noticed for small systems, but many
mathematical models in science and engineering lead to large-scale systems, and to
considerable difficulties.
As an illustration of what Newton’s method does in other special cases, consider
solving the problem f (x) = 0, with f (x) = Ax − b. (This means that we are going
to use Newton’s method to solve a system of linear equations.) Then f 0 (x) = A, and
Newton’s method becomes (cp. 2.32)

xk+1 = xk − A−1 (Axk − b) = xk − xk + A−1 b = A−1 b.

Hence Newton’s method solves the problem in a single iteration. This comes as no
surprise; the idea of linearization is “wasted” on a linear problem, as the linear prob-
lem is its own linearization and the method is designed to solve the linear problem
exactly. However, we also note that Newton’s method is “expensive” to use, as it
requires the Jacobian matrix and is supposed to work with a full linear solver. This
again means that if the nonlinear problem is very large, a conventional factoriza-
tion method is not affordable, and Newton’s method will be modified to some less
expensive variant, e.g.
36 2 Principles of numerical analysis

xk+1 = xk − M −1 f (xk ), (2.33)


where M ≈ f 0 (xk ) is inexpensive to invert. The matrix M is often kept constant for
many successive iterations to further reduce computational effort. This is referred
to as modified Newton iteration. While each iteration is cheaper to carry out, the
savings come at a price – the fast, quadratic convergence is lost, and convergence (if
it converges at all) will be reduced to linear convergence. Naturally, (2.33) may be
viewed as a fixed point iteration, with

g : x 7→ x − M −1 f (x)

and Jacobian matrix


g0 (x) = I − M −1 f 0 (x).
In the fixed point iteration, we need kg0 (x)k  1 in a neighborhood of the solution
x∗ . This obviously requires M −1 f 0 (x) ≈ I, i.e., M −1 must approximate ( f 0 (x))−1
well enough. Needless to say, finding such an M, which is also cheap to invert, is a
tall order. Even so, it is a standard task for the numerical analyst who works with
advanced, large-scale problems in applied mathematics.

2.5 Correspondence Principles

Out of the four principles outlined above, three of them aim to take not computable
problems into computable problems, or linear algebra. Discretization is the most
important principle; it reduces the information in an analysis problem into a finite
set and enables computations to be completed in finite time. Linearization is a tool
for approximating a nonlinear problem by a linear problem; because the latter is
not the same as the original problem, it must be combined with iteration to create
a sequence of problems, whose solutions converge to the solution of the original
problem. Finally, the “principle” of linear algebra and polynomials represents core
methodology, allowing us to approximate and compute in finite time.
On top of this methodology, it is important to understand that analysis problems
are set in the “continuous world” of functions, derivatives and integrals. By contrast,
the algebraic problems are set in the “discrete world” of vectors, polynomials and
linear systems.
There is a strong correspondence between the continuous world and the discrete
world. Both have rich theories that are largely parallel, with similar structure. The
numerical analyst must be well acquainted with both worlds, and be able to move
freely between them, as all numerical methods work with finite data sets but are
intended to solve problems from analysis.
As an example, consider a linear first-order initial value problem

u̇ = Au. (2.34)
2.5 Correspondence Principles 37

Its discrete-world counterpart is

vk+1 = Bvk . (2.35)

While the first problem is a differential equation, the second problem is a differ-
ence equation. If we discretize the differential equation (2.34) in a way similar to
(2.5) and let vk denote an approximation to u(tk ), where tk = k∆t, we get the differ-
ence equation
vk+1 = (I + ∆tA)vk , (2.36)
corresponding to taking B = I + ∆tA. The latter equation links the two equations
above. Now, we are often interested in questions such as whether u(t) → 0 as t → ∞.
This will happen if all eigenvalues of A are located strictly inside the left half plane.
But if we have discretized the system, we are also interested in knowing whether
vk → 0 as k → ∞. This will happen in (2.35) if all eigenvalues of B are located strictly
inside the unit circle. Thus we see that the conditions we arrive at in the continuous
and discrete worlds are different, although the two theories are largely analogous.
The real problem in numerical analysis, however, is to relate the two. Thus we
would like our discretization (2.36) to behave in a way that replicates the behavior
of the continuous problem (2.34). As B = I + ∆tA we can find out whether this is
the case. Let x be an eigenvector of A, i.e., Ax = λ x. We then have

Bx = (I + ∆tA)x = x + ∆ Ax = (1 + ∆tλ )x,

so we conclude that the eigenvalues of B are 1 + ∆tλ .


Now, if Re λ < 0, will |1+∆tλ | be less than 1? Interestingly, this puts a condition
on ∆t, which must be chosen small enough. In a broader context, it implies that
not every discretization will do. Thus, in order to find out how to solve a problem
successfully, we need to have a strong foundation both in the classical continuous
world and in the discrete world, as well as knowing how problems, concepts and
theorems from one world correspond to those of the other.
It is of particular importance to learn to recognize the structural similarity of the
continuous and discrete worlds. Above we have seen the similarity and connections
between u̇ = Au and vk+1 = Bvk . The beginner in numerical analysis is probably
better acquainted with the analysis side, but one quickly recognizes how the dis-
crete world works. In fact, for almost every continuous principle, there is a discrete
principle, and vice versa.
For example, the usual scalar product f T g between two vectors f and g is
N
f Tg = ∑ fk gk .
k=0

Is there a “continuous” scalar product? The answer is yes, and we will use it in
particular in connection with the finite element method. Thus the corresponding
operation in the continuous case is
38 2 Principles of numerical analysis
Z 1
h f , gi = f (x)g(x) dx,
0

where h·, ·i denotes the inner product of two functions. Here we recognize the point-
wise product of the two functions, integrated (“summed”) over the entire range of
the independent variable, x.
Many operations in algebra have similar counterparts in analysis. Another im-
portant example is linear systems of equations, Ax = b, which, when written in
component form, are
∑ ai, j x j = bi .
j

There are many different types of linear operators in analysis. A direct counterpart
to the linear system above is the integral equation
Z 1
k(x, y)u(y) dy = v(x).
0

Here we see that a function u on [0, 1] is mapped to another function v. Again this
happens by multiplying the function u at the point y by a function of two independent
variables, x and y, and integrating (“summing”) over the entire range of y, leaving
us with a function of the remaining independent variable, x. This operation is often
written, just like in the linear algebra case, as Ku = v. If the operator K and the
function v are given, we need to solve the integral equation Ku = v. Such a problem
can be solved using discretization, which takes us back to a linear algebraic system
of equations.
The simplest example of an integral equation, as well as a differential equation,
is an equation of the form
Z t
ẋ = f (t), ⇒ x(t) = f (τ) dτ.
0

Obviously, this is a problem from analysis, and we would approximate its solution in
numerical analysis by using a discretization method. For example, we may choose a
grid Ω and sample the function f on Ω . Because it may in general be impossible to
find an elementary primitive function of f , we convert it to a grid function F. This,
in turn, is reconverted to an approximating polynomial P, with | f (t) − P(t)| ≤ ε on
the entire interval.
Replacing f in the integral by the polynomial P, we are now ready to find a
numerical approximation to the integral. We have obtained a computable problem.
As primitive functions of polynomials are polynomials, the integral can easily be
computed. Thus Z t
x(t) ≈ P(τ) dτ.
0
In the section on linear algebra and polynomials, we have seen that an interpolation
polynomial is obtained by solving a linear system of equations, where the discrete
samples F of the continuous function f form the data vector. As a result, the in-
2.5 Correspondence Principles 39

terpolation polynomial is a linear combination of the data samples, i.e., it can be


written
P(t) = ∑ Fk ϕk (t),
k

where the basis functions ϕk (t) are also polynomials. Therefore,


Z t Z t Z t
P(τ) dτ = ∑ Fk ϕk (τ) dτ = ∑ Fk ϕk (τ) dτ.
0 0 k k 0

Rt
By introducing the notation wk = 0 ϕk (τ) dτ, we arrive at
Z t Z t
f (τ) dτ ≈ P(τ) dτ = ∑ Fk wk .
0 0 k

This means that the integral in the continuous world can be approximated by a
weighted sum in the discrete world; the latter is easily computable. In fact, we rec-
ognize this computation as a simple scalar product of the grid function vector F and
a the weight vector w.
We further note that if | f (t) − P(t)| ≤ ε, then
Z t Z t
| f (τ) − P(τ) dτ| ≤ | f (τ) − P(τ)| dτ ≤ εt.
0 0

This means that not only can we compute an approximation in the discrete world,
but we can also obtain an error bound, provided that we master the interpolation
problem.
In a similar way, it is of fundamental importance to master the four basic princi-
ples of numerical analysis, the correspondence between the discrete and the contin-
uous worlds, and how well we can approximate solutions to problems in analysis by
problems in algebra. In this book, we are going to focus on differential equations,
and therefore the principle of discretization is of primary importance. Accuracy is a
matter of two questions, first stability (which we have not yet been able to explore)
and accuracy, which is usually a matter of how fine we make the discretization.
Naturally there is a trade-off. The finer the discretization, the higher is the com-
putational cost. On the other hand, a finer discretization offers better accuracy. Can
we obtain high accuracy at a low cost? The answer is yes, provided that we can con-
struct stable methods of a high order of convergence. This is not an easy task, but
it is exactly what makes scientific computing an interesting and challenging field of
study.
For reasons of efficiency, as well as for computability, we need to stay away from
infinity. However, accuracy is obtained “near infinity.” We obviously need to strike
a balance.
40 2 Principles of numerical analysis

2.6 Accuracy, residuals and errors


Chapter 3
Differential equations and applications

Differential equations are perhaps the most common type of mathematical model
in science and engineering. Our objective is to give an introduction to the basic
principles of computational methods for differential equations. This entails the con-
struction and analysis of such methods, as well as many other aspects linked to
computing. Thus, one needs to have an understanding of
• the application
• the mathematical model and what it represents
• the mathematical properties of the problem
• what properties a computational method needs to solve the problem
• how to verify method properties
• qualitative discrepancies between numerical and exact solutions
• how to obtain accuracy and estimate errors
• how to interpret computational results.
We are going to consider three different but basic types of problems. The first is
initial value problems (IVP) of the form

ẏ = f (t, y),

with initial condition y(0) = y0 . In general this is a system of equations, and the
dot represents differentiation with respect to the independent variable t, which is
interpreted as “time.” Thus the differential operator of interest is
d
. (3.1)
dt

The second type of problem is a boundary value problem (BVP) of the form

−u00 = f (x),

with boundary conditions u(0) = uL and u(1) = uR . Here prime denotes differentia-
tion with respect to the independent variable x, which is interpreted as “space.” The

41
42 3 Differential equations and applications

differential operator of interest is now

d2
. (3.2)
d x2
In connection with BVPs, we will also consider other types of boundary conditions.
Occasionally, we will also consider first order derivatives, i.e., the operator d/d x.
We will find that there are very significant differences in properties and com-
putational techniques between dealing with initial and boundary value problems.
Interestingly, these theories are combined in the third type of problem we will con-
sider: time-dependent partial differential equations (PDE). This is a very large
field, and we limit ourselves to some simple standard problems in a single space
dimension.
Thus, we are going to combine the differential operators (3.1) and (3.2). The
simplest example is the diffusion equation,

∂ u ∂ 2u
= 2,
∂t ∂x
which requires initial as well as boundary conditions. We will follow standard nota-
tion in PDE theory, and rewrite this equation as

ut = uxx ,

where the subscript t denotes partial derivative with respect to time t, and the sub-
script x denotes partial derivative with respect to space x. Both initial conditions and
boundary conditions are needed to have a well-defined solution, and the solution u
is a function of two independent variables, t and x.
Apart from combining ∂ /∂t with ∂ 2 /∂ x2 , we shall also be interested in combin-
ing it with the first order space derivative ∂ /∂ x, as in the advection equation,

ut = ux .

The two PDEs above have completely different properties. Both are standard model
problems in mathematical physics. The advection equation has wave solutions,
while the diffusion equation is dissipative with strong damping. The two prob-
lems cannot be solved using the same numerical methods, since the methods must
be constructed to match the special properties of each problem.
We will also see that there are many variants of the equations above, and we
will consider various nonlinear versions of the equations, as well as other types of
boundary conditions. In addition, we will consider eigenvalue problems for the dif-
ferential operators mentioned above, and build a comprehensive theory from the
elementary understanding that can be derived from the IVP and BVP ordinary dif-
ferential equations.
3.1 Initial value problems 43

3.1 Initial value problems

As mentioned above, the general form of a first order IVP is

ẏ = f (t, y),

with f : R × Rm → Rm , and an initial condition y(0) = y0 . The task is to compute the


evolution of the dependent variable, y(t), for t > 0. In practice, the computational
problem is always set on a finite (compact) interval, [0,tend ].
Almost all software for initial value problems address this problem. This means
that one has to provide the function f and the initial condition in order to solve the
problem numerically. Most problems of interest have a nonlinear function f , but
even in the linear case numerical methods are usually necessary, due to the size if
the problem.
A simple example of an IVP is a second-order equation, describing the motion
of a pendulum of length L, subject to gravitation, as characterized by the constant
g,
g
ϕ̈ + sin ϕ = 0, ϕ(0) = ϕ0 , ϕ̇(0) = ω0 .
L
Here ϕ represents the angle of the pendulum. The problem is nonlinear, since the
term sin ϕ occurs in the equation. Another issue is that that the equation is second
order, modeling Newtonian mechanics. Since standard software solves first order
problems, we need to rewrite the system in first order form by a variable transfor-
mation. Thus we introduce the additional variable ω = ϕ̇, representing the angular
velocity. Then we have

ϕ̇ = ω
g
ω̇ = − sin ϕ.
L
This is a first order system of initial value problems. The general principle is that
any scalar differential equation of order d can be transformed in a similar manner to
a system of d first order differential equations.
Another nonlinear system of equations is the predator–prey model

ẏ1 = k1 y1 − k2 y1 y2
ẏ2 = k3 y1 y2 − k4 y2 ,

where y1 represents the prey population and y2 the predator species. The coefficients
ki are supposed to be positive, and the variables y1 and y2 are non-negative. This
equation is a classical model known as the Lotka–Volterra equation, and it exhibits
an interesting behavior with periodic solutions. The problem is nonlinear, because
it contains the quadratic terms y1 y2 . The model can be used to model the interaction
of rabbits (y1 ) and foxes (y2 ).
44 3 Differential equations and applications

Let us consider the case where there are no foxes, i.e., y2 = 0. The problem then
reduces to ẏ1 = k1 y1 , which has exponentially growing solutions. Thus, without a
predators, the rabbits multiply and the population grows exponentially. On the other
hand, if there are no rabbits (y1 = 0), the system reduces to ẏ2 = −k4 y2 . The solution
is then a negative exponential, going to zero, representing the fact that the foxes
starve without food supply.
If there are both rabbits and foxes, the product term y1 y2 represents the chance
of an encounter between a fox and a rabbit. The term −k2 y1 y2 in the first equation
is negative, representing the fact that the encounter has negative consequences for
the rabbit. Likewise, the positive term k3 y1 y2 in the second equation models the fact
that the same encounter has positive consequences for the fox. When these inter-
actions are accounted for, one gets an interesting dynamical system, with periodic
variations over time in the fox and rabbit populations. The system does not reach an
equilibrium.
Initial value problems have applications in a large number of areas. Some well-
known examples are
• Mechanics, where Newton’s second law is M q̈ = F(q). Here M is a mass matrix,
and F represents external forces. The dependent variable q represents position,
and its second derivative represents acceleration. Thus mass times acceleration
equals force. This is a second order equation, and it is usually transformed to a
first order system before it is solved numerically. Applications range from celes-
tial mechanics to vehicle dynamics and robotics.
• Electrical circuits, where Kirchhoff’s law is Cv̇ = −I(v), and where C is a capac-
itance matrix, v represents nodal voltages, and I represents currents. Applications
are found in circuit design and VLSI simulation.
• Chemical reaction kinetics, where the mass action law reads ċ = f (c). Here the
vector c represents the concentration of the reactants in a perfectly mixed reaction
vessel, and the function f represents the actual reaction interactions. The same
type of equation is used in biochemistry as well as in chemical process industry.
We note, in the last case, that many other applications lead to equations that are
structurally similar to the reaction equations. Thus the Lotka–Volterra predator-prey
model has a similar structure.
Other examples of this type of equation are epidemiological models, where the
spreading of an infectious disease has a similar dynamical form. The interactions de-
scribe the number of infected people, given that some are susceptible, while others
are infected and some have recovered (or are vaccinated) to be immune to further in-
fections. This is the classical “SIR model,” developed by Kermack and McKendrick
in 1927; it was a breakthrough in the understanding of epidemiology and vaccina-
tion.
Equations such as those above are all important for modeling complex, nonlin-
ear interactions, whose outcome cannot be foreseen by intuitive or analytical tech-
niques.
3.2 Two-point boundary value problems 45

3.2 Two-point boundary value problems

The two-point boundary value problems (2p-BVP) we will consider are mostly sec-
ond order ordinary differential equations. A classical problem takes the form

−u00 = f (x), (3.3)

on [0, 1], with Dirichlet boundary conditions u(0) = uL and u(1) = uR . Here the
function f is a source term, which only depends on the independent variable x.
Occasionally one of the boundary conditions will be replaced by a Neumann con-
dition, u0 (0) = u0L . The Neumann condition can also be imposed at the right endpoint
of the interval, but in order to have a unique solution, at least one boundary condition
has to be of Dirichlet type.
The task is to compute (an approximation to) a function u ∈ C2 [0, 1] that satisfies
the boundary conditions and the differential equation on [0, 1]. Because we cannot
compute functions in general, we will have to introduce a discretization, to compute
a discrete approximation to u(x).
The equation above is a one-dimensional version of the Poisson equation. Un-
derstanding this 2p-BVP is key to understanding the Laplace and Poisson equations
in 2D or 3D. The latter cases are certainly more complicated, but many basic prop-
erties concerning solvability and error bounds are built on similar theories.
An example of a fourth order problem is the beam equation

−M 00 = f (x)
M
−u00 = ,
EI
where M represents the bending moment of a slender beam supported at both ends,
subject to a distributed load f (x) on [0, 1]. If the supports do not sustain any bending
moment, the boundary conditions for the first equation are of homogeneous Dirich-
let type, M(0) = M(1) = 0. In the second equation, which is structurally identical
to the first, u represents the deflection of the beam, under the bending moment M.
Here, too, the boundary conditions are Dirichlet conditions, u(0) = u(1) = 0. Fi-
nally, E is a material constant and I (which may depend on x) is the cross-section
moment of inertia of the beam, which depends on the shape of the beam’s cross
section.
This is a linear problem, as the dependent variables M and u only enter linearly.
Many solvers for 2p-BVPs are written for second order problems, and often de-
signed especially for Poisson-type equations. In this particular example, one first
solves the moment equation, and once the moment has been computed, it is used as
data for solving the second, deflection equation.
Whether such an approach is possible or not depends on the boundary conditions.
Thus, there is a variant of the beam equation, where the beam is “clamped” at both
ends. This means that the supports do sustain bending moments, and that the bound-
46 3 Differential equations and applications

ary conditions take the form u0 (0) = u0 (1) = 0 and u(0) = u(1) = 0. In this case,
one cannot solve teh problem in two steps, because the problem is a genuine fourth
order equation, known as the biharmonic equation. If I is constant, the equation is
equivalent to
f (x)
uIV = ,
EI
and will require its own dedicated numerical solver.
In connection with 2p-BVPs, we will also consider problems containing first
order derivatives, e.g.
−u00 + u0 + u = f (x),
or nonlinear problems, such as

−u00 + uu0 = f (x),

or
−u00 + u = f (u).
In the last case, the function f is a function of the solution u, not just a source
term. The structure of these equations will become clear in connection with PDEs,
as these operators will correspond to the spatial operators of time-dependent PDEs.
Another very important, and distinct, type of BVP is eigenvalue problems for
differential operators. The simplest example is

−u00 = λ u, (3.4)

with homogeneous boundary conditions u(0) = u(1) = 0. Since an algebraic eigen-


value problem is usually written Au = λ u, where λ corresponds to the eigenvalues
of the matrix A, we see that (3.4) is in fact an “analytical” eigenvalue problem for
the differential operator −d2 /d x2 . This type of problem is usually referred to as
Sturm–Liouville problems. Here we will construct discretization methods that ap-
proximate this analysis problem by a linear algebra problem; in other words, we will
be able to approximate the eigenvalues of (3.4) by solving a linear algebra eigen-
value problem.
Two-point boundary value problems occur in a large number of applications. A
short list of examples includes
• Materials science and structural analysis, as exemplified by the beam equation
above.
• Microphysics, as exemplified by the (stationary) Shrödinger equation

h̄ 00
− ψ +V (x)ψ = Eψ,
2m
where ψ is the wave function, V (x) is a “potential well,” and E is the energy of a
particle in the state ψ.
3.3 Partial Differential Equations 47

• Eigenmode analysis, such as in −u00 = λ u, which may describe buckling of


structures; eigenfrequency oscillation modes of e.g. a bridge or a musical instru-
ment.
The last two examples are Sturm–Liouville problems, which in many cases are
directly related to Fourier analysis. Thus, we find new links between computational
methods and advanced analysis.
One of the great insights of applied mathematics is that there are applications
in vastly differing areas that still share the same structure of their equations. For
example, it is by no means obvious that buckling problems, first solved by Euler
in the middle of the 18th century, satisfy the same (or nearly the same) equation as
Schrödinger’s “particle-in-box” problems of the 20th century, or could be used in the
design of music instruments. The applications in macroscopic material science and
in quantum mechanics appear to have nothing in common, but mathematics tells
us otherwise. This makes it possible to develop common techniques in scientific
computing, with only minor changes in details. The overall methodology will still
be the same, having an impact in all of these areas.

3.3 Partial Differential Equations

Partial differential equations are characterized by having two or more independent


variables. Here we shall limit ourselves to the simplest case, where the two inde-
pendent variables are time and space (in 1D). This is simply the combination of
initial and boundary value problems, without having to approach the difficulty of
representing geometry in space.
While IVPs and BVPs have their own special interests, it is in PDEs that differen-
tial equations and scientific computing get really exciting. The particular difficulties
of initial and boundary value problems are present simultaneously, and conspire to
create a whole new range of difficulties, often far harder than one would have ex-
pected.
As mentioned above, we are mainly going to consider two equations, the parabolic
diffusion equation
ut = uxx , (3.5)
and the hyperbolic advection equation,

ut = ux . (3.6)

The latter is also often referred to as the convection equation. The space operators
can be combined, and the equation

ut = uxx + ux (3.7)
48 3 Differential equations and applications

is usually referred to as the convection–diffusion equation. It has two differen-


tial operators in space, ∂ 2 /∂ x2 (giving rise to diffusion), and ∂ /∂ x (creating con-
vection/advection). Convection and advection are transport phenomena, usually
associated with some material flow, like in hydrodynamics or gas dynamics, and
model wave propagation. By contrast, diffusion does not require a mass flow, and
(3.5) is the standard model for heat transfer. The combined equation, possibly in-
cluding further terms, is the simplest model equation for problems in heat and mass
transfer.
We have previously seen that in IVPs, equations of the form u̇ = f (u) can be used
to model chemical reactions. For this reason, if we add such a term to the diffusion
equation (3.5), we obtain
ut = uxx + f (u),
known as the reaction–diffusion equation. Here f depends on the dependent vari-
able u, but in case it would only depend on the independent variable x, it is a source
term, and the equation becomes

ut = uxx + f (x).

This is a plain diffusion equation with source term. Many time-dependent problems
involving diffusion have stationary solutions. This means that after a long time,
equilibrium is reached, and there is no longer any time evolution. The stationary
solution is independent of time, implying that ut = 0. In the last equation, the equi-
librium state therefore satisfies

0 = uxx + f (x).

We already encountered this equation in the context of 2p-BVPs. Thus

−u00 = f (x),

is the Poisson equation (3.3). While the time dependent problems considered above
are either parabolic (both ut and uxx are present) or hyperbolic (ut is present, but
the highest order derivative in space is ux ), the stationary equation is different; the
Poisson equation is elliptic. We will later have a closer look at the classification of
these equations. For the time being, let us just note that there are elliptic, parabolic
and hyperbolic equations, and that the 2p-BVP (3.3) represents elliptic problems,
so long as there is only one space dimension.
With the exception of special equations such as Poisson’s, the names of the equa-
tions we consider usually only list the terms included in the right hand side, pro-
vided that the left hand side only contains the time derivative ut . For example, by in-
cluding the first derivative ux in the diffusion equation, we obtained the convection–
diffusion equation. Likewise, if we also include the reaction term f (u), we get

ut = uxx + ux + f (u),
3.3 Partial Differential Equations 49

known as the convection–diffusion–reaction equation. Such names are practical, as


the different terms often put different requirements on the solution techniques. Thus
the name of the equation tells us about what difficulties we expect to encounter.
Elliptic problems have characteristics that are similar to plain 2p-BVPs. Parabolic
and hyperbolic equations, on the other hand, are both time- and space-dependent,
and are more complicated. Parabolic equations are perhaps somewhat simpler, while
hyperbolic equations are particularly difficult because they represent conservation
laws. They have wave solutions without damping, and one of the main difficulties is
to create discretization methods that replicate the conservation properties. As very
few numerical methods conserve energy, qualitative differences between the exact
and numerical solutions soon become obvious.
There are also second order equations in time, with the most obvious example
being the classical wave equation,

utt = uxx .

This equation is closely related to the advection equation, ut = ux , and both equa-
tions are hyperbolic. The most interesting problems, however, are nonlinear. A fa-
mous test problem is the inviscid Burgers equation,

ut + uux = 0,

which again is a hyperbolic conservation law with wave solutions. However, due to
the nonlinearity, this equation may develop discontinuous solutions, corresponding
to shock waves (cf. “sonic booms”). Such solutions are extremely difficult to ap-
proximate, and we shall have to introduce a new notion, of weak solutions. These
do not have to be differentiable at all points in time and space, and therefore only
satisfy the equation in a generalized sense.
The situation is somewhat relaxed if diffusion is also present, such as in the
viscous Burgers equation,
ut + uux = uxx .
The diffusion term on the right will then introduce dissipation, representing vis-
cosity, meaning that wave energy will dissipate over distance. (The inviscid Burg-
ers equation has no viscosity term.) The solution may initially be discontinuous,
but over time it becomes increasingly regular and eventually becomes smooth. The
viscous Burgers equation is a simple model in several applications, for example
modeling seismic waves. Because of the presence of the diffusion term, this equa-
tion is no longer hyperbolic but parabolic. However, if the diffusion is weak, wave
phenomena will still be observed over a considerable range in time and space. For
this reason, the viscous Burgers equation is sometimes referred to by the oxymoron
parabolic wave equation.
PDEs is an extremely rich field of study, with applications e.g. in materials
science, fluid flow, electromagnetics, potential problems, field theory, and multi-
physics. Multiphysics is any area that combines two or more fields of application,
50 3 Differential equations and applications

as have been suggested by the many combinations of terms above. One example
is fluid–structure interaction, such as air flow over an elastic wing, or blood flow
inside the heart. Because of the different nature of the equations, there is typically
a need for dedicated numerical methods, even though many components of such
methods are common for spcial classes of problems. The area is a highly active field
of research.
From the vast variety of applications, it is clear that some difficulties will be en-
countered in the numerical solution of differential equations. Although basic prin-
ciples covering e.g. discretization and polynomial approximation apply, it is still
necessary to construct methods with a special focus on the class of problems we
want to address. Even so, the basic principles are few, and the challenge is to find
the right combination of techniques in each separate case.
Chapter 4
Summary: Objectives and methodology

One of the pioneers in scientific computing, Peter D Lax, has said that

“The computer is to mathematics what the telescope


is to astronomy, and the microscope is to biology.”

Thus ever faster computers, with ever larger memory capacity, allow us to consider
more and more advanced mathematical models, and it is possible to explore by nu-
merical simulation how complex mathematical models behave. Naturally, this leads
to large-scale computations that cannot be carried out by analytical methods. The
role of scientific computing is to bridge the gap between mathematics proper and
computable approximations.
Although our focus is on differential equations, the previous chapters have intro-
duced a variety of basic principles in scientific computing, and outlined why they
are needed in order to construct approximate solutions to mathematical problems. In
particular, we noted that only problems in linear algebra are computable, in the sense
that an exact solution can be constructed in a finite number of steps. By contrast,
nonlinear problems, and problems from analysis (such as differential equations) can
only be solved by approximate methods.

The purpose of scientific computing is


• to construct and analyze methods for solving specific problems in applied
mathematics
• to construct software implementing such methods for solving applied mathe-
matics problems on a computer.

Scientific computing is a separate field of study because conventional, analytical


mathematical techniques have a very limited range. Few problems of interest can be
solved analytically. Instead, we must use approximate methods. This can be justified
by taking great care to prove that the computational methods are convergent, and
therefore (at least in principle) able to find approximate solutions to any prescribed
accuracy. This extends the analytical techniques by a systematic use of numerical

51
52 4 Summary: Objectives and methodology

computing, which, from the scientific point of view is subject to no less rigorous
standards than conventional mathematics.

The methodology of scientific computing is


• based on conventional mathematics, with the usual proof requirements
• built on classical results in continuous as well as discrete mathematics.

A few principles stand out as the main building blocks in all computational meth-
ods. These are
• Standard techniques in linear algebra
• A systematic use of polynomials, including trigonometric polynomials
• The principle of discretization, which brings a problem from analysis to a prob-
lem in algebra
• The principle of linearization, which approximates a nonlinear problem locally
by a linear problem
• The principle of iteration, which constructs a sequence of approximations con-
verging to the final result.

All numerical methods are constructed by using elements of these principles. This
does not mean that the methods are similar, or that a “trick” that works in one prob-
lem context also works in another. In fact, the variety of techniques used is very
large; without recognizing the basic principles, one can easily be overwhelmed by
the technicalities and miss the the broader pattern that are characteristic of a general
approach.
On top of the approximation techniques mentioned above, all computations are
carried out in finite precision arithmetic, usually defined by the IEEE 754 stan-
dard. This will lead to roundoff errors. It is a common misunderstanding about sci-
entific computing that the results are erroneous due to roundoff. However, most
computational algorithms are little affected by roundoff, as other errors, such as dis-
cretization errors, linearization errors and iteration errors typically dominate. The
only problems that are truly affected by roundoff are problems from linear algebra
solved by finite algorithms. Such problems are unaffected by discretization, lin-
earization and iteration errors, meaning that only roundoff remains.
Apart from computing approximate solutions, scientific computing is also con-
cerned with error bounds and error estimates. Thus we are not only concerned
with obtaining an approximation, but also the accuracy of the computation. Finally
we are interested in obtaining these results reasonably fast. In order of importance,
computational methods must be
1. Stable
2. Accurate
3. Reliable
4. Efficient
5. Easy to use.
4 Summary: Objectives and methodology 53

Stability is priority number one, as unstable algorithms are unable to obtain any
accuracy at all; instability will ruin the results. If stable algorithms can devised, we
are interested in obtaining accuracy. Accurate algorithms should also be reliable
and robust, and be able to solve broad classes of problems; high accuracy on a few
selected test problems is not good enough. As part of the reliability, we usually also
want reliable error estimates, indicating the accuracy obtained.
Once these criteria have been met, we want to address efficiency. Efficiency is
a very broad issue, ranging from data structures allowing efficient memory use,
to adaptivity, which means that the software “tunes” itself automatically to the
mathematical properties of the problem at hand. Finally, when efficient software
has been constructed, we want ease of use, meaning that various types of interfaces
are needed to set up problems (entering or generating data, e.g. the geometry of
the computational domain), as well as postprocessing tools, such as visualization
software.
In this introduction to scientific computing, we focus less on a rigorous mathe-
matical treatment of method construction and analysis, and more on an understand-
ing of elementary computational methods, and how they interact with standard prob-
lems in applied mathematics. This entails understanding what the original equations
represent, meaning that we will emphasize the links from physics, via mathematics,
to methods and actual computational results.
These will also be investigated in model implementations. We need to understand
the basic mathematics of the problems, the properties of our computational methods,
and how to assess their performance. Assessing performance is a difficult matter.
It is invariably done by trying out the method on various well-known test problems
and benchmarks, but the assessment is never complete. Ideally, we want to infer
some general conclusions, but have to keep in mind that any numerical test only
yields results for that particular problem.
Since we are going to use standard test problems and benchmarks, which may
often have a known analytical solution, we work in an idealized setting. In real
computational problems, the exact solution is never known. However, unless we
can demonstrate that standard benchmarks can be solved correctly, with both stabil-
ity and accuracy, there is little hope that the method would work for the advanced
problems.
Many of our benchmarks will use simple tools of visualization. For example,
we typically want to demonstrate that an implementation is correct by verifying its
theoretical convergence order. This is usually done in graphs, like those used in the
section on discretization above. There we saw use of lin-lin graphs, lin-log graphs
and log-log graphs. This may appear to be a simple remark, but part of the skill
in scientific computing is in visualizing the results in a proper way. Therefore it is
important to master these techniques, and carefully try out the best tools in each sit-
uation. As we have remarked previously, log-log diagrams are used for power laws.
Likewise, lin-log diagrams are preferred for exponentials, but there are numerous
other situations where scaling is key to revealing the relevant information.
Part II
Initial Value Problems
Chapter 5
First Order Initial Value Problems

Initial value problems occur in a large number of applications, from rigid-body me-
chanics, via electrical circuits to chemical reactions. They describe the time evo-
lution of some process, and the task is usually to predict how a given system will
evolve. Most standard software is written for initial value problems of the form

ẏ = f (t, y); y(0) = y0 , (5.1)

where f : R × Rd → Rd is a given function, and where the initial condition y0 is also


given.
Before constructing computational methods, it is often a good idea to verify,
through mathematical means, whether there exists a solution and whether the so-
lution is unique. A somewhat more ambitious approach is to verify that the initial
value problem is well posed. This means that we need to demonstrate that there is a
unique solution, which depends continuously on the data. This usually means that
a small change in the initial value, or a small change in a forcing function, will only
have a minor effect on the solution to the problem.
While such a theory exists for initial value problems in ordinary differential equa-
tions, much less is known in partial differential equations. Thus, fortunately, the
theory of initial value problems is quite comprehensive compared to other areas in
differential equations.

5.1 Existence and uniqueness

The classical result on existence and uniqueness is built on the continuity properties
of the function f .

Definition 5.1. Let f : R × Rd → Rd be a given function, and define its Lipschitz


constant with respect to the second argument by

3
4 5 First Order Initial Value Problems

k f (t, u) − f (t(v)k
L[ f ] = sup . (5.2)
u6=v ku − vk

Obviously, the Lipschitz constant depends on t, and we shall also assume that the
function f is continuous with respect to time t. The classical existence and unique-
ness result on (5.1) is the following.

Theorem 5.1. Let f : R × Rd → Rd be a given function, and assume that it is con-


tinuous with respect to time t, and that it is Lipschitz continuous with respect to y,
with L[ f ] < ∞. Then there is a unique solution to (5.1) for t ≥ 0.

If f is a linear, constant coefficient system, i.e., ẏ = Ay, then

kAu − Avk kAxk


L[ f ] = sup = sup = kAk, (5.3)
u6=v ku − vk x6=0 kxk

where kAk is the norm of the matrix A induced by the vector norm k · k. Thus,
for a linear map A, the Lipschitz constant is just the operator norm. But nonlinear
functions are harder to deal with. In general, very few nonlinear maps satisfy a
Lipschitz condition on all of Rd . If it satisfies such a condition on a bounded domain
D, then one can guarantee the existence of a solution up to the point where the
solution y reaches the boundary of D. A classical example is the following.

Example Consider the IVP

ẏ = y2 ; y(0) = y0 > 0,

with analytical solution


y0
y(t) = .
1 − ty0
The function f (t, y) = y2 obviously does not satisfy a Lipschitz condition on all of R, but
only on a finite domain. In fact, inspecting the solution, we see that the solution remains
bounded only up to t = 1/y0 , when the solution “blows up.” Initial value problems having
this property are said to have “finite escape time.” Thus, no matter how large the region D
is where the Lipschitz condition holds, the solution will reach the boundary of D in finite
time and escape.

There are also other examples, when the Lipschitz condition does not hold.

Example Consider the IVP



ẏ = − −y; y(0) = 0.

This problem does not have a unique solution. The solution may be chosen as y(t) = 0 on
the interval t ∈ [0, 2τ], with τ ≥ 0 an arbitrary number, followed by y(t) = −(t/2 − τ)2 for
t > 2τ. While this might seem like a contrived problem, it can actually arise, largely due
to poor mathematical modeling. Thus, assume that we we want to model the free fall of a
particle of mass m. Its potential energy is U = mgy, where g is the gravitational constant,
and where the kinetic energy is T = mẏ2 /2. The total energy is the sum of the two, i.e.,
5.1 Existence and uniqueness 5

mẏ2
E = mgy + .
2
Let us assume (a matter of normalization) that E = 0. Then, obviously

ẏ2 = −2gy,

√ −2gy. To obtain the original equation, let us choose units so
and, because y ≤ 0, ẏ = −
that g = 1/2. Then ẏ = − −y.
The model is, at least in principle, correct. However, it is “unfortunate”, since we choose the
initial condition y(0) = 0 and the constant E = 0. Although this choice may seem natural,
it just happens to be at the singularity of the right-hand side function f of the differential
equation, and we do not necessarily get the expected solution, y(t) = −t 2 /4.
√ √
Thus, when f (y) = − −y, we have f 0 (y) = 1/(2 −y). Hence f 0 (y) is not defined at
the initial value y = 0, and since the Lipschitz constant L[ f ] ≥ max | f 0 (y)|, the Lipschitz
condition is not satisfied at the starting point either. Therefore, we are not guaranteed a
unique solution. What this example shows is that even if one uses “sensible” mathematical
modeling, one may end up with a poor mathematical description, lacking unique solutions.
This appears to be less understood; even so, one needs to be aware that not all models, and
not all normalizations of coordinate systems and initial values, will lead to proper models.

Example A variant of the same problem is obtained by modeling a leaky bucket. Assuming
that a cylindrical bucket initially contains water up to a level y(0) = y0 and that the water
runs out of a hole in the bottom, we apply Bernoulli’s law, according to which the flow
rate v out of the hole satisfies v2 ∼ p, where p is the pressure at the opening. Because of the
cylindrical shape of the bucket, the pressure is proportional to the level y of water. Likewise,
if the flow rate is v, then ẏ ∼ v. Dropping all proportionality constants, it follows that ẏ2 = y,
so that

ẏ = − y; y(0) = y0 .
Even though this equation is similar to the former, we do not face the same problem, since
we start at a different initial condition. The problem can be solved analytically, and
 t 2
y(t) = y0 − .
2
Thus we find that the bucket will be empty (y = 0) at time t = 2y0 (where, again, propor-
tionality constants have been neglected).
This problem has an interesting connection to the former. The previous problem, of model-
ing the free fall of a mass leads to essentially the same equation. While the free fall problem
did not have a unique solution, the leaky bucket problem does. However, the relation be-
tween the two problems is that the free fall problem is equivalent to the leaky bucket prob-
lem in reverse time. Thus, if the bucket is currently empty, we cannot answer the question
of when it was last time full. Such a question is not well posed. This is evident in the leaky
bucket problem, but less so in the free fall problem.

As the examples above demonstrate, questions of existence, uniqueness and well-


posedness are of great importance not only to mathematics, but also to the applied
mathematician and the numerical analyst. In general it is of importance to obtain a
good understanding of whether the problems we want to solve have unique solutions
that depend continuously on the data. If this is not the case, we can hardly expect
our computational methods to be successful.
6 5 First Order Initial Value Problems

Even so, it is often impossible to verify that Lipschitz conditions are satisfied on
sufficiently large domains. In fact, this is often neglected in practice, but one must
then be aware of the risk of the odd surprise. Failures to solve problems are not
uncommon, in which case it is also of important why there is a failure. Is it because
of problem properties, or because of an unsuitable choice of computational method?
Mastering theory is never a waste of time. Nothing is as practical as a good theory.

5.2 Stability

Stability is a key concept in all types of differential equations. It will appear nu-
merous times in this book. It will refer to the stability of the original mathematical
problem, as well as to the stability the discrete problem, and, which is more com-
plex, to the numerical method itself. By and large, these concepts are broadly re-
lated in the following way: if the mathematical problem is stable, then the dicscrete
problem will be stable as well, provided that the numerical method is stable. This re-
lation will, however, not hold without qualifications. For this reason, we will se that
stability plays the central role in the numerical solution of differential equations.
For initial value problems, stability refers to the question whether two neighbor-
ing solutions will remain close when t → ∞. Here our original IVP problem is
d
y = f (t, y); y(0) = y0
dt
and we will consider a perturbed problem
d
(y + ∆ y) = f (t, y + ∆ y); ∆ y(0) = ∆ y0 .
dt
It then follows that the perturbation satisfies
d
∆ y = f (t, y + ∆ y) − f (t, y); ∆ y(0) = ∆ y0 .
dt
In classical Lyapunov stability theory, we ask whether for every ε > 0 there is a
δ > 0 such that
k∆ y0 k ≤ δ ⇒ k∆ y(t)k ≤ ε,
for all t > 0. If this holds, the solution to the IVP obviously depends continuously
on the data, i.e., the initial value. The interpretation is that, even if we perturb the
initial value by a small amount, the perturbed solution will never deviate by more
than a small amount from the solution of the original solution, even when t → ∞.
We then say that the solution y(t) is stable. If, in addition, k∆ y(t)k → 0 as t → ∞,
then we say that y(t) is asymptotically stable. And if k∆ y(t)k ≤ K · e−αt for some
positive constants K and α, then y(t) is exponentially stable. There are also further
qualifications that offer additional stability notions.
Chapter 6
Stability theory

Because stability plays such a central role in the numerical treatment of differen-
tial equations, we shall devote a chapter to lay down some foundations of stability
theory. The details vary between initial value problems and boundary value prob-
lems; between differential equations and their discretizations (usually some form of
difference equations); and between linear systems and nonlinear systems. However,
the stability notions have something in common: the solutions to our problem must
have a continuous dependence on the data, even though what constitutes “data”
also varies.
The continuous data dependence would not be such a special issue if it did not
also include some transfinite process. Thus we are interested in how solutions be-
have as t → ∞ in the initial value case, or in how a discretization behaves as the
number of discretization points N → ∞. As long as we study ordinary differential
equations, we are interested in single parameter limits, but in partial differential
equations we face multiparametric limits, which makes stability an intricate matter.
As we cannot deal with all these issues at once, this chapter will focus on the im-
mediate needs for first order initial value problems, but we will nevertheless develop
tools that allow extensions also to other applications.

6.1 Linear stability theory

By considering a linear, constant coefficient problem, the stability notions become


more clear, since linear systems have a much simpler behavior. Thus, if
d
y = Ay; y(0) = y0
dt
and
d
(y + ∆ y) = A(y + ∆ y); ∆ y(0) = ∆ y0 ,
dt

7
8 6 Stability theory

it follows that
d
∆ y = A ∆ y; ∆ y(0) = ∆ y0 .
dt
We note that the perturbed differential equation is identical to the original problem.
Thus we may consider the original linear system directly, and whether it has solu-
tions depending continuously on y0 . In particular, no matter how we choose y0 , the
solution y(t) must remain bounded. Since y = 0 is a solution of the problem (for
y0 = 0), we say that the zero solution is stable if y(t) remains bounded for all t > 0.
But as this applies to every solution, we can speak of a stable linear system.
We begin by considering a linear system with constant coefficients,

ẏ = Ay; y(0) = y0 .

Does the solution grow or decay? The exact solution is

y(t) = etA y0 ,

and it follows that ky(t)k ≤ ketA k ky0 k. Therefore, ky(t)k → 0 as ky0 k → 0, provided
that, for all t ≥ 0, it holds that
ketA k ≤ C. (6.1)
This is the crucial stability condition, and the question is: for what matrices A is etA
bounded?
The answer is well known, and is formulated in terms of the eigenvalues of A.
Let us therefore make a brief excursion into eigenvalue theory, which will play a
central role not only for initial value problems but also in boundary value problems
and partial differential equations.

Definition 6.1. The set of all eigenvalues of A ∈ Rd×d will be denoted by

λ [A] = {λ ∈ C : Au = λ u},

and is referred to as the spectrum of A. Whenever the eigenvalues and eigenvectors


are numbered, we write
Auk = λk [A] · uk
for k = 1 : d, where uk is the kth eigenvector, associated with the kth eigenvalue λk [A].

For A ∈ Rd×d there are always d eigenvalues. If the eigenvalues are distinct, then
there are also d linearly independent eigenvectors, and the matrix can be diagonal-
ized. Thus, writing the eigenproblem

AU = UΛ ,

with Λ = diag λk , and the eigenvectors arranged as the d columns of the d ×d matrix
U, we have U −1 AU = Λ .
Now, if Au = λ u, it follows that A2 u = λ Au = λ 2 u, and in general,
6.1 Linear stability theory 9

λk [A p ] = (λk [A]) p ,

for every power p ≥ 0. This motivates that we write λ [A p ] = λ p [A]. From this it
follows that if P is any polynomial, we have λ [P(A)] = P(λ [A]). Hence:

Lemma 6.1. Let P be a polynomial of any degree, and let λ [A] denote the spectrum
of a matrix A ∈ Rd×d . Then

λ [P(A)] = P(λ [A]). (6.2)

If the matrix is diagonalizable, i.e., U −1 AU = Λ , then U −1 A pU = Λ p .

For the exponential function, which is defined by



An
eA = ∑ ,
n=0 n!

it holds that λ [eA ] = eλ [A] , even though the “polynomial” is a power series. This
follows from
∞ λ n [A]uk

An uk
eA uk = ∑ =∑ k = eλk [A] uk .
n=0 n! n=0 n!
Although this result does not rely on A being diagonalizable, it also holds that if
−1
U −1 AU = Λ , then U −1 eAU = eU AU = eΛ . In fact, if f (z) is any analytical function,
then U −1 f (A)U = f (U −1 AU) = f (Λ ). More generally, we have

Lemma 6.2. Let A ∈ Rd×d , and let λ [A] denote the spectrum of a matrix. Then
λ [eA ] = eλ [A] , and
|λ [etA ]| = |etλ [A] | = etRe λ [A] . (6.3)

It immediately follows that etA is bounded as t → ∞ if λ [A] ∈ C− , i.e., if the


eigenvalues of A are located in the left half plane. Eigenvalues with zero real part
are also acceptable if they are simple.
The result of Lemma 6.1 can also be extended. We note that if A is nonsingular,
it holds that
Au = λ u ⇒ λ −1 u = A−1 u.
Therefore λk [A p ] = (λk [A]) p holds also for negative powers, if only A is invertible, so
that no eigenvalue is zero. It follows that if a polynomial Q(z) has the property that
λ [Q(A)] = Q(λ [A]) 6= 0, then Q−1 (A) exists. This implies that Lemma 6.1 can be
extended to rational functions. This will be useful in connection with Runge–Kutta
methods. Thus we have the following extension:

Lemma 6.3. Let R(z) = P(z)/Q(z) be a rational function, where the degrees of the
polynomials P and Q are arbitrary. Let A ∈ Rd×d , and let λ [A] denote the spectrum
of a matrix. If Q(λ [A]) 6= 0, then
10 6 Stability theory

λ [R(A)] = R(λ [A]), (6.4)

where R(A) = P(A) · Q−1 (A) = Q−1 (A) · P(A).

Although a linear system ẏ = Ay is stable whenever the eigenvalues are located in


the left half plane, this eigenvalue theory does not extend to non-autonomous linear
systems ẏ = A(t)y and to nonlinear problems ẏ = f (t, y). As more powerful tools
are needed, the standard approach is to analyze stability in terms of some norm
condition on the vector field.

6.2 Logarithmic norms

Let us begin by considering the linear constant coefficient system ẏ = Ay once more,
and find an equation for the time evolution of kyk. It satisfies

dkyk ky(t + h)k − ky(t)k


= lim sup
dt+ h→0+ h
ky(t) + hAy(t) + O(h2 )k − ky(t)k
= lim
h→0+ h
kI + hAk − 1
≤ lim ky(t)k,
h→0+ h

where d/dt+ denotes the right-hand derivative. This is used as we are interested
in the forward time evolution of ky(t)k.

Definition 6.2. Let A ∈ Rd×d . The upper logarithmic norm of A is defined by

kI + hAk − 1
M[A] = lim . (6.5)
h→0+ h

This limit can be shown to exist for all matrices. From the derivation above, we
have obtained the differential inequality

dkyk
≤ M[A] · kyk. (6.6)
dt+
This differential inequality is easily solved. Note that
 
d   dkyk
kyk e−tM[A] = e−tM[A] − M[A] · kyk ≤ 0.
dt+ dt+

Hence ky(t)k e−tM[A] ≤ ky(0)k, and it follows that ky(t)k ≤ etM[A] ky(0)k for all t ≥ 0.
Since y(t) = etA y(0), we have:
6.2 Logarithmic norms 11

Vector norm Matrix norm Log norm µ[A]

kxk1 = ∑i |xi | max j ∑i |ai j | max j [Re a j j + ∑0i |ai j |]

α[(A + A∗ )/2]
p p
kxk2 = ∑i |xi |2 ρ[A∗ A]

kxk∞ = maxi |xi | maxi ∑ j |ai j | maxi [Re aii + ∑0j |ai j |]

Table 6.1 Computation of matrix and logarithmic norms. The functions ρ[·] and α[·] refer to
the spectral radius and spectral abscissa, respectively. The matrix A∗ is the (possibly complex
conjugate) transpose of A

Theorem 6.1. For every A ∈ Rd×d , for any vector norm, and for t ≥ 0, it holds that

ketA k ≤ etM[A] . (6.7)

The reason why this is of interest is that we have the following result on stability:

Corollary 6.1. Let A ∈ Rd×d and assume that M[A] ≤ 0, then

ketA k ≤ 1, (6.8)

for all t ≥ 0.

Therefore, if the logarithmic norm is non-positive, the system is stable. Note that
the logarithmic norm is not a “norm” in the proper sense of the word. Unlike a true
norm, the logarithmic norm can be negative, which makes it especially interesting
in connection with stability theory. Table 6.2 shows how the logarithmic norm is
calculated for the most common norms. Note that it is easily computed for the norms
k · k1 and k · k∞ , but that it is harder to compute it for the standard Euclidean norm,
k·k2 . We then need the spectral radius ρ[·], and the spectral abscissa, α[·], defined
by
ρ[A] = max |λk [A]|, α[A] = max Re λk [A]. (6.9)
k k

In order to use norms and logarithmic norms in the analysis that follows, we
recall the properties of these norms.

Definition 6.3. A vector norm k · k satisfies the following axioms:


1. kxk ≥ 0; kxk = 0 ⇔ x = 0
2. kγxk = |γ| · kxk
3. kx + yk ≤ kxk + kyk.
12 6 Stability theory

Definition 6.4. The operator norm induced by the vector norm k · k is defined by

kAxk
kAk = sup . (6.10)
x6=0 kxk

The operator norm is a matrix norm, and hence satisfies the same rules as the
vector norm. However, being an operator norm, it has an additional property. By
construction, it satisfies
kAxk ≤ kAk · kxk, (6.11)
from which it follows that the operator norm is submultiplicative. This means that

kABk ≤ kAk · kBk. (6.12)

This follows directly from kABxk ≤ kAk · kBxk ≤ kAk · kBk · kxk.
The logarithmic norm has already been defined above by (6.5), by the limit

kI + hAk − 1
M[A] = lim .
h→0+ h
Thus it is defined in terms of the operator norm, which in turn is defined in terms
of the vector norm. All three are thus connected, and the computation of these quan-
tities are linked as shown in Table 6.2.
The logarithmic norm has a wide range of applications in both initial value and
boundary value problems, as well as in algebraic equations. Later on, we shall also
see that it can be extended to nonlinear maps and to differential operators. Like the
operator norm, it has a number of useful properties that play an important role in
deriving error and perturbation bounds.

Theorem 6.2. The upper logarithmic norm M[A] of a matrix A ∈ Rd×d has the fol-
lowing properties:
1. M[A] ≤ kAk
2. M[A + zI] = M[A] + Re z
3. M[γA] = γ M[A], γ ≥0
4. M[A + B] ≤ M[A] + M[B]
5. ketA k ≤ etM[A] , t ≥0

It is also easily demonstrated that the operator norm and the logarithmic norm
are related to the spectral bounds (6.9). Thus

α[A] ≤ M[A]; ρ[A] ≤ kAk. (6.13)

Consequently, if M[A] < 0, then all eigenvalues λ [A] ∈ C− and A is invertible.


We are now in a position to address more important stability issues. Let us con-
sider a perturbed linear constant coefficient problem,
6.2 Logarithmic norms 13

ẏ = Ay + p(t); y(0) = y0 , (6.14)

where p is a perturbation function. This satisfies the differential inequality

dkyk
≤ M[A] · kyk + kpk,
dt+
with solution Z t
ky(t)k ≤ etM[A] ky0 k + e(t−τ)M[A] kp(τ)k dτ.
0
We have already treated the case p ≡ 0, so let us instead consider the case p 6= 0
and y0 = 0, and answer the question of how large ky(t)k can become given the
perturbation p. Thus, letting

kpk∞ = sup kp(t)k,


t≥0

we have Z t
ky(t)k ≤ kpk∞ e(t−τ)M[A] dτ.
0
Assuming that M[A] 6= 0, evaluating the integral, we find the perturbation bound

etM[A] − 1
ky(t)k ≤ kpk∞ . (6.15)
M[A]

In case M[A] > 0 the bound grows exponentially. More interestingly, if M[A] < 0,
the exponential term decays, and

kpk∞
kyk∞ ≤ − . (6.16)
M[A]

Then y(t) can never exceed the bound given by (6.18). We summarize in a theorem:

Theorem 6.3. Let ẏ = Ay + p(t) with y(0) = 0. Let kpk∞ = supt≥0 kp(t)k. Assume
that M[A] 6= 0. Then

etM[A] − 1
ky(t)k ≤ kpk∞ ; t ≥ 0. (6.17)
M[A]

If M[A] = 0, then
ky(t)k ≤ kpk∞ · t; t ≥ 0. (6.18)
If M[A] < 0 it holds that
kpk∞
kyk∞ ≤ − . (6.19)
M[A]

Note that this theorem allows an exponentially growing bound in (6.17), a lin-
early growing bound in (6.18), and a uniform upper bound in (6.19). The latter is
14 6 Stability theory

the limit in (6.17) as t → ∞ if M[A] < 0. Note that the different bounds are primarily
distinguished by the sign of M[A], which governs stability.
Let us simplify the problem further and assume that p is constant in (6.14). Then,
as M[A] < 0 implies that the exponential goes to zero (since λ [A] ∈ C− by (6.13)),
there must be a unique stationary solution ȳ to (6.14), satisfying

0 = Aȳ + p ⇒ ȳ = −A−1 p.

Thus
kpk
kȳk = kA−1 pk ≤ − ,
M[A]
and it follows that
kA−1 pk 1
≤− .
kpk M[A]
Because p is an arbitrary constant vector,

kA−1 pk 1
sup = kA−1 k ≤ − .
p6=0 kpk M[A]

Thus we have derived the following important, but less obvious result:

Theorem 6.4. Let A ∈ Rd×d and assume that M[A] < 0. Then A is invertible, with

1
kA−1 k ≤ − . (6.20)
M[A]

This result will be used numerous times in various situations, both in initial value
problems and in boundary value problems. It is the simplest version of a more gen-
eral result, known as the uniform monotonicity theorem.
The theory above collects some of the most important results in linear stability
theory, both in terms of eigenvalues and in terms of norms and logarithmic norms.
The theory is general and is valid for all norms. However, if we specialize to inner
product norms (Hilbert space) we obtain stronger results, in the sense that they can
be directly extended beyond elementary matrix theory.

6.3 Inner product norms

Inner products generate (and generalize) the Euclidean norm. They are defined as
follows.

Definition 6.5. An inner product is a bilinear form h·, ·i : Cd × Cd → C satisfying


1. hu, ui ≥ 0 ; hu, ui = 0 ⇔ u = 0
6.3 Inner product norms 15

2. hu, vi = hv, ui
3. hu, αvi = αhu, vi
4. hu, v + wi = hu, vi + hu, wi,
generating the Euclidean norm by hu, ui = kuk22 .

Above, the bar denotes complex conjugate. In most cases we will only consider
real inner products, in which case the bar can be neglected. However, the complex
notation above enables us to also discuss operators with complex eigenvalues.
An inner product generalizes the notion of scalar product. Apart from the proper-
ties listed above, and has a few more essential properties. One of the most important
is the following:

Theorem 6.5. (Cauchy–Schwarz inequality) For all u, v ∈ Cd , it holds that

−kuk2 · kvk2 ≤ Re hu, vi ≤ kuk2 · kvk2 .

For u, v ∈ Rd , it holds that

−kuk2 · kvk2 ≤ hu, vi ≤ kuk2 · kvk2 .

This distinguishes between the case of real and complex vector spaces. Note that
when we have operators with complex conjugate eigenvalues, the corresponding
eigenvectors are also complex, which more or less necessitates the use of complex
vector spaces. Whether the vector space is real or complex, however, we always
have the following:

Definition 6.6. The operator norm associated with h·, ·i is

hAu, Aui kAuk22


kAk22 = sup = sup 2
u6=0 hu, ui u6=0 kuk2

For vectors in finite dimensions, we may denote the inner product by hu, vi = u∗ v,
where u∗ denotes the transpose, or, in the complex case, the complex conjugate
transpose. With this notation, we have u∗ u = kuk22 , from which it follows that
hAu, Aui ≤ kAk22 · kuk22 . This is easily seen to be equivalent to the standard definition
of the operator norm for a general choice of norm. For the logarithmic norm, the
situation is similar, but we give an alternative definition for an inner product norm,
since this is not only convenient, but also turns out to allow the logarithmic norm
to be defined for a somewhat wider class of vector fields. In addition, we obtain a
natural definition of a lower as well as the upper logarithmic norm.

Definition 6.7. The lower and upper logarithmic norms associated with the inner
product h·, ·i are defined as
16 6 Stability theory

Re hu, Aui Re hu, Aui


m2 [A] = inf , M2 [A] = sup . (6.21)
u6=0 kuk22 u6=0 kuk22

This implies that m2 [A] = −M2 [−A], and that we have the following bounds,

m2 [A] · kuk22 ≤ Re hu, Aui ≤ M2 [A] · kuk22 . (6.22)

Again, if we write the inner product as hu, vi = u∗ v, we find that

u∗ A∗ Au
kAk22 = sup ∗
u6=0 u u

and
Re u∗ Au
M2 [A] = sup .
u6=0 u∗ u
Thus the norm of a matrix, as well as the lower and upper logarithmic norms are
extrema of two quadratic forms, albeit with different matrices, A∗ A and A, respec-
tively. It is therefore in order to investigate how these extrema are found.
Let C be a given matrix, and let

Re u∗Cu
q(u) =
u∗ u
denote the Rayleigh quotient formed by C and the vector u. We will find its extrema
by finding its stationary points, i.e., by solving the equation gradu q = 0. Now,

Re [u∗ u · gradu (u∗Cu) − u∗Cu · gradu (u∗ u)]


gradu q =
u∗ u · u∗ u
Re [u∗ u · (u∗C + u∗C∗ ) − u∗Cu · (2u∗ )]
= := 0,
u∗ u · u∗ u
which, upon (conjugate) transposition, gives the equation

C +C∗
u = q(u) · u
2
for the determination of the stationary points. This is obviously an eigenvalue prob-
lem, where q(u) is an eigenvalue of the symmetric matrix (C +C∗ )/2. Thus, in case
we take C = A∗ A, we obtain the eigenvalue problem

A∗ Au = λ u,

showing that kAk22 is the largest eigenvalue of A∗ A. This eigenvalue is real and pos-
itive and σ 2 := λmax [A∗ A] is the square of the maximal singular value of A.
On the other hand, if we take C = A (which is not a priori symmetric), we still
end up with a symmetric eigenvalue problem for the stationary points,
6.3 Inner product norms 17

A + A∗
u = λ · u.
2
The eigenvalues of (A + A∗ )/2 are real, but they are not necessarily positive. In fact,
we have just demonstrated that the logarithmic norm is given by

A + A∗
 
M2 [A] = max λk ,
k 2

as was indicated in Table 6.2. The Euclidean norm is sometimes referred to as the
spectral norm, as operator norms and logarithmic norms are determined by the spec-
trum of symmetrized operators associated with A. We summarize:

Theorem 6.6. In terms of the spectral radius and spectral abscissa, it holds that

A + A∗
 
p
kAk2 = ρ[A A]; ∗ M2 [A] = α . (6.23)
2

It remains to show that the alternative definition 6.21 is compatible with the previ-
ous definition 6.5 whenever kAk2 < ∞ (which corresponds to a Lipschitz condition).
Note that, as h → 0+,
s
k(I + hA)uk22
k(I + hA)k2 = sup
u6=0 kuk22
r
u∗ (I + hA)∗ (I + hA)u
= sup
u6=0 u∗ u
r
u∗ u + hu∗ (A + A∗ )u + O(h2 )
= sup
u6=0 u∗ u
r
u∗ (A + A∗ )u
= sup 1 + h + O(h2 )
u6=0 u∗ u
u∗ (A + A∗ )u
= 1 + h sup + O(h2 )
u6=0 2u∗ u
= 1 + hM2 [A] + +O(h2 ).

Hence it follows that 6.21 and 6.5 represent the same limit, in case kAk2 < ∞. How-
ever, we shall see later that 6.21 applies also in the case when A is an unbounded
operator.
A more important aspect of using inner products is that, since kuk22 = u∗ u is
differentiable,

dkuk2 1 dkuk22 1 d(u∗ u) u̇∗ u + u∗ u̇


kuk2 = = = = Re (u∗ u̇).
dt 2 dt 2 dt 2
18 6 Stability theory

Therefore, if we consider the linear system u̇ = Au, we can assess stability by inves-
tigating the projection of the derivative u̇ on u, i.e.,

Re u∗ u̇ = Re u∗ Au ≤ M2 [A] · kuk22

and it follows that


dkuk2
m2 [A] · kuk2 ≤ ≤ M2 [A] · kuk2 . (6.24)
dt
The upper bound is the same differential inequality as we had before, when the
concept was introduced for general norms. The reason why the technique above
is of special importance is because it is a standard technique in partial differential
equations, when A represents a differential operator in space.

6.4 Matrix categories

Although general norms have their place in matrix theory and in the analysis of
differential equations, inner product norms are particularly useful. The mathematics
of Hilbert space plays a central role in most of applied mathematics, and will be
the preferred setting in this book.
Inner products allow the notion of orthogonality. Thus two vectors are orthog-
onal if hu, vi = 0. In line with the notation used above, we will write this u∗ v = 0.
Orthogonality is the key idea behind some of the best known methods in applied
mathematics, such as the least squares method, and, in the present context, the finite
element method. These methods find a best approximation by requiring that the
residual is orthogonal the span of the basis functions; hence the residual cannot be
made any smaller in the inner product norm.
Although there are several ways of constructing inner product norms, we will let
kuk22 = u∗ u denote the associated norm unless a special construction is emphasized.
In this section, however, the norm refer to the standard Euclidean norm.
Just like there is a (conjugate) transpose of a vector, there is a (conjugate) trans-
pose of a matrix. The definition is

hu, Avi = hA∗ u, vi,

for all vectors u, v. Now, since hu, vi = u∗ v, we have

u∗ Av = hu, Avi = hA∗ u, vi = (A∗ u)∗ v,

so (A∗ u)∗ = u∗ A, and (A∗ )∗ = A. We shall return to this in connection with differen-
tial operators, where A∗ is known as the adjoint of A.
6.4 Matrix categories 19

Property Name λk [A] m2 [A] M2 [A] kAk2

A∗ = A symmetric real −α[−A] α[A] ρ[A]

A∗ = −A skew-symmetric iωk 0 0 ρ[A]

A∗ = A−1 orthogonal eiϕk −α[−A] α[A] 1

A∗ A = AA∗ normal complex −α[−A] α[A] ρ[A]

positive definite − >0 − −

negative definite − − <0 −

indefinite − <0 >0 −

contraction − − − <1

Table 6.2 Matrix categories and logarithmic norms. A∗ is the (conjugate) transpose of A. All listed
categories of matrices have (or can be arranged to have) orthogonal eigenvectors. The most general
class is normal matrices; all categories above are normal

Within this framework, there are several important classes of matrices that we
will encounter many times. Below in Table 6.4 these classes are characterized. In
addition, for each class, the spectral properties and logarithmic norms are given.
The names of the classes of matrices vary depending on the context. The names
given in Table 6.4 refer to standard terminology for real matrices, A ∈ Rd×d . For
complex matrices A ∈ Cd×d , the corresponding terms are, respectively, Hermitian;
skew-Hermitian; unitary; and normal. For more general linear operators, such as
linear differential operators, the terms are self-adjoint; anti-selfadjoint; unitary;
and normal.
For example, in a linear system u̇ = Au, where A is skew-symmetric, we have

d 1 dkuk22 Re hu, u̇i Re hu, Aui Re u∗ Au


log kuk2 = = = = = 0.
dt 2
2kuk2 dt hu, ui hu, ui u∗ u

Thus it follows that ku(t)k2 remains constant in such a system; problems of this
type are referred to as conservation laws, and occur e.g. in transport equations in
partial differential equations. They require special numerical methods, that replicate
the conservation law, keeping the norm of the solution constant. For other classes
of matrices, there may be similar concerns whether we can construct methods that
have a proper qualitative behavior.
By contrast, if M2 [A] < 0, it follows that ku(tk2 → 0. Thus the magnitude of the
solution will decrease as t → ∞, and ku(t)k2 ≤ ku(0)k2 for all t ≥ 0.
We also see that definiteness can be characterized in terms of the logarithmic
norms. Thus a positive definite matrix has m2 [A] > 0, corresponding to
20 6 Stability theory

Re u∗ Au
0 < m2 [A] = inf ,
u6=0 u∗ u

so the quadratic form only takes values in the right half-plane. Likewise, a negative
definite matrix is characterized by M2 [A] < 0. The upper and lower logarithmic
norms provide additional quantitative information, however, as the actual values of
the logarithmic norms tell us how positive (or negative) definite a matrix is; this
allows us to find stability margins. Note, however, that there are matrices that are
neither positive nor negative definite.

6.5 Nonlinear stability theory

The stability of nonlinear systems is considerably more complicated than for linear
systems. Yet there are strong similarities, even though the stability of a solution must
be considered in a case by case basis. Let u and v be two solutions to the same IVP,
with initial conditions u(0) = u0 and v(0) = v0 . Then u − v satisfies

d
(u − v) = f (t, u) − f (t, v).
dt
Taking the inner product with u − v, we find the differential inequality
1 d
ku − vk22 = hu − v, f (t, u) − f (t, v)i ≤ M2 [ f ] · ku − vk22 ,
2 dt
where the upper logarithmic norm of f (t, ·) is

hu − v, f (t, u) − f (t, v)i


M2 [ f ] = sup , (6.25)
u6=v ku − vk22

where u, v are in the domain of f (t, ·). In a similar way, taking the infimum instead,
we obtain the lower logarithmic norm m2 [A]. Consequently, letting ∆ u = u − v de-
note the difference between u and v, we have the differential inequalities
d
m2 [ f ] · k∆ uk2 ≤ k∆ uk2 ≤ M2 [ f ] · k∆ uk2 ,
dt
which are completely analogous to those we obtained in the linear case. Thus we
could bound the growth rate of k∆ uk2 in terms of M2 [ f ].
In fact, if f (t, ·) is Lipschitz with respect to its second argument, we have

h f (t, u) − f (t, v), f (t, u) − f (t, v)i k f (t, u) − f (t, v)k22


L22 [ f ] = sup 2
= sup .
u6=v ku − vk2 u6=v ku − vk22

We can easily verify that that L[·] is an operator (semi) norm; in fact, if f (t, u) = Au
is a linear map, we find that L2 [A] = kAk2 . This part of the theory is easily extended
6.5 Nonlinear stability theory 21

to general norms, so that we can define

L[I + h f (t, ·)] − 1


M[ f ] = lim .
h→0+ h
In case of the Euclidean norm, the logarithmic norm defined above is identical to
the expression (6.25) above.
No matter what norm we choose, we obtain differential inequalities and pertur-
bation bounds of the same structure as in the linear case. This extends linear theory
to nonlinear systems. Let us therefore state two results of great importance in the
analysis that follows.

Theorem 6.7. Let u̇ = f (u) + p(t) and v̇ = f (v) with u(0) − v(0) = 0. Let kpk∞ =
supt≥0 kp(t)k. Assume that f : Rd → Rd with M[ f ] 6= 0 and let ∆ u = u − v. Then

etM[ f ] − 1
k∆ u(t)k ≤ kpk∞ ; t ≥ 0. (6.26)
M[ f ]

If M[ f ] = 0, then
k∆ u(t)k ≤ kpk∞ · t; t ≥ 0. (6.27)
If M[ f ] < 0 it holds that
kpk∞
k∆ uk∞ ≤ − . (6.28)
M[ f ]
We note that this is a nonlinear version of the bounds (6.17 – 6.19). Here v(t)
represents the unperturbed solution, and u(t) the solution obtained when a forcing
perturbation term p(t) drives the solution away from v(t). Whether the perturbed
solution grows or not is primarily governed by M[ f ]. Above we note that this result
is valid for any norm, even though we will give preference to the Euclidean norm.
In the linear case we saw that if M[A] < 0, then any stationary solution is stable.
In the nonlinear case we have a similar result. First we note that if p = 0 then u = v;
this means that the solution to the system is unique. Now, in the theorem above,
assume that M[ f ] < 0 and that p 6= 0 is constant. Then there is a unique stationary
solution ū, satisfying
0 = f (ū) + p
We shall see that we can write ū = f −1 (−p), i.e., we want to show that the inverse
map f −1 exists. To this end, note that, by the Cauchy–Schwarz inequality,

−k f (u) − f (v)k2 ku − vk2 ≤ hu − v, f (u) − f (v)i ≤ M2 [ f ] · ku − vk22 < 0

for any distinct vectors u, v ∈ Rd . Simplifying, we have

−k f (u) − f (v)k2 ≤ M2 [ f ] · ku − vk2 < 0.

This means that if u → v, then necessarily f (u) → f (v). Hence f is one-to-one, and
we may write f (u) = x and f (v) = y, with u = f −1 (x) and v = f −1 (y). It follows
22 6 Stability theory

that
k f −1 (x) − f −1 (y)k2 1
≤−
kx − yk2 M2 [ f ]
holds for all x 6= y. By taking the supremum of the left hand side we arrive at the
following theorem:

Theorem 6.8. (Uniform Monotonicity Theorem) Assume that f : Rd → Rd with


M2 [ f ] < 0. Then f is invertible on Rd with a Lipschitz inverse, and

1
L2 [ f −1 ] ≤ − . (6.29)
M2 [ f ]

The derivation above is only for inner product norms, but the result also holds for
general norms. Likewise, it holds if we replace the condition M[ f ] < 0 by m[ f ] > 0.
This can be compared to the linear case. Thus, if A is a positive definite matrix (i.e.,
m2 [A] > 0) it has a bounded inverse, with kA−1 k2 ≤ 1/m2 [A]. Similarly, a negative
definite matrix has a bounded inverse. The uniform monotonicity theorem above
generalizes those classical results to nonlinear maps.
An interesting consequence of the uniform monotonicity theorem is the follow-
ing:

Corollary 6.2. Let h > 0 and assume that f : Rd → Rd with M2 [h f ] < 1. Then I −h f
is invertible on Rd with a Lipschitz inverse, and
1
L2 [(I − h f )−1 ] ≤ . (6.30)
1 − M2 [h f ]

Proof. This result is obtained from the elementary properties of M[·] in Theorem
6.2. Thus we note that
M2 [h f − I] = M2 [h f ] − 1.
By assumption M2 [h f − I] < 0, so (I − h f )−1 is Lipschitz, with the constant given
by (6.30).

This corollary will be seen to guarantee existence and uniqueness of solutions in


implicit time-stepping methods for initial value problems.

6.6 Stability in discrete systems

There is a strong analogy between differential equations and difference equations.


Corresponding to the linear and nonlinear initial value problems
6.6 Stability in discrete systems 23

ẏ = Ay
ẏ = f (y),

we have the discrete systems

yn+1 = Ayn
yn+1 = f (yn ).

Beginning with the linear system, we saw that in the continuous case, stability was
governed by having all eigenvalues in the left half-plane, α[A] < 0, or, if norms were
used, by having a non-negative upper logarithmic norm, M[A] ≤ 0. In the discrete
linear case, the stability conditions require that we have the eigenvalues in the unit
circle, ρ[A] < 1, or, in terms of norms, that kAk ≤ 1.
Thus the role of left half-plane is “replaced by” the unit circle in the discrete
case; the spectral abscissa by the spectral radius; and the logarithmic norm by the
matrix norm. In the discrete, linear case, the solution is yn = An y0 , and the solution
is bounded (stable) if
kAn k ≤ C
for all n ≥ 1. A matrix satisfying this condition is called power bounded. Power
boundedness is necessary for stability, but as it depends on the spectrum of A as
well as the eigenvectors, it may often be difficult to establish. On the other hand,
using the submultiplicativity of the operator norm, we have

kAn k ≤ kAkn .

Therefore A is power bounded if kAk ≤ 1, and the latter condition is often much
easier to establish. Since A may be power bounded even if kAk > 1, the condition
kAk ≤ 1 is sufficient for stability, but not necessary.
In an analogous way, the solution of the continuous system is y(t) = etA y(0), and
the solution is bounded (stable) if

ketA k ≤ C.

This, too, is more difficult to establish than the result obtained by using norms. Thus
we have seen, by using differential inequalities, that

ketA k ≤ etM[A]

for t ≥ 0, and it follows that the system is stable if M[A] ≤ 0. Again, the latter is a
sufficient but not a necessary condition.
In the nonlinear continuous case, we need to investigate the difference of two
solutions, u and v, and whether they remain close as t → ∞. This is the case if
M[ f ] ≤ 0. The situation is similar in the discrete case. We then have
24 6 Stability theory

un+1 = f (un )
vn+1 = f (vn ).

The difference between the solutions satisfy

kun+1 − vn+1 k = k f (un ) − f (vn )k ≤ L[ f ] · kun − vn k,

where L[ f ] is the Lipschitz constant of f . Thus, if L[ f ] < 1, the distance between


the solutions decreases. We have already seen in Theorem 2.1 that, provided that
f : D → f (D) ⊂ D and L[ f ] < 1, there is a unique fixed point ū solving the equation

ū = f (ū).

This is then a “stationary solution” to the discrete dynamical system. In addition,


we saw in (2.25) that (I − f )−1 exists and is Lipschitz. In fact, in view of Theorem
6.2, the inverse exists under a slightly relaxed condition, M[ f ] < 1, and the error
estimate (2.25) can be sharpened. We shall derive an improved bound, but restrict
the derivation to inner product norms. Note that

un+1 − ū = f (un ) − f (un+1 ) + f (un+1 ) − f (ū).

Hence

kun+1 − ūk22 = hun+1 − ū, un+1 − ūi


= hun+1 − ū, f (un ) − f (un+1 )i + hun+1 − ū, f (un+1 ) − f (ū)i
≤ kun+1 − ūk2 k f (un ) − f (un+1 )k2 + M2 [ f ] · kun+1 − ūk22
≤ L2 [ f ] · kun+1 − ūk2 kun − un+1 k2 + M2 [ f ] · kun+1 − ūk22 .

Simplifying, we obtain

kun+1 − ūk2 ≤ L2 [ f ] · kun − un+1 k2 + M2 [ f ] · kun+1 − ūk2 .

Note that M2 [ f ] ≤ L2 [ f ] for all f , and that we have assumed L2 [ f ] < 1. Therefore
M2 [ f ] < 1 and
L2 [g]
kun+1 − ūk2 ≤ · kun+1 − un k2 .
1 − M2 [g]
This bound is always preferable to (2.25) since M2 [ f ] ≤ L2 [ f ]. In particular, if
M2 [ f ] ≤ 0, it follows that kun+1 − ūk2 ≤ kun+1 − un k2 , expressing that the error
is less than the computable difference between the last two iterates. Such a bound
cannot be obtained using the Lipschitz constant alone, as in (2.25). We restate the
fixed point theorem in improved form, for general norms, even though only the Eu-
clidean norm was used above:

Theorem 6.9. (Fixed point theorem) Let D be a closed set and assume that f is a
Lipschitz continuous map satisfying f : D ⊂ Rd → D. Then there exists a fixed point
6.6 Stability in discrete systems 25

System type ẏ = Ay ẏ = f (y) yn+1 = Ayn yn+1 = f (yn )

Spectral condition α[A] < 0 − ρ[A] < 1 −

Norm condition M[A] ≤ 0 M[ f ] ≤ 0 kAk ≤ 1 L[ f ] ≤ 1

Table 6.3 Stability conditions for linear and nonlinear systems. Elementary stability conditions
are given in terms of the spectrum and norms in the linear constant coefficient case. The spectrum
may reach the boundary of the left half plane or the unit circle, provided that a multiple eigenvalue
has a full set of eigenvectors. The norm conditions are sufficient, but not necessary

ū ∈ D. If, in addition, L[ f ] < 1 on D, then the fixed point ū is unique, and the fixed
point iteration converges to ū for every starting value u0 ∈ D, with the error estimate

L[g]
kun+1 − ūk ≤ · kun+1 − un k. (6.31)
1 − M[g]

Thus, in connection with discrete dynamical systems, there are also links to iter-
ative methods for solving equations; such iterations are also discrete time dynamical
systems, to which stability and contractivity applies. Returning to the stability in-
terpretation, we have collected some elementary stability conditions in Table 6.6.
While the conditions on the eigenvalues (spectrum) only apply to linear constant
coefficient systems, the norm conditions apply to linear as well as nonlinear sys-
tems, but are only sufficient conditions; a system could be stable and yet fail to
fulfill the norm condition.
Another analogy is found in linear differential and difference equations.

Example Consider the linear differential equation

ÿ + 3ẏ + y = 0

with suitable initial condition. The standard procedure is to insert the ansatz y = etλ into
the equation, to obtain the characteristic equation from which the possible values of λ are
determined. Thus
λ 2 + 3λ + 1 = 0, (6.32)
√ √
and it follows that λ1,2 = (−3 ± 9 − 4)/2 = (−3 ± 5)/2. The general solution is

y(t) = C1 etλ1 +C2 etλ2 ,

where the constants of integration C1 and C2 are determined by the initial condition. Obvi-
ously y(t) = 0 is a solution. It is stable, since both roots λ1,2 ∈ C− .

Example The corresponding example in difference equations is

yn+2 + 3yn+1 + yn = 0.

This time, however, we make the ansatz yn = λ n , which is an “exponential function” in


discrete time. Upon insertion, we get
26 6 Stability theory

λ n+2 + 3λ n+1 + λ n = 0,

leading to the same characteristic equation (6.32) as before, since λ 6= 0 when we seek a
nonzero solution. Naturally, we have the same roots λ1,2 . The general solution is

yn = C1 λ1n +C2 λ2n ,

where C1 and C2 are determined by the initial conditions. The zero solution yn = 0 is now
unstable, since one root is outside the unit circle:

|λ1 | = | − 3 + 5|/2 ≈ 0.382 < 1

|λ1 | = | − 3 − 5|/2 ≈ 2.618 > 1.

Thus, in linear differential and difference equations, stability is once more deter-
mined by having the roots in the left half-plane, or in the unit circle. This is in fact
the same result as we saw before:

Example By putting ẏ = z in the second order problem above, it is transformed into a


system of first order equations. Thus

ẏ = z
ż = −y − 3z,

with initial conditions y(0) = y0 and z(0) = ẏ(0) = ẏ0 . In matrix–vector form we have
     
d y 0 1 y
= · .
dt z −1 −3 z

The matrix thus obtained is referred to as the companion matrix of the differential equa-
tion, and for stability its eigenvalues must be located in the left half plane. To determine its
eigenvalues λk [A], we need the characteristic equation,
 
−λ 1
det(A − λ I) = det = λ (λ + 3) + 1 = λ 2 + 3λ + 1 := 0.
−1 −3 − λ

This is the same characteristic equation as before, showing that spectral stability conditions
are identical to those derived directly for higher order differential or difference equations.
Chapter 7
The Explicit Euler method

The first discrete method for solving initial value problems was devised by Euler in
the mid 18th century. One of the greatest mathematicians of all times, Euler realized
that many of the emerging problems of analysis could only be solved approximately.
In the problem ẏ = f (t, y), the “issue” is the derivative. Thus we have already noted
that derivatives need to be approximated by finite differences in order to construct
computable approximations.
In the differential equation
ẏ = f (t, y), (7.1)
we start from the simplest approximation. We will compute a sequence of approxi-
mations, yn ≈ y(tn ), such that
yn+1 − yn
= f (tn , yn ). (7.2)
∆t
This follows the pattern from the standard definition of the derivative. Since

y(tn + ∆t) − y(tn )


lim = ẏ(tn ),
∆t→0 ∆t
the finite difference approximation (7.2) is obtained by replacing the derivative in
(7.1), using a finite time step ∆t > 0. From (7.2) we create a recursion,

yn+1 = yn + ∆t · f (tn , yn ), (7.3)

starting from the initial value y0 = y(0). This is the Explicit Euler method. It is
the original time-stepping method, and all other types of time-stepping method con-
structions include the explicit Euler method as the simplest case.

27
28 7 The Explicit Euler method

Fig. 7.1 Leonhard Euler (1707–1783). Portrait by J.E. Handmann (1753), Kunstmuseum Basel

7.1 Convergence

The Euler recursion implies that we sample the vector field f (t, y) at the current
point of approximation, i.e., at (tn , yn ), and then take one step of size ∆t in the
direction of the tangent. Naturally, the exact solution will not satisfy this recursion.
As before, we let y(tn ) denote the exact solution of the differential equation at time
tn . Inserting the exact solution into the recursion (7.3), we obtain

y(tn+1 ) = y(tn ) + ∆t · f (tn , y(tn )) − rn , (7.4)

where the local residual rn 6= 0 signifies that the exact solution does not satisfy the
recursion.
The first question is, how large is rn ? Let us assume that the exact solution y is
twice continuously differentiable on [0, T ]. Expanding in a Taylor series, we obtain

∆t 2 · ÿ(θn )
y(tn+1 ) = y(tn ) + ∆t · ẏ(tn ) + ,
2
for some θn ∈ [tn ,tn+1 ]. Since ẏ(tn ) = f (tn , y(tn )), we can compare to (7.4) and con-
clude that
∆t 2 · ÿ(θn )
rn = − . (7.5)
2
Hence krn k = O(∆t 2 ) as ∆t → 0, provided that y ∈ C2 [0, T ].
7.1 Convergence 29

Lemma 7.1. Let {y(tn )} denote the exact solution to (7.1) at the time points {tn }.
Assume that y ∈ C2 [0, T ], and that f is Lipschitz. When the exact solution is inserted
into the explicit Euler recursion (7.3), the local residual is

kÿ(t)k
krn k ≤ ∆t 2 · max . (7.6)
t∈[0,T ] 2

This means that the difference equation is consistent with the differential equa-
tion, and that the approximation improves as ∆t → 0. We will return to this notion
later; for the time being, we say that the method has order of consistency p, if

krn k = O(∆t p+1 )

as ∆t → 0 (or, equivalently, N → ∞).


Now, because the exact solution does not satisfy the recursion, it follows that the
numerical solution will deviate from the exact solution. We introduce the following
definition.

Definition 7.1. Let the sequence {yn } denote the numerical solution generated by
the Euler method (7.3) and let {y(tn )} denote the exact solution to (7.1) at the time
points {tn }. Then the difference

en = yn − y(tn )

is called the global error at time tn .

The next question is, therefore, how large is en ? Now, the objective of all time
stepping methods is to generate a numerical solution {yn } whose global error can be
bounded. In fact, we want more: the method must be convergent. This means that
given any prescribed error tolerance ε, we must be able to choose the step size ∆t
accordingly, so that the numerical solution attains the prescribed accuracy ε.
Let us see how this is done. The local residual is related to the global error.
Subtracting (7.4) from (7.3), we get

en+1 = en + ∆t · f (tn , y(tn ) + en ) − ∆t · f (tn , y(tn )) + rn . (7.7)

This is a recursion for the global error, where the local residual is the forcing func-
tion. It should be noted that the terminology varies in the literature, and that rn is
often called the local truncation error, or the local error. The reason for this dis-
crepancy will become clear later, in connection with implicit methods.
Taking norms in (7.7), using the triangle inequality and the Lipschitz condition
for f , yields

ken+1 k ≤ ken k + ∆t L[ f ] · ken k + krn k = (1 + ∆t L[ f ])ken k + krn k. (7.8)


30 7 The Explicit Euler method

This is a difference inequality, and from it we are going to derive a bound on the
global error. To this end, we need the following lemma.

Lemma 7.2. Let the sequence {un }∞ 0 satisfy un+1 ≤ (1 + µ)un + vn , with u0 = 0,
vn > 0 and µ > −1, but µ 6= 0. Then

enµ − 1
un ≤ max vk · . (7.9)
k<n µ

In case µ = 0, it holds that un ≤ n · maxk<n vk .

Proof. The case µ = 0 is trivial. For µ 6= 0, we first prove by induction that un ≤ Un ,


where
n−1
Un = ∑ (1 + µ)n−k−1 vk . (7.10)
k=0

This obviously holds for n = 1. Assume that un ≤ Un holds for a given n. Then
n
un+1 ≤ (1 + µ)un + vn ≤ (1 + µ)Un + vn = ∑ (1 + µ)n−k vk = Un+1 ,
k=0

so un ≤ Un holds for all n ≥ 1. From (7.10), it now follows that


n−1
enµ − 1 enµ − 1
Un ≤ max vk
k<n
∑ e(n−k−1)µ = max
k<n
vk ·
eµ − 1
≤ max
k<n
vk ·
µ
.
k=0

since 1 + µ ≤ eµ and the sum is a geometric series. The proof is complete.

We are now ready to construct a global error bound for the explicit Euler method,
by applying (7.9) to the error recursion (7.8). Thus we obtain the following result.

Theorem 7.1. Let the initial value problem ẏ = f (t, y) be given with y(0) = y0 on
a compact interval [0, T ]. Assume that L[ f ] < ∞ for y ∈ Rd , and that the solution
y ∈ C2 [0, T ]. When this problem is solved, taking N steps with the explicit Euler
method using step size ∆t = T /N, the global error at t = T is bounded by

kÿk eT L[ f ] − 1
kyN − y(T )k ≤ ∆t · max . (7.11)
t∈[0,T ] 2 L[ f ]

Proof. The result follows immediately from identifying µ = ∆t L[ f ] in (7.8), and

∆t 2 kÿ(θk )k
vk = krk k =
2
in (7.5), and noting that n · ∆t = tn in general, and N · ∆t = T in particular.
7.1 Convergence 31

This classical result proves that the explicit Euler method is convergent, i.e.,
by choosing the step size ∆t = T /N small enough, the method can generate approx-
imations that are arbitrarily accurate. Thus the structure of the bound (8.7) is

kyN − y(T )k ≤ C(T ) · ∆t. (7.12)

An alternative formulation of (7.12) uses ∆t = T /N, to express the bound as

K(T )
kyN − y(T )k ≤ , (7.13)
N
with K(T ) = T ·C(T ).

Definition 7.2. If a time stepping method generates a sequence of approximations


yn ≈ y(tn ) at the time points tn = n · ∆t with ∆t = T /N, and the exact solution y(t)
is sufficiently smooth, the method is said to be of convergence order p, if there is
a constant C and an N ∗ such that

kyN − y(T )k ≤ C · N −p

for all N > N ∗ .

Here we note that, by Theorem 7.1, the explicit Euler method is of convergence
order p = 1. This implies that the global error is keN k = O(∆t), and that given any
accuracy requirement ε, we can pick ∆t (or N) such that keN k ≤ ε.

Example The theory is easily illustrated on a simple test problem. We choose

ẏ = λ (y − g(t)) + ġ(t); y(0) = y0 . (7.14)

The exact solution is


y(t) = eλt (y0 − g(0)) + g(t),
and is composed of a homogeneous solution (the exponential) and the particular solution
f (t). We will choose g(t) = sin πt and y0 = 0 so that the exact solution is y(t) = sin πt,
and we will solve the problem on [0, 1]. We deliberately choose N = 5 and N = 10 steps,
to obtain large errors, visible to the naked eye. Such a computational setup is obviously
exaggerated for the purpose of creating insight.
The results are seen in Figure 7.1, where we have taken λ = −0.2. In spite of using so few
steps, we still observe the expected behavior, with a global error O(∆t) and local residuals
O(∆t 2 ). This corroborates the first order convergence.

The convergence proof for the explicit Euler method is key to a broader under-
standing of time stepping methods. We shall elaborate on the numerical tests by
varying the parameters in the test problem, and also compare the results to those
we obtain for the implicit Euler method, to be studied next. However, before we go
on, it is important to discuss the interpretation of the convergence proof, as well as
some important, critical remarks on what has been achieved so far.
32 7 The Explicit Euler method

1.2

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1.2

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t

Fig. 7.2 Demonstration of the Explicit Euler method. The simple test problem (7.14) is solved on
[0, 1] with N = 5 steps (top) and then N = 10 steps (bottom). The exact solution y(t) = sin πt is
indicated by emphasized blue curve, while the explicit Euler method generates the red, polygonal,
discrete solution. Each step is taken in the tangential directions of local solutions to the differential
equation (green intermediate curves). The global error at the endpoint is approximately 0.6 for
∆t = 1/5 and half as large, 0.3, for ∆t = 1/10, in agreement with an O(∆t) global error. The
local residuals approximately correspond to the distance between the green curves. Since there are
twice as many green curves in the lower graph, with only half the distance between them, the local
residual is O(∆t 2 )

Remark 1 (Convergence) The notion of a convergent method applies to a general class


of Lipschitz continuous problems whose solutions are smooth enough, in this case with
y ∈ C2 [0, T ]. Thus convergence is a nominal method property, and the convergence order
is the best generic performance the method will exhibit.
However, there are exceptions. For example, if the solution y is a polynomial of degree
1 (a straight line), then ÿ ≡ 0 and the local residual vanishes. The explicit Euler method
then generates the exact solution. Conversely, if a given problem fails to satisfy the Lip-
schitz condition, or if the solution is not in C2 [0, T ], the convergence order may drop, or
the method may fail altogether. In practice, one rarely verifies the theoretical assumptions,
and occasional failures are encountered. Using a convergent method does not guarantee
unconditional success.
Finally, convergence is a notion from analysis, requiring a dense vector space, such as Rd .
In computer arithmetic, there is no “true convergence,” since machine representable num-
bers are few and far between. Even so, it is rare to experience difficulties due to roundoff,
and in most cases the nominal convergence order is observed. In the explicit Euler case, this
means that if ∆t is reduced by a factor of 10, we will typically observe a global error that is
10 times smaller.
7.1 Convergence 33

Remark 2 (Consistency and stability imply convergence) The error bound has the form

kyN − y(T )k ≤ C(T ) · ∆t,

and it has two essential components. First, the single power of ∆t is due to consistency.
Second, C(T ) must be bounded; this is referred to as stability, and implies that the bound
depends continuously on ∆t. Here C(T ) is sometimes referred to as the stability constant.
Let us have a closer look at where these concepts were employed. We used the inequality
n−1
un ≤ ∑ (1 + µ)n−k−1 vk ,
k=0

and we need uN → 0 as N → ∞. We then need (1 + µ)N to be bounded. This appears to


require µ ≤ 0, but in fact we have a little bit of leeway. In our case

T L[ f ] N
 
(1 + µ)N = (1 + ∆t · L[ f ])N = 1 + → eT L[ f ] ,
N

so the exponential term is bounded for fixed T , even though µ > 0. That is where stability
entered. Without stability, C(T ) would not have been bounded. Thus stability is necessary.
Second, for the error to go to zero, we needed vk → 0. This is where consistency entered.
Above we saw that vk ≤ O(∆t 2 ) → 0, since the local residual is

krn k ≤ O(∆t 2 ).

Without consistency, the upper bound of the global error kyN − y(T )k ≤ C(T ) · ∆t would
not have contained the factor ∆t. Thus consistency is necessary, too.

Remark 3 (What is a stability constant?) The convergence proof above is a mathematical


proof. It is “sharp” in the sense that equality could be attained, but it is far too weak for
numerical purposes. Consider a plain initial value problem

ẏ = −10y; y(0) = 1,

to be solved on [0, T ] with T = 10. The exact solution is y(t) = e−10t , with ÿ = 100y, im-
plying that max |ÿ| = 100. The Lipschitz constant is L[ f ] = 10.
Inserting these data into (8.7), we get

kÿk eT L[ f ] − 1 e100 − 1
kyN − y(T )k ≤ ∆t · max = ∆t · 50 · ≈ 1.3 · 1044 ∆t.
t∈[0,T ] 2 L[ f ] 10

Thus C(T ) is stupendous; such “constants” do not belong in numerical analysis. In real
computations, stability constants must have a reasonable magnitude, keeping in mind that
computations are carried out in finite precision, and need to finish in finite time.
Fortunately, the real error is much smaller. Suppose we take ∆t = 0.01 and N = 103 steps
to complete the integration from t = 0 to T = 10. Because y(t) is convex, it is easily seen
that 0 < yn < y(tn ) during the entire integration. Therefore the error at T is certainly less
than y(T ). Now,
y(T ) = e−10T = e−100 = 3.7 · 10−44 ,
so kyN − y(T )k ≈ 3.7 · 10−44 . Thus the error bound overestimates the error by 88 orders of
magnitude. This is unacceptable, especially when the differential equation poses no special
problems at all.
34 7 The Explicit Euler method

To summarize the analysis above, we note that stability and consistency are two
distinct necessary conditions for convergence. Later on, we will simplify the con-
vergence proofs, reducing them to a matter of verifying the order of consistency,
and stability. Proving consistency is usually easy, only requiring a Taylor series ex-
pansion. Stability is more difficult, but once established, it holds that the order of
convergence equals the order of consistency.
Bridging the gap from consistency to convergence, stability plays the key role.
It will turn up in many different forms depending on the problem type and method
construction. Because the pattern remains the same, we will discover that the Lax
Principle is the most important idea in numerical analysis: consistency and stabil-
ity imply convergence.

7.2 Alternative bounds

The convergence proof derived above is only a “mathematical” proof, and we need
to find better estimates. The problem with the derivation above depends on two
things: a reckless use of the triangle inequality, and a consequential damaging
use of the Lipschitz constant. As a consequence, we obtained a stability constant
C(T ) which is so large as to suggest that accurate numerical solution of differential
equations is impossible over reasonably long time intervals.
However, this is not the case. By using logarithmic norms, we will be able to
derive much improved error bounds that support the observation from computational
practice, that most initial value problems can be solved to a very high accuracy.
Going back to the recursion (7.7) for the global error, we had

en+1 = en + ∆t · f (tn , y(tn ) + en ) − ∆t · f (tn , y(tn )) + rn .

Again, we take norms and use the triangle inequality, without splitting the first three
terms. Thus we get

ken+1 k ≤ L[I + ∆t f ]ken k + krn k ≈ (1 + ∆t M[ f ])ken k + krn k,

where the approximation is derived from

≤ L[I + ∆t f ] = 1 + ∆t M[ f ] + O(∆t 2 ),

in accordance with Definition 6.2. Dropping the O(∆t 2 ) term, we have effectively
just replaced the Lipschitz constant L[ f ] in (7.8) by the logarithmic norm M[ f ].
Otherwise, everything remains the same. The convergence proof now only offers
an approximate global error bound, but it is much improved due to the fact that
M[ f ] ≤ L[ f ].

Theorem 7.2. Let the initial value problem ẏ = f (t, y) be given with y(0) = y0 on
a compact interval [0, T ]. Assume that L[ f ] < ∞ for y ∈ Rd , and that the solution
7.3 The Lipschitz assumption 35

y ∈ C2 [0, T ]. When this problem is solved, taking N steps with the explicit Euler
method using step size ∆t = T /N, the global error at t = T is bounded by

kÿk eT M[ f ] − 1
kyN − y(T )k . ∆t · max . (7.15)
t∈[0,T ] 2 M[ f ]

Remark 1 (Perturbation bound) We note that the error bound (7.15) has the same struc-
ture as the perturbation bound (6.26). While the latter was derived for the differential equa-
tion, the global error bound was derived for the discretization. The shared structure of the
bounds shows that the error accumulation in the discrete recursion is similar to the effect of
a continuous perturbation p(t) in the differential equation.

Remark 2 (The stability constant, revisited) Let us again consider the problem

ẏ = −10y; y(0) = 1,

to be solved on [0, T ] with T = 10. The exact solution is y(t) = e−10t , with ÿ = 100y, im-
plying that max |ÿ| = 100. While the Lipschitz constant is L[ f ] = 10, the logarithmic norm
is M[ f ] = −10.
This gives a completely different error bound. Inserting the data into (7.15), we get

kÿk eT M[ f ] − 1 e−100 − 1
kyN − y(T )k ≤ ∆t · max = ∆t · 50 · ≈ 5∆t.
t∈[0,T ] 2 M[ f ] −10

Thus the stability constant C(T ) ≈ 5 is moderate in size. The error bound is still an over-
estimate, but the new error bound shows that the numerical method can achieve realistic
accuracy.
Even with the logarithmic norm, the error bound usually overestimates the error. However,
the main difference is that while the Lipschitz constant is positive, it cannot pick up any
information on the stability of the solutions of the equation. By contrast, the logarithmic
norm distinguishes between forward and reverse time integration, and therefore con-
tains some information about stability. This is necessary in order to have realistic error
bounds.

7.3 The Lipschitz assumption

While a much better error bound could be obtained when the logarithmic norm
replaced the Lischitz constant in the derivation, the classical assumption for estab-
lishing existence of solutions on some compact interval [0, T ] is still that the vector
field is Lipschitz with respect to its second argument, i.e.,

L[ f (t, ·)] < ∞.

Noting that it always holds that M[ f ] ≤ L[ f ] (see the basic properties in Theorem
6.2, which apply also to nonlinear maps), the Lischitz assumption automatically
36 7 The Explicit Euler method

implies that M[ f ] exists and is bounded. Therefore it is always possible to work with
the logarithmic norm instead of the Lipschitz constant in the estimates, although
this occasionally leads to approximate upper bounds.
More importantly, it may happen that M[ f ]  L[ f ], implying that vastly im-
proved error bounds can be obtained with the logarithmic norm. This is of particular
importance in connection with stiff differential equations, which will be studied
later. Since the error bounds typically contain a factor eT L[ f ] , which can be replaced
by eT M[ f ] (see Theorem 6.7), the difference is enormous. In case T is also large, the
classical bound, based on the Lipschitz constant, loses its computational relevance
altogether. Bounds and estimates have to be as tight as possible.
Unlike the Lipschitz constant, the logarithmic norm may be negative. Thus,
in cases where M[ f ] < 0, we can obtain uniform error bounds, also when T → ∞,
which is otherwise impossible in the classical setting. In fact, stiff problems have
T L[ f ]  1 but T M[ f ] small or even negative. Such problems cannot be dealt with in
a meaningful way without using the logarithmic norm. Typical examples are found
in parabolic partial differential equations, such as in the diffusion equation.
For this reason, we shall in the sequel start our derivations from the (weaker)
assumption that
M[ f (t, ·)] < ∞,
keeping in mind that this is compatible with classical existence theory for ordinary
differential equations, no matter how large L[ f (t, ·)] is.
Chapter 8
The Implicit Euler Method

The explicit Euler method is


yn+1 − yn
= f (tn , yn ).
∆t
and is obtained from the finite difference approximation to the derivative,

y(tn + ∆t) − y(tn )


≈ ẏ(tn ).
∆t
In the Implicit Euler Method, we instead interpret the difference quotient as an
approximation to ẏ(tn+1 ), which is the right endpoint of the interval [tn ,tn+1 ] rather
than the left. This leads to the discretization
yn+1 − yn
= f (tn+1 , yn+1 ).
∆t
The method is implicit, because given the point (tn , yn ), we cannot compute yn+1
directly. To use this method, we have to solve the (nonlinear) equation

yn+1 − ∆t f (tn+1 , yn+1 ) = yn (8.1)

on every single step. This leads to a number of questions:


1. Under what conditions can this equation be solved?
2. Which method should be used to solve this equation?
3. Can the added cost of equation solving be justified?

Let us start with existence of solutions. In ordinary differential equations we


generally assume that f is Lipschitz with respect to the second argument. For sim-
plicity, let us assume that L[ f (t, ·)] ≤ L[ f ] < ∞ on all of Rd . This implies that
M[ f ] ≤ L[ f ] < ∞. By Corollary 6.2 we have the following result:

1
M[∆t f ] < 1 ⇒ L[(I − ∆t f )−1 ] ≤ .
1 − ∆t M[ f ]

37
38 8 The Implicit Euler Method

Throughout the analysis, we shall assume that M[∆t f ] < 1. We note that this is a
considerably weaker assumption than assuming L[∆t f ] < 1, in which case the fixed
point theorem would apply. Thus, a solution to (8.1) exists for (possibly) much larger
step sizes ∆t than the fixed point theorem would indicate.
This brings us to the second question. If we were to use step sizes ∆t such that
M[∆t f ] < 1 but L[∆t f ]  1, then obviously we cannot solve the equation by fixed
point iteration. We will see that these conditions are typical. For this reason, we need
to use Newton’s method, which may converge in the operative conditions defined
by M[∆t f ] < 1.
As for the third question, being implicit, the implicit Euler method is going to
be more expensive per step than the explicit method. But using Newton’s method is
going to make it far more expensive per step. Can this extra cost can be justified?
There are two possible benefits – if the method is more accurate or has improved
stability properties, it might be possible to employ larger steps ∆t than in the explicit
method. Then the implicit method would compensate for the inefficiency of the
explicit method. It turns out that the advantage is in improved stability, and that
there are cases when the implicit Euler method can use step sizes ∆t that are orders
of magnitude greater than those of the explicit method. These conditions depend on
the differential equation, and do not violate the solvability issues raised in the first
question. Whenever these conditions are at hand, the implicit method easily makes
up for its more expensive computations. The issue is not the cost per step, but the
cost per integrated unit of time, often referred to as the cost per unit step.

8.1 Convergence

Let us begin by investigating consistency. We follow standard procedure and insert


the exact solution y(t) into the discretization, to obtain

y(tn+1 ) = y(tn ) + ∆t · f (tn+1 , y(tn+1 )) − rn , (8.2)

where we want to find the magnitude of the local residual rn . We assume that the
exact solution y is twice continuously differentiable and expand in a Taylor series.
Here we note that we need to expand both t(tn+1 ) and ẏ(tn+1 ) = f (tn+1 , y(tn+1 )
around tn . Thus we have

∆t 2 · ÿ(tn )
y(tn+1 ) = y(tn ) + ∆t · ẏ(tn ) + + O(∆t 3 )
2
ẏ(tn+1 ) = ẏ(tn ) + ∆t · ÿ(tn ) + O(∆t 2 ).

Inserting into (8.2) we conclude that

∆t 2 · ÿ(tn )
rn = + O(∆t 3 ). (8.3)
2
8.1 Convergence 39

Hence the order of consistency of the implicit Euler method is p = 1, just like for the
explicit method. The only difference is that the local residual in the implicit Euler
method has the opposite sign from that of the explicit Euler method.

Lemma 8.1. Let {y(tn )} denote the exact solution to (7.1) at the time points {tn }.
Assume that y ∈ C2 [0, T ], and that f is Lipschitz. When the exact solution is inserted
into the implicit Euler recursion (8.2), the local residual is

kÿ(t)k
krn k ≤ ∆t 2 · max . (8.4)
t∈[0,T ] 2

as ∆t → 0.

As for the global error and convergence, the analysis is now slightly more com-
plicated because the method is implicit. Assuming that ∆t M[ f ] < 1, the inverse map
(I − ∆t · f )−1 exists and is Lipschitz on account of Theorem 6.2. We now have have

(I − ∆t · f )(yn+1 ) = yn
(I − ∆t · f )(y(tn+1 )) = y(tn ) − rn .

Inverting I − ∆t f , subtracting and taking norms, it follows that

ken+1 k ≤ L[(I − ∆t f )−1 ] · ken + rn k, (8.5)

where en = yn − y(tn ) is the global error at tn . Again, by Theorem 6.2, we obtain

ken + rn k ken k + krn k


ken+1 k ≤ ≤ ≈ (1+∆t M[ f ])ken k+krn k+O(∆t 3 ). (8.6)
1 − ∆t M[ f ] 1 − ∆t M[ f ]

Thus we have the approximate difference inequality

ken+1 k . (1 + ∆t M[ f ])k · ken k + krn k,

conforming to Lemma 7.2. Identifying µ = ∆t M[ f ] and vk = krk k, we obtain the


following standard convergence result.

Theorem 8.1. Let the initial value problem ẏ = f (t, y) be given with y(0) = y0 on a
compact interval [0, T ]. Assume that ∆t M[ f ] < 1 for y ∈ Rd , and that the solution
y ∈ C2 [0, T ]. When this problem is solved, taking N steps with the implicit Euler
method using step size ∆t = T /N, the global error at t = T is bounded by

kÿk eT M[ f ] − 1
kyN − y(T )k . ∆t · max . (8.7)
t∈[0,T ] 2 M[ f ]

While this conforms to the bound obtained for the explicit method, there ap-
pears to be little new to learn from the implicit method. Before we run some tests,
40 8 The Implicit Euler Method

comparing the explicit and implicit Euler methods, let us note that there is another
way of deriving the error estimate above. Thus, starting all over, we note that if the
method in a single step starts from a point y(tn ) on the exact solution, it produces an
approximation
ŷn+1 = y(tn ) + ∆t · f (tn+1 , ŷn+1 ), (8.8)
Because ŷn+1 6= y(tn+1 ), it is warranted to introduce a special notation for the dis-
crepancy. Thus we introduce the local error ln+1 , defined by

ŷn+1 = y(tn+1 ) + ln+1 (8.9)

The local error is, naturally, related to the local residual. Subtracting (8.2) from
(8.8), we have

ln+1 = ∆t · f (tn+1 , y(tn+1 ) + ln+1 ) − ∆t · f (tn+1 , y(tn+1 )) + rn .

Let us for simplicity assume that f (t, y) = Jy, i.e., that the vector field f is a linear
constant coefficient system. Then

ln+1 = (I − ∆t J)−1 rn . (8.10)

The inverse of the matrix exists, since we assumed ∆tM[ f ] < 1, and M[J] = M[ f ] if
f = J. Obviously, if k∆t Jk → 0 as ∆t → 0, we have ln+1 → rn . Thus, if the step sizes
are small enough to make k∆t Jk  1, it holds that ln+1 ≈ rn . However, the point in
using the implicit Euler method is to employ large step sizes for which k∆t Jk  1,
while it still holds that ∆tM[J]  1. This is the case in stiff differential equations,
where the local error ln+1 can be much smaller than the local residual rn .
For the explicit Euler method, the local residual and the local error ar identical;
thus ln+1 ≡ rn . For implicit methods, however, especially in realistic operational
conditions, the difference is significant. In practical computations, it is therefore
more important to control the magnitude of the local error than the local residual.
For this reason, we emphasize the local error perspective in the sequel.
Let us now turn to comparing the computational behavior of the explicit and
implicit Euler methods. This will be done by considering a few simple test problems
that illustrate both stability and accuracy. It is important to note that stability is the
highest priority; without stability no accuracy can be obtained.

Example We shall compare the explicit and implicit Euler methods using the same test
problem (7.14) as before, specifically

ẏ = λ (y − sin ωt) + ω cos ωt; y(0) = y0 , (8.11)

with exact solution


y(t) = eλt y0 + sin ωt.
We shall take y0 = 0 so that the homogenous solution is not present in the exact solution
y(t) = sin ωt, but the exponential term will show up in the local solutions passing through
the points {yn } generated by the numerical methods. We will take ω = π and solve the
problem on [0, 1], and just like before we will use N = 5 and 10 steps, respectively. The
8.1 Convergence 41

Explicit Euler, N=5 Implicit Euler, N=5

1 1

0.5 0.5

0 0

-0.5 -0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Explicit Euler, N=10 Implicit Euler, N=10

1 1

0.5 0.5

0 0

-0.5 -0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t

Fig. 8.1 Comparing the Euler methods. Test problem (8.11) is solved for λ = −0.2 with N = 5
(top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (right
panels). Blue curve is the exact solution y(t); red polygons represent the numerical solutions;
green curves represent local solutions through numerically computed points. The local residuals
have opposite signs, since explicit solutions proceed above y(t), and below it in the implicit case.
Going from 5 to 10 steps, the global error is O(∆t) for both methods, with local residuals O(∆t 2 )

prime motivation for the test is to investigate the effect of varying λ , which controls the
damping of the exponential. We will use three different values, λ = −0.2, −2 and −20, and
otherwise use the same computational setup for both methods.
The results are shown in Figures 8.1 – 8.1. For λ = −0.2 the damping is weak and local
solutions almost run “parallel” to the exact solution. The test verifies that the global error is
of the same magnitude for both methods.
When λ = −2 there is more exponential damping. This has the interesting effect that the
global error becomes smaller, due to the fact that the stability constant C(T ) is smaller when
the damping rate increases. In addition, local solutions now show a moderate damping, but
we still observe how the global error is similar in both methods, and still O(∆t).
For λ = −20, there is strong exponential damping, as is evident from the fast contracting
local solutions. The big surprise, however, is that for N = 5 (or, equivalently, ∆t = 0.2), the
explicit method goes unstable, with a numerical solution exhibiting growing oscillations
diverging from the exact solution, even though the initial value was taken on the exact
solution. This effect is due to numerical instability, and will be investigated in detail.
The instability may at first seem surprising, since we do have a convergence proof for the
method, but we note that λ ∆t = −4, which is too large for the method to remain stable. By
contrast, there is no sign of instability in the implicit method, which remains stable. Nor is
there any instability in the explicit method when N = 10 and ∆t = 0.1 put λ ∆t at −2.
42 8 The Implicit Euler Method

Explicit Euler, N=5 Implicit Euler, N=5

1 1

0.5 0.5

0 0

-0.5 -0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Explicit Euler, N=10 Implicit Euler, N=10

1 1

0.5 0.5

0 0

-0.5 -0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t

Fig. 8.2 Comparing the Euler methods. Test problem (8.11) is solved for λ = −2 with N = 5
(top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (right
panels). When the exponential damping increases, the global error decreases, but otherwise the
results remain similar, with the exception of local solutions clearly displaying a faster damping
rate

However, we have seen that in the convergence proof stability depends on (1 + ∆t L[ f ])N
being bounded as N → ∞ and ∆t → 0. While this is still true, we need to recognize that in a
practical computation, we fix a ∆t > 0 and then take a large number of steps with that step
size. In such a case we have a new situation; the successive powers (1 + µ)n will naturally
grow unless |1 + µ| ≤ 1. We will see that this condition has been violated in our situation
when the method goes unstable.

The main discovery in the comparison of the explicit and implicit Euler methods
is that the explicit method suddenly goes unstable when the product of the step size
∆t and the problem parameter λ is too large. This means that we need to develop
a stability theory for the methods. It is not sufficient to investigate the stability of
the mathematical problem and the discrete problem; we need to establish under
what conditions stability of the mathematical problem carries over to the discrete
problem. This will pose (possibly) restrictive conditions on the magnitude of the
step size ∆t.
8.2 Numerical stability 43

Explicit Euler, N=5 Implicit Euler, N=5

1 1

0.5 0.5

0 0

-0.5 -0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Explicit Euler, N=10 Implicit Euler, N=10

1 1

0.5 0.5

0 0

-0.5 -0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t t

Fig. 8.3 Comparing the Euler methods. Test problem (8.11) is solved for λ = −20 with N = 5
(top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (right
panels). For N = 5, the explicit method shows numerical instability (top left) as indicated by
growing oscillations, diverging from the exact solution. When the step size is shortened (N = 10,
lower left), the method regains stability. In the implicit method, there is no instability at all; the
implicit Euler method can use larger steps than the explicit Euler method. Finally, due to the
strong exponential damping, the global error is very small whenever the computation is stable

8.2 Numerical stability

Numerical stability is investigated in terms of the linear test equation,

ẋ = λ x, x(0) = 1, (8.12)

where λ ∈ C. This requires a special motivation.

Motivation Consider a linear constant coefficient system

ẏ = Ay, (8.13)

where A is diagonalizable by the transformation U −1 AU = Λ , and Λ is the diagonal matrix


containing the eigenvalues of A. The transformation y = Ux then implies ẏ = U ẋ, and

U ẋ = AUx ⇒ ẋ = U −1 AUx = Λ x.

Since A may have λ [A] ∈ C, we take λ ∈ C in (8.12).


If (8.13) is discretized by (say) the explicit Euler method, we obtain
44 8 The Implicit Euler Method

Euler discretization
ẏ = Ay • - • yn+1 = (I + hA)yn

diagonalization diagonalization

? - •?xn+1 = (I + hΛ )xn
ẋ = Λ x •
Euler discretization

Fig. 8.4 Commutative diagram. Diagonalization U −1 AU = Λ of the vector field A commutes with
the Euler discretization, justifying the linear test equation for investigating numerical stability

yn+1 = (I + ∆t A)yn .

Putting yn = Uxn , we get

Uxn+1 = (I + ∆t A)Uxn ⇒ xn+1 = U −1 (I + ∆t A)Uxn = (I + ∆tΛ )xn .

But this is the explicit Euler method applied to the diagonalized problem ẋ = Λ x. Thus,
diagonalization and discretization commute; it does not matter in which order these opera-
tions are carried out, see Figure 8.4.
Therefore (8.13) can be analyzed eigenvalue by eigenvalue; this is what the linear test equa-
tion (8.12) does, with λ ∈ λ [A]. This justifies the interest in (8.12) as a standard test problem.

Now, let us consider the mathematical stability of (8.12). Since the solution is

x(t) = etλ ,

it follows that
|x(t)| = |etλ | = etRe λ .
Hence |x(t)| remains bounded for t ≥ 0 if Re λ ≤ 0. For a system ẏ = Ay, this corre-
sponds to having λ [A] ∈ C− , or equivalently α[A] ≤ 0. This is the stability condition
for the differential equation.
Checking the stability of the discretization, we start by investigating the explicit
Euler method. For the linear test equation (8.12) we obtain

xn+1 = (1 + ∆tλ )xn ,

and it follows that |xn+1 | = |1 + ∆tλ | · |xn |. Thus the numerical solution is nonin-
creasing provided that
|1 + ∆tλ | ≤ 1. (8.14)
This is the condition for numerical stability. Unlike mathematical stability, it does
not only depend on λ , but on the step size ∆t as well. More precisely, numerical
stability depends on the product ∆tλ , and does not automatically follow from
8.2 Numerical stability 45

Explict Euler Implict Euler

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-3 -2 -1 0 1 -1 0 1 2 3

Fig. 8.5 Stability regions of explicit and implicit Euler methods. Left panel shows the explicit
Euler stability region SEE in the complex plane. The method is stable for ∆tλ inside the green
disk. Right panel shows the stability region SIE of the implicit Euler method. The method is stable
in C, except inside the red disk, which corresponds to the region where the method is unstable.
Thus the implicit Euler method is stable in the entire left half-plane, but also in most of the right
half-plane, where the differential equation is unstable

mathematical stability. Instead, numerical stability must be examined method by


method, establishing the unique step size limitations associated with each method.
In (8.14), λ is a complex number. We put z = ∆tλ ∈ C, and rewrite (8.14) as

|1 + z| ≤ 1.

This is the interior of a circle in C, with center at z = −1 and radius 1. It is referred


to as the stability region of the explicit Euler method, formally defined as the disk

SEE = {z ∈ C : |1 + z| ≤ 1}.

The discrete problem is numerically stable for ∆tλ ∈ SEE .


A similar analysis for the implicit Euler method yields

xn+1 = xn + ∆tλ xn+1 ⇒ xn+1 = (1 − ∆tλ )−1 xn .

Hence the numerical solution remains bounded if |1 − z|−1 ≤ 1, and the stability
region of the implicit Euler method is defined by
46 8 The Implicit Euler Method

SIE = {z ∈ C : |1 − z| ≥ 1}.

Numerical stability requires that ∆tλ ∈ SIE . The shape of the stability region of the
implicit Euler method is also a circle, but now with center at z = 1 and radius 1. The
important difference is that while SEE is the interior of a circle, SIE is the exterior of
a circle. In particular, we note that C− ⊂ SIE . Thus, if ∆tλ ∈ C− , the implicit Euler
method is stable.
The implicit Euler method has a large stability region, covering the entire nega-
tive half plane, while the explicit method has a small stability region, putting strong
restrictions on the step size ∆t. Thus the explicit Euler method can only use short
steps. By contrast, the implicit Euler method is stable whenever ∆tλ ∈ C− . Since
∆t > 0 is real, it follows that there is no restriction on the step size if λ ∈ C− .
For this reason, the method is sometimes referred to as unconditionally stable. The
stability regions of the explicit and implicit Euler methods are found in Figure 8.2.
We can now analyze the results we obtained when testing the two methods.

Example The previous test problem, used to assess the properties of the explicit and im-
plicit Euler methods, was
ẏ = λ (y − g) + ġ.
This has a particular solution y(t) = g(t) and exponential homogeneous solutions. Putting
x = y − g, the test problem is transformed into the linear test equation ẋ = λ x. Thus, sta-
bility is only a function of ∆tλ , and only depends on the homogeneous solutions and the
method’s ability to handle exponential solutions.
In the previous tests, we used two step sizes, ∆t = 0.2 and ∆t = 0.1. We also used three
different values of λ , viz., λ = −0.2, λ = −2, and λ = −20. Since in all cases λ ∈ C− , the
implicit Euler method is stable, no matter how the step size is chosen. This explains why no
stability issues were observed.
For the explicit Euler method, we have the following table of parameter combinations:

Parameters λ = −0.2 λ = −2 λ = −20

∆t = 0.1 ∆tλ = −0.02 ∆tλ = −0.2 ∆tλ = −2

∆t = 0.2 ∆tλ = −0.04 ∆tλ = −0.4 ∆tλ = −4

From this table we see that only one parameter combination, ∆tλ = −4, is such that ∆tλ ∈ /
SEE , causing numerical instability. Another combination, ∆tλ = −2, is marginally stable.
This is in full agreement with the tests, and for ∆tλ = −4 the numerical solution became
oscillatory and diverged from the exact solution, see Figure 8.1. This is the typical behavior
when numerical instability is encountered.
8.3 Stiff problems 47
1.2

0.8

0.6

0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t

Fig. 8.6 Vector field and flow of a stiff problem. The test problem (8.11) is solved using the implicit
Euler method with N = 20 steps. The exact solution y(t) = sin πt (blue) and the discrete solution
(red markers) are plotted vs t ∈ [0, 1]. Neighboring solutions (green) to the differential equation, for
other initial conditions, illustrate the vector field away from the particular solution. At λ = −30,
these solutions quickly converge on the particular solution. This is typical of stiff problems

8.3 Stiff problems

Stiff initial value problems require time stepping methods with stability regions cov-
ering (near) all of C− . The simplest example of such a method is the implicit Euler
method. The simplest illustration of a stiff problem is the Prothero–Robinson test
problem (8.11), i.e.,
ẏ = λ (y − g) + ġ.
whose particular solution is y(t) = g(t), and where the homogeneous solutions are
exponentials etλ .
Stiffness is a question of how the numerical method interacts with the problem.
If λ  −1, the homogeneous solutions decay very fast, after which the solution is
near the particular solution, y(t) ≈ g(t) no matter what the initial condition was.
This is demonstrated in Figure 8.3 for λ = −30. This is not a very stiff problem, but
the parameter values are chosen to make the effect readily visible to the naked eye.
By taking λ  −30 the neighboring flow becomes nearly vertical.

Example (Stiffness) In the Prothero–Robinson test problem, assume that λ = −1000,


and that g(t) = sint. Analyzing the stability of a time stepping method for this problem is
equivalent to considering the linear test equation (8.12) with the given λ .
48 8 The Implicit Euler Method

Let us consider the numerical stability of the explicit Euler method. Since we then require
|1 + ∆t λ | ≤ 1, or, given that λ < 0,

2
∆t ≤ ∆t S = = 2 · 10−3 ,
λ
where ∆t S is the maximum stable step size.
Meanwhile, approximating the particular solution y(t) ≈ sint, using the explicit Euler
method, produces a local residual (equal to its local error)

|ẏ| ∆t 2
|rn | ≈ ∆t 2 ≤ .
2 2
Let us assume that we need an accuracy specified by |ln | ≤ T OL = 10−4 , where T OL is a
prescribed local error tolerance. Then, obviously, we can accept a step size

∆t TOL = 2T OL = 1.4 · 10−2 .

However, it will not be possible to use such a step size, because the method would go
unstable. There is a conflict between the stability requirement and the accuracy requirement,
since ∆t S < ∆t TOL . This is the problem of stiffness; being restricted by having to maintain
stability, an explicit method cannot reach its potential accuracy.
The problem is overcome by using an appropriate implicit method. For example, as the
implicit Euler method is unconditionally stable, it has no stability restriction on ∆t. Its
local residual is the same as that of the explicit method, so it will be possible to use ∆t =
1.4 · 10−2 . In fact, one can use an even larger step size, as the local error is smaller. Thus

rn ∆t 2 ∆t
ln+1 = ≈ ≈ ≤ T OL,
1 − ∆t λ 2 + 2 · 103 ∆t 2 · 103

which requires ∆t ≤ 2 · 103 T OL = 0.2. This step size will produce the requested accuracy,
and, since it is 100 times larger than the maximum stable step size ∆t S available to the
explicit Euler method, the implicit Euler method is likely going to be far more efficient,
even though it is more expensive per step due to the necessary equation solving.
In real applications, one often encounters stiff problems where the ratio ∆t S /∆t TOL  1 for
any explicit method. This ratio can be arbitrarily large, making it impossible to solve such
stiff problems unless dedicated methods are used. On the other hand, when an appropriate
implicit method is used, these problems can often be solved very quickly.

To develop a comprehensive theory os stiffness, we may consider a system of


nonlinear equations having a structure similar to the Prothero–Robinson example.
Thus we will consider
ẏ = f (y − g) + ġ, (8.15)
where f : Rd → Rd is a nonlinear map having f (0) = 0. Then g can be viewed as
a “particular solution,” as y(t) = g(t) satisfies (8.15). But since the initial condition
y(0) may not be chosen to equal g(0), there is also a nonlinear transient, correspond-
ing to the “homogeneous solution.”
Putting u = y − g, (8.15) is turned into the simpler nonlinear problem

u̇ = f (u), u(0) = u0 . (8.16)


8.3 Stiff problems 49

The central issue is now whether an explicit method would suffer severe step size
restrictions when solving this problem. This depends on the mathematical stability
properties of the system, and in particular on whether there are strongly damped so-
lutions near the zero solution u = 0, corresponding to the linear Prothero–Robinson
problem with λ  −1.
If the initial value u(0) is small enough, the system (8.16) can be linearized
around the equilibrium solution u = 0, to obtain

u̇ ≈ f 0 (0)u,

where f 0 (0) is the Jacobian matrix of f evaluated at u = 0, as long as kuk2  1.


Since the matrix is constant, the linearized problem is a simple linear system,

u̇ = Au,

where A ∈ Rd×d . Unlike having a single, scalar λ as before, we now have to deal
with a matrix (and its spectrum), asking whether it will lead to stability restrictions
on the step size ∆t. This is done using the techniques of Chapter 6.
Investigating its mathematical stability, we take the inner product with u to find
the differential inequalities
d
m2 [A] · kuk2 ≤ kuk2 ≤ M2 [A] · kuk2 .
dt
Here m2 [A] and M2 [A] are the lower and upper logarithmic norms, respectively, and
the differential inequalities imply that

etm2 [A] ≤ ketA k2 ≤ etM2 [A] .

Thus the matrix exponential is bounded above and below. The lower bound gives
the maximum decay rate of homogeneous solutions, while the upper bound gives
the maximum growth rate.
In the Prothero–Robinson example above, we saw that the stability restriction
in an explicit method is caused by a fast decay rate. This happened when λ 
−1. In a linear system, we have such a fast decay rate when m2 [A]  −1. This is
characteristic of stiff problems: m2 [A]  −1 is a necessary for stiffness. To give an
alternative interpretation, if one would integrate the problem in reverse time, then
we would solve u̇ = −A, whose maximum growth rate is M2 [−A] = −m2 [A]  1.
Thus the reverse time problem has very unstable solutions.
Since mathematical and numerical stability are not equivalent, we again consider
the explicit Euler method. We then see that a sufficient condition for numerical
stability is the circle condition

kI + ∆t Ak2 ≤ 1.
50 8 The Implicit Euler Method

By the triangle inequality, −1 + k∆t Ak2 ≤ kI + ∆t Ak2 ≤ 1, the circle condition


implies k∆t Ak2 ≤ 2. However, k∆t Ak2 ≥ −m2 [∆t A]. Therefore, if m2 [∆t A]  −1
the circle condition cannot possibly be satisfied, and instability is bound to happen.
It follows that m2 [∆t A]  −1 is a necessary condition for stiffness.
However, a system of equations is more complicated than a scalar equation. The
investigation of the scalar problem excluded growing solutions. In a system, it is
possible that we have growing as well as decaying solutions. We therefore introduce
the average decay rate,
m2 [A] + M2 [A]
s[A] = .
2
The reason why the condition m2 [A]  −1 alone might not cause stiffness is that if
M2 [A]  1 at the same time, the system has both rapidly decaying and rapidly grow-
ing solutions. While decaying solutions put a stability restriction on the step size,
the growing solutions put an accuracy restriction, in order to resolve the growing
solutions.

Definition 8.1. Let A ∈ Rd×d . The stiffness indicator of A is defined by

m2 [A] + M2 [A]
s[A] = . (8.17)
2

Here we note that if A = λ ∈ R, then s[λ ] = λ . The stiffness indicator is com-


patible with the previous discussion of scalar problems, and we can proceed to gen-
eralize the concept to systems. For scalar systems, λ put a restriction on the step
size. More generally, we now need to relate s[A] to a time scale τ, which is not
necessarily the same as the step size ∆t.

Definition 8.2. Assume that s[A] < 0. Then the local reference time scale τ is de-
fined by
1
τ =− . (8.18)
s[A]

The reference time scale approximates the largest step size by which an explicit
method can proceed, without losing numerical stability. Any desired time interval,
be it the length of the integration interval or the preferred step size, can be related to
the reference time scale.

Definition 8.3. Let τ be the local reference time scale. For a given step size ∆t the
stiffness factor is defined by
r(∆t) = ∆t/τ.

Irrespective of whether an explicit or implicit method is used, stiffness is deter-


mined by whether a step size ∆t > τ is desired or not. This depends on many factors,
not least the accuracy requirement and error tolerance T OL. As a simple observa-
tion, if the problem is to be solved on [0, T ], stiffness cannot occur if r(T ) < 1, since
8.3 Stiff problems 51

the step sizes will obviously be shorter than τ in such a case. However, it may very
well occur that r(T )  1, in which case the problem may turn out to be stiff.
We are now in a position to discuss stiffness in more general problems. Following
the techniques outlined in Section 6.5, we consider two neighboring solutions to a
nonlinear problem,

u̇ = f (u), u(0) = u0
v̇ = f (v), v(0) = u0 + ∆ u0 ,

where we do no longer require that f (0) = 0. Thus we can consider stiffness in terms
of perturbations along a non-constant solution. The difference ∆ u = v − u satisfies
d
∆ u = f (v) − f (u).
dt
As before, we will only consider small (“infinitesimal”) perturbations ∆ u, allowing
us to linearize the perturbed problem around the non-constant solution u. Taking the
inner product with ∆ u, we obtain the differential inequalities
d
m2 [J(u)] · k∆ uk2 ≤ k∆ uk2 ≤ M2 [J(u)] · k∆ uk2 ,
dt
where J(u) = f 0 (u) is the Jacobian matrix of f , evaluated along the nominal solution
u(t). Thus the matrix is no longer constant but varies along the solution trajectory.
Nevertheless, the same theory applies, and if s[J(u)] < 0, we obtain a reference time
scale τ(u). Thus the stiffness factor too can vary along the solution. Stiffness occurs
whenever we need to use a step size ∆t such that

∆t
r(∆t) =  1.
τ(u)

By evaluating s[J(u)] along a trajectory stiffness can be assessed locally.

Remarks on stiffness For any nonlinear system u̇ = f (u), with f ∈ C1 , stiffness is defined
locally at any point u of the vector field, in terms of s[ f 0 (u)]. The stiffness indicator deter-
mines a local time scale τ(u). In case f = A is a linear constant coefficient system, s[A] and
τ are constant on [0, T ].
1. Since |s[J(u)]| ≤ L[ f ], a necessary condition for stiffness is that T L[ f ]  1. However, the
latter be used as a characterization of stiffness. As L[ f ] = L[− f ], this quantity is independent
of whether the problem is integrated in forward time or reverse time. One of the most typical
characteristics of stiffness is that the problem has strong damping in forward time, and is
strongly unstable in reverse time. This is reflected by s[J(u)].
2. Depending on the requested error tolerance T OL, as well as on the choice of method, the
step size ∆t may have to be chosen shorter than τ; in such a case stiffness will not occur
either. Likewise, should s[J(u)] become positive during some subinterval, stiffness is no
longer an issue. Unless T · s[J(u)]  −1 stiffness will not occur; this means that any time
stepping method can be used, without loss of efficiency.
3. It is not uncommon to encounter problem where T · s[J(u)]  −106 or much greater. In
some problems of practical interest T · s[J(u)]  −1012 or more; in such cases, an explicit
52 8 The Implicit Euler Method

method will never finish the integration, as trillions of steps will be necessary. By contrast,
there are stiff problems where a well designed implicit method solves the problem in N
steps, where N is practically independent of T s[J(u)], or of T L[ f ].
4. Given the choice between an explicit and an implicit method, using the same error toler-
ance, the explicit method may be restricted to using step sizes ∆t  τ, while an “uncondi-
tionally stable” implicit method might be able to employ a step size ∆t  τ.

8.4 Simple methods of order 2

The explicit and implicit Euler methods are the simplest time stepping methods, but
they illustrate the essential aspects of such methods. They are simple to analyze and
can be understood both intuitively and theoretically. However, the methods are only
first order convergent and therefore of little practical interest. In real computations
we need higher order methods. Before we proceed to advanced methods, we shall
consider a few simple methods of convergence order p = 2.
The construction of the Euler methods started from approximating the derivative
by a finite difference quotient. If y ∈ C1 [tn ,tn+1 ], then, by the mean value theorem,
there is a θ ∈ [0, 1] such that

y(tn+1 ) − y(tn )
= ẏ((1 − θ )tn + θtn+1 ) = ẏ(tn + θ (tn+1 − tn )).
tn+1 − tn
But this is only an existence theorem, not telling us the value of θ . In the explicit
Euler method, we used θ = 0 to create a first order approximation. Likewise, in the
implicit Euler we used θ = 1. However, there is a better choice. Thus, assuming that
y is sufficiently differentiable, expanding in Taylor series around tn yields, for the
left hand side,
y(tn+1 ) − y(tn ) tn+1 − tn
= ẏ(tn ) + ÿ(tn ) + O((tn+1 − tn )2 );
tn+1 − tn 2
and for the right hand side,

ẏ(tn + θ (tn+1 − tn )) = ẏ(tn ) + θ (tn+1 − tn )ÿ(tn ) + O((tn+1 − tn )2 ).

Matching terms, we see that by taking θ = 1/2, we have a second order approx-
imation. This is the best that can be achieved in general, and corresponds to the
approximation  
y(tn+1 ) − y(tn ) tn + tn+1
≈ ẏ .
tn+1 − tn 2
We can transform this into a computational method for first order IVP’s in two ways.
For obvious reasons, both are referred to as the midpoint method; one is explicit and
the other implicit.
8.4 Simple methods of order 2 53

Beginning with the Implicit Midpoint method, we define the scheme


 
tn + tn+1 yn + yn+1
yn+1 − yn = ∆t · f , . (8.19)
2 2

The method is implicit since yn+1 appears both in the left and right hand sides.
The explicit construction is just a matter of re-indexation. We use three consecu-
tive equidistant points tn−1 ,tn and tn+1 , all separated by a step size ∆t. Then

y(tn+1 ) − y(tn−1 )
= ẏ(tn ) + O(∆t 2 ).
2∆t
This leads to the Explicit Midpoint method, defined by

yn+1 − yn−1 = 2∆t · f (tn , yn ). (8.20)

This is a two-step method, since it needs both yn−1 and yn to compute yn+1 . On the
other hand, the method is explicit and needs no equation solving.
There is a third, less obvious, construction. Consider the approximation
 
tn + tn+1 ẏ(tn ) + ẏ(tn+1 )
ẏ ≈ .
2 2

This means that we approximate the derivative at the midpoint by the average of the
derivatives at the endpoints. Expanding ẏ(tn ) and ẏ(tn+1 ) around the midpoint, we
obtain
∆t
ÿ(t¯) + O(∆t 2 )
ẏ(tn ) = ẏ(t¯) −
2
∆t
ẏ(tn+1 ) = ẏ(t¯) + ÿ(t¯) + O(∆t 2 ),
2
where t¯ = (tn + tn+1 )/2 represents the midpoint. It follows that

ẏ(tn ) + ẏ(tn+1 )
= ẏ(t¯) + O(∆t 2 )
2
is a second order approximation. This leads to the Trapezoidal Rule,
 
f (tn , yn ) + f (tn+1 , yn+1 )
yn+1 − yn = ∆t · . (8.21)
2

It is an implicit method, and it is obviously related to the implicit midpoint method.


Thus, if we consider a linear constant coefficient problem ẏ = Ay, both methods
produce the discretization
 
Ayn + Ayn+1
yn+1 − yn = ∆t · .
2
54 8 The Implicit Euler Method

Solving for yn+1 , we obtain

∆t A −1
   
∆t A
yn+1 = I − I+ yn .
2 2

However, the two methods are no longer identical for nonlinear systems, or for linear
systems with time dependent coefficients.
Among advanced methods, there are two dominating classes, Runge–Kutta
(RK) methods, and linear multistep (LM) methods. While the latter may use an
arbitrary number of steps to advance the solution, RK methods are one-step meth-
ods. We shall study both method classes below.
Of the three second order methods above, the explicit midpoint method is in the
LM class, but not in RK. The implicit midpoint method is in the RK class, but not in
LM. Finally, the trapezoidal rule, as well as the explicit and implicit Euler methods
studied before, can be seen as (some of the simplest) members of both the LM and
the RK class.
Let us now turn to comparing these methods. For simplicity, we take a the scalar
Prothero– Robinson test problem (8.11), which means that we can solve equations
in the implicit methods exactly. In Figure 8.4 the trapezoidal rule is compared to
the explicit Euler method. The test demonstrates the need for higher order methods.
Thus, going from a first order to a second order method has a strong impact on
accuracy. Although the step size is the same for both methods, the second order
method can achieve several orders of magnitude higher accuracy. The same relative
effect takes place each time we select a higher method order. Since it is possible
to construct methods of very high convergence orders, it is in practice possible to
solve many problems to full numerical accuracy at a reasonable computational cost.
Modern codes typically implement methods of up to order p = 6, although there are
standard solvers of even higher orders.
The test problem in Figure 8.4 is not stiff, but slightly unstable. It is not a partic-
ularly difficult problem. It solved using a rather coarse step size, again to emphasize
the differences in accuracy. In real computations the step size would be smaller,
making the difference even more pronpounced. To see the effect as a function of the
step size ∆t, we compare the explict and implicit Euler methods to the trapezoidal
rule in Figure 8.4. Here, the setting is moderately stiff, at λ = −50, so as to also
demonstrate when the explicit method goes unstable. More importantly, we see that
for smaller (but still relevant) step sizes, the second order method achieves several
orders of magnitude better accuracy.
This is the central idea in discretization methods: in a convergent method, ac-
curacy increases as ∆t → 0, but the smaller the step size, the more computational
effort is needed. So how do we obtain high accuracy without a too large compu-
tational cost? The answer is by using high order methods. Then the error can be
made extremely small, even without taking ∆t exceedingly small.
The only concern is that we have to make sure that the method remains stable,
since stability is required in order to have convergence. Note that this is not a matter
8.4 Simple methods of order 2 55

Explicit Euler, N=50


1

0.5

-0.5

-1

0 0.5 1 1.5 2 2.5 3 3.5 4


t

Trapezoidal Rule, N=50


1

0.5

-0.5

-1

0 0.5 1 1.5 2 2.5 3 3.5 4


t

Fig. 8.7 The effect of second order convergence. The test problem (8.11) is solved using the im-
plicit Euler method (top) and with the trapezoidal rule (bottom), both using N = 50 steps. The exact
solution y(t) = sin πt (blue) and the discrete solution (red markers) are plotted vs t ∈ [0, 4], cov-
ering two full periods. The choice λ = 0.1 makes the problem slightly unstable, posing a greater
challenge to the methods. The first order explicit Euler has a readily visible and growing error. By
contrast, the second order trapezoidal rule offers much higher accuracy

of the differential equation being unstable; such problems can be solved too, as
demonstrated in Figure 8.4. Instead, it is a matter of whether the method as such
is a stable discretization. To illustrate this point, we return to the test problem
(8.11), and solve it using the explicit midpoint method. The results are found in
Figure 8.4. Comparing the explicit midpoint method to teh trapezoidal rule, both of
second order, we find that there are substantial differences. Here we have returned
to a nonstiff problem, with λ = −1, but even so, the explicit method soon develops
unacceptable oscillations. These are due to instability, altough the issue is less
serious than before. Even so, the trapezoidal rule is far better, as the error plots
demonstrate.
This test suggests that stability, convergence and accuracy are delicate matters
that require a deep understanding. Later, we shall return to these methods in con-
nection with Hamiltonian problems, and find that the explicit midpoint method has
a unique niche, in Hamiltonian problems (e.g. in celestial mechanics) and in hyper-
bolic conservation laws. This is due to the mathematical properties of such prob-
lems.
56 8 The Implicit Euler Method

-3
EE -3
IE -3
TR
10 10 10

10 -4 10 -4 10 -4

10 -5 10 -5 10 -5
error

10 -6 10 -6 10 -6

10 -7 10 -7 10 -7

10 -8 10 -8 10 -8

10 -9 10 -9 10 -9
10 -4 10 -2 10 -4 10 -2 10 -4 10 -2
dt dt dt

Fig. 8.8 Global error vs. step size. The test problem (8.11) is solved using explicit Euler (left),
implicit Euler (center) and the trapezoidal rule (right), over [0, 1] for λ = −50. Red graphs show
global error magnitude at t = 1 as a function of ∆t. The Euler methods are of order p = 1 as
indicated by dashed reference line of slope 1. The trapezoidal rule is of order p = 2, indicated
by reference line of slope 2. The error of the trapezoidal rule is 1000 times smaller at ∆t = 10−4 ,
showing the impact of higher order methods. The explicit Euler error graph goes “through the roof”
at ∆t ≥ 4 · 10−2 due to numerical instability, when ∆t λ is outside the method’s stability region

To analyze the instability of the explicit midpoint method, we apply it to the


linear test equation ẏ = λ y. We then obtain the recursion

yn+1 − yn−1 = 2∆t λ yn .

Putting q = ∆t λ , this is a linear difference equation

yn+1 − 2q yn − yn−1 = 0.

This has stable solutions provided that the roots of the characteristic equation are
inside the unit circle. The characteristic equation is

κ 2 − 2q κ − 1 = 0.

Since this can be factorized into (κ − κ1 )(κ − κ2 ) = 0, where κ1 , κ2 are the roots of
the characteristic equation, we see that κ1 κ2 = −1. Thus, if one root is less that one
in magnitude, the other is greater than 1. Therefore we can write
8.4 Simple methods of order 2 57

Explicit midpoint method, N=75

-1

0 1 2 3 4 5 6

Trapezoidal rule, N=75

-1

0 1 2 3 4 5 6

Error magnitudes
10 0

10 -2

10 -4
0 1 2 3 4 5 6
t

Fig. 8.9 Comparison of second order methods. The test problem (8.11) is solved using the explicit
midpoint method (top) and the trapezoidal rule (center), over [0, 6] for λ = −1. Both methods have
order p = 2 and use N = 75 steps. The explicit method method develops growing oscillations over
time, indicative of instability. No such issues are observed in the implicit method. Graphs of how
the error evolves over time (bottom) show that while the trapezoidal rule (green) maintains an error
never exceeding 5 · 10−3 , the explicit midpoint method has an exponentially growing error (red),
as indicated by the straight trendline in the lin-log diagram

κ1 = eiϕ ; κ2 = −e−iϕ .

Now, since we must also have κ1 + κ2 = 2q, we obviously have

2q = eiϕ − e−iϕ = 2i sin ϕ.

Consequently, q = i sin ϕ, and it follows that

∆t λ = i sin ϕ.

Since ∆t > 0, the surprising result is that the method is only stable when λ is on the
imaginary axis. Writing λ = iω, we must therefore have

∆t ω ∈ (−1, 1).

Note that we cannot allow |∆t ω| = 1 since we would then have a double root,
leading to unbounded solutions.
58 8 The Implicit Euler Method

Trapezoidal rule Explicit midpoint method


2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2
-1.5 -1 -0.5 0 0.5 -1.5 -1 -0.5 0 0.5

Fig. 8.10 Stability regions. The stability region of the trapezoidal rule is the entire negative half-
plane C− , as indicated by the green area in the left panel. The stability region of the explicit
midpoint method, however, is very “small” (right panel) as it only consists of the open interval
i(−1, 1). Thus the endpoints ±i are not included

The conclusion is that the method is only stable for ∆t λ on the open interval
i · (−1, 1) in the complex plane. This is simply a short strip of the imaginary axis.
Since we used λ = −1 in our test, we chose λ in the negative half plane. The method
is therefore obviously unstable, no matter how we choose ∆t. For this reason, the
method is unsuitable for the test problem.
By contrast, the trapezoidal rule is stable, which explains why its performance is
superior. No matter how λ is chosen, the trapezoidal rule will not fail. To see this,
we again consider the linear test equation, ẏ = λ y. We obtain the recursion

∆t λ yn
yn+1 − yn = (yn+1 − yn ),
2
resulting in
1 + z/2
yn+1 = yn = R(z)yn ,
1 − z/2
where z = ∆t λ . Thus we need |R(z)| ≤ 1 for stability. This implies that

|1 + z/2| ≤ |1 − z/2|.
8.4 Simple methods of order 2 59

This requires that the distance from an arbitrary point z/2 ∈ C to +1 is greater than
its distance to −1 in the complex plane. This obviously implies that z ∈ C− . The
method’s stability region is therefore S TR ≡ C− . The stability regions of the explicit
midpoint method and the trapezoidal rule are shown in Figure 8.4.
This explains why two second order methods can produce such different results.
In fact, the stability properties of the explicit midpoint method need to be qualified.
We have already determined its stability region. Let us again assume that we solve
ẏ = λ y with λ ∈ R to obtain the linear difference equation

yn+1 − 2q yn − yn−1 = 0,

where q = ∆t λ , and characteristic equation κ 2 − 2κ − 1 = 0. As noted above, the


product of the roots is −1, and the sum is 2q. If λ is real, then q is real, so for
stability the only possibility is that κ1 = 1 and κ2 = −1. The sum is zero, implying
λ = 0. Thus, if λ 6= 0 is real, the method is necessarily unstable, with roots
1
κ1 ≈ 1 + ∆t λ κ2 ≈ − ≈ −1 + ∆t λ .
1 + ∆t λ
It follows that the solution has the behavior

yn ≈ C1 (1 + ∆t λ )n +C2 (−1)n (1 − ∆t λ )n ≈ C1 etn λ +C2 (−1)n e−tn λ ,

where tn = n∆t. Thus, there is a discrete oscillatory behavior, indicated by the fac-
tor (−1)n . In case λ ∈ R− , the amplitude of this oscillation grows exponentially,
as indicated by the factor e−tn λ . Although this is confirmed in Figure 8.4, in full
agreement with theory, this undesirable behavior only evolves over time, since the
coefficient C2 is very small. The method does remain convergent, but it is not “sta-
ble” in the same sense as the other methods considered so far. The explicit midpoint
method is weakly stable, and is only of practical value for problems where λ = iω
is located on the imaginary axis.
The qualification of stability that is needed is the following. A method is called
zero stable if it produces stable solutions when applied to the (trivial) problem
ẏ = 0. The method is called strongly zero stable if the characteristic equation asso-
ciated with this problem has a single root of unit modulus, κ = 1. In case there are
other roots of unit modulus, but still no roots of multiplicity greater than one, the
method is weakly zero stable. The explicit and implicit Euler methods, as well as
the trapezoidal role, only have the single root κ = 1, and are strongly zero stable.
(All one-step methods are strongly zero stable, since they only have a single root.)
The explicit midpoint method, however, is a two-step method having two unimodu-
lar roots, κ = 1 and κ = −1. Therefore it is weakly zero stable. It is the root κ = −1
that limits method performance.
60 8 The Implicit Euler Method

8.5 Conclusions

In a stable method, the global error has a more or less generic structure of the form

eT M[ f ] − 1
kek . C · ∆t p · ky(p) k∞ · . (8.22)
M[ f ]

For convergence, stability is necessary. The necessary stability condition is rather


modest – a method is only be required to solve the linear test equation ẏ = λ y in
a stable manner for λ = 0, and for this reason, the condition is referred to as zero
stability. It requires that the point 0 ∈ C is inside the method’s stability region.
While this is the case also for the explicit midpoint method studied above, further
qualifications are needed, and in most cases we require strong zero stability, which
excludes the explicit midpoint method.
Apart from stability, which is crucial, the global error bound depends on
• A constant C, characteristic of the method, known as the error constant
• The step size ∆t
• The method order p
• The regularity of the solution y, as indicated by ky(p) k∞
• The range of integration [0, T ], or the interval on which the problem is solved
• The logarithmic norm M[ f ], determining the damping rate of perturbations.
Thus a strongly stable (convergent) method will produce any desired accuracy by
choosing ∆t small enough. Choosing a method with a higher order of convergence
p may often be preferable. The regularity of the solution, the damping rate, and the
range of integration are parameters given by the problem, and little can be done
about them. Nevertheless, it is of importance to understand that the final computa-
tional accuracy also depends on those parameters.
Returning to stability, there are three different notions involved:
• Stability of the problem, essentially governed by M[ f ]
• Stability of the discretization for a fixed ∆t > 0
• Stability of the method as ∆t → 0 for a given T .
Although this may at first seem like an overuse of the same term, they all refer to a
continuous data dependence, as some variable goes to infinity.
In a stable problem, also referred to as mathematical stability, we are interested
in whether a small perturbation of the differential equation only results in a small
perturbation of the solution y(t) as t → ∞. The usual setting is that we consider a
single solution, often an equilibrium, and whether small perturbations of the initial
condition will increase or stay bounded. If it stays bounded, the solution is stable.
One sufficient condition for mathematical stability is M[ f ] < 0, and in (8.22), we
see that the stability of the problem affects the computational accuracy, in particular
how fast the global error is accumulated.
In a stable discretization, also referred to as numerical stability, we are inter-
ested in whether the discrete system has a similar property. The standard setting is
8.5 Conclusions 61

that we take the given problem, fix a finite time step ∆t, and let the recursion index
n → ∞. The question is whether the numerical solution yn remains bounded under
perturbations, again usually in the initial value.
In the best of worlds, numerical stability would follow from mathematical sta-
bility. However, this requires more. Thus the method must be stable in order to
have convergence. The setting is different; here we fix an interval [0, T ], and con-
sider solving the problem on that interval using N steps ∆t = T /N. The question
is whether the numerical solution remains bounded as per (8.22) when N → ∞. In
effect, we ask that the accumulated error at time T remains bounded, independent
of how many steps N we use to reach T . This is key to convergence, since the bound
(8.22) contains the factor ∆t p , allowing us to make the accumulated error arbitrarily
small by choosing N large enough.
In all three cases above, we ask that perturbations remain bounded as some pa-
rameter tends to infinity. This is stability, and it keeps recurring in various guises
throughout all of numerical analysis, explaining the importance of stability theory.
This concludes our analysis of elementary methods for first order initial value
problems. Advanced methods work with similar notions, but require a different ap-
proach to the construction of the methods. The two main contenders are Runge–
Kutta and linear multistep methods.
Chapter 9
Runge–Kutta methods

We have seen that a higher order methods offer significantly higher accuracy at a
moderate extra cost. Here we shall explore a systematic approach to the construction
of high order one-step methods. In Runge–Kutta methods the key idea is to sample
the vector field f at several points in a single step, combining the results to obtain
a high order end result in a single step. This entails using the samples to match as
many terms in a Taylor series expansion of the solution as possible.

9.1 An elementary explicit RK method

Let us consider one of the simplest explicit Runge–Kutta methods, achieving second
order convergence by sampling the vector field at two points per step. For clarity, we
shall make the simplifying assumption that the differential equation is autonomous,
i.e., the vector field does not depend on time t, but has the form

ẏ = f (y); y(0) = y0 .

The following computational procedure is a simple RK method. We first compute


the stage derivatives,

Y10 = f (yn )
Y20 = f (yn + ∆t Y10 ).

These are samples of the vector field f at two points Y1 and Y2 near the solution
trajectory. They are not derivatives in the true sense, since they are not functions
of time, but only evaluations of f . For this reason we use a prime to denote the
stage derivatives, while a dot represents the derivative ẏ of the solution y, which is a
differentiable function.
The points Y1 and Y2 are referred to as the stage values, and are defined by

63
64 9 Runge–Kutta methods

Y1 = yn
Y2 = yn + ∆t Y10 .

Thus it holds that Yi0 = f (Yi ). We finally update the the solution according to

∆t 0
yn+1 = yn + (Y +Y20 ). (9.1)
2 1
This is an explicit method, since there is no need to solve nonlinear equations in
the process. We shall see that it is a second order convergent method, which can be
viewed as an explicit method that “emulates” the trapezoidal rule.
We shall compare this method to the second order convergent trapezoidal rule,
starting from the same point yn . It computes an update ȳn+1 , defined by

∆t
ȳn+1 = yn + ( f (yn ) + f (ȳn+1 )).
2
This method is implicit, which makes it relatively expensive. However, the explicit
second order RK method (9.1) above has been constructed along similar lines, by
replacing the vector field sample f (ȳn+1 ) by Y20 . To justify this operation, note that

Y2 = yn + ∆t f (yn ).

Thus the stage value Y2 is simply an explicit Euler step starting from yn , and it
follows that Y2 − ȳn+1 = O(∆t 2 ), corresponding to the local error of the explicit
Euler method in a single step. It follows that

Y20 = f (Y2 ) = f (ȳn+1 + O(∆t 2 )) = f (ȳn+1 ) + O(∆t 2 ),

provided that f is Lipschitz (the standard assumption). Therefore (9.1) computes

∆t 0
yn+1 = yn + (Y +Y20 ) = ȳn+1 + O(∆t 3 ).
2 1
This implies that the RK method (9.1) is a second order explicit “workaround” pro-
ducing nearly the same result as the trapezoidal rule. It is cheaper, but the benefit
come at a price. Thus the explicit RK method does not have the excellent stability
properties of the implicit trapezoidal rule.
Runge–Kutta methods are divided into explicit (ERK) and implicit (IRK) meth-
ods. The classical ERK methods date back to 1895, when Carl Runge and Wil-
helm Kutta developed some of the ERK methods that are still in use today. Runge
and Kutta were involved in mathematics and its applications in physics and fluid
mechanics, and realized that the construction of accurate and stable computational
methods for initial value problems required serious mathematical thought. The mod-
ern theory of RK methods, however, was largely initiated and developed by John C.
Butcher of the University of Auckland, New Zealand, around 1965. Because of the
complexity of RK theory, this area is still a lively field of research.
9.2 Explicit Runge–Kutta methods 65

9.2 Explicit Runge–Kutta methods

So far we have based the construction of our methods on replacing the derivative by
a finite difference. By contrast, the idea behind Runge–Kutta methods are closely
related to interpolation theory. Integrating the differential equation ẏ = f (t, y) over
the interval [tn ,tn+1 ], we obtain
Z tn+1
y(tn+1 ) − y(tn ) = f (τ, y(τ)) dτ. (9.2)
tn

This transforms the differential equation into an integral equation. Here we have
returned to the nonautonomous formulation ẏ = f (t, y), although we will shortly go
back to the autonomous formulation.
As the integral cannot be evaluated exactly, it needs to be approximated numer-
ically. The standard approach to numerical integration is to sample the integrand
f (τ, y(τ)) at a number of points, and approximate the integral by a weighted sum,
Z tn+1 s
f (τ, y(τ)) dτ ≈ ∆t ∑ biYi0 ,
tn i=1

where Yi0 = f (τi ,Yi ). The accuracy of this approximation depends on how we con-
struct the stage values Yi and the corresponding stage derivatives, Yi0 = f (τi ,Yi ).
This means that (9.2) generates a numerical integration formula
s
yn+1 − yn = ∆t ∑ biYi0 ,
i=1

and the difficulty lies in the construction of the stage values Yi , which generate the
stage derivatives, Yi0 = f (τi ,Yi ). There are many ways to choose the stage values,
corresponding to the many different ways in which integrals can be approximated
by discrete sums. However, because the stage values and derivatives have to be
generated sequentially in an explicit computation, we must have

Y10 = f (tn , yn ).

Subsequent stage values are obtained by advancing a local solution based on linear
combinations of previously computed stage derivatives. Thus

Y20 = f (tn + c2 ∆t, yn + a2,1 ∆tY10 )


Y30 = f (tn + c3 ∆t, yn + a3,1 ∆tY10 + a3,2 ∆tY20 )
...

This means that an explicit Runge–Kutta method for the problem ẏ = f (t, y) is given
by the computational process
66 9 Runge–Kutta methods

i−1
Yi0 = f (tn + ci ∆t, yn + ∑ ai, j ∆t Y j0 ), (9.3)
j=1

together with the updating formula


s
yn+1 = yn + ∆t ∑ biYi0 . (9.4)
i=1

The method is determined by three sets of parameters, the nodes ci , forming a


vector c, the weights bi forming a vector b, and the matrix A with coefficients ai, j .
These are usually arranged in the Butcher tableau,

0 0 0 ··· 0
c2 a2,1 0 · · · 0 c A
.. .. .. ..
. . . . or
cs as,1 as,2 · · · 0 bT
b1 b2 · · · bs
For explicit RK methods the coefficient matrix A is strictly lower triangular. In
implicit RK methods, this is no longer the case.
In the sequel, we shall use the simplifying assumption
s
ci = ∑ ai, j . (9.5)
j=1

This means that the nodes are determined by the row sums of the coefficient matrix
A. We then only need to consider the autonomous initial value problem ẏ = f (y)
to derive order and stability conditions. While it is possible to construct RK meth-
ods that do not satisfy the simplifying assumption, such methods are never used
in practice, since important invariance properties are lost. Thus all state-of-the-art
Runge–Kutta methods, including the original methods of 1895, satisfy the simpli-
fying assumption.
With the simplifying assumption, we can describe the RK process as
1. Compute the i th stage value Yi = yn + ∆t ∑i−1 0
j=1 ai, j Y j
2. Sample vector field to compute the i th stage derivative Yi0 = f (Yi )
3. After computing all stage derivatives, update yn+1 = yn + ∆t ∑si=1 bi Yi0
In this process, we note that stage derivatives are always multiplied by the time step
∆t. Thus the process should be viewed as computing stage values Yi and scaled
stage derivatives, ∆tYi0 . The latter quantity is then computed from the vector field
by ∆tYi0 = ∆t f (Yi ).
9.3 Taylor series expansions and elementary differentials 67

9.3 Taylor series expansions and elementary differentials

In the derivation of RK methods, we need to match terms in the Taylor series ex-
pansions of the method’s updating formula and the expansion of the exact solution.
Because RK methods by construction employ nested evaluations of the vector field
f when the stage derivatives Yi0 are computed, the Taylor series expansions are more
complicated than otherwise. The standard approach is to express the Taylor series
in terms of the function f and its derivatives, rather than in terms of the solution y
and its derivatives.
Below, we shall use a short-hand notation for function values and their deriva-
tives. Since all expansions are around tn in time and yn in “space,” we let y, ẏ, ÿ, . . .
denote the values y(tn ), ẏ(tn ), ÿ(tn ), etc.
Likewise, we let f denote f (y), while fy = ∂ f /∂ y denotes the Jacobian matrix
with elements {∂ fi /∂ y j }. Note that, due to the simplifying assumption we only
have to consider the differential equation ẏ = f (y); without that assumption, we
would have had to consider ẏ = f (t, y), requiring two partial derivatives, fy and ft .
As will become clear, the simplifying assumption saves a lot of work, without losing
generality.
We also need higher order derivatives, and fyy denotes the 3-tensor with ele-
ments {∂ 2 fi /∂ y j ∂ yk }. Having three indices, it is a multilinear operator producing
a vector if applied to two vectors. Thus fyy f f is a vector, which can be computed
successively, from ( fyy f ) f , where the product of the 3-tensor fyy and the vector (1-
tensor) f produces the matrix (2-tensor) fyy f . This, in turn, can then multiply the
vector f in the usual way; thus ( fyy f ) f is simply a matrix-vector multiply.
If this sounds complicated, the worst is yet to come. Now, since

∆t 2 ∆t 3 ...
y(t + ∆t) = y + ∆t ẏ + ÿ + y +...
2 6
we will have to convert derivatives of y into derivatives of f using the differential
equation ẏ = f . By the chain rule it follows that

ÿ = fy ẏ = fy f .

Then, again using the chain rule,


... d
y = fy f = ( fyy ẏ) f + fy fy f = ( fyy f ) f + fy fy f .
dt
Before computing higher order derivatives, we introduce some simplifications and
short-hand notation. First, in an expression of the type fyy gh, the order of the two
arguments g and h does not matter; thus ( fyy g)h = ( fyy h)g. Second, in an expression
like fyy f f , where the 3-tensor has two identical arguments f , we will allow the
(slightly abusive) short-hand notation fyy f 2 , although f 2 does not represent a power
(which has no meaning for a vector) but only that the argument occurs twice. Finally,
68 9 Runge–Kutta methods

in an expression of the type fy fy f the Jacobian multiplies ... f twice, justifying the
notation fy2 f ; this is indeed a power of fy . We then have y = fyy f 2 + fy2 f .
From here it is all uphill. Thus, using the chain rule to each term of the third
derivative, observing the rules of the simplified notation, we have
....
y = fyyy f 3 + fyy ( fy f ) f + ( fyy f ) fy f + ( fyy f ) fy f + fy ( fyy f 2 ) + fy3 f .

Noting that three terms are identical, omitting superfluous parentheses, we collect
terms to get ....
y = fyyy f 3 + 3 fyy f fy f + fy fyy f 2 + fy3 f .
The terms appearing in the total derivatives are called elementary differentials, and
every total derivative is composed of several elementary differentials. Unfortunately
the number of elementary differentials grows exponentially with the order of the
derivative, soon making the expressions very complicated. Nevertheless, collecting
the expressions obtained so far, we have

ẏ = f
ÿ = fy f
...
y = fyy f 2 + fy2 f
....
y = fyyy f 3 + 3 fyy f fy f + fy fyy f 2 + fy3 f ,

so that the Taylor series is

∆t 2 ∆t 3
fyy f 2 + fy2 f +

y(t + ∆t) = y + ∆t f + fy f +
2! 3!
∆t 4
fyyy f 3 + 3 fyy f fy f + fy fyy f 2 + fy3 f + . . .

4!
The next step is to compare this Taylor series to that of the RK method’s updating
formula. To exemplify, let us consider a two-stage ERK. Its Butcher tableau is

0 0 0
c2 a21 0
b1 b2

where c2 = a21 . Thus a two-stage ERK has three free parameters, a21 , b1 and b2 ,
which can be chosen to maximize the order of the method. The method advances
the solution a single step by

yn+1 = yn + ∆t (b1 f (yn ) + b2 f (yn + a21 ∆t f (yn ))) . (9.6)

We now need to select a21 , b1 and b2 so as to match as many terms as possible to the
previous Taylor series expansion. To this end, we need to expand (9.6) in a Taylor
series as well. Fortunately, since by assumption f (yn ) = ẏ(tn ), there is only one term
to expand. Noting that a21 ∆t f (yn ) is “small,” we have
9.3 Taylor series expansions and elementary differentials 69

f (yn + a21 ∆t f (yn )) = f + a21 ∆t · fy f + O(∆t 2 ).

Thus we can assemble the Taylor series from (9.6) to obtain

yn+1 = y + ∆t (b1 f + b2 f ) + b2 a21 ∆t 2 · fy f + O(∆t 3 ).

Matching terms, we achieve second order if

b1 + b2 = 1
1
b2 c2 = ,
2
where we have preferred to let the parameter c2 replace a21 . Now, since we have
three parameters but only two equations, the solution is not unique; there is a one-
parameter family of two-stage RK methods of order p = 2. Choosing β = b2 as the
free parameter, the family can be written

0 0 0
1 1
2β 2β 0
1−β β

where we typically choose β ∈ [ 12 , 1]. The methods at the endpoints are perhaps the
best known. Thus, at β = 1/2 we obtain Heun’s method,

0 0 0
1 1 0
1/2 1/2

corresponding to the simple ERK we introduced to emulate the trapezoidal rule by


an explicit method. On the other hand, taking β = 1, we get the modified Euler
method,
0 0 0
1/2 1/2 0
0 1

This procedure looks complicated, and it is. For higher order methods we “only”
need to allow more stage values and parameters, and expand the Taylor series to
include more terms. This quickly goes out of hand, and the construction of RK
methods needs special techniques. Thus the elementary differentials are usually rep-
resented by graphs (“trees”), and order conditions are derived using group theory,
combinatorics and symmetry properties. That is the easy part. The hard part is that,
as we saw above, the equations for determining the method coefficients are non-
linear algebraic equations, which means that it is often difficult to solve for the
method parameters, and that computer software (numerical and symbolic) is usually
needed in this step. In the light of this, it is quite remarkable that Runge managed to
derive methods of order p = 4.
70 9 Runge–Kutta methods

9.4 Order conditions and convergence

To specify order conditions, we take the derivations one step further and consider a
three-stage ERK with Butcher tableau

0 0 0 0
c2 a21 0 0
c3 a31 a32 0
b1 b2 b3

Here we proceed in the same fashion as in the two-stage case, expanding the updat-
ing formula and comparing to a third order Taylor series. We now have six parame-
ters, and as there are four elementary differentials for total derivatives not exceeding
order three, we will again have a non-unique solution, now with three degrees of
freedom. Two different families (therefore not exhaustive), are

0 0 0 0
2/3 2/3 0 0
2/3 23 − 4β
1 1
4β 0
1 3
4 4 −β β

and
0 0 0 0
2/3 2/3 0 0
1 1
0 − 4β 4β 0
1 3
4 −β 4 β
Of these two, the first is typically preferred since the bi coefficients are positive if
β < 3/4. The best known method from this family is the Nyström method,

0 0 0 0
2/3 2/3 0 0
2/3 0 2/3 0
1/4 3/8 3/8

A method of order p = 3, but not coming from any one of these two families, is the
classical RK3 method
0 0 0 0
1/2 1/2 0 0
1 −1 2 0
1/6 2/3 1/6

The procedure can be continued to find higher order methods, but it soon gets
out of hand. More importantly, the degrees of freedom are lost, since the number of
of elementary differentials (order conditions) grows faster than the number of free
parameters. In fact, already for order four, we have eight order conditions and ten
9.4 Order conditions and convergence 71

parameters; hence still some degrees of freedom. But for order five, there is already
17 order conditions (elementary differentials), but only 15 parameters to choose in
a five-stage explicit RK method. Unsurprisingly, there is no explicit Runge–Kutta
method of order five, with five stages, but six stages are necessary.
To give an impression of the increasing complexity, we consider a four-stage
ERK with Butcher tableau
0 0 0 0 0
c2 a21 0 0 0
c3 a31 a32 0 0
c4 a41 a42 a43 0
b1 b2 b3 b4
After expanding in the relevant Taylor series, matching elementary differentials, we
obtain the order conditions for order p = 1,

f: ∑ bi = 1.
i

In addition, for order p = 2,


1
fy f : ∑ bi ci = 2 .
i

For order p = 3, it must also hold that


1
fyy f 2 : ∑ bi c2i = 3
i
1
fy2 f : ∑ bi ai j c j = 6 .
i, j

For order p = 4, we further require


1
fyyy f 3 : ∑ bi c3i = 4
i
1
fyy f fy f : ∑ bi ci ai j c j = 8
i, j
1
fy fyy f 2 : ∑ bi ai j c2j = 12
i, j
1
fy3 f : ∑ bi ai j a jk ck = 24 .
i, j,k

By now, it is clear that the order conditions cannot be solved for the free coefficients
without hard work. Even so, in 1895 Runge found the classical RK4 method,
72 9 Runge–Kutta methods

0 0 0 0 0
1/2 1/2 0 0 0
1/2 0 1/2 0 0
1 0 0 1 0
1/6 1/3 1/3 1/6

corresponding to the computational scheme

Y10 = f (tn , yn )
Y20 = f (tn + ∆t/2, yn + ∆tY10 /2)
Y30 = f (tn + ∆t/2, yn + ∆tY20 /2)
Y40 = f (tn + ∆t, yn + ∆tY30 )
∆t 0
Y1 + 2Y20 + 2Y30 +Y40 .

yn+1 = yn +
6
This method is still in wide use today, demonstrating how powerful it is. It is linked
to Simpson’s rule for the computation of integrals. This is a fourth order method for
computing integrals of a function of t. At considerable extra work, Runge was able
to extend this idea to ordinary differential equations.
There is a full theory for how to construct explicit Runge–Kutta methods, and
today, there are ERK methods in use of orders up to p = 8. Such methods are diffi-
cult to construct, but due to their extremely high accuracy, they are useful for high
precision computations. To illustrate how difficult it is to construct such methods,
we note that for an s-stage ERK method, there are s(s + 1)/2 coefficients, as seen in
the following table.

Stages s 1 2 3 4 5 6 7 8 9 10 11
Coefficients 1 3 6 10 15 21 28 36 45 55 66
Table 9.1 Stages and coefficients in ERK methods. The number of free parameters in an s-stage
ERK method is s(s + 1)/2

But there is an overwhelming number of order conditions to achieve order p.


Comparing a given order to the number of order conditions, and the minimum num-
ber of stages s required to achieve the requested order, we have the following table.

Order p 1 2 3 4 5 6 7 8 9 10
Conditions 1 2 4 8 17 37 85 200 486 1205
Min stages 1 2 3 4 6 7 9 11 ? ?
Table 9.2 Necessary number of stages in ERK methods. The number of order conditions in ERK
methods grows prohibitively fast

Thus it is currently not known how many stages are minimally needed to con-
struct methods of orders 9 and 10. Naturally, this might not seem to be important.
However, what is more surprising is that it has been possible to construct (say) 7-
9.5 Implicit Runge–Kutta methods 73

stage ERK methods of order p = 6; such methods are subject to no less than 37
order conditions due to the large number of elementary differentials, yet they only
have 28 parameters. Thus 28 parameters must satisfy 37 equations; while this seems
unlikely, it is nevertheless possible. It is even more remarkable that an order p = 8
method (with 200 order conditions) can be constructed using only 11 stages and 66
parameters (Butcher, 1985).
The one thing that is simple about Runge–Kutta methods is that every RK
method satisfying the first order condition ∑ bi = 1 (consistency) is conver-
gent. This follows from the methods being one-step methods. All consistent one-
step methods are convergent, and have global error bounds similar to those derived
for the Euler methods. The possible break-down of convergence only happens in
multistep methods, or in connection with time-dependent partial differential equa-
tions.
The construction of explicit Runge–Kutta methods is far from trivial and re-
quires considerable expertise. Luckily, there are several high performing methods
to choose from, also with built-in error estimators. We will return to how these
methods are made adaptive, using automatic step size control to meet a prescribed
accuracy requirement.

9.5 Implicit Runge–Kutta methods

Implicit RK methods have a Butcher tableau

c A
bT

where the matrix A is no longer required to be strictly lower triangular. The order
conditions are exactly the same as in the ERK case, as the parameters are again
determined by matching coefficients in the Taylor series expansions of the solution
y(t + ∆t) and the updating formula for yn+1 .
Let us consider a general two-stage IRK method. It has a Butcher tableau

c1 a11 a12
c2 a21 a22
b1 b2

This corresponds to the equations

Y10 = f (yn + ∆t(a11Y10 + a12Y20 ))


Y20 = f (yn + ∆t(a21Y10 + a22Y20 ))
yn+1 = yn + ∆t(b1Y10 + b2Y20 ),
74 9 Runge–Kutta methods

and it becomes evident that the two first stage equations form a single nonlinear
equation system which must be solved in order to advance the solution.
The aim of using IRK methods is to increase the stability region so as to ob-
tain methods useful for solving stiff differential equations, for which ∆tL[ f ]  1.
However, because it is expensive to use a general IRK with a full A matrix, only a
few such methods are ever used. Among them we find the well-known 3-stage order
p = 5 Radau IIa method, also known as R ADAU 5. Its Butcher tableau is
√ √ √ √
2
5 − √106 11 7 6 37 169 6
45 − 360√ 225 − 1800

2
− 225 + √756
2
5 + 106 37 169 6 11
225 + √1800
7 6
45 + √
360 − 225 − 756
2
4 6 4 6 1
1 9−√ 36 9+√ 36 9
4 6 4 6 1
9 − 36 9 + 36 9

Due to its structure, the computations can be arranged in a quite efficient way, and
R ADAU 5 is probably the most powerful IRK method available today for solving
stiff problems.
For IRK methods, the maximum order when using s stages is p = 2s. This is
achieved by the Gauss–Legendre methods. They are also useful for stiff prob-
lems. Due to symmetry their stability regions coincide with C− ; thus the methods
are A-stable. However, the computations associated with these methods are more
complicated than for R ADAU 5. The latter method also has better damping when
∆tλ → −∞; this is often an advantage in practice. The Gauss–Legendre method of
order six has the Butcher tableau:
√ √ √
15 15
1
2 −10
5
36√
2
9 −15
5
36 − √3015
1
2√
5
36 + √2415 2
9√
5
36 − 2415
15
1
2 + 10
5
36 + 3015 2
9 + 15
15
5
36
5 4 5
18 9 18

The methods discussed so far have a full A matrix. However, one can achieve
sufficiently improved stability properties already when the matrix A is lower trian-
gular, with a nonzero diagonal. Such methods are referred to as DIRK methods
(diagonally implicit Runge–Kutta methods). A further restriction is to demand that
the diagonal elements are all equal. Such methods are known as SDIRK methods
(singly diagonally implicit RK), and have the Butcher tableau (in the 2-stage case)

γ γ 0
c2 a21 γ
b1 b2

This corresponds to the equations


9.5 Implicit Runge–Kutta methods 75

Y10 = f (yn + γ ∆t Y10 )


Y20 = f (yn + a21 ∆t Y10 + γ ∆t Y20 )
yn+1 = yn + ∆t(b1Y10 + b2Y20 ).

The first two equations form a system, but the equations are decoupled. Thus, they
can be rewritten

(I − γ ∆t f )(yn + γ ∆t Y10 ) = yn
(I − γ ∆t f )(yn + a21 ∆t Y10 + γ ∆t Y20 ) = yn + a21 ∆t Y10 ,

and we see that they will share the same Jacobian matrix, I − γ∆t fy . After the first
equations has been solved, the right-hand side of the second equation can be com-
puted, and the second equation solved. Thus we now have two separate, sequential
systems to solve, and this requires less work than simultaneously solving two cou-
pled equations. This substantially reduces the computational effort and complexity
compared to the more advanced methods.
Let us now turn to some of the most elementary IRK methods. The implicit
Euler method is a one-stage method with Butcher tableau

11
1

and equations

Y10 = f (yn + ∆t Y10 )


yn+1 = yn + ∆tY10 .

Here we make the important observation that yn+1 = Y1 , which means that we can
rewrite the first equation above as Y10 = f (yn+1 ) with updating formula yn+1 = yn +
∆t f (yn+1 ). This is recognized as the implicit Euler method.
The implicit midpoint method is also a one-stage method, with the Butcher
tableau
1/2 1/2
1
together with the equations

Y10 = f (yn + ∆t Y10 /2)


yn+1 = yn + ∆tY10 .

The first equation now needs a minor transformation,

∆t
yn + ∆t Y10 /2 = yn + f (yn + ∆t Y10 /2),
2
implying that yn + ∆t Y10 /2 is the solution to the equation
76 9 Runge–Kutta methods

∆t
(I − f )(yn + ∆t Y10 /2) = yn .
2

Hence yn + ∆t Y10 /2 = (I − ∆t −1
2 f ) (yn ). Therefore the updating formula becomes

∆t −1
yn+1 = 2(I − f ) (yn ) − yn
2
∆t −1 ∆t ∆t −1
= 2(I − f ) (yn ) − (I − f )(I − f ) (yn )
 2  2 2
∆t ∆t −1
= 2I − (I − f (I − f ) (yn )
2 2
∆t ∆t −1
= (I + f )(I − f ) (yn ).
2 2
Hence the implicit midpoint method can be expressed as

∆t ∆t −1
yn+1 = (I + f )(I − f ) (yn ). (9.7)
2 2
Let us now compare to the trapezoidal rule. This is a two-stage, one-step IRK
method of order p = 2, whose Butcher tableau is

0 0 0
1 1/2 1/2
1/2 1/2

This corresponds to the equations

Y10 = f (yn )
Y20 = f (yn + ∆t Y10 /2 + ∆t Y20 /2)
∆t ∆t
yn+1 = yn + Y10 + Y20 .
2 2
Here we note that Y20 = f (yn+1 ), so it immediately follows that

∆t
yn+1 = yn + ( f (yn ) + f (yn+1 )),
2
which is recognized as the trapezoidal rule in the form it was previously discussed.
Moreover, the latter formula can be rearranged to obtain
   
∆t ∆t
I− f (yn+1 ) = I + f (yn ),
2 2

resulting in the formula

∆t −1 ∆t
yn+1 = (I − f ) (I + f )(yn ). (9.8)
2 2
9.6 Stability of Runge–Kutta methods 77

Thus, comparing (9.8) to (9.7) we see that the difference between the trapezoidal
rule and the implicit midpoint method is that the two operators, corresponding to
one half-step explicit Euler method and one half-step implicit Euler method, are
commuted. In the case of a linear constant coefficient system ẏ = Ay this does not
matter since the factors commute, but in the nonlinear case there is a difference.
This also makes a difference in terms of stability.
All of the three elementary IRK methods above are SDIRK methods. In addition,
the trapezoidal rule has a first explicit stage, and is therefore sometimes referred to
an ESDIRK method. It has one more property of significance; the second stage Y2
is identical to the output yn+1 , which, of course, becomes the first stage on the next
step. Methods with this property are called “first same as last” (FSAL). The FSAL
property can be used to economize the computations.

9.6 Stability of Runge–Kutta methods

To assess stability, we use the linear test equation, ẏ = λ y, with y(0) = 1. Let us
investigate the stability of the classical RK4 method, with equations

Y10 = f (tn , yn )
Y20 = f (tn + ∆t/2, yn + ∆tY10 /2)
Y30 = f (tn + ∆t/2, yn + ∆tY20 /2)
Y40 = f (tn + ∆t, yn + ∆tY30 )
∆t 0
Y1 + 2Y20 + 2Y30 +Y40 .

yn+1 = yn +
6
Since f (t, y) = λ y, we obtain

∆t Y10 = ∆t λ yn
∆t Y20 = ∆t λ (yn + ∆tY10 /2)) = ∆t λ (yn + ∆t λ yn /2)) = ∆t λ + (∆t λ )2 /2 yn


∆t Y30 = ∆t λ (yn + ∆tY20 /2)) = · · · = ∆t λ + (∆t λ )2 /2 + (∆t λ )3 /4 yn




∆t Y40 = ∆t λ (yn + ∆tY30 )) = · · · = ∆t λ + (∆t λ )2 + (∆t λ )3 /2 + (∆t λ )4 /4 yn .




We now assemble these expressions in the updating formula, to get

∆t 0
Y1 + 2Y20 + 2Y30 +Y40

yn+1 = yn +
6
(∆t λ )2 (∆t λ )3 (∆t λ )4
 
= 1 + ∆t λ + + + yn .
2 6 24

Thus, when the classical RK4 method is applied to the linear test equation, the
updating formula is a polynomial of degree four, yn+1 = P(∆t λ )yn , where
78 9 Runge–Kutta methods

Fig. 9.1 Stability region of the classical RK4 method. Colors indicate values z ∈ C where |P(z)| ∈
[0, 1]. Dark blue areas reveal the location of the four zeros of the polynomial P(z)

z2 z3 z4
P(z) = 1 + z + + + ≈ ez .
2 6 24

This is no coincidence; since ẏ = λ y implies y(t + ∆t) = e∆t λ y(t), the polynomial
P(z) must necessarily approximate ez accurately.
The same thing will happen for any explicit RK method. Thus for every ERK
method the updating formula is of the form yn+1 = P(∆t λ )yn , where the polynomial
P is characteristic for each method. Since

|yn+1 | ≤ |P(∆t λ )| · |yn |,

it follows that the method is numerically stable if |P(z)| ≤ 1. For this reason, P(z) is
referred to as the stability function of the method. The stability region of the RK4
method is plotted in Figure 9.6.
A similar procedure can be followed for implicit RK methods, with the only
difference being that the stability function is then a rational function, R(z) =
P(z)/Q(z), with P and Q polynomials such that R(z) → ez as z → 0. An implicit
RK method is A-stable if

Re z ≤ 0 ⇒ |R(z)| ≤ 1.

Since R(z) must be bounded in all of the left half plane, it follows that deg Q ≥
deg P. Obviously, no explicit method can be A-stable, since for every polynomial,
P(z) → ∞ as z → ∞.
9.7 Embedded Runge–Kutta methods 79

To check for A-stability, we first check that R(z) has no poles in the left half
plane (Q has no zeros in C− ). Then, invoking the maximum principle, one needs
to verify that |R(iω)| ≤ 1 for all ω ∈ R, i.e., an IRK method without poles in C−
is A-stable if it is stable on the imaginary axis.

9.7 Embedded Runge–Kutta methods

In general, the solution of an initial value problem changes character over time. For
example, in the Prothero–Robinson problem

ẏ = λ (y − ϕ(t)) + ϕ̇(t)

there is a particular solution ϕ(t) and homogeneous solutions eλt . These may be
very different in character. Consequently, during the transient, the step size needs
to be adapted to the homogenous solution. But once the homogeneous solution has
decayed and the solution is dominated by the particular solution ϕ(t), the step size
needs to be readapted to ϕ(t). This is especially important in stiff problems, where
the step size satisfying the error tolerance may vary by several orders of magnitude
during the integration. Without adaptivity, the efficiency may suffer to the point
of even making the integration impossible.
To control the step size, we need an error estimate. This can be provided by
embedded RK methods. A simple example is given by the classical RK3 and RK4
methods. Thus the RK3, with Butcher tableau

0 0 0 0
1/2 1/2 0 0
1 −1 2 0
1/6 2/3 1/6

has the equations

Y10 = f (yn )
Y20 = f (yn + ∆t Y10 )
Z20 = f (yn − ∆t Y10 + 2∆t Y20 )
∆t 0
Y1 + 4Y20 + Z20 .

zn+1 = yn +
6
This is a method of order p = 3. Noting that the first two stage derivatives are iden-
tical to those of the RK4 method, the two methods RK3 and RK4 can be embedded
into the same computational scheme, based on the equations
80 9 Runge–Kutta methods

Y10 = f (yn )
Y20 = f (yn + ∆t Y10 /2)
Z20 = f (yn − ∆t Y10 + 2∆t Y20 )
Y30 = f (yn + ∆t Y20 /2)
Y40 = f (yn + ∆t Y30 )
∆t 0
Y1 + 4Y20 + Z20

zn+1 = yn +
6
∆t 0
Y1 + 2Y20 + 2Y30 +Y40 .

yn+1 = yn +
6
Thus, with one extra function evaluation for computing Z20 , we can obtain both a
result of order p = 3 (i.e., zn+1 ) and the fourth order result yn+1 .
This embedded RK pair can be displayed in terms of a common Butcher tableau

Y10 Y20 Z20 Y30 Y40


0 0 0 0 0 0
1/2 1/2 0 0 0 0
1 −1 2 0 0 0
1/2 0 1/2 0 0 0
1 0 0 0 1 0
1/6 2/3 1/6 0 0
1/6 1/3 0 1/3 1/6

where headlines in terms of stage derivatives have been included for clarity.
Now, we have seen that there is a dramatic difference in accuracy between a
method of order p and one of order p + 1. For practical purposes, the more accurate
result can then be regarded as “exact” when compared to the lower order result. This
means that we can here estimate the local error by the difference

ln+1 = zn+1 − yn+1 = O(∆t 4 ),

where we remind the reader that the local error is of order O(∆t p+1 ) for an order
p method. Thus, the statement above is that zn+1 has a local error of magnitude
O(∆t 4 ).
This is then an estimate of the local error in the lower order result, i.e., in zn+1 .
However, the most common approach, referred to as local extrapolation, only uses
the error estimate ln+1 for controlling the step size, while retaining the best available
result (here of order p = 4) as the output. This implies that a somewhat questionable
approach is used, regarding ln+1 as a tool for adjusting the step size ∆t.
The embedded pair above is referred to as the RK34 method, meaning that it
advances the solution by a method of order p = 4 (the RK4 method) and uses a
built-in error estimator of order p = 3. This is a simple but relevant example. To
see what advanced methods look like, we can consider the low order Shampine–
Bogacki SB23 method,
9.7 Embedded Runge–Kutta methods 81

0 0 0 0 0
1/2 1/2 0 0 0
3/4 0 3/4 0 0
1 2/9 1/3 4/9 0
2/9 1/3 4/9 0
7/24 1/4 1/3 1/8
where the first row of b coefficients corresponds to order p = 2 and the second to
order p = 3.
M ATLAB offers the Dormand–Prince D OPRI 45 solver ode45,

0 0 0 0 0 0 0 0
1/5 1/5 0 0 0 0 0 0
3/10 3/40 9/40 0 0 0 0 0
4/5 44/45 −56/15 32/9 0 0 0 0
8/9 19372/6561 −25360/2187 64448/6561 −212/729 0 0 0
1 9017/3168 −355/33 46372/5247 49/176 −5103/18656 0 0
1 35/384 0 500/1113 125/192 −2187/6784 11/84 0
5179/57600 0 7571/16695 393/640 −92097/339200 187/2100 1/40
35/384 0 500/1113 125/192 −2187/6784 11/84 0

Here the first row of b coefficients corresponds to order p = 4 and the second to
order p = 5. The 5th order method has the FSAL property.
Although there are further ERK methods, the current state-of-the-art method of
a similar order is the Cash–Karp CK45 method,

0 0 0 0 0 0 0
1/5 1/5 0 0 0 0 0
3/10 3/40 9/40 0 0 0 0
3/5 3/10 −9/10 6/5 0 0 0
1 −11/54 5/2 −70/27 35/27 0 0
7/8 1361/55296 175/512 575/13824 44275/110592 253/4096 0
2825/28648 0 18575/48384 13525/55296 277/14336 1/4
37/378 0 250/621 125/594 0 512/1771

where the first row of b coefficients corresponds to order p = 4 and the second to
order p = 5.
The coefficients of these methods clearly suggest that it is a nontrivial task to
construct high performance ERK methods. IRK methods are extended in a similar
way to embedded methods, but there are added difficulties obtaining good error
estimators for IRK methods. For stiff problems, M ATLAB offers the IRK solver
ode23s.
82 9 Runge–Kutta methods

9.8 Adaptive step size control

Let us consider a method of order p taking a single step of size ∆t from a point
yn = ŷ(tn ), where ŷ(t) is a local solution to the differential equation. We then obtain

R p : yn 7→ yn+1 ,

where the local error is

l p+1,n+1 = yn+1 − ŷ(tn+1 ) = O(∆t p+1 ).

In the analysis of the Euler methods, we saw in Section that global error bounds at
time T are of the form

maxn kl p+1,n k eT M[ f ] − 1
keN k . C · .
∆t M[ f ]

This bound shows that in order to control the magnitude of the global error, which
is O(∆t p ), we need to control the local error per unit step (EPUS) kl p+1,n k/∆t,
which is also O(∆t p ). Thus, if step sizes can be varied along the solution trajectory
so that kl p+1,n k/∆t ≈ T OL, where T OL is the local error tolerance, then

eT M[ f ] − 1
keN k . C · T OL .
M[ f ]

This means that an EPUS strategy will make the global error proportional to
T OL, irrespective of how many steps are needed to reach the end-point T . A solver
that achieves this is called tolerance proportional.
While EPUS is common in nonstiff solvers, it is not always used in connection
with stiff problems, simply because stiffness typically requires step sizes to vary by
several orders of magnitude. EPUS would then cause the solver to spend a dispro-
portionate computational effort on resolving initial transients, especially since the
system’s damping will make these errors decay rather than accumulate. Therefore,
in stiff solvers, it is common to only control the error per step (EPS), which means
that one attempts to adjust the step size along the trajectory so that kl p+1,n k ≈ T OL.
Due to damping, it is still possible to obtain highly accurate results.
Now, in order to control the local error in practice, we need a local error esti-
mate obtained from an embedded method. Let us therefore assume that we have two
methods, of orders p and p − 1, respectively, both starting from a point yn = ŷ(tn ),
producing two different results,

R p : yn 7→ yn+1
R p−1 : yn 7→ zn+1 .

The local errors are


9.8 Adaptive step size control 83

l p+1,n+1 = yn+1 − ŷ(tn+1 ) = O(∆t p+1 )


l p,n+1 = zn+1 − ŷ(tn+1 ) = O(∆t p ),

and the embedded method produces a local error estimate

l˜p,n+1 = zn+1 − yn+1 = O(∆t p ),

where the asymptotic behavior of the error estimate is evidently determined by the
lower order method. In advancing the solution, however, it is common to choose
the higher order method, R p , so that no accuracy is “wasted.” Because this method
has a global error O(∆t p , one could expect to achieve a performance similar to
tolerance proportionality by controlling the magnitude of kl˜p,n+1 k directly.
The basic assumption in step size control is that the norm of the local error esti-
mate, wn = kl˜p,n k, is accurately represented by the asymptotic model
p
wn = φn−1 ∆tn−1 ,

where p is the order of the error estimator. (The power would change if EPUS is
used instead of EPS). The step-size indexing used above is defined

∆tn−1 = tn − tn−1 ,

and the principal error function φn−1 is assumed to vary slowly along the solution
trajectory. The step size is continually adjusted to keep wn (approximately) constant,
equal to the requested tolerance T OL. The objective is to keep the scaled control
error
T OL 1/p
 
cn = (9.9)
wn
close to 1 for all n. This is achieved by selecting the next step-size ∆tn as

∆tn = ρn−1 ∆tn−1 , (9.10)

where ρn−1 is the step size ratio. Taking logarithms, this multiplicative recursion is
transformed into a simple summation,

log ∆tn = log ∆tn−1 + log ρn−1 ,

referred to as a discrete integrating controller, as it adjusts log ∆tn by a summation


of step size ratios, which in turn are determined by past scaled control errors. The
simplest choice is to take ρn−1 = cn . Inserted into (9.10), it yields
 1/p
T OL
∆tn = ∆tn−1 . (9.11)
wn

This elementary controller is used with various restrictions in most conventional


codes. It is a single-input single-output feedback control system, known as a dead-
84 9 Runge–Kutta methods

beat controller as it immediately tries to compensate any deviation from the target
tolerance in a single correction. Unfortunately, this approach has several shortcom-
ings and is prone to overreact to variations in the error estimator, especially if the
numerical method operates near instability.
However, control theory offers many alternatives, and a more advanced approach
is to process the scaled control errors by applying a recursive digital filter for the
construction of ρn−1 . A filter is a linear difference equation (multistep method) de-
signed to regularize the step size sequence, eliminating “noise,” while still making
the step size track the error, which is kept close to the preset tolerance. The general
multiplicative form of a two-step filter is
γ−κ2
ρn−1 = cγn1 cn−1
2
ρn−2 . (9.12)

The filter coefficients γ1 , γ2 and κ2 are chosen to produce a smooth step size se-
quence while maintaining overall stability. The filter coefficients must satisfy spe-
cific order and stability conditions, and can be selected to match method and prob-
lem classes. A well designed controller produces small step-size ratios ρn−1 . The
same filter coefficients can be used irrespective of method order.
The full step size controller consists of the filter recursion (with its two-step
character taking back values into account) followed by a simple integrator. Thus
γ −κ2
ρn−1 = cγn1 cn−1
2
ρn−2
∆tn = ρn−1 ∆tn−1

controls the step size during the integration. By taking logarithms, we see that this
step size control mechanism consists of two linear difference equations for the log-
arithmic quantities,

log ρn−1 + κ2 log ρn−2 = γ1 log cn + γ2 log cn−1


log ∆tn = log ∆tn−1 + log ρn−1 .

This can therefore be analyzed by standard techniques from the theory of linear
difference equations.
In computational practice, however, the multiplicative form is preferred. Thus the
combination of the filter and the integrator can be assembled into a multiplicative
form similar to that of the elementary controller, to yield
 γ1 /p  γ2 /p  −κ2
T OL T OL ∆tn−1
∆tn = ∆tn−1 . (9.13)
wn wn−1 ∆tn−2

Due to the similarity between (9.11) and (9.13), we see that it is straightforward
to replace the elementary controller in an existing code by a specifically designed
digital filter. The computations must, however, also be equipped with the usual pro-
tections against division by zero, etc.
9.9 Stiffness detection 85

Among the filter designs are standard integrating controllers and proportional-
integral (PI) controllers. It also includes PI lowpass filters that suppress step size
oscillations. A list of selected filter coefficients of proven designs can be found in
Table 9.8. The filter choice has little effect on total number of steps as long as sta-
bility is maintained, but it does have a significant impact on computational stability.
Thus, for ERK methods and nonstiff problems, the best choice is the PI3333 con-
troller, while for IRK and stiff problems, the H211PI digital filter is preferable.

The reason why the same controller cannot be used with all methods is that in
explicit methods the step size needs to have a stable, smooth behavior also when
the step size becomes limited by numerical stability. For ERK methods, this can
be shown to require γ2 < 0. Meanwhile, in a lowpass filter suppressing (−1)n os-
cillations in the step size, it is necessary to have γ1 = γ2 , and since the “integral
gain” γ1 + γ2 must remain positive (although not too large) to make wn → T OL,
some of these criteria are conflicting. This means that we need different controllers
to manage regularity and computational stability well. It also implies that little can
be gained from varying the filter coefficients at random; the designs below are well
tested and robust, yet still have to be properly selected with respect to methods and
problem classes.

γ1 γ2 κ2 Designation
1 0 0 Elementary
1/3 0 0 Convolution
2/3 −1/3 0 PI3333
1/6 1/6 0 H211PI
1/b 1/b 1/b H211b; b ∈ [2, 6]
Table 9.3 Controllers and digital filters. A selection of filter coefficients for two-step step size
controllers. The first controller is the elementary deadbeat controller, which tends to react too fast
and induce step size oscillations. The second two are more benign and well suited to ERK methods
and nonstiff problems, with a low integral gain of γ1 + γ2 = 1/3. The last two are digital filters
suppressing step size oscillations (γ1 = γ2 ) in IRK methods applied to stiff or nonstiff problems

9.9 Stiffness detection

When discussing stiff problems, we defined the stiffness indicator in a linear system
ẏ = Jy as
m2 [J] + M2 [J]
s[J] = ,
2
and we found that the problem is stiff if and only if s[J]  −1. As it is too expen-
sive to compute this quantity along a solution we need a simpler stiffness detection
procedure.
86 9 Runge–Kutta methods

We recall that, for a given inner product and any two vectors u and v, it holds that

hu − v, f (u) − f (v)i
m2 [ f ] ≤ ≤ M2 [ f ].
ku − vk22

Rather than computing the average of the upper and lower bounds, we can obtain
an inexpensive estimate by simply computing the inner product, provided that two
different samples of the vector field can be computed at the same point in time. This
is easily done in RK methods, since RK methods sample the vector field at several
neighboring points.
To describe how this is done, let us consider the embedded RK34 pair,

Y10 = f (tn , yn )
Y20 = f (tn + ∆t/2, yn + ∆t Y10 /2)
Z20 = f (tn + ∆t, yn − ∆t Y10 + 2∆t Y20 )
Y30 = f (tn + ∆t/2, yn + ∆t Y20 /2)
Y40 = f (tn + ∆t, yn + ∆t Y30 )
∆t 0
Y1 + 4Y20 + Z20

zn+1 = yn +
6
∆t 0
Y1 + 2Y20 + 2Y30 +Y40 .

yn+1 = yn +
6
Putting tn+1 = tn + ∆tn , we note that two of the five stage derivatives are

Z20 = f (tn+1 , Z2 )
Y40 = f (tn+1 ,Y4 ),

where the stage values Z2 and Y4 are introduced to simplify the notation. Moreover,
we note that the corresponding stage derivatives are both evaluated at time tn+1 . This
allows us to estimate the stiffness indicator at time tn+1 by

hZ2 −Y4 , f (tn+1 , Z2 ) − f (tn+1 ,Y4 )i hZ2 −Y4 , Z20 −Y40 i


s[ f ] ≈ = 2 .
kZ2 −Y4 k2 2 ∆t k −Y10 + 2Y20 −Y30 k22

Since all quantities are generated in every single step taken by the RK34 method,
the stiffness indicator can be estimated at a moderate extra cost of computing the
inner product and norm as specified above.
Another, often better alternative, is to rearrange the computational sequence of
the R34 pair thus:
9.9 Stiffness detection 87

Y20 = f (tn + ∆t/2, yn + ∆t Y10 /2)


Z20 = f (tn + ∆t, yn − ∆t Y10 + 2∆t Y20 )
Y30 = f (tn + ∆t/2, yn + ∆t Y20 /2)
Y40 = f (tn + ∆t, yn + ∆t Y30 )
∆t 0
Y1 + 4Y20 + Z20

zn+1 = yn +
6
∆t 0
Y + 2Y20 + 2Y30 +Y40

yn+1 = yn +
6 1
Ŷ10 = f (tn+1 , yn+1 ).

Thus we now have the first stage derivative of the next step available, before we
estimate the stiffness indicator, enabling us to form an approximation from Y40 and
the new Ŷ10 , which are both computed at time tn+1 . Thus we obtain the estimate

hyn+1 −Y4 , f (tn+1 , yn+1 ) − f (tn+1 ,Y4 )i hyn+1 −Y4 , Ŷ10 −Y40 i
s[ f ] ≈ = .
kyn+1 −Y4 k22 kyn+1 −Y4 k22

Similar approximations can be found (possibly after minor rearrangements of the


computational process) for all embedded ERK methods.
The reason for computing the stiffness indicator is that if an adaptive ERK
method is used and the differential equation starts exhibiting stiffness, the step size
will be limited by stability. A well designed PI step size controller will then manage
the situation, keeping the method stable. But the method will be inefficient, and it
is valuable to have this situation diagnosed automatically. In a dissipative system
(let us for simplicity consider ẏ = Jy with M2 [J] < 0) the most common onset of
stiffness is that the maximum negative eigenvalue λk [∆tJ] reaches the boundary of
the stability region on the negative real axis. For example, in the RK4 method, we
have seen that the stability region reaches out to −2.7 on the real axis. This will be
identified by the stiffness detection, by observing that

∆t · s[J] ≈ −2.7

for several consecutive steps. By checking for this condition, an explicit solver can
return the diagnostics that the problem ought to be solved using an implicit method
intended for stiff problems instead.
In the construction of the estimate s[ f ] it is desirable to use the “last” stage deriva-
tive as described above. Since RK methods use nested function evaluations, it is the
last or the “most nested” function evaluation which gives rise to the elementary
differential fy4 f (in the case of four stages). In the linear case, this elementary dif-
ferential will have the form J 4 · Jy = J 5 y. Thus it is similar to a power iteration. As a
power iteration converges to the dominant eigenvalue, which typically is the largest
in magnitude negative eigenvalue, the stiffness detector will pick up the mode that
is likely to limit the step size due to numerical stability.
Chapter 10
Linear Multistep methods

As opposed to Runge–Kutta methods, linear multistep methods include additional


information about the vector field by approximating the differential equation by a
difference equation. Two different approaches dominate this approach.
For nonstiff problems, the differential equation ẏ = f (t, y) is integrated to be
represented by an integral equation,
Z tn+k
y(tn+k ) − y(tn+k−1 ) = f (τ, y(τ)) dτ,
tn+k−1

If an interpolation polynomial Ṗ(t) ≈ f (t, y(t)) can be constructed, the integral is


approximated by a k-step numerical integration method, as a weighted sum. This
yields schemes of the form
k
yn+k = yn+k−1 + ∆t · ∑ β j f (tn+ j , yn+ j ), (10.1)
j=0

where ∆t = tn+k − tn+k−1 is a constant step size. The coefficients β j are deter-
mined by integrating polynomial basis functions, and the numerical solution is
yn+k = P(tn+k ). The approach is reminiscent of RK methods, which also approx-
imate the integral by using several samples of the vector field f , but now these
samples are “recycled” function values from previous steps, rather than obtained
through nested evaluations of f on the current step.
This corresponds to the Adams methods, introduced in the second half of the
19th century by Astronomer Royal John C. Adams. The methods were first used
in the various attempts to locate an unknown planet (the discovery of Neptune in
1846). The methods proved so successful that they remain a standard computational
technique to this very day. There are explicit and implicit Adams methods, known
as Adams–Bashforth (AB) and Adams–Moulton (AM) methods, respectively. Both
are only intended for nonstiff problems, although AM have better stability and sig-
nificantly smaller errors. There are AM and AB methods of all orders.

89
90 10 Linear Multistep methods

For stiff problems, on the other hand, the differential equation is not converted
into an integral equation. Instead, the derivative is discretized using a finite differ-
ence. Thus ẏ = f (t, y) is replaced by a difference equation
k
∑ α j yn+ j = ∆t · f (tn+k , yn+k ), (10.2)
j=0

where the step size ∆t = tn+k − tn+k−1 is constant, and where


k
1
∆t ∑ α j y(tn+ j ) = y0 (tn+k ) + O(∆t k ).
j=0

These methods are known as backward differentiation formulas (BDF). They are
designed to have large stability regions, covering most of the negative half plane, so
as to make the BDF methods suitable for stiff problems. The methods were intro-
duced in the early 1950s, and efficient software emerged in the early 1970s. Since
then, these methods remain one of the most competitive approaches available to-
day for solving stiff initial value problems. The BDF family only contains methods
of orders 1–6, but the 6th order method is not used in practice due to insufficient
stability.
The Adams and BDF method families are the most common linear multistep
methods, even though there are other methods that potentially could offer higher ac-
curacy or better stability. One of the key advantages of the Adams and BDF methods
is that both families are based on a local polynomial representation of the solution,
making it comparatively simple to make the methods adaptive, using a a varying
time step ∆t. Still, it is far more complicated than for RK methods.
To make the polynomial representation clear, Adams and BDF methods are dis-
tinguished as follows. In Adams methods we construct a polynomial Ṗ(t) interpo-
lating past f values. Since
Z tn+k
P(tn+k ) − P(tn+k−1 ) = Ṗ(t) dt, (10.3)
tn+k−1

we obtain a computable numerical approximation yn+k = P(tn+k ).


In the BDF methods, on the other hand, we instead construct a polynomial P(t)
interpolating past y values, requiring that this polynomial satisfies the differential
equation at tn+k , i.e.,
Ṗ(tn+k ) = f (tn+k , P(tn+k )). (10.4)
This is referred to as a collocation condition. Thus we see that both classes of
methods discussed above are related to interpolation theory, which also makes it
possible to extend the methods to variable step sizes.
10.1 Adams methods 91

10.1 Adams methods

Adams methods compute a sequence {yn } where yn ≈ y(tn ), and where the corre-
sponding samples of the vector field are y0n = f (tn , yn ). These are denoted y0n instead
of ẏn , since they are not derivatives of a function, but merely values of f obtained
nearby the exact solution trajectory.
The Adams-Bashforth methods are explicit. The simplest AB method is the
one-step AB1, and is identical to the explicit Euler method. To derive the first dis-
tinct AB method, we turn to the two-step AB2 method. This is a method that takes
y0n and y0n+1 and computes

yn+2 − yn+1 = ∆t β1 y0n+1 + β0 y0n ,




where we need to determine the two coefficients β0 and β1 . This can be done in
several different ways. In keeping with a polynomial representation, let
t − tn t − tn+1
Ṗ(t) = y0n+1 · + y0n · = y0n+1 · ϕ1 (t) + y0n · ϕ0 (t),
tn+1 − tn tn − tn+1
where the basis functions ϕ j are Lagrange interpolation polynomials of degree 1,
satisfying (
1 j=k
ϕ j (tn+k ) = δ jk =
0 j 6= k.
Here δ jk is the Kronecker delta. It follows that

Ṗ(tn ) = y0n , Ṗ(tn+1 ) = y0n+1 ,

showing that the polynomial Ṗ(t) interpolates the two derivative samples y0n (at tn )
and y0n+1 (at tn+1 ). We now have
Z tn+2
P(tn+2 ) − P(tn+1 ) = Ṗ(t) dt
tn+1
Z tn+2 Z tn+2
= y0n+1 ϕ1 (t) dt + y0n ϕ0 (t) dt
tn+1 tn+1
= ∆t β1 y0n+1 + β0 y0n .


Determining the coefficients, we have


t
1 (t − tn )2 n+2 3
Z tn+2 Z tn+2 
1 1 t − tn
β1 = ϕ1 (t) dt = dt = 2 =
∆t tn+1 ∆t tn+1 tn+1 − tn ∆t 2 tn+1 2

and
t
1 (t − tn+1 )2 n+2
Z tn+2 Z tn+2 
1 1 t − tn+1 1
β0 = ϕ0 (t) dt = dt = − 2
=− .
∆t tn+1 ∆t tn+1 tn − tn+1 ∆t 2 tn+1 2
92 10 Linear Multistep methods

Relabeling the derivative samples y0n = f (tn , yn ) etc, we have the constant step size
form of the Adams–Bashforth AB2 method,
 
3 1
yn+2 − yn+1 = ∆t f (tn+1 , yn+1 ) − f (tn , yn ) .
2 2

The same process can be repeated for increasing polynomial degrees. The key
is to interpolate the f -samples y0n , . . . , y0n+k−1 by a polynomial Ṗ of degree k − 1,
constructed from Lagrange basis ϕ0 (t), . . . , ϕk−1 (t), and compute the method coef-
ficients
1 tn+k
Z
βj = ϕ j (t) dt. (10.5)
∆t tn+k−1
This results in the explicit k-step ABk method, of order p = k,

yn+k − yn+k−1 = ∆t (βk−1 f (tn+k−1 , yn+k−1 ) + · · · + β0 f (tn , yn )) , (10.6)

where the coefficients are given in Table 10.1.

Steps k Order p βk−1 βk−2 βk−3 βk−4 βk−5


1 1 1
2 2 3/2 −1/2
3 3 23/12 −16/12 5/12
4 4 55/24 −59/24 37/24 −9/24
5 5 1901/720 −2774/720 2616/720 −1274/720 251/720
Table 10.1 Adams–Bashforth method coefficients. The first five ABk methods are explicit, with
coefficients β j normalized so that Σ β j = 1. Note that AB1 is the explicit Euler method

While explicit methods are inexpensive, implicit methods typically offer bet-
ter stability. The procedure above can be used to construct the implicit Adams–
Moulton methods. The simplest AM method is the implicit Euler method of order
p = 1, but there is also another one-step AM1 method of order p = 2, the trapezoidal
rule. The first musltistep AM method is the two-step AM2 method,

yn+2 − yn+1 = ∆t β2 y0n+2 + β1 y0n+1 + β0 y0n ,




where we need to determine three coefficients β0 , β1 and β2 . The method is implicit


since the vector field sample y0n+2 = f (tn+2 , yn+2 ) depends on yn+2 , which also ap-
pears in the left hand side.
For the AM2 method, we construct an interpolation polynomial

(t − tn+1 )(t − tn ) (t − tn+2 )(t − tn )


Ṗ(t) = y0n+2 · + y0n+1 · +
(tn+2 − tn+1 )(tn+2 − tn ) (tn+1 − tn+2 )(tn+1 − tn )
(t − tn+2 )(t − tn+1 )
+ y0n · = y0n+2 · ϕ2 (t) + y0n+1 · ϕ1 (t) + y0n · ϕ0 (t),
(tn − tn+2 )(tn − tn+1 )
10.1 Adams methods 93

where the basis functions ϕ j are Lagrange interpolation polynomials of degree 2.


The coefficients are determined in the same way as in the explicit case, by comput-
ing the integrals
1 tn+k
Z
βj = ϕ j (t) dt.
∆t tn+k−1
For example,
Z tn+2
1 (t − tn+1 )(t − tn ) 5
β2 = dt = .
∆t tn+1 (tn+2 − tn+1 )(tn+2 − tn ) 12

The form of the implicit k-step AMk method, of order p = k + 1, is

yn+k − yn+k−1 = ∆t (βk f (tn+k , yn+k ) + · · · + β0 f (tn , yn )) , (10.7)

where the coefficients are given in Table 10.1.

Steps k Order p βk βk−1 βk−2 βk−3 βk−4


1 1 1
1 2 1/2 1/2
2 3 5/12 8/12 −1/12
3 4 9/24 19/24 −5/24 1/24
4 5 251/720 646/720 −264/720 106/720 −19/720
Table 10.2 Adams–Bashforth method coefficients. The first five AMk methods are implicit, with
coefficients β j normalized so that Σ β j = 1. Note that there are two one-step methods, the implicit
Euler method of order p = 1, and the trapezoidal rule of order p = 2

Now, because the AM methods are implicit, we need to solve for yn+k . In the
AMk method, the equation for yn+k is

yn+k = ∆t βk f (tn+k , yn+k ) + ψ, (10.8)

where the vector ψ only depends on past data,

ψ = yn+k−1 + ∆t (βk−1 f (tn+k−1 , yn+k−1 ) + · · · + β0 f (tn , yn )) . (10.9)

The nonlinear equation (10.8) can in principle be solved using either fixed point
iteration or Newton iteration. For fixed point iteration to converge, we need

∆t βk L[ f ] < 1, (10.10)

where L[ f ] is the Lipschitz constant of the vector field f with respect to its second
argument. Because the AMk methods are intended for nonstiff problems, the Lips-
chitz condition is usually acceptable. Selecting ∆t small enough to satisfy (10.10)
is comparable to the step size restriction imposed by numerical stability, which also
has to be satisfied.
94 10 Linear Multistep methods

But we still need an initial approximation y0n+k to get the iteration started. This is
obtained from the explicit ABk method; this is referred to as a predictor–corrector
(PC) scheme, where the ABk method is the predictor (P), and the AMk is the cor-
rector (C). The classical PC scheme is
 
y0n+k = yn+k−1 + ∆t β̃k−1 f (tn+k−1 , yn+k−1 ) + · · · + β̃0 f (tn , yn )
ym+1 m
n+k = ∆t βk f (tn+k , yn+k ) + ψ,

where the β̃ j coefficients refer to the ABk predictor method, and βk to the AMk
corrector method. The superscript m refers to the iteration index, and the corrector
is iterated “until convergence.” This means that the iteration is terminated when the
remaining error in the AMk equation is acceptably small.
If the explicit method is used as a predictor, if first evaluates (E) the vector field
f (tn+k−1 , yn+k−1 ) to compute the predictor (P). If the explicit method is used to ad-
vance the solution, the procedure continues in this EP-EP-EP. . . mode. If the implicit
method is used, however, the EP step of the predictor is followed by a function eval-
uation (E) and a subsequent correction (C) using the implicit method. The correction
iteration is repeated m times, and the scheme is denoted EP(EC)m . The vector field
f is usually sampled also at the accepted point ym+1n+k , which is equivalent to the first
evaluation (E) of the predictor for the next step. It is of importance to avoid evalu-
ating the vector field at the accepted point in stiff computations, when (10.10) does
not hold; then such an evaluation typically degrades accuracy.
Because convergence can be slow and each iteration requires a new function
evaluation, the AMk equation (10.8) is sometimes solved using Newton’s method.
This speeds up convergence, but at the cost of having to compute the Jacobian matrix
∂ f /∂ y and solving full linear systems of equations. In this case, too, we need an
initial approximation, and the ABk method is still of use for this purpose.

10.2 BDF methods

The one-step BDF method, BDF1, is identical to the implicit Euler method. The
two-step BDF2 is obtained by constructing a polynomial P(t) of degree 2, interpo-
lating the back values yn and yn+1 , and satisfying the differential equation at tn+2 .
This is a set of three conditions, which determine the three coefficients of the poly-
nomial:

P(tn ) = yn
P(tn+1 ) = yn+1
Ṗ(tn+2 ) = f (tn+2 , P(tn+2 )).

Writing the polynomial as


10.2 BDF methods 95

P(t) = yn ϕ0 (t) + yn+1 ϕ1 (t) + yn+2 ϕ2 (t),

where ϕ j (t) are the Lagrange basis polynomials of degree 2, the first two interpola-
tion conditions are automatically satisfied. As for the third condition, we have

yn ϕ̇0 (tn+2 ) + yn+1 ϕ̇1 (tn+2 ) + yn+2 ϕ̇2 (tn+2 ) = f (tn+2 , yn+2 ).

Let us introduce the following notation,

α j = ∆t · ϕ̇(tn+2 ).

We then have a nonlinear equation for the determination of yn+2 ,

α2 yn+2 − ∆t f (tn+2 , yn+2 ) = ψ,

where ψ = −α0 yn − α1 yn+1 only depends on past data. Thus, to work out the coef-
ficients of the BDF2 method, we need to find the derivatives of the basis functions.
(This is the opposite of the case with Adams methods, where we had to find the
integrals of the basis functions.) To this end, we note that

(t − tn+1 )(t − tn+2 )


ϕ0 (t) =
(tn − tn+1 )(tn − tn+2 )
(t − tn )(t − tn+2 )
ϕ1 (t) =
(tn+1 − tn )(tn+1 − tn+2 )
(t − tn )(t − tn+1 )
ϕ2 (t) = .
(tn+2 − tn )(tn+2 − tn+1 )

Noting that the basis functions are of the form a · (t − b)(t − c), their derivatives are
a · (2t − (b + c)), so that, assuming that the step size ∆t is constant, we get

2tn+2 − (tn+1 + tn+2 ) ∆t 1


ϕ̇0 (tn+2 ) = = =
(tn − tn+1 )(tn − tn+2 ) 2∆t 2 2∆t
2tn+2 − (tn + tn+2 ) 2∆t 2
ϕ̇1 (tn+2 ) = =− 2 =−
(tn+1 − tn )(tn+1 − tn+2 ) ∆t ∆t
2tn+2 − (tn + tn+1 ) 3∆t 3
ϕ̇2 (tn+2 ) = = = .
(tn+2 − tn )(tn+2 − tn+1 ) 2∆t 2 2∆t

It follows that the constant step size BDF2 method is


3 1
yn+2 − 2yn+1 + yn = ∆t f (tn+2 , yn+2 ).
2 2
When this is extended to BDFk, the method takes the form

αk yn+k + · · · + α0 yn = ∆t f (tn+k , yn+k ),

where the coefficients α j are found in Table 10.2.


96 10 Linear Multistep methods

Steps k Order p αk αk−1 αk−2 αk−3 αk−4 αk−5


1 1 1 −1
2 2 3/2 −2 1/2
3 3 11/6 −3 3/2 −1/3
4 4 25/12 −4 3 −4/3 1/4
5 5 137/60 −5 5 −10/3 5/4 −1/5
Table 10.3 BDF method coefficients. The first five BDFk methods are implicit, with coefficients α j
normalized so that βk = 1. Note that the BDF1 method is equivalent to the implicit Euler method.
The BDFk method is of order p = k as long as k ≤ 6

10.3 Operator theory

However, there is another, more interesting, way to derive these methods. For con-
stant steps, this is based on operator theory. Thus we introduce the forward shift
operator, acting on continuous functions y(t), according to

E∆t : y(t) 7→ y(t + ∆t). (10.11)

The forward shift operator has the following properties:


1. E0 = 1
2. E∆t E∆ s = E∆ s E∆t = E∆t+∆ s
3. (E∆t )−1 = E−∆t .
The properties are reminiscent of an exponential function. This is no coincidence.
Let us denote the usual differentiation operator by D, so that Dy = ẏ. For an analytic
function Taylor’s theorem now reads

(∆t D)2 (∆t D)3


E∆t y = y(t + ∆t) = y + ∆t Dy + y+ y + · · · = e∆t D y,
2! 3!
where the operator function e∆t D is interpreted as the operator series

(∆t D) j
e∆t D = ∑ . (10.12)
j=0 j!

In operator calculus the actual convergence of this series is an issue. Here we shall
refrain from investigating this aspect, as we will truncate every operator series we
derive. In fact, this is what numerical computation is about – we always truncate
infinite expansions, to consider the approximations that can be obtained from the
first few terms. This may appear less rigorous, but even for a truncated series, the
resulting expansion will be exact for polynomials up to a corresponding degree.
The result of the calculus above is that Taylor’s theorem can formally be written
10.3 Operator theory 97

E∆t = e∆t D (10.13)

explaining the exponential character of the forward shift operator.


Since we need finite differences, we introduce the forward difference operator

∆ : y(t) 7→ y(t + ∆t) − y(t) (10.14)

as well as the backward difference operator

∇ : y(t) 7→ y(t) − y(t − ∆t). (10.15)

Obviously, ∆ y = E∆t y − y, and ∇y = y − E−∆t y. Hence we can write

∆ = E∆t − 1
∇ = 1 − E−∆t .

Since “time” t is a function in its own right, we may apply these operators to y(t) = t,
obtaining the seemingly trivial result

∆t = E∆t t − t = (t + ∆t) − t = ∆t.

This shows that a “time increment” ∆t is indeed only the forward difference operator
acting on the function y(t) = t. Later, in connection with boundary value problems,
the independent variable is “space” x, and a space discretization ∆ x is then in a
similar way a forward difference operator acting on the independent variable. With
this notation, we have
∆ y y(t + ∆t) − y(t) dy
= ≈ = Dy.
∆t ∆t dt

All operators introduced here are linear operators. They can be added and com-
posed, with composition being identified by “multiplication.” With these two oper-
ations, these operators satisfy the axioms of a ring, and form an operator algebra.
By permitting infinite series (this needs further qualifications), we are working in a
somewhat broader context of operator calculus. Let the game begin.
Since ∇ = 1 − E−∆t and E∆t = e∆t D , we have e−∆t D = 1 − ∇. Hence, taking
logarithms, we get
∆t D = − log(1 − ∇),
where the logarithm function is to be interpreted in terms of its power series expan-
sion. Thus,
∇2 ∇3 ∇4 ∞
∇j
∆t D = ∇ + + + +··· = ∑ . (10.16)
2 3 4 j=1 j

Now consider a differential equation ẏ = f (t, y). Then ∆t Dy = ∆t f (t, y), and a lin-
ear multistep method can be obtained by replacing the operator ∆t D by a truncated
98 10 Linear Multistep methods

operator series from (10.16). This yields a k-step finite difference method,

∇2 ∇k
 
∇+ +···+ yn = ∆t f (tn , yn ). (10.17)
2 k

This is simply the BDFk method in backward difference representation. If k = 1,


we have the implicit Euler method in backward difference representation, as

∇yn+1 = ∆t f (tn+1 , yn+1 ).

More interestingly, for k = 2 we obtain the method

∇2
 
∇+ yn = ∆t f (tn , yn ).
2

Here the operators act on the discrete sequence {yn } rather than on functions, and
∇{yn } = {yn − yn−1 }, which we represent in the informal shorthand notation

∇yn = yn − yn−1 .

Given that yn ≈ y(tn ), we see that this is the same action as when the operator acts
on continuous functions. Powers of the operator ∇ are defined recursively, by

∇2 yn = ∇(∇yn ) = ∇(yn − yn−1 ) = yn − 2yn−1 + yn−2 .

The same principle applies to higher order differences. Considering (10.17), we find
that
yn − 2yn−1 + yn−2
yn − yn−1 + = ∆t f (tn , yn ),
2
which is simplified to
3 1
yn − 2yn−1 + yn−2 = ∆t f (tn , yn ).
2 2
This is recognized as the BDF2 formula obtained from the Lagrange polynomial
basis.
The use of operator series is very powerful so long as the step size is constant.
(Variable step size multistep methods are very complicated, although necessary in
advanced scientific computing.) In principle, any multistep method can be derived
using difference operators on a uniform grid. Thus, noting that

∇2 ∇3 ∇ ∇2
   
∇+ + +... = ∇ 1+ + + . . . = ∇ · L(∇),
2 3 2 3

the operator L(∇) can be formally inverted, to construct a method

∇yn = L−1 (∇) ∆t f (tn , yn ).


10.4 General theory of linear multistep methods 99

Expanding L−1 (∇) in powers of ∇ we obtain

∇ ∇2 ∇3 19∇4 3∇5
 
∇yn = 1 − − − − − − . . . ∆t f (tn , yn ).
2 12 24 720 160

These are the AMk methods in backward difference representation, where we


include as many powers of ∇ as needed. If the highest power on the right hand side
is ∇k the method is the k-step AMk.
A similar technique can be used to derive the ABk methods as well as many other
types of methods. Operator techniques are particularly convenient to derive constant
step size multistep methods.

10.4 General theory of linear multistep methods

To develop a general theory, the problem to be solved is ẏ = f (t, y). Linear multistep
methods approximate this problem by a difference equation of the form
k k
∑ α j yn+ j = ∆t ∑ β j f (tn+ j , yn+ j ). (10.18)
j=0 j=0

As the difference equation is of order k, the method (10.18) is referred to as a k-step


method. The two sets of coefficients {α j }kj=0 and {β j }kj=0 define the generating
polynomials,
k k
ρ(ζ ) = ∑ α jζ j ; σ (ζ ) = ∑ β jζ j, (10.19)
j=0 j=0

which are assumed to have no common factor. We also assume that αk 6= 0. The co-
efficients are normalized by requiring that σ (1) = 1. Then the pair (ρ, σ ) uniquely
defines a linear k-step method, which can be written in terms of the forward shift
operator as
ρ(E∆t ) yn = ∆t σ (E∆t ) f (tn , yn ).
As αk 6= 0 the difference equation (10.18) can be written

βk
yn+k − ∆t f (tn+k , yn+k ) = ψ, (10.20)
αk

where ψ only depends on past data. The method is explicit if βk = 0 and implicit if
βk 6= 0. In the latter case (10.20) must be solved numerically to determine yn+k .
Inserting any sufficiently differentiable function y and its derivative ẏ into (10.18)
we find, by Taylor series expansion, that
k k
rn [y] := ∑ α j y(tn+ j ) − ∆t ∑ β j ẏ(tn+ j ) = ck ∆t p+1 y(p+1) (tn + θ k∆t), (10.21)
j=0 j=0
100 10 Linear Multistep methods

for some θ ∈ [0, 1] as ∆t → 0, where the remainder term rn is the local residual.
Here ck is the error constant, and p is the order of consistency. The order of
consistency is determined by inserting polynomials y(t) = t q , with ẏ(t) = qt q−1 ,
since rn [t q ] ≡ 0 for all q ≤ p.
Specifically for q = 0 the order condition is ∑ α j = ρ(1) = 0. Hence ρ(ζ ) = 0
must always have one root ζ = 1, known as the principal root. Taking n = 0 and
t j = j · ∆t, for q ≥ 1, the order conditions are

k
α j jq − β j q jq−1 = 0;

∑ q = 1, . . . , p, (10.22)
j=0

where the (p + 1)th condition fails. The first order condition (q = 1) can also be
written ρ 0 (1) = σ (1). The two conditions ρ(1) = 0 and ρ 0 (1) = σ (1) are referred
to as pre-consistency and consistency, respectively.
As a k-step method has 2k + 1 coefficients (one coefficient being lost to normal-
ization), and the coefficients {α j }k0 and {β j }k0 enter (10.22) linearly, it is possible to
have rn [t q ] ≡ 0 for q = 0, . . . 2k. Thus the maximal order of consistency is p = 2k.
However, for k > 2 such methods are unstable and fail to be convergent.
We shall see that zero-stability restricts the order, and we have to distinguish be-
tween order of consistency, and order of convergence. The latter is the important
property, which means that the global error en = yn − y(tn ) = O(∆t p ) as ∆t → 0.

10.5 Stability and convergence

Let us consider the simple problem ẏ = f (t). Then (10.18) can be written

ρ(E∆t )yn = ∆t σ (E∆t ) fn . (10.23)

The solution is yn = un + vn , where {un } is a particular solution of the difference


equation, and the homogeneous solutions {vn } satisfy ρ(E∆t )vn = 0. The latter are
determined by the roots ζν of the characteristic equation ρ(ζ ) = 0.
Thus the homogenous solutions depend on the method but are unrelated to the
given problem ẏ = f (t). They must therefore remain bounded for all n to avoid
an unstable numerical solution, Rdiverging from the particular solution {un } which
approximates the exact solution, f (t) dt.
The homogeneous solutions are unstable unless all roots ζν are inside or on the
unit circle. Furthermore, {vn } also grows if any root on the unit circle is multiple.
Thus it is necessary to impose the root condition,

ρ(ζ ) = 0 ⇒ |ζν | ≤ 1; ν = 1, . . . , k
|ζν | = 1 ⇒ ζν is a simple root.
10.5 Stability and convergence 101

A method whose ρ polynomial satisfies the root condition is zero stable. Zero sta-
bility is necessary for convergence, i.e., for the numerical solution to converge to
the exact solution, yn → y(tn ) as ∆t → 0.
If a zero-stable method (ρ, σ ) has order of consistency p, it is also convergent,
with order of convergence p, i.e., kyn − y(tn )k = O(∆t p ) as ∆t → 0. In particular,
every consistent one-step method is convergent, since ρ has but a single root. This
is why zero stability is not an issue for RK methods.
Thus the bridge from order of consistency to order of convergence is stabil-
ity.For a stable method, the order of convergence equals the order of consistency.
We will see later that this is a general principle in numerical analysis.
In the case of linear multistep methods, the crucial stability notion (for conver-
gence) is zero stability. Of the methods studied here, the Adams–Bashforth and
Adams–Moulton methods has a single root ζ = 1 and the remaining roots at ζ = 0.
Hence they satisfy the root condition; they are zero stable and convergent. In the
present context, zero stability is only an issue for the BDF methods, due to the
structure of their ρ polynomials. The BDF1–6 are zero stable, but the BDF7 and
higher are not. Thus, while the BDFk has order of consistency p = k for all k, the
methods are only convergent for k ≤ 6.
As only convergent methods are of interest, the maximum order needs to be re-
examined, taking zero stability into account. According to the First Dahlquist Bar-
rier Theorem, the maximal order of convergence of a k-step method is

k
 explicit methods
pmax = k + 1 implicit methods with k odd

k + 2 implicit methods with k even.

Implicit methods of order p = k + 2 are weakly stable, meaning that ρ(ζ ) = 0


has two or more distinct roots on the unit circle. This usually results in a spuri-
ous, oscillatory error caused by undamped homogeneous solutions, and has been
demonstrated in the explicit midpoint method,

yn+2 − yn = 2∆t f (tn−1 , yn+1 ).

This method has ρ(ζ ) = ζ 2 − 1 with zeros ζ = ±1. Such methods are only used
in exceptional cases or for special problems. In practice, the maximal order of an
implicit method is p = k + 1, as exemplified by the AMk methods. Similarly, the
ABk methods are explicit, and of maximal order p = k. The implicit BDFk methods,
on the other hand, are only of order p = k, as they sacrifice maximal order for
improved stability, so that they can be used for stiff differential equations.
102 10 Linear Multistep methods

10.6 Stability regions

Stability regions are determined by applying the method to the linear test equation
ẏ = λ y for λ ∈ C, with y(0) = 1. Since y(t) = etλ , the zero solution y = 0 is stable
whenever Re λ ≤ 0.
Applying (ρ, σ ) to the linear test equation leads to the homogeneous difference
equation ρ(E)yn − ∆t λ σ (E)yn = 0. Stability is then governed by the roots of the
characteristic equation
ρ(ζ ) − ∆t λ σ (ζ ) = 0. (10.24)
Like before, solutions are stable if and only if the k roots of (10.24) satisfy
|ζν (∆t λ )| ≤ 1, with simple roots of unit modulus. The stability region is defined
as

S(ρ, σ ) = {∆t λ ∈ C ρ(ζ ) − ∆t λ σ (ζ ) satisfies the root condition}. (10.25)

Note that zero stability can be expressed as 0 ∈ S(ρ, σ ). This is a very modest
requirement. In practice, a method needs a fairly large stability region.
Since few methods have the property that S(ρ, σ ) = C− we distinguish between
numerical stability and the mathematical stability of the problem. Moreover,
since S(ρ, σ ) is defined in terms of ∆t λ , combining method and problem parame-
ters ∆t and λ , there is in general a restriction on the step size ∆t in order to maintain
numerical stability. This is referred to as conditional stability.
The stability regions of the Euler methods,the trapezoidal rule, the explicit mid-
point method and RK methods have already been investigated. A method whose
stability region covers the left half-plane, C− ⊂ S(ρ, σ ), is called A-stable. This
leads to additional constraints on the method’s coefficients. Thus, according to the
Second Dahlquist Barrier theorem, the maximum order of an A-stable multistep
method is p = 2, and of all A-stable second order multistep methods, the trapezoidal
rule has the smallest error constant.
Although this appears to be a disappointing result, among higher order multistep
methods, the BDF methods have stability regions covering most of C− . Otherwise,
high order A-stable methods can be found among IRK methods. The stability re-
gions of some of the AMk and BDFk methods are plotted in Figure 10.1.
The characteristic equation ρ(ζ ) − ∆t λ σ (ζ ) = 0 implies that

ρ(ζ )
∆t λ = .
σ (ζ )

To find the stability region, we note that on the boundary ∂ S(ρ, σ ), there is at least
one root of unit modulus, i.e., ζν (∆t λ ) = eiϕ . We therefore let ζ = eiϕ , and consider

ρ(eiϕ )
z= .
σ (eiϕ )
10.7 Adaptive multistep methods 103

4
10
3

2
5
1

0 0

−1
−5
−2

−3
−10
−4
−6 −4 −2 0 −5 0 5 10 15

Fig. 10.1 Stability regions of AMk and BDFk methods. Adams–Moulton methods of orders p =
3 − 6 (left) and BDF methods of orders p = 2 − 5 (right) plotted in the complex ∆t λ plane. AMk
methods are stable inside each closed curve. The stability regions shrink with increasing order,
starting from p = 3. BDF methods are stable outside each closed curve, where the largest curve
corresponds to p = 5. Here, too, stability is lost with increasing order

By plotting the image of the unit circle under the rational map ρ/σ , we obtain the
boundary locus in the complex ∆t λ plane. The boundary of the stability region
consists of (a subset of) this curve.

10.7 Adaptive multistep methods

Most linear multistep methods are implemented as adaptive methods, meaning that
order as well as step size are varied along the solution, to suit its local behavior. Such
implementations are referred to as variable order – variable step size methods.
They are very complex codes, since it is already difficult to construct multistep
methods on nonuniform grids. Step size control then proceeds in a manner similar
to that we have already seen in RK methods. Order control, on the other hand, is a
difficult issue, since it will require that error estimates for different orders must be
available for comparison.
To advance one step, an implicit method requires that a nonlinear equation of the
generic form
βk
yn − ∆t f (tn , yn ) = ψ (10.26)
αk
be solved on each step. The actual iteration can be carried out using fixed-point iter-
ation or Newton iteration. As fixed-point iteration effectively requires ∆t βk L[ f ]/αk <
1 for convergence, the step size ∆t is restricted to ∆t ∼ 1/L[ f ], making fixed-point
iteration useless for stiff problems. In stiff problems, therefore, BDF methods are
used with some Newton-type iteration to overcome step size restrictions.
104 10 Linear Multistep methods

Using Newton’s method is relatively expensive, as it requires the Jacobian matrix


of (10.26),
∆t βk ∂ f
J(y) = I − .
αk ∂ y
To reduce the cost of evaluating and factorizing the Jacobian, it is common to use
a modified Newton iteration, only recomputing ∂ f /∂ y when convergence slows
down. For example, in a linear system with constant coefficients, the Jacobian is
constant and needs no re-evaluation at all. In nonlinear systems, re-evaluation will
usually be necessary, but a modified Newton iteration keeps the Jacobian un-
changed over several iterations, and sometimes over several steps. Such strategies
also become part of a well designed adaptive code.
For very large systems it may not be affordable to use Newton iterations as
the equation solving using conventional Gaussian elimination may be too time-
consuming. In such cases, there are other alternatives, e.g. using matrix-free it-
erations, based on conjugate gradient methods or Krylov subspace iteration.
Because multistep solvers are very complex, it is impossible to construct your
own code and expect it to perform well. The existing efficient and reliable codes
for stiff and nonstiff problems are time-tested pieces of software, based on decades
of research and experience. Some well-known codes are VODE and LSODE (non-
stiff and stiff problems), SUNDIALS (nonstiff, stiff and differential-algebraic prob-
lems), DASSL and DASPK (stiff and differential-algebraic) and MEBDF (modified
extended BDF methods). M ATLAB offers the solvers ode15s (stiff problems) and
ode113 (nonstiff problems).
Apart from the problem types mentioned here, there are differential-algebraic
equations (DAEs), stochastic differential equations (SDEs), delay differential equa-
tions (DDEs), and combinations of these. They go outside the standard setting of
initial value problems in ODEs, and cannot be dealt with here. Nevertheless, these
areas offer their own theories, methodologies and algorithms, often drawing on the
elementary techniques for ODEs which have been presented above.
Chapter 11
Special Second Order Problems

There are special cases when the standard first-order formulation and its methods
may not be the best choice. Special second order equations have the form

ÿ = f (y); y(0) = y0 , ẏ(0) = ẏ0 .

The are called special second order problems because they have a second order
derivative but no first order derivative. Second order initial value problems are
found in mechanics, and if the first order derivative is missing, the problem has
no damping. Such problems occur for instance in celestial mechanics (describing
planetary motion and orbits), and also in molecular dynamics. Without damping,
the problems usually have some form of oscillatory behavior, often in combination
with some conservation principle.
The simplest case would be an undamped pendulum,
g
ϕ̈ + sin ϕ = 0; ϕ(0) = ϕ0 , ϕ̇(0) = ϕ̇0 , (11.1)
L
or its linearization, the harmonic oscillator

ϕ̈ = −ω 2 ϕ; ϕ(0) = ϕ0 , ϕ̇(0) = ϕ̇0 . (11.2)

The linearization around ϕ = 0 is obtained by considering small amplitudes and


approximating the nonlninear function, sin ϕ ≈ ϕ. Here, obviously, the angular fre-
quency is r
g
ω= ,
L
where g is the gravitational constant and L is the length of the pendulum. The char-
acteristic equation of (11.2) is λ 2 = −ω 2 , with solution λ = ±iω. Consequently,
the solution is
ϕ(t) = A sin ωt + B cos ωt.

105
106 11 Special Second Order Problems

For simplicity, let us assume that the initial conditions have been chosen so that
ϕ(t) = sin ωt. This is a plain harmonic oscillation. It is undamped, and goes on
indefinitely, with constant amplitude. The nonlinear pendulum (11.1) has the same
behavior, except with an amplitude-dependent frequency.

11.1 Standard methods for the harmonic oscillator

In the harmonic oscillator, the amplitude remains unchanged, and no energy is lost.
The question now is whether it is possible to find numerical methods that also pre-
serve an undamped oscillation. This is the central issue in geometric integration,
where we seek to find discretizations with conservation properties similar to those
of the differential equation.
A first approach to this problem is to rewrite it as a system of two first order
equations. For the harmonic oscillator we put y = ϕ = sin ωt and x = ϕ̇/ω = cos ωt.
Then we have the system
      
d x 0 −ω x x
= = K(ω) · , (11.3)
dt y ω 0 y y

where the skew-symmetric matrix K(ω) has two complex conjugate eigenvalues,
λ1,2 [K(ω)] = ±iω. These are obviously located on the imaginary axis. It is there-
fore immediately clear that the simplest methods like the explicit and implicit Euler
methods will not do. In the explicit method, iω∆t will always be outside the stability
region, leading to numerical instability. In the implicit method, on the other hand,
iω∆t will fall strictly inside the stability region, leading to exponential damping.
Thus neither of these methods has a qualitative behavior agreeing with that of the
differential equation, see Figure 11.1.
Recalling linear stability theory, the Trapezoidal Rule has a stability region
whose boundary coincides with the imaginary axis. Therefore this method should be
“ideal,” in the sense that the eigenvalues of the discretization lie on the unit circle.
The trapezoidal method for a system ż = Az is

∆tA −1
   
∆tA
zn+1 = I − I+ zn . (11.4)
2 2

Therefore, if A = iω (corresponding to the linear test equation), we have

1 + iω ∆t/2
zn+1 = zn ,
1 − iω ∆t/2

where
1 + iω∆t/2 2 1 + ω 2 ∆t 2 /4

1 − iω∆t/2 = 1 + ω 2 ∆t 2 /4 ≡ 1

11.1 Standard methods for the harmonic oscillator 107

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5

-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5

Fig. 11.1 Euler methods applied to harmonic oscillator. When the explicit Euler method (left
panel) and the implicit Euler method (right panel) are applied to the harmonic oscillator, the for-
mer method is unstable and forms an outward spiral. The implicit method, on the other hand,
spirals inward, due to overstability. Neither method is satisfactory, in spite of both methods being
convergent. In both cases, ω = 1 and N = 5000 steps were taken to integrate the system on [0, 20π].
The unit circle is in both cases indicated by a blue circle

for all ω∆t ∈ R.


The analysis is only slightly more complicated in the vector case, when A =
K(ω). Putting a = ω∆t/2 and applying the trapezoidal method to (11.3) then yields

1 − a2 −2a
      
xn+1 1 xn xn
= = T (a) · , (11.5)
yn+1 1 + a2 2a 1 − a2 yn yn

where we are interested in the eigenvalues of T (a). These are

1 − a2 ± 2ai
λ1,2 [T (a)] = ,
1 + a2
implying that

1 − a2 ± 2ai 2 1 − 2a2 + a4 + 4a2



|λ1,2 [T (a)]|2 = = ≡1
1 + a2 1 + 2a2 + a4

for all a ∈ R, and hence for all ω∆t ∈ R. Thus the discretization will also represent
oscillations, and since the eigenvalues of T (a) have unit modulus, there is no damp-
108 11 Special Second Order Problems

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Fig. 11.2 Explicit midpoint and trapezoidal method applied to harmonic oscillator. When the ex-
plicit midpoint method (left panel) and the trapezoidal rule (right panel) are applied to the harmonic
oscillator, both methods show an outstanding performance with excellent long-time behavior. In
both cases, ω = 1 and a mere N = 200 steps were taken to integrate the system on [0, 20π]. The
unit circle is in both cases indicated by a blue circle, and the numerical solutions (red) only de-
viate negligibly from the true solution. In both cases ω∆t = π/10 satisfies the Nyquist–Shannon
sampling criterion

ing, meaning that the norm of the solution remains bounded above and below, for
all n > 1. This is a desirable behavior, as it is similar to the behavior of the harmonic
oscillator. However, in order to resolve the correct frequency of the oscillation, the
time step ∆t has to be chosen short enough to have |ω∆t| < 1, in accordance with
the Nyquist–Shannon sampling theorem.
Looking for an explicit alternative to the implicit trapezoidal method, we have
previously encountered the two-step Explicit Midpoint method. When applied to
a system ż = Az it reads
zn+1 = zn−1 + 2∆t Azn .
Again, let us take A = iω, corresponding to the linear test equation. For this two-step
difference equation, we have the characteristic equation

q2 − 2∆t iω q − 1 = 0.

Since the product of the roots is −1, the method is only stable if the roots are

q1 = eiξ , q2 = −e−iξ
11.2 The symplectic Euler method 109

with sum q1 + q2 = 2i sin ξ = 2∆t iω. Hence

sin ξ = ω∆t.

Obviously, this requires ω∆t ∈ (−1, 1), or |ω∆t| < 1. Equality is excluded because
if ω∆t = ±1 we obtain a double root, q1 = q2 = ±i, leading to instability.
By taking A = K(ω) this result is immediately generalized to the full recursion
for (11.3). The scheme is
     
xn+1 xn−1 x
= + 2∆t · K(ω) · n , (11.6)
yn+1 yn−1 yn

and it is evidently stable whenever |λ1,2 [∆t K(ω)]| < 1, or simply |ω∆t| < 1.
While there was no such stability restriction on ∆t for the trapezoidal method,
one would think that the explicit midpoint method is at a disadvantage. However,
in order to resolve the frequency of the actual oscillation, the step size ∆t has to
fulfill a similar requirement according to the Nyquist–Shannon sampling theorem.
Therefore, one can expect a similar performance with both methods. They are both
second order convergent, and can use similar step sizes. The drawback is that the
trapezoidal rule is implicit, and hence more expensive to use. On the other hand,
it is more robust in case the system would also include a small damping, in which
case the trapezoidal rule will still manage well, while the explicit midpoint method
immediately goes unstable.
As the numerical tests in Figure 11.1 demonstrate, there is simply no compari-
son between conventional methods and methods designed for special second order
equations. Even at coarse step sizes, the special methods beat conventional methods
hands down, and accuracy far exceeds any standard expectations. While the special
methods are not perfect, they still have phase errors, even though they nearly con-
serve total energy over very long times. This demonstrates that a complete knowl-
edge of both the problem at hand, and and understanding of the special properties
of numerical methods can have a tremendous impact on the quality of the numerical
solution.

11.2 The symplectic Euler method

As there are few other conventional methods that will prove conservative for the
harmonic oscillator, we have to seek other alternatives. One possibility is to combine
the Explicit and Implicit Euler methods in the following way,

xn+1 = xn − ∆t · ωyn
yn+1 = yn + ∆t · ωxn+1 .
110 11 Special Second Order Problems

Thus the first equation is discretized using the explicit Euler method, and the second
equation by the implicit Euler method. The overall method is nevertheless explicit,
because once xn+1 has been computed from the first equation, it can be substituted
into the second equation to compute yn+1 . This unusual method is known as the
Symplectic Euler method.
In order to analyze the method, we can eliminate y. Thus, from the first equation
we obtain
xn+1 − xn
yn = .
ω∆t
Substituting into the second equation, eliminating the variable y, we obtain, after
rearranging terms,
xn+2 − 2xn+1 + xn
= −ω 2 xn+1 . (11.7)
∆t 2
The difference quotient on the left hand side is a second order approximation to the
derivative ẍ at time tn+1 , making (11.7) a two-step direct discretization of ẍ = −ω 2 x.
This particular discretization will be used in connection with the wave equation in
PDEs, where it is sometimes referred to under a nickname, the “leap-frog method.”
The method is related to the explicit midpoint method, and is sometimes a method of
choice for problems in wave mechanics. We note, in passing, that the classical wave
equation is utt = uxx , i.e., there is no first order time derivative. This is therefore
a “special second order equation” without damping, and it therefore benefits from
special time-stepping methods.
Let us now analyze this method the way we evaluated the other methods. We
note that (11.7) is a linear difference equation. Its characteristic equation is

q2 − (2 − (ω∆t)2 )q + 1 = 0.

Here, the products of the roots is +1, and their sum is 2 − (ω∆t)2 . Therefore, as
long as the method is stable, we have

q1 = eiξ , q2 = e−iξ

with q1 + q2 = 2 cos ξ = 2 − (ω∆t)2 . It follows that

ξ ω 2 ∆t 2
1 − cos ξ = 2 sin2 = ,
2 2
which requires |ω∆t| < 2. This stability condition is slightly more allowing than the
previous conditions, obtained for the trapezoidal and explicit midpoint methods. In
addition, the symplectic Euler method has both roots on the unit circle, implying
that the solution remains bounded above and below, which is desirable for a method
simulating wave propagation.
11.3 Hamiltonian systems 111

11.3 Hamiltonian systems

A most important class of second order problems are Hamiltonian systems. Such
a system is described by a scalar function, the Hamiltonian H(q, p), where the two
variables usually refer to position (q) and momentum (p). The time evolution of the
system is described by the equations

ṗ = −Hq
q̇ = H p ,

where Hq = ∂ H/∂ q and H p = ∂ H/∂ p.


An important special case is when the Hamiltonian is separable. It can then be
written as a sum of two independent functions,

H(q, p) = U(q) + T (p).

This is often the case in mechanics, where U(q) denotes a potential (depending
only on the position q) and T (p) denotes kinetic energy (depending only on the
momentum p). For example, in a harmonic oscillator we could have

1 p2
H(q, p) = kq2 + ,
2 2m
where the parameters m and k represent mass and spring constant. In such a prob-
lem total energy is conserved, although there is energy transfer from potential to
kinetic energy and vice versa. This means that both p and q evolve in time, while
H(p, q) = C, independent of time. The computational challenge, then, is whether
we can find numerical methods that solve the differential equations while keeping
the total energy invariant.
In a more general setting, U(q) is a potential depending on the position vector
q, and the kinetic energy is pT M −1 p/2, where p is the momentum vector, and M
is the mass matrix. In this case, Uq = ∇qU is the gradient of the potential, and
Tp = M −1 p. Since Hq = Uq and H p = Tp , the system takes the form

ṗ = −Uq (11.8)
q̇ = Tp . (11.9)

For such problems, it is again possible to combine explicit and implicit time-
stepping, because the right hand side of the first equation only depends on q and
the right hand side of the second equation only on p.
The methods we have considered so far are reasonably good contenders, but we
will find that they do not keep H(q, p) constant. Instead, they keep H(q, p) nearly
constant. While q and p cannot be computed exactly, there will be a (possibly grow-
ing) phase error, but the deviation from constant energy will remain small over ex-
tremely long integration times. Thus, if the phase error can be tolerated, all the
112 11 Special Second Order Problems

methods presented here will produce numerical solutions with a qualitative behav-
ior in good agreement with that of the continuous system.

11.4 The Störmer–Verlet method

For the separable Hamiltonian system (11.8), a simple discretization is the follow-
ing,

∆t
pn+1/2 = pn − Uq (qn )
2
qn+1 = qn + ∆t · Tp (pn+1/2 )
∆t
pn+1 = pn+1/2 − Uq (qn+1 ).
2
This method, known as the Störmer–Verlet method in connection with Hamilto-
nian problems, splits the integration into separate steps for p and q. The first step
advances p half a step from tn to tn+1/2 . Next, we advance q a full step from tn to
tn+1 , making use of the fact that we can evaluate the right hand side at the midpoint,
as pn+1/2 is available from the previous operation. To complete the integration, we
take the second half step in p, from tn+1/2 to tn , using the recently updated value
qn+1 . The method is explicit and of second order. It is easy to use and is highly
successful in molecular dynamics and celestial mechanics.
After the three steps above have been completed, the integration continues by
repeating the procedure. Considering two consecutive steps, we see that the splitting
can then be rewritten as

pn+1/2 = pn−1/2 − ∆t ·Uq (qn )


qn+1 = qn + ∆t · Tp (pn+1/2 ).

This means that the variable q advances on a grid where the time points have an
integer index, tn , while the variable p advances on a staggered grid, where the time
points have fractional indices, with tn+1/2 = (tn +tn+1 )/2. Thus the method is a map

(pn−1/2 , qn ) 7→ (pn+1/2 , qn+1 ).

Whenever p and q are needed at a single point in time, this is easily computed
according to the original scheme.
As a final test of the Störmer–Verlet method, we integrate the harmonic oscillator
over a very long time interval, [0, 200π], corresponding to one hundred full periods
of oscillation, see Figure 11.4. We take N = 104 steps, corresponding to 100 steps
per period. We compute the error of the Hamiltonian, which in this case means that
we check how far from the trigonometric identity the computed quantity x2 + y2 is.
The remarkable feature of the Störmer–Verlet method is that there is no growth of
11.4 The Störmer–Verlet method 113

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Fig. 11.3 Symplectic Euler and Störmer–Verlet methods applied to harmonic oscillator. When the
symplectic Euler method (left panel) and the Störmer–Verlet method (right panel) are applied to
the harmonic oscillator, both methods show a periodic behavior. In both cases, ω = 1 and N = 200
steps were taken to integrate the system on [0, 20π]. The unit circle is in both cases indicated by a
blue circle. The symplectic Euler method shows a very visible error. By comparison, the Störmer–
Verlet method has a far smaller error

this error; in fact, the error shows a strong periodic behavior, and one can integrate
over extremely long times with a bounded error, as demonstrated in Figure 11.4.
Even if the energy remains near the invariant, there is still a phase error, i.e., there
is a deviation between (x(t), y(t)) and the exact solution, (cost, sint). The phase er-
ror initially grows linearly. Eventually, for very large times it will ruin the global
accuracy of the solution, as the numerical solution goes completely out of phase,
although the energy is still correct. This limits the time horizon over which simu-
lations remain accurate. Both the energy error and the global error in the Störmer–
Verlet method are of second order. The energy error is O((ω∆t)2 ), while the global
error, as long as it is small, has a behavior of the form ∼ O(t(ω∆t)2 ).
The special methods discussed here are not suitable (indeed often complete fail-
ures) in problems having damping, however small. In connection with conservative
systems and wave phenomena, on the other hand, the methods excel, and have some
unparalleled properties, with a unique ability to conserve energy and/or amplitude.
Even so, the method comparisons demonstrated here show that methods have to be
selected with great care also for special second order equations. In particular, Fig-
ure 11.4 shows that even if a method preserves symplectic structure (the symplectic
114 11 Special Second Order Problems

10 0

10 -5

10 -10

10 -15
0 100 200 300 400 500 600 700

0.12

0.1

0.08

0.06

0.04

0.02

0
0 100 200 300 400 500 600 700

Fig. 11.4 Error in Störmer–Verlet method applied to harmonic oscillator. When the Störmer–
Verlet method solves the harmonic oscillator with ω = 1 over a long time interval [0, 200π], the
energy error e = x2 + y2 − 1 is plotted as a function of time t (top panel). Using 100 steps per
period, for a total of N = 104 steps, we note that this error does not increase with time, but remains
bounded for all t > 0. Thus the solution remains close to the invariant x2 + y2 = 1 for extremely
long times. However, there is also a phase error,
p which causes the global error to essentially grow
linearly (lower panel). The global error is (x − cost)2 + (y − sint)2 and measures the distance
between the numerical solution and the exact solution

Euler method) it might still not have a strong enough performance when compared
to a dedicated method, such as the Störmer–Verlet method.

11.5 Time symmetry, reversibility and adaptivity

Many special second order problems are reversible, or symmetric with respect to
time, in the sense that the problems of integrating in forward or reverse time are
essentially the same. For example, simulating the solar system in reverse time (to
find its state in the past) is not qualitatively different from simulating it in forward
time (predicting its state in the future). In case damping had been present, these
tasks would have been entirely different.
Consider a system ż = F(z). Integrating it in reverse time is equivalent to making
the variable change t ↔ −t, which implies the transformation d z/dt ↔ −d z/dt.
Thus solving ż = F(z) in reverse time is the same as solving ż = −F(z) in forward
11.5 Time symmetry, reversibility and adaptivity 115

time. But this transformation reverses the stability properties of the system. If they
have the same stability properties, then the system must be neutrally stable.
For the Hamiltonian problem (11.8), reversibility is a slightly more technical
issue. With q representing position and p momentum, we obtain the solution in
reverse time by merely making the variable change p ↔ −p. While the position
variables are unchanged, the “velocities” are reversed, and (11.8) takes the form

− ṗ = −Uq (11.10)
q̇ = −Tp . (11.11)

Meanwhile, if we only change the independent variable, t ↔ −t in (11.8), we have

− ṗ = −Uq (11.12)
−q̇ = Tp , (11.13)

which is immediately seen to be the same system as (11.10). Combining the two
changes p ↔ −p and t ↔ −t then produces the original system.
Let us specialize ż = F(z) to a linear system ż = Az with initial condition z0 .
Advancing the solution ∆t units in time, we have z1 = e∆tA z(0). If, on the other
hand, the solution at time ∆t is known, then z0 is given by z0 = e−∆tA z1 , and
 −1
e∆tA = e−∆tA .

A symmetric one-step method Φ∆t : z0 7→ z1 must have the corresponding property,

(Φ∆t )−1 = Φ−∆t .

While conventional methods such as the Euler methods fail to have this property,
we note that e.g. the trapezoidal rule is symmetric, since, by (11.4),
 −1
∆t −1 ∆t −∆t −1 −∆t
(I − A) (I + A) = (I − A) (I + A).
2 2 2 2

Explicit methods cannot be symmetric without being two-step (like the explicit mid-
point method and the Störmer–Verlet method), but require a more elaborate theory.
Symmetric methods play a key role in geometric integration, since they replicate
the symmetry and reversibility of special second order problems. Usually, explicit
methods are preferred. Implicit methods are more expensive, and require that the
(nonlinear) equations are solved to full precision.
Now, the real difficulty comes in making symmetric methods adaptive. The con-
ventional way of changing the time step is multiplicative, as in

∆tn+1 = θn · ∆tn .
116 11 Special Second Order Problems

This is not symmetric, since it uses the back history of the solution to compute θn
and decide on the next step size. (It is only symmetric for θn = ±1, which fails to
make the method adaptive.)
It is therefore necessary to reconsider adaptivity. We need to construct a step size
control which is in itself a Hamiltonian problem. This can be obtained by transform-
ing the independent variable t so that d/dt = ρd/dτ, where ρ(τ) is a time rescaling
function. The original system ż = F(z) is transformed into

z0 = F(z)/ρ , (11.14)

where prime denotes derivative with respect to τ. This equation is augmented by the
control system

ρ 0 = G(z) (11.15)
0
t = 1/ρ, (11.16)

where (11.15) generates ρ from a chosen control function G(z) and (11.16) recovers
the original time t. The initial values are ρ(0) = 1 and t(0) = 0. By solving the
augmented system numerically, the continuous function ρ(τ) will be represented by
a discrete sequence {ρn+1/2 }, where the varying time step ∆t is constructed from
ε
∆tn+1/2 = . (11.17)
ρn+1/2

The fixed parameter ε is interpreted as the initial step size.


The target is to keep ρ = Q(z), where Q is a prescribed symmetric function of
the solution z. We need to construct a function G(z) to achieve this end. Defining
h̄(t, ρ) = log[Q(z(t))/ρ], we construct a (separable) Hamiltonian system

t 0 = −h̄ρ
ρ 0 = h̄t .

By construction, h̄(t, ρ) = log[Q(z(t))/ρ] is a first integral and will remain constant.


Now, noting that h̄ρ = 1/ρ, and

h̄t = ∇z Q(z)T F(z)/Q(z),

we make the control system (11.15–11.16) Hamiltonian by taking

G(z) = d(log Q)/dt = ∇z Q(z)T F(z)/Q(z).

As a result, ρ will track the quantity Q(z) along the solution z(t).
In numerical computations, we use the explicit Störmer–Verlet method to solve
(11.15), resulting in the discrete control law

ρn+1/2 = ρn−1/2 + ε∇z Q(zn )T F(zn )/Q(zn ). (11.18)


11.5 Time symmetry, reversibility and adaptivity 117

This generates a sequence {ρn+1/2 } such that Q(zn )/ρn remains nearly constant,
and from which the step size is computed using (11.17). For separable Hamiltonian
problems, the adaptive Störmer–Verlet method then becomes

ρn+1/2 = ρn−1/2 + ε∇z Q(zn )T F(zn )/Q(zn )


ε
∆t =
ρn+1/2
∆t
pn+1/2 = pn − Uq (qn )
2
qn+1 = qn + ∆t · Tp (pn+1/2 )
∆t
pn+1 = pn+1/2 − Uq (qn+1 )
2
tn+1 = tn + ∆t.

Thus the method is made adaptive by simply adding the recursion for ρ.
There are many ways of choosing the tracking function Q(z). However, it is
important to note that the adaptive Störmer–Verlet method above uses a constant
pseudo-time step ε, which is converted to the real time step through ∆t = ε/ρ. The
latter varies along the solution, since ρ tracks Q(z). However, this is not a controller
in the usual sense, since Q(z) is not an error estimate. Another important aspect is
that the tracking control used here is not of the usual multiplicative type in conven-
tional error control. The tracking control is additive and controls the inverse of ∆t,
and hence is a nonlinear control law.

Example We use the classical Kepler problem as an illustration. This describes the motion
of a special two-body problem, which models the trajectory of a light body in a highly
eccentric orbit around a heavy body. This could represent a comet orbiting the Sun, or
the “free” return trajectory of a manned space capsule around the back of the Moon. The
structure of the problem is given by its Hamiltonian (with normalized constants)

pT p 1
H(p, q) = −p .
2 qT q

We take the initial conditions

q(0) = (1 − e, 0)T
p
p(0) = (0, (1 + e)/(1 − e))T ,

where e is the eccentricity of the orbit. At eccentricity e = 0 the orbit is circular with
constant velocity, with no need to vary the step size. The higher the value of the eccentricity
e, the more dramatic is the change of velocity near the heavy mass. The solution is highly
sensitive to perturbations near the heavy mass. Thus we need a short step size there, but far
away accuracy is less crucial and the step size can be large.
Figure 11.5 shows the numerical solution of the Kepler problem with an eccentricity of e =
0.8. The Störmer–Verlet method is used as described, first with constant steps, and then
p in
adaptive mode, with variable steps constructed by letting ρ track the quantity Q = 1/ qT q.
This means that ρ is small (and ∆t is large) when kqk is large (away from the central mass),
and vice versa. In the adaptive computation this is achieved by taking G = −pT q/qT q.
118 11 Special Second Order Problems

Fig. 11.5 Thirty orbits of the Kepler problem. When the Störmer–Verlet method integrates the
Kepler problem with eccentricity 0.8 using N = 104 constant steps (left), numerical precession
(phase error) is significant. When reversible adaptivity is used with N = 104 variable steps (right),
numerical precession is strongly suppressed, and the energy error is reduced by a factor of 30

Both the constant step size method and the adaptive method (nearly) conserve energy, but
adaptivity significantly improves both accuracy and efficiency. In fact, the computational
cost of varying the step size is negligible. For the same total work (the same number of
steps), the global error is much reduced while energy is still nearly conserved.
Part III
Boundary Value Problems
Chapter 12
Two-point Boundary Value Problems

Boundary value problems (BVPs) occur in materials science, structural analysis,


physics and similar applications. They are often connected to time dependent partial
differential equations, where the BVP represents a stationary solution. The simplest
BVP has one independent variable, x, usually interpreted as “space,” and one depen-
dent variable, u, which is then a scalar function of x. The standard case is to consider
a second order differential equation on a compact interval, say x ∈ [0, 1], with one
boundary condition at each endpoint of the interval.

12.1 The Poisson equation in 1D. Boundary conditions

We begin by considering the simplest two-point boundary value problem (2pBVP),

−u00 = f (x); u(0) = uL , u(1) = uR . (12.1)

Here f is a given data function on Ω = [0, 1], and the specified type of boundary
conditions, defining the value of the solution on the boundary ∂ Ω , are referred to as
Dirichlet conditions.
There are also other types of boundary conditions. Neumann conditions refer to

u(0) = uL , u0 (1) = u0R , (12.2)

which combine prescribing the value of u at one endpoint, and the derivative u0 at
the other endpoint.
Robin conditions refer to a linear combination of the value of u and its derivative
at one of the endpoints, as in

u(0) = uL , αu(1) + β u0 (1) = γ, (12.3)

where the numbers α, β and γ are given, in addition to uL .

3
4 12 Two-point Boundary Value Problems

In general, the terms Dirichlet condition, Neumann condition and Robin condi-
tion (in singular) are commonly used for any single condition having the structure
indicated above. Such a condition can be prescribed at either one of the boundary
points, but a Dirichlet condition is needed at the other endpoint.
The problem (12.1) may look over-simplified. It is linear, and solving the prob-
lem is (technically) only a matter of integrating the data function f twice, deter-
mining the constants of integration from the boundary conditions. However, there is
more to this problem; it can be viewed as the Poisson equation in 1D. The Poisson
equation occurs frequently in applied mathematics, and is the foremost example of
an elliptic problem. Thus we shall see that (12.1) is well worth the study. Some
important insights gained from this problem will carry over to the 2D case, where
the Poisson equation is a partial differential equation.
There are basically two different approaches to solve this problem – the finite
difference method (FDM), and the finite element method (FEM). We shall study
both, starting with the FDM, since it can be understood intuitively using elementary
methodology. The FEM is theoretically much more advanced, but in the 1D case it,
too, becomes easily accessible.

12.2 Existence and uniqueness

Just like with initial value problems, we need to investigate existence and unique-
ness before we attempt to solve the boundary value problem numerically. In BVPs,
existence and uniqueness are more intricate issues than in IVPs. However, in (12.1),
these questions are simple. The problem is linear and consists of a homogeneous so-
lution and a particular solution. The particular solution is only a matter of whether
f is twice integrable, and the homogeneous solution is simply a straight line,
uH (x) = C0 +C1 x. The constants are determined by the boundary conditions.
Recalling that the 1D Poisson equation is a special example of a 2pBVP, one
needs to discuss existence and uniqueness for more general problems. The question
is then more difficult. Let us consider 2pBVPs of the form

−u00 + au = f (x); u(0) = uL , u(1) = uR . (12.4)

(Further terms could also be added.) Because the homogeneous solutions now have
a different character, a more advanced theory is needed. The basic result is:

Theorem 12.1. (Fredholm alternative) Let f ∈ L2 [0, 1]. The two-point BVP (12.4)
either has a unique solution, or the homogeneous problem

−u00 + au = 0; u(0) = 0, u(1) = 0 (12.5)

has nontrivial solutions.


12.3 Notation: Spaces and norms 5

This needs an explanation. The Fredholm alternative allows us to determine


whether there is a unique solution by merely considering the homogenous prob-
lem. Note that the homogeneous problem does not only have a zero right-hand side,
but homogeneous boundary conditions as well. Naturally, the trivial solution to the
homogeneous problem (12.5) is the zero solution, u(x) ≡ 0. Can there be nonzero
solutions?
The answer is yes: the solutions to the homogeneous problem −u00 + au = 0 are
√ √
u(x) = C0 cos ax +C1 sin ax.

The left boundary condition u(0) = 0 implies C0 = 0, while the right boundary
condition u(1) = 0 implies √
0 = C1 sin a.

√ to the trivial solution u(x) ≡ 0. But there is


The obvious solution is C1 = 0, leading
another possibility: we could have sin a = 0, which holds whenever

a = k2 π 2 ,

for any positive integer k. Thus it could happen that the homogeneous problem has
nontrivial solutions. Suppose that uP (x) is a particular solution to (12.4), and that
a = k2 π 2 . Then
u(x) = uP (x) +C1 sin kπx
is also a solution, for every C1 and for every integer k. Therefore, the solution is
not unique. However, if a 6= k2 π 2 , then C1 ≡ 0 and the solution is unique, all in
accordance with the Fredholm alternative for (12.4).
The Fredholm alternative applies to more general equations, usually formulated
as L u = f , where L is a linear differential operator. A sufficient condition for
establishing unique solutions, is to demonstrate that the operator L is elliptic. This
means that, for all functions u satisfying u(0) = u(1) = 0, the operator must satisfy

hu, L ui > 0, (12.6)

where the inner product is defined by


Z 1
hu, vi = uv dx. (12.7)
0

12.3 Notation: Spaces and norms

In order to develop a theory and a methodology for 2pBVPs, notation is important,


together with the choice of norms. We let the computational domain be denoted
by Ω , with boundary ∂ Ω . In most cases, we will take Ω = [0, 1]. In this stan-
6 12 Two-point Boundary Value Problems

dard setting, ∂ Ω only consists of two points, x = 0 and x = 1, where the boundary
conditions are imposed.
The space of square integrable functions on Ω is denoted by L2 (Ω ), or, if the
computational domain is made specific, by L2 [0, 1]. We write f ∈ L2 (Ω ) for any
function defined on Ω , such that
Z
k f k2L2 (Ω ) = | f (x)|2 dx < ∞.

The norm on L2 (Ω ) is associated with the inner product h·, ·i. For u, v ∈ L2 (Ω ) it is
defined by Z
hu, vi = uv dx,

and kuk2L2 (Ω ) = hu, ui. In most cases, we will use the simplified notation k f k2 for
the L2 -norm of a function.
The space C p (Ω ) consists of all p times continuously differentiable functions.
Usually we need to be more specific. Thus we say that u ∈ H 1 (Ω ) if u0 ∈ L2 (Ω ).
Because we also need to impose boundary conditions on the function u, we intro-
duce the function space

H01 (Ω ) = {u : u0 ∈ L2 (Ω ) and u=0 on ∂ Ω }.

In the standard setting with Ω = [0, 1], this corresponds to differentiable functions
with ku0 k2L2 [0,1] < ∞, satisfying homogeneous boundary conditions, u(0) = u(1) = 0.
An example of a function in this space is u(x) = sin πx.
If f ∈ L2 [0, 1] and we seek a solution to the 1D Poisson equation, −u00 = f with
u(0) = u(1) = 0, we are looking for a function u ∈ H01 [0, 1] ∩ H 2 [0, 1]. The space
H 1 (Ω ) plays an important role. It is referred to as a Sobolev space, and is usually
equipped with its own norm, defined as

kuk2H 1 (Ω ) = kuk2L2 (Ω ) + ku0 k2L2 (Ω ) .

For our purposes, however, it will be sufficient to consider the standard L2 (Ω ) norm
of functions, even though we will have to require that solutions are u ∈ H01 [0, 1] ∩
H 2 [0, 1].

12.4 Integration by parts and ellipticity

To show ellipticity, we need more advanced tools. Let us consider the operator

d2
L =− ,
dx2
12.4 Integration by parts and ellipticity 7

and the 1D Poisson equation L u = f with homogeneous boundary conditions,


u(0) = u(1) = 0. We shall prove that, for all sufficiently differentiable functions
u satisfying the boundary conditions, it holds that

0 < m2 [L ] · kuk2 ≤ hu, L ui,

where m2 [L ] is the lower logarithmic norm of L . Thus, even differential oper-


ators have logarithmic norms, and m2 [L ] > 0 implies that L is elliptic. By the
uniform monotonicity theorem,
1
m2 [L ] > 0 ⇒ kL −1 k2 ≤ .
m2 [L ]

Thus an elliptic operator has a bounded, continuous inverse, and for the problem
L u = f it follows that

k f k2
kuk2 = kL −1 f k2 ≤ .
m2 [L ]

As a result, k f k2 → 0 implies that kuk2 → 0, i.e., the solution is unique and de-
pends continuously on the data f . This is traditionally expressed by saying that the
problem is well posed. We also note that the bound breaks down if m2 [L ] → 0+,
which corresponds to loss of ellipticity. This explains why ellipticity is a key prop-
erty in boundary value problems.
The discussion above is still sketchy, and we need to qualify and quantify the
results. More specifically, we shall show that L = −d2 /dx2 is elliptic for all
u ∈ H01 [0, 1] ∩ H 2 [0, 1], with inner product defined by (12.7). As mentioned above,
this is a Sobolev space of differentiable functions with u0 ∈ L2 [0, 1], satisfying the
boundary conditions u(0) = u(1) = 0.
Consider u, v ∈ H01 [0, 1] ∩ H 1 [0, 1]. By the Leibniz rule, (uv)0 = u0 v + uv0 . Inte-
gration yields
Z 1 Z 1
uv0 dx = [uv]10 − u0 v dx.
0 0

Due to the boundary conditions, we have [uv]10 = 0. Consequently,


Z 1 Z 1
uv0 dx = − u0 v dx.
0 0

In terms of the inner product (12.7), integration by parts can then be written

hu, v0 i = −hu0 , vi.

Integration by parts is a key technique in the analysis of boundary value problems


and in finite element analysis.
To illustrate this, we consider L = −d2 /dx2 for functions u ∈ H01 [0, 1] ∩ H 2 [0, 1].
Integration by parts then yields
8 12 Two-point Boundary Value Problems

−hu, u00 i = hu0 , u0 i = ku0 k22 > 0, (12.8)

whenever u is a nonzero function. Note that ku0 k2 = 0 is equivalent to u0 = 0; then u


must be constant, and in fact u(x) ≡ 0, due to the homogeneous boundary conditions
satisfies by all functions u ∈ H01 [0, 1] ∩ H 2 [0, 1]. Therefore the expressions in (12.8)
are strictly positive whenever u 6= 0. This shows that L = −d2 /dx2 is elliptic.
However, we can say more. In (12.8), we have not yet determined m2 [−d2 /dx2 ].
This requires that we find a lower bound of ku0 k22 for functions u ∈ H01 [0, 1] ∩
H 2 [0, 1]. To be precise, we have to find m2 [−d2 /dx2 ], such that

ku0 k22 ≥ m2 [−d2 /dx2 ] · kuk22 ,

We can determine the largest constant m2 [−d2 /dx2 ] for which this inequality holds.
Since u(0) = u(1) = 0, any function u ∈ H01 [0, 1] can be written as a Fourier series
√ ∞
u= 2 ∑ ck sin kπx,
k=1

and by Parseval’s theorem, we have kuk22 = ∑∞ 2


k=1 ck . Likewise,

√ ∞
u0 = 2 ∑ ck kπ cos kπx,
k=1

and it follows that


∞ ∞
ku0 k22 = ∑ k2 π 2 c2k = π 2 ∑ k2 c2k ≥ π 2 kuk22 .
k=1 k=1

This inequality is sharp, since equality holds for the function u(x) = 2 sin πx, for
which kuk22 = 1 and ku0 k22 = π 2 . Thus we have obtained the following result:

Theorem 12.2. Let u ∈ H = H01 [0, 1]. Then the Poincaré inequality

ku0 k2 ≥ π kuk2 (12.9)

holds, and for the operator −d2 /dx2 on H = H01 [0, 1] ∩ H 2 [0, 1], it holds that

d2
 
m2 − 2 = π 2 . (12.10)
dx

Thus we have shown that that L is an elliptic operator. The Poincaré inequality is
the simplest case of Sobolev inequalities, which offer lower bounds on derivatives
in terms of the norm of u. Note that there is no upper bound, since d2 /dx2 is an
unbounded operator; no matter how small (but nonzero) the function u is, u0 can
be arbitrarily large.
12.5 Self-adjoint operators 9

12.5 Self-adjoint operators

Let u, v ∈ H01 [0, 1] ∩ H 2 [0, 1] and let L be a linear operator. Its adjoint operator
L ∗ is defined by the operation

hv, L ui = hL ∗ v, ui.

This is an operator analogue of taking the transpose of a matrix in linear algebra.


Thus, if A ∈ Rd×d is a matrix, and u, v ∈ Rd are two vectors, with inner product
hv, ui = v∗ u, where v∗ denotes the transposed vector, then

hv, Aui = v∗ Au = (A∗ v)∗ u = hA∗ v, ui,

where A∗ denotes the transpose of A. Hence the transpose A∗ is the adjoint of A.


However, a linear differential operator does not have a transpose, even though
an adjoint operator exists. The adjoint plays a role similar to that of the transpose
in algebra. When we solve (12.1) numerically by a finite difference method, our
discretization of this linear problem results in an linear algebraic system,

LN u = f, (12.11)

where we want the N × N matrix to have properties reflecting the properties of L .


We therefore need to find out what properties L has. Let us begin by considering
L = −d2 /dx2 on u, v ∈ H01 [0, 1] ∩ H 2 [0, 1]. We then have, integrating by parts
twice,
hv, L ui = −hv, u00 i = hv0 , u0 i = −hv00 , u0 i = hL ∗ v, ui.
Thus L u = −u00 , and L ∗ v = −v00 , for all u, v. Therefore L ∗ = L .

Definition 12.1. An linear operator satisfying L ∗ = L is called self-adjoint.

A self-adjoint operator is a counterpart to the notion of a symmetric matrix.


This is a property we want to retain when we discretize our BVP. Thus, in the 1D
Poisson equation with homogeneous boundary conditions, −u00 = f , we obtain the
linear system (12.11), where we wish to have LN∗ = LN , i.e., we want LN to be a
symmetric matrix, since for the differential operator

d2
L =−
dx2
it holds that L ∗ = L .
It is important to note that not all operators are self-adjoint. For example, if we
consider L = d/dx on u, v ∈ H01 [0, 1] ∩ H 2 [0, 1], we integrate by parts to obtain

hv, L ui = hv, u0 i = −hv0 , ui = hL ∗ v, ui.

Thus L u = u0 and L ∗ v = −v. It follows that L ∗ = −L .


10 12 Two-point Boundary Value Problems

Definition 12.2. An linear operator satisfying L ∗ = −L is called anti-selfadjoint.

This is a counterpart to skew-symmetric matrices, which are defined by the prop-


erty A∗ = −A. The symmetry and anti-symmetry properties have important conse-
quences. We have already computed the lower logarithmic norm m2 [−d2 /dx2 ], and
we now wish to do the same for d/dx. To this end, we note that
   
d 0 d
m2 · kuk2 ≤ hu, u i ≤ M2 · kuk2 . (12.12)
dx dx

Now, integrating by parts, we get hu, u0 i = −hu0 , ui, since d/dx is anti-self adjoint
on H01 [0, 1] ∩ H 2 [0, 1]. However, recalling that
Z 1 Z 1
hv, ui = vu dx = uv dx = hu, vi,
0 0

we have hu, u0 i = −hu0 , ui = −hu, u0 i. It follows that hu, u0 i = 0 for all u ∈ H01 [0, 1] ∩
H 2 [0, 1]. Thus every differentiable function u satisfying u(0) = u(1) = 0 is orthog-
onal to its derivative u0 . As a consequence, putting hu, u0 i = 0 in (12.12), we find the
logarithmic norms    
d d
m2 = M2 = 0. (12.13)
dx dx
With this information, we can consider more interesting operators. For example,
if we consider the boundary value problem L u = f with homogeneous Dirichlet
boundary conditions u(0) = u(1) = 0 and

L u = −u00 + au0 + bu,

we find, integrating by parts,

hu, L ui = hu, −u00 + au0 + bui


= hu, −u00 i + ahu, u0 i + bhu, ui
≥ (π 2 + b) · kuk22 .

Thus
d2
 
d
m2 [L ] = m2 − 2 + a + b = π 2 + b.
dx dx
Therefore L is elliptic on H01 [0, 1] ∩ H 2 [0, 1] as long as b > −π 2 , independent of a.
It follows that the problem

−u00 + au + bu = f , u(0) = u(1) = 0,

is well posed for b > −π 2 and for all a ∈ R. The solution is unique, and depends
continuously on the data, as
k f k2
kuk2 ≤ 2 .
π +b
12.6 Sturm–Liouville eigenvalue problems 11

12.6 Sturm–Liouville eigenvalue problems

Eigenvalue problems play an important role in applied mathematics, with appli-


cations in structural analysis, quantum mechanics, wave problems and resonance
analysis. Beside solving the plain 2pBVP problems of the previous section, we shall
also develop methods for solving eigenvalue problems.
An eigenvalue problem for a differential operator is a BVP of the form

L u = λ u, (12.14)

where λ is a scalar eigenvalue, and u is the corresponding eigenfunction. The


boundary conditions are homogeneous. Thus, in the 2pBVP case, the boundary con-
ditions are either homogeneous Dirichlet conditions u(0) = u(1) = 0, or homoge-
neous Neumann conditions, corresponding to u(0) = 0 and u0 (1) = 0, or u0 (0) = 0
and u(0) = 0.
The reason for the homogeneous boundary conditions is that the entire eigen-
value problem is homogeneous. This means that if u(x) is an eigenfunction, then so
is αu(x) for any scalar α 6= 0, since αu also satisfies (12.14) as well as the chosen
boundary conditions. As a consequence, the eigenfunctions are only determined up
to a constant, even though the eigenvalues are unique. An example is the vibration
modes of a string; the frequency (corresponding to the eigenvalue λ ) is well de-
fined, but the amplitude (corresponding to α) is not; one can have a large or small
amplitude oscillation.

Example The simplest eigenvalue problem is

−u00 = λ u, u(0) = u(1) = 0.


√ √
Its general solution is A sin λ x + B cos λ x. Imposing the Dirichlet condition u(0) = 0
implies B = 0. Imposing the second Dirichlet condition u(1) = 0 implies

A sin λ = 0.

Here we cannot infer that √


the amplitude A is zero, since the solution is then identically zero.
Rather, we must have sin λ = 0. Hence the eigenvalues and eigenfunctions are

λk = k2 π 2 , k ∈ Z+ (12.15)
uk (x) = sin kπx. (12.16)

We note that there is an infinite sequence of eigenvalues and corresponding eigenfunctions.


The amplitude A of the eigenfunction is undetermined, but the eigenfunctions form an or-
thonormal basis,
{ûk }∞
k=0
√ √
with ûk (x) = 2 sin kπx. The amplitude 2 is chosen so that

hûi , û j i = δi j ,

where δi j is the Kronecker delta. This orthonormal basis is, naturally, the standard basis for
Fourier analysis on H01 [0, 1] ∩ H 2 [0, 1].
12 12 Two-point Boundary Value Problems

Thus we have determined the basic properties of the operator −d2 /dx2 . It is self-adjoint,
and it is elliptic with lower logarithmic norm m2 [−d2 /dx2 ] = π 2 .

The eigenvalue problem above is the simplest example of a Sturm–Liouville


eigenvalue problem. These are eigenvalue problems for general selfadjoint opera-
tors, i.e., operators satisfying L ∗ = L . The general form is
 
d du
− p(x) + q(x)u = λ u, (12.17)
dx dx

together with homogeneous boundary conditions. In this equation, the scalar coef-
ficients must satisfy p(x) > 0 and q(x) ≥ 0. The simple worked example above has
p(x) ≡ 1 and q(x) = 0.
Let us now show that L , defined by

L u = −(pu0 )0 + qu,

is selfadjoint. To this end, we consider hv, L ui and integrate by parts,

hv, −(pu0 )0 + qui = hv, −(pu0 )0 i + hv, qui


= hv0 , pu0 i + hqv, ui
= hpv0 , u0 i + hqv, ui
= h−(pv0 )0 , ui + hqv, ui
= h−(pv0 )0 + qv, ui = hL v, ui.

By definition, it holds that hv, L ui = hL ∗ v, ui. Consequently, L ∗ = L , showing


that L is selfadjoint.
We note that there is no first derivative appearing in the Sturm–Liouville operator
(12.17). The reason is, naturally, that d/dx is anti-selfadjoint, and would change
the properties of the operator L . In spite of this, one can also consider eigenvalue
problems involving d/dx, although there will then be a loss of structure.
Self-adjoint operators have several important properties. Among the most impor-
tant are the following.

Theorem 12.3. Self-adjoint operators have real eigenvalues and orthogonal eigen-
functions.

Proof. Let L be self-adjoint, and let L u = λ1 u and L v = λ2 v. We first prove that


λk is real. As L ∗ = L , we have

hu, L ui = hu, λ1 ui = λ1 kuk22


= hL ∗ u, ui = hL u, ui = hλ1 u, ui = λ1∗ kuk22 ,

and it follows that λ1∗ = λ1 , so λ1 is real. The same applies to all other eigenvalues.
Now consider the two eigenpairs λ1 , u and λ2 , v, with λ1 6= λ2 . Then
12.6 Sturm–Liouville eigenvalue problems 13

hv, L ui = hL v, ui
= hv, λ1 ui = hλ2 v, ui
= λ1 hv, ui = λ2 hv, ui.

Hence (λ1 − λ2 )hv, ui = 0. Since λ2 6= λ2 , it follows that hv, ui = 0, proving orthog-


onality.

It is worthwhile noting that for an anti-selfadjoint operator (L ∗ = −L ) there is


a similar result – the eigenvalues are then purely imaginary, but the eigenfunctions
remain orthogonal. This is in full agreement with the properties of skew-symmetric
matrices. These have imaginary eigenvalues and orthogonal eigenvectors.
In fact, there is a more general result. An operator is normal if L ∗ L = L L ∗ .
(This condition obviously holds if L ∗ = ±L .) Such operators all have orthogonal
eigenfunctions (hence the name “normal”), but the eigenvalues may now be com-
plex. These results hold both in function spaces and in coordinate vector spaces.
In summary, a self-adjoint operator has real eigenvalues and orthogonal eigen-
functions. Since the regular Sturm–Liouville problem is self-adjoint, all such prob-
lems have this property. Above, in the worked example, we have seen that this ap-
plies to the operator −d2 /dx2 with homogeneous Dirichlet boundary conditions, but
a similar behavior holds for all Sturm–Liouville operators.
When constructing numerical methods for such problems, it is important that
the discretization preserves these properties. All properties of normal operators
are analogous in the discrete setting. Thus, if L is normal, the numerical method
should replicate these properties, in order to recover the proper behavior of the orig-
inal problem.
Chapter 13
Finite difference methods

Finite difference methods (FDM) are based on approximating derivatives by differ-


ence quotients. The basis is a forward difference,

y(x + ∆ x) − y(x)
y0 (x) = + O(∆ x) (13.1)
∆x
or a backward difference,
y(x) − y(x − ∆ x)
y0 (x) = + O(∆ x). (13.2)
∆x
Both are first order approximations. Taking the average of the two yields a symmet-
ric difference quotient,

y(x + ∆ x) − y(x − ∆ x)
y0 (x) = + O(∆ x2 ). (13.3)
2∆ x
This is a second order approximation. Symmetric difference approximations are in
general at least second order accurate. Therefore one can usually construct second
order convergent FDMs.
In a similar way, we can approximate a second order derivative by the symmet-
ric difference quotient
 
1 y(x + ∆ x) − y(x) y(x) − y(x − ∆ x)
y00 (x) ≈ −
∆x ∆x ∆x
y(x + ∆ x) − 2y(x) + y(x − ∆ x)
= + O(∆ x2 ).
∆ x2
The orders of these approximations are easily verified by Taylor series expansions.
Finite difference methods for 2pBVPs are constructed from these approxima-
tions and some further variants building on the same techniques. This converts the
differential equation into an approximating algebraic problem, which can be solved
with a finite computational effort.

15
16 13 Finite difference methods

13.1 FDM for the 1D Poisson equation

The 1D Poisson equation, −u00 = f on Ω = [0, 1], with Dirichlet boundary condi-
tions u(0) = uL and u(1) = uR , is both the simplest 2pBVP and the simplest example
of how to construct and apply a finite difference method. For this reason we take this
model equation to describe the main techniques and ideas of finite difference meth-
ods for 2pBVPs. This includes proving that the FDM is convergent. Later, we shall
extend the techniques to also discuss Neumann and other boundary conditions, as
well as eigenvalue problems and non-selfadjoint boundary value problems.
The independent variable, x ∈ Ω , is discretized by choosing an equidistant grid
ΩN = {x j }N+1
j=0 ⊂ Ω , where the grid points are given by

x j = j · ∆ x,

and where the spacing between the grid points is referred to as the mesh width,
1
∆x = .
N +1
This means that x0 = 0 is the left endpoint, and xN+1 = 1 is the right endpoint. In
between, there are N internal points in the interval [0, 1]. In case one wants to solve
the problem on an interval Ω = [0, L] the only change is that ∆ x = L/(N +1). Below,
we shall only describe the method on the unit interval.
The solution will be approximated numerically on this grid, by a vector u =
{u j }N1 of internal points, augmented by the boundary values, u0 = uL and uN+1 =
uR . The vector u approximates the solution to the differential equation, according to
u j ≈ u(x j ). Now the second order derivative −u00 is approximated by a symmetric
finite difference,
u j−1 − 2u j + u j+1
−u00 ≈ − .
∆ x2
We thus obtain a linear system of equations,
−uL + 2u1 − u2
= f (x1 )
∆ x2
−u j−1 + 2u j − u j+1
= f (x j ) j = 2 : N −1
∆ x2
−uN−1 + 2uN − uR
= f (xN ),
∆ x2
where the first and last equations include the boundary conditions, and the mid-
dle equation is the generic equation, only involving internal points. Given that the
boundary values are known, we retain the unknowns in the left hand side, to get
13.1 FDM for the 1D Poisson equation 17
Data function f
5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

FDM solution to -u" = f


0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Fig. 13.1 FDM solution of −u00 = f . The continuous data function f on [0, 1] is sampled on the
grid (red markers, top). The FDM system LN u = f + b is then solved to produce an approximate
solution (red markers, bottom) with inhomogeneous Dirichlet conditions indicated separately

2u1 − u2 uL
= f (x1 ) + 2
∆ x2 ∆x
−u j−1 + 2u j − u j+1
= f (x j ) j = 2 : N −1
∆ x2
−uN−1 + 2uN uR
2
= f (xN ) + 2 .
∆x ∆x
This linear system is conveniently represented in matrix-vector form, as
      
2 −1 0 · · · 0 u1 f1 uL
−1 2 −1 · · · 0   u2   f2  0
1  .. .. ..
  
  ..   .. 
 1  
 .. 
= + . (13.4)

∆ x2 
 . . .   .   .  ∆ x2  . 
     
 −1 2 −1   uN−1   fN−1   0
0 · · · 0 −1 2 uN fN uR

In compact form, the FDM discretization can be written

LN u = f + b. (13.5)

Here the vector b ∈ RN represents the boundary conditions; in case the Dirichlet
conditions are homogeneous we have b = 0. The N × N tridiagonal matrix can be
written
18 13 Finite difference methods

1
LN = T (13.6)
∆ x2
with
T = tridiag (−1 2 − 1). (13.7)
Thus the self-adjoint problem −u00 = f , in the form L u = f with L ∗ = L and
u(0) = u(1) = 0, is approximated by the linear algebraic system LN u = f, where the
matrix is symmetric, i.e., LN∗ = LN . The FDM solution of an example problem with
inhomogeneous Dirichlet conditions is illustrated in Figure 13.1.
We shall later prove that the method constructed above is convergent of order
p = 2. As u = LN−1 (f + b), convergence will depend on whether we can show that
the inverse LN−1 is continuous (bounded). This requires that we explore the special
properties of LN and similar tridiagonal matrices.

13.2 Toeplitz matrices

In the 1D Poisson problem L u = f the operator L = −d2 /dx2 is elliptic, allowing


us to solve this problem for all f ∈ L2 [0, 1. As we shall see, the symmetric second
order discretization above, LN u = f + b, is also a solvable problem, as LN is sym-
metric and positive definite. The symmetry was already established in (13.6), and
we now have to demonstrate that LN is positive definite.
This is related to the eigenvalues of LN , and, in particular, to its lower logarithmic
norm. Thus we have to show that m2 [LN ] > 0. To establish such a result, we turn to
the algebraic eigenvalue problem

LN u = λ u, (13.8)

where LN is given by (13.6). We also noted that LN = T /∆ x2 , where the tridiagonal


matrix T is a Toeplitz matrix.

Definition 13.1. A matrix A = {ai, j } is called Toeplitz if ai, j = ai− j .

This means that along diagonals (whether the main diagonal or super-, or sub-
diagonals) the matrix elements are constant. For example, on the main diagonal,
i = j, so ai, j = a0 , for all i, j.
Toeplitz matrices occur frequently in applied mathematics. They typically occur
when local operators (such as difference operators or digital filters) are represented
on equidistant grids. Toeplitz matrices are discrete analogues of convolutions, since,
if v = Au, then
N N
vi = ∑ ai, j u j = ∑ ai− j u j .
j=1 j=1

This is a discrete counterpart to a convolution integral of two functions a and u,


13.2 Toeplitz matrices 19
Z 1
v(x) = a ∗ u = a(x − t)u(t) dt.
0

In our case, T = tridiag (−1 2 − 1). This is sometimes written

T =] −1 2 −1[, (13.9)

where the reversed brackets indicate the neighboring diagonals, where the main di-
agonal is emphasized in boldface. The bracket is usually referred to as the convo-
lution kernel. The convolution kernel completely defines the matrix and its action.
Much is known about Toeplitz matrices. For example, we can compute the Eu-
clidean norm and logarithmic norm of tridiagonal Toeplitz matrices. This is key
to proving that LN is positive definite, and that the FDM method is convergent as
N → ∞. Such a result is by no means trivial; the matrix LN is N × N and so grows in
size, but moreover, its elements also grow in magnitude, since the diagonal element
of LN is 2(N + 1)2 → ∞.
Now, in (13.8) we need to compute the eigenvalues of LN . To this end, it is suffi-
cient to find the eigenvalues of the N × N Toeplitz matrix T , since

λ [LN ] = (N + 1)2 · λ [T ].

Because we will also be interested in skew-symmetric and non-symmetric Toeplitz


matrices, we will consider a general tridiagonal matrix. Since the diagonal only
shifts the eigenvalues by a constant, the general result is obtained from considering
a matrix with zero diagonal. The main result is the following:

Theorem 13.1. Let K be an N × N tridiagonal Toeplitz matrix, with kernel

K =]a 0 c[, (13.10)

where a, c ∈ R. The N eigenvalues λ [K] are given by


√ kπ
λk = 2 ac cos , k = 1 : N, (13.11)
N +1

and the kth eigenvector v, satisfying Kv = λk v has components


 a  j/2 kπ j  a  j/2
vj = sin = sin kπx j , j = 1 : N, (13.12)
c N +1 c
on the grid points x j ∈ ΩN ⊂ Ω = [0, 1].

Proof. We first note that if a 6= c, we can construct a diagonal symmetrizer D, such


that √
DKD−1 = ac S, (13.13)
20 13 Finite difference methods

where the tridiagonal matrix S is symmetric, with kernel

S =] 1 0 1[. (13.14)

Assume that Kv = λk v. Then

DKD−1 Dv = λk Dv,

showing that Su = λk u. Thus, since (13.13) is a similarity transformation, the eigen-


values are preserved, i.e.,

λ [K] = λ [DKD−1 ] = ac λ [S],

while the eigenvectors are transformed according to u = Dv.


Starting with the symmetrizer, we illustrate the construction by taking N = 3. We
then have
    −1   
d1 0 0 0c0 d1 0 0 0 d1 c/d2 0
DKD−1 =  0 d2 0  a 0 c  0 d2−1 0  = d2 a/d1 0 d2 c/d3  .
0 0 d3 0a0 −1 0 d3 a/d2 0
0 0 d3

Hence the subdiagonal elements in the jth column are d j+1 a/d j , while the super-
diagonal elements in the jth row are d j c/d j+1 . These can be made equal, making
the matrix DKD−1 symmetric, by requiring

d j+1 dj
a= c,
dj d j+1
p
from which it follows that d j+1 /d j = c/a.√Consequently, the transformed
√ sub-
and superdiagonal elements are all equal to ac, and DKD−1 = ac S, with the
symmetric matrix S given by (13.14). The diagonal matrix D is found from the
recursion r
c p
d j+1 = d j , d1 = c/a.
a
It follows that d j = (c/a) j/2 . Note that if c/a < 0, the symmetrizer D is complex
valued.
The remaining problem is to find λ [S]. For the eigenvalue problem Su = λ u, we
note that the jth equation reads

u j+1 + u j−1 = λ u j .

This is a linear difference equation, with boundary conditions u0 = uN+1 = 0. Its


characteristic equation is
κ 2 − λ κ + 1 = 0. (13.15)
13.2 Toeplitz matrices 21

The product of the two roots is 1, due to the last coefficient in the characteristic
equation. Hence, denoting one root by κ, the other root is 1/κ, and the general
solution to the difference equation can be written

u j = Aκ j + Bκ − j .

Inserting u0 = 0, we find A + B = 0, so the general solution is

u j = A(κ j − κ − j ). (13.16)

Inserting the other boundary condition, uN+1 = 0, we find

0 = A(κ N+1 − κ −(N+1) ).

Since A 6= 0 (otherwise we obtain the trivial solution u ≡ 0), we must have

κ 2(N+1) = 1.

Thus we need to find the (N + 1)st roots of unity. Writing 1 = e2πik , we obtain
ikπ
κk = e N+1 , k = 1 : N.

Now, in the characteristic equation (13.15) we see that the sum of the two roots is
λ . Therefore,
1 ikπ ikπ kπ
λk [S] = κk + = e N+1 + e− N+1 = 2 cos .
κk N +1

Thus we have found λ [S]. It follows that


√ kπ
λk [K] = 2 ac cos .
N +1
We now have to construct the eigenvectors. We note in (13.16), taking A = 1/(2i),
that the kth eigenvector u of S has components
ikπ j ikπ j kπ j
u j = A · (κkj − κk− j ) = A · (e N+1 − e− N+1 ) = sin = sin kπx j ,
N +1

where the x j are the grid points. The transformation v = D−1 u implies that the
eigenvector Kv = λk v has components

v j = (a/c) j/2 sin kπx j .

This completes the proof.

Theorem 13.1 is central in 1D FDM and FEM theory, and useful in 2pBVPs as
well as in parabolic and hyperbolic PDEs. Because we are not only interested in
22 13 Finite difference methods

self-adjoint problems, the theorem also covers skewsymmetric and nonsymmetric


Toeplitz matrices, since a and c may be different, and even have opposite signs, in
which case the eigenvalues become imaginary and the eigenvectors complex.
Applying Theorem 13.1 to the Poisson equation with Dirichlet conditions, which
is a self-adjoint problem, and whose FDM discretization is symmetric, we have

Corollary 13.1. Let LN = (N + 1)2 T be an N × N tridiagonal Toeplitz matrix, where


the kernel of T is given by (13.9). Then

λk [LN ] = 4(N + 1)2 sin2 , k = 1 : N. (13.17)
2(N + 1)

The kth eigenvector u, satisfying LN u = λk u, has vector components


√ √ kπn
un = 2 sin kπxn = 2 sin , n=1:N (13.18)
N +1

on the uniform grid ΩN = {xn }N1 , with grid points xn = n∆ x = n/(N + 1).

Proof. Note that T = 2I + K, with a = c = −1 in Theorem 13.1. Hence


kπ kπ
λk [T ] = 2 + 2 cos = 4 sin2 ,
N +1 2(N + 1)

and it follows that λk [LN ] = (N + 1)2 λk [T ], see Figure 13.2. The eigenvectors are
rescaled to conform to the orthonormal eigenfunctions of the continuous operator.
The proof is complete.

As a further application, we turn to the Sturm–Liouville eigenvalue problem

−u00 = λ u, u(0) = u(1) = 0.

For this problem, we already know that the eigenvalues and eigenfunctions are

d2
 
λk − 2 = k2 π 2 , k ∈ Z+
dx

uk (x) = 2 sin kπx.

We employ the same discretization as for the 2pBVP above, to obtain the algebraic
eigenvalue problem
LN u = λ u, (13.19)
We can now compare the eigenvalues and eigenfunctions of the continuous prob-
lem, L u = λ u, and those of the discrete problem, LN u = λk u. Starting with the
13.2 Toeplitz matrices 23


Fig. 13.2 Eigenvalues of LN . The eigenvalues λk [LN ] = (N + 1)2 · 2 + 2 cos N+1

are projections
on the real axis of equally spaced angular increments. Here N = 19

eigenfunctions,
√ we note that the kth eigenfunction of the continuous problem is
uk (x) = 2 sin kπx. Sampled on the grid, we obtain

uk (xn ) = 2 sin kπxn = un ,

where u = {un }N1 is the kth eigenvector of the discrete problem. Thus the dis-
crete eigenvectors are exact samples of the continuous eigenfunctions, without
discretization errors.
The three first eigenvectors (the “lowest modes”) are plotted on the grid ΩN in
Figure 13.2, and are seen to be accurate renditions of the continuous eigenfunctions,
even when N is small. The last eigenvector (the “highest mode”), however, appears
to have little reminiscence of the continuous eigenfunction, in spite of coinciding
with the exact solution on every grid point, as seen in Figure 13.2. This illustrates
the fact that the sampling of a highly oscillatory function must be dense enough
in order to visually represent the function well, even though the Nyquist–Shannon
sampling theorem is not violated in Figure 13.2.
The visual deterioration of the eigenvectors is perhaps better revealed by the
eigenvalues, which are subject to discretization errors. We have already seen that
λk [L ] = k2 π 2 . For the discrete eigenvalues, we expand the first eigenvalues of
(13.17) in a Taylor series as N → ∞, to obtain

kπ k4 π 4
λk [LN ] = 4(N + 1)2 sin2 = k2 π 2 − + O(N −4 ).
2(N + 1) 12(N + 1)2

It follows that
k4 π 4
λk [LN ] = λk [L ] − + O(N −4 ), (13.20)
12(N + 1)2
implying that the numerically computed eigenvalues converge to the eigenvalues
of the differential operator with order p = 2, since the error is O((N + 1))−2 ) =
24 13 Finite difference methods
First three modes
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

-0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

-0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Fig. 13.3 First three eigenvector modes of LN . From top to bottom, the panels show the eigenvec-
tors u1 , u2 and u3 of the N × N matrix LN , for N = 19. The graphs are in close agreement with the
eigenfunctions uk (x) = sin kπx of L = −d2 /dx2 with u(0) = u(1) = 0, for k = 1, 2 and 3. Note
that the boundary values (indicated separately) are not components of the eigenvectors

O(∆ x2 ). Thus a second order FDM will produce second order approximations to the
eigenvalues. However, high accuracy will only be obtained for the first few eigen-
values. The relative error in the kth eigenvalue is

λk [LN ] − λk [L ] λk [L ]
=− + O(N −4 ).
λk [L ] 12(N + 1)2

It increases with the magnitude of λk [L ], which is proportional to k2 . At the same


time, for a fixed k, the relative error decreases as N 2 increases, due to the second
order convergence of the FDM. Therefore, as the relative error is O(k2 /N 2 ), we
obtain increased accuracy only if k grows “slower”
√ than N. As a rule of thumb, one
can obtain high accuracy for the first k ∼ N eigenvalues.
√ Likewise, we typically
obtain an acceptable discrete rendition of the first N eigenfunctions.
The “mismatch” for higher eigenvalues and eigenvectors is typical of numerical
eigenvalue computations. It is due to the fact that the continuous operator L has an
infinite sequence of eigenvalues and eigenvectors, while the discrete FDM operator
only has N eigenvalues and eigenvectors.
Since LN∗ = LN , the eigenvalues are real, as observed above, and the eigenvectors
are orthogonal. Apart from calculating the eigenvalues and eigenvectors, we can also
13.2 Toeplitz matrices 25
Highest mode
1

0.5

-0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

-0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

-0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Fig. 13.4 Highest eigenvector mode of LN . Top panel shows the eigenvector u19 of the N × N
matrix LN , corresponding to the N th mode for N = 19. Center panel shows the eigenfunction
u19 (x) = sin 19πx of L = −d2 /dx2 with u(0) = u(1) = 0. When both graphs are overlaid (bot-
tom) we see that the eigenvector u19 consists of exact samples of u19 (x) on the grid

determine the Euclidean norms and logarithmic norms of LN , as these values are
determined by the extreme eigenvalues in symmetric (normal) matrices.

Theorem 13.2. Let LN = (N + 1)2 T be the N × N symmetric tridiagonal Toeplitz


matrix given by (13.9). Then

m2 [LN ] = λ1 [LN ] ≈ π 2 + O(N −2 )


kLN k2 = λN [LN ] ≈ 4(N + 1)2 − O(1)
kLN−1 k2 = 1/λ1 [LN ] ≈ 1/π 2 + O(N −2 ).

Proof. The results follows directly from the fact that, for a symmetric matrix, the
Euclidean norm and upper and lower logarithmic norms are determined by the
extreme eigenvalues. Since the eigenvalues are all real, and are positive for the
matrix LN , it follows that m2 [LN ] = λ1 [LN ] and kLN k2 = λN [LN ]. Further, since
λ [LN−1 ] = λ −1 [LN ] and LN−1 , too, is symmetric, it follows that kLN−1 k2 = 1/λ1 [LN ].
Taylor series expansions as N → ∞ then provide the approximations.

We note that, since m2 [LN ] > 0, the matrix LN is symmetric positive definite.
Thus the FDM discretization reflects the fact that L is self-adjoint and elliptic.
26 13 Finite difference methods

This qualitative preservation of the properties of the original operator is very impor-
tant for success in numerical computations.
It is worth noting that, by the uniform monotonicity theorem, it holds that
1
m2 [LN ] > 0 ⇒ kLN−1 k2 ≤ . (13.21)
m2 [LN ]

However, while this bound holds for general matrices, (13.21) now holds with equal-
ity, since the matrix LN and its inverse are both symmetric positive definite. Thus
the Euclidean norm is “optimal” for symmetric positive definite systems.
We are now in a position to prove that the FDM is convergent for elliptic prob-
lems. This will follow the standard pattern of inserting the exact solution into the
discretization to obtain a local error, followed by establishing stability, from which
convergence results.

13.3 The Lax Principle

The Lax Principle is a “meta theorem” in numerical analysis. It states a general pat-
tern in numerical analysis, establishing how consistency is related to convergence.
The bridge between consistency and convergence is stability: without stability a
method is not convergent.

Theorem 13.3. (Lax Principle) Consistency and Stability imply Convergence.

We have already seen this principle in action in initial value problems, proving that
time stepping methods are convergent. We first proved the order of consistency, and
then proved that the method was stable, to conclude that the method was convergent.
Here, we shall do the same thing for the FDM discretization of the 1D Poisson equa-
tion. This is but a single example of 2pBVPs, and whenever the problem changes,
or the boundary conditions change, we have to reconsider the proof. Nevertheless,
we will see that all convergence proofs follow the same pattern: we first prove the
order of consistency, then prove stability, and the order of convergence follows.
The present task is to prove that the FDM discretization of the 1D Poisson equa-
tion is convergent. We therefore consider (13.5), LN u = f + b. For simplicity, but
without loss of generality, we may consider homogeneous boundary conditions, i.e.,
b = 0. The generic equation of this discretization reads
−u j−1 + 2u j − u j+1
= f (x j ). (13.22)
∆ x2
Upon inserting the exact solution to −u00 = f , we have
13.3 The Lax Principle 27

−u(x j−1 ) + 2u(x j ) − u(x j+1 )


= f (x j ) − r j , (13.23)
∆ x2
where r j is the local residual (often referred to as the local error in the literature).
Expanding the exact solution around the grid point x j , we find that

−u(x j − ∆ x) + 2u(x j ) − u(x j + ∆ x) 00 ∆ x2 (4)


= −u (x j ) − u (x j ) + O(∆ x4 ).
∆ x2 12
Since −u00 = f , it follows that

∆ x2 (4)
rj = u (x j ) + O(∆ x4 ),
12
so the method’s order of consistency is p = 2. Now, defining the global error as
e j = u j − u(x j ) and subtracting (13.23) from (13.22), we obtain

−e j−1 + 2e j − e j+1
= r j.
∆ x2
This is referred to as the error equation, as it defines the global error in terms of
the local residual. Interestingly, the structure of the error equation is identical to that
of (13.22). Thus, denoting the global error vector on ΩN by e, and the local residual
by r, we have
LN e = r. (13.24)
Because LN is symmetric positive definite, it is invertible, and therefore

e = LN−1 r. (13.25)

We now need to bound the error. To this end, we need to choose an appropriate norm.
In the continuous problem −u00 = f the standard choice is the L2 norm, defined by
Z
kuk2L2 (Ω ) = |u(x)|2 dx.

We need a discrete counterpart. This norm is the root mean square norm (often
referred to as the RMS norm, or the discrete L2 norm), defined by
N
kuk2L2 (ΩN ) = ∑ |u j |2 ∆ x,
1

where ∆ x = 1/(N + 1) in the case of Dirichlet boundary conditions.


The main reason for using this norm is that we need a norm such that the norm
of the unit function, u(x) ≡ 1, is kuk2L2 (Ω ) = 1 not only in the continuous case, but
also in the discrete case, independent of the number of grid points. This precludes
the
√ use of the standard Euclidean norm, since the norm of the unit function is then2
N, and growing with the number of grid points. However, with the discrete L
28 13 Finite difference methods

norm, this problem is rectified. (In fact, the discrete L2 (ΩN ) norm is a second order
approximation to the continuous L2 (Ω ) norm.)
To use the discrete L2 (ΩN ) norm, we need the corresponding operator norm.
Given a matrix A ∈ RN×N , we have

kAuk2L2 (Ω ∑ j (Au)2j /(N + 1) kAuk22


N)
kAk2L2 (ΩN ) = sup = sup = sup = kAk22 .
u6=0 kuk2L2 (Ω u6=0 ∑ u2
j j /(N + 1) u6=0 kuk 2
2
N)

Definition 13.2. The discrete L2 (ΩN ) norm of an N-vector u is defined by


s
1 N
kukL2 (ΩN ) = |u j |2 . (13.26)
N +1 ∑1

The induced operator norm of a matrix A : L2 (ΩN ) → L2 (ΩN ) is

kAk2L2 (ΩN ) = kAk22 . (13.27)

We note that the structure of the expression (13.26) explains the term “root, mean,
square.” Now, by (13.25), we can bound the global error,

kekL2 (ΩN ) ≤ kLN−1 k2 · krkL2 (ΩN ) . (13.28)

The global error is bounded if the two factors on the right hand side are bounded.

Theorem 13.4. The local residual in the FDM discretization (13.5) of the 1D Pois-
son equation, with Dirichlet boundary conditions, is

krkL2 (ΩN ) = O(∆ x2 ),

implying that the method’s order of consistency is p = 2.

This result has already been demonstrated above, using Taylor series expansions.
Thus, with every component r j being O(∆ x2 ), we have krkL2 (ΩN ) = O(∆ x2 ), show-
ing that the method is 2nd order consistent. To translate this result into convergence,
we need stability.

Definition 13.3. The FDM discretization (13.5) is stable, if there is a constant C,


independent of N, such that
kLN−1 k2 ≤ C. (13.29)
13.3 The Lax Principle 29

By Theorem 13.2 the stability condition is also satisfied, as kLN−1 k2 ≈ 1/π 2 . This
implies that the global error can be bounded according to (13.28), proving the con-
vergence of the finite difference method:

Theorem 13.5. The global error in the FDM discretization (13.5) of the 1D Poisson
equation, with Dirichlet boundary conditions, is bounded by

krkL2 (ΩN )
kekL2 (ΩN ) . = O(∆ x2 ), (13.30)
π2
implying that the method’s order of convergence is p = 2.

This result is classical: consistency and stability imply convergence. It ex-


emplifies the Lax Principle as stated above, also referred to as the fundamental
theorem of numerical analysis. It is a recurring theme, and the pattern is found in
most numerical approximation schemes, whether in initial value problems, bound-
ary value problems, or in partial differential equations. It states that in a stable
method, krkL2 (ΩN ) → 0 implies that kekL2 (ΩN ) → 0.
In FDMs for 2pBVPs, consistency means that the local residual r → 0 as N → ∞.
Likewise, convergence means that the global error e → 0 as the discretization is
made finer. The bridge from consistency to convergence is stability, which requires
that kLN−1 k2 is uniformly bounded as N → ∞. Then (13.28) implies that the order
of convergence is equal to the order of consistency.
It is important to note that the stability condition kLN−1 k2 ≤ C is not a condition
on a single matrix. It is a condition on the entire family of matrices {LN−1 }∞ N=N0 ,
which must remain continuous (bounded) as N → ∞. For different N, the matrices
have different dimensions, and, naturally, different elements. Yet they have to share
a common constant, C, such that for any N, it holds that kLN−1 k2 ≤ C, where the
stability constant C is independent of N.

Remark 1 (Continuity and stability) On close inspection, we see that stability is just
another term for continuity. Thus, since e = LN−1 r, we see that if LN−1 is a continuous map,
then r → 0 implies e → 0, see Figure 13.5. While we typically establish the consistency
condition r → 0 by straightforward Taylor series expansions, the rest of the convergence
proof “only” requires that we establish stability (continuity). This is usually the harder part
of the proof.

Remark 2 (Ellipticity and stability) Above, we have established stability from elliptic-
ity. Thus, by investigating the associated Sturm–Liouville eigenvalue problem, we demon-
strated that the differential operator L is elliptic, and that the FDM preserves that prop-
erty. As a consequence, LN is uniformly positive definite for all N, as demonstrated by
m2 [LN ] ≈ π 2 > 0. This means that ellipticity plays a central role. Because ellipticity also
depends on the boundary conditions, it is typically necessary to establish ellipticity in each
separate case. We note that ellipticity is a sufficient condition for demonstrating stability;
thus there are operators that are not elliptic, but whose discretizations are still stable. While
stability is always necessary, it is typically much harder to establish stability (and hence
convergence) in the non-elliptic case.
30 13 Finite difference methods

H02 (Ω ) L2 (Ω )
L
u• -• f

ΓN ΓN

H02 (ΩN ) L2 (ΩN )

•f−r
?
u(ΩN )• 
LN−1
@ @
e@ r@
@• 
R R•?
@
u f

Fig. 13.5 The Lax Principle. A finite difference method is applied to L u = f on Ω = [0, 1] with
homogeneous Dirichlet conditions. A grid ΩN with N interior points is used to obtain the algebraic
problem LN u = f. When the exact solution is inserted into the discretization, we obtain a discrep-
ancy, represented by the local residual −r. The corresponding global error is e = LN−1 r. As r → 0
(consistency), it follows that e → 0 (convergence), provided that kLN−1 k ≤ C for all N (stability).
Thus the stability condition is equivalent to the family of maps LN−1 being continuous

13.4 Other types of boundary conditions

The standard boundary conditions for a 2pBVP are Dirichlet conditions (already
dealt with), Neumann conditions, Robin conditions, and periodic boundary condi-
tions. Both Neumann and Robin conditions involve u0 on the boundary. Without loss
of generality, we begin by assuming that the boundary condition at x = 0 is a Dirich-
let condition, and that we have a Neumann condition at x = 1. This means that the
simplest model problem is the 1D Poisson equation

−u00 = f , u(0) = uL , u0 (1) = u0R . (13.31)

There are (at least) three different ways to approach this problem. In all cases the
condition on the derivative at x = 1 must be discretized, and to maintain 2nd order
convergence, the boundary condition must also be discretized to 2nd order accuracy.
The different approaches will be seen to involve the construction of the grid, so as
to meet the specific requirements.
In the first approach, we construct a grid with N internal points, such that

x j = j · ∆ x,

with ∆ x = 1/(N + 1/2). This means that x0 = 0 and that all grid points are still
spaced by ∆ x, but that
13.4 Other types of boundary conditions 31

1 ∆x
xN = 1 − ∆ x , xN+1 = 1 + ,
2 2
with xN+1 outside the [0, 1] interval. Since xN and xN+1 are symmetrically located
around the boundary point x = 1, we have

u(xN+1 ) − u(xN )
u0 (1) = + O(∆ x2 ).
∆x
Conforming to this approximation, we define the numerical approximation uN+1 by
uN+1 − uN
u0R =
∆x
and solve for uN+1 to obtain uN+1 = uN + ∆ x · u0R . Inserting this expression into the
FDM discretization of −u00 = f , we get
−uL + 2u1 − u2
= f (x1 )
∆ x2
−u j−1 + 2u j − u j+1
= f (x j ) j = 2 : N −1
∆ x2
0
−uN−1 + 2uN − (uN + ∆ x · uR )
= f (xN ).
∆ x2
Simplifying the first and last equations, we now have
2u1 − u2
= f (x1 ) + uL /∆ x2
∆ x2
−u j−1 + 2u j − u j+1
= f (x j ) j = 2 : N −1
∆ x2
−uN−1 + uN
= f (xN ) + u0R /∆ x.
∆ x2
In matrix-vector form, this system reads LN u = f + b, with
      
2 −1 0 · · · 0 u1 f1 uL
−1 2 −1 · · · 0   u2   f2   0 
1  . . .

.
 
.
 1 
 .. 

 . .
. . . .   . 
=
 . 
+ . (13.32)
∆ x2    .   .  ∆ x2  . 
      
 −1 2 −1   uN−1   fN−1   0 
0 · · · 0 −1 1 uN fN ∆ x · u0R

The main difference compared to the previous case, with Dirichlet conditions at both
endpoints, is that ∆ x = 1/(N + 1/2) and that the lower right element of LN is now 1
rather than 2; thus the matrix is no longer Toeplitz. As the matrix is still symmetric,
we have kLN−1 k2 = 1/λ1 [LN ] < C. While one has to reconsider the invertibility of
the matrix, and whether the FDM is convergent, this only requires that we study the
Sturm–Liouville eigenvalue problem −u00 = λ u with boundary conditions u(0) =
u0 (1) = 0, showing that its smallest eigenvalue is positive. Finally, to complete the
32 13 Finite difference methods

approach, we need a numerical approximation to u(1), e.g. for plotting purposes.


This, too, has to be 2nd order, and we make the symmetric approximation

uN+1 + uN ∆ x · u0R
u(1) ≈ = uN + .
2 2
This is computable as soon as the interior discrete solution, u = {u j }N1 , has been
computed.
In the second approach, we construct a grid such that

x j = j · ∆ x,

with ∆ x = 1/N. This means that x0 = 0 and xN = 1, although the grid point spacing
is still ∆ x. We introduce an extra grid point outside the interval, xN+1 = 1 + ∆ x, and
approximate the boundary condition by
uN+1 − uN−1
u0 (1) = u0R = + O(∆ x2 ).
2∆ x
This approximation is still 2nd order, again because xN−1 and xN+1 are symmetri-
cally located around the boundary point x = 1. Defining uN+1 by
uN+1 − uN−1
= u0R ,
2∆ x
we have uN+1 = uN−1 + 2∆ x · u0R . When this approximation is inserted into the
generic FDM discretization, we obtain
2u1 − u2
= f (x1 ) + uL /∆ x2
∆ x2
−u j−1 + 2u j − u j+1
= f (x j ) j = 2 : N −1
∆ x2
−2uN−1 + 2uN
= f (xN ) + 2u0R /∆ x.
∆ x2
In matrix-vector form, this system reads
      
2 −1 0 · · · 0 u1 f1 uL
−1 2 −1 · · · 0   u2   f2   0 
1  .. .. ..

.
 
.
 1 
 .. 


. . .
  .
.

=
 .
.

+ . , (13.33)
∆ x2   ∆ x2 
      
   
 −1 2 −1 uN−1   fN−1   0 
0 · · · 0 −2 2 uN fN ∆ x · u0R

with ∆ x = 1/N. As in the previous approach, the matrix is not Toeplitz, and no
longer symmetric. Again invertibility has to be reconsidered in order to prove
convergence. In this case, too, stability can be established. Comparing (13.32) to
(13.33), the systems look very similar. The only difference is in the last row of the
matrices, which have different elements, and in the location of the grid points, corre-
13.4 Other types of boundary conditions 33

sponding to ∆ x = 1/(N +1(2) in the first case and ∆ x = 1/N in the latter. Finally, we
note that in this approach, the discretization generates the internal approximations
u1 , . . . , uN−1 as well as the numerical solution uN on the boundary. This explains
why this N × N system corresponds to a discretization with ∆ x = 1/N.
In the third approach, we work with the standard grid x j = j · ∆ x and mesh
width ∆ x = 1/(N + 1). Again, x0 = 0 and xN+1 = 1, but we do not employ any extra
grid points outside the interval. Instead, we approximate the boundary condition
u0 (1) = u0R to 2nd order accuracy using the BDF2 method (see methods for initial
value problems). This means that we use a “one-sided” difference approximation
3 1
uN+1 − 2uN + uN−1 = ∆ x · u0R .
2 2
Solving for uN+1 (which approximates the solution u(1) at the boundary), we get

4 1 2
uN+1 = uN − uN−1 + ∆ x · u0R . (13.34)
3 3 3
As before, this is inserted into the standard discretization, again affecting only the
last row (the N th equation) of the system, which now reads
−2uN−1 + 2uN 2
= f (xN ) + · u0 . (13.35)
3∆ x 2 3∆ x R
The matrix is neither Toeplitz nor symmetric, calling for a sperate analysis of the
invertibility to prove stability and convergence. Once the interior approximations
u = {u j }N1 have been computed, the solution at the right boundary is approximated
by (13.34).
The three approaches above work on different grids, with ∆ x = 1/N, ∆ x =
1/(N + 1/2), and ∆ x = 1/(N + 1). Apart from the mesh width factor, the main
change in the system is found in the last row of the matrix. Without affecting sta-
bility, the last equation can be rescaled (by multiplying the last equation by a suit-
able factor) so that the last row is the same as in (13.32). Thus all approaches will
be stable if the symmetric system (13.32) is stable. For the latter system, we will
demonstrate later that λ1 [LN ] ≈ π 2 /4, which guarantees that ellipticity is preserved.
For Robin conditions, we employ the same techniques in any suitable combina-
tion. For example, if the boundary condition is uR + αu0R = β and we use the second
approach above, with mesh width ∆ x = 1/N, we define
uN+1 − uN−1
uN + α = β,
2∆ x
and solve for uN+1 = uN−1 + 2∆ x(β − uN )/α, eliminating this variable from the last
equation. Again, we see that this will affect the last row of the matrix, as well as the
right hand side of the system. If the non-Dirichlet condition is located at x = 0, the
procedures are completely analogous.
34 13 Finite difference methods

Now, let us return to a convergence proof in the case we have a Neumann condi-
tion. As noted above, it is sufficient to consider the first approach, where the matrix
LN is defined in (13.32). That is a symmetric Toeplitz matrix, and it is therfore suf-
ficient to consider its smallest eigenvalue, since kLN−1 k2 = 1/λ1 [LN ], provided that
we can show that this eigenvalue is strictly positive for all N.
A prerequisite is that the Sturm–Liouville eigenvalue problem −u00 = λ u with
Neumann conditions u(0) = 0 and u0 (1) = 0 is elliptic. As trigonometric functions
are eigenfunctions of the second derivative, we note that

(2k − 1)πx
uk (x) = sin , k ∈ Z+
2
are eigenfunctions satisfying the Neumann conditions, with eigenvalues

(2k − 1)2 π 2
λk = .
4
Thus the eigenvalue sequence is π 2 /4, 9π 2 /4, . . . . Since we are using a 2nd order
consistent discretization, we expect the matrix LN as defined in (13.32) to have sim-
ilar eigenvalues, and, in particular, that the smallest eigenvalue is obtained for k = 1,
with λ1 [LN ] ≈ π 2 /4. In order to demonstrate that this is in fact the case, we consider
the FDM discretization
−u j−1 + 2u j − u j+1
= λ u j,
∆ x2
where ∆ x = 1/(N + 1/2) and uN+1 = uN , as described in the first approach above.
Simplifying, we can rewrite this difference equation as

u j−1 + u j+1 = (2 − ∆ x2 λ )u j =: µ u j . (13.36)

Thus we may consider the difference equation u j−1 + u j+1 = µu j , the characteristic
equation of which is
z2 − µz + 1 = 0.
Then the general solution is u j = Az j + Bz− j , since the product of the two roots is 1.
Inserting u0 = 0, we find A + B = 0. Thus the general solution is u j = A(z j − z− j ).
Upon inserting the second boundary condition, uN+1 = uN , we get
1
zN (z − 1) = z−N ( − 1),
z
(2k−1)π
from which it follows that z2N+1 = −1 = e−iπ ei2kπ . Thus zk = ei 2N+1 , and
1 2k − 1
µk = zk + = 2 cos π, k = 1 : N.
zk 2N + 1
By (13.36) it follows that
13.5 FDM for general 2pBVPs 35

1 2
   
2 − µk (2k − 1)π
λk [LN ] = = N+ 2 − 2 cos ; k = 1 : N.
∆ x2 2 2N + 1

Thus, expanding λk [LN ] in a Taylor series, we find, for the first few eiegnevalues

π2
λk [LN ] ≈ (2k − 1)2 . (13.37)
4
This is in perfect agreement with the eigenvalues of the differential operator, and
since LN is symmetric, we have λ2 [LN ] ≈ π 2 /4. Thus LN is positive definite (reflect-
ing the fact that the differential operator is elliptic also for Neumann conditions),
and
4
kLN−1 k2 ≈ 2
π
as N → ∞. This is the stability constant of the discrete problem, and it proves that
the standard 2nd order FDM for the 1D Poisson equation is stable, and therefore
convergent, also for Neumann conditions. Thus, once more, we have applied the
Lax Principle to prove that an FDM is convergent, and once more it has turned out
to depend on the ellipticity of the differntial operator.
In addition, as has been pointed out above, the other approaches of implementing
the Neumann condition will yield a modfied matrix which is directly turned into the
matrix treated above, by rescaling the last equation of the FDM. Thus convergence
is completley determined by the matrix LN , no matter how we choose to represent
the Neumann condition, provided that it is done so as to maintain the 2nd order
consistency of the FDM.

13.5 FDM for general 2pBVPs

The principles for more general problems than the 1D Poisson equation with Dirich-
let conditions follow similar lines. Here we shall examine general self-adjoint oper-
ators as well as problems involving first order derivatives, and nonlinear problems.
Let us begin by considering the linear problem L u = f , with

L u = −(pu0 )0 + qu, (13.38)

where p(x) > 0 and q(x) ≥ 0. This is a self-adjoint elliptic problem – multiplying
by u and integrating by parts, using homogeneous Dirichlet conditions on Ω = [0, 1],
one easily shows that

m2 [L ] ≥ min |p(x)| · π 2 + min |q(x)| > 0.


x x

We want to construct a discretization having similar properties. This means that we


need to construct a system LN u = f + b where LN is symmetric and positive definite.
36 13 Finite difference methods

When discretizing this operator, we start with the expression pu0 . On an interval
[xi , xi+1 ], we make the approximation

pi+1/2 · (ui+1 − ui )
(pu0 )(x1+1/2 ) = + O(∆ x2 ),
∆x
where pi+1/2 = p((xi+1 + xi )/2) is the value of the function p at the interval mid-
point. Likewise, we have

pi−1/2 · (ui − ui−1 )


(pu0 )(x1−1/2 ) = + O(∆ x2 ).
∆x
We can next approximate −(pu0 )0 by a symmetric difference quotient. This can be
obtained as
(pu0 )(x1+1/2 ) − (pu0 )(x1−1/2 )
−(pu0 )0 = − + O(∆ x2 )
xi+1/2 − xi−1/2
pi+1/2 · (ui+1 − ui ) − pi−1/2 · (ui − ui−1 )
=− + O(∆ x2 )
∆ x2
−pi−1/2 ui−1 + (pi−1/2 + pi+1/2 )ui − pi+1/2 ui+1
= + O(∆ x2 ).
∆ x2
Consequently, we obtain a second order consistent approximation to the self-adjoint
operator L u = −(pu0 )0 + qu through

−pi−1/2 ui−1 + (pi−1/2 + pi+1/2 )ui − pi+1/2 ui+1


−(pu0 )0 + qu = + qi ui + O(∆ x2 ).
∆ x2
We note that if p(x) ≡ 1 and q(x) ≡ 0, then

−ui−1 + 2ui − ui+1


−(pu0 )0 = −u00 = + O(∆ x2 ).
∆ x2
For the second order discretization of the operator L u = −(pu0 )0 + qu, we get a
matrix, whose representation of the central elements are

..
 
 . 
 −pi−1/2 
1  2

LN = . . . −pi−1/2 pi−1/2 + pi+1/2 + qi ∆ x −pi+1/2 . . . . (13.39)
∆ x2 
 
 −p i+1/2


..
.

Seeing that the matrix elements above and to the left of the diagonal elements are
the same, as well as that the elements below and to the right of the center element
are the same, we conclude that the matrix is symmetric, i.e., LN∗ = LN . Thus the
discretization above preserves the self-adjoint symmetry of L .
13.5 FDM for general 2pBVPs 37

Let us now consider a general linear 2pBVP,

−u00 + au0 + bu = f , (13.40)

with Dirichlet boundary conditions u(0) = uL and u(1) = uL . We have already dealt
with the discretization of u00 , so here it remains to discretize u0 . We introduce a
standard grid x j = j · ∆ x on Ω = [0, 1], with

1
∆x = .
N +1

The discretization of −u00 is given by (N + 1)2 · T u. To maintain second order con-


sistency, we use a symmetric discretization of u0 , and represent it by
u j+1 − u j−1
u0 ≈ .
2∆ x
In matrix-vector form, this becomes 2(N + 1) · Su, where S is a skew-symmetric
Toeplitz matrix, with kernel

S =] −1 0 1[, (13.41)

Skew-symmetry means that S∗ = −S. For simplicity, we introduce the notation

D20 = (N + 1)2 · T, D10 = 2(N + 1) · S,

where the superscript refers to the order of the derivative, not to a power of the
matrix. In this notation, a discretization of the linear problem (13.42) becomes a
linear system of equations,

(D20 + aD10 + bI)u = f + b. (13.42)

Thus, the differential operators are linear operators, and upon discretization, they are
represented by finite difference matrices, since linear operators on finite dimensional
spaces can be represented by matrices. We may now introduce the notation

LN = D20 + aD10 + bI.

The solvability of the system, and the convergence of the discretization, depends on
the boundedness of LN−1 . We note that

m2 [LN ] = m2 [D20 + aD10 + bI] ≥ m2 [D20 ] + m2 [aD10 ] + b ≈ π 2 + b,

since skew-symmetry implies that m2 [aD10 ] = 0 for every a. Thus the presence of
the first derivative does not affect solvability and the stability of the discretization.
The problem is not symmetric (although the matrix LN is Toeplitz), but it is elliptic,
as long as π 2 + b > 0. It follows that the problem is solvable, and that the FDM is
stable and convergent if
38 13 Finite difference methods

b & −π 2 .
Again, we see that if the problem is elliptic, the FDM discretization is convergent
as long as we choose the mesh width ∆ x small enough to keep LN positive definite.
As for the eigenvalues of LN , the matrix is is Toeplitz, and therefore Theorem
13.1 applies. We note that the kernel of LN is
a 1 2 a 1
LN = ] − − 2 +b − 2[, (13.43)
2∆ x ∆ x ∆ x2 2∆ x ∆ x
and it follows that the eigenvalues are
r
(a∆ x)2 2 kπ
λk [LN ] = b + 2(N + 1)2 1− sin . (13.44)
4 2(N + 1)

Here we have chosen to represent the eigenvalues both in terms of ∆ x and N, in spite
of the relation ∆ x = 1/(N + 1). We note that the parameter b only shifts the eigen-
values to the right in the complex plane, increasing the ellipticity. The parameter a,
however, which does not affect ellipticity at all, has a different influence. Thus, as
long as |a∆ x| < 2 the eigenvalues are real, but if |a∆ x| > 2 they become complex.
We shall return to this problem later, to see that the continuous operator only has
real eigenvalues, and that we therefore want to restrict ∆ x so that |a∆ x| < 2. This is
an unexpected condition; it is not a stability condition, and does not matter to con-
vergence, but it does matter to grid quality. In practical computations, we want to
use the largest mesh width that produces acceptable results, and keeping |a∆ x| < 2
will be seen to be necessary to resolve the solution on the grid. The quantity |a∆ x|
is usually referred to as the mesh Péclet number.
As the last example of general 2pBVPs, we shall consider nonlinear problems.
By and large, what is meant by a nonlinear problem is an operator where the highest
order derivative is linear, and lower order terms are nonlinear. In a second order
differential equation, this means that the structure of the equation is

u00 = f (x, u, u0 )

together with suitable boundary conditions. The FMD discretization proceeds with
the same techniques as before, applied to each occurring derivative. Stability con-
ditions will depend on the properties of f . For simplicity, let us consider a specific
case,
u00 − uu0 = 0. (13.45)
This equation will recur in connection with hyperbolic partial differential equations.
It has an alternative formulation,
 2 0
u
u00 − = 0. (13.46)
2

Using symmetric difference approximations, (13.45) yields


13.6 Higher order methods. Cowell’s difference correction 39

u j−1 − 2u j + u j+1 u j+1 − u j−1


2
−uj = 0. (13.47)
∆x 2∆ x
In (13.46) we instead obtain
2 2
u j−1 − 2u j + u j+1 u j+1 − u j−1
− = 0. (13.48)
∆ x2 4∆ x
Due to maintaining symmetry, both discretizations are second order consistent. Sta-
bility is now considerably more difficult, as is proving convergence. The two dis-
cretizations are not the same. In fact, there is a variant of (13.47) which is still 2nd
order consistent, where we replace one factor to get
u j−1 − 2u j + u j+1 u j+1 + u j−1 u j+1 − u j−1
− · = 0. (13.49)
∆ x2 2 2∆ x
This is immediately seen to be identical to (13.48). As already mentioned, we shall
return to these discretizations in connection with partial differential equations.

13.6 Higher order methods. Cowell’s difference correction

So far, we have only considered second order discretizations. Such methods are
easily constructed by merely using symmetric difference approximations to deriva-
tives, and, in case it is needed, the boundary conditions. Let us now consider the
possibility of constructing higher order FDM techniques for 2pBVPs. Again, we
limit ourselves to the Poisson equation, demonstrating the idea behind higher order
methods.
In the Poisson equation −u00 = f we have approximated the second order deriva-
tive by the finite difference quotient

−u(x j−1 ) + 2u(x j ) − u(x j+1 ) ∆ x2 (4)


2
= −u00 (x j ) − u (x j ) − O(∆ x4 ).
∆x 12
Thus the asymptotically dominant term in the local residual is

∆ x2 (4) ∆ x2 00
− u (x j ) = f (x j ),
12 12
since −u00 = f . The dominating residual term can therefore be eliminated by the
approximation

∆ x2 00 f (x j−1 ) − 2 f (x j ) + f (x j+1 )
f (x j ) = + O(∆ x4 ).
12 12
With this estimate, we will consider the discretization
40 13 Finite difference methods

−u j−1 + 2u j − u j+1 f (x j−1 ) − 2 f (x j ) + f (x j+1 )


2
= f (x j ) + ,
∆x 12
which has order of consistency p = 4. Given that u and f are sufficiently differen-
tiable. Simplifying, we obtain Cowell’s method, where the dominant O(∆ x2 ) resid-
ual in the standard 2nd order discretization has been eliminated by a difference
correction. Cowell’s method is
−u j−1 + 2u j − u j+1 f (x j−1 ) + 10 f (x j ) + f (x j+1 )
= . (13.50)
∆ x2 12

Since the method’s consistency order is p = 4, we need to verify its convergence


order. By the Lax Principle, this will only require that we establish stability. This is,
however, a next to trivial issue. Thus, writing the Cowell method in matrix–vector
form, we have
LN u = MN f + b, (13.51)
where LN is exactly the same matrix as in the standard 2nd order FDM, and where
the vector b accounts for the Dirichlet boundary conditions in exactly the same way
as before. The difference correction is carried out by the Toeplitz matrix MN , which
is an averaging operator, whose convolution kernel is
1
MN = ]1 10 1 [. (13.52)
12
We note that the row sum of elements in the matrix MN is one. The two function
values surrounding f (x j ) will modify the right-hand side, so as to annihilate the
O(∆ x2 ) term in the residual, leaving only fourth and higher order terms. The only
thing to keep in mind is that in order to apply the difference correction, we cannot
make do with only the interior points { f j }N1 , but we also have to include the samples
of f (x) on the boundary. Thus we have to consider a convolution of f -values from
f (x0 ) to f (xN+1 ).
To clarify this detail, we give the linear system, specifying the first and last equa-
tions, and the generic equation. This system is, for inhomogeneous Dirichlet condi-
tions u(0) = uL and u(1) = uR ,

2u1 − u2 f0 + 10 f1 + f2 uL
= + 2
∆ x2 12 ∆x
−u j−1 + 2u j − u j+1 f j−1 + 10 f j + f j+1
= , j = 2 : N −1
∆ x2 12
−uN−1 + 2uN fN−1 + 10 fN + fN+1 uR
= + 2.
∆ x2 12 ∆x

Now, in order to verify stability, we note that in (13.51) the system matrix is
LN , and that stability only depends on the inverse LN−1 being uniformly bounded as
13.6 Higher order methods. Cowell’s difference correction 41

N → ∞. Since the matrix LN is identical to that of the standard 2nd order FDM, and
we have already demonstrated that kLN−1 k2 ≈ 1/π 2 , Cowell’s method, too, is stable.
Since the method has consistency order p = 4, it now follows from the Lax Prin-
ciple that Cowell’s method is convergent of order p = 4. Naturally, this requires
the data function f to be sufficiently smooth; the convergence order of any method
is always the nominal order obtained when the problem data are smooth enough.

Example A simple M ATLAB code implementing the standard second order FDM, as well
as Cowell’s 4th order method is given below. In Figure 13.6 this code is tested for a simple
problem constructed from an exact solution,

u(x) = −2 + x + e−x + sin(5x + 1/3),

so that the right hand side is

f (x) = 25 sin(5x + 1/3) − e−x .

Thus we are able to compute the error and verify the convergence orders. This simple
demonstration shows the very significant accuracy improvement achieved by the 4th order
method. Already at ∆ x = 10−2 , the 4th order method produces a solution with an error four
orders of magnitude smaller for this particular problem. It is to be noted that the computati-
nal effort is similar for the 2nd and 4th order methods. The only additional work needed in
Cowell’s method is the convolution of the right hand side f to obtain the difference correc-
tion. Because the kernel corresponds to a tridiagonal Toeplitz matrix, the additional work is
marginal, but delivers a vast gain in accuracy.

function bvp24(N)
% Written (c) 2017-11-17 for 1D Poisson problem -u'' = f.
% Test of standard 2nd order FDM + 4th order Cowell method

L = 1;
dx = L/(N+1); % Mesh width
xx = linspace(0,L,N+2)'; % Grid on [0,L]

% Define RHS (any data function)


ff = 1 + xx - exp(-xx) + sin(5*xx + 1/3);

% Define Dirichlet boundary conditions


uL = 0;
uR = 1;

% Assemble matrix
a = zeros(N,1);
a(1) = 2;
a(2) = -1;
LN = toeplitz(a)/dxˆ2; % System matrix

% Construct difference correction for 4th order method


ker = [1 10 1]/12; % Convolution kernel
f4 = conv(ff,ker,'same'); % RHS for Cowell's method

% Reduce right hand sides to interior points


f4 = f4(2:end-1); % 4th order Cowell
42 13 Finite difference methods

-3
Error vs x -3
Error vs dx
10 10

10 -4 10 -4

10 -5 10 -5

10 -6 10 -6

10 -7 10 -7

10 -8 10 -8

10 -9 10 -9

10 -10 10 -10

-11 -11
10 10

-12 -12
10 10

10 -13 10 -13
0 0.2 0.4 0.6 0.8 1 10 -3 10 -2
x dx

Fig. 13.6 Comparison of 2nd and 4th order methods. The 1D Poisson equation −u00 = f with
Dirichlet conditions is solved using a standard 2nd order FDM and the 4th order Cowell method for
N = 49, 99, 199, 399 and 799. Left panel shows the absolute value of the global error vs. x ∈ [0, 1]
for the 2nd order method (blue) as well as for Cowell’s 4th order method (red). The sharp dip near
x = 0.65 corresponds to a sign change in the global error. Right panel shows the discrete L2 norm
of the error vs. ∆ x = 1/(N + 1) for the 2nd order method (blue, top) and the 4th order Cowell
method (red, bottom). The superior accuracy of the higher order method is visible in both panels

f2 = ff(2:end-1); % Standard 2nd order FDM

% Select method (f=f2 or f=f4). Include boundary values


f = f4;
f(1) = f(1) + uL/dxˆ2;
f(end) = f(end) + uR/dxˆ2;

% Solve problem and append boundary conditions


u = LN\f;
uu = [uL; u; uR];

% Plot solution
plot(xx,uu,'b')
xlabel('x')
grid on

Later on, in connection with the Laplace and Poisson equations in 2D (or higher),
we shall see that standard 2nd order FDMs can be enhanced to produce 4th order
convergent results by applying a similar difference correction to the one employed
13.7 Higher order problems. The biharmonic equation 43

above. Because the system to be solved has the same matrix as the second order
method, stability is unaffected and the extra work is marginal. The gain in accuracy,
however, going from second to fourth order convergence, is very substantial. The
only drawback is the difficulty to apply this technique on nonuniform grids.

13.7 Higher order problems. The biharmonic equation

The boundary value problems discussed so far have involved derivatives of order
two. Higher order problems refer to differential equations involving third order
derivatives or higher. A typical application is in elasticity theory, and in particular
the biharmonic equation, which is used in plate theory. Interestingly, both the for-
mulation of the problem, and the solution techniques are influenced by the boundary
conditions.
Consider the beam equation, modeling a simply supported elastic beam. It con-
sists of two second order problems,

M 00 = f
u00 = M/(EI).

Here f is an external load density on [0, 1], and M is the resulting bending moment
on the beam. The boundary conditions for the first equation are M(0) = M(1) = 0
(implying that the supported endpoints do not sustain any bending moment). In the
second equation, u is the deflection of the beam’s center line under the bending
moment M. The boundary conditions are u(0) = u(1) (reflecting firm supports that
are level. Finally E is a material constant, the modulus of elasticity, and I is a shape
parameter, the cross-sectional moment of inertia, which is determined by the beam’s
physical size and cross-section geometry.
Thus the beam equation consists of two coupled 1D Poisson problems with
Dirichlet conditions, often referred to as articulated conditions. (A Neumann con-
dition is referred to as a free boundary condition.) The sequence of two second order
problems is mathematically (but not numerically) equivalent to a single fourth order
equation. Let us for simplicity take EI = 1 to focus on the mathematical form of the
problem. Eliminating M, we obtain the fourth order problem

u0000 = f (13.53)

together with the articulated boundary conditions,

u00 (0) = u00 (1) = 0


u(0) = u(1) = 0.

Due to the boundary conditions, this problem can be decomposed into the two equa-
tions we started with. However, if the beam ends are clamped instead of articulated,
44 13 Finite difference methods

the boundary conditions are

u0 (0) = u0 (1) = 0
u(0) = u(1) = 0.

Due to the boundary conditions now only involving u and u0 , the problem can no
longer be decomposed into two second order problems. It is a genuine fourth order
problem. This is the biharmonic equation.
In the beam equation with articulated supports, it is important to take advantage
of its structure of being equivalent to two second order problems. The term “har-
monic” comes from the fact that the eigenfunctions of d2 /dx2 are trigonometric
functions, leading to Fourier analysis or harmonic analysis. In higher dimensions
the same holds for the Laplacian operator,

∂2 ∂2
∆= 2
+ 2,
∂x ∂y
sometimes referred to as the harmonic operator. The biharmonic operator is simply

∂4 ∂4 ∂4
∆2 = 4
+2 2 2 + 4,
∂x ∂x ∂y ∂y

which, in our 1D context above, simplifies to d4 /dx4 .


Using a simple second order FDM on an equidistant grid for the biharmonic and
beam equations requires that we approximate d4 /dx4 on the usual grid ΩN ⊂ [0, 1],
with points x j = j∆ x and ∆ x = 1/(N + 1), as follows:
−u j−2 +2u j−1 −u j −u j−1 +2u j −u j+1 −u j +2u j+1 −u j+2
− ∆ x2
+2 ∆ x2
− ∆ x2
u0000 ≈
∆ x2
u j−2 − 4u j−1 + 6u j − 4u j+1 + u j+2
= .
∆ x4
The articulated boundary conditions u(0) = u(1) = 0 correspond to u0 = uN+1 = 0.
The clamping conditions are represented by
−u−1 + u1
u0 (0) ≈
2∆ x
0 −uN + uN+2
u (1) ≈ .
2∆ x
Given that u0 (0) = u0 (1) = 0 the clamped conditions are

u−1 = u1
uN+1 = uN ,
13.7 Higher order problems. The biharmonic equation 45

allowing the elimination of the exterior variables and modifying the matrix elements
accordingly.
If the clamped conditions are replaced by the articulated moment boundary con-
ditions used in the beam equation, the boundary conditions are represented by
u−1 − 2u0 + u1
u00 (0) ≈
∆ x2
uN − 2uN+1 + uN+2
u00 (1) ≈ .
∆ x2
Given that u00 (0) = u00 (1) = 0 and u0 = uN+1 = 0, we use the articulated conditions

u−1 = −u1
uN+1 = −uN ,

which would be used as an alternative to the clamped conditions.


Collecting the information, we construct a system PN (β )u = f, where the N × N
pentadiagonal matrix PN (β ) is given by (here exemplified for N = 5)
 
β −4 1 0 0
 −4 6 −4 1 0 
4 
 
PN (β ) = (N + 1) ·  1 −4 6 −4 1  .
 0 1 −4 6 −4 
0 0 1 −4 β

where the boundary conditions only affect the top left and bottom right elements.
For the biharmonic equation (clamped conditions), β = 7, while for the simply
supported beam (articulated conditions), β = 5.
As noted above the beam problem can be factorized into two Poisson problems,

M 00 = f
u00 = M

with M(0) = M(1) = 0 and u(0) = u(1) = 0. The standard second order discretiza-
tion is then

LN M = f
LN u = M,

where LN is described by the kernel LN = (N + 1)2 · ] 1 −2 1 [. It is easily verified


that PN (5) = LN2 , reflecting the splitting into two consecutive Poisson problems. It is
important to note that the simply supported beam problem should always be solved
as two first order problems. The reason is that the condition numbers differ. Thus

4(N + 1)2
κ2 [LN ] ≈ ,
π2
46 13 Finite difference methods

while
κ2 [PN (β )] = O((N + 1)4 ).
This means that the biharmonic equation is more sensitive to perturbations than
the second order problem. When solving the biharmonic, the matrix PN (7) must be
factorized numerically as is, while PN (5) = LN2 is an exact factorization into two
factors LN , only requiring that LN be factorized numerically. In case one factorizes
PN (5) numerically, perturbation sensitivity is comparable in both problems.
We further note that the biharmonic equation is elliptic. The eigenvalue problem

u0000 = λ u

has eigenfunctions
uk (x) = sin kπx
satisfying the articulated conditions u(0) = u(1) = u00 (0) = u00 (1) = 0, giving eigen-
values λk = k4 π 4 . These are just the squares of the eigenvalues of −d2 /dx2 .
With clamped conditions, the eigenfunctions of the genuine biharmonic equation
are more complicated. They can be written

u(x) = (cos α − cosh α)(sin αx − sinh αx) − (sin α − sinh α)(cos αx − cosh αx),

where α = λ 1/4 . By construction, u(x) satisfies the three conditions u(0) = u(1) = 0
and u0 (0) = 0, and the parameter α is determined by the last clamped condition,
u0 (1) = 0. Since

u0 (1) = α · (cos α − cosh α)2 + α · (sin α − sinh α)(sin α + sinh α)


= α · (cos2 α − 2 cos α cosh α + cosh2 α + sin2 α − sinh2 α)
= 2α − 2α cos α cosh α,

it follows that α must satisfy the nonlinear equation


1
cos α = . (13.54)
cosh α
There is an infinite suite of solutions αk , for k ∈ Z+ , as illustrated in Figure 13.7.
The smallest positive root, α1 & 3π/2, must be computed numerically, and as the
corresponding eigenvalue of the differential operator is λ1 = α14 , we get

α1 ≈ 4.730 040 75 λ1 ≈ 500.564.


Thus the biharmonic operator is selfadjoint and elliptic. Except for the first few
eigenvalues, which are larger, the eigenvalues approach, for k ∈ Z+ ,
 4
d (2k + 1)4 π 4
λk ≈ ≈ k4 π 4
dx4 24
13.7 Higher order problems. The biharmonic equation 47
1

0.5

-0.5

-1
0 2 4 6 8 10 12 14 16 18 20
alpha

Eigenvalues
10 8

6
10

10 4

10 2

10 0 10 1
k

Fig. 13.7 Eigenvalues of biharmonic operator. Top panel shows left and right hand sides of the
equation cos α = 1/ cosh α. Roots are found where the graphs intersect, indicated by blue markers.
The first root α1 occurs at α ≈ 3π/2, and yields λ1 = α14 ≈ 500. Lower panel shows the first
30 eigenvalues of PN (7) (blue) for N = 999, approximating the first 30 eigenvalues of d4 /dx4
with clamped conditions. These are compared to the corresponding eigenvalues of d4 /dx4 with
articulated conditions (red). Asymptotically, for large k, the eigenvalues are almost the same

on the interval [0, 1] with clamped boundary conditions, u(0) = u(1) = 0 and
u0 (0) = u0 (1) = 0. This is easily verified by computing the eigenvalues of PN (7),
see Figure 13.7. They can also be compared to the eigenvalues of PN (5). The latter
are smaller, corresponding to the fact that the clamped beam is stiffer than the simply
supported beam. This is in particular noticeable in the first eigenvalue of the fourth
order differential operator with articulated conditions, for which λ1 = π 4 ≈ 97.409,
more than a factor of 5 smaller than the first eigenvalue of the operator with clamped
conditions.
Example Consider the problem u0000 = −1 on Ω = [0, 1]. With articulated boundary condi-
tions u(0) = u(1) = u00 (0) = u00 (1) = 0, this equation represents a simply supported beam
under a uniform load density of f (x) = −1. The operator is elliptic with m2 [d4 /dx4 ] = π 4 .
Since the load is constant, the analytic solution can be found through four integrations; it is

1
−x4 + 2x3 − x .

u(x) =
24
With clamped conditions u(0) = u(1) = u0 (0) = u0 (1) = 0 the problem instead represents the
biharmonic equation for a beam sustaining bending moments at the endpoints. The operator
is elliptic, but the lower logarithmic norm has now increased to m2 [d4 /dx4 ] & 81π 4 /16. The
solution is
48 13 Finite difference methods
Beam and Biharmonic equations u"" = -1
0

-0.005

-0.01

-0.015
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

10 -2

10 -4

-6
10

10 -8

10 -10
10 -4 10 -3 10 -2 10 -1 10 0
1/(N+1)

Fig. 13.8 Beam and biharmonic equations. The problem u0000 = −1 is solved on a grid with N = 49
interior points (top panel), using clamped conditions (upper curve, red markers), and with articu-
lated conditions (lower curve, blue markers). The clamped beam is stiffer than the simply supported
beam, due to nonzero bending moments at the endpoints. Lower panel shows the discrete L2 norm
of the global error for the simply supported beam (top, blue) and the biharmonic problem (bottom,
red) vs. mesh width ∆ x = 1/(N + 1). Second order convergence is clearly visible. For ∆ x small,
the error starts increasing due to roundoff in the factorization of PN (7)

1
−x4 + 2x3 − x2 .

u(x) =
24
Using a symmetric 2nd order consistent FDM on a uniform grid ΩN ⊂ Ω with N interior
points will in both cases produce a local residual of the form C · ∆ x2 u0000 (x). However, the
global error will be different; it is bounded by

krkL2 (ΩN )
kekL2 (ΩN ) / ,
m2 [PN (β )]

where β = 7 for clamped conditions, and β = 5 for articulated conditions. Because the
logarithmic norm is determined by the smallest eigenvalue of d4 /dx4 , and the FDM ap-
proximates these eigenvalues to second order accuracy, we have m2 [PN (5)] ≈ π 4 ≈ 100,
and m2 [PN (7)] ≈ 81π 4 /16 ≈ 500, so the error is expected to be smaller with clamped con-
ditions. In both cases, the FDM is second order convergent, by virtue of the global error
bound. Solving the problems, one obtains the solutions shown in Figure 13.7.
13.9 Nonuniform grids 49

13.8 Singular problems and boundary layer problems

13.9 Nonuniform grids

In problems with singularities and boundary layers it is desirable to allocate the


grid points to the subintervals where there is significant change in the solution. In
boundary layer problems, this is at one of the boundaries, or possibly both. In FDM,
derivatives are still approximated by difference quotients, but the mesh width ∆ x
varies.
For a nonuniform grid {x j }N+1
0 we define the mesh width locally, in terms of
forward and backward differences, i.e.,

∆ x j = x j+1 − x j
∇x j = x j − x j−1 .

To approximate derivatives by divided differences, we use the forward and back-


ward difference operators,

∆ y j = y j+1 − y j
∇y j = y j − y j−1 .

Left and right divided differences are now defined as

y j − y j−1 ∇y j
D− y j = =
x j − x j−1 ∇x j
y j+1 − y j ∆yj
D+ y j = = .
x j+1 − x j ∆xj

Using these difference quotients we approximate derivatives by the divided differ-


ences
∇x j D+ y j + ∆ x j D− y j
y0 (x j ) ≈
∆ x j + ∇x j

D+ y j − D− y j
y00 (x j ) ≈ 2 .
∆ x j + ∇x j

This is 2nd order only on smooth grids, which means that the local mesh width
must change slowly, with ∆ x j /∇x j = 1 + O(N −1 ).
Chapter 14
Finite element methods

While the finite difference method (FDM) takes a linear operator equation

Lu= f (14.1)

on Ω = [0, 1] and converts it into a linear algebraic equation

Lu = f + b

on a grid ΩN ⊂ Ω , where L is a matrix and u is a vector, the finite element method


(FEM) leaves the operator L intact, instead representing the solution u(x) by a
piecewise polynomial,
N
u∆ x (x) = ∑ c j ϕ j (x). (14.2)
j=1

Thus the approximant u∆ x (x) is a linear combination of basis functions ϕ j (x), often
referred to as shape functions. We will still use a grid for the construction of the
functions ϕ j (x), which are also piecewise polynomials of compact support. The
term “compact support” means that each basis function is nonzero only on a (small)
compact interval. Usually the construction is such that a given basis function is
nonzero only on two adjacent cells, where a “cell” refers to the interval between two
neighboring grid points. The shape functions are often constructed to be a partition
of unity, implying that
N
∑ ϕ j (x) = 1.
j=1

When u∆ x (x) is inserted into the original operator equation (14.1), there will
obviously be a residual,
L u∆ x = f + r.
The finite element method now needs to define the coefficients c j . The criterion is
to minimize the residual with respect to the L2 (Ω ) norm. This is equivalent to using
the least squares method. Given that the residual is

51
52 14 Finite element methods

r = L u∆ x − f ,

we require that the residual is orthogonal to the set of basis functions, {ϕ j }N1 . This
is referred to as the Galerkin method.
Since orthogonality is defined in terms of an inner product, the best approxima-
tion is characterized by
hϕi , ri = 0,
for all i = 1 : N, or simply
N
∑ c j hϕi , L ϕ j i = hϕi , f i ; i = 1 : N,
j=1

where we have used the linearity of the operator L , and the fact that the inner
product is a bilinear form. This now results in a linear system of equations for
determining the coefficients c j .
The set of basis functions span a space V∆ x . Since u∆ x ∈ V∆ x , the orthogonality
requirement above implies that u∆ x is the best approximation to be found in V∆ x ;
any change in the coefficients will increase the norm of the residual, and therefore
be worse. Thus, the discrete solution cannot be improved upon, without finding a
“better space” V∆ x , e.g. by refining the mesh width ∆ x.
The procedure described above has a shortcoming, however. We will be inter-
ested in solving the Poisson equation −u00 = f , where the operator L is a second
order derivative. The simplest basis functions are piecewise linear functions. But a
piecewise linear function cannot be used, because its second derivative is zero ev-
erywhere, except at the grid points, where u∆ x (x) is not two times differentiable.
Therefore the finite element method needs a modification of the approach above.
Thus we will reformulate the operator equation in its weak form, which is obtained
by integration by parts. With this change, the FEM proceeds according to the pat-
tern above.
The finite element method has many advantages over finite difference methods.
There are numerous variations on the theme. For example, we can choose basis
functions of high degrees to create high order methods, and with different continu-
ity requirements between cells. In the simplest case mentioned above, the approxi-
mating function u∆ x (x) is piecewise linear and continuous. That method is referred
to as the continuous Galerkin method cG(1), where the number 1 refers to the
degree of the piecewise polynomials. There are also other variants, where the the
approximant is not required to be continuous (favoring other properties, such as
conservation principles). This is referred to as discontinuous Galerkin methods,
dG(·). Further variants may impose the orthogonality condition differently. But the
outstanding advantage of finite elements is in 2D and 3D problems is that the grid
and cells can easily be adapted to complex geometries. By contrast, finite differ-
ence methods have considerable difficulties in such situations.
14.1 The weak form 53

14.1 The weak form

Let us begin by considering the 1D Poisson equation −u00 = f on Ω = [0, 1], with
homogeneous Dirichlet boundary conditions, u(0) = u(1) = 0. This formulation is
called the strong form. It requires that the differential equation is satisfied point-
wise, for all x ∈ Ω .
To obtain the weak form, let v ∈ H01 (Ω ) be a function satisfying the bound-
ary conditions. Construct the inner product of v and the differential equation in the
strong form to get
−hv, u00 i = hv, f i.
In the left hand side, we integrate by parts to get

hv0 , u0 i = hv, f i. (14.3)

This is the weak form of the differential equation. It is “weak,” because u is now
only required to be differentiable a single time, and (14.3) does not require that
the −u00 = f holds pointwise, but only in an average sense. Thus, while the strong
solution satisfies the weak formulation, the converse is not true; a solution to the
weak formulation is an approximate solution to −u00 = f .
This approach is formalized in the following way.

Definition 14.1. (Energy norm) Let u, v ∈ H01 (Ω ). The energy norm is the bilinear
form a : H01 (Ω ) × H01 (Ω ) → R, defined by

a(v, u) = hv0 , u0 i. (14.4)

The energy norm derives its name from the fact that a(u, u) = ku0 k22 , and that the
potential energy of an elastic beam, whose deflection from the mean, equilibrium
line is u, is proportional to ku0 k22 . This is similar to the energy stored in a linear
spring, which has been compressed a given distance. While this notion of energy is,
at least in part, an intuitive concept, it is also related to a variational formulation of
the problem. The equilibrium solution is found where the virtual work is zero. This
principle is of key importance in finite element methods, where the (approximate)
solution is found by a variational principle. In order to compute such an approxima-
tion, we need to formulate the equation that characterizes the optimal solution. This
is done by using the energy norm.

Definition 14.2. (Weak form) The weak form of the Poisson equation −u00 = f with
homogeneous Dirichlet conditions is defined by requiring that

a(v, u) = hv, f i (14.5)

holds for all test functions v ∈ H01 (Ω ). A function u ∈ H01 (Ω ) satisfying (14.5) is
called a weak solution to the Poisson equation.
54 14 Finite element methods

We shall return to the question of whether there exists a solution to this problem.
This follows from the Lax–Milgram lemma, which relies on ellipticity and some
further properties of the problem.

14.2 The cG(1) finite element method in 1D

The cG(1) method refers to the continuous Galerkin method with linear elements.
(A linear element is a polynomial of degree 1.) Its shape functions are piecewise lin-
ear functions, of compact support. Given a grid Ω̄N = {x j }N+1
j=0 , the basis functions
N+1
{ϕi (x)}i=0 satisfy
ϕi (x j ) = δi j ,
where δi j is the Kronecker delta. Thus, since the shape functions are piecewise
linear, they satisfy
x − xi−1 x − xi−1
ϕi (x) = = x ∈ [xi−1 , xi ]
xi − xi−1 ∆x

xi+1 − x xi+1 − x
ϕi (x) = = x ∈ [xi , xi+1 ].
xi+1 − xi ∆x

This choice of basis functions is considered the simplest in FEM theory. They are
often referred to as hat functions, perhaps motivated by their appearance when
graphed, see Figure 14.2.
Now, let us consider the 1D Poissons equation −u00 = f with inhomogeneous
Dirichlet conditions u(0) = uL and u(1) = uR . To apply the cG(1) method, we write
N
u∆ x (x) = uL ϕ0 (x) + ∑ c j ϕ j (x) + uR ϕN+1 (x), (14.6)
j=1

and require that this ansatz satisfy the weak formulation,

hϕi0 , u0∆ x i = hϕi , f i, i = 1 : N. (14.7)

Writing this as a system of equations, especially including the first and last equations
to demonstrate the influence from the boundary conditions, we have

uL hϕ10 , ϕ00 i + c1 hϕ10 , ϕ10 i + c2 hϕ10 , ϕ20 i = hϕ1 , f i


ci−1 hϕi0 , ϕi−1
0
i + ci hϕi0 , ϕi0 i + ci+1 hϕi0 , ϕi+1
0
i = hϕi , f i, i = 2 : N −1
cN−1 hϕN0 , ϕN−1
0
i + cN hϕN0 , ϕN0 i + uR hϕN0 , ϕN+1
0
i = hϕN , f i.
14.2 The cG(1) finite element method in 1D 55
1

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 14.1 Shape functions for cG(1) FEM. Piecewise linear basis functions are plotted on Ω̄N ⊂
[0, 1] for ∆ x = 0.1. Top panel shows a generic basis function (blue) at the center, and the two
special boundary basis functions (red, green) used to match boundary conditions. Bottom panel
shows three adjacent basis functions. Because each function has support only on two neighboring
cells, only three consecutive basis functions overlap

This is a system of N linear equations, for the N unknowns c = {c j }N1 ,

c1 hϕ10 , ϕ10 i + c2 hϕ10 , ϕ20 i = hϕ1 , f i − uL hϕ10 , ϕ00 i


ci−1 hϕi , ϕi−1 i + ci hϕi0 , ϕi0 i + ci+1 hϕi0 , ϕi+1
0 0 0
i = hϕi , f i, i = 2 : N −1
0 0 0 0 0 0
cN−1 hϕN , ϕN−1 i + cN hϕN , ϕN i = hϕN , f i − uR hϕN , ϕN+1 i.

It can be written
KN c = g + b, (14.8)
where KN ∈ RN×N is referred to as the stiffness matrix. It is tridiagonal, due to the
fact that a given shape function only overlaps with its nearest neighbors. The matrix
elements are

kii = hϕi0 , ϕi0 i, and ki j = hϕi0 , ϕ 0j i, for j = i ± 1.

Since hϕi0 , ϕi+1


0 i = hϕ 0 , ϕ 0 i it follows that k = k ; in other words, the stiffness
i+1 i ij ji
matrix is symmetric. This reflects the fact that the operator −d2 /dx2 is selfadjoint.
To assemble the matrix on a uniform grid, we compute
56 14 Finite element methods
Z xi+1
1 2
kii = hϕi0 , ϕi0 i = |ϕi0 (x)|2 dx = 2∆ x · 2
= ,
xi−1 ∆x ∆x

where the limits of the integral represent the subinterval of Ω = [0, 1] where ϕi0 has
support, cp. Figure 14.2. We also compute
Z xi+1
−1 1
ki,i+1 = hϕi0 , ϕi+1
0
i= ϕi0 (x)ϕi+1
0
(x) dx = ∆ x · =− ,
xi ∆ x2 ∆x

where the integration interval corresponds to the overlapping of the two shape func-
tions, where ϕi0 (x)ϕi+1
0 (x) 6= 0. Having computed these inner products, the stiffness

matrix can be assembled, and


 
2 −1 0 · · · 0
−1 2 −1 · · · 0 
1  .. .. ..

KN = . (14.9)
 
∆x 
 . . . 
 −1 2 −1 
0 · · · 0 −1 2

Here we recognize the usual symmetric positive definite tridiagonal Toeplitz ma-
trix T = ] − 1 2 − 1 [ that we encountered when the 1D Poisson problem was
solved using the finite difference method.
Since we have already computed hϕi0 , ϕi+1
0 i, we can also account for the boundary

conditions, which involve the shape functions on the boundary. Thus the first and
last elements of the vector b in (14.8) are
uL uR
b1 = , bN = .
∆x ∆x
The rest of its elements are zero. This, too, conforms to the results for FDM. How-
ever, the difference between the cG(1) FEM and the standard 2nd order FDM be-
comes apparent when considering the vector g. The issue is that one needs to com-
pute the integrals gi = hϕi , f i. This is in general not possible. Instead, these integrals
must be approximated numerically. This is done as follows.
We sample the function f on the grid Ω̄N , including its boundary points, since,
in general, f (x) is nonzero also on the boundary of Ω = [0, 1]. This gives us a
vector f̄ = { f (x j )}N+1
0 , allowing us to represent the function f as a piecewise linear
function,
N+1
f (x) ≈ ∑ f j ϕ j (x),
j=0

where f j = f (x j ). This means that we use linear interpolation between the sam-
ples; it is a second order accurate approximation, provided that f ∈ C2 [0, 1]. We can
now compute the inner products
14.2 The cG(1) finite element method in 1D 57

N+1 1
hϕi , f i ≈ ∑ f j hϕi , ϕ j i = ∑ fi+ν hϕi , ϕi+ν i,
j=0 ν=−1

again due to the fact that the shape function only overlaps with its nearest neighbors.
We now have

x − xi−1 2 xi − x 2
Z xi   Z xi+1  
2∆ x
hϕi , ϕi i = dx + dx = ,
xi−1 ∆x xi ∆x 3

and Z xi+1
(xi+1 − x)(x − xi ) ∆x
hϕi , ϕi+1 i = dx = .
xi ∆ x2 6
This again yields a Toeplitz matrix, but it is N × (N + 2). Computing the right hand
side vector g corresponds to convolving the sampled function f̄, with a kernel

∆x
MN = ]1 4 1 [. (14.10)
6
This is an averaging operator, which can be compared to the difference correction
used in the Cowell FDM, but its origin is different. Also, it does not increase the
order or accuracy; it is merely a matter of calculating the load function g in the
linear system. Thus the cG(1) equations for the 1D Poisson equation can be written
       
2 −1 0 · · · 0 c1 4 1 0 ··· 0 f1 d0
−1 2 −1 · · · 0   c2  1 4 1 · · · 0  f2   0 
1  .. .. ..
  ∆x 
  .. 
   
 . . . . . .   ..   .. 
 .  =  . + . ,

∆x 
 . . .   6 
 . . .    
 −1 2 −1     1 4 1    0 
0 · · · 0 −1 2 cN 0 ··· 0 1 4 fN dN

where the first and last elements of the vector d are


f0 ∆ x uL fN+1 ∆ x uR
d1 = + , dN = + .
6 ∆x 6 ∆x
In matrix–vector form the system reads

KN c = MN f + d, (14.11)

where MN is referred to as the mass matrix. The N × N mass matrix is a symmet-


ric positive definite tridiagonal Toeplitz matrix. By solving (14.11), we obtain the
vector c, and therefore also the piecewise linear approximate solution u∆ x (x) for
x ∈ Ω = [0, 1]. Moreover, since ϕ j (xi ) = δi j , it follows from (14.6) that
58 14 Finite element methods

u∆ x (x0 ) = uL
N
u∆ x (xi ) = ∑ c j ϕ j (xi ) = ci , i=1:N
j=1

u∆ x (xN+1 ) = uR .

Thus ci ≈ u(xi ) approximates the exact solution on the grid, and the interpolant
u∆ x (x) ≈ u(x) for all x ∈ Ω = [0, 1].
The cG(1) FEM can be compared to the standard FDM for the same problem.
The latter produced the linear system LN u = f + b. The relation between LN and KN
is trivial, as KN = ∆ x · LN . The main difference is in the right hand side. While the
boundary conditions enter in a similar way (scaled by ∆ x), the load function f is
treated differently, inasmuch as the FEM employs the mass matrix to create a local
averaging of function values. In addition, the FEM also uses the values of f on the
boundary, f (0) and f (1). This is not unlike the 4th order Cowell FDM, but there
the matrix differs from the mass matrix in the FEM. Later we shall also see that on
nonuniform grids the differences are greater than they appear. There, we will find
significant differences also between LN and KN .
Naturally, the cG(1) FEM can be used also to solve eigenvalue problems. For
further comparison with the FDM, we apply the FEM to the eigenvalue problem

−u00 = λ u, u(0) = u(1) = 0. (14.12)

Due to the homogeneous Dirichlet conditions, we have uL = uR = 0, so the ansatz


(14.6) does not contain the boundary shape functions ϕ0 (x) and ϕN+1 (x). This holds
also in the right hand side, simplifying the problem. The construction of the system
is otherwise identical to the derivation above, and one arrives at the generalized
eigenvalue problem
KN u = λ MN u. (14.13)
This is not a standard eigenvalue problem due to the appearance of the mass matrix
in the right hand side. There are special computational methods that solve general-
ized eigenvalue problems directly, but here we note that (14.13) is mathematically
equivalent to the problem
MN−1 KN u = λ u. (14.14)
Since MN is symmetric positive definite it is invertible, with a symmetric positive
definite inverse. Thus, in the cG(1) method we obtain discrete eigenvalues

λk [MN−1 KN ].

These should be close to those produced by the standard 2nd order FDM, and the
error should be similar since the cG(1) method is 2nd order. This is also seen to be
the case in Figure 14.2, where the same problem is also solved using Cowell’s 4th
order method. The algebraic problem is again a generalized eigenvalue problem, like
in the cG(1) method, but with a different weight matrix replacing the mass matrix.
14.3 Convergence 59

-2
Relative errors vs k
10

10 -3

10 -4

10 -5

10 -6

10 -7

-8
10

10 -9
10 0 10 1
k

Fig. 14.2 Relative error in eigenvalues. The cG(1) and standard 2nd order FDM solve the Dirichlet
eigenvalue problem −u00 = λ u using ∆ x = 10−2 . Relative errors in λk for the first ten discrete
eigenvalues are practically indistinguishable (top curve, red markers), and the relative error in λk
is O(k2 ∆ x2 ). These results are compared to Cowell’s 4th order FDM (lower curve, blue markers),
which produces much higher accuracy, with a relative error in λk proportional to O(k4 ∆ x4 )

14.3 Convergence

We shall only sketch a convergence proof for the cG(1) FEM applied to the 1D
Poisson equation with Dirichlet conditions. Above, we have already noted that

u∆ x (xi ) = ci .

Thus ci ≈ u(xi ). The FEM equation (14.11) is obviously consistent of order p = 2 at


the grid points xi provided that the function f is smooth. Since kKN−1 k2 ≤ C implies
stability, it follows that ci = u(xi ) + O(∆ x2 ) as ∆ x → 0, implying that the method is
convergent, at the grid points, in the sense that ci → u(xi ).
Between the grid points, the method is also 2nd order convergent, due to linear
interpolation being 2nd order accurate. We have the following classical result.

Lemma 14.1. Let v ∈ C2 [a, b] and assume that x ∈ [a, b]. Then the linear interpolant

v(b) − v(a)
P(x) = v(a) + (x − a)
b−a
60 14 Finite element methods

satisfies P(a) = v(a) and P(b) = v(b), with an error

v00 (ξ )
v(x) − P(x) = (x − a)(x − b)
2!
for some ξ ∈ (a, b).

Note that |(x − a)(x − b)| ≤ (b − a)2 /4 for all x ∈ [a, b] and that the maximum
is attained at the midpoint, x = (a + b)/2. Suppose that b − a = ∆ x, so that the
interpolant represents the piecewise linear interpolation on the grid. Then the inter-
polation error satisfies the estimate

∆ x2 00
|v(x) − P(x)| ≤ |v (ξ )|,
8
showing that linear interpolation is 2nd order accurate as ∆ x → 0. Thus, at the grid
points, |u∆ x (xi ) − u(xi )| = O(∆ x2 ), and since linear interpolation between the grid
points is also second order accurate, the global error is

|u∆ x (x) − u(x)| = O(∆ x2 ) (14.15)

for all x ∈ Ω = [0, 1], provided that the solution u(x) is twice continuously differ-
entiable. In fact, if u is a strong (pointwise) solution to −u00 = f , the error bound
(14.15) holds whenever f ∈ C0 [0, 1].
If u∆ x (x) → u(x) with a second order error, we generally lose an order for the
derivative u0∆ x (x). The derivative u0∆ x (x) is piecewise constant and will deviate by
an error |u0∆ x (x) − u0 (x)| = O(∆ x) for an arbitrary x ∈ Ω = [0, 1]. However, by the
mean value theorem, there is a ξi ∈ (xi , xi+1 ) such that u0∆ x (ξi ) = u0 (ξi ). As always,
each ξi is unknown, but a second order approximation is found at the midpoint of
each cell. Thus it holds that
   
0 xi + xi+1 0 xi + xi+1
2
u
∆x −u = O(∆ x ).
2 2

As a consequence, in the finite element method there is a shift in importance from


grid points to the cells (intervals), where the cG(1) method can produce a global
O(∆ x2 ) accurate solution, as well as a similar accuracy in the derivative, although
only at the cell centers.

14.4 Neumann conditions

We have previously solved the Neumann problem for the 1D Poisson equation, using
FDM. There we saw that there were many options for representing the boundary
condition so as to achieve 2nd order accuracy. In the cG(1), the options are fewer,
14.4 Neumann conditions 61

and second order convergence is obtained by constructing a grid with N internal


points, such that
x j = j · ∆ x,
with ∆ x = 1/(N + 1/2). Hence x0 = 0 and

1 ∆x
xN = 1 − ∆ x , xN+1 = 1 + ,
2 2
with xN+1 outside the [0, 1] interval. Since xN and xN+1 are symmetrically located
around the boundary point x = 1, the latter is the center of the cell [xN , xN+1 ], where
the derivative will be represented to second order accuracy. Using the standard
piecewise linear shape functions, we have

u∆ x (x0 ) = uL
N+1
u∆ x (x) = ∑ c j ϕ j (x) i=1:N
j=1

where the last shape function ϕN+1 (x) is used to impose the Neumann boundary
condition u0∆ x (1) = u0R at x = 1. The value of u(1) is approximated by u∆ x (1). Due
to the piecewise linear construction, we have
cN+1 + cN
u∆ x (1) = cN+1 ϕN+1 (1) + cN ϕN (1) = .
2
Since
cN+1 − cN
u0∆ x (1) = cN+1 ϕN+1
0
(1) + cN ϕN0 (1) = = u0R
∆x
it follows that cN+1 = cN + ∆ x u0R , and consequently

cN+1 + cN ∆ x u0R ∆ x u0R


u∆ x (1) = = cN + = u∆ x (1 − ∆ x/2) + .
2 2 2
Thus cN+1 is determined by cN and the Neumann condition u0 (1) = u0R . The remain-
ing coefficients {ci }N1 are determined by the linear system

c1 hϕ10 , ϕ10 i + c2 hϕ10 , ϕ20 i = hϕ1 , f i − uL hϕ10 , ϕ00 i


ci−1 hϕi0 , ϕi−1
0
i + ci hϕi0 , ϕi0 i + ci+1 hϕi0 , ϕi+1
0
i = hϕi , f i, i = 2 : N −1
cN−1 hϕN0 , ϕN−1
0
i + cN hϕN0 , ϕN0 i + cN+1 hϕN0 , ϕN+1
0
i = hϕN , f i.

Using cN+1 = cN + ∆ x u0R , the system becomes

c1 hϕ10 , ϕ10 i + c2 hϕ10 , ϕ20 i = hϕ1 , f i − uL hϕ10 , ϕ00 i


ci−1 hϕi0 , ϕi−1
0 0
i + ci hϕi0 , ϕi0 i + ci+1 hϕi0 , ϕi+1 i = hϕi , f i, i = 2 : N −1
cN−1 hϕN , ϕN−1 i + cN (hϕN , ϕN i + hϕN , ϕN+1 i) = hϕN , f i − ∆ x uR hϕN0 , ϕN+1
0 0 0 0 0 0 0 0
i.

As a result, we get the system


62 14 Finite element methods
       
2 −1 0 · · · 0 c1 4 1 0 0 ···f1 d0
−1 2 −1 · · · 0   c2  1 4 1 0 ···
 f2   0 
1  .. .. ..
  ∆x 
  ..   .. ..
   
  ..   .. 
..
 .  =  . + . .


∆x  . . .   6 
 . . .
   
 −1 2 −1    1 4 1    0 
0 · · · 0 −1 1 cN 0 ··· 0 1 4 fN dN

In matrix–vector form the system reads

KN c = MN f + d, (14.16)

where the first and last elements of the vector d are


f0 ∆ x uL fN+1 ∆ x
d1 = + , dN = + u0R .
6 ∆x 6
In addition, the lower right element of the stiffness matrix has been changed from
2 to 1, and the mesh width is adjusted from ∆ x = 1/(N + 1) to ∆ x = 1/(N + 1/2).
This problem representation is similar to the first approach used in the FDM analysis
of the Neumann problem for the 1D Poisson problem.
Convergence is obvious, since the modified stiffness matrix is the same as the
matrix in the FDM treatment of the eigenvalue problem with Neumann condition.
Therefore the method is stable. Likewise consistency at the grid points follows if f
is regular. By the Lax principle the method is convergent or order p = 2 at the grid
points, and the piecewise linear interpolation implies that the cG(1) solution u∆ x (x)
is globally 2nd order accurate.

14.5 cG(1) FEM on nonuniform grids

Let us consider a nonuniform grid constructed as a differentiable deformation of a


uniform grid. This means that we choose a differentiable map Φ : [0, 1] → [0, 1] such
that Φ(0) = 0 and Φ(1) = 1, and such that a uniform grid

ξn = n/(N + 1)

for n = 0 : N + 1 is mapped to a nonuniform grid xn = Φ(ξn ).

You might also like