NumericalMethods UofV

MATH 337, by T.
Lakoba, University of Vermont 1

0 Preliminaries
0.1 Motivation
The numerical methods of solving dierential equations that we will study in this course are
based on the following concept: Given a dierential equation, e.g.,
y
(x) = f(x, y), (0.1)

replace the derivative by an appropriate nite dierence, e.g.:
y
(x)
y(x + h) y(x)
h
, when h is small (h 1). (0.2)
Then Eq. (0.1) becomes (in the approximate sense)
y(x + h) y(x) = hf(x, y(x)), (0.3)
from which the new value y(x + h) of the unknown function y can be found given the old
value y(x).
In this course, we will consider both equations that are more complicated than (0.1) as well
as the discretization schemes that are more sophisticated than (0.3).
0.2 Taylor series expansions
Taylor series expansion of functions will play a central role when we study the accuracy of
discretization schemes. Below is a reminder from Calculus II, and its generalization.
If a function f(x) has innitely many derivatives, then
f(x) = f(x
0
) +
(x x
0
)
1!
f
(x
0
) +
(x x
0
)
2
2!
f
(x
0
) + . . .
=
n=0
(x x
0
)
n
n!
f
(n)
(x
0
) . (0.4)
If f(x) has derivatives up to the (N + 1)st (i.e. f
(N+1)
exists), then
f(x) =
N
n=0
(x x
0
)
n
n!
f
(n)
(x
0
) +
(x x
0
)
N+1
(N + 1)!
f
(N+1)
(x
), x
(x
0
, x) . (0.5)
For functions of two variables, Eq. (0.4) generalizes as follows (we denote x = (x x
0
)
and y = (y y
0
)):
f(x, y) =
n=0
(x)
n
n!
n
f(x
0
, y)
x
n
=
n=0
(x)
n
n!
_

m=0
(y)
m
m!
n+m
f(x
0
, y
0
)
x
n
y
m
_
explained below
=
k=0
1
k!
_
x

x
+ y

y
_
k
f( x, y)|
x=x
0
, y=y
0
= f(x
0
, y
0
) + (xf
x
(x
0
, y
0
) + y f
y
(x
0
, y
0
)) +
+
1
2!
_
(x)
2
f
xx
(x
0
, y
0
) + 2xy f
xy
(x
0
, y
0
) + (y)
2
f
yy
(x
0
, y
0
)
_
+ . . . (0.6)
MATH 337, by T. Lakoba, University of Vermont 2
The step of going from the second to the third line in the above calculations is based on the
binomial expansion formula
(x + y)
k
=
k
n=0
k!
n!(n k)!
x
n
y
kn
and takes some eort to verify. (For example, one would write out all terms in line two with
n + m = 2 and verify that they equal to the term in line three with k = 2. Then one would
repeat this for n + m = k = 3 and so on, until one sees the pattern.) For our purposes, it will
be sucient to just accept the end result, i.e. the last line of (0.6).
0.3 Existence and uniqueness theorem for ODEs
In the rst two parts of this course, we will deal exclusively with ordinary dierential equations
(ODEs), i.e. equations that involve the derivative(s) with respect to only one independent
variable (usually denoted as x).
To solve an ODE numerically, we rst have to be sure that its solution exists and is unique;
otherwise, we may be looking for something that simply is not there! The following theorem
establishes this fundamental fact for ODEs.
Theorem Let y(x) satisfy the initial-value problem (IVP), i.e. an ODE plus the initial
condition:
y
(x) = f(x, y), y(x

0
) = y
0
. (0.7)
Let f(x, y) be dened and continuous in a closed region R that contains point (x
0
, y
0
). Let, in
addition, f(x, y) satisfy the Lipschitz condition with respect to y:
For any x, y
1
, y
2
R, |f(x, y
1
) f(x, y
2
)| L|y
1
y
2
| , (Lipschitz)
where the constant L depends on the region R and the function f, but not on y
1
and y
2
. Then
a solution of IVP (0.7) exists and is unique on some interval containing the point x
0
.
Remarks to the Theorem:
1. Any f(x, y) that is dierentiable with respect to y and such that |f
y
| L in R, satises
the Lipschitz condition. In this case, the Lipschitz constant L = max
R
|f
y
(x, y)|.
2. In addition, f(y) = |y| also satises the Lipchitz condition, even though this function
does not have a derivative with respect to y. In general, L = max |f
y
(x, y)|, where the
maximum is taken over the part of R where f
y
exists. For example, for f(y) = |y|, one
has L = 1.
3. f(y) =

y does not satisfy the Lipschitz condition on [0, 1]. Indeed, one cannot nd a
constant L that would be independent of y and such that
0 < L|y 0|
for suciently small y.
Question: What happens to the solution of the ODE when the Lipschitz condition is violated?
Consider the IVP
y
(x) =
y, y(0) = 0 . (0.8)
As we have just said in Remark 3, the function f(y) =

y does not satisfy the Lipschitz
condition. One can verify (by substitution) that IVP (0.8) has the following solutions:
1st solution: y =
x
2
4
.
2nd solution: y = 0.
innitely many solutions:
y =
_
_
_
0, 0 x a (a > 0)
(x a)
2
4
, x > a .
0 1 2 3 4 5 6
0
0.5
1
1.5
2
2.5
1st
. . .
other
2nd
Thus, if f(x, y) does not satisfy the Lipschitz condition, the solution of IVP (0.7) may not be
unique.
0.4 Solution of a linear inhomogeneous IVP
We remind here the procedure of solving the IVP
y
(x) = a y + g(x), a = const , y(x

0
) = y
0
. (0.9)
Step 1: Solve the homogenous ODE y
= a y:
y
hom
= a y
hom
y
hom
(x) = e
a(xx
0
)
. (0.10)
Step 2: Look for the solution of the inhomogeneous problem in the form y(x) = y
hom
(x) c(x),
where c(x) is determined by substituting the latter expression into Eq. (0.9):
c y
hom
+ c
y
hom
= a c y
hom
+ g(x),
c
=
g(x)
y
hom
,
c(x) =
_
g( x) e
a( xx
0
)
d x,
y(x) =
_
y
0
+
_
x
x
0
g( x) e
a( xx
0
)
d x
_
e
a(xx
0
)
. (0.11)
In the rst line of (0.11), the symbol denotes cancellation of the respective terms on the
two sides of the equation, which occurs due to (0.10).
0.5 A very useful limit from Calculus
In Calculus I, you learned that
lim
h0
(1 + h)
1/h
= e, (0.12)
where e is the base of the natural logarithm.
The following useful corollary is derived from (0.12):
lim
h0
(1 + ah)
b/h
= e
ab
, (0.13)
where a, b are any nite numbers. Indeed, if we denote ah = g, then g 0 as h 0, and then
the l.h.s. (left-hand side) of (0.13) becomes:
lim
g0
(1 + g)
b/(g/a)
= lim
g0
(1 + g)
ab/g
=
_
lim
g0
(1 + g)
1/g
_
ab
= e
ab
.
Note also that
lim
h0
( 1 + ah
2
)
b/h
= e
0
= 1 (0.14)
for any nite numbers a and b.
1 Simple Euler method and its modications
1.1 Simple Euler method for the 1st-order IVP
Consider the IVP
y
(x) = f(x, y), y(x

0
) = y
0
. (1.1)
Let: x
i
= x
0
+ i h, i = 0, 1, . . . n
y
i
= y(x
i
) true solution evaluated at points x
i
Y
i
the solution to be calculated numerically.
Replace
y
(x)
Y
i+1
Y
i
h
.
Then Eq. (1.1) gets replaced with
Y
i+1
= Y
i
+ hf(x
i
, Y
i
) Y
0
= y
0
. (1.2)
1.2 Local error of the simple Euler method
The calculated solution satises Eq. (1.2). Next, assuming that the true solution of IVP (1.1)
has (at least) a second derivative y
(x), one can use the Taylor expansion to write:

y
i+1
= y(x
i
+ h) = y
i
+ y
i
h + y
i
(x
i
)
h
2
2
= y
i
+ hf(x
i
, y
i
) + O(h
2
) . (1.3)
Here x
i
is some point between x
i
and x
i+1
= x
i
+ h, and we have used Eq. (0.5).
Notation O(h
k
) for any k means the following:
q = O(h
k
) whenever lim
h0
q
h
k
= const < , const = 0 .
For example,
5h
2
+ 1000h
3
= O(h
2
); or
h
1 + h cos(3 + 2h)
= O(h) .
We now introduce a new notation. The local truncation error shows how well the solution
Y
i+1
of the nite-dierence scheme approximates the exact solution y
i+1
of the ODE at point
x
i+1
, assuming that at x
i
the two solutions were the same, i.e. Y
i
= y
i
. Comparing the last line
of Eq. (1.3) with Eq. (1.2), we see that the local truncation error of the simple Euler method
is O(h
2
). It tends to zero when h 0.
Another useful notation is that of discretization error. It shows how well the nite-dierence
scheme approximates the ODE. Let us now estimate this error. First, we note from (1.2) and
(1.3) that the computed and exact solutions satisfy:
Y
i+1
Y
i
h
= f(x
i
, Y
i
) and
y
i+1
y
i
h
= f(x
i
, y
i
) + O(h),
whence the discretization error of the simple Euler method is seen to be O(h).
1.3 Global error of the Euler method; Propagation of errors
As we have said above, the local truncation error shows how well the computed solution ap-
proximates the exact solution at one given point, assuming that these two solutions have been
the same up to that point. However, as we compute the solution of the nite-dierence scheme,
the local truncation errors at each step accumulate. This results in that the dierence between
the computed solution Y
i
and exact solution y
i
at some point x
i
down the line becomes much
greater than the local truncation error.
Let
i
= y
i
Y
i
denote the error (the dierence between the true and computed solutions)
at x = x
i
. This error (or, sometimes, its absolute value) is called the global error of the
nite-dierence method.
Our goal in this subsection will be to nd an upper bound for this error. Let us emphasize
that nding an upper bound for the error rather than the error itself is the best one can do.
(Indeed, if one could have found the actual error
i
, one would have then simply added it to
the numerical solution Y
i
and obtained the exact solution y
i
.) The main purpose of nding the
upper bound for the error is to determine how it depends on the step size h. We will do this
now for the simple Euler method (1.2).
To this end, we begin by considering Eq. (1.2) and the 1st line of Eq. (1.3):
Y
i+1
= Y
i
+ hf(x
i
, Y
i
)
y
i+1
= y
i
+ hf(x
i
, y
i
) +
h
2
2
y
(x
i
)
Subtract the 1st equation above from the 2nd to obtain the error at x
i+1
:
i+1
=
i
+ h (f(x
i
, y
i
) f(x
i
, Y
i
)) +
h
2
2
y
(x
i
) . (1.4)
Now apply the triangle inequality, valid for any three numbers a, b, c:
a = b + c |a| |b| +|c|, (1.5)
to Eq. (1.4) and obtain:
|
i+1
| |
i
| + hL|
i
| +
h
2
2
|y
(x
i
)|
= (1 + hL)|
i
| +
h
2
2
|y
(x
i
)| . (1.6)
In writing the second term in the above formula, we used the fact that f(x, y) satises the
Lipschitz condition with respect to y (see Lecture 0).
To complete nding the upper bound for the error |
i+1
|, we need to estimate y
(x
i
). We
use the Chain rule for a function of two variables (recall Calculus III) to obtain:
y
(x) =
d
2
y(x)
dx
2
|
use the ODE
=
df(x, y)
dx
= f
x
dx
dx
+ f
y
dy
dx
= f
x
+ f
y
f . (1.7)
Considering the rst term on the r.h.s. of (1.7), let us assume that
|f
x
| M
1
for some M
1
< . (1.8)
In cases when this asumption does not hold (as, for example, for f(x, y) = x
1/3
sin
1
x
), the
estimate obtained below (see (1.16)) is not valid, but a modied estimate can usually be found
on a case-by-case basis. So here we proceed with assumption (1.8).
Considering the second term on the r.h.s. of (1.7), we rst recall that f satises the Lipschitz
condition with respect to y, which means that
|f
y
| M
2
for some M
2
< , (1.9)
except possibly at a nite number of y-values where f
y
does not exist (like at y = 0 for
f(y) = |y|). Finally, the other factor of the second term on the r.h.s. of (1.7) is also bounded,
because f is assumed to be continuous and on the closed region R (see the Existence and
Uniqueness Theorem in Lecture 0). Thus,
|f| M
3
for some M
3
< . (1.10)
Combining Eqs. (1.71.10), we see that
|y
(x
i
)| M
1
+ M
2
M
3
M < . (1.11)
Now combining Eqs. (1.6) and (1.11), we obtain:
|
i+1
| (1 + hL)|
i
| +
h
2
2
M . (1.12)
This last equation implies that |
i+1
| z
i+1
, where z
i+1
satises the following recurrence
equation:
z
i+1
= (1 + hL)z
i
+
h
2
2
M , z
0
= 0 . (1.13)
(Condition z
0
= 0 follows from the fact that
0
= 0; see the initial conditions in Eqs. (1.1) and
(1.2).)
Thus, the error |
i
| is bounded by z
i
, and we need to solve Eq. (1.13) to nd that bound.
The way to do so is analogous to solving a linear inhomogeneous equation (see Section 0.4).
However, before we obtain the solution, let us develop an intuitive understanding of what kind
of answer we should expect. To this end, let us assume for the moment that L = 0 in Eq.
(1.13). Then we have:
z
i+1
= z
i
+
h
2
2
M = (z
i1
+
h
2
2
M) +
h
2
2
M = . . .
= z
0
+
h
2
2
M i = 0 +
h
2
2
M
x
i
x
0
h
= h
M(x
i
x
0
)
2
= O(h) .
That is, the global error |
i
| should have the size O(h). In other words,
Global error = Number of steps Local error
or
O(h) = O
_
1
h
_
O(h
2
)
Now let us show that a similar estimate also holds for L = 0. First, solve the homogeneous
version of (1.13):
z
i+1
= (1 + hL) z
i
z
i,hom
= (1 + hL)
i
. (1.14)
Note that this is an analogue of e
a(x
i
x
0
)
in Section 0.4, because
(1 + hL)
i
= (1 + hL)
(x
i
x
0
)/h
|
h0
e
L(x
i
x
0
)
,
where we have used the denition of x
i
, found after (1.1), and also the results of Section 0.5.
In close analogy to the method used in Section 0.4, we seek the solution of (1.13) in the
form z
i
= c
i
z
i,hom
(with c
0
= 0). Substituting this form into (1.13) and using Eq. (1.14), we
obtain:
c
i+1
(1 + hL)
i+1
= (1 + hL) c
i
(1 + hL)
i
+
h
2
2
M
c
i+1
= c
i
+
h
2
M
2 (1 + hL)
i+1
= c
i1
+
h
2
M
2 (1 + hL)
i+11
+
h
2
M
2 (1 + hL)
i+1
= . . . = c
0
+
i+1
k=1
h
2
M
2
1
(1 + hL)
k
= |
geometric series
h
2
M
2(1 + hL)
1
1
(1+hL)
i+1
1
1
(1+hL)
=
hM
2L
_
1
1
(1 + hL)
i+1
_
. (1.15)
Combining (1.14) and (1.15), and using (0.13), we nally obtain:
z
i+1
=
hM
2L
_
(1 + hL)
i+1
1
_
=
hM
2L
_
(1 + hL)
(x
i+1
x
0
)/h
1
_

hM
2L
_
e
L(x
i+1
x
0
)
1
_
= O(h),
|
i+1
|
hM
2L
_
e
L(xx
0
)
1
_
= O(h) . (1.16)
This is the upper bound for the global error of the simple Euler method (1.2).
Thus, in the last two subsections, we have shown that for the simple Euler method:
Local truncation error = O(h
2
);
Discretization error = O(h);
Global error = O(h).
The exponent of h in the global error is often referred to as the order of the nite-dierence
method. Thus, the simpler Euler method is the 1st-order method.
Question: How does the above bound for the error change when we include the machine
round-o error (which occurs because numbers are computed with nite accuracy, usually
10
16
)?
Answer: In the above formulae, replace h
2
M/2 by h
2
M/2 + r, where r is the maximum
value of the round-o error. Then Eq. (1.16) gets replaced with
|
i+1
|
_
h
2
M
2
+ r
_
1
hL
_
e
L(xx
0
)
1
_
=
_
hM
2L
+
r
hL
_
_
e
L(xx
0
)
1
_
(1.17)
The r.h.s. of the above bound is schematically plotted in the gure below. We see that for very
small h, the term r/h can be dominant.
Moral:
Decreasing the step size
of the dierence equation
does not always result
in the increased accuracy
of the obtained solution.
total error
discretization
error
roundoff error
step size h
1.4 Modications of the Euler method
In this subsection, our goal is to nd nite-dierence schemes which are more accurate than the
simple Euler method (i.e., the global error of the sought methods should be O(h
2
) or better).
Again, we rst want to develop an intuitive understanding of how this can be done, and
then actually do it. So, to begin, we notice an obvious fact that the ODE y
= f(x, y) is just a
more general case of y
= f(x). The solution of the latter equation is y =

_
f(x)dx. Whenever
we cannot evaluate the integral analytically in closed form, we resort to approximating the
integral by the Riemann sums.
A very crude approximation
to
_
b
a
f(x)dx
is provided by the
left Riemann sums:
Y
i+1
= Y
i
+ hf(x
i
) .
This is the analogue of the
simple Euler method (1.2):
Y
i+1
= Y
i
+ hf(x
i
, Y
i
) .
left Riemann sums
x
0
x
1
x
2
x
3

Approximations of the integral
_
b
a
f(x)dx that are known to be more accurate than the left
Riemann sums are the Trapezoidal Rule and the Midpoint Rule:
Trapezoidal Rule:
Y
i+1
= Y
i
+ h
f(x
i
) + f(x
i+1
)
2
.
Its analogue for the ODE
is to look like this:
Y
i+1
= Y
i
+
h
2
(f(x
i
, Y
i
) + f(x
i+1
, Y
i
+ Ah)) ,
(1.18)
where the coecient A is to be determined.
Method (1.18) is called
the Modied Euler method.
Trapezoidal Rule
x
0
x
1
x
2
x
3

Midpoint Rule:
Y
i+1
= Y
i
+ hf
_
x
i
+
h
2
_
.
Its analogue for the ODE
is to look like this:
Y
i+1
= Y
i
+hf
_
x
i
+
h
2
, Y
i
+ Bh
_
, (1.19)
where the coecient B is to be determined.
We will refer to method (1.19) as
the Midpoint method.
Midpoint Rule
x
0
x
1
x
2
x
3

The coecients A in (1.18) and B in (1.19) are determined from the requirement that the
corresponding nite-dierence scheme have the global error O(h
2
) (as opposed to the simple
Eulers O(h)), or equivalently, the local truncation error O(h
3
). Below we will determine the
value of A. You will be asked to compute B along similar lines in one of the homework problems.
To determine the coecient A in the Modied Euler method (1.18), let us rewrite that
equation while Taylor-expanding its r.h.s. using Eq. (0.6) with x = h and y = Ah:
Y
i+1
= Y
i
+
h
2
f(x
i
, Y
i
) +
h
2
_
f(x
i
, Y
i
) + [hf
x
(x
i
, Y
i
) + (Ah)f
y
(x
i
, Y
i
)] + O(h
2
)
_
= Y
i
+ hf(x
i
, Y
i
) +
h
2
2
(f
x
(x
i
, Y
i
) + Af
y
(x
i
, Y
i
)) + O(h
3
) . (1.20)
Equation (1.20) yields the Taylor expansion of the computed solution Y
i+1
. Let us compare
it with the Taylor expansion of the exact solution y(x
i+1
). To simplify the notations, we will
denote y
i
= y
(x
i
), etc. Then, using Eq. (1.7):
y
i+1
= y
i
+ hy
i
+
h
2
2
y
i
+ O(h
3
)
= y
i
+ hf(x
i
, y
i
) +
h
2
2
(f
x
(x
i
, y
i
) + f(x
i
, y
i
)f
y
(x
i
, y
i
)) + O(h
3
) . (1.21)
Upon comparing the last lines of Eqs. (1.20) and (1.21), we see that in order for method (1.18)
to have the local truncation error of O(h
3
), one should take A = f(x
i
, Y
i
).
Thus, the Modied Euler method can be programmed into a computer code as follows:
_
_
Y
0
= y
0
,
Y
i+1
= Y
i
+ hf(x
i
, Y
i
),
Y
i+1
= Y
i
+
h
2
_
f(x
i
, Y
i
) + f(x
i+1
,

Y
i+1
)
_
.
(1.22)
Remark: An alternative way to code in the last line of the above equation is
Y
i+1
=
1
2
_
Y
i
+

Y
i+1
+ hf(x
i+1
,

Y
i+1
)
_
. (1.23)
This way is more ecient, because it requires only one evaluation of function f, which is
usually the most time-consuming operation, while the last line of (1.22) requires two function
evaluations.
In a homework problem, you will show that in Eq. (1.19), B =
1
2
f(x
i
, Y
i
). Then the
Midpoint method can be programmed as follows:
_
_
Y
0
= y
0
,
Y
i+
1
2
= Y
i
+
h
2
f(x
i
, Y
i
),
Y
i+1
= Y
i
+ hf
_
x
i
+
h
2
, Y
i+
1
2
_
.
(1.24)
Both the Modied Euler and the Midpoint methods have the local truncation error of O(h
3
)
and the discretization and global errors of O(h
2
). Thus, these are the 2nd-order methods. The
derivation of the local truncation error for the Modied Euler method is given in the Appendix
to this section. This derivation will be needed for solving some of the homework problems.
Remark about notations: Dierent books use dierent names for the methods which we have
called the Modied Euler and Midpoint methods.
1.5 An alternative way to improve the accuracy of a nite-dierence
method: Richardson method / Romberg extrapolation
We have shown that the global error of the simple Euler method is O(h), which means that
Y
h
i
= y
i
+ O(h) = y
i
+ (a h + b h
2
+ . . .) = y
i
+ a h + O(h
2
) (1.25)
where a, b, etc. are some constant coecients that depend on the function f and its derivatives
(as well as on the values of x), but not on h. The superscript h in Y
h
i
means that this particular
numerical solution has been computed with the step size h. We can now halve the step size
and re-compute Y
h/2
i
, which will satisfy
Y
h/2
i
= y
i
+
_
a
h
2
+ b
_
h
2
_
2
+ . . .
_
= y
i
+
_
a
h
2
+ O(h
2
)
_
. (1.26)
Let us clarify that Y
h/2
i
is not the numerical solution at x
i
+ (h/2) but rather the numerical
solution computed from x
0
up to x
i
with the step size (h/2).
Equations (1.25) and (1.26) form a system of linear equations for the unknowns a and y
i
.
Solving this system, we nd
y
i
= 2Y
h/2
i
Y
h
i
+ O(h
2
) . (1.27)
Thus, a better approximation to the exact solution than either Y
h
i
or Y
h/2
i
is Y
improved
i
=
2Y
h/2
i
Y
h
i
.
The above method of improving accuracy of the computed solution is called either the
Romberg extrapolation or Richardson method. It works for any nite-dierence scheme, not
just for the simple Euler. However, it is not computationally ecient. For example, to compute
Y
improved
as per Eq. (1.27), one requires one function evaluation to compute Y
h
i+1
from Y
h
i
and
two function evaluations to compute Y
h/2
i+1
from Y
h/2
i
(since we need to use two steps of size h/2
each). Thus, the total number of function evaluations to move from point x
i
to point x
i+1
is
three, compared with two required for either the Modied Euler or Midpoint methods.
1.6 Appendix: Derivation of the local truncation error of the Mod-
ied Euler method
The idea of this derivation is the same as in Section 1.2, where we derived an estimate for
the local truncation error of the simple Euler method. The details of the present derivation,
however, are more involved. In particular, we will use the following formula, obtained similarly
to (1.7):
y
(x) =
d
3
y(x)
dx
3
|
use the ODE
=
d
2
f(x, y)
dx
2
|
use (1.7)
= (f
x
+ f
y
f)
x
dx
dx
+ (f
x
+ f
y
f)
y
dy
dx
|
use the Product rule
= f
xx
+ f
x
f
y
+ 2ff
xy
+ f(f
y
)
2
+ f
2
f
yy
. (1.28)
Let us recall that in deriving the local truncation error at point x
i+1
, one always assumes
that the exact solution y
i
and the computed solution Y
i
at the previous step (i.e. at point x
i
)
are equal: y
i
= Y
i
. Also, for brevity of notations, we will write f without arguments to mean
either f(x
i
, y
i
) or f(x
i
, Y
i
):
f f(x
i
, y
i
) = f(x
i
, Y
i
).
By the denition, given in Section 1.2, the local truncation error of the Modied Euler
method is computed as follows:
ME
i+1
= y
i+1
Y
ME
i+1
, (1.29)
where y
i+1
and Y
i+1
are the exact and computed solutions at point x
i+1
, respectively (assuming
that y
i
= Y
i
). We rst nd y
i+1
using ODE (1.1):
y
i+1
= y(x
i
+ h)
= y
i
+ hy
i
+
h
2
2
y
i
+
h
3
6
y
i
+ O(h
4
) |
use (1.7) and (1.28)
= y
i
+ hf +
h
2
2
(f
x
+ ff
y
) +
h
3
6
_
f
xx
+ f
x
f
y
+ 2ff
xy
+ f(f
y
)
2
+ f
2
f
yy
_
+ O(h
4
) .(1.30)
We now nd Y
ME
i+1
from Eq. (1.22):
Y
ME
i+1
= Y
i
+
h
2
( f + f(x
i
+ h, Y
i
+ hf) ) |
for last term, use (0.6) with x=h and y=hf
= Y
i
+
h
2
_
f +
_
f + [hf
x
+ hff
y
] +
1
2!
[h
2
f
xx
+ 2 h hf f
xy
+ (hf)
2
f
yy
] + O(h
3
)
__
= Y
i
+ hf +
h
2
2
(f
x
+ ff
y
) +
h
3
4
(f
xx
+ 2ff
xy
+ f
2
f
yy
) + O(h
4
) . (1.31)
Finally, subtracting (1.31) from (1.30), one obtains:
ME
i+1
= h
3
__
1
6

1
4
_
(f
xx
+ 2ff
xy
+ f
2
f
yy
) +
1
6
(f
x
+ ff
y
)f
y
_
+ O(h
4
)
= h
3
_
1
12
(f
xx
+ 2ff
xy
+ f
2
f
yy
) +
1
6
(f
x
+ ff
y
)f
y
_
+ O(h
4
) . (1.32)
For example, let f(x, y) = ay, where a = const. Then
f
x
= f
xx
= f
xy
= 0, f
y
= a, and f
yy
= 0 ,
so that from (1.32) the local truncation error of the Modied Euler method, applied to the
ODE y
= ay, is found to be
ME
i+1
=
h
3
6
a
3
y + O(h
4
) .
1.7 Questions for self-assessment
1. What does the notation O(h
k
) mean?
2. What are the meanings of the local truncation error, discretization error, and global error?
3. Give an example when the triangle inequality (1.5) holds with the < sign.
4. Be able to explain all steps made in the derivations in Eqs. (1.15) and (1.16).
5. Why are the Modied Euler and Midpoint methods called 2nd-order methods?
6. Obtain (1.27) from (1.25) and (1.26).
7. Explain why the properly programmed Modied Euler method requires exactly two eval-
uations of f per step.
8. Why may one prefer the Modied Euler method over the Romberg extrapolation based
on the simple Euler method?
2 Runge-Kutta methods
2.1 The family of Runge-Kutta methods
In this section, we will introduce a family of increasingly accurate, and time-ecient, methods
called Runge-Kutta methods after two German scientists: a mathematician and physicist Carle
Runge (18561927) and a mathematician Martin Kutta (18671944).
The Modied Euler and Midpoint methods of the previous section can be written in a form
common to both of these methods:
Y
i+1
= Y
i
+ (a k
1
+ b k
2
);
k
1
= hf(x
i
, Y
i
),
k
2
= hf(x
i
+ h, Y
i
+ k
1
);
a, b, , are some constants.
(2.1)
Specically, for the Modied Euler,
a = b =
1
2
, = = 1; (2.2)
and for the Midpoint method,
a = 0, b = 1, = =
1
2
. (2.3)
In general, if we require that method (2.1) have the global error O(h
2
), we can repeat
the calculations we carried out in Section 1.4 for the Modied Euler method and obtain the
following 3 equations for 4 unknown coecients a, b, , :
a + b = 1, b =
1
2
, b =
1
2
. (2.4)
Observations:
Since there are fewer equations than unknowns in (2.4), then there are innitely many
nite-dierence methods whose global error is O(h
2
).
One can generalize form (2.1) and seek methods of higher order (i.e. with the global error
of O(h
k
) with k 3) as follows:
Y
i+1
= Y
i
+ (ak
1
+ bk
2
+ ck
3
+ . . . );
k
1
= hf(x
i
, Y
i
),
k
2
= hf(x
i
+
2
h, Y
i
+
21
k
1
),
k
3
= hf(x
i
+
3
h, Y
i
+
31
k
1
+
32
k
2
),
etc.
(2.5)
This family of methods is called the Runge-Kutta (RK) methods.
For example, if one looks for 4th-order methods, one obtains 11 equations for 13 coecients.
Again, this says that there are innitely many 4th-order methods. Historically, the most popular
such method has been
Y
i+1
= Y
i
+
1
6
(k
1
+ 2k
2
+ 2k
3
+ k
4
);
k
1
= hf(x
i
, Y
i
),
k
2
= hf
x
i
+
1
2
h, Y
i
+
1
2
k
1
,
k
3
= hf
x
i
+
1
2
h, Y
i
+
1
2
k
2
,
k
4
= hf (x
i
+ h, Y
i
+ k
3
) .
(2.6)
We will refer to this as the classical Runge-Kutta (cRK) method.
The table below compares the time-eciency of the cRK and Modied Euler methods and
shows that the former method is much more ecient.
Method Global error # of function evaluations
per step
cRK O(h
4
) 4
Modied Euler O(h
2
) 2
One of the reasons why the cRK method is so popular is that the number of function
evaluations per step in it equals the order of the method. It is known that RK methods of
order n 5 require more than n function evaluations; i.e. they are less ecient than the cRK
and other lower-order RK methods. For example, a 5th-order RK method would require a
minimum of 6 function evaluations per step.
2.2 Adaptive methods: Controlling step size for given accuracy
In this subsection, we discuss an important question of how the error of the numerical solution
can be controlled and/or kept within a prescribed bound. A more complete and thorough
discussion of this issue can be found in a paper by L.F. Shampine, Error estimation and
control for ODEs, SIAM J. of Scientic Computing, 25, 316 (2005). A preprint of this paper
is available on the course website.
To begin, we emphasize two important points about error control algorithms.
1. These algorithms control the local truncation error, and not the global error, of the solu-
tion. Indeed, the only way to control the global error is to run the simulations more than
once. For example, one can run a simulation with the step h and then repeat it with the
step h/2 to verify that the dierence between the two solutions is within a prescribed
accuracy. Although this can be done occasionally (for example, when conrming a key
result of ones paper), it is too time-expensive to do so routinely. Therefore, the error
control algorithms make sure that the local error at each step is less than a given tolerance
(which is in some way related to the prescribed global accuracy), and then just let the
user hope that the global accuracy is met. Fortunately, this hope comes true in most
cases; but see the aforementioned paper for possible problematic cases.
2. The goal of the error control is not only to control the error but also to optimize the
step size used to obtain dierent portions of the solution. For example, if it is found
that the solution changes very smoothly on a subinterval I
smooth
of the computational
interval, then the step size on I
smooth
can be taken suciently large. On the contrary,
if one detects that the solution changes rapidly on another interval, I
rapid
, then the step
size there should be decreased.
Methods where both the solution and its error are evaluated at each step of the calculation
are called adaptive methods. They are most useful in problems with abruptly (or rapidly)
changing coecients. One simple example of such a problem is the motion of a skydiver: the
air resistance changes abruptly at the moment the parachute opens. This will be discussed in
more detail in the homework.
To present the idea of the algorithm used by adaptive methods, assume for the moment that
we know the exact solution y
i
. Let
glob
be the maximum desired global error and n be the order
of the method. Then the actual local truncation error must be O(h
n+1
), or ch
n+1
+ O(h
n+2
)
with some constant c. Since the maximum allowed local truncation error,
loc
, is not prescribed,
it has to be postulated in some plausible manner. The common choice is to take
loc
= h
glob
.
Then, the steps of the algorithm of an adaptive method are as follows.
1. At each x
i
, compute the actual local truncation error
i
= |y
i
Y
i
| and compare it with
loc
. (The practical implementation of this step is described later.)
2a. If
i
<
loc
, then accept the solution, multiply the step size by (
loc
/
i
)
1/(n+1)
, (where
is some numerical coecient less than 1), and proceed to the next step.
2b. If
i
>
loc
, then multiply the step size by (
loc
/
i
)
1/(n+1)
, re-calculate the solution,
and check the error. If the actual error is acceptable, proceed to the next step. If not, repeat
this step again.
Note that with the above step size adjustment, the error at the next step is expected to be
approximately
c
loc
1/(n+1)
n+1
=
loc
n+1
ch
n+1
i

loc
n+1
.
The coecient < 1 (say, = 0.9) is included to avoid the situation where the computed error
just slightly exceeds the allowed bound, which would be acceptable to a human, but the com-
puter will have to recalculate the entire step, thereby wasting expensive function evaluations.
Now, in reality, the exact solution of the ODE is not known. Then one can use the following
trick. Suppose the numerical method we use is of suciently high order (e.g., the order 4 of the
cRK method is suciently high for all practical purposes). Then we can compute the solution
Y
h
i
with the step size h and at each step compare it with the solution Y
h/2
i
, obtained with the
step size being halved. For example, for the cRK method is of fourth order, and hence Y
h/2
i
should be closer to the exact solution than Y
h
i
is by about 2
4
= 16 times. Then one can declare
Y
h/2
i
to be the exact solution, compute
i
= |Y
h/2
i
Y
h
i
|, and use
i
in place of the
i
above.
However, this way is very inecient. For example, for the cRK method, it would require 7
additional function evaluations per step (needed to advance Y
h/2
from x
i
to x
i+1
). Therefore,
people have designed alternative approaches to control the error size. Below we briey describe
the ideas behind two such approaches.
Runge-Kutta-Fehlberg method
1
Idea: Design a 5th-order method that would share some of the function evaluations with
a 4th-order method. The solution Y
[5]
i
, obtained using the 5th-order method, is expected to
be much more accurate than the solution Y
[4]
i
, obtained using the 4th-order method. Then we
declare
i
= |Y
[5]
i
Y
[4]
i
| to be the numerical error and adjust the step size based on that error
relative to the allowed tolerance.
Implementation:
Y
[4]
i+1
= Y
i
+
25
216
k
1
+
1408
2565
k
3
+
2197
4104
k
4
1
5
k
5
,
Y
[5]
i+1
= Y
i
+
16
135
k
1
+
6656
12825
k
3
+
28561
56430
k
4
9
50
k
5
+
2
55
k
6
;
k
1
= hf(x
i
, Y
i
),
k
2
= hf
x
i
+
1
4
h, Y
i
+
1
4
k
1
,
k
3
= hf
x
i
+
3
8
h, Y
i
+
3
32
k
1
+
9
32
k
2
,
k
4
= hf
x
i
+
12
13
h, Y
i
+
1932
2197
k
1
7200
2197
k
2
+
7296
2197
k
3
,
k
5
= hf
x
i
+ h, Y
i
+
439
216
k
1
8k
2
+
3680
513
k
3
845
4104
k
4
,
k
6
= hf
x
i
+
1
2
h, Y
i
8
27
k
1
+ 2k
2
3544
2565
k
3
+
1859
4104
k
4
11
40
k
5
,
where Y
i
= Y
[5]
i
.
(2.7)
Altogether, there are only 6 function evaluations per step, because the 4th- and 5th-order
methods share 4 function evaluations.
Runge-Kutta-Merson method
Idea: For certain choices of the auxiliary functions k
1
, k
2
, etc., the local truncation error
of, say, a 4th order RK method can be made equal to C
5
h
5
y
(5)
(x
i
) + O(h
6
) with some known
coecient C
5
. (Note that this local truncation error is proportional to the (n+1)-st derivative
of the solution, where n is the order of the method. We observed a similar situation earlier
for the simple Euler method; see Eq. (1.3).) On the other hand, a certain linear combination
of the ks can also be chosen to equal C
5
h
5
y
(5)
(x
i
) + O(h
6
) for a certain class of functions
(namely, for linear functions: f(x, y) = a(x) + b y, where b = const). Thus, we can obtain
both an approximate solution and an estimate for its error. We can then use that estimate to
adjust the step size so as to always make the (estimate for the) local truncation error below a
prescribed maximum value.
For example, if one computes the solution Y
i
using the cRK method and then, in addition,
evaluates
k
5
= hf
x
i
+
3
4
h, Y
i
+
1
32
[5k
1
+ 7k
2
+ 13k
3
k
4
]
, (2.8)
1
It is interesting to note that while the cRK method was developed in early 1900s, its extension by Fehlberg
was proposed only in 1970.
then it can be shown (with a great deal of algebra) that
Local truncation error
2
3
h(k
1
+ 3k
2
+ 3k
3
+ 3k
4
8k
5
) + O(h
6
) . (2.9)
Here the sign is used instead of the = because the equality holds only for f(x, y) =
a(x)+by, where b = const. Thus, again, by evaluating function f just one extra time compared
to the cRK method, one obtains both the numerical solution and a crude estimate for its error.
Then this error estimate can be used as the actual error
i
in the algorithm of the corresponding
adaptive method.
Implementation: More popular than the method described by (2.8) and (2.9), however, is
another method based on the same idea and called the Runge-Kutta-Merson method:
Y
i+1
= Y
i
+
1
6
(k
1
+ 4k
4
+ k
5
);
k
1
= hf(x
i
, Y
i
),
k
2
= hf
x
i
+
1
3
h, Y
i
+
1
3
k
1
,
k
3
= hf
x
i
+
1
3
h, Y
i
+
1
6
(k
1
+ k
2
)
,
k
4
= hf
x
i
+
1
2
h, Y
i
+
1
8
(k
1
+ 3k
3
)
,
k
5
= hf
x
i
+ h, Y
i
+
1
2
(k
1
3k
3
+ 4k
4
)
,
Local truncation error
1
30
(2k
1
9k
3
+ 8k
4
k
5
) .
(2.10)
Once again, one should note that the last line above is only a crude estimate for the trun-
cation error (valid only when f(x, y) is a linear function of y). Indeed, if it had been valid for
any f(x, y), then we would have a contradiction with a statement found at the end of Sec. 2.1.
(Which statement is that?)
To conclude this presentation of the adaptive RK methods, we must specify what solution
is taken at x
i+1
. For example, for the RK-Fehlberg method, we have the choice between setting
Y
i+1
to either Y
[4]
i+1
or Y
[5]
i+1
. The common sense suggests setting Y
i+1
= Y
[5]
i+1
, because, after all,
it is Y
[5]
i+1
that we have declared to be our etalon solution. This choice does work in most
circumstances, although there are important exceptions (see the paper by L. Shampine). Thus,
what the RK-Fehlberg method does is compute a 5th-order-accurate solution while controlling
the error of a less accurate 4-th-order solution related to it.
1. List the 13 coecients mentioned in the paragraph after Eq. (2.5). Do not write the 11
equations.
2. If the step size is reduced by a factor of 2, how much will the error of the cRK and the
Modied Euler methods be reduced? Which of these methods is more accurate?
3. Suppose f = f(x) (on the r.h.s. of the ODE); that is, f does not depend on y but only
on x. What numerical integration method (studied in Calculus 2) does the cRK method
reduce to? [Hint: Rewrite Eq. (2.6) for f = f(x).]
4. List the 7 function evaluations mentioned in the paragraph before the title
Runge-Kutta-Fehlberg method.
5. Describe the idea behind the Runge-Kutta-Fehlberg method.
6. Describe the idea behind the Runge-Kutta-Merson method.
7. Which statement is meant in the paragraph following Eq. (2.10)?
8. One of the built-in ODE solvers in MATLAB is called ode45 . What do you think the
origin of this name is? Without reading the description of this solver under MATLABs
help browser, can you guess what order this method is?
3 Multistep, Predictor-Corrector, and Implicit methods
In this section, we will introduce methods that may be as accurate as high-order Runge-Kutta
methods but will require fewer function evaluations.
We will also introduce implicit methods, whose signicance will become clearer in a later
section.
3.1 Idea behind multistep methods
The gure on the right illustrates the (famil-
iar) fact that if you know y
(x
i
), i.e. the slope
of y(x), then you can compute a rst-order
accurate approximation Y
1st order
i+1
to the solu-
tion y
i+1
.
Likewise, if you know the slope and the cur-
vature of your solution at a given point, you
can compute a second-order accurate approx-
imation, Y
2nd order
i+1
, to the solution at the next
step.
Y
i

Y
1st ord
i+1
Y
2nd ord
i+1
y
i+1

matches slope
matches slope
and curvature
Now, recall that curvature is proportional to y
. This motivates the following.

Question: How can we nd approximation to y
i
using already computed values Y
ik
,
k = 0, 1, 2, . . . ?
Answer: Note that
y
i

y
i
y
i1
h
=
f
i
f
i1
h
. (3.1)
Here and below we will use the notation f
i
in two slightly dierent ways:
f
i
f(x
i
, y
i
) or f
i
f(x
i
, Y
i
) (3.2)
whenever this does not cause any confusion.
Continuing with Eq. (3.1), we can state it more specically by writing
y
i
=
y
i
y
i1
h
+ O(h) =
f
i
f
i1
h
+ O(h) , (3.3)
where we will compute the O(h) term later. For now, we use (3.3) to approximate y
i+1
as
follows:
y
i+1
= y(x
i
+ h) = y
i
+ hy
i
+
h
2
2
y
i
+ O(h
3
)
= y
i
+ hf
i
+
h
2
2
_
f
i
f
i1
h
+ O(h)
_
+ O(h
3
)
= y
i
+ h
_
3
2
f
i
1
2
f
i1
_
+ O(h
3
) . (3.4)
Remark 1: To start the corresponding nite-dierence method, i.e.
Y
i+1
= Y
i
+ h
_
3
2
f
i
1
2
f
i1
_
(3.5)
(now we use f
i
as f(x
i
, Y
i
)), one needs two initial points of the solution, Y
0
and Y
1
. These can
be computed, e.g., by the simple Euler method; this is discussed in more detail in Section 3.4.
Remark 2: Equation (3.4) becomes exact rather than approximate if y(x) = p
2
(x) ax
2
+bx+c
is a second-degree polynomial in x. Indeed, in such a case,
y
i
= 2ax
i
+ b, and y
i
= 2a =
y
i
y
i1
h
; (3.6)
(note the exact equality in the last formula). We will use this remark later on.
Method (3.5) is of the second order. If we want to obtain a third-order method along the
same lines, we need to use the third derivative of the solution:
y
i
=
y
i
2y
i1
+ y
i2
h
2
+ O(h) (3.7)
(you will be asked to verify this equation in one of the homework problems). Then we proceed
as in Eq. (3.4), namely:
y
i+1
= y
i
+ hy
i
+
h
2
2
y
i
+
h
3
6
y
i
+ O(h
4
) . (3.8)
If you now try to substitute the expression on the r.h.s. of (3.3) for y
i
, you notice that you
actually need an expression for the O(h)-term there that would have accuracy of O(h
2
). Here
is the corresponding calculation:
y
i
y
i1
h
=
y
(x
i
) y
(x
i1
)
h
=
y
_
y
i
hy
i
+
h
2
2
y
i
+ O(h
3
)
_
h
= y
i

h
2
y
i
+ O(h
2
),
(3.9)
whence
y
i
=
y
i
y
i1
h
+
h
2
y
i
+ O(h
2
) . (3.10)
To complete the derivation of the third-order nite-dierence method, we substitute Eqs.
(3.10), (3.7), and y
i
= f
i
etc. into Eq. (3.8). The result is:
Y
i+1
= Y
i
+
h
12
[23f
i
16f
i1
+ 5f
i2
] ; (3.11)
the local truncation error of this method is O(h
4
). Method (3.11) is called the 3rd-order
AdamsBashforth method.
Similarly, one can derive higher-order AdamsBashforth methods. For example, the 4th-
order AdamsBashforth method is
Y
i+1
= Y
i
+
h
24
[55f
i
59f
i1
+ 37f
i2
9f
i3
] . (3.12)
Methods like (3.5), (3.11), and (3.12) are called multistep methods. To start a multistep
method, one requires more than one initial point of the solution (in the examples considered
above, the number of required initial points equals the order of the method).
Comparison of multistep and Runge-Kutta methods
The advantage of multistep over single-step RK methods of the same accuracy is that the
multistep methods require only one function evaluation per step, while, e.g., the cRK method
requires 4, and the RK-Fehlberg method, 6, function evaluations.
The disadvantage of the multistep methods is that changing the step size for them is rather
complicated (it requires interpolation of the numerical solution), while for the single-step RK
methods this is a straighforward procedure.
3.2 An alternative way to derive formulae for multistep methods
Recall that the 2nd-order AdamsBashforth method (3.5) was exact on solutions y(x) that are
2nd-degree polynomials: y(x) = p
2
(x) (see Remark 2 after Eq. (3.4)). Similarly, one expects
that the 3rd-order AdamsBashforth method should be exact for y(x) = p
3
(x). We will now
use this observation to derive the formula for this method, Eq. (3.11), in a dierent manner
than in Sec. 3.1.
To begin, we take, according to the above observation, f(x, y) = y
(x) = (p
3
(x))
= p
2
(x),
i.e. a 2nd-degree polynomial in x. We now integrate the dierential equation y
= f(x, y) from
x
i
to x
i+1
and obtain:
y
i+1
= y
i
+
_
x
i+1
x
i
f(x, y(x))dx . (3.13)
Let us approximate the integral by a quadrature formula, as follows:
_
x
i+1
x
i
f(x, y(x))dx h(b
0
f
i
+ b
1
f
i1
+ b
2
f
i2
) (3.14)
and require that the above equation hold exactly, rather than approximately, for any f(x, y(x)) =
p
2
(x). This is equivalent to requiring that (3.14) hold exactly for f = 1, f = x, and f = x
2
.
Without loss of generality
2
, one can set x
i
= 0 and then rewrite Eq. (3.14) for the above three
forms of f:
for f = 1:
_
h
0
1 dx = h = h( b
0
1 + b
1
1 + b
2
1 )
for f = x:
_
h
0
xdx =
1
2
h
2
= h( b
0
0 + b
1
(h) + b
2
(2h) )
for f = x
2
:
_
h
0
x
2
dx =
1
3
h
3
= h( b
0
0 + b
1
(h)
2
+ b
2
(2h)
2
) .
(3.15)
Equations (3.15) constitute a linear system of 3 equations for 3 unknowns b
0
, b
1
, and b
2
. Solving
it, we obtain
b
0
=
23
12
, b
1
=
16
12
, b
2
=
5
12
,
which in combination with Eq. (3.14) yields the same method as (3.11). Methods of higher
order can be obtained similarly.
2
In a homework problem, you will be asked to show this.
3.3 A more general form of multistep methods, with examples
The AdamsBashforth methods above have the following common form:
Y
i+1
Y
i
= h
N
k=0
b
k
f
ik
. (3.16)
As has been shown in Sec. 3.2, the sum on the r.h.s. approximates
_
x
i+1
x
i
f(x, y(x))dx.
Let us now consider multistep methods of a more general form:
Y
i+1
k=0
a
k
Y
ik
= h
N
k=0
b
k
f
ik
. (3.17)
Note that the sum on the r.h.s. of (3.17), unlike that in (3.16), does not have a straightforward
interpretation. In the next Lecture, we will discover that many methods of the form (3.17)
have a serious aw in them, but for now let us consider two particular examples, focusing only
on the accuracy of the following methods.
Simple center-dierence (Leap-frog) method
Recall that
y
i
y
i1
h
= y
i
+ O(h) . (3.18)
However
3
,
y
i+1
y
i1
2h
= y
i
+ O(h
2
) . (3.19)
Thus, the l.h.s. of (3.19) provides a more accurate approximation to y
i
than does the l.h.s. of
(3.18). So we use Eq. (3.19) to produce a 2nd-order method:
Y
i+1
= Y
i1
+ 2hf
i
, (3.20)
which is of the form (3.17). We need both Y
0
and Y
1
to start this method.
A divergent third-order method
(The term divergent will be explained in the next Lecture.)
Let us try to increase the order of method (3.20) from 2nd to 3rd by including extra terms
into the scheme:
Y
i+1
(a
0
Y
i
+ a
1
Y
i1
+ a
2
Y
i2
) = b
0
hf
i
, (3.21)
where we now require that the local truncation error of (3.21) be O(h
4
). We can follow the
derivation found either in Sec. 3.1 (Taylor-series expansion) or Sec. 3.2 (requiring that (3.21)
hold true for y = p
3
(x)) to obtain the values of the coecients a
0
through a
2
, and b
0
. The
result is:
Y
i+1
+
3
2
Y
i
3Y
i1
+
1
2
Y
i2
= 3hf
i
. (3.22)
Supposedly, method (3.22) is more accurate than the Leap-frog method (3.20). However, we will
show in the next Lecture that method (3.22) is completely useless for numerical computations.
3
Again, you will be asked to verify this.
3.4 Starting a multistep method
To start any of the single-step methods, considered in Lectures 1 and 2, one only needs to know
the initial condition, Y
0
= y
0
, at x = x
0
. To start any multistep method, one needs to know
the numerical solution at several points. For example, to start an AdamsBashforth method of
order m, one would need the values Y
0
, . . . , Y
m1
(see Eqs. (3.5), (3.11), and (3.12)). That is,
to start an mth-order method, one needs to know the solution at the rst m points. We will
now address the following question:
Suppose that we want to start a multistep method of order m using the values Y
1
, . . . , Y
m1
that have been computed by a starting (single-step) method of order n. What should the order
n of the starting method be so that not to compromise the order m of the multistep method?
First, it is clear that if n m, then the local error made in the computation of Y
1
, . . . , Y
m1
and of the terms on the r.h.s. of (3.16) and (3.17) will be at least as small (in the order of
magnitude sense) as the local error of the multistep method. So, using a starting method whose
order is no less than the order of the multistep method will not degrade the accuracy of the
latter method. But is it possible to use a staring method with n < m with the same end result?
We will now show that for methods of the form (3.16) it is possible to take n = m1 (i.e.,
the starting methods order may be one less than the multistep methods order).
4
With a little
more work, one can show that the same answer holds also for the particular case of method
(3.17) given by Eq. (3.24) below. (E.g., the Leap-frog method is a particular representative of
the latter case.)
The local truncation errors of Y
1
through Y
m1
are O(h
n+1
). Then the error contributed to
Y
m
from the second term (i.e., from Y
i
with i = m1) on the l.h.s. of (3.16), is O(h
n+1
):
error of l.h.s. of (3.16) = O(h
n+1
).
Next, if f
i
through f
iN
on the r.h.s. were calculated using the exact solution y(x), then the
error of the r.h.s. would have been O(h
m+1
). Indeed, this error is just the local truncation error
of method (3.16) that arises due to the approximation of
_
x
i+1
x
i
f(x, y(x))dx by h
N
k=0
b
k
f
ik
.
However, the f
ik
s are calculated using values Y
1
through Y
m1
which themselves have been
obtained with the error O(h
n+1
) of the starting method. Then the error of each f
ik
is also
O(h
n+1
).
5
Therefore,
error of r.h.s. of (3.16) = O(h
m+1
) + h O(h
n+1
) = max{O(h
n+2
), O(h
m+1
)} .
Thus, the total local truncation error in Y
m
that comes from the l.h.s. and r.h.s of (3.16), is
O(h
n+1
) (recall that we are only interested in the situation where n < m). In order not to
decrease the accuracy of the multistep method, this error must satisfy two criteria:
(i) It must have the same order of magnitude as the global error at the end of the computa-
tion, i.e., O(h
m
); and in addition,
(ii) It may propagate to the next computed solution, i.e., to Y
i+2
, but it must not accumulate
at each step with other errors of the same magnitude.
4
Unfortunately, I was unable to nd any detailed published proof of this result, and so the derivation found
below is my own. As such, it is subject to mistakes . However, a set of Matlab codes accompanying this
Lecture where the 3rd-order AdamsBashforth method (3.11) can be started using the modied Euler, Midpoint,
or simple Euler method, shows that if not this derivation itself, than at least its result is probably correct.
5
If the last statement is not clear, do not worry and just read on. More details about it are presented in the
derivation of Eq. (3.29) in the next Section, and also in the Appendix.
One can easily see that criterion (i) is indeed satised for n + 1 = m, i.e., when n = m 1.
As for criterion (ii), it is also satised. To see that this is the case, it suces to repeat the
above derivation for the error at the next step, i.e. at Y
i+2
Y
m+1
. Then one can see that
the only contribution of order O(h
n+1
) to the error in Y
m+1
will come from Y
m
, and it will
not combine with any other error of the same order. An analogous statement will hold for the
errors in Y
m+2
, Y
m+3
, etc. In this fashion, the original error made in the computation of Y
m1
will simply propagate to the end of the computational interval. Thus, the global error will be:
propagated error from Y
m1
= O(h
n+1
)
+
accumulated local truncation error
= O
_
1
h
_
O(h
m+1
)
for n=m1
= O(h
m
) .
Thus, the order n of the starting method should be no lower than (m1), where m is the
order of the multistep method (3.16).
For methods of the more general form (3.17) that do not reduce to (3.24), an answer to
the question stated at the beginning of this Section cannot be obtained as simply. Moreover,
I have seen indirect indications in published literature that the above derivation may not be
valid for those more general multistep methods. Therefore, it is a good idea to use a method
of order m when starting a multistep method (3.17) of order m.
3.5 Predictorcorrector methods: General form
Let us recall the Modied Euler method introduced in Lecture 1 and write it here using slightly
dierent notations:
Y
p
i+1
= Y
i
+ hf
i
Y
c
i+1
= Y
i
+
1
2
h
_
f
i
+ f(x
i+1
, Y
p
i+1
)
_
Y
i+1
= Y
c
i+1
.
(3.23)
We can interpret the above as follows: We rst predict the new value of the solution Y
i+1
by
the rst equation, and then correct it by the second equation. Methods of this kind are called
predictorcorrector (PC) methods.
Question: What is the optimal relation between the orders of the predictor and corrector
equations?
Answer: The example of the Modied Euler method suggests that the order of the cor-
rector should be one higher than that of the predictor. More precisely, the following theorem
holds:
Theorem If the order of the corrector equation is n, then the order of the corresponding
PC method is also n, provided that the order of the predictor equation is no less than n 1.
Proof We will assume that the global error of the corrector equation by itself is O(h
n
)
and the global error of the predictor equation by itself is O(h
n1
). Then we will prove that the
global error of the combined PC method is O(h
n
).
The general forms of the predictor and corector equations are, respectively:
Predictor: Y
p
i+1
= Y
iQ
+ h
N
k=0
p
k
f
ik
, (3.24)
Corrector: Y
c
i+1
= Y
iD
+ h
M
k=0
c
k
f
ik
+ hc
1
f(x
i+1
, Y
p
i+1
) . (3.25)
In the above two equations, Q, D, N, M are some integer nonnegative numbers. (One of the
questions at the end of this Lecture asks you to represent Eq. (3.23) in the form (3.24), (3.25),
i.e. to give values for Q, D, N, M and the coecients p
k
s and c
k
s.)
As we have done in previous derivations, let us assume that all computed values Y
ik
,
k = 0, 1, 2, . . . coincide with the exact solution at the coresponding points: Y
ik
= y
ik
. Then
we can use the identity
y
i+1
= y
iQ
+ (y
i+1
y
iQ
)
see (3.13)
= y
iQ
+
_
x
i+1
x
iQ
y
(x)dx = Y
iQ
+
_
x
i+1
x
iQ
f(x, y(x))dx
and rewrite Eq. (3.24) as:
Y
p
i+1
= Y
iQ
+
_
x
i+1
x
iQ
f(x, y(x))dx+
_
h
N
k=0
p
k
f
ik
_
x
i+1
x
iQ
f(x, y(x))dx
_
Y
p
i+1
= y
i+1
+E
P
.
(3.26)
Here E
P
is the error made by replacing the exact integral
_
x
i+1
x
iQ
f(x, y(x))dx
by the linear combination of f
ik
s, found on the r.h.s. of (3.24). Since, by the condition of the
Theorem, the global error of the predictor equation is O(h
n1
), then the local truncation error
E
P
has the order of O(h
(n1)+1
) = O(h
n
).
Similarly, Eq. (3.25) is rewritten as
Y
c
i+1
= y
i+1
+ E
C
+ hc
1
_
f(x
i+1
, Y
p
i+1
) f(x
i+1
, y
i+1
)
_
. (3.27)
Here E
C
is the error obtained by replacing the exact integral
_
x
i+1
x
iD
f(x, y(x))dx
by the quadrature formula
h
M
k=1
c
k
f
ik
(note that the lower limit of the summation is dierent from that in (3.25)!). The last term on
the r.h.s. of (3.27) occurs because, unlike all previously computed Y
ik
s, the Y
p
i+1
= y
i+1
.
To complete the proof,
6
we need to show that Y
c
i+1
y
i+1
= O(h
n+1
) in (3.27). By the
condition of the Theorem, the corrector equation has order n, and hence the local truncation
error E
C
= O(h
n+1
). Then all that remains to be estimated is the last term on the r.h.s. of
(3.27). To that end, we recall that f satises the Lipschitz condition with respect to y, whence
|f(x
i+1
, Y
p
i+1
) f(x
i+1
, y
i+1
)| L|Y
p
i+1
y
i+1
| = L|E
P
|, (3.28)
where L is the Lipschitz constant. Combining Eqs. (3.27) and (3.28) and using the triangle
inequality (1.5), we nally obtain
|Y
c
i+1
y
i+1
| |E
C
| + hL|E
P
| = O(h
n+1
) + h O(h
n
) = O(h
n+1
) , (3.29)
which proves that the PC method has the local truncation error of order n + 1, and hence is
the nth-order method. q.e.d.
We now present two PC pairs that in applications are sometimes preferred
7
over the Mod-
6
At this point, you have probably forgotten what we are proving. Pause, re-read the Theorems statement,
and then come back to nish the reading.
7
In the next Section we will explain why this is so.
ied Euler method. The rst pair is:
Predictor : Y
p
i+1
= Y
i
+
1
2
h(3f
i
f
i1
)
Corrector : Y
c
i+1
= Y
i
+
1
2
h
_
f
i
+ f
p
i+1
_
,
(3.30)
where f
p
i+1
= f(x
i+1
, Y
p
i+1
). The order of the PC method (3.30) is two.
The other pair is:
Predictor: 4th-order AdamsBashforth
Y
p
i+1
= Y
i
+
1
24
h(55f
i
59f
i1
+ 37f
i2
9f
i3
) .
Corrector: 4th-order AdamsMoulton
Y
c
i+1
= Y
i
+
1
24
h(9f
p
i+1
+ 19f
i
5f
i1
+ f
i2
) ,
(3.31)
This PC method as a whole has the same name as its corrector equation: the 4th-order
AdamsMoulton.
3.6 Predictorcorrector methods: Error control
An observation one can make from Eqs. (3.30) is that both the predictor and corrector equations
have the order two (i.e. the local truncation errors of O(h
3
)). In view of the Theorem of
the previous subsection, this may seem to be unnecessary. Indeed, the contribution of the
predictors local truncation error is hO(h
3
) = O(h
4
) (see Eq. (3.29)), while the local truncation
error of the corrector equation (which determines that of the entire PC method) is only O(h
3
).
There is, however, an important consideration because of which method (3.30) may be preferred
over the Modied Euler. Namely, one can monitor the error size in (3.30), whereas the Modied
Euler does not give its user such a capability. Below we explain this statement in detail. A
similar treatment can be applied to the AdamsMoulton method (3.31).
The key fact is that the local truncation errors of the predictor and correction equations
(3.30) are proportional to each other in the leading order:
y
i+1
Y
p
i+1
=
5
12
h
3
y
i
+ O(h
4
), (3.32)
y
i+1
Y
c
i+1
=
1
12
h
3
y
i
+ O(h
4
) . (3.33)
For the readers information, the analogues of the above estimates for the AdamsMoulton
method (3.31) are:
4th-order AdamsMoulton method:
y
i+1
Y
p
i+1
=
251
720
h
5
y
(5)
i
+ O(h
6
),
y
i+1
Y
c
i+1
=
19
720
h
5
y
(5)
i
+ O(h
6
) .
We derive (3.33) in the Appendix to this Lecture, while the derivation of (3.32) is left as an
exercise. Here we only note that the derivation of (3.33) hinges upon the fact that y
i
Y
p
i
=
O(h
3
), which is guaranteed by (3.32). Otherwise, i.e. if y
i
Y
p
i
= O(h
2
), as in the predictor for
the Modied Euler method, the term on the r.h.s. of (3.33) would not have had such a simple
form.
We will now explain how (3.32) and (3.33) can be used together to control the error of the
PC method (3.30). From (3.33) we obtain the error of the corrector equation:
|
c
i+1
|
1
12
h
3
|y
i
| . (3.34)
On the other hand, from Eqs. (3.32) and (3.33) together, we have
|Y
p
i+1
Y
c
i+1
|
_
5
12
+
1
12
_
h
3
|y
i
|. (3.35)
Thus, from (3.34) and (3.35) one can estimate the error via the dierence of the predicted and
corrected values of the solution:
|
c
i+1
|
1
6
|Y
p
i+1
Y
c
i+1
| . (3.36)
Moreover, Eqs. (3.32) and (3.33) can also be used to obtain a higher-order method than
(3.30), because they imply that
y
i+1
=
1
6
_
Y
p
i+1
+ 5Y
c
i+1
_
+ O(h
4
) .
Hence
Y
i+1
=
1
6
_
Y
p
i+1
+ 5Y
c
i+1
_
(3.37)
produces a more accurate approximation to the solution than either Y
p
i+1
or Y
c
i+1
alone. (Note
a similarity with the Romberg extrapolation described in Lecture 1.)
Thus, Eqs. (3.30), (3.36), and (3.37) can be used to program a PC method with an adaptive
step size. Namely, suppose that we have a goal that the local truncation error of our numerical
solution does not exceed a specied number
loc
. Then:
1. Compute Y
p
i
and Y
c
i
from (3.30).
2. Compare the error calculated from (3.36) with
loc
and then adjust the step size as explained
in Lecture 2. (As we pointed out at the end of Sec. 3.1, the adjustment of the step size in
multistep methods is awkward; so it may be more practical to just keep a record of the error
magnitude without changing the step size.)
3. Upon accepting the calculations for a particular step, calculate the solution at this step
from (3.37).
The above procedure produces a 3rd-order-accurate solution (3.37) while controlling the error
size of the associated 2nd-order method (3.30). This is exactly the same idea as was used by
the adaptive RK methods described in Lecture 2.
Remark 1 As we have just discussed, using the predictor and corrector equations of the
same order has the advantage of allowing one to monitor or control the error. However, it
may have a disadvantage of making such schemes less stable compared to schemes where the
predictors order is one less than that of the corrector. (We will study the concept of stability
in the next Lecture.) Thus, choosing a particular PC pair may depend on the application,
and a signicant body of research has been devoted to this issue.
Remark 2 Suppose we plan to use a PC method with the predictor and corrector equations
having the same order, say m, so as to monitor the error, as described above. If the predictor
equation comes from a multistep method of the form (3.16) (as, e.g., in (3.30) or (3.31)), then we
need to re-examine the question addressed in Section 3.4. Namely, what order starting method
should we use for the predictor equation to be able to monitor the error? In Section 3.4 we
showed that a starting method of order (m 1) would suce to make the predictor method
have order m. More precisely, the global error of the predictor method is O(h
m
) in this case.
However, the local truncation error of such a predictor method is also O(h
m
) and not O(h
m+1
)!
This is because the error made by the starting method would propagate (but not accumulate)
to the last computed value of the numerical solution, as we showed in Section 3.4. For example,
if we start the 2nd-order predictor method in (3.30) by the simple Euler method, then the local
truncation error of O(h
2
) that the simple Euler made in computing Y
p
1
, will propagate up to
Y
i
for any i. Such an error will invalidate the derivation of the local truncation error of the
corrector equation, found in the Appendix (Section 3.8). Thus, we conclude: If you want to be
able to monitor the error in a PC method where both the predictor and corrector equations are
multistep methods of order m, you need to start the predictor equation by an mth-order starting
method.
To conclude our consideration of the PC methods, let us address the following important
issue. Note that we can apply the corrector formula more than once. For example, for the
method (3.30), we will then have:
Y
p
i+1
= Y
i
+
1
2
h(3f
i
f
i1
)
Y
c,1
i+1
= Y
i
+
1
2
h
_
f
i
+ f(x
i+1
, Y
p
i+1
)
_
,
Y
c,2
i+1
= Y
i
+
1
2
h
_
f
i
+ f(x
i+1
, Y
c,1
i+1
)
_
,
etc.
Y
i+1
= Y
c,k
i+1
(3.38)
Question: How many times should we apply the corrector equation?
We need to strike a compromise here. If we apply the corrector too many times, then we
will waste computer time if each iteration of the corrector changes the solution by less than the
truncation error of the method. On the other hand, we may have to apply it more than once in
order to make the dierence |Y
c,k
i+1
Y
c,k1
i+1
| between the last two iterations much smaller than
the truncation error of the corrector equation (since the latter error is basically the error of the
method; see Eq. (3.33)).
Ideally, one would like to know the conditions under which it is sucient to apply the
corrector equation only once, so that no benets would be gained by its successive applications.
Below we derive such a sucient condition for the method (3.30)/(3.38). For another PC
method, e.g., AdamsMoulton, an analogous condition can be derived along the same lines.
Suppose the maximum allowed global error of the solution is
glob
. The allowed local trun-
cation error is then about
loc
= h
glob
(see Sec. 2.2). We impose two requirements:
(i) The local truncation error of our solution should not exceed
loc
;
(ii) The dierence |Y
c,2
i+1
Y
c,1
i+1
| should be much smaller than
loc
.
Requirement (i) is necessary to satisfy in order to obtain the required accuracy of the numerical
solution. Requirement (ii) is necessary to satisfy in order to use the corrector equation only
once.
Requirement (i) along with Eq. (3.36) yields
1
6
|Y
p
i+1
Y
c,1
i+1
| <
loc
, |Y
p
i+1
Y
c,1
i+1
| < 6
loc
. (3.39)
If at some x
i
condition (3.39) does not hold, the step size needs to be reduced in accordance
with the errors order |Y
p
i+1
Y
c,1
i+1
| = O(h
3
).
Requirement (ii) implies:
|Y
c,2
i+1
Y
c,1
i+1
| =
_
Y
i
+
1
2
h
_
f
i
+ f(x
i+1
, Y
c,1
i+1
)
_

_
Y
i
+
1
2
h
_
f
i
+ f(x
i+1
, Y
p
i+1
)
_
=
1
2
h
f(x
i+1
, Y
c,1
i+1
) f(x
i+1
, Y
p
i+1
)
Lipschitz
1
2
hL
Y
c,1
i+1
Y
p
i+1
(3.39)
1
2
hL 6
loc
. (3.40)
Thus, a sucient condition for |Y
c,2
i+1
Y
c,1
i+1
|
loc
to hold is
1
2
hL 6
loc

loc
, or hL
1
3
. (3.41)
If condition (3.41) is satised, then a single application of the corrector equation is adequate.
If, however, the step size is not small enough, we may require two iterations of the corrector.
Then a second application of (3.40) would produce the condition:
|Y
c,3
i+1
Y
c,2
i+1
|
loc

_
1
2
hL
_
2
6
loc

loc
, or hL
_
2
3
0.82 , (3.42)
which is less restrictive than (3.41).
To summarize on the PC methods:
1. The PC methods may provide both high accuracy and the capability of error control, all
at a potentially lower computational cost than RK-Fehlberg or RK-Merson methods. For
example, the AdamsMoulton method (3.31) has the error of the same (fourth) order as
the aforementioned RK methods, while requiring k+1 function evaluations, where k is the
number of times one has to iterate the corrector equation. If k < 4, then Adam-Moulton
requires fewer function evaluations than either RK-Merson or RK-Fehlberg.
2. The adjustment of the step size in PC methods is awkward (as it is in all multistep
methods); it requires interpolation of the solution between the nodes of the computational
grid.
3. One may ask, why not just halve the step size of the AdamsBashforth method (which
would reduce the global error by a factor of 2
4
= 16, i.e. a lot) and then use it alone
without the AdamsMoulton corrector formula? The answer is this. First, one will
then lose control over the error. Second, the AdamsBashforth may sometimes produce
a numerical solution which has nothing to do with the exact solution, while the PC
AdamsMoultons solution will stay close to the exact one. This issue will be discussed
in detail in the next Lecture.
3.7 Implicit methods
We noted in Lecture 1 that the simple Euler method is analogous to the left Riemann sums
when integrating the dierential equation y
= f(x).
The method analogous to the right Riemann
sums is:
Y
i+1
= Y
i
+ hf(x
i+1
, Y
i+1
) . (3.43)
It is called the implicit Euler, or backward
Euler, method. This is a rst-order method:
Its global error is O(h) and the local trunca-
tion error is O(h
2
).
right Riemann sums
x
0
x
1
x
2
x
3

We note that if f(x, y) = a(x)y+b(x), then the implicit equation (3.43) can be easily solved:
Y
i+1
=
Y
i
+ hb
i+1
1 ha
i+1
. (3.44)
However, for a general nonlinear f(x, y), equation (3.43) cannot be solved exactly, and its
solution then has to be found numerically, say, by the Newton-Raphson method.
Question: Why does one want to use the implicit Euler, which is so much harder to solve
than the simple Euler method?
Answer: Implicit methods have stability properties that are much better than those of
explicit methods (like the simple Euler). We will discuss this in the next Lecture.
Note that the last remark about AdamsBashforth vs. AdamsMoulton, found at the end
of the previous Section, is also related to the stability issue. Indeed, the AdamsBashforth
method (the rst equation in (3.31)) is explicit, and thus according to the above, it should be
not as stable as the AdamsMoulton method (the second equation in (3.31)), which is implicit
if one treats Y
p
i+1
in it as being approximately equal to Y
c
i+1
.
Finally, we present equations for the Modied implicit Euler method:
Y
i+1
= Y
i
+
h
2
(f(x
i
, Y
i
) + f(x
i+1
, Y
i+1
)) . (3.45)
This is a second-order method.
3.8 Appendix: Derivation of (3.33)
Here we derive the local truncation error of the corrector equation in the method (3.30). As-
suming, as usual, that Y
i
= y
i
, and using Y
p
i+1
= y
i+1
+ O(h
3
) (since the order of the predictor
equation is two), one obtains from the corrector equation of (3.30):
Y
c
i+1
= y
i
+
1
2
h(y
i
+ f(x
i+1
, y
i+1
+ O(h
3
)) )
= y
i
+
1
2
h(y
i
+ f(x
i+1
, y
i+1
) + O(h
3
) )
= y
i
+
1
2
h
_
y
i
+ y
i+1
+ O(h
3
)
_
= y
i
+
1
2
h
_
y
i
+
_
y
i
+ hy
i
+
1
2
h
2
y
i
+ O(h
3
)
+ O(h
3
)
_
= y
i
+ hy
i
+
1
2
h
2
y
i
+
1
4
h
3
y
i
+ O(h
4
) . (3.46)
On the other hand, for the exact solution we have the usual Taylor series expansion:
y
i+1
= y
i
+ h
i
+
1
2
h
2
y
i
+
1
6
h
3
y
i
+ O(h
4
) . (3.47)
Subtracting (3.46) from (3.47), we obtain
y
i+1
Y
c
i+1
=
1
12
h
3
y
i
+ O(h
4
) ,
which is (3.33).
1. Make sure you can reproduce the derivation of Eq. (3.4).
2. What is the idea behind the derivation of Eq. (3.5)?
3. Derive Eqs. (3.9) and (3.10).
4. Derive Eq. (3.11) as indicated in the text.
5. Describe two alternative ways to derive formulae for multistep methods.
6. Verify Eq. (3.19).
7. For a multistep method of order m, what should the order of the starting method be?
8. Convince yourself that method (3.23) is of the form (3.24) and (3.25).
9. What is the origin of the error E
P
in Eq. (3.26)?
10. What is the origin of the error E
C
in Eq. (3.27)?
11. How should the orders of the predictor and corrector equations be related? Why?
12. Is there a reason to use a predictor as accurate as the corrector?
13. What is the signicance of Requirements (i) and (ii) found before Eq. (3.39)?
14. Make sure you can explain the derivations of (3.40) and (3.41).
15. What are the advantages and disadvantages of the PC methods compared to the RK
methods?
16. What is the reason one may want to use an implicit method?
4 Stability analysis of nite-dierence methods for ODEs
4.1 Consistency, stability, and convergence of a numerical method;
Main Theorem
Recall that we are solving the IVP
y
= f(x, y), y(x

0
) = y
0
. (4.1)
Suppose we are using the simple Euler method:
Y
n+1
Y
n
h
f(x
n
, Y
n
) = 0 . (4.2)
Note that in this and subsequent Lectures, we abandon the subscript i in favor of n, because
we want to reserve i for the
1. If we denote the l.h.s. of (4.2) as F[Y

n
, h], then the above
equation can be rewritten as
F[Y
n
, h] = 0 . (4.3)
In general, any nite-dierence method can be written in the form (4.3).
Recall that if we substitute into (4.3) the exact solution y(x
n
) of the ODE, then we obtain
F[y
n
, h] =
n
. (4.4)
In Lecture 1, we called
n
the discretization error and h
n
, the local truncation error.
Now we need a notation for the norm. For any sequence of numbers {a
n
}, let
||a||
= max
n
|a
n
| . (4.5)
This norm is called the L
-norm of the sequence {a

n
}. There are many other kinds of norms,
each of which is useful in its own range of circumstances. In this course, we will only deal with
the L
-norm, and therefore we will simply denote it by ||a||, dropping the subscript . The
reason we are interested in this particular norm is that at the end of the day, we want to know
that the maximum error of our solution is bounded by some tolerance
tol
:
max
n
|
n
|
tol
, or ||||
tol
. (4.6)
We will now give a series of denitions.
Denition 1: A numerical method F[Y
n
, h] = 0 is called consistent if
lim
h0
|||| = 0, (4.7)
where
n
is dened by Eq. (4.4).
Note that if the order of a method is l > 0, then the method is consistent, since
n
= O(h
l
).
However, when we are solving an ODE numerically, our main concern is not so much that
the discretization error be small but that the global error be small. Hence we have another
denition.
Denition 2: A numerical method is called convergent if
lim
h0
|||| lim
h0
y Y = lim
h0
max
n
|y
n
Y
n
| = 0 . (4.8)
Question: What do we need to require of a consistent method in order for it to be
convergent?
Answer: That the accumulated local truncation error not grow out of bound.
This motivates yet another denition.
Denition 3: Consider a nite-dierence scheme of the form (4.3). Let an approximate
solution to that scheme be u
n
. (Recall that its exact solution is Y
n
.) The approximation may
arise, for example, due to the nite precision of the computer or due to a slight error in the
initial condition. If we substitute u
n
into the equation of the numerical method, it satises:
F[u
n
, h] =
n
, (4.9)
where
n
is some small number. The method is called stable if
||u Y || C|||| , (4.10)
where the constant C is required to be independent of h. The latter requirement means that for
a stable method, one wants to ensure that small errors (e.g., due to rounding o) made at each
step do not accumulate. That is, one wants to preclude the possibility that C is proportional
to the number of steps ( O(1/h)) or, worse yet, that C = exp[O(1/h)].
Theorem 4.1 (P. Lax):
If a method is both consistent and stable, then it converges. In short:
Consistency + Stability Convergence
Remark: Note that all three properties of the method: consistency, stability, and convergence,
must be dened with respect to the same norm (in this course, we are using only one kind of
norm, so that is not an issue anyway).
The idea of the Proof: Consistency of the method means that the local truncation error
at each step, h
n
, is suciently small so that the accumulated (i.e., global) error, which is on
the order of
n
, tends to zero as h is decreased (see (4.7)). Thus:
Consistency y Y is small, (4.11)
where, as above, Y is the ideal solution of the numerical scheme (4.3) obtained in the absence
of machine round-o errors and any errors in initial conditions.
Stability of the method means that if at any given step, the actual solution u
n
slightly
deviates from the ideal solution Y
n
due to the round-o errors, then these small deviations
remain small and do not grow as n increases. Thus:
Stability Y u is small. (4.12)
But then the above two equations together imply that the maximum dierence between the
actual computed solution u
n
and the exact solution y
n
also remains small, because there are
no other sources of errors and no reasons for the errors to grow (other than merely add up).
Thus:
y u = (y Y ) + (Y u) (y Y ) +(Y u), (4.13)
which must be small because each term on the r.h.s. is small. The fact that the l.h.s. of the
above equation is small means, by Denition 2, that the method is convergent. q.e.d.
4.2 Some general comments about stability of solutions of ODEs
In the remainder of this Lecture, and also in some of the following Lectures, we will study
stability of a given numerical method by applying it to a model problem
y
= y, y(0) = y
0
(4.14)
with < 0 or, more generally, with Re < 0 (since, as we will see later, may need to be
allowed to be a complex number).
We will rst explain why we have excluded the case > 0 (or, more generally, Re > 0).
The solution of (4.14) is y = y
0
e
x
. Suppose we make a small error in the initial condition:
y
0
y
0
+ y (y
0
+ )e
x
y
true
+ e
x
. (4.15)
Now, if > 0, then a small change in the initial condition produces, over a suciently large
x, a large change in the solution. In other words, the problem is unstable in the absolute sense.
However, if || |y
0
|, then the error is still small compared to the exact solution, which means
that the problem is stable in the relative sense.
On the other hand, if < 0, the error is e
||x
0 as x increases, and therefore problem
(4.14) is stable in both absolute and relative senses.
Thus, the instability of the solution for > 0 is intrinsic to the ODE, i.e. it does not depend
on how we solve the equation numerically. Moreover, we have seen that such an instability does
not present any danger to the numerical solution, because the error still remains much smaller
than the solution (as long as || |y
0
|). On the contrary, the exact solution of (4.14) with
< 0 is intrinsically stable (and, in particular, non-growing). Therefore, we want a numerical
method to also produce a stable solution. If, however, it produces a growing solution, then we
immediately know that the method is unstable.
Let us now explain why model problem (4.14) is relevant for studying stability of numerical
methods. A very brief answer is: Because it arises as the local linear approximation (also known
as a linearization) near every point of a generic trajectory y = y(x). Indeed, consider a general
ODE y
= f(x, y) along with its exact solution y(x) and numerical solution Y
n
. To illustrate
our point, let us suppose that we have used the simple Euler method to obtain the numerical
solution. Thus, the exact solution, the numerical solution, and the error
n
= y
n
Y
n
satisfy:
y
n+1
= y
n
+ hf(x
n
, y
n
) + h
n
Y
n+1
= Y
n
+ hf(x
n
, Y
n
)
n+1
=
n
+ h(f(x
n
, y
n
) f(x
n
, Y
n
)) + h
n
,
(4.16)
where h
n
is the local truncation error (see (4.4)). Now, since |
n
| is supposed to be small,
we can Taylor-expand the dierence term inside the parentheses in the last equation of (4.16);
then that equation becomes
n+1
=
n
+ hf
y
(x
n
, y
n
)
n
+
_
h
n
+ O(
2
n
)
_
. (4.17)
If we disregard the O(
2
n
)-term above and, in addition, replace f
y
(x
n
, y
n
) by a constant =
max |f
y
(x, y)|, then (4.17) becomes
n+1
=
n
+ h
n
+ h
n
. (4.18a)
Equation (4.17) is nothing but the simple-Euler approximation of the linear inhomogeneous
ODE
(x) = (x) + (x) . (4.18b)

The solution to this ODE was obtained in Lecture 0:
(x) =
0
e
(xx
0
)
+
_
x
0
( x) e
(x x)
d x.
One can see that unless (x) itself grows with x (which we do not expect of a local truncation
error, since it is supposed to be always small), the presence of the (x)-term in Eq. (4.18b) does
not cause the error (x) to grow. Rather, it is the term that may result in an exponential
growth of the initially small error! On these grounds, we neglect the (x)-term in (4.18b) and
hence obtain the model problem (4.14). Thus, considering how the numerical solution behaves
for this model problem, we should be able to predict whether the numerical errors will grow or
remain bounded in the original IVP (4.1).
The above consideration has missed one subtle possibility. Namely, what if < 0 while
(x) const as x ? One can show that in this case, (x) does not decay but asymptotically
tends to a constant. On the other hand, from the model problem (4.14) with < 0, we would
mistakenly conclude that the error between the exact and numerical solutions should decay,
whereas according to the primordial and hence more correct equations (4.18), this error would
level o at a nonzero value! This important conclusion is worth to be repeated: In some
cases, when the numerical scheme is predicted to be stable by model equation (4.14), the error
between the exact and numerical solutions may tend to a nonzero constant. This means that
model equation (4.14) does not always correctly predict the behavior of the error
between the exact and numerical solutions. To be more specic, (4.14) would
always correctly predict only an unstable numerical solution.
A natural question then is: Is there some other kind of error whose behavior would always
be correctly predicted by Eq. (4.14)? The answer is: Yes, and this error is the deviation
between two numerical solutions, say Y
n
and U
n
, whose initial values are close to each other.
Indeed, if both Y and U satisfy (4.2), then a simple calculation along the lines of (4.16) would
yield for (U
n
Y
n
) and equation similar to (4.17) where now there would be no term h
n
.
This would then yield (4.18b) with (x) 0, which is (4.14). Thus, we conclude that model
equation (4.14) always correctly predicts (at least in short term) the behavior of
the deviation between two initially close solutions of the numerical scheme.
Finally, we note that if in (4.1), f(x, y) = a(x)y, then the model equation (4.14) coincides
with the original ODE, i.e. the aforementioned deviation (U
n
Y
n
) and either of the numerical
solutions Y
n
and U
n
satisfy the same equation. This simple observation is also worth repeating:
For linear ODEs, the numerical solution and a deviation (U
n
Y
n
) between any two
numerical solutions U
n
and Y
n
, evolve in the same way.
4.3 Stability analyses of some familiar numerical methods
Below we present stability criteria for the numerical methods we have studied in the preceding
Lectures for the model problem (4.14).
We begin with the simple Euler method. As we have shown above, for it the error satises
n+1
=
n
+ h
n
,
n
=
0
(1 + h)
n
, (4.19)
where
0
is the initial error.
Let > 0. Then both the solution of the ODE (4.14), y = y
0
e
x
, and the error (4.19),
increase as the calculation proceeds. As we said above, one can do nothing about that.
Now let < 0. Then the true solution y
0
e
x
decreases, but the error will decrease only if
|1 + h| < 1 1 < 1 + h < 1 h <
2
||
. (4.20)
E.g., to solve y
= 30y (with any initial condition), we must use h <

2
30
in order to guarantee
that the round-o and truncation errors will decay and not inundate the solution.
Thus, for the model problem (4.14) with < 0, the simple Euler method is stable only when
the step size satises Eq. (4.20). This conditional stability is referred to as partial stability;
thus, the simple Euler method is partially stable. For the general case where is a complex
number, partial stability is dened as below.
Denition 4: A method is called partially stable if, when applied to the model problem
(4.14) with Re < 0, the corresponding numerical solution is stable only for some values of h.
The region in the h-plane where the method is stable is called the region of stability of the
method.
Let us nd the region of stability of the simple Euler method. To this end, write as the
sum of its real and imaginary parts: =
R
+i
I
(note that here and below i =
1). Then
the rst of the inequalities in (4.20) becomes
|1 + h
R
+ ih
I
| < 1
_
(1 + h
R
)
2
+ (h
I
)
2
< 1 . (4.21)
Thus, the region of stability of the simple Euler
method is the inside of the circle
(1 + h
R
)
2
+ (h
I
)
2
= 1,
as shown in the gure on the right.
Stability region for simple Euler method
2
h
I

h
R
We now present brief details about the region of stability for the Modied Euler method.
In a homework problem, you will be asked to supply the missing details.
Substituting the ODE from (4.14) into Eqs. (1.22) (see Lecture 1), we nd that
Y
n+1
= (1 + h +
1
2
(h)
2
) Y
n
. (4.22)
Remark 1: The evolution of any error in this method will also satisfy the same equation (4.22).
Remark 2: Note that the factor on the r.h.s. of (4.22) is quite expected: Since the Modied
Euler is the 2nd-order method, its solution of the model problem (4.14) should be the 2nd-degree
polynomial that approximates the exponential in the exact solution y = y
0
e
x
.
The boundary of the stability region is obtained
by setting the modulus of the factor on the r.h.s.
of (4.22) to 1:
1 + h +
1
2
(h)
2
)
= 1.
Indeed, if the factor on the l.h.s. is less than 1,
all errors will decay, and if it is greater than 1,
they will grow, even though the exact solution may
decay.
The above equation can be equivalently written as
Stability region for Modified Euler method
2
h
I

h
R
_
1 + h
R
+
1
2
_
(h
R
)
2
(h
I
)
2
_
_
2
+ (h
I
+ h
2
R
)
2
= 1 . (4.23)
The corresponding region is shown above.
When the cRK method is applied to the model problem (4.14), the corresponding stability
criterion becomes

k=0
(h)
k
k!
1 . (4.24)
The expression on the l.h.s. is the fourth-degree polynomial approximating e
h
; this is consistent
with Remark 2 made after Eq. (4.22).
For real , criterion (4.24) reduces to
2.79 h 0 . (4.25)
Note that the cRK method is not only more accurate than the simple and Modied Euler
methods, but also has a greater stability region for negative real values of .
4.4 Stability analysis of multistep methods
We begin with the 2nd-order Adams-Bashforth method (3.5):
Y
n+1
= Y
n
+ h
_
3
2
f
n
1
2
f
n1
_
. (3.5)
Substituting the model ODE (4.14) into that equation, one obtains
Y
n+1
_
1 +
3
2
h
_
Y
n
+
1
2
hY
n1
= 0 . (4.26)
To solve this dierence equation, we use the same procedure as we would use to solve a linear
ODE. Namely, for the ODE
y
+ a
1
y
+ a
0
y = 0
with constant coecients a
1
, a
0
, we need to substitute the ansatz y = e
rx
, which yields the
following polynomial equation for r:
r
2
+ a
1
r + a
0
= 0 .
Similarly, for the dierence equation (4.26), we substitute Y
n
= r
n
and, upon cancelling by the
common factor r
n1
, obtain:
r
2
_
1 +
3
2
h
_
r +
1
2
h = 0 . (4.27)
This quadratic equation has two roots:
r
1
=
1
2
_
_
_
_
1 +
3
2
h
_
+
_
1 +
3
2
h
_
2
2h
_
_
_
,
r
2
=
1
2
_
_
_
_
1 +
3
2
h
_
_
1 +
3
2
h
_
2
2h
_
_
_
.
(4.28)
In the limit of h 0 (which is the limit where the dierence method (4.26) reduces to the
ODE (4.14)), one can use the Taylor expansion (and, in particular, the formula
1 + =
1 +
1
2
+ O(
2
) ), to obtain the asymptotic forms of r
1
and r
2
:
r
1
1 + h, r
2

1
2
h. (4.29)
The solution Y
n
that corresponds to root r
1
turns, in the limit h 0, into the true solution of
the ODE y
= y, because
lim
h0
r
n
1
= lim
h0
(1 + h)
n
= lim
h0
(1 + h)
x/h
= e
x
; (4.30)
see Sec. 0.5. However, the solution of the dierence method (4.26) corresponding to root r
2
does not correspond to any actual solution of the ODE ! For that reason, root r
2
and the
corresponding dierence solution r
n
2
are called parasitic.
A good thing about the parasitic solution for the 2nd-order Adams-Bashforth method is that
it does not grow for suciently small h. In fact, since for suciently small h, r
2

1
2
h < 1,
then that parasitic solution decays to zero rather rapidly and therefore does not contaminate
the numerical solution.
To require that the 2nd-order Adams-Bashforth
method be stable is equivalent to requiring that
both r
1
and r
2
satisfy
|r
1
| 1 and |r
2
| 1 . (4.31)
The stability region is inside the oval-shaped re-
gion shown on the right (the little horns are the
plotting artice). This gure is produced by Math-
ematica; in a homework problem, you will be asked
to obtain this gure on your own.
A curious point to note is that the two require-
ments, |r
1
| 1 and |r
2
| 1, produce two non-
overlapping parts of the stability region boundary
(its right-hand and left-hand parts, respectively).
Stability region of 2nd-order
Adams-Bashforth
-1 -0.8 -0.6 -0.4 -0.2
h r
-1
-0.5
0.5
1
h i
A similar analysis for the 3rd-order Adams-
Bashforth method (3.11) shows that the corre-
sponding dierence equation has three roots, of
which one (say, r
1
) corresponds to the true solu-
tion of the ODE and the other two (r
2
and r
3
) are
the parasitic roots. Fortunately, these roots decay
to zero as O(h) for h 0, so they do not aect
the numerical solution for suciently small h. For
nite h, the requirement
|r
1
| 1, |r
2
| 1, |r
3
| 1
results in the stability region whose shape is qual-
itatively shown on the right.
Stability region of 3rd-order
Adams-Bashforth
h
r

h
i

6/11
From the above consideration of the 2nd- and 3rd-order Adams-Bashforth methods there
follows an observation that is shared by some other families of methods: the more accurate
method has a smaller stability region.
Let us now analyze the stability of the two methods considered in Sec. 3.3.
Leap-frog method
Substituting the model ODE into Eq. (3.20), one obtains
Y
n+1
Y
n1
= 2hY
n
. (4.32)
For Y
n
= r
n
we nd:
r
2
2hr 1 = 0 r
1,2
= h
_
1 + (h)
2
. (4.33)
Considering the limit h 0, as before, we nd:
r
1
1 + h, r
2
1 + h. (4.34)
Again, as before, the solution of the diference equation with r
1
corresponds to the solution of
the ODE: r
n
1
(1 + h)
x/h
e
x
. The solution corresponding to root r
2
is parasitic. Its
behavior needs to be analyzed separately for > 0 and < 0.
> 0 Then r
2
= (1 h), so that |r
2
| < 1, and the parasitic solution decays, whereas
the true solution, (1 + h)
n
e
x
, grows. Thus, the method is stable in this case.
< 0 Then r
2
= (1 + h||), so that |r
2
| > 1. Thus, the parasitic solution grows
exponentially and makes the method unstable. Specically, the dierence solution will be the
linear combination
Y
n
= c
1
r
n
1
+ c
2
r
n
2
c
1
e
x
+ c
2
(1)
n
e
x
. (4.35)
We see that the parasitic part (the 2nd term in (4.35) of this solution grows for < 0, whereas
the true solution (the rst term in (4.35)) decays; thus, for a suciently large x, the numerical
solution will bear no resemblance to the true solution.
The stability region of the Leap-frog method,
shown on the right, is disappointingly small: the
method is stable only for
R
= 0 and 1 h
I
1 . (4.36)
However, since the numerical solution will still stay
close to the true solution for |x| 1, the Leap-
frog method is called weakly unstable.
Stability region of Leapfrog
h
i

h
r
i
i
Divergent 3rd-order method (3.22)
For the model problem (4.14), that method becomes
Y
n+1
+
3
2
Y
n
3Y
n1
+
1
2
Y
n2
= 3hY
n
. (4.37)
Proceeding as before, we obtain the characteristic equation for the roots:
r
3
+
_
3
2
3h
_
r
2
3r +
1
2
= 0 . (4.38)
To consider the limit h 0, we can simply set h = 0 as the lowest-order approximation. The
cubic equation (4.38) reduces to
r
3
+
3
2
r
2
3r +
1
2
= 0 , (4.39)
which has the roots
r
1
= 1, r
2
2.69, r
3
0.19 . (4.40)
Then for small h, the numerical solution is
Y
n
= c
1
(1 + h)
n
+ c
2
(2.69 + O(h))
n
+ c
3
(0.19 + O(h))
n
(approximate true (parasitic solution (parasitic solution
solution) that explodes) that decays)
(4.41)
The second term, corresponding to a parasitic solution, grows (in magnitude) much faster than
the term approximating the true solution, and therefore the numerical solution very quickly
becomes complete garbage. This happens much faster than for the Leap-frog method with
< 0. Therefore, method (3.22) is called strongly unstable; obviously, it is useless for any
computations.
The above considerations of multistep methods can be summarized as follows. Consider a
multistep method of the general form (3.17). For the model problem (4.14), it becomes
Y
n+1
k=0
a
k
Y
nk
= h
N
k=0
b
k
Y
nk
. (4.42)
The rst step of its stability analysis is to set h = 0, which will result in the following charac-
teristic polynomial:
r
M+1
k=0
a
k
r
Mk
= 0 . (4.43)
This equation must always have a root r
1
= 1 which corresponds to the true solution of the
ODE. If any of the other roots, i.e. {r
2
, r
3
, . . . , r
M+1
}, satises |r
k
| > 1, then the method is
strongly unstable. If any root with k 2 satises |r
k
| = 1, then the method may be weakly
unstable (like the Leap-frog method). Finally, if all |r
k
| < 1 for k = 2, . . . , M + 1, then the
method is stable for h 0. It may be either partially stable, as the single-step methods and
Adams-Bashforth methods, or absolutely stable, as the implicit methods that we will consider
next.
4.5 Stability analysis of implicit methods
Consider the implicit Euler method (3.43). For the model problem (4.14) it yields
Y
n+1
= Y
n
+ hY
n+1
, Y
n
= Y
0
_
1
1 h
_
n
. (4.44)
It can be veried (do it, following the lines of Sec. 0.5) that the r.h.s. of the last equation
reduces for h 0 to the exact solution y
0
e
x
, as it should. The stability condition is
1
1 h
1 |1 h| 1 . (4.45)
The boundary of the stability region is the circle
(1 h
R
)
2
+ (h
I
)
2
= 1 ; (4.46)
the stability region is the outside of that circle (see
the second of inequalities (4.45)).
Stability region of implicit Euler
h
i

h
r
Denition 5: If a numerical method, when applied to the model problem (4.14), is stable
for all with
R
< 0, such a method is called absolutely stable, or A-stable for short.
Thus, we have shown that the implicit Euler method is A-stable.
Similarly, one can show that the Modied implicit Euler method is also A-stable (you will
be asked to do so in a homework problem).
Theorem 4.2:
1) No explicit nite-dierence method is A-stable.
2) No implicit method of order higher than 2 is A-stable.
Thus, according to Theorem 4.2, implicit methods of order 3 and higher are only partially
stable; however, their regions of stability are usually larger than those of explicit methods of
the same order.
For more information about stability of numerical methods, one may consult the book by
P. Henrici, Discrete variable methods in ordinary dierential equations, (Wiley 1968).
1. Explain the meanings of the concepts of consistency, stability, and convergence of a nu-
merical method.
2. State the Lax Theorem.
3. Give the idea behind the proof of the Lax Theorem.
4. Why is the model problem (4.14) relevant to analyze stability of numerical methods?
5. The behavior of what kind of error does the model problem (4.14) always predict correctly?
6. Verify the statement about (U
n
Y
n
) in the next to the last paragraph of Sec. 4.2.
7. What is the general procedure of analyzing stability of a numerical method?
8. Obtain Eq. (4.27).
9. Obtain Eq. (4.29).
10. Why does the characteristic polynomial for the 3rd-order Adams-Bashforth method (3.11)
has exactly 3 roots?
11. Would you use the Leap-frog method to solve the ODE y
y + x
2
?
12. Obtain (4.34) from (4.33).
13. Why is the Leap-frog method called weakly unstable?
14. Why is the method (3.22) called strongly unstable?
15. Are Adams-Bashforth and Runge-Kutta methods always partially stable, or can they be
weakly unstable? (Hint: Look at Eqs. (4.42) and (4.43) and think of the roots r
1
, r
2
, . . .
that these methods can have when h 0.)
16. Verify the statement made after Eq. (4.44).
17. Obtain (4.46) from (4.45).
18. Is the 3rd-order Adams-Moulton method that you obtained in Homework 3 (problem 3)
A-stable?
5 Higher-order ODEs and systems of ODEs
5.1 General-purpose discretization schemes for systems of ODEs
The strategy of generalizing a discretization scheme from one to N > 1 ODEs is, for the most
part, straightforward. Therefore, below we will consider only the case of a system of N = 2
ODEs. This case will also allow us to investigate a certain issue that is specic to systems of
ODEs and does not occur for a single ODE. We will denote the exact solutions of the ODE
system in question as y
(1)
(x) and y
(2)
(x), while the corresponding numerical solutions of this
system, as Y
(1)
and Y
(2)
; the functions appearing on the r.h.s. of the system will be denoted
as f
(1)
and f
(2)
. Thus, the IVP for the two unknowns, y
(1)
and y
(2)
, is:
y
(1)
= f
(1)
(x, y
(1)
, y
(2)
), y
(1)
(x
0
) = y
(1)
0
,
y
(2)
= f
(2)
(x, y
(1)
, y
(2)
), y
(2)
(x
0
) = y
(2)
0
.
(5.1)
We now consider generalizations of some of the methods introduced in Lectures 1 and 2.
Simple Euler method
Probably the most intuitive form of this method for two ODEs is
Y
(1)
n+1
= Y
(1)
n
+hf
(1)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n
__
,
Y
(2)
n+1
= Y
(2)
n
+hf
(2)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n
__
.
(5.2)
Already for this most basic example, we can identify the issue, mentioned above, that is specic
to systems of ODEs and does not occur for a single rst-order ODE. Namely, notice that once
we have found the new value Y
(1)
n+1
for the rst component of the solution, we can substitute it
into the second equation instead of substituting Y
(1)
n
, as it is done in (5.2). The result is:
Y
(1)
n+1
= Y
(1)
n
+hf
(1)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n
__
,
Y
(2)
n+1
= Y
(2)
n
+hf
(2)
_
x
n
,
_
Y
(1)
n+1
, Y
(2)
n
__
.
(5.3)
Since the components Y
(1)
and Y
(2)
enter Eqs. (5.2) on equal footing, we can interchange their
order in (5.3) and obtain:
Y
(2)
n+1
= Y
(2)
n
+hf
(2)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n
__
,
Y
(1)
n+1
= Y
(1)
n
+hf
(1)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n+1
__
.
(5.4)
It is rather straightforward to see that all the three implementations of the simple Euler method
are rst-order methods.
An obvious question that now comes to mind is this: Is there any aspect because of which
methods (5.3) and (5.4) may be preferred over method (5.2)? The short answer is yes, for a
certain form of f
(1)
and f
(2)
, there is. We will present more detail in Sec. 5.3 below. For now
we continue with presenting the discretization scheme for the Modied Euler equation for two
rst-order ODEs.
Modied Euler method
Y
(k)
= Y
(k)
n
+hf
(k)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n
__
,
Y
(k)
n+1
= Y
(k)
n
+
h
2
_
f
(k)
_
x
n
,
_
Y
(1)
n
, Y
(2)
n
__
+f
(k)
_
x
n+1
,
_
Y
(1)
, Y
(2)
___
,
k = 1, 2 .
(5.5)
Let us verify that (5.5) is a second-order method, as it has been for a single ODE. We
proceed in exactly the same steps as in Lecture 1. We will also use the shorthand notations:
Y
n
=
_
Y
(1)
n
, Y
(2)
n
_
, y
n
=
_
y
(1)
n
, y
(2)
n
_
,

f
n
=
_
f
(1)
(x
n
,

Y
n
), f
(2)
(x
n
,

Y
n
)
_
_
f
(1)
n
, f
(2)
n
_
.
Since in the derivation of the local truncation error we always assume that Y
(k)
n
= y
(k)
n
, then
also
_
f
(1)
(x
n
, y
n
), f
(2)
(x
n
, y
n
)
_
=
_
f
(1)
n
, f
(2)
n
_
.
Expanding the r.h.s. of the second of Eqs. (5.5) about the point (x
n
,

Y
n
) in a Taylor
series, we obtain:
Y
(k)
n+1
= Y
(k)
n
+
h
2
_
f
(k)
_
x
n
,

Y
n
_
+f
(k)
_
x
n+1
,

Y
n
+h
f
n
__
[
Taylor expansion
= Y
(k)
n
+
h
2
_
f
(k)
n
+
_
f
(k)
n
+h
f
(k)
n
x
+hf
(1)
n
f
(k)
n
y
(1)
+hf
(2)
n
f
(k)
n
y
(2)
__
+O(h
3
)
= Y
(k)
n
+hf
(k)
n
+
h
2
2
_
f
(k)
n
x
+f
(1)
n
f
(k)
n
y
(1)
+f
(2)
n
f
(k)
n
y
(2)
_
+O(h
3
) . (5.6)
Now expanding the exact solution y
(k)
n+1
= y
(k)
(x
n+1
) in a Taylor series, we obtain:
y
(k)
n+1
[
Taylor expansion
= y
(k)
n
+h
d
dx
y
(k)
n
+
h
2
2
d
2
dx
2
y
(k)
n
+O(h
3
)
[
denitions of dy
(k)
/dx
= y
(k)
n
+hf
(k)
n
+
h
2
2
d
dx
f
(k)
n
+O(h
3
)
[
Chain rule
= y
(k)
n
+hf
(k)
n
+
h
2
2
_
f
(k)
n
x
+f
(1)
n
f
(k)
n
y
(1)
+f
(2)
n
f
(k)
n
y
(2)
_
+O(h
3
) . (5.7)
Here the coecient of the h
2
-term has been computed by using the fact that f
(k)
= f
(k)
(x, y
(1)
(x), y
(2)
(x))
and then using the Chain rule. Comparing the last lines in (5.6) and (5.7), we see that
y
(k)
n+1
= Y
(k)
n+1
+O(h
3
), which conrms that the order of the local truncation error in the Modied
Euler method (5.5) is 3, and hence the method is second-order accurate.
In the homework problems, you will be asked to write out the forms of discretization schemes
for the Midpoint and cRK methods for a system of two ODEs.
To conclude this subsection, we note that any higher-order IVP, say,
y
+f(x, y, y
, y
) = 0, y(x
0
) = y
0
, y
(x
0
) = z
0
, y
(x
0
) = w
0
, (5.8)
can be rewritten as a system of rst-order ODEs with appropriate initial conditions:
y
= z ,
z
= w,
w
= f(x, y, z, w) ,
(5.9)
y(x
0
) = y
0
, z(x
0
) = z
0
, w(x
0
) = w
0
.
This is a system of three rst-order ODEs, for which the forms of the discretization schemes
have been considered above. Obviously, any higher-order ODE that can be explicitly solved for
the highest derivative, can be dealt with along the same lines.
5.2 Special methods for the second-order ODE y
//
= f(y).
I: Central-dierence methods
A second-order ODE, along with the appropriate initial conditions:
y
= f(y), y(x
0
) = y
0
, y
(x
0
) = y
0
, (5.10)
occurs in applications quite frequently because it describes the motion of a Newtonian particle
(i.e. a particle that obeys the laws of Newtonian mechanics) in the presence of a conservative
force (i.e. a force that depends only on the position of the particle but not on its speed and/or
the time). In the remainder of this subsection, it will be convenient to think of y as the position
of the particle, of x as the time, and of y
as the particles velocity.

The rst special method that we introduce for Eq. (5.10) (and for systems of such equations)
uses a second-order accurate approximation for y
:
y
n
=
y
n+1
2y
n
+y
n1
h
2
+O(h
2
) ; (5.11)
you encountered a similar formula in Lecture 3 (see Sec. 3.1). Combining Eqs. (5.10) and
(5.11), we arrive at the central-dierence method for Eq. (5.10):
Y
n+1
2Y
n
+Y
n1
= h
2
f
n
. (5.12)
(Method (5.12) is sometimes referred to as the simple central-dierence method, because the
r.h.s. of the ODE (5.10) enters it in the simplest possible way.) Since this is a two-step method,
one needs two initial points to start it. The rst point, Y
0
, is simply the initial condition for
the particles position: Y
0
= y
0
. The second point, Y
1
, has to be determined from the initial
position y
0
and the initial velocity y
0
. The natural question then is: To what accuracy should
we determine Y
1
so as to be consistent with the accuracy of the method (5.12)?
To answer this question, we rst show that the global error in the simple central-dierence
method is O(h
2
). Indeed, the local truncation error is O(h
4
), as it follows from (5.10)(5.12).
For numerical methods for a rst-order ODE, considered earlier, this would imply that the
global error must be O(
1
h
) O(h
4
) = O(h
3
), since the local error of O(h
4
) would accumulate
over O(
1
h
) steps. However, (5.10) is a second-order ODE, and for it, the error accumulates
dierently than for a rst-order one. To see this qualitatively, we consider the simplest case
where the same error is made at every step, and all these errors simply add together. This can
then be modeled by the following second-order ODE in the discrete variable n:
d
2
(GlobalError)/dn
2
= LocalError, where LocalError=const; (5.13)
GlobalError(0) = 0, GlobalError
(0) = StartupError . (5.14)

The StartupError is actually the error one makes in computing Y
1
. If we now treat the
discrete variable n as continuous (which is acceptable if we want to obtain an estimate for the
answer), then the solution of the above is, obviously,
GlobalError(n) = StartupError n + LocalError
n
2
2
, (5.15)
which on the account of
n =
(b a)
h
= O
_
1
h
_
, ([a, b] being the interval of integration)
becomes
GlobalError(n) = StartupError O
_
1
h
_
+ LocalError O
_
1
h
2
_
. (5.16)
In Appendix, we derive an analog of (5.16) for the discrete equation (5.12) rather than for the
continuous equation (5.13); that derivation conrms the validity of our replacing the discrete
equation by its continuous equivalent for the purposes of the estimation of error accumulation.
Equation (5.16) along with the aforementioned fact that the local truncation error is O(h
4
)
imply that the global error is indeed O(h
2
), provided that the startup error (i.e., the error in
Y
1
) is appropriately small. Using the same equation (5.16), it is now easy to see that Y
1
needs
to be determined with accuracy O(h
3
). Therefore, we supplement Eq. (5.12) with the following
Initial conditions for method (5.12):
Y
0
= y
0
, Y
1
= y
0
+hy
0
+
h
2
2
f(y
0
) , (5.17)
where in the last equation we have used the ODE y
= f(y).
Another method that uses the central-dierence approximation (5.11) for y
is:
Y
n+1
2Y
n
+Y
n1
=
h
2
12
(f
n+1
+ 10f
n
+f
n1
) . (5.18)
This is called Numerovs method, or the Royal Road formula. The local truncation error of
this method is O(h
6
). Therefore, the global error will be O(h
4
) (i.e., 2 orders better than the
global error in the simple central-dierence method), provided we calculate Y
1
with accuracy
O(h
5
). In principle, this can be done using, for example, the Taylor expansion:
Y
1
= y
0
+hy
0
+
h
2
2
y
0
+
h
3
6
y
0
+
h
4
24
y
(iv)
0
, (5.19)
where y
0
and y
0
are given as the initial conditions and the higher-order derivatives are computed
successively as follows:
y
0
= f(y
0
),
y
0
=
d
dx
f(y)
y=y
0
= f
y
(y
0
)y
0
,
y
(iv)
0
=
d
dx
[f
y
(y) y
(x)]
y=y
0
= f
yy
(y
0
)y
0
+f
y
(y
0
)f(y
0
) .
(5.20)
However, Numerovs method is implicit (why?), which makes it unpopular for numerical inte-
gration of the IVPs (5.10). The only exception would be the case when f(y) = ay +b, a linear
function of y, when the equation for Y
n+1
can be easily solved. We will encounter Numerovs
method later in this course when we study boundary value problems; there, this method is the
method of choice because of its high accuracy.
5.3 Special methods for the second-order ODE y
//
= f(y).
II: Methods that approximately preserve energy
As we said at the beginning of the previous subsection, Eq. (5.10) describes the motion of a
particle in the eld of a conservative force. For example, the gravitational or electrostatic force
is conservative, but any form of friction is not. We now rename the independent variable x
as t (the time) and denote v(t) = y
(t) (the velocity). As before, y(t) denotes the particles

position. Equation (5.10) can be rewritten as
y
= v, (5.21)
v
= f(y), (5.22)
y(t
0
) = y
0
, v(t
0
) = v
0
.
In Eq. (5.22), the r.h.s. can be thought of as a force acting on the particle of unit mass. Note
that these equations admit a conserved quantity, called the Hamiltonian (which in Newtonian
mechanics is just the total energy of the particle):
H(v, y) =
1
2
v
2
+U(y) , U(y) =
_
f(y)dy . (5.23)
The rst and second terms on the r.h.s. of (5.23) are the kinetic and potential energies of the
particle. Using the equations of motion, (5.21) and (5.22), it is easy to see that the Hamiltonian
(i.e., the total energy) is indeed conserved:
dH
dt
=
H
v
dv
dt
+
H
y
dy
dt
= v f(y) +
dU
dy
v = 0 t . (5.24)
It is now natural to ask: Do any of the methods considered so far conserve the Hamiltonian?
That is, if V
n
, Y
n
is a numerical solution of (5.21) and (5.22), is H(V
n
, Y
n
) independent of n?
The answer is no. However, some of the methods do conserve the Hamiltonian approximately
over very long time intervals. We now consider specic examples of such methods.
Consider the three implementations of the simple Euler method, given by Eqs. (5.2)
(5.4). We will refer to method (5.2) as the regular Euler method; the other two methods are
conventionally referred to as symplectic
8
Euler methods. Let us apply these methods with
h = 0.02 to integration of the equations of a simple harmonic oscillator
y
= y, y(0) = 0, y
(0) = 1 . (5.25)
The results are presented below. We plot the numerical solutions along with the exact one in
the phase plane for t 20, which corresponds to slightly more than 3 oscillation periods. The
orbits of the solutions obtained by the symplectic methods lie very close to the orbit of the
exact solution, while the orbit corresponding to the regular Euler method winds o into innity
(provided one waits innitely long, of course).
8
Symplectic is a term from Hamiltonian mechanics that means preserving areas in the phase space. If
this explanation does not make the matter clearer to you, simply ignore it and treat the word symplectic just
as a new adjective in your vocabulary.
1 0.5 0 0.5 1 1.5
1
0.5
0
0.5
1
1.5
phase plane of simple harmonic oscillator
v
y
regular Euler
symplectic Eulers
and exact solution
0 5 10 15 20
0
0.1
0.2

Error in the Hamiltonian of simple harmonic oscillator
time
E
r
r
o
r
,

H
c
o
m
p
u
t
e
d
H
e
x
a
c
t
regular Euler
symplectic Eulers
It is known that the symplectic Euler (and
higher-order symplectic) methods loose their
remarkable property of near-preservation of
the energy if the step size is varied. To
illustrate this fact, we show the error in
the Hamiltonian obtained for the same Eq.
(5.25) when the step size is sinusoidally var-
ied with a frequency incommensurable with
that of the oscillator itself. Specically, we
took h = 0.02+0.01 sin(1.95t). We see, how-
ever, that the error in the symplectic meth-
ods is still much smaller than that obtained
by the regular Euler method.
0 10 20 30 40 50 60 70 80 90
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Error in Hamiltonian: variable step size
time
E
r
r
o
r
,

H
c
o
m
p
u
t
e
d
H
e
x
a
c
t
regular Euler
symplectic Eulers
At this point, we are ready to ask two more questions.
Question: What feature of the symplectic Euler methods allows them to maintain the
Hamiltonian of the numerical solution near that of the exact solution?
Answer: Perhaps surprisingly, systematic studies of symplectic methods began relatively
recently, in the late 1980s. The theory behind these methods goes far beyond the scope of this
course (and the expertise of this instructor). A recent review of such methods is posted on
the website of this course. We will only briey touch upon the main reason for the superior
performance of the symplectic methods over non-symplectic ones, in Sec. 5.4.
Question: Among the methods we have considered in this Section, are there other methods
that possess the property of near-conservation of the Hamiltonian?
Answer: The short answer is yes. To present a more detailed answer, let us look back at
the gure for the error in the Hamiltonian, obtained with the two symplectic Euler methods.
We see that these errors are nearly opposite to each other and hence, being added, will nearly
cancel one another. Therefore, if we somehow manage to combine the symplectic methods
(5.3) and (5.4) so that the Hamiltonian error of the new method is the sum of those two old
errors, then that new error will be dramatically reduced in comparison with either of the old
errors. (This is similar to how the second-order trapezoidal rule for integration of f(x) is
obtained as the average of the rst-order accurate left and right Riemann sums, whose errors
are nearly opposite and thus, being added, nearly cancel each other.) Below we produce such
a combination of methods (5.3) and (5.4).
Let us split the step from x
n
to x
n+1
into two substeps: from x
n
to x
n+
1
2
and then from
x
n+
1
2
to x
n+1
(see the gure below). Let us now advance the solution in the rst half-step
using method (5.4) and then advance it in the second half-step using method (5.3). Here is this
process in detail:
x
n

x
n+1
x
n+1/2

Symplectic Euler (5.4)
Symplectic Euler (5.3)
x
n
x
n+
1
2
, use (5.4) : V
n+
1
2
= V
n
+
h
2
f(Y
n
) ,
Y
n+
1
2
= Y
n
+
h
2
V
n+
1
2
,
x
n+
1
2
x
n+1
, use (5.3) : Y
n+1
= Y
n+
1
2
+
h
2
V
n+
1
2
,
V
n+1
= V
n+
1
2
+
h
2
f(Y
n+1
) .
(5.26)
Combining the above equations (simply add the 2nd and 3rd equations, and then add the 1st
and 4th ones), we obtain:
Y
n+1
= Y
n
+hV
n
+
h
2
2
f(Y
n
) ,
V
n+1
= V
n
+
h
2
(f(Y
n
) +f(Y
n+1
)) .
(5.27)
Method (5.27) is called the Verlet method, after Dr. Loup Verlet who discovered it in 1967.
Later, however, Verlet himself found accounts of his method in works dated as far back as the
late 18th century. In particular, in 1907, G. Stormer used higher-order versions of this method
for computation of the motion of the ionized particles in the Earths magnetic eld. About 50
years earlier, J.F. Encke had used method (5.27) for computation of planetary orbits. For this
reason, this method is also sometimes related with the names of Stormer and/or Encke.
The Verlet method is extensively used in applications dealing with long-time computations,
such as molecular dynamics, planetary motion, and computer animation
9
. Its benets are:
9
For example, you may visit a game-developers website at http://www.gamedev.net, go to their Forums
(i) It nearly conserves the energy of the modeled system;
(ii) It is second-order accurate; and
(iii) It requires only one function evaluation per step.
To make the value of these benets evident, in a homework problem you will be asked to
compare the performance of the Verlet method with that of the higher-order cRK method,
which is not symplectic and does not have the property of near-conservation of the energy.
We now complete the answer to the question asked about a page ago and show that the
Verlet method is equivalent to the simple central-dierence method. To this end, let us write
the Verlet equations at two consecutive steps:
Y
n+1
= Y
n
+hV
n
+
h
2
2
f(Y
n
) ,
V
n+1
= V
n
+
h
2
(f(Y
n
) +f(Y
n+1
)) ,
Y
n+2
= Y
n+1
+hV
n+1
+
h
2
2
f(Y
n+1
) ,
V
n+2
= V
n+1
+
h
2
(f(Y
n+1
) +f(Y
n+2
)) .
(5.28)
In fact, we will only need the rst three of the above equations. Subtracting the 1st equation
from the 3rd and slightly rearranging the terms, we obtain:
Y
n+2
2Y
n+1
+Y
n
=
_
hV
n+1
+
h
2
2
f(Y
n+1
)
_

_
hV
n
+
h
2
2
f(Y
n
)
_
. (5.29)
We now use the 2nd equation of (5.28) to eliminate V
n+1
. The straightforward calculation yields
Y
n+2
2Y
n+1
+Y
n
= h
2
f(Y
n+1
),
which is the simple central-dierence method (5.12). Thus, we have shown that the simple
central-dierence method nearly conserves the energy of the system.
To conclude this section, we note that although the Verlet method nearly conserves the
Hamiltonian of the simulated system, it may not always conserve or nearly-conserve other
constants of the motion, whenever such exist. As an example, consider the Kepler two-body
problem (two particles in each others gravitational eld):
q
=
q
(q
2
+r
2
)
3/2
, r
=
r
(q
2
+r
2
)
3/2
, (5.30)
where q and r are the Cartesian coordinates of a certain radius vector relative to the center
of mass of the particles. Let us denote the velocities corresponding to q and r as Q and R,
respectively. This problem has the following three constants of the motion:
Hamiltonian of (5.30):
H =
1
2
(Q
2
+R
2
)
1
_
q
2
+r
2
, (5.31)
Angular momentum of (5.30):
A = qR rQ, (5.32)
Runge-Lenz vector of (5.30):
L =
i
_
R(qR rQ)
q
_
q
2
+r
2
_
+
j
_
Q(qR rQ)
r
_
q
2
+r
2
_
. (5.33)
and there do a search for Verlet.
It turns out that the Verlet method nearly conserves the Hamiltonian and exactly conserves
the angular momentum A, but does not conserve the Runge-Lenz vector L. In a homework
problem, you will be asked to examine what eect this nonconservation has on the numerical
solution.
5.4 Stability of numerical methods for systems of ODEs and higher-
order ODEs
Following the lines of the previous three subsections, we will rst comment on the stability
of general-purpose methods, and then on that of special methods for y
= f(y) and similar

equations.
In Lecture 4, we showed that in order to analyze stability of numerical methods for a single
rst-order ODE y
= f(x, y), we needed to consider that stability for the model problem (4.14),
y
= y with = const. This is because any small error, coming either from the truncation
error or from elsewhere, satises the linearized equation
(x) = f
y
(x, y) (x) + driving terms ; (5.34)
see Eqs. (4.17) and (4.18). Thus, if we replace the variable coecient f
y
(x, y), whose exact
value at each x we do not know, with a constant (such that, hopefully, [[ > [f
y
(x, y)[), we
then arrive at the model problem (4.14), which we can easily analyze.
Question: What is the counterpart of (5.34) for a system of ODEs?
Answer, stated for two ODEs:
(x) =
(f
(1)
, f
(2)
)
(y
(1)
, y
(2)
)
(x) + driving terms , (5.35)
where
=
_

(1)
(1)
_
,
(f
(1)
, f
(2)
)
(y
(1)
, y
(2)
)
=
_
_
f
(1)
y
(1)
f
(1)
y
(2)
f
(2)
y
(1)
f
(2)
y
(2)
_
_
. (5.36)
The last matrix is called the Jacobian of the r.h.s. of system (5.1). Equations (5.35) and (5.36)
generalize straightforwardly for more than two equations.
We now give a brief derivation of Eqs. (5.35) and (5.36) which parallels that of Eqs. (4.18)
for a single rst-order ODE. For convenience of the reader, we re-state here the IVP system
(5.1):
y
(1)
= f
(1)
(x, y
(1)
, y
(2)
), y
(1)
(x
0
) = y
(1)
0
,
y
(2)
= f
(2)
(x, y
(1)
, y
(2)
), y
(2)
(x
0
) = y
(2)
0
,
(5.1)
whose exact solutions are y
(1)
and y
(2)
. Let Y
(1)
and Y
(2)
be the numerical solutions of system
(5.1). They satisfy a related system:
Y
(1)
= f
(1)
(x, Y
(1)
, Y
(2)
) +
1
h
Local Truncation Error, Y
(1)
(x
0
) = y
(1)
0
,
Y
(2)
= f
(2)
(x, Y
(1)
, Y
(2)
) +
1
h
Local Truncation Error, Y
(2)
(x
0
) = y
(2)
0
,
(5.37)
(Here we used the other symbol of prime, , to denote the nite-dierence approximation
of the derivative, while we reserve the regular symbol of prime,

, for the exact analytical
derivative.) Next, we subtract each of Eqs. (5.37) from the corresponding equation in (5.1)
(assigning now the same meaning to the two kinds of primes) and obtain the following equations
for the global errors
(k)
= y
(k)
Y
(k)
, k = 1, 2:
(k)
= f
(k)
(x, y
(1)
, y
(2)
) f
(k)
(x, Y
(1)
, Y
(2)
)
1
h
Local Truncation Error
=
f
(k)
y
(1)

(1)
+
f
(k)
y
(2)

(2)
+ O(
2
)
1
h
Local Truncation Error . (5.38)
It is now easy to see that Eq. (5.38) is the same as Eqs. (5.35) and (5.36).
Since we do not know the values of the entries in the Jacobian matrix in Eq. (5.35), we
simply replace that matrix by a matrix with constant terms. Thus, the model problem that
one should use to analyze stability of numerical methods for system of ODEs is
y
= Ay , A is a constant matrix. (5.39)

Now, for a single rst-order ODE, we had only one parameter, , in the model problem.
Question: How many parameters do we have in the model problem (5.39) for a system of N
ODEs?
The answer depends on which of the two categories of methods one uses. Namely, in Sec.
5.1, we saw that some methods (e.g., the regular Euler (5.2) and the Modied Euler (5.5)) use
the solution

Y
n
at x = x
n
to simultaneously advance to the next step, x = x
n+1
. Moreover,
each component Y
(k)
is obtained using the same discretization rule. To be consistent with
the terminology of Sec. 5.1, we will call this rst category of methods, the general purpose
methods. Methods of the other category, which included the symplectic Euler and Verlet,
obtain a component Y
(m)
n+1
at x
n+1
by using previously obtained components at x
n+1
, Y
(k)
n+1
with
k < m, as well as the components Y
(p)
n
with p m at x
n
. In other words, they apply dierent
discretization rules for dierent components. We will call methods from this category, special
methods.
Returning to the above question, we now show that for the general-purpose methods (regular
Euler, modied Euler, cRK, etc.), the answer is N (even though matrix A contains N
2
entries!). We will explain this using the regular Euler method as an example. Details for other
general purpose methods are more involved, but follow the same logic. First, we note that
most matrices are diagonalizable
10
, which means that there exists a matrix S and a diagonal
matrix D such that
A = S
1
DS ; D = diag(
1
,
2
, . . . ,
N
) . (5.40)
Moreover, the diagonal entries of D are the eigenvalues of A. Substitution of (5.40) into (5.39)
leads to the following chain of transformations:
y
= Ay
y
= S
1
DSy
Sy
= DSy
(Sy)
= D(Sy)
z
= Dz, where z = Sy . (5.41)

10
Some matrices, e.g.
_
1 1
0 1
_
, are not diagonalizable. However, if we perturb it as, say,
_
1.01 1
0 1
_
, this
latter matrix is diagonalizable.
Therefore, the important (for the stability analysis) information about a diagonalizable matrix
A is concentrated in its eigenvalues. Namely, given the diagonal form (5.40) of matrix D, the
last equation in (5.41) can be written as
z
(k)
=
(k)
z
(k)
, for k = 1, . . . N , (5.42)
which means that the matrix model problem (5.39) reduces to the model problem (4.14) for a
single rst-order ODE.
Now, when we apply the regular Euler method to system (5.39), we get
Y
n+1

Y
n
= hA
Y
n
. (5.43)
Repeating now the steps of (5.41), we rewrite this as
Z
n+1
Z
n
= hD
Z
n
, (5.44)
where

Z
n
is the numerical approximation to z
n
. Given the diagonal form of D, for the compo-
nents of

Z
n
we obtain:
Z
(k)
n+1
Z
(k)
n
= h
(k)
Z
(k)
n
, (5.45)
which is just the simple Euler method applied separately to individual model problems (5.42).
Thus, we have conrmed our earlier statement that for a general purpose method, the stability
analysis for a system of ODEs reduces to the stability analysis for a single equation.
We will now show that the above statement does not apply to special methods like the
symplectic Euler etc.. Indeed, let us apply symplectic Euler (5.3) to a 2 2 model problem
(5.39), assuming that A =
_
a
11
a
12
a
21
a
22
_
. We have (verify):
Y
n+1

Y
n
= h
_
a
11
a
12
0 0
_

Y
n
+h
_
0 0
a
21
a
22
_

Y
n+1
. (5.46)
This can no longer be written in the form (5.43), and hence the subsequent calculations that led
to (5.45) are no longer valid. Therefore, the only venue to proceed with the stability analysis
for special methods is to consider the original matrix model problem (5.39); obviously, this
model problem has, in general, as many parameters as the matrix A, i.e. N
2
.
Below we give an example of doing stability analyses for the regular and symplectic Euler
methods. In a homework problem, you will be asked to do similar calculations for the modied
Euler and Verlet methods. As a particular problem, we choose that of a simple harmonic
oscillator (5.25). That problem can be written in the matrix form as follows:
_
y
v
_
=
_
0 1
1 0
__
y
v
_
, (5.47)
so that y
(1)
= y and y
(2)
= v. The matrix in (5.47) has the eigenvalues
1,2
= i:
0 1
1 0
= 0
2
+ 1 = 0
= i .
Thus, if we use the regular Euler, which is a general purpose method, it suces to study the
stability of the simple Euler method for a single ODE
y
= y with = i or = i. (5.48)
Before we proceed, let us perform a sanity check and conrm that Eq. (5.48) does indeed
describe a solution that we expect of a harmonic oscillator, i.e.
y = c
1
sin x +c
2
cos x (5.49)
for some constants c
1
, c
2
. Indeed, the solution of (5.48) with, say, = i, is
y = e
ix
. (5.50)
Using the Euler formula for complex numbers,
e
ix
= cos x +i sin x,
we see that solution (5.50) is indeed of the form (5.49) with c
1
= 1 and c
2
= i. Now, if we
want both constants c
1
, c
2
to be real, we need to also account for the solution with = i
and perform more tedious calculations. As a result, we would not only conrm that the y-
component of the solution has the form (5.49), but would also show that y and v are given
by
y = B sin(x +), v = C sin(x +) , (5.51)
where B, C, , and are some constants. We will not do these calculations here since this
would have distracted us from our main goal, which is the stability analysis of the regular Euler
method applied to (5.47). We will be content with summarizing the discussion on the above
by stating that the exact solutions (5.51) of the harmonic oscillator model describe oscillations
with a constant amplitude (related to constants B and C).
Let us now return to the stability analysis for the regular Euler method applied to system
(5.47). According to the discussion that led to Eq. (5.45), this reduces to the stability analysis
of the single Eq. (5.48). The result is given by Eq. (4.21) of Lecture 4. Namely, we have that
[r[ = [1 +h i[ =
1 +h
2
1 +
1
2
h
2
, (5.52)
so that
[r[
n
= [r[
(x/h)
(1 +
1
2
h
2
)
(x/h)
= (1 +h
1
2
h)
(x/h)
e
xh/2
. (5.53)
Since the absolute value [r[
n
determines the amplitude of the numerical solution, we see that
Eq. (5.53) shows that this amplitude grows with x, whereas the amplitude of the exact solution
of (5.47) is constant (see (5.51))
11
. Therefore, the regular Euler method applied to (5.47) is
unstable for any step size h!
This result is corroborated by the gure accompanying Eq. (4.21). Namely, for = i
or i, the value h lies on the imaginary axis, which is outside the stability region for the
simple Euler method. Since the magnitude of any error will then grow exponentially, so will
the amplitude of the solution, because, as we discussed in Lecture 4, for linear equations the
error and the solution satisfy the same equation. In a homework problem, you will be asked
11
To better visualize what is going on, you may imagine that the numerical solution at each x
n
is simply the
exact solution multiplied by [r[
n
.
to show that the behavior of the Hamiltonian of the numerical solution shown in a gure in
Sec. 5.3 quantitatively agrees with Eq. (5.53).
Now we turn to the stability analysis of the symplectic Euler method (say, (5.3)). To that
end, we will apply these nite-dierence equations to a model problem, which we take as a
slight generalization of (5.25):
y
=
2
y . (5.54)
The nite-dierence equations are:
y
n+1
= y
n
+hv
n
v
n+1
= v
n
h
2
y
n+1
_
1 0
h
2
1
__
y
v
_
n+1
=
_
1 h
0 1
__
y
v
_
n
_
y
v
_
n+1
=
_
1 0
h
2
1
_
1
_
1 h
0 1
__
y
v
_
n
_
y
v
_
n+1
=
_
1 0
h
2
1
__
1 h
0 1
__
y
v
_
n
_
y
v
_
n+1
=
_
1 h
h
2
1 h
2
2
__
y
v
_
n
. (5.55)
Once we have obtained this matrix relation between the solutions at the nth and (n + 1)st
steps, we need to obtain the eigenvalues of the matrix on the r.h.s. of (5.55). Indeed, it is
known from Linear Algebra that the solution of (5.55) is
_
y
v
_
n
= u
1
r
n
1
+u
2
r
n
2
, (5.56)
where r
1,2
are the eigenvalues of the matrix in question and u
1,2
are the corresponding eigenvec-
tors. (We have used the notation r
1,2
instead of
1,2
for the eigenvalues in order to emphasize
the connection with the characteristic root r that arises in the stability analysis of a single
ODE.) If we nd that the modulus of either of the eigenvalues r
1
or r
2
exceeds 1, this would
mean that the symplectic method is unstable (well, we know already that it is not, but we need
to demonstrate that). A simple calculation similar to that found after Eq. (5.47) yields
r
1,2
= 1
1
2
_
h
2
h
4
4
4h
2
2
_
. (5.57)
With some help from Mathematica, one can show that
[r
1
[ = [r
2
[ = 1 for 2 h 2
either [r
1
[ > 1 or [r
2
[ > 1, for any other complex h .
(5.58)
Thus, the symplectic Euler method is stable for the simple harmonic oscillator equation (and,
in general, other oscillatory models), provided that h is suciently small, so that [h[ < 2.
Note that in Eq. (5.54) is a counterpart of i in the model equation (4.14): Indeed, simply
dierentiate (4.14) with being replaced by i one more time. Using this relation between
and , we then observe that the stability region for the symplectic Euler method, given by
the rst line of (5.58), is reminescent of the stability region of the Leap-frog method (see Eq.
(4.36)). This may suggest that the Leap-frog method, applied to an oscillatory equation, will
also have the property of near-conservation of the total energy; however, we will not consider
this issue in detail.
To conclude this subsection, let us mention that for the simple central-dierence method
(5.12) and Numerovs method (5.18), the stability analysis should also be applied to the simple
harmonic oscillator equation (5.54). For example, substituting Y
n
= r
n
into the simple central-
dierence equation, where f(y) =
2
y, one nds
r
2
(2 h
2
2
)r + 1 = 0 . (5.59)
Again, with the help from Mathematica, one can show that the two roots r
1,2
of Eq. (5.59)
satisfy Eq. (5.58). This is not at all surprising, given that the simple central-dierence method
is equivalent to the Verlet method (see the text around (5.28)) and the latter, in its turn, is
simply a composition of two symplectic Euler methods.
Similarly, one can show that the stability region for Numerovs method is given by
6 h
6 , where [r
1
[ = [r
2
[ = 1 , (5.60)
whereas for any other complex h, either [r
1
[ > 1 or [r
2
[ > 1.
5.5 Sti equations
Here we will encounter, for the rst time in this course, a class of equations that are very
dicult to solve numerically. These equations are called numerically sti. It is important to
be able to recognize cases where one has to deal with such systems of equation; otherwise, the
numerical solution that one would obtain will have no connection to the exact one.
Let us consider an IVP
_
u
v
_
=
_
998 1998
999 1999
__
u
v
_
,
_
u
v
_
x=0
=
_
1
0
_
. (5.61)
Its exact solution is
_
u
v
_
=
_
2
1
_
e
x
+
_
1
1
_
e
1000x
. (5.62)
The IVP (5.61) is an example of a sti equation. Although there is no rigorous denition of nu-
merical stiness, it is usually accepted that a sti system should satisfy the following two criteria:
(i) The system of ODEs must contain at least two groups of solutions, where solutions in one
group vary rapidly relatively to the solutions of the other group. That is, among the eigenvalues
of the corresponding matrix A there must be two,
slow
and
rapid
, such that
[
rapid
[
[
slow
[
1 . (5.63)
(ii) The rapidly changing solution(s) must be stable. That is, the large in magnitude eigen-
values,
rapid
, of the matrix A in Eq. (5.39) must have Re
rapid
< 0. As for the slowly
changing solutions, they may be either stable or unstable.
Let us verify that system (5.61) is sti. Indeed, criterion (i) above is satised for this system
because of its two solutions, given by the two terms in (5.62), the rst (with
slow
= 1) varies
slowly compared to the other term (with
rapid
= 1000). Criterion (ii) is satised because the
rapidly changing solution has
rapid
< 0.
Another example of a sti system is
_
u
v
_
=
_
499 501
501 499
__
u
v
_
,
_
u
v
_
x=0
=
_
0
2
_
, (5.64)
whose solution is
_
u
v
_
=
_
1
1
_
e
2x
+
_
1
1
_
e
1000x
. (5.65)
Here, again, the rst and second terms in (5.65) represent the slow and fast parts of the solution,
with
slow
= 2 and
rapid
= 1000, so that [
rapid
[ [
slow
[. Thus, criterion (i) is satised.
Criterion (ii) is satised because the rapid solution is stable:
rapid
< 0.
The diculty with sti equations can be understood from the above examples (5.61), (5.62)
and (5.64), (5.65). Namely, the rapid parts of those solutions are important only very close to
x = 0 and are almost zero everywhere else. However, in order to integrate, e.g., (5.61) using,
say, the simple Euler method, one would require to keep h 1000 2 (see Eq. (4.20)), i.e.
h 0.002. That is, we are forced to use a very small step size in order to avoid the numerical
instability caused by the least important part of the solution!
Thus, in layman terms, a problem that involves processes evolving on two (or more) dis-
parate scales, with the rapid process(es) being stable, is sti. Moreover, as the above example
shows, the meaning of stiness is that one needs to work the hardest (i.e., use the smallest
h) to resolve the least important part of the solution (i.e., the second terms on the r.h.s.es of
(5.62) and (5.65)).
An obvious way to deal with a sti equation is to use an A-stable method (implicit or
modied implicit Euler). This would eliminate the issue of numerical instability; however, the
problem of (low) accuracy will still remain.
In practice, one strikes a compromise between the accuracy and stability of the method.
Matlab, for example, uses a family of methods known as BDF (backward-dierence formula)
methods. Matlabs built-in solvers for sti problems are ode15s (this uses a method of order
between 1 and 5) and ode23s.
5.6 Appendix: Derivation of Eq. (5.12) with f
n
= const
Here we will derive the solution of Eq. (5.12) with its right-hand side being replaced by a
constant:
Y
n+1
2Y
n
+Y
n1
= M, M = const. (5.66)
This will provide a rigorous justication for the solution (5.15) of the system (5.13)(5.14).
The method that we will use closely follows the lines of the method of variation of parameters
for the second-order ODE
y
+By
+Cy = F(x). (5.67)

In what follows we will refer to Eq. (5.67) as the continuous case. Namely, we rst obtain the
solutions of the homogeneous version of (5.66):
Y
n
= c
(1)
+c
(2)
n, c
(1)
and c
(2)
are arbitrary constants. (5.68)
Solution (5.68) was obtained by the substitution into (5.66) with M = 0 of the ansatze Y
n
= r
n
and Y
n
= nr
n
. This is analogous to how the solution y = c
(1)
+ c
(2)
x of the ODE y
= 0 is
obtained.
Next, to solve Eq. (5.66) with M ,= 0, we allow the constants c
(1)
and c
(2)
to depend on n.
Substituting the result into (5.66), we obtain:
_
c
(1)
n+1
2c
(1)
n
+c
(1)
n1
_
+
_
(n + 1)c
(2)
n+1
2nc
(2)
n
+ (n 1)c
(2)
n1
_
= M. (5.69)
Now, similarly to how in the continuous case the counterparts of our c
(1)
and c
(2)
are set to
satisfy an equation
_
c
(1)
_
y
(1)
+
_
c
(2)
_
y
(2)
= 0,
where y
(k)
, k = 1, 2 are the homogeneous solutions of (5.67), here we impose the following
condition:
k = n :
_
c
(1)
k+1
c
(1)
k
_
+k
_
c
(2)
k+1
c
(2)
k
_
= 0. (5.70)
Subtracting from (5.70) its counterpart for k = n 1, one obtains:
_
c
(1)
n+1
2c
(1)
n
+c
(1)
n1
_
+
_
nc
(2)
n+1
(2n 1)c
(2)
n
+ (n 1)c
(2)
n1
_
= 0. (5.71)
Next, subtracting the last equation from (5.69), we obtain a recurrence equation for c
(2)
only,
which has a simple solution (assuming c
(2)
0
= 0):
c
(2)
n+1
c
(2)
n
= M, c
(2)
n
= nM. (5.72)
From (5.72) and (5.70) one obtains the solution for c
(1)
:
c
(1)
n+1
c
(1)
n
= nM, c
(1)
n
=
n(n 1)
2
M. (5.73)
(Again, we have assumed that c
(1)
0
= 0.) Finally, combining the results of (5.68), (5.72), and
(5.73), we obtain the solution of Eq. (5.66):
Y
n
=
n(n 1)
2
M +n
2
M =
n(n + 1)
2
M = O(n
2
) M. (5.74)
The leading-order dependence on n of this solution is that claimed in formula (5.15).
1. Verify (5.6).
2. Verify (5.7).
3. What would be your rst step to solve a 5th-order ODE using the methodology of Sec.
5.1?
4. Use Eq. (5.11) to explain why the local truncation error of the simple central-dierence
method (5.12) is O(h
4
).
5. Explain why Y
1
for that method needs to be calculated with accuracy O(h
3
).
6. What is the global error of the simple central-dierence method?
7. How does the rate of the error accumulation for a second-order ODE dier from the rate
of the error accumulation for a rst-order ODE?
8. Explain the last term in the expression for Y
1
in (5.17).
9. Why is Numerovs method implicit?
10. What is the physical meaning of the Hamiltonian for a Newtonian particle?
11. Verify (5.24).
12. What is the advantage of the symplectic Euler methods over the regular Euler method?
13. State the observation that prompted us to combine the two symplectic Euler methods
into the Verlet method.
14. Obtain (5.27) from (5.26).
15. Obtain (5.29) and the next (unnumbered) equation.
16. What is the model problem for the stability analysis for a system of ODEs?
17. Show that for a general-purpose method, the stability analysis for a system of ODEs
reduces to the stability analysis for the model problem (4.14).
18. Why is this not so for the special methods, like the symplectic Euler?
19. Make sure you can follow (5.55).
20. Verify that (5.56) is the solution of (5.55). That is, substitute (5.56), with the correspond-
ing subindices, into both sides of the last equation of (5.55). Then for the expression on
the r.h.s., use the stated fact that u
1
and u
2
are the eigenvectors of the matrix appearing
on the r.h.s. of that equation. (You do not need to use the explicit form of that matrix.)
21. Obtain (5.57).
22. Would you apply the Verlet method to the following strongly damped oscillator:
y
= (2 +i)
2
y ?
Please explain.
23. Same question for Numerovs method.
24. What is the numerical stiness, in layman terms?
6 Boundary-value problems (BVPs): Introduction
A typical BVP consists of an ODE and a set of conditions that its solution has to satisfy at
both ends of a certain interval [a, b]. For example:
y
= f(x, y, y
), y(a) = , y(b) = . (6.1)

Now, let us recall that for IVPs
y
= f(x, y), y(x

0
) = y
0
,
we saw (in Lecture 0) that there are certain conditions on the function f(x, y) which would
guarantee that the solution y(x) of the IVP exists and is unique.
The situation with BVPs is considerably more complicated. Namely, relatively few theorems
exist that can guarantee existence and uniqueness of a solution to a BVP. Below we will state,
without proof, two of such theorems. In this course we will always assume that f(x, y, y
) in
(6.1) and in similar BVPs is a continuous function of its arguments.
Theorem 6.1 Consider a BVP of a special form:
y
= f(x, y), y(a) = , y(b) = . (6.2)

(Note that unlike (6.1), the function f in (6.2) is assumed not to depend on y
.)
If f/y > 0 for all x [a, b] and all values of the solution y, then the solution y(x) to the
BVP (6.2) exists and is unique.
This Theorem is not very useful for a general nonlinear function f(x, y), because we do
not know the solution y(x) and hence cannot always determine whether f/y > 0 or < 0
for the specic solution that we are going to obtain. Sometimes, however, as, e.g., when
f(x, y) = f
0
(x) + y
3
, we are guaranteed that f/y > 0 for any y and hence the solution of
the corresponding BVP does exist and is unique. Another useful and common case is when the
BVP is linear. In this case, we have the following result.
Theorem 6.2 Consider a linear BVP:
y
+ P(x)y
+ Q(x)y = R(x), y(a) = , y(b) = . (6.3)

Let the coecients P(x), Q(x), and R(x) be continuous on [a, b] and, in addition, let Q(x) 0
on [a, b]. Then the BVP (6.3) has a unique solution.
Note that in addition to the Dirichlet boundary conditions considered above, i.e.
y(a) = , y(b) = , (Dirichlet b.c.)
there may be also boundary conditions for the derivative, called the Neumann boundary con-
ditions:
y
(a) = , y
(b) = . (Neumann b.c.)

Also, the boundary consitions may be of the mixed type:
A
1
y(a) + A
2
y
(a) = , B
1
y(b) + B
2
y
(b) = . (mixed b.c.)

Boundary conditions that involve values of the solution at both boundaries are also possible;
e.g., periodic boundary conditions:
y(a) = y(b), y
(a) = y
(b); (periodic b.c.)

however, we will only consider the boundary conditions of the rst three types (Dirichlet,
Neumann, and mixed) in this course.
Note that Theorems 6.1 and 6.2 are stated specically for the Dirichlet boundary conditions.
For the Neumann boundary conditions, they are not valid. Specically, in the case of Theorem
6.2 applied to the linear ODE (6.3) with Neumann boundary conditions, the solution will exist
but will only be unique up to an arbitrary constant. (That is, if y(x) is a solution, then so is
y(x) + C, where C is any constant.)
A large collection of other theorems about existence and uniqueness of solutions of linear
and nonlinear BVPs can be found in a very readable book by P.B. Bailey, L.F. Shampine, and
P.E. Waltman, Nonlinear two-point boundary value problems, Ser.: Mathematics in science
and engineering, vol. 44 (Academic Press 1968).
Unless the BVP satises the conditions of Theorems 6.1 or 6.2, it is not guaranteed to
have a unique solution. In fact, depending on the specic combination of the ODE and the
boundary conditions, the BVP may have: (i) no solutions, (ii) one solution, (iii) a nite number
of solutions, or (iv) innitely many solutions. Possibility (iii) can take place only for nonlinear
BVPs, while the other three possibilities can take place for both linear and nonlinear BVPs.
In the remander of this lecture, we will focus on linear BVPs.
Thus, a linear BVP can have 0 solutions, 1 solution, or the of solutions. This is similar
to how a matrix equation
Mx =

b (6.4)
can have 0 solutions, 1 solution, or the of solutions, depending on whether the matrix M
is singular or not. The reason behind this similarity will become apparent as we proceed, an
early indication of this reason appearing later in this lecture and more evidence appearing in
the subsequent lectures. Below we give three examples where the BVP does not satisfy the
conditions of Theorem 6.2, and each of the above three possibilities is realized.
y
+
2
y = 1, y(0) = 0, y(1) = 0 (Problem I)
has 0 solutions.
y
+
2
y = 1, y(0) = 0, y
(1) = 1 (Problem II)

has exactly 1 solution.
y
+
2
y = 1, y(0) = 0, y(1) =
2
2
(Problem III)
has the of solutions.
Below we demonstrate that the above statements about the numbers of solutions in Problems
IIII are indeed corect.
One can verify that the general solution of the ODE
y
+
2
y = 1 (6.5)
is
y = Asin x + Bcos x +
1
2
. (6.6)
The constants A and B are determined by the boundary conditions. Namely, substituting the
solution (6.6) into the boundary conditions of Problem I above, we have:
y(0) = 0 B +
1
2
= 0; y(1) = 0 B +
1
2
= 0 ; (6.7)
hence no such B (and hence no pair A, B) exists. Note that the above equations can be written
as a linear system for the coecients A and B:
0 1
0 1
A
B
. (6.8)
The coecient matrix in (6.8) is singular and, in addition, the vector on the r.h.s. does not
belong to the range (column space) of this coecient matrix. Therefore, the linear system has
no solution, as we have stated above.
Substituting now the solution (6.6) of the ODE (6.5) into the boundary conditions of Prob-
lem II, we arrive at the following linear system for A and B:
y(0) = 0 B +
1
2
= 0; y
(1) = 1 A = 1 , (6.9)
or in the matrix form,
0 1
0
A
B
2
1
. (6.10)
This system obviously has the unique solution A =
1
, B =
1
2
; note that the matrix in the
linear system (6.10) is nonsingular.
Finally, substituting (6.6) into the boundary conditions of Problem III, one nds:
y(0) = 0 B +
1
2
= 0; y(1) =
2
2
B +
1
2
=
2
2
, (6.11)
or in the matrix form,
0 1
0 1
A
B
2
1
. (6.12)
The solution of (6.12) is: A =arbitrary, B =
1
2
. Although the matrix in (6.12) is singular,
the vector on the r.h.s. of this equation belongs to the column space of this matrix (that is, it
can be written as a linear combination of the columns), and hence the linear system in (6.12)
has innitely many solutions.
The above simple examples illustrate a connection between linear BVPs and systems of lin-
ear equations. We can use this connection to formulate an analogue of the well-known theorem
for linear systems, namely:
Theorem in Linear Algebra: The linear system (6.4) has a unique solution if and only
if the matrix M is nonsingular.
Equivalently, either the linear system has a unique solution, or the homogeneous linear system
Mx =

0
has nontrivial solutions. (Recall that the second part of the previous sentence is one of the
denitions of a singular matrix.)
Similarly, for BVPs we have
The Alternative Principle for BVPs: Either the homogeneous linear BVP (i.e. the
one with both the r.h.s. R(x) = 0 and zero boundary conditions) has nontrivial solutions, or
the original BVP has a unique solution.
In a homework problem, you will be asked to verify this principle for Problems IIII.
To conclude this introduction to BVPs, let us again consider Problem I and exhibit a
danger associated with that case. By itself, the fact that the BVP in Problem I has no
solutions, is neither good nor bad; it is simply the fact of life. However, suppose that the
coecients in this problem are known only approximately (for example, because of the round-
o error). Then the matrix in (6.8) is no longer singular (in general), but almost singular. This,
in turn, makes the linear system (6.8) ill-conditioned: a tiny change of the vector on the r.h.s.
will generically lead to a large change in the solution. Thus, any numerical results obtained
for such a system will not be reliable. In a homework problem, you will be asked to consider a
specic example illustrating this case.
Questions for self-assessment
1. Can you say anything about the existence and uniqueness of solution of the BVP
y
= arctan y, y(1) = , y
(1) = ?
2. Same question about the BVP
y
+ (arctan x) y
+ (sin x) y = 1, y() = 1, y(0) = 1 .

3. Same question about the BVP
y
+ (arctan x) y
(sin x) y = 1, y
(0) = 0, y
(1) = 1 .
4. How many solutions are possible for a BVP? What if the BVP is linear?
5. Verify that (6.6) is the solution of (6.5).
6. Verify each of the equations (6.7)(6.12).
7. Verify that the vector on the r.h.s. of (6.12) belongs to the range (column space) of the
matrix on the l.h.s..
8. State the Alternative Principle for the BVPs.
9. What is the danger associated with the case when the BVP has no solution?
7 The shooting method for solving BVPs
7.1 The idea of the shooting method
In this and the next lectures we will only consider BVPs that satisfy the conditions of Theorems
6.1 or 6.2 and thus are guaranteed to have a unique solution.
Suppose we want to solve a BVP with Dirichlet boundary conditions:
y
= f(x, y, y
), y(a) = , y(b) = . (7.1)

We can rewrite this BVP in the form:
y
= z
z
= f(x, y, z)
y(a) =
y(b) = .
(7.2)
The BVP (7.2) will turn into an IVP if we replace the boundary condition at x = b with the
condition
z(a) = , (7.3)
where is some number. Then we can solve the resulting IVP by any method that we have
studied in Lecture 5, and obtain the value of its solution y(b) at x = b. If y(b) = , then we
have solved the BVP. Mostly likely, however, we will nd that after the rst try, y(b) = .
Then we should choose another value for and try again. There is actually a strategy of how
the values of need to be chosen. This strategy is simpler for linear BVPs, so this is the case
we consider next.
7.2 Shooting method for the Dirichlet problem of linear BVPs
Thus, our immediate goal is to solve the linear BVP
y
+ P(x)y
+ Q(x)y = R(x) with Q(x) 0, y(a) = , y(b) = . (7.4)

To this end, consider two auxiliary IVPs:
u
+ Pu
+ Qu = R,
u(a) = , u
(a) = 0
(7.5)
and
v
+ Pv
+ Qv = 0,
v(a) = 0, v
(a) = 1 ,
(7.6)
where we omit the arguments of P(x) etc. as this should cause no confusion. Next, consider
the function
w = u + v, = const . (7.7)
Using Eqs. (7.5) and (7.6), it is easy to see that
(u + v)
+ P (u + v)
+ Q(u + v) = R,
(u + v)(a) = , (u + v)
(a) = ,
(7.8)
i.e. w satises the IVP
w
+ Pw
+ Qw = R,
w(a) = , w
(a) = .
(7.9)
Note that the only dierence between (7.9) and (7.4) is that in (7.9), we know the value of w
at x = a but do not know whether w(b) = . If we can choose in such a way that w(b) does
equal b, this will mean that we have solved the BVP (7.4).
To determine such a value of , we rst solve the IVPs (7.5) and (7.6) by an appropriate
method of Lecture 5 and nd the coresponding values u(b) and v(b). We then choose the value
=
0
by requiring that the corresponding w(b) = , i.e.
w(b) = u(b) +
0
v(b) = . (7.10)
This w(x) is the solution of the BVP (7.4), because it satises the same ODE and the same
boundary conditions at x = a and x = b; see Eqs. (7.9) and (7.10). Equation (7.10) yields the
following equation for the
0
:
0
=
u(b)
v(b)
. (7.11)
Thus, solving only two IVPs (7.5) and (7.6) and constructing the new function w(x) ac-
cording to (7.7) and (7.11), we obtain the solution to the linear BVP (7.4).
Consistency check: In (7.4), we have required that Q(x) 0, which guarantees that a
unique solution of that BVP exists. What would have happened if we had overlooked to impose
that requirement? Then, according to Theorem 6.2, we could have run into a situation where
the BVP would have had no solutions. This would occur if
v(b) = 0 . (7.12)
But then Eqs. (7.6) and (7.12) would mean that the homogeneous BVP
v
+ Pv
+ Qv = 0,
v(a) = 0, v(b) = 0 ,
(7.13)
must have a nontrivial
12
solution. The above considerations agree, as they should, with the
Alternative Principle for the BVPs, namely: the BVP (7.4) may have no solutions if the
corresponding homogeneous BVP (7.13) has nontrivial solutions.
7.3 Generalizations of the shooting method for linear BVPs
If the BVP has boundary conditions other than of the Dirichlet type, we will still proceed
exactly as we did above. For example, suppose we need to solve the BVP
y
+ Py
+ Qy = R,
y(a) = , y
(b) = ,
(7.14)
which has the Neumann boundary condition at the right end point. Denote y
1
= y, y
2
= y
,
and rewrite this BVP as
y
1
= y
2
y
2
= Py
2
Qy
1
+ R
y
1
(a) = , y
2
(b) = .
(7.15)
12
Indeed, since we also know that v
(a) = 1, then v(x) cannot identically equal zero on [a, b].

Using the vector/matrix notations with
y =
y
1
y
2
,
we can further rewrite this BVP as
y
0 1
Q P
y +
0
R
,
y
1
(a) =
y
2
(b) = .
(7.16)
Now, in analogy with Eqs. (7.5) and (7.6), consider two auxiliary IVPs:
u
0 1
Q P
u +
0
R
, u(a) =
; (7.17)
v
0 1
Q P
v , v(a) =
0
1
. (7.18)
Solve these IVPs by an appropriate method and obtain the values u(b) and v(b). Next, consider
the vector
w = u + v, = const. (7.19)
Using Eqs. (7.17)(7.19), it is easy to see that this new vector satises the IVP
w
0 1
Q P
w +
0
R
, w(a) =
. (7.20)
At x = b, its value is
w(b) =
u
1
(b) + v
1
(b)
u
2
(b) + v
2
(b)
, (7.21)
where u
1
is the rst component of u etc. From the last equation in (7.16), it follows that we
are to require that
w
2
(b) = . (7.22)
Equations (7.21) and (7.22) together yield
u
2
(b) + v
2
(b) = , (7.23)
0
=
u
2
(b)
v
2
(b)
. (7.24)
Thus, the vector w given by Eq. (7.19) where u, v, and satisfy Eqs. (7.17), (7.18), and
(7.24), respectively, is the solution of the BVP (7.16).
Also, the shooting method can be used with IVPs of order higher than the second. For
example, consider the BVP
x
3
y
+ xy
y = 3 + ln x,
y(a) = , y
(b) = , y
(b) = .
(7.25)
As in the previous example, denote y
1
= y, y
2
= y
, y
3
= y
and rewrite the BVP (7.25) in the

matrix form:
y
= Ay +r ,
y
1
(a) =
y
2
(b) =
y
3
(b) = .
(7.26)
where
y =
y
1
y
2
y
3
, A =
0 1 0
0 0 1
1
x
3

1
x
2
0
, r =
0
0
3+lnx
x
3
.
Consider now three auxiliary IVPs:
u
= Au +r ,
u(a) =
0
0
;
v
= Av ,
v(a) =
0
1
0
;
w
= A w,
w(a) =
0
0
1
.
(7.27)
Solve them and obtain u(b), v(b), and w(b). Then, construct z = u +v + w, where and
are numbers which will be determined shortly. At x = b, one has
z(b) =
. . .
u
2
(b) + v
2
(b) + w
2
(b)
u
3
(b) + v
3
(b) + w
3
(b)
. (7.28)
If we require that z satisfy the BVP (7.26), we must have
z(b) =
. . .
. (7.29)
From Eqs. (7.28) and (7.29) we form a system of two linear equations for the unknown coe-
cients and :
v
2
(b) + w
2
(b) = u
2
(b) ,
v
3
(b) + w
3
(b) = u
3
(b) .
(7.30)
Solving this linear system, we obtain values
0
and
0
such that the corresponding z = u +
0
v +
0
w solves the BVP (7.26) and hence the original BVP (7.25).
7.4 Caveat with the shooting method, and its remedy, the multiple
shooting method
Here we will encounter a situation where the shooting method in its form described above does
not work. We will also provide a way to modify the method so that it would be usable again.
Let us consider the BVP
y
= 30
2
(y 1 + 2x) ,
y(0) = 1, y(b) = 1 2b ; b > 0.
(7.31)
Its exact solution is
y = 1 2x ; (7.32)
by Theorem 6.2 this solution is unique. Note that the general solution of only the ODE (without
boundary conditions) in (7.31) is
y = 1 2x + Ae
30x
+ B e
30x
. (7.33)
Now let us try to use the shooting method to solve the BVP (7.31). Following the lines of
Sec. 7.2, we set up auxiliary IVPs
u
= 30
2
(u 1 + 2x) ,
u(0) = 1, u
(0) = 0 ;
v
= 30
2
v ,
v(0) = 0, v
(0) = 1 ;
(7.34)
and solve them. The exact solutions of (7.34) are:
u = 1 2x +
1
30
e
30x
e
30x
, v =
1
60
e
30x
e
30x
. (7.35)
Then Eq. (7.11) provides the value of the auxiliary parameter
0
:
0
=
(1 2b) (1 2 b +
1
30
(e
30b
e
30b
) )
1
60
(e
30b
e
30b
)

1
30
e
30b
1
60
e
30b
= 2 . (7.36)
Above we have used the sign to emphasize that we have kept only the largest terms in each
of the numerator and denominator; but, as a matter of fact, simple algebra shows that
0
= 2
exactly. Indeed,
u
(7.35)
+ (2) v
(7.35)
= 1 2x = exact solution. (7.37)
However, in any realistic situation, the value of
0
will be determined with a round-o error,
i.e. instead of (7.36) we should expect to get
0
= 2 + , (7.38)
where is the round-o error. Then the solution that one obtains from the auxiliary IVPs
(7.34) and Eqs. (7.35) and (7.38) is:
y = u
(7.35)
+ (2 + ) v
(7.35)
= 1 2x +

60
e
30x
e
30x
. (7.39)
In Matlab, 10
15
. Suppose b = 1.4; then the
last term in (7.39) is
60
e
30b
e
30b
10
15
1
60
e
301.4
29 ,
which is much larger than either of the rst two
terms. Thus, the exact solution will be drowned in
the round-o error amplied by a large exponential
factor. This is illustrated in the gure on the right,
where both the exact and the numerical solutions
for b = 1.4 are shown.
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
10
20
30
40
50
60
x
y
exact
numerical
The way in which the shooting method is to be modied in order to handle the above
problem is suggested by this gure. Namely, one can see that the numerical solution is quite
accurate up to the vicinity of the right end point, where x 1.4 and the factor e
30x
overtakes
the terms of the exact solution. Therefore, if we split the interval [0, 1.4] into two adjoint
subintervals, [0, 0.7] and [0.7, 1.4], and perform shooting on each of these subintervals, then
the corresponding exponential factor, e
300.7
10
9
, will be too small to distort the numerical
solution, because then e
300.7
10
15
10
9
= 10
6
1.
Below we show the implementation details of this approach, known as the multiple shooting
method. (Obviously, the name comes from the fact that the shooting is performed in multiple
(sub)intervals.) These details are worked out for the case of two subintervals, [0, b/2] and
[b/2, b]; a generalization for the case of more subintervals is fairly straightforward.
Consider two sets of auxiliary IVPs that are similar to the IVPs (7.34):
On [0, b/2]:
u
(1)
= 30
2
(u
(1)
1 + 2x) ,
u
(1)
(0) = , u
(1)
(0) = 0 ;
v
(1)
= 30
2
v
(1)
,
v
(1)
(0) = 0, v
(1)
(0) = 1 ;
(7.40)
On [b/2, b]:
u
(2)
= 30
2
(u
(2)
1 + 2x) ,
u
(2)
(b/2) = 0 , u
(2)
(b/2) = 0 ;
v
(2,1)
= 30
2
v
(2,1)
,
v
(2,1)
(b/2) = 1, v
(2,1)
(b/2) = 0 ;
(7.41)
v
(2,2)
= 30
2
v
(2,2)
,
v
(2,2)
(b/2) = 0, v
(2,2)
(b/2) = 1 .
Note 1: The initial condition for u
(1)
at x = 0 is denoted as , even though in the example
considered = 1. This is done to emphasize that the given initial condition is always used for
u
(1)
at the left end point of the original interval.
Note 2: Note that in the 2nd subinterval (and, in general, in the kth subinterval with k 2),
the initial conditions for the u
(k)
must be taken to always be zero. (This is stressed in the
u-system in (7.41) by putting the initial condition for u
(2)
in a box.)
Continuing with solving the IVPs (7.40) and (7.41), we construct solutions
w
(1)
= u
(1)
+
(1)
v
(1)
, w
(2)
= u
(2)
+
(2,1)
v
(2,1)
+
(2,2)
v
(2,2)
, (7.42)
where the numbers
(1)
,
(2,1)
, and
(2,2)
are to be determined. Namely, these three numbers
are determined from three requirements:
w
(1)
(b/2) = w
(2)
(b/2) ,
w
(1)
(b/2) = w
(2)
(b/2) ;
The solution and its derivative

must be continuous at x = b/2;
(7.43)
and
w
(2)
(b) = (= 1 2b) ,
The solution must satisfy

the boundary condition at x = b.
(7.44)
Equations (7.43) and (7.44) yield:
w
(1)
(b/2) = w
(2)
(b/2) u
(1)
(b/2) +
(1)
v
(1)
(b/2) = 0 +
(2,1)
1 +
(2,2)
0 ;
w
(1)
(b/2) = w
(2)
(b/2) u
(1)
(b/2) +
(1)
v
(1)
(b/2) = 0 +
(2,1)
0 +
(2,2)
1 ;
w
(2)
(b) = u
(2)
(b) +
(2,1)
v
(2,1)
(b) +
(2,2)
v
(2,2)
(b) = .
(7.45)
In writing out the r.h.s.s of the rst two equations above, we have used the boundary conditions
of the IVP (7.41).
The three equations (7.45) form a linear system for the three unknowns
(1)
,
(2,1)
, and
(2,2)
. (Recall that u
(1)
(b/2) etc. are known from solving the IVPs (7.40) and (7.41).) Thus,
nding the
(1)
,
(2,1)
, and
(2,2)
from (7.45) and substituting them back into (7.42), we obtain
the solution to the original BVP (7.31).
Important note: The multiple shooting method, at least in the above form, can only
be used for linear BVPs, because it is only for them that the linear superposition principle,
allowing us to write
w = u + v ,
can be used.
Further reading on the multiple shooting method can be found in:
P. Deuhard, Recent advances in multiple shooting techniques, in Computation techniques
for ODEs, I. Gladwell and D.K. Sayers, eds. (Academic Press, 1980);
J. Stoer and R. Burlish, Introduction to numerical analysis (Springer Verlag, 1980);
G. Hall and J.M. Watt, Modern numerical methods for ODEs (Clarendon Press, 1976).
7.5 Shooting method for nonlinear BVPs
As has just been noted above, for nonlinear BVPs, the linear superposition of the auxiliary
solutions u and v cannot be used. However, one can still proceed with using the shooting
method by following the general guidelines of Sec. 7.1.
As an example, let us consider the BVP
y
=
y
2
2 + x
,
y(0) = 1, y(2) = 1 .
(7.46)
We again consider the auxiliary IVP
y
1
= y
2
,
y
2
=
y
2
1
2 + x
;
y
1
(0) = 1, y
2
(0) = .
(7.47)
The idea is now to nd the right value(s) of iter-
atively. To motivate the iteration algorithm, let us
actually solve the IVP (7.47) for an (equidistant)
set of s inside some large interval and look at the
result, which is shown in the gure on the right.
(The reason this particular interval of values
is chosen is simply because the instructor knows
what the result should be.) This gure shows that
the boundary condition at the right end point,
y(2)|
as the function of
= 1 , (7.48)
can be considered as a nonlinear algebraic equation
with respect to . Correspondingly, we can employ
well-known methods of solving nonlinear algebraic
equations for solving nonlinear BVPs.
15 10 5 0
5
3
1
1
3
y
(
2
)
prescribed boundary value

Probably the simplest such a method is the secant method. Below we will show how to use
it to nd the values

and

, for which y(2) = 1 (see the gure). Suppose we have tried two
values,
1
and
2
, and found the corresponding values y(2)|
1
and y(2)|
2
. Denote
F(
k
) = y(2)|
k
1 , k = 1, 2 . (7.49)
Thus, our goal is to nd the roots of the equation
F() = 0 . (7.50)
Given the rst two values of F() at =
1,2
, the secant method proceeds as follows:
k+1
=
k
F(
k
)
F(
k
)F(
k1
)
k1
, and then compute F(
k+1
) from (7.49). (7.51)
The iterations are stopped when |F(
k+1
) F(
k
)| becomes less than a prescribed tolerance.
In this manner, one will nd the values

and

and hence the corresponding two solutions of

the nonlinear BVP (7.46). You will be asked to do so in a homework problem.
7.6 Broader applicability of the shooting method
We will conclude with two remarks. The rst will outline another case where the shooting
method can be used. The other will mention an important case where this method cannot be
used.
7.6.1 Shooting method for nding discrete eigenvalues
Consider a BVP
y
+ (2sech
2
x
2
)y = 0, x (, ), y(|x| ) 0 . (7.52)
Here the term 2sech
2
x could be generalized to any
potential V (x) that has one or several humps
in its central region and decays to zero as |x| .
Such a BVP is solvable (i.e., a y(x) can be found
such that y(|x| ) 0) only for some special
values of , called the eigenvalues of this BVP. The
corresponding solution, called an eigenfunction, is
a localized blob, which has some structure in
the region where the potential is signicantly dif-
ferent from zero and which vanishes at the ends of
the innite line. An example of such an eigenfunc-
tion is shown on the right. Note that in general, an
eigenfunction may have a more complicated struc-
ture at the center than just a single hump.
10 0 10
0
x
y
A variant of the shooting method which can nd these eigenvalues is the following. First,
since one cannot literally model the innite line interval (, ), consider the above BVP on
the interval [R, R] for some reasonably large R (say, R = 10). For a given in the BVP,
choose the initial conditions for the shooting as
y(R) = y
0
, y
(R) = y(R), (7.53)

for some very small y
0
which we will discuss later. The reason behind the above relation between
y
(R) and y(R) is this. Since the potential 2sech

2
x (almost) vanishes at |x| = R, then
(7.52) reduces to y
2
y 0, and hence y
y at x = R. Note that of the two possibilities

y
y and y
y which are implied by y
2
y 0, we have chosen the former, because
it is its solution,
y = e
x
, (7.54)
which agrees with the behavior of the eigenfunction at the left end of the real line (see the
gure above).
The constant y
0
in (7.53) can be taken as
y
0
= e
c R
, (7.55)
where the constant c is of order one. Often one can simply take c = 1.
Now, compute the solution of the IVP consisting of the ODE in (7.52) and the initial
condition (7.53) and record the value y(R). This value can be denoted as G() since it has
been obtained for a particular value of : G() y(R)|
. Repeat this process for values of

=
min
+ j, j = 0, 1, 2, . . . in some specied interval [
min
,
max
]; as a result, one obtains
a set of points representing a curve G(). Those values of where this curve crosses zero
correspond to the eigenvalues
13
. Indeed, there y(R) = 0, which is the approximate relation
satised by eigenfunctions of (7.52) at x = R for R 1.
7.6.2 Inapplicability of the shooting method in higher dimensions
Boundary value problems considered in this section are one-dimensional, in that they involve
the derivative with respect to only one variable x. Such BVPs often arise in description of
one-dimensional objects such as beams and strings. A natural generalization of these to two
dimensions are plates and membranes. For example, a classic Helmholtz equation
2
u
x
2
+

2
u
y
2
+ k
2
u = 0, (7.56)
where u = 0 along the boundaries of a square with vertices (x, y) = (0, 0), (1, 0), (1, 1), (0, 1),
arises in the mathematical description of oscillations of a square membrane.
One could attempt to obtain a solution of this BVP by shooting from, say, the left side
of this square to the right side. However, not only is this tedious to implement while ac-
counting for all possible combinations of u/x along the side (x = 0, 0 y 1), but
also results of such shooting will be dominated by numerical error and will have nothing in
common with the true solution. The reason is that any IVP for certain two-dimensional equa-
tions, of which (7.56) is a particular case, are ill-posed. We will not go further into this
13
In practice, one uses a more accurate method, but the description of this technical detail is outside the
scope of this lecture.
issue
14
since it would substantially rely on the material studied in a course on partial dif-
ferential equations. What is important to remember out of this brief discussion is that the
shooting method can be used only for one-dimensional BVPs.
In the next lecture we will introduce alternative methods that can be used to solve BVPs
both in one and many dimensions. However, we will only consider their applications in one
dimension.
1. Explain the basic idea behind the shooting method (Sec. 7.1).
2. Why did we require that Q(x) 0 in (7.4)?
3. Verify (7.8).
4. Suppose that Q(x) > 0 and hence we can have v(b) = 0, as in (7.12). Does (7.12) mean
only that the BVP (7.4) will have no solutions, or are there other possibilities?
5. Verify (7.15) and (7.16).
6. Verify (7.20).
7. Verify (7.26).
8. Suppose you need to solve a 5th-order linear BVP. How many auxiliary systems do you
need to consider? How many parameters (analogues of ) do you need to introduce and
then solve for?
9. What allows us to say that by Theorem 6.2 the solution (7.32) is unique?
10. Verify (7.37) and (7.39).
11. What causes the large deviation of the numerical solution from the exact one in the gure
found under Eq. (7.39)?
12. Describe the key idea behind the multiple shooting method.
13. Suppose we use 3 subintervals for multiple shooting. How many parameters analogous to
(1)
,
(2,1)
, and
(2,2)
will we need? What are the meanings of the conditions from which
these parameters can be determined?
14. Verify that the r.h.s.es in (7.45) are correct.
15. Describe the key idea behind the shooting method for nonlinear BVPs.
16. For a general nonlinear BVP, can one tell how many solutions one will nd?
14
We will, however, arrive at the same conclusion, but from another view point, in Lecture 11.
8 Finite-dierence methods for BVPs
In this Lecture, we consider methods for solving BVPs whose (methods) idea is to replace the
BVPs with a system of approximating algebraic equations. In the rst ve sections, we will
(mostly) deal with linear BVPs, so that the corresponding system of equations will be linear.
The last section will show how nonlinear BVPs can be approached.
8.1 Matrix problem for the discretized solution
Let us begin by considering a linear BVP with Dirichlet boundary conditions:
y
+ P(x)y
+ Q(x)y = R(x) ,
y(a) = , y(b) = .
(8.1)
As before, we assume that P, Q, and R are twice continuously dierentiable, so that y is a four
times continuously dierentiable function of x. Also, we consider the case where Q(x) 0 on
[a, b], so that the BVP (8.1) is guaranteed by Theorem 6.2 to have a unique solution.
Let us replace y
and y
in (8.1) by their second-order accurate discretizations:

y
=
1
h
2
(y
n+1
2y
n
+ y
n1
) + O(h
2
) , (8.2)
y
=
1
2h
(y
n+1
y
n1
) + O(h
2
) . (8.3)
Upon omitting the O(h
2
)-terms and substituting (8.2) and (8.3) into (8.1), we obtain the
following system of linear equations:
Y
0
= ;
(1 +
h
2
P
n
)Y
n+1
(2 h
2
Q
n
)Y
n
+ (1
h
2
P
n
)Y
n1
= h
2
R
n
, 1 n N 1 ;
Y
N
= .
(8.4)
In the matrix form, this is
A
Y = r , (8.5)
where
A =
_
_
_
_
_
_
_
_
(2 h
2
Q
1
) (1 +
h
2
P
1
) 0 0 0
(1
h
2
P
2
) (2 h
2
Q
2
) (1 +
h
2
P
2
) 0 0
0 (1
h
2
P
3
) (2 h
2
Q
3
) (1 +
h
2
P
3
) 0 0

0 0 (1
h
2
P
N2
) (2 h
2
Q
N2
) (1 +
h
2
P
N2
)
0 0 0 (1
h
2
P
N1
) (2 h
2
Q
N1
)
_
_
_
_
_
_
_
_
, (8.6)
Y = [Y
1
, Y
2
, . . . , Y
N1
]
T
, and
r =
_
h
2
R
1
_
1
h
2
P
1
_
, h
2
R
2
, h
2
R
3
, , h
2
R
N2
, h
2
R
N1
_
1 +
h
2
P
N1
_
_
T
;
(8.7)
the superscript T in (8.7) denotes the transpose.
From Linear Algebra, it is known that the linear system (8.5) has a unique solution if (and
only if) the matrix A is nonsingular. Therefore, below we list some results that will allow us
to guarantee that under certain conditions, a particular A is nonsingular.
Gerschgorin Circles Theorem, 8.1 Let a
ij
be entries of an M M matrix A and let
k
, k = 1, . . . , M be the eigenvalues of A. Then:
(i) Each eigenvalue lies in the union of row circles R
i
, where
R
i
=
_
z : [z a
ii
[
M
j=1, j=i
[a
ij
[
_
. (8.8)
(In words,
M
j=1, j=i
[a
ij
[ is the sum of all o-diagonal entries in the ith row.)
(ii) Similarly, since A and A
T
have the same eigenvalues, then each eigenvalue also lies in
the union of column circles C
j
, where
C
j
=
_
z : [z a
jj
[
M
i=1, i=j
[a
ij
[
_
. (8.9)
(In words,
M
i=1, i=j
[a
ij
[ is the sum of all o-diagonal entries in the jth column.)
(iii) Let
i=l
i=k
R
i
be a cluster of (l k) row circles that is disjoint from all the other row
circles. Then it contains exactly (l k) eigenvalues.
A similar statement holds for column circles.
Example Use the Gerschgorin Circles Theorem to estimate eigenvalues of a matrix
E =
_
_
_
1 2 1
1 1 1
1 3 1
_
_
_
. (8.10)
Solution The circles are listed and sketched below:
R
1
= z : [z 1[ [2[ +[1[ = 3 C
1
= z : [z 1[ [1[ +[1[ = 2
R
2
= z : [z 1[ [1[ +[1[ = 2 C
2
= z : [z 1[ [2[ +[3[ = 5
R
3
= z : [z + 1[ [1[ +[3[ = 4 C
3
= z : [z + 1[ [1[ +[ 1[ = 2
6 4 2 0 2 4 6
4
2
0
2
4
Im z
Re z
R
3

R
1

4 2 0 2 4 6 8
6
4
2
0
2
4
6
Im z
Re z
C
2

C
3

(Circles R
2
and C
1
are not sketched because they are concentric with, and lie entirely within,
circles R
1
and C
2
, respectively.)
Gerschgorin Circles Theorem says that the eigenvalues of E must lie within the intersection
of
3
i=1
R
i
and
3
j=1
C
j
. For example, this gives that [Re [ 4 for each and any of the
eigenvalues.
Before we can apply the Gerschgorin Circles Theorem, we need to introduce some new
terminology.
Denition Matrix A is called diagonally dominant if
either [a
ii
[
M
j=1, j=i
[a
ij
[ or [a
ii
[
M
j=1, j=i
[a
ji
[ , 1 i M , (8.11)
with the strict inequality holding for at least one i.
A matrix is called strictly diagonally dominant (SDD) if
either [a
ii
[ >
M
j=1, j=i
[a
ij
[ or [a
ii
[ >
M
j=1, j=i
[a
ji
[ for all i = 1, . . . , M. (8.12)
In other words, in a SDD matrix, the sums of the o-diagonal elements along either every row
or every colum are less than the corresponding diagonal entries.
Theorem 8.2 If a matrix A is SDD, then it is nonsingular.
Proof If A is SDD, then one of the inequalities (8.12) holds. Suppose the inequality for the
rows holds. Comparing that inequality with the r.h.s. of (8.8), we conclude, by the Gerschgorin
Circles Theorem, that point = 0 is outside of the union
M
i=1
R
i
of Gerschgorin circles. Hence
it is automatically outside of the intersection of the unions
M
i=1
R
i
and
M
i=1
C
i
. Therefore,
= 0 is not an eigenvalue of A, hence A is nonsingular. q.e.d.
Theorem 8.3 Consider the BVP (8.1). If Q(x) 0, and if P(x) is bounded on [a, b] (i.e.
[P(x)[ T for some T), then the discrete version (8.4) of the BVP in question has a unique
solution, provided that the step size satises hT 2.
Proof Case (a): Q(x) < 0. In this case, matrix A in (8.6) is SDD, provided that hT 2.
Indeed, A is tridiagonal, with the diagonal elements being
a
ii
= (2 h
2
Q
i
), [a
ii
[ > 2 . (8.13)
The sum of the absolute values of the o-diagonal elements is
[a
i,i+1
[ +[a
i,i1
[ =
1 +
h
2
P
i
1
h
2
P
i
=
_
1 +
h
2
P
i
_
+
_
1
h
2
P
i
_
= 2 . (8.14)
In removing the absolute value signs in the above equation, we have used the fact that hT 2.
Now, comparing (8.13) with (8.14), we see that
[a
i,i+1
[ +[a
i,i1
[ < [a
ii
[ ,
which means that A is SDD and hence by Theorem 8.2, (8.4) has a unique solution. q.e.d.
Case (b): Q(x) 0 requires a more involved proof, which we omit.
Note: We emphasize that Theorem 8.3 gives a bound for the step size,
h max
x[a, b]
[P(x)[ 2 , (8.15)
which is a sucient condition for the discretization (8.4) (with Q(x) 0) to be solvable.
8.2 Thomas algorithm
In the previous section, we have considered the issue of the possibility to obtain the unique
solution of the linear system (8.4), which (the system) approximates the BVP (8.1). In this
section, we will consider the issue of solving (8.4) in an ecient manner. The key fact that
will allow us to do so is the tridiagonal form of A; that is, A has nonzero entries only on the
main diagonal and on the subdiagonals directly above and below it. Let us recall that in order
to solve
15
a linear system of the form (8.5) with a full M M matrix A, one requires O(M
3
)
operations. However, this would be an inecient way to solve a linear system with a tridiagonal
A. Below we will present an algorithm, which has been discovered independently by several
researchers in the 50s, that allows one to solve a linear system with a tridiagonal matrix using
only O(M) operations.
A common practical way to numerically solve a linear system of the form
Ay = r (8.5)
(with any matrix A) is via LU decomposition. Namely, one seeks two matrices L (lower
triangular) and U (upper triangular) such that
16
LU = A. (8.16)
Then (8.5) is solved in two steps:
Step 1 : Lz =r and Step 2 : Uy =z . (8.17)
15
by Gaussian elimination or by any other direct (i.e., non-iterative) method
16
Disclaimer: The theory of LU decomposition is considerably more involved than the simple excerpt from
it given here. We will not go into further details of that theory in this course.
The linear systems in (8.17) are then solved using forward (for Step 1) and backward (for Step
2) substitutions; details of this will be provided later on.
When A is tridiagonal, both nding the matrices L and U and solving the systems in (8.17)
requires only O(M) operations. You will be asked to demonstrate that in a homework problem.
Here we will present details of the algorithm itself.
Let
A =
_
_
_
_
_
_
_
_
_
_
_
_
b
1
c
1
0 0 0
a
2
b
2
c
2
0 0
0 a
3
b
3
c
3
0

0 0 a
M1
b
M1
c
M1
0 0 0 a
M
b
M
_
_
_
_
_
_
_
_
_
_
_
_
; (8.18)
then we seek L and U in the form:
LU =
_
_
_
_
_
_
_
_
_
1 0 0 0
2
1 0 0
0
3
1 0

0 0
M
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
c
1
0 0
0
2
c
2
0

0 0
M1
c
M1
0 0 0
M
_
_
_
_
_
_
_
_
_
(8.19)
Multiplying the matrices in (8.19) and comparing the result with (8.18), we obtain:
Row 1:
1
= b
1
;
Row 2:
2
1
= a
2
,
2
c
1
+
2
= b
2
;
Row j:
j
j1
= a
j
,
j
c
j1
+
j
= b
j
.
(8.20)
Equations (8.20) are easily solved for the unknown coecients
j
and
j
:
1
= b
1
,
j
= a
j
/
j1
,
j
= b
j
j
c
j1
, j = 2, . . . , M .
(8.21)
Finally, we show how the systems in (8.17) can be solved. Let
z = [z
1
, z
2
, , z
M
]
T
, etc.
Then the forward substitution in Lz =r gives:
z
1
= r
1
,
z
j
= r
j
j
z
j1
, j = 2, . . . , M .
(8.22)
The backward substitution in Uy =z gives:
y
M
= z
M
/
M
,
y
j
= (z
j
c
j
y
j+1
)/
j
, j = M 1, . . . , 1 .
(8.23)
Thus, y is found in terms of r.
The entire procedure (8.21) to (8.23) is fast, as we said earlier, and also requires the storage
of only 8 one-dimensional arrays of size O(M) for the coecients a
j
, b
j
, c
j
,
j
,
j
,
r
j
, z
j
, and y
j
. Moreover, it is possible to show that when A is SDD, i.e. when
[b
j
[ > [a
j
[ +[c
j
[, j = 1, . . . , M, (8.24)
then small round-o (or any other) errors do not get amplied by this algorithm; specically,
the numerical error remains small and independent of the size M of the problem
17
.
To conclude this section, we note that similar algorithms exist for other banded (e.g., pen-
tadiagonal) matrices. The details of those algorithms can be found, e.g., in Sec. 3-2 of W.
Ames, Numerical methods for partial dierential equations, 3rd ed. (Academic Press, 1992).
8.3 Error estimates, and higher-order discretization
In this section, we will state, without proof, two theorems about the accuracy of the solutions
of systems of discretized equations approximating a given BVP.
Theorem 8.4 Let Y
n
N1
n=1
be the solution of the discretized problem (8.4) (or, which
is the same, (8.5)(8.7)) and let y(x) be the exact solution of the original BVP (8.1); then
n
= y(x
n
) Y
n
is the error of the numerical solution. In addition, let T = max
x[a, b]
[P(x)[
and also let Q(x) Q < 0 (recall that Q(x) 0 is required for the BVP to have a unique
solution). Then the error satises the following estimate:
max [
n
[
1
h
2
([Q[ +
8
(ba)
2
)
_
1
12
h
4
(M
4
+TM
3
) + 2
_
, (8.25)
where M
3
= max
x[a, b]
[y
[, M
4
= max
x[a, b]
[y
[, and is the round-o error.

When the round-o error is neglected, estimate (8.25) yields that the discrete approximation
(8.4) to the BVP produces the error on the order of O(h
2
) (i.e., in other words, is second-order
accurate). At rst sight, this is similar to how the nite-dierence approximations (8.2) and
(8.3) led to second-order accurate methods for IVPs. In fact, for both the IVPs and BVPs, the
expression inside the largest parentheses on the r.h.s. of (8.25) is the local truncation error.
However, the interpretation of the O(1/h
2
) factor in front of those parentheses diers from
the interpretation of the similar factor for IVPs. In the latter case, that factor arose from
accumulation of the error over O(1/h) steps. On the contrary, for the BVPs, that factor is the
so-called condition number of the matrix A in (8.5), dened as
cond A = [[A[[ [[A
1
[[ , (8.26)
where [[A[[ is the norm of matrix A, subordinate to a particular vector norm
18
. In courses on
Numerical Analysis, it is shown that the condition number is the factor by which a small error
in the vector r is amplied in the corresponding error of the solution

Y:
A(
Y +

Y) =r +

r
[[

Y[[
[[
Y[[
= condA
[[
r[[
[[r[[
. (8.27)
17
In line with the earlier disclaimer about the LU decomposition, we also note that condition (8.24) is just
the simplest of possible conditions that guarantee boundedness of the error. Other conditions exist and are
studied in advanced courses in Numerical Analysis.
18
Denition (8.26) is given here only for completemeness of the presentation; we will not need more precise
denitions of the norms involved.
One can show (but we will not do so here) that the condition number of the matrix A in (8.5)
is O(
1
h
2
); this fact, along with Eq. (8.27) and the local truncation error

r being O(h
4
), implies
estimate (8.25) for the solution error.
The numerical error of the discrete approximation of the BVP (8.1) can be signicantly
reduced if instead of the simple central-dierence approximation to y
= f(x, y), as in (8.2),

one uses the Numerovs formula (5.18). Specically, if y
does not enter the BVP, i.e. when the

BVP is
y
= f(x, y) , y(a) = , y(b) = , (8.28)

then Numerovs formula leads to the following system of discrete equations:
Y
0
= ;
Y
n+1
2Y
n
+ Y
n1
=
h
2
12
(f
n+1
+ 10f
n
+ f
n1
) 1 n N 1 ;
Y
N
= .
(8.29)
In particular, when f(x, y) = Q(x)y+R(x), system (8.29) is linear. Since the local truncation
error of Numerovs method is O(h
6
), the solution of system (8.29) must be a fourth-order
accurate approximation to the exact solution of the BVP (8.28). More precisely, the following
theorem holds true:
Theorem 8.5 Let Y
n
N1
n=1
be the solution of the discretized problem (8.29), where
f(x, y) = Q(x)y + R(x) (so that this system is linear). Then the error
n
= y(x
n
) Y
n
satises the following estimate:
max [
n
[
1
h
2
([Q[ +
8
(ba)
2
)
_
1
240
h
6
M
6
+ 2
_
, (8.30)
where the notations are the same as in Theorem 8.4 and, in addition, M
6
= max
x[a, b]
[y
(6)
[.
8.4 Neumann and mixed boundary conditions
Suppose that at, say, x = a, the boundary condition is
A
1
y(a) + A
2
y
(a) = , (8.31)
where A
1
and A
2
are some constants. (Recall that for A
1
= 0, this is the Neumann boundary
condition, while for A
1
A
2
,= 0, this condition is of the mixed type.) Below we present two
methods that allow one to construct generalizations of systems (8.4) (simple central-dierence
discretization) and (8.29) (Numerovs formula) for the case of the boundary condition (8.31)
instead of
y(a) = .
Note that the problem we need to handle for such a generalization is obtaining an approximation
to y
(a) in (8.31) that would have the accuracy consistent with the accuracy of the discretization
scheme. That is, the accuracy needs to be second-order for a generalization of (8.4) and fourth-
order for a generalization of (8.29).
Method 1
This method is ecient only for a generalization of the second-order accurate approximation.
Thus, we will look for a second-order accurate nite-dierence approximation to y
(a).
Introduce a ctitious point x
1
= a h and let the approximate solution at that point be
Y
1
. Once one knows Y
1
, the approximation sought is
y
(a) =
Y
1
Y
1
2h
+ O(h
2
) , (8.32)
so that the equation approximating the boundary condition (8.31) is
x = a : A
1
Y
0
+ A
2
Y
1
Y
1
2h
= . (8.33)
The discrete analog of the ODE itself at x = a is:
_
1 +
h
2
P
0
_
Y
1
(2 h
2
Q
0
)Y
0
+
_
1
h
2
P
0
_
Y
1
= h
2
R
0
. (8.34)
Equations (8.33) and (8.34) should replace the rst equation,
Y
0
= , (8.35)
in (8.4), so that the dimension of the resulting system of equations in (8.4) increases from
(N 1) (N 1) to (N + 1) (N + 1). Note that adding two new equations (8.33) and
(8.34) to system (8.4) is consistent with introducing two new unknowns Y
1
and Y
0
.
Later on we will see that, rather than dealing with two equations (8.33) and (8.34), it is
more practical to solve (8.33) for Y
1
and substitute the result in (8.34). Then, instead of two
equations (8.33) and (8.34), we will have one equation that needs to replace (8.35):
2Y
1
_
2 h
2
Q
0
2h
A
1
A
2
_
1
h
2
P
0
__
Y
0
= h
2
R
0
+ 2h

A
2
_
1
h
2
P
0
_
. (8.36)
(We are justied to assume in (8.36) that A
2
,= 0, since otherwise the boundary condition (8.31)
becomes Dirichlet.) The resulting system then contains 1 + (N 1) = N equations for the N
unknowns: Y
0
through Y
N1
, and hence can be solved for a unique solution Y
n
N1
n=0
, unless
the coecient matrix of this system is singular. We will remark on the latter possibility after
we describe the other method.
Method 2
This method does not use a ctitious point to approximate the boundary condition with the
required accuracy. We will rst show how this method can be used for the second-order accurate
approximation (8.4) and then indicate how it (the method) can be modied for the fourth-order
accurate approximation (8.29).
In analogy with Eq. (3.10) of Lecture 3, one can obtain:
y
n+1
y
n
h
= y
n
+
h
2
y
n
+ O(h
2
) . (8.37)
Then, using the ODE in (8.1) to express y
as R Py
Qy, one nds:

y
n
=
y
n+1
y
n
h

h
2
(R
n
P
n
y
n
Q
n
y
n
) + O(h
2
)
=
y
n+1
y
n
h

h
2
_
R
n
P
n
_
y
n+1
y
n
h
+ O(h)
_
Q
n
y
n
_
+ O(h
2
) . (8.38)
Therefore, the boundary condition (8.31) can be approximated by
A
1
Y
0
+ A
2
_
Y
1
Y
0
h

h
2
_
R
0
P
0
Y
1
Y
0
h
Q
0
Y
0
__
= , (8.39)
with the discretization error being O(h
2
). To answer some of the Questions for self-assessment,
you will need to collect the coecients of Y
0
and Y
1
in (8.39) to put it in a form similar to that
of (8.36). Such a rearranged Eq. (8.39) should then be used along with the N 1 equations
in (8.4) for Y
n
N1
n=1
. Thus we have a system of N equations for N unknowns Y
n
N1
n=0
, which
can be solved (again, unless its coecient matrix is singular).
If one requires a fourth-order accurate approximation to y
(a), one needs to use two more

terms in expansion (8.37):
y
n+1
y
n
h
= y
n
+
h
2
y
n
+
h
2
6
y
n
+
h
3
24
y
n
+ O(h
4
) (8.40)
and replace the y
and higher-order derivatives using y
n
= f(x
n
, y
n
) f
n
, f
n+1
= f
n
+h
d
dx
f
n
+
. . . = y
n
+ hy
n
+ . . ., etc. The result is:
y
n
=
y
n+1
y
n
h

h
24
(7f
n
+ 6f
n+1
f
n+2
) + O(h
4
) . (8.41)
Equation (8.41) with n = 0 can then be used to obtain a fourth-order accurate counterpart of
(8.39).
Remark 1: Recall that the coecient matrix in the discretized BVP (8.4) with Dirichlet
boundary conditions is SDD and hence nonsingular, provided that h satises the condition of
Theorem 8.3. We will now state a requirement that the coecients A
1
, A
2
in the non-Dirichlet
boundary condition (8.31) must satisfy in order to guarantee that the corresponding BVP has a
unique solution. To this end, we consider the situations arising in Methods 1 and 2 separately.
When Eq. (8.36) replaces the Dirichlet boundary condition (8.35) in Method 1, the resulting
coecient matrix is SDD provided that
A
1
A
2
0, (8.42)
in addition to the earlier requirements of Q(x) < 0 and hmax [P(x)[ 2. (You will be asked to
verify (8.42) in one of the Questions for self-assessment.) On the other hand, if two individual
equations (8.33) and (8.34) are used instead of their combined form (8.36), the corresponding
coecient matrix is no longer SDD. This is one advantage of using (8.36) instead of (8.33) and
(8.34).
Thus, condition (8.42) is sucient, but not necessary, to guarantee that the corresponding
discretized BVP has a unique solution. That is, even when (8.42) does not hold, one can still
attempt to solve the corresponding linear system, since strict diagonal dominance is only a
sucient, but not necessary, condition for A to be nonsingular.
In the case of Method 2, by collecting the coecients of Y
0
and Y
1
in Eq. (8.39), it is
straightforward (although, perhaps, a little tediuos) to show that the coecient matrix is SDD
if (8.42) and the two conditions stated one line below it, are satised. You will verify this in
a homework problem. The fourth-order accurate analog of (8.39), based on Eq. (8.41), also
yields the same conditions for strict diagonal dominance of the coecient matrix.
Remark 2: In the case of the BVP with Dirichlet boundary conditions, the coecient
matrix is tridiagonal (see Eq. (8.6)), and hence the corresponding linear system can be solved
eciently by the Thomas algorithm. In the case of non-Dirichlet boundary conditions, one can
show (and you will be asked to do so) that Method 1 based on Eq. (8.36) yields a tridiagonal
system, but the same method using Eqs. (8.33) and (8.34) does not. This is the other advantage
of using the single Eq. (8.36) in this method.
Method 2 for a second-order accurate approximation to the BVP gives a tridiagonal matrix.
This can be straightforwardly shown by the same rearrangement of terms in Eq. (8.39) that was
used above to show that the corresponding matrix is SDD. However, the fourth-order accurate
modication of (8.39), based on (8.41), produces a matrix

A that is no longer tridiagonal. One
can handle this situation in two ways. First, if one is willing to sacrice one order of accuracy in
exchange for the convenience of having a tridiagonal coecient matrix, then instead of (8.41),
one can use
y
n
=
y
n+1
y
n
h

h
6
(2f
n
+ f
n+1
) + O(h
3
) . (8.43)
Then the modication of (8.39) based on (8.43) does result in a tridiagonal coecient matrix.
Alternatively, one can see that matrix

A obtained with the fourth-order accurate formula (8.41)
diers from a tridiagonal one only in having its (1, 3)th entry nonzero (verify). Thus,

A is in
some sense very close to a tridiagonal matrix, and we can hope that this fact could be used to
nd

A
1
with only O(M) operations. This can indeed be done by using the algorithm described
in the next section.
8.5 Periodic boundary condition; ShermanMorrison algorithm
In this section, we will consider the case when BVP (8.1) has periodic boundary conditions.
We will see that the coecient matrix arising in this case is not tridiagonal, but is, in some
sense, close to it. We will then present an algorithm that will allow us to nd the inverse of
that matrix using only O(M) operations, where M M is the dimension of the matrix. The
same method can also be used in other situations, including the inversion of matrix

A dened
in the last paragraph of the preceding section.
We rst obtain the analogues of Eqs. (8.5)(8.7) in the case of the BVP having periodic
boundary conditions. Consider the corresponding counterpart of the BVP (8.1):
y
+ P(x)y
+ Q(x)y = R(x) ,
y(a) = y(b).
(8.44)
The corresponding counterpart of system (8.4) is
Y
0
= Y
N
;
(1 +
h
2
P
n
)Y
n+1
(2 h
2
Q
n
)Y
n
+ (1
h
2
P
n
)Y
n1
= h
2
R
n
, 0 n N .
(8.45)
Note 1: The index n in (8.45) runs from 0 to N, while in (8.4) in runs from 1 to N 1.
Note 2: Since our problem has periodic boundary conditions, it is logical to let
Y
n
= Y
Nn
and Y
N+n
= Y
n
, 0 n N .
In particular,
Y
1
= Y
N1
and Y
N
= Y
0
(8.46)
(the last equation here is just the original periodic boundary condition).
In view of Eq. (8.46), the equations in system (8.45) with n = 0 and n = N 1 can be
written as follows:
(1 +
h
2
P
0
)Y
1
(2 h
2
Q
0
)Y
0
+ (1
h
2
P
0
)Y
N1
= h
2
R
0
,
(1 +
h
2
P
N1
)Y
0
(2 h
2
Q
N1
)Y
N1
+ (1
h
2
P
N1
)Y
N2
= h
2
R
N1
.
(8.47)
With this result, system (8.45) can be written in the matrix form (8.5), where now
Y = [Y
0
, Y
1
, . . . , Y
N1
]
T
, (8.48)
A =
_
_
_
_
_
_
_
_
_
_
(2 h
2
Q
0
) (1 +
h
2
P
0
) 0 0 (1
h
2
P
0
)
(1
h
2
P
2
) (2 h
2
Q
2
) (1 +
h
2
P
2
) 0 0
0 (1
h
2
P
3
) (2 h
2
Q
3
) (1 +
h
2
P
3
) 0 0

0 0 (1
h
2
P
N2
) (2 h
2
Q
N2
) (1 +
h
2
P
N2
)
(1 +
h
2
P
N1
) 0 0 (1
h
2
P
N1
) (2 h
2
Q
N1
)
_
_
_
_
_
_
_
_
_
_
(8.49)
and
r =
_
h
2
R
0
, h
2
R
1
, h
2
R
2
, , h
2
R
N2
, h
2
R
N1
T
. (8.50)
Matrix A in Eq. (8.49) dieres from its counterpart in Eq. (8.6) in two respects: (i) its
dimension is N N rather than (N 1) (N 1) and (ii) it is not tridiagonal due to the
terms in its upper-right and lower-left corners. Such matrices are called circulant.
Thus, to obtain the solution of the BVP (8.44), we will need to solve the linear system (8.5
with the non-tridiagonal matrix A. We will now show how this problem can be reduced to
the solution of a system with a tridiagonal matrix. To this end, we rst make a preliminary
observation. Let w be a vector whose only nonzero entry is its ith entry and equals w
i
. Let z be
a vector whose only nonzero entry is its jth entry and equals z
j
. Then C = wz
T
is an N N
matrix whose only nonzero entry is C
ij
= w
i
z
j
(verify). Similarly, if w = [w
1
, 0, 0, . . . , w
N
]
T
and z = [z
1
, 0, 0, . . . , z
N
]
T
, then
C = wz
T
=
_
_
_
_
_
_
_
_
_
w
1
z
1
0 0 w
1
z
N
0 0 0 0

0 0 0 0
w
N
z
1
0 0 w
N
z
N
_
_
_
_
_
_
_
_
_
. (8.51)
Therefore, the circulant matrix A in Eq. (8.49) can be represented as:
A = A
tridiag
+ wz
T
, (8.52)
where A
tridiag
is some tridiagonal matrix and w and z are some properly chosen vectors. Note
that while the choice of w and z allows much freedom, the form of A
tridiag
is unique for a given
circulant matrix A, once w and z have been chosen. In a homework problem, you will be asked
to make a choice of w and z and consequently come up with the expression for A
tridiag
, given
that A is as in Eq. (8.49).
Linear systems (8.5) with the coecient matrix A given by (8.52) can be time-eciently
i.e., in O(M) operations solved by the so-called ShermanMorrison algorithm. This
algorithm can be found in most textbooks on Numerical Analysis, or online.
8.6 Nonlinear BVPs
The analysis presented in this section is carried out with three restrictions. First, we consider
only BVPs with Dirichlet boundary conditions. Generalizations for boundary conditions of the
form (8.31) can be done straightforwardly along the lines of the previous section.
Second, we only consider the BVPs
y
= f(x, y), y(a) = , y(b) = , (8.28)

which does not involve y
. Although the methods described below can also be adopted to a

more general BVP
y
= f(x, y, y
), y(a) = , y(b) = , (7.1)

the analysis of convergence of those methods to a solution of the BVP (7.1) is signicantly more
complex than the one presented here. Thus, the methods that we will develop in this section
can be applied to (7.1) without a guarantee that one will obtain a solution of that BVP.
Third, we will focus our attention on the second-order accurate discretization of the BVPs.
The analysis for the fourth-order accurate discretization scheme is essentially the same, and
produces similar results.
Consider the BVP (8.28). The counterpart of system (8.4) for this BVP has the same form
as (8.5), i.e.:
A
Y = r , (8.5)
where now
A =
_
_
_
_
_
_
_
_
2 1 0 0 0
1 2 1 0 0
0 1 2 1 0

0 0 1 2 1
0 0 0 1 2
_
_
_
_
_
_
_
_
; r =
_
_
_
_
_
_
h
2
f(x
1
, Y
1
)
h
2
f(x
2
, Y
2
)

h
2
f(x
N2
, Y
N2
)
h
2
f(x
N1
, Y
N1
)
_
_
_
_
_
_
. (8.53)
Equations (8.5) and (8.53) constitute a system of nonlinear algebraic equations. Below we
consider three methods of iterative solution of such a system. Of these methods, Method 1 is
an analogue of the xed-point iteration method for solving a single linear equation
19
/ y = r(y), (8.54)
Method 2 contains its modications, and Method 3 is the analog of the NewtonRaphson
method for (8.54).
Method 1 (Picard iterations)
The xed-point iteration scheme, also called the Picard iteration scheme, for the single nonlinear
equation (8.54) is simply
y
(k+1)
=
1
/
r(y
(k)
), (8.55)
where y
(k)
denotes the kth iteration of the solution of (8.54). To start the iteration scheme
(8.55), one, of course, needs an initial guess y
(0)
.
19
The constant / in (8.54) could have been absorbed into the function r(y), but we kept it so as to mimic
the notations of (8.5).
Now, let

Y
(k)
denote the kth iteration of the solution of the matrix nonlinear equation (8.5),
(8.53). Then the corresponding Picard iteration scheme is
A
Y
(k+1)
= r
_
x,

Y
(k)
_
, k = 0, 1, 2, . . . . (8.56)
Importantly, unlike the original nonlinear Eqs. (8.5) and (8.53), Eq. (8.56) is a linear system.
Indeed, at the (k+1)th iteration, the unknown vector

Y
(k+1)
enters linearly, while the nonlinear
r.h.s. contains

Y
(k)
that has been determined at the previous iteration.
Let us invetsigate the rate of convergence of the iteration scheme (8.56). For its one-variable
counterpart, Eq. (8.54), the convergence condition of the xed-point iterations (8.55) is well-
known to be

/
1
dr(y)
dy
< 1 (8.57)
for all y suciently close to the xed point. The condition we will establish for (8.56) will be
a direct analog of (8.57).
Before we proceed, let us remark that the analysis of convergence of Picards iteration
scheme (8.56) can proceed in two slightly dierent ways. In one way, one can transform (8.56)
into an iteration scheme for
(k)
=

Y
(k)

Y
(k1)
. Then the conditions that the iterations
converge is expressed by
_
_
(k)
_
_
<
_
_
(k1)
_
_
for all suciently large k, (8.58)
where, as in Lecture 4 (see (4.5)), [[ . . . [[ denotes the -norm:
|| ||
= max
n
[
n
[ .
Indeed, this condition implies that lim
k
_
_
(k)
_
_
= 0. This, in its turn, implies that the
sequence of iterations
_
Y
(k)
_
tends to a limit, which then, according to (8.56), must be a
solution of (8.5).
We will, however, proceed in a slightly dierent way. Namely, we will assume that our
starting guess,

Y
(0)
, is suciently close to the exact solution,

Y, of (8.5). Then, as long as
the iterations converge,

Y
(k)
will stay close to

Y for all k, and one can write
Y
(k)
=

Y +
(k)
, where
_
_
(k)
_
_
1. (8.59)
The condition that iterations (8.56) converge has exactly the same form, (8.58), as before.
However, its interpretation is slightly dierent (although equivalent): Now the fact that
lim
k
_
_
(k)
_
_
= 0 implies that lim
k

Y
(k)
=

Y, i.e. that the iterative solutions converge to
the exact solution of (8.5).
Both methods of convergence analysis described above can be shown to yield the same
conclusions. We chose to follow the second method, based on (8.59), because it can be more
naturally related to the linearization of the nonlinear equation at hand (see below). Lineariza-
tion results in replacing the analysis of the original nonlinear equation (8.5) by the analysis of
a system of linear equations, which can be carried out using well-developed methods of Linear
Algebra.
Thus, we begin the convergence analysis of the iteration scheme (8.56) by substituting
there expression (8.59) and linearizing the right-hand side using the rst two terms of the
Taylor expansion near

Y:
&
&
&
A
Y + A
(k+1)
=$
$
$
$
r(x,

Y) +
r
(k)
+ O
_ _
_
(k)
_
_
2
_
. (8.60)
The rst terms on both sides of the above equation cancel out by virtue of the exact equation
(8.5). Then, upon discarding quadratically small terms and using the denition of r from (8.53),
one obtains a linear system
A
(k+1)
=
r
(k)
, where
r
Y
= h
2
diag
_
f(x
1
, Y
1
)
Y
1
, . . . ,
f(x
N1
, Y
N1
)
Y
N1
_
. (8.61)
We will use this equation to relate the norms of
(k+1)
and
(k)
and thereby establish a sucient
condition for convergence of iterations (8.56).
Let us assume that
max
1nN1
f(x
n
, Y
n
)
Y
n
= L. (8.62)
Multiplying both sides of (8.56) by A
1
and taking their norm, one obtains:
_
_
(k+1)
_
_
[[A
1
[[ h
2
L
_
_
(k)
_
_
, (8.63)
where we have also used a known fact from Linear Algebra, stating that for any matrix A and
vector z,
[[Az[[ [[A[[ [[z[[
(actually, the latter inequality follows directly from the denition of a matrix norm). For
completeness, let us mention that
[[A[[
= max
1iM
M
j=1
[a
ij
[ . (8.64)
Inequality (8.63) shows that
_
_
(k)
_
_
_
h
2
L[[A
1
[[
_
k
_
_
(0)
_
_
, (8.65)
which implies that Picard iterations converge when
h
2
L[[A
1
[[ < 1 . (8.66)
As promised earlier, this condition is analogous to (8.57).
It now remains to nd [[A
1
[[. Since matrix A shown in Eq. (8.53) arises in a great many
applications, the explicit form of its inverse has been calculated for any size N = (b a)/h.
The derivation of A
1
can be found on photocopied pages posted on the course website; the
corresponding result for [[A
1
[[, obtained with the use of (8.64), is:
[[A
1
[[ =
(b a)
2
8h
2
. (8.67)
Substituting (8.67) into (8.66), we nally obtain that for Picard iterations to converge, it is
sucient (but not necessary) that
(b a)
2
8
L < 1 . (8.68)
Thus, whether the Picard iterations converge to the discretized solution of the BVP (8.28)
depends not only on the function f(x, y) but also on the length of the interval [a, b].
Method 2 (modied Picard iterations)
The idea of the modied Picard iterations method can be explained using the example of the
single equation (8.54), where we will set / = 1 for convenience and without loss of generality.
Suppose the simple xed-point iteration scheme (8.55) does not converge because at the xed
point y, dr( y)/dy > 1,
20
so that the convergence condition (8.57) is violated. Then,
instead of iterating (8.54), let us iterate
yy = r(y)y (1)y
(k+1)
= r(y
(k)
)y
(k)
y
(k+1)
=
r(y
(k)
) y
(k)
1
.
(8.69)
Note that the exact equation which we start with in (8.69) is equivalent to Eq. (8.54) (with
/ = 1), but the iteration equation in (8.69) is dierent from (8.55). If our guess at the
true value of the derivative dr( y)/dy is suciently close, then the derivative of the r.h.s. of
(8.69) is less than 1, and hence, by (8.57), the iteration scheme (8.69) converges, in contrast
to (8.54), which diverges. In other words, by subtracting from both sides of the equation a
linear term whose slope closely matches the slope of the nonlinear term at the solution y, one
drastically reduces the magnitude of the slope of the right-hand side of the iterative scheme,
thereby making it converge.
Let us return to the iterative solution of Eq. (8.28), where now, instead of iterating, as in
Picards method, the equation
_
y
(k+1)
_
= f
_
x, y
(k)
_
, (8.70)
we will iterate the equation
_
y
(k+1)
_
c y
(k+1)
= f
_
x, y
(k)
_
c y
(k)
(8.71)
with some constant c. The corresponding linearized equation in vector form is easily obtained
from (8.61):
(A h
2
c I)
(k+1)
= h
2
diag
_
f(x
1
, Y
1
)
Y
1
c, . . . ,
f(x
N1
, Y
N1
)
Y
N1
c
_

(k)
, (8.72)
where I is the identity matrix of the same size as A. We will now address the question of
how one should choose the constant c so as to ensure convergence of (8.72) and hence of the
modied scheme (8.71).
The main dierence between the multi-component equation (8.72) and the single-component
equation (8.69) is that no single value of c could simultaneously match all of the values
f(x
n
, Y
n
)/Y
n
, which, in general, are distributed in some interval
L

f(x
n
, Y
n
)
Y
n
L
+
, n = 1, . . . , N 1 . (8.73)
It may be intuitive to suppose that the optimal choice for c may be at the midpoint of that
interval, i.e.,
c
opt
= L
av
=
1
2
_
L
+ L
+
_
. (8.74)
Below we will show that this is indeed the case.
Specically, let us only consider the case where f/y > 0, when a unique solution of BVP
(8.28) is guaranteed to exist by Theorem 6.1 of Lecture 6. Then in (8.73), both
L
> 0, (8.75)
20
Here is used instead of = because dr( y)/dy is usually not known exactly.
and hence L
av
> 0. Next, by following the steps of the derivation of (8.63) one obtains from
(8.72):
_
_
(k+1)
_
_

_
[[(A h
2
c I)
1
[[ h
2
max
L
L
+
[ c[
_

_
_
(k)
_
_
. (8.76)
Here stands for any of f(x
n
, Y
n
)/Y
n
, which satisfy (8.73). Note that the maximum above
is taken with respect to while c is assumed to be xed. Then:
L
c c L
+
c max
L
L
+
[ c[ = max[c L
[, [L
+
c[. (8.77)
Our immediate goal is to determine for what c the coecient multiplying
_
_
(k)
_
_
in (8.76)
is the smallest: this will yield the fastest convergence of the modied Picard iterations (8.71).
The entire analysis of this question is a little teduious, since one will have to consider separately
the cases where c [L
, L
+
] and c
[L
, L
+
]. Since we have announced that the answer,
(8.74), corresponds to the former case, we will present the details only for it. Details for the
case c
[L
, L
+
] are similar, but will not yield an optimal value of c, and so we will omit
them.
Thus, we are looking to determine
K min
L
c L
+
_
h
2
[[(A h
2
c I)
1
[[ max[c L
[, [L
+
c[
_
, (8.78)
for which we will rst need to nd the norm [[(A h
2
c I)
1
[[. Since A is a symmetric matrix
(see (8.53)), so are matrices (Ah
2
c I) and (Ah
2
c I)
1
. In graduate-level courses on Linear
Algebra it is shown that the norm of a real symmetric matrix B equals the modulus of its
largest eigenvalue. Since (an eigenvalue of B
1
) = 1/(an eigenvalue of B), then
[[(A h
2
c I)
1
[[ =
1
min [eigenvalue of (A h
2
c I) [
. (8.79a)
To nd the lower bound for the latter eigenvalue, we use the result of Problem 2 of HW 8
(which is based on the Gerschgorin Circles Theorem of Section 8.1). Thus,
[[(A h
2
c I)
1
[[ =
1
h
2
c
(8.79b)
(recall that c > 0 because we assumed in (8.75) that L
> 0 and also since c [L
, L
+
]).
Combining (8.78) and (8.79b), we will now determine
K = min
L
c L
+
_
max
_
c L
L
+
c
c
__
= min
L
c L
+
_
max
_
1
L
c
,
L
+
c
1
__
. (8.80)
c = L
c = L
+
c = c
opt

1 (L
/c)
(L
+
/c)1
The corresponding optimal c is found as shown in
the gure on the left:
1
L
c
opt
=
L
+
c
opt
1
c
opt
=
1
2
_
L
+ L
+
_
,
which is (8.74).
Substituting (8.74) into (8.80) and then using the result in (8.76), one nally obtains:
[[
(k+1)
[[
_
L
+
L
L
+
+ L
_
[[
(k)
[[ . (8.81)
Since according to our assumption (8.75) L
> 0, then the factor (L

+
L
)/(L
+
+ L
) < 1.
Thus, the modied Picard scheme (8.71), (8.74) converges.
The issues of using the modied Picard iterations on BVPs where conditions (8.75) are
not met, or using scheme (8.71) with a non-optimal constant c, are considered in homework
problems. Let us only emphasize that the modied Picard iterations can sometimes converge
even when (8.75) and/or (8.74) do not hold.
Method 3 (NewtonRaphson method)
Although this method can be described for a general BVP of the form (7.1), we will only do
so for a particular BVP, encountered in one of the homework problems for Lecture 7. Namely,
consider the BVP
y
=
y
2
2 + x
, y(0) = 1, y(2) = 1. (8.82)
Let us begin by writing down the system of second-order accurate discrete equations for this
BVP:
Y
n+1
2Y
n
+ Y
n1
=
h
2
Y
2
n
2 + x
n
, Y
0
= 1, Y
N
= 1 . (8.83)
Let Y
(0)
n
be the initial guess for the solution of (8.83) at x
n
. Similarly to (8.59), we relate it to
the exact solution Y
n
of (8.83):
Y
(0)
n
= Y
n
+
(0)
n
, [
n
[ 1 . (8.84)
Substituting (8.84) into (8.83) and using the fact that the exact solution Y
n
satises (8.83),
we obtain an equivalent system:
_
(0)
n+1
2
(0)
n
+
(0)
n1
h
2
2Y
(0)
n

(0)
n
2 + x
n
_
+
h
2
_
(0)
n
_
2
2 + x
n
=
_
Y
(0)
n+1
2Y
(0)
n
+ Y
(0)
n1
_
h
2
_
Y
(0)
n
_
2
2 + x
n
.
(8.85)
If we now assume that
(0)
n
are suciently small for all n, then we neglect the last term on the
l.h.s. of (8.85), and obtain a linear system for
(0)
n
:
(0)
n+1
_
2 +
2h
2
Y
(0)
n
2 + x
n
_
(0)
n
+
(0)
n1
=
_
Y
(0)
n+1
2Y
(0)
n
+ Y
(0)
n1
_
h
2
_
Y
(0)
n
_
2
2 + x
n
. (8.86)
If our initial guess satises the boundary conditions of the BVP (8.82), then we also have
(0)
0
= 0,
(0)
N
= 0. (8.87)
System (8.86), (8.87) can be solved time-eciently by the Thomas algorithm. Thus we
obtain
(0)
n
. According to (8.84), this gives the next iteration for our solution Y
n
:
Y
n
Y
(1)
n
= Y
(0)
n

(0)
n
. (8.88)
We then substitute
Y
(1)
n
= Y
n
+
(1)
n
(8.89)
into (8.83), obtain a system analogous to (8.85), and then solve it for
(1)
n
in exactly the same
way we have solved (8.85) for
(0)
n
. Repeating these steps, we stop when the process converges,
i.e.
_
_
Y
(k+1)

Y
(k)
_
_
becomes less than a given tolerance.
Using iterative methods in more complicated situations
First, let us note that iterative methods, which we have described above for ODEs, are
equally applicable to partial dierential equations (PDEs). This distinguishes iterative methods
from the shoting method considered in Lecture 7: as we stressed at the end of that Lecture,
the shooting method cannot be extended to solving BVPs for PDEs. Thus, iterative methods
remain the only group of method for solving nonlinear BVPs for PDEs. The ideas of these
methods are the same as we have described above for ODEs.
Second, the xed-point iteration methods (i.e., counterparts of the Picard and modied
Picard described above) are not used (or are rarely used) in commercial software. The main
reason is that they are considerably slower than the NewtonRaphson method and its variations
and also than so-called Krylov subspace methods, studied in advanced courses on Numerical
Linear Algebra. We will only mention that the most famous of those Krylov subspace methods
is the Conjugate Gradient method (CGM) for solving symmetric positive (or negative) denite
linear systems. An accessible introduction to this method can be found in many textbooks
and also in an online paper by J.R. Shewchuk, An Introduction to the Conjugate Gradient
Method Without the Agonizing Pain.
21
An extension of the CGM to nonlinear BVPs whose
linearization yields symmetric matrices is known as the NewtonCGM or, more generally, as a
class of NewtonKrylov methods.
Another reason why Krylov subspace methods are used much more widely than xed-point
iterative methods is that convergence conditions of the latter methods are typically restricted
by a condition analogous to (8.75). Some of Krylov subspace methods either do not have
restrictions, or are less sensitive to them. Also, the NewtonRaphson method does not have
any restrictions similar to (8.75). Yet, this method also has its own issues, and a number of
books are written about application of the NewtonRaphson method to systems of nonlinear
equations.
1. Verify that (8.4) follows from (8.1)(8.3).
21
Some pain, however, is to be expected.
2. Explain why the form of the rst and last terms of r in (8.7) is dierent from the form of
the other terms.
3. What does the Gerschgorin Circles Theorem allow one to do?
4. What is the dierence between a diagonally dominant and strictly diagonally dominant
matrices?
5. Make sure you can follow the derivations in the Example about the Gerschgorin Circles
Theorem.
6. Make sure you can follow the proof of Theorem 8.2.
7. What can you say about a solution of a linear system (8.5) where A is SDD?
8. Under what condition(s) on h is the discretized BVP (8.4) guaranteed to have a unique
solution?
9. Verify (8.20).
10. Verify (8.21) through (8.23).
11. Explain why the local truncation error of the discretized BVP is multiplied by a factor
O(
1
h
2
) to give the error of the solution.
12. Describe the idea of Method 1 for handling non-Dirichlet boundary conditions.
13. Derive (8.36).
14. Describe the idea of Method 2 for handling non-Dirichlet boundary conditions.
15. Convince yourself (and be prepared to convince the instructor) that if condition (8.42) and
the two conditions stated one line below it, hold, then the coecient matrix in Method 1
based on Eq. (8.36) is SDD.
Hint: Write the condition that you want to prove and then assume that (8.42) and the
other two conditions hold.
16. Suppose that you need to solve a BVP with a mixed-type boundary condition and such
that (8.42) does not hold. Will you attempt to solve such a BVP? What should your
expectations be?
17. Suppose that the BVP has a non-Dirichlet boundary condition of the form (8.31) at
x = b (the right end point of the interval) rather than at x = a. What will the analog of
condition (8.42) be in this case?
Hint 1: Remember that this is a QSA and not an exercise requiring calculations.
Hint 2: To better visualize the situation, suppose [a, b] = [1, 1]. What mathematical
operation will take the left end point of this interval into the right end point? How will
this operation transform the terms on the l.h.s. of (8.31)?
18. Convince yourself (and be prepared to convince the instructor) that the statements in
Remark 2 in Sec. 8.4 about the coecient matrices for the second-order methods being
tridiagonal, are correct.
19. Describe the idea of the Picard iterations.
20. Derive (8.63) as explained in the text.
21. What is the condition for the Picard iterations to converge? Is this a necessary or sucient
condition? In other words, if that condition does not hold, should one still attempt to
solve the BVP?
22. Describe the idea behind the modied Picard iterations.
23. How is the nite-dierence implementation of (8.71) dierent from that of (8.56)?
24. Make sure you can obtain (8.72).
25. Make sure you can obtain (8.76) and (8.77).
26. Explain how the result of Problem 2 of HW 8 leads to (8.79b).
27. Derive (8.74) from the explanation found after (8.80).
28. Obtain (8.81) as explained in the text.
29. Are (8.75) and (8.74) necessary or sucient conditions for convergence of the modied
Picard iterations (8.71)?
30. Make sure you can see where (8.85) comes from.
31. Write down a linear system satised by
(1)
n
dened in (8.89).
32. Describe the idea behind the NewtonRaphson method.
33. Can one use iterative methods when solving nonlinear BVPs for PDEs?
9 Concepts behind nite-element method
The power of the nite-element method (FEM) becomes evident when handling problems in
two or three spatial dimentions. In this lecture (and this course), we will only consider appli-
cations of the FEM to problems in one spatial dimention (i.e., BVPs on an interval), so as to
demonstrate some of the basic concepts behind this method.
9.1 General idea of FEM
Suppose we are looking for a solution of the BVP
y
+ Q(x)y = R(x), y(0) = y(1) = 0 . (9.1)

This BVP has zero boundary conditions at both end points of the interval; however, this
does not restrict the generality of the subsequent exposition. Namely, a way to treat nonzero
boundary conditions is described in a homework problem.
The idea of the FEM is as follows. Select a set of linearly independent function {
j
(x)}
M
j=1
such that each
j
satises the boundary conditions of the BVP, i.e.
j
(0) =
j
(1) = 0, j = 1, . . . , M . (9.2)
Then, look for an approximate solution of the BVP in the form
Y (x) =
M
j=1
c
j
j
(x) , (9.3)
where coecients c
j
are to be determined. Note that since the
j
s satisfy the boundary
conditions of the BVP, so does the solution Y (x). The problem has now become the following:
(i) decide which basis functions
j
(x) to use and (ii) determine the coecients c
j
so as to make
the error between Y (x) and the exact solution y(x) as small as possible.
The term basis describing the set {
j
} is used here to indicate that functions
j
must
be linearly independent and, moreover, their linear superpositions (i.e., the r.h.s. of (9.3))
should be able to approximate functions (i.e., solutions of the BVP) from a suciently large
class suciently closely. A quantitaive chracterization of the two sucientlys in the previous
sentence is a serious mathematical task, which we will not attempt to undertake. Instead, we
will proceed at the intuitive level.
One possible set of basis functions which satisfy boundary conditions (9.2) is
j
(x) = sin(jx), j = 1, . . . , M . (9.4)
Another, more convenient, set will be introduced later on as we proceed. In general, there
may be many choices for {
j
(x)}; the decision as to which one to use is usually made on a
problem-by-problem basis.
The problem of determining the coecients c
j
can be handled in three dierent ways, which
we will now describe.
9.2 Collocation method
Let us substitute expansion (9.3) into BVP (9.1):
M
j=1
c
j
j
(x) + Q(x)
M
j=1
c
j
j
(x) = R(x) . (9.5)
Recall that due to the choice (9.2), the boundary conditions are satised automatically.
Now, ideally, we want Eq. (9.5) to hold identically, i.e. for all x [0, 1]. However, since we
only have M free parameters, {c
j
}
M
j=1
, at our disposal, we can only require that Eq. (9.5) be
satised at M points, called collocation points. That is, if {x
k
}
M
k=1
is a set of such points on
[0, 1], then we require that the following system of (linear) equations hold:
M
j=1
_
j
(x
k
) + Q(x
k
)
j
(x
k
)
_
c
j
= R(x
k
), k = 1, . . . , M . (9.6)
Upon solving this system of M equations for the M unknowns c
j
and substituting their values
into (9.3), one nds the approximate solution Y (x) of the BVP.
Linear system (9.6) can be written in the standard form
Ac =r , (9.7)
where
(A)
kj
=
_
j
(x
k
) + Q(x
k
)
j
(x
k
)
_
,
r = (R(x
1
), . . . , R(x
M
))
T
.
(9.8)
The coecient matrix A is, in general, full (i.e. not tri- or pentadiagonal), whereas matrix
A that arose in the nite-dierence approach of Lecture 8 was tridiagonal. Thus it appears
that the collocation methods leads to a system that is more dicult to solve than the system
produced by the nite-dierence approach. However, one can make matrix A of the collocation
method to also be tridiagonal if one chooses the basis functions in a special form. Namely, for
j = 2, . . . , M 1, take
j
s to be the following cubic B-splines:
B
j
(x) =
_
_
(x
j2
)
3
4h
3
, x
j2
x x
j1
1
4
+
3x
j1
4h
_
1 +
x
j1
h

_
x
j1
h
_
2
_
, x
j1
x x
j
1
4

3x
j+1
4h
_
1
x
j+1
h

_
x
j+1
h
_
2
_
, x
j
x x
j+1
(x
j+2
)
3
4h
3
, x
j+1
x x
j+2
0, otherwise
j = 2, . . . , M 1,
(9.9)
where x
j
= xx
j
, and we have assumed for simplicity that all points are equidistant, so that
h = x
j+1
x
j
. Recall that x
0
= 0 and x
M+1
= 1 for BVP (9.1). Functions B
j
(x) (9.9) have
continuous rst and second derivatives everywhere on [0, 1]; a typical such function is shown
in the middle plot of the accompanying gure.
B
1
(x)
B
j
(x)
j=2,...,M1
x
j

B
M
(x)
When j = 1 or j = M, these functions have to be slightly modied, so that, say, B
1
(x)
satises the boundary condition at x
0
: B
1
(x
0
) = 0, but no condition is placed on its derivative
at that point. The plots of B
1
and B
M
are shown in the left and right plots of the gure; the
analytical expression for, say, B
1
being:
B
1
(x) =
_
_
1
2(4 3h)
_
3(6 5h)
x
0
h
9(1 h)
_
x
0
h
_
2
_
x
0
h
_
3
_
, x
0
x x
1
1
4

3x
2
4h
_
1
x
2
h

_
x
2
h
_
2
_
, x
1
x x
2
(x
3
)
3
4h
3
, x
2
x x
3
0, otherwise.
(9.10)
Now, with
j
(x) = B
j
(x), system (9.6) has a tridiagonal matrix A because for any x
k
, only
B
k
(x
k
) and B
k1
(x
k
) are nonzero, i.e. B
j
(x
k
) = 0 for |k j| 1. Then all one needs in order
to write out the explicit form of (9.6) are the values of B
k
(x
k
) and B
k1
(x
k
). These are shown
in the table below.
x
k1
x
k
x
k+1
B
k
(x) 1/4 1 1/4
B
k
(x) 3/(2h
2
) 3/h
2
3/(2h
2
)
To conclude this subsection, we point out the advantage of the collocation method over the
nite-dierence method of Lecture 8: Points x
k
do not need to be equidistant. This will have
no eect on the form of system (9.6); only the coecients in (9.9) and (9.10) will be slightly
modied. One can use this freedom in distributing the collocation points over the interval so
that to place more of them in the region(s) where the solution is expected to change rapidly.
9.3 Galerkin method
This method allows one to use basis functions that are simpler than the B-splines considered
in the previous subsection. The solution obtained, of course, will not be as smooth as that
obtained by the collocation method.
As we said above, the approximate solution Y (x) is sought as a linear combination of the
basis functions
j
; see (9.3). We can now draw an analogy with Linear Algebra and call the
basis functions
j
vectors which span (i.e., form) a linear, M-dimensional space S
M
. Then,
according to (9.3), Y (x) is a vector in that space.
Let us again substitute (9.3) into BVP (9.1) and write the result as
M
j=1
_
j
(x) + Q(x)
j
(x)
_
c
j
R(x) = (x) . (9.11)
(Recall that in the collocation method, we required
that the residual vector (x
k
) = 0 at all the collo-
cation points x
k
.) Now, the residual vector (x)
does not, in general, belong to the linear space
S
M
(in other words, it is not a linear combination
of
j
s). Geometrically, we can represent this sit-
uation (for M = 2) as in the gure on the right.
Namely, vector (x) has a component that belongs
to S
M
and the other component that lies outside
of S
M
.
2

S
2

The idea of the Galerkin method is to select the coecients c
j
so as to make the residual
vector (x) orthogonal to all of the basis functions
j
, j = 1, . . . , M. In that case, the projection
of (x) on S
M
is zero, and hence the length of (x) is minimized, since
length (x) =
_
_
length
|| to S
M
(x)
_
2
+ (length
to S
M
(x))
2
.
Thus, we need to specify what we mean by orthogonal and length for functions. Two
functions f(x) and g(x) are called orthogonal if
_
1
0
f(x)g(x)dx = 0 . (9.12)
Two remarks are in order. First, note that the integral in (9.12) is over [0, 1]. This is because
the BVP we are considering is dened over that interval. If a BVP is dened over [a, b], the
corresponding denition of orthogonality would contain
_
b
a
instead of
_
1
0
. Second, (9.12) is not
the only denition of orthogonality of functions, but just one of those which are used frequently.
For dierent applications, dierent denitions of function orthogonality may prove to be more
convenient.
The denition of length of a function is subordinate to that of orthogonality, namely:
||f(x)||
2
=
_
1
0
(f(x))
2
dx. (9.13)
The subscript 2 of ||...||
2
is used because the l.h.s. of (9.13) is also known as the L
2
-norm of
a function, which is dierent from the -norm that we have considered so far.
Thus, the Galerkin method requires that
_
1
0
(x)
k
(x) dx = 0 for k = 1, . . . , M. (9.14)
With the account of (9.11), this gives:
M
j=1
c
j
_
1
0
_
j
(x) + Q(x)
j
(x)
_
k
(x) dx =
_
1
0
R(x)
k
(x) dx ; k = 1, . . . , M. (9.15)
If we now dene a matrix A to have the coecients
a
kj
=
_
1
0
_
j
(x) + Q(x)
j
(x)
_
k
(x) dx (9.16)
and the vector r to be
r =
__
1
0
R(x)
1
(x) dx, . . . ,
_
1
0
R(x)
M
(x) dx
_
T
, (9.17)
then the system of linear equations (9.15) takes on the familiar form (9.7).
So far, there has been no real advantage of the Galerkin method over the collocation method.
Such an advantage arises when we use integration by parts to rewrite (9.16) in the form
a
kj
=
_
1
0
j
(x)
k
(x) dx +
_
1
0
Q(x)
j
(x)
k
(x) dx. (9.18)
In deriving (9.18), we have used the boundary conditions (9.2). From (9.18), which is equivalent
to (9.16), we immediately observe two things, which were not evident from (9.16).
a
kj
= a
jk
, i.e. the coecient matrix in the Galerkin method is symmetric.
To calculate a
kj
, one only requires
j
, but not
j
, to exist. Moreover, one does not require
j
to be continuous; it suces that it be integrable.
Then, the following simple choice of
j
(x) can be made:
j
(x) =
_
_
_
1
|x
j
|
h
, x
j1
x x
j+1
0 otherwise
(9.19)
These functions
j
(x) are called hat-functions, or linear
B-splines.
j
(x)
j=1,...,M
x
j

With this choice for
j
s, matrix A is tridiagonal. In-
deed, in this case
j
is as shown on the right, whence
one can calculate that
_
1
0
j
(x)
k
(x) dx =
_
1
h
, k = j 1
2
h
, k = j
0, otherwise.
(9.20)
Quantities
_
1
0
Q(x)
j
(x)
k
(x) dx are nonzero also only
for k = j 1, j, j + 1, as is evident from the gure
above.
j
(x)
j=1,...,M
x
j

0
1/h
1/h
To conclude this subsection, let us remark on the issue of the calculation of the second inte-
gral in (9.18) and the integrals in (9.17). For brevity, we will only speak here about the former
type of integrals; the same will apply to the latter ones. The integrals
_
1
0
Q(x)
j
(x)
k
(x) dx
may be evaluated in one of the following ways. First, if their analytical expressions for all
required pairs of k and j can be obtained (e.g., with Mathematica or another computer algebra
package), the numeric values of these expressions should be used. If such expressions are not
available, then the computation may dier depending on whether the
j
s are smooth and non-
vanishing over the entire interval [0, 1], as, say, functions (9.4), or they are the hat-functions
(or any other highly localized functions). In the former case, the integrals in question may be
computed by one of the standard methods (say, Simpsons) using the existing subdivision of
[0, 1], or by Matlabs built-in inetrgators (quad or quadl). In the latter case, i.e. for highly
localized
j
s, the integrals can be approximated as
_
1
0
Q(x)
j
(x)
k
(x) dx Q(x
mid
)
_
1
0
j
(x)
k
(x) dx, (9.21)
where x
mid
is the middle of the interval over which the product
j
(x)
k
(x) is nonzero. One
can show that the accuracy of approximation (9.21) is O(h
2
). In the case when
j
s are the
hat-functions (9.19), the integral on the r.h.s. of (9.21) can be explicitly calculated to be:
_
1
0
j
(x)
k
(x) dx =
_
_
h
6
, k = j 1
2h
3
, k = j
0, otherwise.
(9.22)
9.4 Rayleigh-Ritz method
This method replaces the problem of solving BVP (9.1) by a problem of nding the minimum
of a certain functional. We will not consider this method in more detail, but only mention
that: (i) Rayleigh-Ritz method can be shown to be equivalent to Galerkin method, and (ii)
the functional mentioned in the previous sentence is the length of the residual vector (x),
dened according to (9.13).
1. Describe the idea behind the collocation method.
2. What condition (or conditions) should the basis functions in the collocation method
satisfy?
3. In addition to the above condition(s), what other condition should the basis functions
satisfy in order to make the corresponding coecient matrix tridiagonal?
4. What is the advantage of the collocation method over the nite-diference method?
5. Write down the explicit form of B
M
.
6. Verify the entries in the Table in Sec. 9.2.
7. Describe the idea behind the Galerkin method.
8. Try to explain the analogy between denition (9.12) of orthogonality of functions and the
denition of orthogonality of vectors in R
n
. Hint: Interpret the integral as (the limit of)
a nite, say, Riemann, sum. (The fact that it is the limit is not really important here.)
9. Consequently, explain why (9.13) is analogous to the Euclidean length of a vector.
10. Continuing from the last two questions, try to explain the close analogy between the
Galerkin method and the least-squares solution of inconsistent linear systems.
11. What is the advantage of the Galerkin method over the collocation method?
11 Classication of partial dierentiation equations (PDEs)
In this lecture, we will begin studying dierential equations involving more than one indepen-
dent variable. Since they involve partial derivatives with respect to these variables, they are
called partial dierential equations (PDEs). Although this course is concerned with numerical
methods for solving such equations, we will rst need to provide some analytical background
on where those equations arise and how their setup is dierent from that of ODEs. This will
be done in this lecture, while the subsequent lectures, except Lecture 16, will deal with the
numerical methods proper.
11.1 Classication of physical problems described by PDEs
The majority of problems in physics and engineering fall into one of the following categories:
(i) equilibrium problems, (ii) eigenvalue problems, and (iii) evolution problems.
(i) Equilibrium problems are those where a
steady-state spatial distribution of some quantity
u inside a given domain D is to be determined by
solving a dierential equation
L[u] = f(x, y), (x, y) D
subject to the boundary condition
B[u] = g(x, y), (x, y) D,
where D is the boundary of D. Here L is a dier-
ential operator involving derivatives with respect
to x and y (for the case of two spatial dimensions);
B, in general, may also involve derivatives. These
BVPs generalize, to two or more dimensions, the
one-dimensional BVPs we studied in Lectures 6
through 9.
PDE:
L[u]=f
inside domain D
Boundary
conditions
B[u]=g
on boundary D
of D
Examples of equilibrium problems include: Steady ows of liquids and gases; steady tem-
perature distributions; equilibrium stress distributions in elastic structures.
(ii) Eigenvalue problems are extensions of equilibrium problems with no external forces
where nontrivial (i.e. not identically zero) steady-state distributions exist only for special values
of certain parameters, called eigenvalues. These eigenvalues, denoted , are to be determined
along with the steady-state distributions themselves. The simplest form of an eigenvalue prob-
lem is
L[u] = u for (x, y) D; B[u] = 0 for (x, y) D.
In a more complex setup, the eigenvalue may enter into the PDE, and even into the boundary
condition, in a more complicated way.
Examples of eigenvalue problems include: Natural frequencies of vibrating strings and
beams; resonances in electric circuits, mechanics, and acoustics; energy levels in quantum
mechanics.
(iii) Evolution problems are extensions of initial value problems, where the distribution of
a quantity u is not steady but exhibits transient behavior. Generally, the problem is to predict
the evolution of the system at any time given its initial state. This is done by solving the PDE
L[u] = f(x, t), x D for t > t
0
,
given the initial state
I[u] = h(x), x D for t = t
0
,
and the boundary conditions
B[u] = g(x, t), x D and t t
0
.
The dierential operator L now involves derivatives with respect to x and t.
Examples of evolution problems include: Propagation of waves of any nature; diusion of a
substance in a room; cooling down or heating an object.
For example, the mathematical problem of determining the evolution of a temperature
distribution u(x, t) inside a rod of length 1 is set up as follows:
L[u] = f(x, t), 0 < x < 1, t > 0,
where the form of the operator L will be specied
later;
u(x, t = 0) = h(x), 0 x 1,
where h(x) is the initial temperature distribution
inside the rod;
u(x = 0, t) = g
0
(t), u(x = 1, t) = g
1
(t), t 0,
where g
0,1
(t) are the temperature values main-
tained at the two ends of the rod. 0 1
0
1
2
t
x
D
D
In subsequent lectures, we will consider exclusively evolution problems. To that end, we
would like to obtain an unambiguous, mathematically rigorous criterion which allows one to
distinguish problems of dierent categories. This is done in the next subsection.
11.2 Classication of PDEs into three types; characteristics
Here we will consider the question of how many initial or boundary conditions can or should be
specied for a PDE, and where (in the (x, y)-space or (x, t)-space) it can or should be specied.
We will concentrate on the case of two-dimensional spaces; generalizations to three- and four-
(i.e., the time plus three spatial dimensions) dimensional cases are possible and for the most
part straightforward. For deniteness, let us speak about the (x, y)-space until otherwise is
indicated. (That is, for now, y may denote either the second spatial variable or the time
variable t.)
As a reference, let us recall the situation with ODEs and, for concreteness, consider a
second-order ODE. There, we could either specify the initial values for the dependent function
y and its derivative y
at one point x = x
0
, or the values of y (or more complicated expressions,
e.g., (8.31)) at two points, x = a and x = b. In the former case, we have an IVP, and in the
latter case, a BVP. Since, as we said earlier, we will be concerned with the evolution problems,
which are higher-dimensional counterparts of IVPs, we proceed to recall how we were able to
solve (conceptually, not technically) the latter, i.e.
y
= f(x, y, y
), y(x
0
) = y
0
, y
(x
0
) = v
0
. (11.1)
Namely, for any point x
0
+ h suciently near x
0
, we could write the Taylor expansion
y(x
0
+ h) = y(x
0
) + hy
(x
0
) +
1
2
h
2
y
(x
0
) +
1
6
h
3
y
(x
0
) + . . . . (11.2)
The rst two terms on the r.h.s of (11.2) are known from the initial condition; the third term
is known from the ODE, which we assume to be satised at x = x
0
. The last term in (11.2)
can then be found from
y
(x) =
dy
dx
=
df
dx
=
f
x
+ y
f
y
+ y
f
y
. (11.3)
All omitted higher-order terms in (11.2) can be found analogously to (11.3). Thus, y(x) can be
determined for all points that are suciently close to x
0
.
When we move from one independent variable (as in ODEs) to two (as in PDEs), it is
intuitive to suppose that now the initial and/or boundary conditions should be specied along
certain curves in the (x, y)-space rather than at a point. (In that case, the dimensions of both
the dierential equation and the initial/boundary condition are each increased by one.) Thus,
let us assume that we know the dependent function u along some curve in the (x, y)-plane
and also the derivative u/n in the direction normal to :
_
_
_
u(x, y) = g
0
(x, y),
u(x, y)
n
= G
1
(x, y),
(x, y) . (11.4)
Note that if one knows u along , one automatically knows also the derivative of u along
(simply take the directional derivative of u in the direction tangent to at each of its points).
Knowing the derivatives of u in both the normal and tangent directions to is equivalent to
knowing u
x
and u
y
separately at each point. (Here and below we use subscripts to denote
partial dierentiation; i.e., u
x
u/x and u
y
u/y.) Thus, Eqs. (11.4) are equivalent to
_
u(x, y) = g
0
(x, y),
u
x
(x, y) = g
1
(x, y) and u
y
(x, y) = g
2
(x, y),
(x, y) . (11.5)
In the remainder of this course, we will consider PDEs of the form
Au
xx
+ 2Bu
xy
+ Cu
yy
+ Du
x
+ Eu
y
+ F = 0, (11.6)
where coecients A, B, C may depend on any or all of x, y, u, u
x
, u
y
and coecients D, E, F
may depend on x, y, u. Given the PDE (11.6) and the initial/boundary conditions (11.5) on
curve , we would like to determine u at some point that is suciently near . So, let (x
0
, y
0
) be
some point on and (x
0
+h, y
0
+k) with h, k 1 be a nearby point where we want to determine
u. The fundamental question that we now ask is: What are the restrictions on curve and
on the coecients A, B, C in (11.6), under which one can determine u(x
0
+ h, y
0
+ k)?
We begin answering this question by writing the Taylor expansion for u(x
0
+h, y
0
+k) near
point (x
0
, y
0
) on , where we know both u and its rst derivatives from (11.5):
u(x
0
+ h, y
0
+ k) = u(x
0
, y
0
) + hu
x
(x
0
, y
0
) + ku
y
(x
0
, y
0
) +
1
2
h
2
u
xx
(x
0
, y
0
) + hku
xy
(x
0
, y
0
) +
1
2
k
2
u
yy
(x
0
, y
0
) +
1
6
h
3
u
xxx
(x
0
, y
0
) + . . . (11.7)
This expansion is the analog of (11.2). Now, as we have said, all terms on the r.h.s. of the rst
line of (11.7) are known from (11.5). If each of the three terms in the second line of (11.7) can
be found separately (i.e., as opposed to in the combination in which they enter Eq. (11.6)),
then all the higher-order terms in expansion (11.7) can be found similarly to (11.3). Indeed,
suppose we have found expressions for u
xx
, u
xy
, and u
yy
in the form that generalizes (11.1) to
two variables:
u
xx
= f
1
(x, y, u, u
x
, u
y
), u
xy
= f
2
(x, y, u, u
x
, u
y
), u
yy
= f
3
(x, y, u, u
x
, u
y
). (11.8)
Then the third-order partial derivatives (see the last line in (11.7)) can be computed using the
Chain Rule for a function of several variables. For example,
u
yyy
=
u
yy
y
x=const
=
df
3
(x, y, u(x, y), u
x
(x, y), u
y
(x, y))
dy
x=const
=
f
3
y
+
f
3
u
u
y
+
f
3
u
x
u
x
y
+
f
3
u
y
u
y
y

f
3
y
+
f
3
u
u
y
+
f
3
u
x
u
xy
+
f
3
u
y
u
yy
. (11.9)
The rst two terms on the r.h.s. of (11.9) are known from (11.5). Therefore, if we also know
u
xy
and u
yy
on , we then can compute the last two terms in (11.9) and hence the u
yyy
. Other
third- and higher-order derivatives in the Taylor expansion (11.7) can be computed analogously.
Thus, will be able to nd u(x
0
+ h, y
0
+ k) if and only if we know u
xx
, u
xy
, and u
yy
on .
Now, we need three equations to be able to uniquely determine the three quantities u
xx
,
u
xy
, and u
yy
. The rst equation is the PDE (11.6). The other two equations are found by
dierentiating the two equations on the second line of (11.5) along
22
:
u
xx
d
x + u
xy
d
y = d
g
1
(x, y) ( g
1,x
d
x + g
1,y
d
y ) , (11.10)
u
yx
d
x + u
yy
d
y = d
g
2
(x, y) ( g
2,x
d
x + g
2,y
d
y ) , (11.11)
where the symbol d
denotes a dierential (i.e., an innitesimally small step) along . Note

that both the r.h.s.es and the ratio d
y/d
x (i.e., the slope of ) are known once and the

functions in (11.5) are known.
Further, if we assume u to be continuously dierentiable at least twice, then
u
xy
= u
yx
, (11.12)
and Eqs. (11.6) and (11.10), (11.11) can be written as a linear system for u
xx
, u
xy
, and u
yy
at
any point on :
_
_
_
A 2B C
d
x d
y 0
0 d
x d
y
_
_
_
_
_
_
u
xx
u
xy
u
yy
_
_
_
=
_
_
_
Du
x
Eu
y
F
d
g
1
d
g
2
_
_
_
. (11.13)
22
Dierentiating, or taking the directional derivative, along a curve is equivalent to computing the infenites-
imal increments u
x
(x + d
x, y + d
y) u
x
(x, y) and u
y
(x + d
x, y + d
y) u
y
(x, y), which yields the l.h.s.es
of (11.10) and (11.11), respectively.
This system yields a unique solution for u
xx
, u
xy
, and u
yy
provided that the coecient matrix
is nonsingular. The matrix would be singular if its determinant vanishes:
A(d
y)
2
2B d
x d
y + C (d
x)
2
= 0
A
_
dy
dx
_
2
2B
_
dy
dx
_
+ C = 0 . (11.14)
Thus, we have obtained the answer to the fundamental question posed above. Namely,
if the initial/boundary conditions are prescribed on a curve whose tangent at any point
satises Eq. (11.14), then the corresponding initial-boundary value problem (IBVP) (11.4) and
(11.6) cannot be solved for a twice-countinuously dierentiable (see (11.12)) function u(x, y). If
the initial/boundary conditions (11.4) are prescribed along any other curve, the IBVP can be
solved. Alternatively, the IBVP can still be solved if a smaller set of initial/boundary conditions
(say, just the rst line in (11.4)) is specied along , or if u
xy
(or any lower-order derivative of
u) is allowed to be discontinuous across .
Equation (11.14) gives one the mathematically rigorous criterion that separates all PDEs
(11.6) into three types depending on the relation among A, B, and C.
B
2
AC < 0
In this case, no real solution for the slope (dy/dx)
can be found from the quadratic equation

(11.14). This means that one can specify the initial/boundary conditions (11.4) along any
curve in the plane, and be able to obtain the solution u suciently close to that curve. Such
equations are called elliptic. Physical problems leading to elliptic equations are the equilibrium
and eigenvalue problems, described in Sec. 11.1. Typical examples of such problems are the
Laplace and Helmholtz equations:
u
xx
+ u
yy
= 0, (Laplace)
u
xx
+ u
yy
= u. (Helmholtz)
The boundary conditions for elliptic equations are usually imposed along the boundary of a
closed domain D, as in the rst gure in Sec. 11.1. One can also show that to obtain the
solution inside the entire domain D rather than only suciently close to its boundary D,
one needs to impose only one of the conditions (11.4), but not both. On this remark, we leave
the elliptic equations and will not consider them again in this course.
B
2
AC > 0
In this case, two real solutions for the slopes of curve exist:
_
dy
dx
_
=
B
B
2
AC
A
. (11.15)
These slopes specify two distinct directions in the (x, y)-plane, called characteristics. The
correspondinf PDEs are called hyperbolic. Physical problems that lead to hyperbolic equa-
tions are the evolution problems dealing with propagation of waves (e.g., light or sound). The
coordinates in this case are x, the spatial coordinate of propagation, and t, the time, rather
than the second spatial coordinate y. The typical example is the Wave equation:
u
xx
u
tt
= 0 . (Wave)
The importance of characteristics in hyperbolic problems is two-fold: (i) the initial data
for a smooth solution cannot be prescribed on a characteristic, and (ii) initial disturbances
propagate along the characteristics. We will consider this latter issue in more detail when we
begin to study numerical methods for hyperbolic PDEs.
B
2
AC = 0
In this case, only one value of the slope of exists:
_
dy
dx
_
=
B
A
. (11.16)
This gives only one direction of characteristics. The corresponding PDEs are called parabolic.
Physical problems that lead to parabolic equations are usually diusion-type problems. The
typical example is the Heat equation,
u
xx
u
t
= 0 , or u
t
= u
xx
, (Heat)
which describes, e.g., evolution of temperature inside a rod.
Since in the next four lectures we will consider methods of numerical solution of the Heat
equation, let us discuss how boundary conditions can or should be set up for it. In fact, this
was considered in the example at the end of Sec. 11.1. Namely, the initial condition for the
Heat equation on x [0, 1] is
u(x, t = 0) = u
0
(x), 0 x 1,
_
Initial condition
for Heat equation
_
and the boundary conditions are
u(0, t) = g
0
(t), u(1, t) = g
1
(t), t 0 .
_
Boundary conditions
for Heat equation
_
Note that the initial condition is prescribed along a characteristic! Indeed, for the Heat
equation, A = 1, B = C = 0, and Eq. (11.16) gives the slope of characteristic as dt/dx = 0,
which means that any line t = const is a characteristic. The above, however, does not contradict
the results of analysis of this subsection, because the initial condition corresponds only to the
rst equation in (11.4), while the second equation is absent. Thus, one cannot prescribe the
rate of change u
t
at the initial moment for the Heat equation.
1. Give examples from physics of equilibrium, eigenvalue, and evolution problems.
2. Explain how system (11.13) is set up (i.e., where its equations come from).
3. What is the signicance of characteristics?
4. What types of physical problems lead to elliptic, hyperbolic, and parabolic equations?
5. How many characteristics does the Wave equation have?
6. Why does prescribing initial data on a characteristic for the Heat equation not prevent
one from nding the solution of that IBVP?
12 The Heat equation in one spatial dimension:
Simple explicit method and Stability analysis
12.1 Formulation of the IBVP and the minimax property of its so-
lution
We begin by writing down the Heat equation (in its simplest form) on the interval x [0, 1]
and the corresponding initial and boundary conditions. In fact, this is just a restatement from
the end of Lecture 11.
u
t
= u
xx
0 < x < 1, t > 0 ; (12.1)
u(x, t = 0) = u
0
(x) 0 x 1 ; (12.2)
u(0, t) = g
0
(t), u(1, t) = g
1
(t) t 0 . (12.3)
The IBVP (12.1)(12.3) will be the subject of this and the next lectures. Boundary conditions
of a form more general than (12.3) will be considered in Lecture 14. Recall that in order to
produce a continuous solution, the boundary and initial conditions must match:
u
0
(0) = g
0
(0) and u
0
(1) = g
1
(0) . (12.4)
On physical grounds, in what follows we will always require that the matching conditions (12.4)
be satised.
It is always useful to know what general properties one may expect of the analytical solution
of a given IBVP, so that one could verify that the corresponding numerical solution also has
these properties (this is a basic sanity check for the numerical code). Such a property for IBVP
(12.1)(12.3), stated below, is proved in courses on PDEs.
Minimax principle Suppose u
t
(and hence u
xx
and both u
x
and u) is continuous in the region
D = [0, 1] [0, ) (see the gure on the right)
a
.
Then the solution u of the IBVP (12.1)(12.3)
achieves its maximum and minimum values on D
(i.e. either for t = 0 or for x = 0 or x = 1).
In other words, u cannot achieve its maximum or
minimum values strictly inside D.
a
Note that here domain D and its boundary D are
dened slightly dierently than in the gure at the end of
Sec. 11.1.
0 1
0
1
2
t
x
D
D
Note that this, at least partially, agrees with our intuition in real life. Indeed, suppose
one creates some distribution of nonnegative temperature in the rod at t = 0 while keeping the
ends of the rod at zero temperature at all times. Then we expect that the temperature inside
the rod at any t > 0 will be less than it was at t = 0 (because the rod will cool down); that is,
the maximum temperature was observed somewhere along the rod at t = 0, i.e. at the bottom
part of D. On the other hand, we also expect that the temperature in this setup will not drop
below zero; that is, the temperature will be minimum at the ends of the rod, i.e. at the sides
of D.
12.2 The simplest explicit method for the Heat equation
Let us cover the region D with a mesh (or grid), as
shown on the right. Denote
x
m
= mh, m = 0, 1, . . . , M, (h =
1
M
);
t
n
= n, n = 0, 1, . . . , N,
_
=
T
max
N
_
;
(12.5)
here T
max
is the maximum time until we want to com-
pute the solution. Also, let U
n
m
be the solution computed
at node (x
m
, t
n
). For simplicity, in this lecture we will
assume that the boundary conditions are homogeneous:
g
0
(t) = g
1
(t) = 0 for all t ; (12.6)
note that this implies that u
0
(0) = u
0
(1) = 0.
When restricted to the grid, the initial and boundary conditions become:
(12.2) U
0
m
= u
0
(mh), 0 m M ; (12.7)
(12.3)
_
U
n
0
= 0,
U
n
M
= 0,
n 0 . (12.8)
Let us now use the simplest nite-dierence approximations to replace the derivatives in
the Heat equation:
u
t

U
n+1
m
U
n
m
+ O() , (12.9)
u
xx

U
n
m+1
2U
n
m
+ U
n
m1
h
2
+ O(h
2
) . (12.10)
Substituting these formulae into (12.1) yileds the simplest explicit method for solving the Heat
equation:
U
n+1
m
U
n
m
=
U
n
m+1
2U
n
m
+ U
n
m1
h
2
+ O( + h
2
) , (12.11)
or, equivalently,
U
n+1
m
= rU
n
m+1
+ (1 2r)U
n
m
+ rU
n
m1
, (12.12)
where
r =

h
2
. (12.13)
The numerical solution at node (x
m
, t
n+1
) can thus be
found if one knows the solution at nodes (x
m
, t
n
) and
(x
m1
, t
n
). These four nodes form a stencil for scheme
(12.12), as shown schematically on the right.
Given the initial and boundary conditions (12.7) and
(12.8), one can advance the solution U
n
m
from time level
number n to time level number (n + 1) using the rec-
curent formula of scheme (12.12).
n+1
n
m1 m m+1
12.3 Stability analysis
From Eq. (12.11) one can see that the simple explicit method is consistent with the PDE
(12.1). Recall from Lecture 4 that consistency means that the solution of the nite-dierence
scheme approaches the solution of the dierential equation as the step size(s), and h in this
case, tend to zero. In other words, the local truncation error satises lim
,h0
= 0.
From the study of ODEs, we know that consistency alone is not sucient for the numerical
solution to converge to the analytical solution of the PDE. To assure the convergence, we
must also require that the nite-dierence scheme be stable. Recall that stability means that
small errors made during one step of the computation must not grow at subsequent steps. For
ODEs, we stated a theorem that said that stability + consistency implied convergence of the
numerical solution to the analytical one. For PDEs, a similar result also holds:
Lax Equivalence Theorem, 12.1 For a properly posed (as discussed in Lecture 11)
IBVP and for a nite-dierence scheme that is consistent with the IBVP, stability is a necessary
and sucient condition for convergence.
As for ODEs, this theorem can be understood from the following simple consideration.
Let u
n
m
= u(x
m
, t
n
) be the exact solution of the PDE,

U
n
m
be the exact solution of the nite-
dierence scheme, and U
n
m
be the actually computed solution of that scheme. (It may dier
from the exact one because, e.g., of round-o errors.) Then
|u
n
m
U
n
m
| =
_
u
n
m

U
n
m
_
+
_
U
n
m
U
n
m
_
u
n
m

U
n
m

U
n
m
U
n
m
. (12.14)
If the dierence scheme is consistent, then the rst term on the r.h.s. is small. If the dierence
scheme is stable, then the second term on the r.h.s. is small for all n (i.e., it does not grow).
Thus, if the scheme is both consistent and stable, then the l.h.s. of (12.14) is small for all
n, which, in words, means that the numerical solution of the nite-dierence scheme closely
approximates the analytical solution of the PDE.
Now we will show how stability of a nite-dierence scheme for a PDE can be studied. We
will do this using two alternative methods. Method 1 will show a relation between the stability
analysis for PDEs with that for systems of ODEs. Method 2 will be new. It is specic to
PDEs and, quite pleasantly, is easier to apply than Method 1. However, nothing is free: this
simplicity comes at the price that this method gives less complete information than Method 1.
We will provide more details after we will have described both methods.
Method 1 (Matrix stability analysis)
One can view scheme (12.11) (and hence (12.12)) as the simple explicit Euler method applied
to the following coupled system of ODEs:
d
dt
_
_
_
_
_
_
u
1
u
2
u
M1
_
_
_
_
_
_
=
1
h
2
_
_
_
_
_
_
2 1 0 0
1 2 1 0 0

0 0 1 2 1
0 0 1 2
_
_
_
_
_
_
_
_
_
_
_
_
u
1
u
2
u
M1
_
_
_
_
_
_
. (12.15)
(In writing out (12.15), we have also used the homogeneous boundary conditions (12.8).) In-
deed, Eqs. (12.11) are obtained by discretizing the time derivative in (12.15) according to
(12.9). Thus, studying the stability of scheme (12.11) is equivalent to studying the stability
of the simple Euler method for system (12.15). You will be asked to do so, using techniques
of Lecture 5, in one of the homework problems. Below we will proceed in a slightly dierent,
although, of course, equivalent, way.
We write Eqs. (12.12) in the matrix form:
_
_
_
_
_
_
U
n+1
1
U
n+1
2
U
n+1
M1
_
_
_
_
_
_
=
_
_
_
_
_
_
1 2r r 0 0
r 1 2r r 0 0

0 0 r 1 2r r
0 0 r 1 2r
_
_
_
_
_
_
_
_
_
_
_
_
U
n
1
U
n
2
U
n
M1
_
_
_
_
_
_
, (12.16)
or
U
n+1
= A
U
n
, (12.17)
where r is dened by (12.13),
U
n
=
_
U
n
1
, U
n
2
, , U
n
M1
T
,
and A is the matrix on the r.h.s. of (12.16).
Iteration scheme (12.16) will converge to a solution only if all the eigenvalues of A do
not exceed 1 in magnitude. Indeed, if any of these eigenvalues exceed 1 (say,
1
> 1), then
U
n
= A
n
U
0
will grow as
n
1
. Therefore, to continue with the stability analysis, we need
to know bounds for the eigenvalues of matrix A. In fact, for the matrix of the very special
form appearing in (12.16), exact eigenvalues are well known. We present the following result
without a proof (which can be found, e.g., in D. Kincaid and W. Cheney, Numerical Analysis:
Mathematics of Scientic Computing, 3rd Ed. (Brooks/Cole, 2002); Sec. 9.1).
Lemma Let B be an N N tridiagonal matrix of the form
B =
_
_
_
_
_
_
b c 0 0
a b c 0 0

0 0 a b c
0 0 a b
_
_
_
_
_
_
. (12.18)
The eigenvalues and the corresponding eigenvectors of B are:
j
= b + 2
ac cos
j
N + 1
, v
j
=
_
_
_
_
_
_
_
_
_
c
b
_
1/2
sin
1j
N+1
_
c
b
_
2/2
sin
2j
N+1
_
c
b
_
N/2
sin
Nj
N+1
_
_
_
_
_
_
_
_
, j = 1, . . . , N . (12.19)
Using this Lemma, we immediately deduce that the eigenvalues of matrix A in (12.17) are
j
= 1 2r + 2r cos
j
M
, j = 1, . . . , M 1 , (12.20)
whence
min
=
M1
= 1 2r + 2r cos
(M 1)
M
, (12.21)
max
=
1
= 1 2r + 2r cos

M
. (12.22)
If /M 1 (i.e., if there are suciently many grid points on the interval [0, 1]), the preceeding
expressions reduce to
min
1 4r + r
_
M
_
2
, (12.23)
max
1 r
_
M
_
2
, (12.24)
where we have used the expansion cos 1
1
2
2
for 1. Then the condition for convergence
of the iterations (12.16), which is, as we said before the Lemma,
1
j
1, j = 1, . . . , M 1, (12.25)
yields
min
1 4r + r
_
M
_
2
1 ;
max
1 r
_
M
_
2
1 .
The second of these equations is satised automatically because r = /h
2
> 0. The rst
equation yields:
r
2
4
_
M
_
2

2
4 (h)
2

1
2
. (12.26)
This condition, in a simplied form
r
1
2
, or
1
2
h
2
, (12.27)
is usually taken as the stability condition of the nite-dierence scheme (12.12). This means
that if
1
2
h
2
, then all round-o errors will eventually decay, and the scheme is stable. The
corresponding numerical solution will converge to the solution of IBVP (12.1)(12.3). If, on
the other hand, >
1
2
h
2
, then the errors will grow, thereby making the scheme unstable. The
corresponding numerical solution, starting at some t > 0, will have nothing in common with
the exact solution of the IBVP.
Remark 1 Above we said that for stability of iterations (12.17), the eigenvalues of A must be
less than 1 in magnitude. Let us stress that this is true only for diagonalizable (e.g., symmetric)
matrices. For nondiagonalizable matrices, e.g., for
N =
_
_
_
_
_
_
1 1 0 0
0 1 1 0 0

0 0 0 1 1
0 0 0 1
_
_
_
_
_
_
, (12.28)
an eigenvalue-based stability analysis will fail. Indeed, all of Ns eigenvalues equal 1, yet one
can show (e.g., using Matlabs command norm) that N
n
as n
M1
, where M is dened
in (12.5). There is an entire eld of matrix analysis that deals with such nondiagonalizable
matrices (with the descriptive keyword being pseudospectra), but we will not go into its
details here.
Condition (12.27) highlights the main drawback of the simple explicit scheme (12.12).
Namely, in order for this scheme to be stable (and hence converge to the analytical solution
of the IBVP), one must take very small steps in time,
1
2
h
2
. This will make the code very
time-consuming. We will consider alternative approaches, which do not face that problem, in
the next lecture.
Now we turn to the second method for stability analysis, announced earlier in this section.
Method 2 (von Neumann stability analysis)
It is rare that eigenvalues of a matrix, like those of matrix A in (12.17), are available. Therefore,
we would like to be able to deduce stability of a scheme without nding those eigenvalues. To
that end, observe that, since the Heat equation and its discrete version (12.12) are linear, the
computational errors satisfy the same equations as the solution itself. Let us denote the error
at node (mh, n) as
n
m
. According to the above, it satises Eq. (12.12):
n+1
m
= r
n
m+1
+ (1 2r)
n
m
+ r
n
m1
. (12.29)
At each time level, the error can be expanded as a linear superposition of Fourier harmonics:
n
m
=
l
c
l
(n) exp(i
l
x
m
) (here i
1). (12.30)
The range of values for
l
will be specied as we proceed.
Since Eq. (12.29) is linear, we can substitute in it each individual term of the above
expansion. In doing so, we will also let
c
l
(n) =
n
,
where is the number to be determined. Thus, substituting
n
m
=
n
exp(imh) into (12.29),
one obtains
n+1
e
imh
= r
n
e
i(m+1)h
+ (1 2r)
n
e
imh
+ r
n
e
i(m1)h
. (12.31)
Let us make two remarks about the notations in (12.31). First, the superscript in
n
m
means
that the error is evaluated at the nth time level. On the other hand, the superscript in
n
means that the factor is raised to nth power. Second, we have dropped the subscript l of
since we now deal with only one term in expansion (12.30).
Continuing with our derivation, we divide all terms in (12.31) by
n
exp(imh) and obtain:
= re
ih
+ (1 2r) + re
ih
= 1 2r + 2r cos(h) . (12.32)
Condition || 1, which would guarantee that the errors do not grow, yields:
1 1 2r + 2r cos(h) 1 . (12.33)
To obtain a condition on r from this double inequality, we need to know what values the
parameter can take. Even though periodic boundary conditions, which are tacitly implied
by the use of the Fourier expansion (12.30) (as shown in graduate courses on Fourier analysis),
yield certain discrete values for , we will follow an alternative and simplied approach.
Namely, we will assume that the cosine in (12.33) can take its full range of values:
1 cos(h) 1 0 h . (12.34)
Using now the half-angle formula, valid for any :
1 cos = 2 sin
2
_
2
_
,
one rewrites (12.33) as
1 1 4r sin
2
_
h
2
_
1. (12.35)
The right-hand inequality in (12.35) holds automatically, while the left-hand one implies:
r sin
2
_
h
2
_
1/2. (12.36)
To guarantee stability of the method, this inequality must hold for all values of h from
(12.34). In particular, it must hold for the worst-case value that yields the largest value of
sin
2
(h/2). The latter value is 1, occurring for h = . Then, the stability condition is
r 1
1
2
, (12.27)
which is the simplied form of the stability condition obtained in Method 1 above.
A few remarks are now in order.
Remark 2 The reason why the condition obtained by the von Neumann analysis is slightly
dierent from the exact condition (12.26) is that the latter, based on the eigenvalues of matrix
A in (12.16), takes into account the boundary conditions (QSA: how?), while the von Neumann
analysis, based on expansion (12.30), ignores those conditions.
Remark 3, related to Remark 2. A condition on r obtained via the von Neumann analysis
is a necessary, but not sucient, condition for stability of a nite-dierence scheme. That is, a
scheme may be found to be stable according to the von Neumann analysis, but taking into ac-
count the information about the boundary conditions may reveal that there still is an instability.
A simple example of this can be found in R.D. Richtmyer and K.W. Morton, Dierence methods
for initial-value problems, 2nd Ed. (Interscience/John Wiley, New York, 1967); pp. 154156
(that book also contains a thorough and rather clear presentation of sucient conditions for
stability). We will not, however, consider the generalization of the von Neumann analysis that
takes into account boundary conditions. A simple, yet practical approach that one may take is
to apply the von Neumann stability analysis to a given scheme, nd the neceassary condition
(usually on r) that is required for the scheme to be stable, and then test the scheme on the
problem of interest while monitoring if any modes localized near the boundaries tend to become
unstable.
Note that Method 1 provides a sucient condition for stability of the numerical scheme
23
,
because it takes into account the boundary conditions when setting up matrix A. However,
that method is dicult to apply in practice since it requires the knowledge of the eigenvalues
of A.
Remark 4 Note, however, that in nite-dierence discretization of hyperbolic equations, where
the counterpart of matrix A may turn out to be nondiagonalizable, the von Neumann analysis
would provide more information about the stability of the numerical scheme than Method 1. An
extreme example is that of matrix N in (12.28), for which the information about its eigenvalues
23
We refer to the case of the Heat equation, where matrix A is diagonalizable and hence has a basis of
eigenvectors over which any initial condition

U
0
can be expanded.
is useless for the stability analysis (see above). Yet, the von Neumann analysis in this case can
be shown to correctly predict stability or instability of the numerical scheme.
An important feature of the von Neumann analysis is that it tells the user which harmonics
(or modes) of the numerical solution will rst become unstable if the stability condition is
slightly violated. For example, it follows from (12.33) and (12.36) that if r just slightly exceeds
the critical value of 1/2, then modes with /h will have the amplication factor that
will be slightly less than 1:
r >
1
2

_

h
_
< 1 . (12.37)
Now recall that the modes are proportional to
exp(imh), hence the unstable modes mentioned above
are
exp(imh) = exp
_
i
h
mh
_
= exp(im) . (12.38)
Therefore, with the account of e
i
= 1, the mode
changes its sign from one node to the next, as shown
on the right. In other words, it is modes with the high-
est frequency that can cause numerical instability of the
simple explicit method for the Heat equation.
1
0
1
m2
m1
m
m+1
m+2
12.4 Explicit methods of higher order
As it follows from (12.11), scheme (12.12) has the rst order of consistency in t and the second
order of consistency in x (i.e., the global error is O( + h
2
)). Note, however, that since the
stability condition (12.27),

1
2
h
2
, (12.27)
must hold, then one always has O() = O(h
2
) for a stable scheme. In other words, it would
not make sense to derive a method with the global error of O(
2
+ h
2
) while keeping
1
2
h
2
.
However, it will still be of value to derive a method with the truncation error O(
2
+h
4
), which
we will now do.
Remembering how we derived higher-order methods for ODEs, we start o by writing out
the Taylor expansions for the nite dierences appearing in (12.9) and (12.10):
U
n+1
m
U
n
m
=

t
U
n
m
+

2
2
t
2
U
n
m
+ O(
2
) , (12.39)
U
n
m+1
2U
n
m
+ U
n
m1
h
2
=

2
x
2
U
n
m
+
h
2
12
4
x
4
U
n
m
+ O(h
4
) . (12.40)
Equation (12.39) is the counterpart of Eq. (8.37), and Eq. (12.40) was obtained in Problem 3
of HW 5. Substituting (12.39) and (12.40) into the Heat equation (12.1), we obtain:
U
n+1
m
U
n
m

U
n
m+1
2U
n
m
+ U
n
m1
h
2
=
_
t
U
n
m

2
x
2
U
n
m
_
+
_
2
t
2
U
n
m
h
2
12
4
x
4
U
n
m
_
+O(
2
+h
4
) .
(12.41)
The rst term on the r.h.s. of (12.41) vanishes, because U
n
m
is assumed to satisfy the Heat
equation. By dierentiating both sides of the Heat equation with respect to t and then using
the Heat equation again, we obtain:
t
(u
t
u
xx
) = u
tt

2
x
2
u
t
= u
tt
u
xxxx
, u
tt
= u
xxxx
. (12.42)
Note that in the middle part of the rst equation above, we have used that u
xxt
= u
txx
, which
implies that the solution has to be dierentiable suciently many times with respect x and t.
We will state some results of the eect of smoothness of the solution on the order of the error
in the next section.
Continuing with the derivation of a higher-order scheme, we use (12.42) to write the second
term on the r.h.s. of (12.41) as
_
2
u
tt
h
2
12
u
xxxx
_
=
_
2

h
2
12
_
u
xxxx
. (12.43)
Thus, if one chooses
=
1
6
h
2
, or r =
1
6
, (12.44)
then the term (12.43) vanishes identically. Then the r.h.s. of (12.41) becomes O(
2
+ h
4
) =
O(h
4
) (or O(
2
)), since and h
2
are related by (12.44). Thus, scheme (12.12) with r = 1/6
has the error O(
2
) = O(h
4
); it is sometimes called the Douglas method.
12.5 Eect of smoothness of initial condition (12.2) on accuracy of
scheme (12.12)
As has been noted after Eq. (12.42), the order of the truncation error of the numerical scheme
depends on the smoothness of the solution, which, in its turn, is determined by the smoothness
of the initial and boundary data. Below we give a corresponding result, whose proof may be
found in Sec. 1.7 of the book by Richtmyer and Morton, mentioned a couple of pages back.
Consider the IBVP (12.1)(12.3) with constant boundary conditions (g
0
(t) = const and
g
1
(t) = const). Let the initial condition u
0
(x) have (p 1) continuous derivatives, while its pth
derivative is discontinuous but bounded. Then for scheme (12.12) with r 1/2 and r = 1/6,
there hold the following conservative estimates for the error of the numerical solution:
||
n
|| =
_
_
O(
p/4
) = O(h
p/2
), for 1 p 3;
O(| ln |) = O(h
2
| ln h|), for p = 4;
O() = O(h
2
), for p > 4 .
(12.45)
For the Douglas method (i.e. scheme (12.12) with r = 1/6), the analogous error estimates are:
||
n
|| =
_
_
O(
p/3
) = O(h
2p/3
), for 1 p 5;
O(
2
ln ) = O(h
4
| ln h|), for p = 6;
O(
2
) = O(h
4
), for p > 6 .
(12.46)
Let us emphasize that these estimates are very conservative and, according to Richtmyer
and Morton, more precise estimates can be obtained, which would show that the error tends
to zero with and h faster than predicted by (12.45) and (12.46). These estimates, however,
do show two important trends, namely:
(i) If the initial condition is not suciently smooth, the numerical error will tend to zero slower
than for a smooth initial condition. In other words, the full potential of a scheme in regards
to its accuracy can be utilized only for suciently smooth initial data; see the last lines in
(12.45) and (12.46).
(ii) The higher the (formally derived) order of the truncation error, the smoother the initial
condition needs to be for the numerical solution to actually achieve that order of accuracy.
It appears likely that similar statements also hold for boundary conditions; we will not,
however, consider that issue.
Finally, let us mention that there is one more important trend in regards to the accuracy of
numerical schemes, which estimates (12.45) and (12.46) do not illustrate. Namely, the accuracy
of a scheme depends also on how close the parameter r is to the stability threshold (which is 1/2
for scheme (12.12)). Intuitively, the reason for this dependency can be understood as follows.
Note that when r is at the stability threshold, there is a mode that does not decay, because for
it, the amplication factor satises: || = 1 ( was introduced before Eq. (12.31)). According
to the end of Sec. 12.3, such a mode for scheme (12.12) is the highest-frequency mode with
= /h = M. It is intuitively clear that any jagged initial condition will contain such a mode
and modes with similar values of (i.e. = (M 1), (M 2), etc.). For those modes,
|| will be just slightly less than 1, and hence they will decay very slowly, thereby lowering the
accuracy of the scheme. On the contrary, when r is, say, 0.4, i.e. less than the threshold by
a nite amount, then all modes will decay at a nite rate, and the accuracy of the scheme is
expected to be higher than for r = 0.5. In a homework problem, you will be asked to use a
model initial datum to explore the eect of its smoothness, as well as the eect of the proximity
of r to the stability threshold, on the accuracy of scheme (12.12).
1. State the minimax principle and provide its intuitive interpretation. When can this
principle be useful?
2. Obtain (12.12).
3. State the Lax Equivalence Theorem and provide a justication for it, based on (12.14).
4. Make sure you can obtain (12.15) as explained in the text below that equation. Where
are the boundary conditions (12.8) used in this derivation?
5. Make sure you can obtain (12.16) from (12.12).
6. Describe the idea behind Method 1 of stability analysis of the Heat equation.
7. What will happen to the solution of scheme (12.12) if condition (12.27) is not satised?
8. Describe the idea behind the von Neumann stability analysis.
9. Make sure you can obtain Eqs. (12.31) and (12.32).
10. Answer the QSA posed in Remark 2 after the description of the von Neumann stability
analysis.
11. Describe advantages and disadvantages of the von Neumann method relative to Method
1.
12. What piece of information would be required to turn a von Neumann-like analysis from
a necessary to a sucient condition of stability?
13. Which harmonics are most dangerous from the point of view of making scheme (12.12)
unstable? How would you proceed answering this question for an arbitrary numerical
scheme?
14. Make sure you can follow the derivation of (12.42).
15. Can you recall a counterpart of the Douglas method for ODEs?
16. Which factors aect the accuracy of a numerical scheme?
13 Implicit methods for the Heat equation
13.1 Derivation of the Crank-Nicolson scheme
We continue studying numerical methods for the IBVP (12.1)(12.3). In Sec. 12.3, we have
seen that the Heat equation (12.1) could be represented as a coupled system of ODEs (12.15).
Moreover, in one of the problems in Homework #12, you were asked whether that system
was sti (and the answer was yes, it is sti). As we remember from Lecture 5, the way to
deal with sti systems is by using implicit methods, which may be constructed so as to be
unconditionally stable irrespective of the step size in the evolution variable. In Lecture 4, we
stated that the highest order that an unconditionally stable method can have is 2 (that is, in
our current notations, the error can tend to zero no faster than O(
2
)). From Lecture 4 we also
recall the particular example of an implicit, unconditionally stable method of order 2: this is
the modied implicit Euler method (3.45). In our current notations, it is:
U
n+1
=

U
n
+

2
_
f (t
n
,

U
n
) +
f (t
n+1
,

U
n+1
)
_
. (13.1)
In the case of system (12.15), the form of function

f is:
f
m
(
U
n
) =
U
n
m+1
2U
n
m
+ U
n
m1
h
2
. (13.2)
Since the operator on the r.h.s. of the above equation will appear very frequently in the
remainder of this course, we introduce a special notation for it:
U
m+1
2U
m
+ U
m1
h
2

1
h
2
2
x
U
m
. (13.3)
Similarly, we denote
U
n+1
U
n
t
U
n
. (13.4)
Then Eq. (13.1) with f given by (13.2) takes on the form:
U
n+1
m
U
n
m
=
1
2
_
U
n
m+1
2U
n
m
+ U
n
m1
h
2
+
U
n+1
m+1
2U
n+1
m
+ U
n+1
m1
h
2
_
, (13.5)
or, in the above shorthand notations,
1
t
U
n
m
=
1
2h
2
_
2
x
U
n
m
+
2
x
U
n+1
m
. (13.6)
The nite-dierence equation (13.5) can be rewritten as
U
n+1
m

r
2
_
U
n+1
m+1
2U
n+1
m
+ U
n+1
m1
_
= U
n
m
+
r
2
_
U
n
m+1
2U
n
m
+ U
n
m1
_
; (13.7)
and correspondingly, Eq. (13.6), as
_
1
r
2
2
x
_
U
n+1
m
=
_
1 +
r
2
2
x
_
U
n
m
, m = 1, . . . , M 1 , (13.8)
where
r =

h
2
. (12.13)
Scheme (13.7) (or, equivalently, (13.8)) is called the
Crank-Nicolson (CN) method. Its stencil is shown
on the right. Both from the stencil and from the dening
equations one can see that U
n+1
m
cannot be determined
in isolation. Rather, one has to determine

U on the en-
tire (n + 1)th time level. Using our standard notation
for the solution vector,
U
n
=
_
U
n
1
, U
n
2
, , U
n
M1
T
,
we rewrite Eq. (13.8) in the vector form:
n+1
n
m1 m m+1
_
I
r
2
A
_

U
n+1
=
_
I +
r
2
A
_

U
n
+

b, (13.9)
where I is the unit matrix and
A =
2 1 0 0
1 2 1 0 0

0 0 1 2 1
0 0 1 2
and

b =
r
2
U
n
0
+ U
n+1
0
0
0
U
n
M
+ U
n+1
M
r
2
g
0
(t
n
) + g
0
(t
n+1
)
0
0
g
1
(t
n
) + g
1
(t
n+1
)
.
(13.10)
Thus, to nd

U
n+1
, we need to solve a tridiagonal linear system, which we can do by the
Thomas algorithm of Lecture 8, using only O(M) operations.
Above, we have derived the CN scheme using the analogy with the modied implicit Euler
method for ODEs. This analogy allows us to expect that the two key features of the latter
method: the unconditional stability and the second-order accuracy, are inherited by the CN
method. Below we show that this is indeed the case.
13.2 Truncation error of the Crank-Nicolson method
The easiest (and, probably, quite instructive) way to de-
rive the truncation error of the CN method is to use
the following observation. Note that the stencil for
this method is symmetric relative to the virtual node
(mh, (n +
1
2
)), marked by a cross in the gure on the
right. This motivates one to expand quantities U
n
m
etc.
about that virtual node. Let us denote the value of the
solution at that point by

U:
U u
_
mh, (n +
1
2
)
_
.
Then, using the Taylor expansion of a function of two
variables (see Lecture 0), we obtain:
n+1
n
m1 m m+1
n+1/2
x
for = 1, 0, or 1:
U
n+1
m+
=

U
m
+
_
U
m,t
+ h
U
m,x
_
+
1
2!
_
_
2
_
2
U
m,tt
+ 2
2
h
U
m,xt
+ (h)
2

U
m,xx
_
+
1
3!
_
_
2
_
3
U
m,ttt
+ 3
_
2
_
2
h
U
m,xtt
+ 3
2
(h)
2

U
m,xxt
+ (h)
3

U
m,xxx
_
+O(
4
+
3
h +
2
h
2
+ h
3
+ h
4
) ,
(13.11)
where

U
m,t
=

t
U
m
|
t=(n+
1
2
)
, etc. Similarly,
U
n
m+
=

U
m
+
_
U
m,t
+ h
U
m,x
_
+
1
2!
_
_
2
_
2
U
m,tt
+ 2
_
2
_
h
U
m,xt
+ (h)
2

U
m,xx
_
+
1
3!
_
_
2
_
3
U
m,ttt
+ 3
_
2
_
2
h
U
m,xtt
+ 3
_
2
_
(h)
2

U
m,xxt
+ (h)
3

U
m,xxx
_
+O(
4
+
3
h +
2
h
2
+ h
3
+ h
4
) .
(13.12)
In a homework problem you will be asked to provide details of the following derivation. Namely,
substituting expressions (13.11) and (13.12) with = 0 into the l.h.s. of (13.6), one obtains:
1
t
U
n
m
=

U
m,t
+ O(
2
) . (13.13)
Next, substituting expressions (13.11) and (13.12) into the r.h.s. of (13.5), one obtains:
1
2h
2
_
2
x
U
n
m
+
2
x
U
n+1
m
_
=

U
m,xx
+ O(h
2
+
2
) . (13.14)
Finally, combining the last two equations yields
1
t
U
n
m
1
2h
2
_
2
x
U
n
m
+
2
x
U
n+1
m
_
=

U
m,t

U
m,xx
+ O(
2
+ h
2
) , (13.15)
which means that the CN scheme (13.6) is second-order accurate in time.
Remark Note that the notation O(
2
+ h
2
) for the truncation error of the CN method does
not necessarily imply that the sizes of and h may be taken to be about the same in practice.
One of the homework problems explores this issue in detail.
Let us now consider the following obvious generalization of the modied implicit Euler
scheme:
U
n+1
=

U
n
+
_
(1 )
f (t
n
,

U
n
) +
f (t
n+1
,

U
n+1
)
_
, (13.16)
where the constant [0, 1]. The corresponding method for the Heat equation is, instead of
(13.9):
(I rA)

U
n+1
= (I + r(1 )A)

U
n
+

b. (13.17)
Obviously, when:
=
1
2
, (13.17) is the CN method;
= 0, (13.17) is the simple explicit method (12.12);
= 1, (13.17) is an analogue of the simple implicit
Euler method for the Heat equation;
its stencil is shown on the right.
We will refer to methods (13.17) with all possible values
of as the -family of methods.
n+1
n
m1 m m+1
Following the derivation of the Douglas method in Sec. 12.4 and of Eq. (13.15) above, one
can show that the truncation error of the -family of methods is
truncation error of (13.17) =
__
1
2

_

h
2
12
_
u
xxxx
+ O(
2
+ h
4
) . (13.18)
Then it follows that in addition to the special value =
1
2
, which gives the second-order accurate
in time CN method, there is another special values of :
=
1
2

1
12r
. (13.19)
When is given by the above formula, the rst term on the r.h.s. of (13.18) vanishes, and the
truncation error becomes O(
2
+h
4
). The corresponding scheme is called Crandalls method or
the optimal method. For other values of , the truncation error of (13.17) is only O( + h
2
).
13.3 Stability of the -family of methods
Here we use Method 1 of Lecture 12 to study stability of scheme (13.17). In a homework
problem, you will be asked to obtain the same results using the von Neumann stability analysis
(Method 2 of Lecture 12).
Before we begin with the analysis, let us recall the idea of Method 1. First, since the PDE
we deal with in this Lecture is linear, both the solution and the error of the numerical scheme
satisfy the dierence equation with the same homogeneous part: namely, in this case, Eq.
(13.17) with

b = 0. Therefore, to establish the condition for stability of a numerical scheme,
we consider only its homogeneous part, because inhomogeneous terms (i.e.,

b in (13.17)) do
not alter the stability properties. Next, with a dierence scheme written in the matrix form as
U
n+1
= M
U
n
,
one needs to determine whether the magnitude of any eigenvalue of matrix M exceeds 1. The
following possibilities can ocur.
If at least one eigenvalue
j
exists such that |
j
| > 1, then the scheme is unstable, because
small errors aligned along the corresponding eigenvector(s) will grow.
At most one eigenvalue of M satises |
j
| = 1, with all the other eigenvalues being strictly
less than 1 in magnitude. Then the scheme is stable.
There are several eigenvalues satisfying |
j
1
| = = |
j
J
| = 1, with the other eigenvalues
being strictly less than 1 in magnitude. Moreover, the eigenvectors corresponding to
j
1
, ...
j
J
are all distinct
24
. Then the numerical scheme is also stable. If, however, some of the
aforementioned eigenvectors coincide (this may, although does not always have to, happen
when some
j
k
is a double eigenvalue), then the scheme is unstable. In the latter case, the
errors will grow, although very slowly.
Following the above outline, we rewrite Eq. (13.17) with

b being set to zero as
U
n+1
= (I rA)
1
(I + r(1 )A)

U
n
. (13.20)
Now recall from Linear Algebra that matrices A, aA + bI, and (cA + dI)
1
have the same
eigenvectors, with the corresponding eigenvalues being , a + b, and (c + d)
1
. Then, since
the eigenvectors of (I + r(1 )A) and (I rA)
1
are the same, the eigenvalues of the matrix
appearing on the r.h.s. of (13.20) are easily found to be
1 + r(1 )
j
1 r
j
, (13.21)
where
j
are the eigenvalues of A. (You will be asked to conrm this in a homework problem.)
According to the Lemma of Lecture 12,
j
= 2 + 2 cos
j
M
= 4 sin
2
_
j
2M
_
, j = 1, . . . , M. (13.22)
As pointed out above, for stability of the scheme, it is necessary that
1 + r(1 )
j
1 r
j
1, j = 1, . . . , M. (13.23)
Using (13.22) and denoting
j

j
2M
,
one rewrites (13.23) as
1 4r(1 ) sin
2
1 + 4r sin
2
. (13.24)
Since 0 by assumption and r > 0 by denition, then the expression under the absolute
value sign on the r.h.s. of (13.24) is positive, and therefore the above inequality can be written
as
_
1 + 4r sin
2
j
_
1 4r(1 ) sin
2
j
1 + 4r sin
2
j
. (13.25)
The right part of this double inequality is automatically satised for all
j
. The left part is
satised when
4r(1 2) sin
2
j
2 . (13.26)
The strongest restriction on r (and hence on the step size in time) occurs when sin
2
j
assumes
its largest value, i.e. 1. In that case, (13.26) yields
(1 2)r
1
2
. (13.27)
24
From Linear Algebra, we recall that a sucient, but not necessary, condition for those eigenvectors to be
distinct is that all the
j
1
, ...
j
J
are distinct.
This inequality should be considered separately in two cases:
1
2
1 r is arbitrary. (13.28)
That is, scheme (13.17) is unconditionally stable for any r.
0 <
1
2
Scheme (13.17) is stable provided that
r
1
2(1 2)
. (13.29)
The above results show that the CN method, as well as the purely implicit method (13.17)
with = 1, are unconditionally stable. That is, no relation between and h must hold in order
for these schemes to converge to the exact solution of the Heat equation. This is the main
advantage of the CN method over the explicit methods of Lecture 12.
Let us also note that Crandalls method (13.17) + (13.19) belongs to the conditionally
stable case, (13.29). However, since for Crandalls method,
r =
1
6(1 2)
<
1
2(1 2)
, (13.30)
then, according to (13.29), this method is stable.
13.4 Ways to improve on the accuracy of the Crank-Nicolson method
To improve on the accuracy of the CN method in time, one may use higher-order multi-step
methods or implicit Runge-Kutta methods. Recall, however, that no method of order higher
than 2 is absolutely stable, and therefore any scheme that one may expect to obtain along
these lines will be (at best) only conditionally stable. Note also that the stencil for a multi-step
generalization of the CN scheme will contain nodes on more than two time levels.
To improve the accuracy of the CN method in space, one can use the analogy with Numerovs
method. Namely, we rewrite the Heat equation u
xx
= u
t
as
2
x
U
n
m
+
2
x
U
n+1
m
2h
2
=
1
12
_
t
U
n
m+1
+ 10
t
U
n
m
+

t
U
n
m1
_
. (13.31)
However, one can verify that the resulting scheme is nothing but Crandalls method!
Finally, we also mention a method attributed to DuFort and Frankel:
U
n+1
m
U
n1
m
2
=
1
h
2
_
U
n
m+1
_
U
n+1
m
+ U
n1
m
+ U
n
m1
_
. (13.32)
This method has the truncation error of
O
_
2
+ h
2
+
_
h
_
2
_
,
which means that it is consistent with the Heat equation only when (/h) 0. When (/h)
const = 0, the DuFort-Frankel method approximates a dierent, hyperbolic, PDE. Although
the DuFort-Frankel method can be shown to be unconditionally stable, it is not used for solution
of parabolic PDEs because of the aforementioned need to have h in order to provide a
scheme consistent with the PDE.
1. Why does one want to use an implicit method to solve the Heat equation?
2. Make sure you can obtain (13.7), and from it, (13.8) and (13.9).
3. How many (order of magnitude) operations is required to advance the solution of the Heat
equation from one time level to the next using the CN method? Is the CN a time-ecient
method?
4. Make sure you can obtain (13.11) and (13.12).
5. What is the order of truncation error of the CN method?
7. Why do you think Crandalls method is called the optimal method?
8. Verify the rst equality in (13.22).
9. What is the signicance of the inequality (13.27)?
10. What is the main advantage of the CN method over the simple explicit method of Lecture
12?
11. Do you think Crandalls method has the same advantage?
Also, explain the origin of each part of formula (13.30).
12. Is it possible to derive an unconditionally stable method with accuracy O(
3
) (or better)
for the Heat equation? If yes, then how? If no, why?
13. Draw the stencil for the DuFort-Frankel method.
14. Do you think the DuFort-Frankel method is time-ecient?
14 Generalizations of the simple Heat equation
In this Lecture, we will consider the following generalizations of the IBVP (12.1)(12.3), based
on the simple Heat equation:
Derivative (Neumann and mixed-type) boundary conditions;
The linear Heat equations with variable coecients;
Nonlinear parabolic equations.
14.1 Boundary conditions involving derivatives
Let us consider the modied IBVP (12.1)(12.3) where the only modication concerns the
boundary condition at x = 0:
u
t
= u
xx
0 < x < 1, t > 0 ; (14.1)
u(x, t = 0) = u
0
(x) 0 x 1 ; (14.2)
u
x
(0, t) + p(t)u(0, t) = q(t), t 0 ; (14.3)
u(1, t) = g
1
(t), t 0 . (14.4)
The boundary condition involving the derivative can be handled by either of the two methods
described in Section 8.4 for one-dimensional BVPs. Below we will describe in detail how the
rst of those methods can be applied to the Heat equation. We will proceed in two steps,
whereby we will rst consider a modication of the simple explicit scheme (12.12) and then, a
modication for the Crank-Nicolson method (13.8), for the boundary condition (14.3).
Modication of the simple explicit scheme (12.12)
For n = 0, i.e. for t = 0, U
0
m
, m = 0, 1, . . . , M 1, M are given by the initial condition
(14.2). Then, discretizing (14.3) with the second order of accuracy in x as
U
0
1
U
0
1
2h
+ p
0
U
0
0
= q
0
, (14.5)
one immediately nds U
0
1
(because p
0
p(0) and q
0
q(0) are given by the boundary
condition (14.3)). Thus, at the time level n = 0, one knows U
0
m
, m = 1, 0, 1, . . . , M 1, M .
For n = 1, we rst determine U
1
m
for m = 0, 1, . . . , M 1 as prescribed by the scheme:
U
1
m
= U
0
m
+ r
_
U
0
m1
2U
0
m
+ U
0
m+1
_
. (14.6)
(Note that the value U
0
1
is used to determine the value of U
1
0
.) Having thus found U
1
0
and U
1
1
,
we next nd U
1
1
from the equation analogous to (14.5):
U
1
1
U
1
1
2h
+ p
1
U
1
0
= q
1
. (14.7)
Finally, U
1
M
is given by the boundary condition (14.4).
For n 2, the above step is repeated.
Remark We used the second-order accurate approximation for u
x
in (14.5) and its counterparts
for n > 0 because we wanted the order of the error at the boundary to be consistent with the
order of the error of the scheme, which is O(h
2
).
Modication of the Crank-Nicolson scheme (13.8)
For n = 0, one nds U
0
1
from Eq. (14.5).
For n = 1, one has,
from the boundary condition (14.3):
U
1
1
U
1
1
2h
+ p
1
U
1
0
= q
1
; (14.7)
from the scheme (13.7):
U
1
m
r
2
_
U
1
m1
2U
1
m
+ U
1
m+1
_
= U
0
m
+
r
2
_
U
0
m1
2U
0
m
+ U
0
m+1
_
, m = 0, 1, . . . , M 1.
(14.8)
Equations (14.7) and (14.8) yield M + 1 equations for the M + 1 unknowns U
1
1
, U
1
0
, U
1
1
, . . .,
U
1
M1
. This system of linear equations can, in principle, be solved. However, as we know from
Sec. 8.4 (see Remark 2 there), the coecient matrix in such a system will not be tridiagonal,
which would preclude a straightforward application of the time-ecient Thomas algorithm.
The way around that problem was also indicated in the aforementioned Remark. Namely, one
needs to eliminate U
1
1
from (14.7) and the Eq. (14.8) with m = 0. For example, we can solve
(14.7) for U
1
1
and substitute the result into Eq. (14.8) with m = 0. This yields:
U
1
0

r
2
__
U
1
1
2h(q
1
p
1
U
1
0
)
2U
1
0
+ U
1
1
_
= U
0
0
+
r
2
__
U
0
1
2h(q
0
p
0
U
0
0
)
2U
0
0
+ U
0
1
_
,
(14.9)
where on the r.h.s. we have also used (14.5). Upon simplifying the above equation, one can
write the linear system for the vector
U
n
=
_
U
n
0
, U
n
1
, . . . , U
n
M1
T
, n = 0 or 1
in the form:
A
U
1
= B
U
0
+

b, (14.10)
where
A =
_
_
_
_
_
_
_
_
_
_
1 + r(1 hp
1
) r 0 0 0
r/2 1 + r r/2 0 0
0 r/2 1 + r r/2 0

0 0 r/2 1 + r r/2
0 0 0 r/2 1 + r
_
_
_
_
_
_
_
_
_
_
(14.11)
and
B =
1 r(1 hp
0
) r 0 0 0
r/2 1 r r/2 0 0
0 r/2 1 r r/2 0

0 0 r/2 1 r r/2
0 0 0 r/2 1 r
,

b =
rh(q
0
+ q
1
)
0
0
0
r
2
(g
0
1
+ g
1
1
)
.
(14.12)
System (14.10) with the tridiagonal matrix A given by (14.11) can now be eciently solved by
the Thomas algorithm.
For n 2, the above step is repeated.
14.2 Linear parabolic PDEs with variable coecients
Generalization of the explicit scheme (12.12) to such PDEs is straighforward. For example, if
instead of the Heat equation (14.1) we have a PDE
u
t
= a(x, t)u
xx
, (14.13)
then we use the following obvious discretization:
a(x, t)u
xx
a
n
m
2
x
U
n
m
h
2
. (14.14)
For the CN method, only slightly more eort is required. Note that the main concern here
is to maintain the O(
2
+ h
2
) accuracy of the method. Maintaining this accuracy is achieved
by using the (well-known to you by now) fact that
f(X + H) f(X H)
2H
= f

(X) + O(H
2
),
or, equivalently,
f(X + H) f(X)
H
= f

_
X +
H
2
_
+ O(H
2
),
(14.15a)
where f(X) is any suciently smooth function, and X can stand for either x or t (then H
stands for either h or , respectively). Similarly, using the Taylor expansion, you will be asked
in a QSA to show that
f(X + H) + f(X)
2
= f
_
X +
H
2
_
+ O(H
2
). (14.15b)
In other words, we can use values f(X) and f(X+H) to approximate the values of the function
and its derivative at (X +
H
2
) the midpoint between X and X +H with accuracy O(H
2
).
Using the idea expressed by (14.15), the schemes that we will list below can be shown to have
the required accuracy of O(
2
+ h
2
).
For the PDE
u
t
= a(x, t)u
xx
+ b(x, t)u
x
+ c(x, t)u, (14.16)
we discretize the terms in a rather obvious way:
u
t

1
t
U
n
m
,
a(x, t)u
xx

1
2h
2
_
a
n
m
2
x
U
n
m
+ a
n+1
m

2
x
U
n+1
m
_
,
b(x, t)u
x

1
4h
_
b
n
m
(U
n
m+1
U
n
m1
) + b
n+1
m
(U
n+1
m+1
U
n+1
m1
)
_
,
c(x, t)u
1
2
_
c
n
m
U
n
m
+ c
n+1
m
U
n+1
m
_
.
(14.17)
Let us explain the origin of the expressions on the r.h.s.es of the rst and third lines above. The
term on the rst line approximates u
t
with accuracy O(
2
) at the virtual node (mh, (n +
1
2
));
this is just a straightforward corollary of the second line of (14.15a). The term on the third line
has two parts. The rst part (with 1/(2h) factored into it) approximates bu
x
with accuracy
O(h
2
) at the node (mh, n); this is just a straightforward corollary of the rst line of (14.15a).
Similarly, the second term approximates bu
x
with accuracy O(h
2
) at the node (mh, (n + 1)).
Hence the average of these two parts approximates bu
x
with accuracy O(
2
+h
2
) at the virtual
node (mh, (n +
1
2
)); this is a straightforward corollary of (14.15b). (If you still have diculty
following these explanations, draw the stencil for the CN method and then draw all the nodes
mentioned above.)
Often, the PDE arises in a physical problem in the form
(x, t)u
t
= ((x, t)u
x
)
x
+ (x, t)u. (14.18)
Instead of manipulating the terms so as to transform this to the form of (14.16) and then use
the discretization (14.17), one can discretize (14.18) directly:
(x, t)u
t

1
2
(
n
m
+
n+1
m
)
1
t
U
n
m
, or
n+
1
2
m
1
t
U
n
m
,
((x, t)u
x
)
x

1
2h
_
n
m+
1
2
x
U
n
m
h

n
m
1
2
x
U
n
m1
h
_
+
1
2h
_
n+1
m+
1
2
x
U
n+1
m
h

n+1
m
1
2
x
U
n+1
m1
h
_
,
(x, t)u
1
2
_
n
m
U
n
m
+
n+1
m
U
n+1
m
_
.
(14.19)
Here we only explain the term on the r.h.s. of the second line, since the other two discretizations
are analogous to those presented in (14.17). The rst term in the rst parentheses approximates
au
x
with accuracy O(h
2
) at the virtual node ((m +
1
2
)h, n); this is a corollary of the second
line of (14.15a). Similarly, the second term in the rst parentheses approximates au
x
with
accuracy O(h
2
) at the virtual node ((m
1
2
)h, n). Consequently, the entire expression in the
rst parentheses with 1/h factored in it approximates (u
x
)
x
with accuracy O(h
2
) at the node
(mh, n); this is a corollary of the rst line of (14.15a). Finally, the entire expression on the
r.h.s. of the second line of (14.19) approximates (u
x
)
x
with accuracy O(
2
+h
2
) at the virtual
node (mh, (n +
1
2
)).
14.3 Von Neumann stability analysis for PDEs with variable coe-
cients
Let us recall that the idea of the von Neumann analysis was to expand the error of the PDE
with constant coecients into a set of exponentials
n
exp(ix) =
n
exp(imh), each of which
exactly satises the discretized PDE for a certain . Note also that for both the simple explicit
scheme (12.12) and the modied Euler-like scheme considered in Problem 4 for Homework #
12, the harmonics exp(imh) that would rst become unstable should the stability condition
for the scheme be violated, are those with the largest spatial frequency, i.e. with = /h (see
the gure at the end of Sec. 12.3). The same appears to be true for most other conditionally
stable schemes.
Now let us consider the PDE (14.13) (or either of (14.16) and (14.18)) where the coecient(s)
does(do) not vary too rapidly. Then, such a coecient can be considered to be almost constant
in comparison to the highest-frequency harmonic that can potentially cause the instability.
This simple consideration suggests that for PDEs with suciently smooth coecients, the von
Neumann analysis can be carried out without any changes, while assuming that at each point
in space and time, the coecients are constant.
For example, the stability criterion for the simple explicit method applied to (14.13) becomes
r
1
2a(x, t)
. (14.20)
This can be interpreted in the following two dierent ways.
(i) If the programmer decides to use constant values for and h, and hence r, over the entire
grid, then he/she should ensure that
r
1
2 max
x,t
a(x, t)
(14.21)
for the scheme to be stable.
(ii) If the programmer decides to vary the step sizes in x and/or t, then at every time level the
step size is to be chosen so as to satisfy the condition
r(t)
1
2 max
x
a(x, t)
. (14.22)
Let us now point out another issue, unrelated to the above one, which, however, is also spe-
cic to PDEs (14.16) and (14.18) and could not occur for the simple Heat equation. Namely,
note that (14.16) and (14.18) may have exponentially growing solutions, which the Heat equa-
tion (14.1) or (14.13) does not have. For example, Eq. (14.16) where each of a, b, and c is
constant, has a solution u = exp(ct). In such a case, when carrying out the von Neumann
analsysis, one should not require that || 1 for the stability of the scheme, because this would
preclude obtaining the above exponentially growing solution. Instead, one should stipulate that
the largest
25
value of || satisfy (for the above example)
max || = 1 + c + smaller terms, (14.23)
while all the other s must be strictly less than 1 in absolute value. Equation (14.23) allows
the (largest) amplication factor corresponding to very low-frequency harmonics (i.e. those
with 0) to be greater than 1 because of the true nature of the solution. If one does not
include the term into c into the modied denition of stability, Eq. (14.23), then it would not
be possible to nd a range for r where the scheme (12.12) could be stable.
For the above example of Eq. (14.16) with constant coecients a, b, and c, the condition on
r based on this modied stability criterion can be shown, by a straightforward but somewhat
lengthy calculation, to be
r
2 + c
1
2
r
2
b
2
2
h
4
4a

1
2a
, (14.24)
i.e. the same as (14.20).
14.4 Nonlinear parabolic PDEs: I. Explicit schemes, and the Newton
Raphson method for implicit schemes
Explicit schemes for nonlinear parabolic PDEs can be consructed straightforwardly. For exam-
ple, for the PDE
u
t
=
_
u
2
u
x
_
x
, (14.25)
25
if more than one value of for a given exists, as for a multi-level scheme
the simple explicit scheme is
t
U
n
m
=
1
h
2
_
_
U
n
m+1
+ U
n
m
2
_
2
(U
n
m+1
U
n
m
)
_
U
n
m
+ U
n
m1
2
_
2
(U
n
m
U
n
m1
)
_
. (14.26)
The von Neumann stability analysis can no longer be rigorously justied for (most) nonlinear
PDEs, but it can be justied approximately, if one assumes that the solution u(x, t) (and hence
its numerical counterpart U
n
m
) does not vary too rapidly. This is analogous to the condition
on the coecients of linear PDEs, mentioned in Sec. 14.3. Below we provide an intuitive
explanation for this claim using (14.25) as a model problem, and then write down the stability
criterion for that PDE.
Let us suppose that at a given moment t, we have obtained the solution u(x, t) to (14.25).
We now want to use (14.25) to advance this solution in time by a small step . It is reasonable
to expect that the solution at t + will be close to that at t:
u(x, t + ) = u(x, t) + w(x, t; ), where |w| |u| . (14.27)
Since w is small in the above sense, it must satisfy (14.25) linearized on the background of the
initial solution u(x, t):
w
t
=
_
u
2
w
x
_
x
+ (2uwu
x
)
x
= u
2
w
xx
+ 2(u
2
)
x
w
x
+ (u
2
)
xx
w, (14.28)
where we have omitted terms that are quadratic and cubic in w. In other words, to advance the
solution of a nonlinear PDE in time by a small amount, we need to solve a linear PDE. Note
that the linear PDE (14.28) has the form (14.16), where the role of known coecients a(x, t)
etc. is now played by the terms u
2
(x, t) etc. These terms are also known from the solution
u(x, t) at time t. Then the stability condition is given by (14.24), which for (14.28) takes on
the form
r
1
2u
2
(x, t)
. (14.29)
Condition (14.29) means that the step size needs to be adjusted accordingly at each time
level so as to maintain the stability of the scheme.
As far as implicit methods for nonlinear PDEs are concerned, there are quite a few possi-
bilities in which such methods can be designed. Here we will discuss in detail an equivalent
of the NewtonRaphson method considered in Lecture 8 and also mention a couple of other
methods: an adhoc method suitable for a certain class of nonlinear parabolic PDEs and an
operator-splitting method. In the Appendix, we will briey describe a group of methods known
as ImplicitExplicit methods (IMEX).
The main diculty that one faces with the NewtonRaphson method is, similarly to Lecture
8, the need to solve systems of algebraic nonlinear equations to obtain the solution at the new
time level. We will now discuss approaches to this problem using Eq. (14.25) as the model
PDE.
To begin, we can use the following slight modication of scheme (14.19) for the PDE (14.18),
where now a = u
2
, b = 0, and c = 1:
u
t

1
t
U
n
m
;
(u
2
u
x
)
x

1
2h
_
(U
n
m
)
2
+ (U
n
m+1
)
2
2
x
U
n
m
h

(U
n
m1
)
2
+ (U
n
m
)
2
2
x
U
n
m1
h
_
+
1
2h
_
(U
n+1
m
)
2
+ (U
n+1
m+1
)
2
2
x
U
n+1
m
h

(U
n+1
m1
)
2
+ (U
n+1
m
)
2
2
x
U
n+1
m1
h
_
.
(14.30)
Next, we substitute the discretized derivatives in (14.30) into Eq. (14.25). You will be asked to
write down the resulting scheme in a homework problem. This scheme, which is just a nonlinear
algebraic system of equations for U
n+1
m
with m = 1, . . . , M 1, can be solved by any of the
iterative methods of Sec. 8.6. We will show the details for the NewtonRaphson method. In
fact, that method is essentially the linearization used in Eqs. (14.27) and (14.28). Namely, the
NewtonRaphson method (as any other iterative method) requires one to use an initial guess
for U
n+1
m
, and an obvious candidate for such a guess is the known value of U
n
m
. Then we let
U
n+1
=

U
n
+
(0)
,
(0)

U
n
; (14.31)
compare this with (8.84). Upon substituting (14.30) and (14.31) into (14.25) and discarding
terms O((
(0)
)
2
), one obtains:
(0)
m

2h
_
_
(0)
m
U
n
m
+
(0)
m+1
U
n
m+1
_

x
U
n
m
h

_
(0)
m1
U
n
m1
+
(0)
m
U
n
m
_

x
U
n
m1
h
+
(U
n
m
)
2
+ (U
n
m+1
)
2
2
(0)
m
h

(U
n
m1
)
2
+ (U
n
m
)
2
2
(0)
m1
h
_
=

h
_
(U
n
m
)
2
+ (U
n
m+1
)
2
2
x
U
n
m
h

(U
n
m1
)
2
+ (U
n
m
)
2
2
x
U
n
m1
h
_
.
(14.32)
Let us outline how the above expression is obtained; you will be asked to ll in the missing
details in a homework problem. Although (14.32) can be obtained by direct multiplication of
terms in (14.30), the easier, and mathematically literate, way is to use the following form of
the familiar Product Rule from Calculus:
(fg) (f + f)(g + g) fg fg + gf, where f f, g g .
(Product Rule)
As a preliminary step of the calculation that you will need to complete on your own, consider
the term (U
n+1
m
)
2
. Let us use the substitution (14.31) and denote U
n
m
f and U
n+1
m
=
U
n
m
+
(0)
m
f + f. Then, using the form of the Product Rule stated above with g = f, you
can write
(U
n+1
m
)
2
(U
n
m
)
2
+ 2U
n
m
(0)
m
. (14.33)
Next, consider the rst term in the rst large parentheses in (14.30) and denote ( (U
n
m
)
2
+
(U
n
m+1
)
2
) by f and
x
U
n
m
/h by g. (So, you denote f, g, f, and g anew each time you use
the Product Rule.) Then it is reasonable to use the following names for the corresponding
quantities in the second large parentheses in (14.30):
(U
n+1
m
)
2
+ (U
n+1
m+1
)
2
f + f,
x
U
n+1
m
/h g + g, (14.34)
where f and g are proportional to
(0)
. At home you will obtain the form of f in (14.34)
using Eq. (14.31). Directly from Eq. (14.31) you will be able to obtain the g. Then all that
remains is to use the Product Rule on these f, g, f, and g. The remaining terms in (14.30)
should be handled similarly.
From (14.32), the vector
(0)
can be solved for in a time-ecient manner (since the coecient
matrix is tridiagonal). In most circumstances, one iteration (14.31) is sucient, but if need be,
the iterations can be continued in complete analogy with the procedure described at the end
of Sec. 8.6. Namely, we rst compute
U
(1)

U
n
+
(0)
(14.35)
and then seek a correction to that solution in the form
U
n+1
=

U
(1)
+
(1)
,
(1)

U
(1)
. (14.36)
Substituting (14.36) along with (14.30) into (14.25), we obtain an equation similar to (14.32):
(1)
m

2h
_
_
(1)
m
U
(1)
m
+
(1)
m+1
U
(1)
m+1
_

x
U
(1)
m
h

_
(1)
m1
U
(1)
m1
+
(1)
m
U
(1)
m
_

x
U
(1)
m1
h
+
(U
(1)
m
)
2
+ (U
(1)
m+1
)
2
2
(1)
m
h

(U
(1)
m1
)
2
+ (U
(1)
m
)
2
2
(1)
m1
h
_
=
(0)
m
+

2h
_
(U
n
m
)
2
+ (U
n
m+1
)
2
2
x
U
n
m
h

(U
n
m1
)
2
+ (U
n
m
)
2
2
x
U
n
m1
h
_
+
2h
_
(U
(1)
m
)
2
+ (U
(1)
m+1
)
2
2
x
U
(1)
m
h

(U
(1)
m1
)
2
+ (U
(1)
m
)
2
2
x
U
(1)
m1
h
_
.
(14.37)
Recall that here, U
(1)
, U
n
, and
(0)
are known, and ones goal is to solve this linear equation for
(1)
. This can be done time-eciently, because the coecient matrix of the equation for
(1)
is
tridiagonal. Once
(1)
has been found, one can dene, and solve for,
(2)
, etc. These iterations
can be carried out in the above manner as many times as need be.
As we have seen above, the strength of the NewtonRaphson method is that it can be
applied to programming an implicit numerical scheme for any nonlinear equation or system
of equations. However, a drawback of this method is that it is quite cumbersome (see, e.g.,
(14.32) and (14.37)). Therefore, a considerable amount of research has been done on nding
other methods which, on one hand, would to large extent retain the good stability properties
of implicit methods while, one the other hand, would be much easier to program. Two such
systematic alternatives to the NewtonRaphson method, which can be applied to a very wide
class of equations and which do not require the solution of a system of nonlinear equations, are
described in the next Section.
To conclude this Section, we will point out one issue that is specic to discretization of
nonlinear dierential equations.
Remark 1 Let us continue using (14.25) as the model problem. Note that it can be written
in an equivalent form:
u
t
=
1
3
(u
3
)
xx
. (14.38)
We can use the following discretization that has the accuracy of O(
2
+ h
2
):
1
t
U
n
m
=
1
3

1
2h
2
_
2
x
(U
3
)
n
m
+
2
x
(U
3
)
n+1
m
_
; (14.39)
recall the denition (13.3) of the operator
2
x
. The point we want to make here is that the
nonlinear system (14.39) is dierent from the nonlinear system obtained upon substitution of
(14.30) into (14.25)!
The issue we have encountered can be understood from the following simple example, per-
taining to a single time level (hence we omit the superscript of the functions). Consider a
nonlinear function u
3
. Obviously,
(u
3
)
x
= 3u
2
u
x
. (14.40)
With the second-order accuracy, the l.h.s. can be discretized as, e.g.,
(u
3
)
x

(U
m+1
)
3
(U
m1
)
3
2h
=
(U
m+1
U
m1
)(U
2
m+1
+ U
m+1
U
m1
+ U
2
m1
)
2h
. (14.41)
Using the same central-dierence formula to discretize the derivative on the r.h.s. of
(14.40), one obtains
3u
2
u
x
3U
2
m
U
m+1
U
m1
2h
, (14.42)
which, obviously, does not equal the r.h.s. of (14.41), although diers from it by an amount
O(h
2
).
Thus, a nonlinear term can have several representations, which are equivalent in the contin-
uous limit (like the l.h.s. and r.h.s. of (14.40)). However, these dierent representations, when
discretized using the same rule, can still lead to distinct nite-dierence equations.
14.5 Nonlinear parabolic PDEs: II. Semi-implicit, implicit-explicit
(IMEX), and other methods
14.5.1 A semi-implicit method
Let us present a simple alternative to the NewtonRaphson method using (14.38) as the model
problem. With the accuracy of O(
2
), the u
3
term can be discretized as follows:
u
3
_
U
n+
1
2
_
2
U
n
+ U
n+1
2
. (14.43)
The r.h.s. of (14.43) is now linear with respect to U
n+1
, but the problem is that we do not
yet know U
n+
1
2
. The latter can be approximated by an explicit method that should have the
local truncation error O(
2
), and hence the global accuracy of one order less, i.e. only O().
That is, one can rst compute U
n+
1
2
and then use it as a known value in (14.43). A simple,
O(
2
)-accurate way to compute U
n+
1
2
is by a multi-step method similar to (3.4):
U
n+
1
2
= U
n
+
1
2
(U
n
U
n1
) =
3
2
U
n
1
2
U
n1
. (14.44)
Then the scheme
t
U
n
m
=
r
6
2
x
_
_
U
n+
1
2
m
_
2
(U
n
m
+ U
n+1
m
)
_
, (14.45)
becomes an implicit scheme for the linear equation.
Method (14.45), (14.44) is a member of a large class of semi-implicit methods. It can be
straightforwardly generalized to the following class of equations:
u
t
= a(u, u
x
, x, t)u
xx
+ b(u, u
x
, x, t)u
x
, (14.46)
where, as stated above, the coecients a and b may depend on the solution u and its derivative
u
x
. (Further generalizations of this form are possible, but for the purpose of our brief discussion,
form (14.46) is sucient.) An extension of scheme (14.45), (14.44) for (14.46) is:
t
U
n
m
= a
_
U
n+
1
2
m
, (U
n+
1
2
m
)
x
, x
m
, t
n+
1
2
_
(U
n
m
)
xx
+ (U
n+1
m
)
xx
2
+
b
_
U
n+
1
2
m
, (U
n+
1
2
m
)
x
, x
m
, t
n+
1
2
_
(U
n
m
)
x
+ (U
n+1
m
)
x
2
, (14.47)
where U
n+
1
2
m
is given by (14.44) and (U
n
m
)
x
denotes the second-order accurate nite-dierence
approximation of u
x
(x
m
, t
n
), etc..
Since this scheme is not fully implicit, it cannot be unconditionally stable (see Theorem
4.2 at the end of Lecture 4). However, one can show that it is unconditionally stable on the
background of the constant solution, u = C where C is any constant, of (14.46). To show this,
let us consider an ansatz
U
n
m
= C +
n
e
imh
, (14.48)
where is the amplication factor in the von Neumann analysis and 1 indicates smallness
of the perturbation to the exact solution u = C. Then, according to (14.44),
U
n+
1
2
m
= C +
_
3
2

1
2
_
n1
e
imh
. (14.49)
When, however, (14.49) is substituted into (14.47) and terms of order O(
2
) and higher are
neglected, the O()-term from (14.49) drops out, since the terms that it multiplies are already
O() (the C-term is absent from (U
n
m
)
xx
and similar terms due to the x-derivative). Then, in
the equation that results from the stability analysis, a and b take on the forms a(C, 0, x
m
, t
n+
1
2
)
and b(C, 0, x
m
, t
n+
1
2
), which is the same as in the case of linear parabolic PDEs with variable
coecients, considered in Section 14.2. Thus, method (14.47) is unconditionally stable on the
background of the constant solution of Eqs. (14.46). While it may be unstable (for too large
a time step) on the background of other, non-constant, solutions, it may still be a good rst
method to try since it is much easier to implement than the NewtonRaphson method.
14.5.2 The idea behind ImplicitExplicit (IMEX) methods
IMEX methods present another attractive alternative to the NewtonRaphson method because,
as the semi-implicit method above, they also do not require the solution of a system of nonlinear
algebraic equations. They do require the step size to be restricted since they are not fully
implicit and hence cannot be unconditionally stable (see Lecture 4). However, such a restriction
can be signicantly weaker than that for a fully explicit method. Below we present only the
basic idea of IMEX methods. A more detailed, and quite readable, exposition, as well as
references, can be found in Section IV.4 of the book by W. Hundsdorfer and J.G. Verwer,
Numerical Solution of Time-Dependent Advection-Diusion-Reaction Equations, (Springer
Series in Comput. Math., vol. 33, Springer, 2003).
The idea behind IMEX methods can be explained without any explicit reference to spatial
variables. Let the evolution equation that we want to solve have the form
u
t
= F (u(t), t)) F
0
(u(t), t) + F
1
(u(t), t) , (14.50)
where F
0
is a non-sti term suitable for explicit time-integration and F
1
is a sti term that
requires implicit treatment. Usually, F
0
and F
1
include, respectively, the advection and diusion
terms (i.e., the second and rst terms in (14.16) or (14.46), respectively; recall from Lecture
12 that the simple Heat equation u
t
= u
xx
is a sti problem). The last term in (14.16), which
is usually referred to as the reaction term (because it often describes chemical reactions) can
belong to either F
0
or F
1
. To make the splitting (14.50) useful for a numerical implementation,
which means avoiding the solution of a system of nonlinear equations, it suces to require that
F
1
be linear in u. Below we will proceed with this assumption, but at the end of our discussion
will mention a generalization where F
1
may contain nonlinear terms. Let us also note that
our consideration applies equally well both to a single Eq. (14.50) and to a system of coupled
equations whose r.h.s. can be split as a sum of non-sti and sti terms.
A simple rst-order accurate IMEX method for (14.50) is:
U
n+1
U
n
= F
0
(U
n
, t
n
) + (1 )F
1
(U
n
, t
n
) + F
1
(U
n+1
, t
n+1
) , (14.51)
where is a parameter, as in Lecture 13. Note that since, by design, F
1
depends on U
n+1
linearly, scheme (14.51) does not require its user to solve any nonlinear algebraic equations.
The stability analysis for this scheme is done as follows. Instead of the model equation
u
t
= u , (4.14)
which does not distinguish between the sti and non-sti parts, one considers a model equation
u
t
=
0
u +
1
u, (14.52)
where
0
and
1
correspond to F
0
and F
1
. Substituting U
n
=
n
into scheme (14.51) applied
to Eq. (14.52), one nds that
(z
0
, z
1
) =
1 + z
0
+ (1 )z
1
1 z
1
, (14.53)
where z
0
=
0
and z
1
=
1
. As usual, one requires
|(z
0
, z
1
)| < 1 (14.54)
for stability. We will now explain that this condition can be interepreted in two dierent ways.
First interpretation of (14.54)
Suppose one has to design a method (14.51) that should be applicable to equations of the
form (14.50) where parameters of F
0
cause the values of
0
to be anywhere in the left-half
complex plane (i.e., not just on the negative real line). Then one should insist on using the full
stability region of the explicit method, i.e., to have |1 + z
0
| < 1, while being willing to give
up some exibility in selecting z
1
. An example of such a situation is when F
0
contains terms
describing nonlinear, but non-sti, reaction or advection, while F
1
contains the simple diusion
term u
xx
, for which all
1
s lie on the negative real axis (see Problem 2 in HW 12). Let us now
explain why the restriction on the values of z
1
are expected to occur.
For the sake of argument, consider the value = 1/2 in (14.51), which would lead to the
CrankNicolson scheme if F
0
were absent. Since that scheme is nothing but the implementation
of the modied implicit Euler method for the Heat equation (see Sec. 13.1), its stability region
is the entire left-half complex plane (recall the result of Problem 7 in HW 4). That is,
1 +
1
2
z
1
1
1
2
z
1
1 (14.55)
whenever Re (z
1
) 0. Graphically, this is illustrated in Figure (a) below. There, the ex-
pressions in the numerator and denominator on the l.h.s. of (14.55) are depicted by the thick
vectors in the left-half and right-half planes, respectively. It is clear that the ratio of the lengths
of those vectors is indeed always less than one.
1
1
(1/2)z
1
(1/2)z
1
Im z
1

Re z
1

(a)
1+z
0

1
(1/2)z
1
(1/2)z
1
Im z
1

Re z
1

(b)
On the other hand, the stability condition (14.54) with = 1/2 is
(1 + z
0
) +
1
2
z
1
1
1
2
z
1
1, (14.56)
which must hold for all z
0
such that |1 +z
0
| 1. As illustrated in Figure (b) above, condition
(14.56) can be violated for some of such z
0
unless Im(z
1
) = 0. Thus, if one insists on having
the full stability region for the explicit part of the IMEX method (14.51), the stability region of
this method with respect to its implicit part can be less than the corresponding stability region
of (14.51) with F
0
0.
In general, in this case one can show that the stability condition (14.54) yields the inequality
1 +|(1 )z
1
| < |1 z
1
| . (14.57)
With some eort, one can further show that the unconditional stability of the IMEX method
(14.51) is attained only for = 1. For < 1/2, scheme (14.51) is unstable. For = 1/2, its
stability region D
1
, given by (14.57), collapses onto the negative real axis: z
1
< 0. (Above,
we have illustrated this graphically.) However, already for just slightly exceeding the critical
value of 1/2, the stability region D
1
becomes a sector with a signicantly nonzero angle on
both sides of the negative real axis; for example, 25
o
and > 50
o
for = 0.51 and = 0.6,
respectively (see, e.g., Fig. 4.1 in the book by Hundsdorfer and Verwer cited above).
Second interpretation of (14.54)
Alternatively, suppose that (14.50) is a system of coupled equations for variables u
(1)
, u
(2)
, . . .,
and suppose that F
1
contains both the diusion term and the sti part of the reaction term.
Then, the eigenvalues
(1)
1
,
(2)
1
, . . . (and hence the corresponding values z
(1)
1
, z
(2)
1
, . . .) of the
Jacobian matrix (F
(1)
, F
(2)
, . . .)/(u
(1)
, u
(2)
, . . .) (see Sec. 5.4 in Lecture 5) can be found
anywhere in the left half of the complex plane, i.e. Re z
(j)
1
0 for all j. Thus, one may want
to know for which complex z
0
one can fulll condition (14.54) given that z
1
can be allowed
anywhere in the left-half complex plane.
Similarly to the previous case, the corresponding nonempty region D
0
exists only for 1/2.
That is, if < 1/2, then the IMEX method (14.51) where z
1
can be found anywhere in the
left-half plane, is unstable for any z
0
= 0 with Re (z
0
) 0! (This should be contrasted
with the situation when F
0
0, for which method (14.51) with < 1/2 is conditionally
stable, as we showed in Sec. 13.3.) For < 1, the stability region D
0
of the IMEX method
is smaller than the region |1 + z
0
| < 1, which would result in the absence of the F
1
term
in (14.50). For = 1/2, the region D
0
collapses into the segment 2 < z
0
< 0 along the
negative real axis, while for = 1, the stability region of the explicit Euler method, e.g.,
D
0
( = 1) = {All z
0
such that |1 + z
0
| < 1}, is recovered (see, again, Fig. 4.1 in the book by
Hundsdorfer and Verwer).
The book by Hundsdorfer and Verwer provides an overview of higher-order accurate mem-
bers of the IMEX family, which are preferred in practice over the lowest-order method (14.51).
Among them are, for example, IMEX RungeKutta and multistep IMEX methods. Below we
will list two second-order accurate IMEX methods and briey comment on their properties.
Second-order IMEX-Adams methods have the form:
U
n+1
U
n
=
3
2
F
0
(U
n
, t
n
)
1
2
F
0
(U
n1
, t
n1
) + (14.58)
F
1
(U
n+1
, t
n+1
) +
_
3
2
2
_
F
1
(U
n
, t
n
) +
_

1
2
_
F
1
(U
n1
, t
n1
) .
If we insist that it be stable for all z
1
in the left-half plane, its stability region with respect to
z
0
depends on (similarly to what we discussed above in the second interpretation of (14.54)).
For example, for = 1/2, this method is stable only when z
0
belongs to a segment along the
negative real axis, z
0
[1, 0]. For = 1, the stability region of the second-order Adams
Bashforth method is recovered (see Problem 4 in HW 4). For = 3/4, the stability region is
an oval part of whose boundary follows the imaginary axis most closely (out of all values of ).
Thus, the IMEX-Adams method with = 3/4 is preferred for equations that have z
0
both on,
and to the left of, the imaginary axis.
If z
0
are known to lie only on the imaginary axis, then the so-called IMEX-CNLF (Crank
Nicolson Leap-frog) method can be used. Its scheme is:
U
n+1
U
n1
2
= F
0
(U
n
, t
n
) +
1
2
_
F
1
(U
n+1
, t
n+1
) + F
1
(U
n1
, t
n1
)
_
. (14.59)
This scheme is stable for all z
1
in the left-half plane and for z
0
[2, 2]. Examples of the
non-sti term F
0
for which
0
lies on the imaginary axis is the advection term b(x, t, u)u
x
. (It
is beyond the scope of this course to explain why this is so, but if you are familiar with Fourier
analysis, you may gure it out on your own.) Thus, equations of the form (14.46) where a is
independent of u can be solved by this method.
Finally, we note that the same considerations can often be generalized when F
1
is not a
linear function of u. For example, consider Eq. (14.46) where now the coecient a does depend
on u. Then one can replace the implict integration in (14.51) with an analogue of the semi-
implicit method (14.47). This would still result in the equation for U
n+1
being linear, and hence
easily solvable. Stability properties of such a method are not, however, clear, and may need to
be veried by numerical experiments.
14.5.3 Comments on other methods
Let us mention a popular method called a split-step method, which we will illustrate with the
example of the celebrated Nonlinear Schrodinger equation:
iu
t
+ u
xx
+|u|
2
u = 0, (note the i =
1 in front of u
t
) (14.60)
which appears in a great many applications involving propagation of wave packets. The split-
step method is based on the observation that the linear and nonlinear parts of this equation can
be solved exactly (we do not need to consider here how this can be done). Then the split-step
algorithm is:
Given U
n
(x) u(x, t
n
),
Solve iu
t
+ u
xx
= 0 from t
n
to t
n+1
; get U
aux
;
Using U
aux
as the initial condition,
Solve iu
t
+ u|u|
2
= 0 from t
n
to t
n+1
; get U
n+1
.
(14.61)
The last class of methods that we will mention are valuable only for PDEs that possess
conserved quantities, like energy. Usually, such equations are hyperbolic PDEs or parabolic
PDEs with imaginary time, like the Nonlinear Schr odinger equation (14.60). Such equations
are multi-dimensional counterparts of the harmonic oscillator equation. There are classes of
numerical schemes that preserve some (or, in rare cases, all!) of the conserved quantities of those
equations. Such schemes are relatives of symplectic methods for ODEs, discussed in Lecture
5. One can read about those conservation-laws-based schemes in, e.g., a textbook by J.W.
Thomas, Numerical partial dierential equations: Conservation laws and elliptic equations
(Springer, 1999). True parabolic equations, like the Heat equation or, more generally, any
equation with diusion in real-valued time, do not have conserved quantities like energy, and
hence conservation-laws-based schemes are not applicable to them.
1. In (14.5) and (14.7), why did we not use the simpler discretization
U
n
1
U
n
0
h
+ p
n
U
n
0
= q
n
, n = 0, 1,
which would have eliminated the need to deal with the solution U
n
1
at the virtual node?
2. Be able to explain the idea(s) behind handling the derivative boundary condition for both
the simple explicit and Crank-Nicolson schemes.
3. Make sure you can obtain (14.9) and hence (14.10)(14.12).
4. Obtain (14.15b).
5. Explain, argumentatively but without calculations, that discretization (14.17) produces
a scheme of the accuracy stated in the text. (Drawing the stencil should help.)
6. Same question about (14.19).
7. What condition on the variable coecients of a linear PDE should hold in order for the
von Neumann stability analysis to proceed along the same lines as for the simple Heat
equation? Why?
8. Describe two ways in which the person who is numerically solving PDE (14.13) may use
the stability condition (14.20).
9. When and why does one need to modify the stability criterion to be (14.23)?
10. What is the order of accuracy of scheme (14.26)?
11. Explain qualitatively (i.e., without calculations) that discretization (14.30) produces a
scheme of the accuracy O(
2
+ h
2
). (Drawing the stencil should help.)
12. What is the main diculty in solving nonlinear PDEs by implicit methods?
13. Using discretizations (14.30) as an example, explain the idea behing the NewtonRaphson
method when applied to nonlinear PDEs.
14. Describe the issue about discretization of nonlinear terms, pointed out in Remark 1.
15. Describe the idea behind the semi-implicit method presented in Sec. 14.5.
16. Explain why the r.h.s. of (14.44) approximates the l.h.s. of that equation.
17. Obtain (14.53).
18. What are two possible interpretations of (14.54)?
19. Make sure you can follow the argument made around condition (14.56).
20. Why does method (14.58) have the name Adams in it?
21. When can method (14.59) be used?
15 The Heat equation in 2 and 3 spatial dimensions
In this Lecture, which concludes our treatment of parabolic equations, we will develop numerical
methods for the Heat equation in 2 and 3 dimensions in space. We will present the details of
these developments for the 2-dimensional case, while for the 3-dimensional case, we will mention
only those aspects which cannot be straightforwardly generalized from 2 to 3 spatial dimensions.
Since this Lecture is quite long, here we give a brief preview of its results. First, we will
explain how the solution vector can be set up on a 3-dimensional grid (two dimensions in space
and one in time). We will discuss both the conceptual part of this setup and its implementation
in Matlab. Then we will present the simple explicit scheme for the 2D Heat equation and will
show that it is even more time-inecient than it was for the Heat equation in one dimension.
In search of a time-ecient substitute, we will analyze the naive version of the Crank-Nicolson
scheme for the 2D Heat equation, and will discover that that scheme is not time-ecient either!
We will then show how a number of time-ecient generalizations of the Crank-Nicolson scheme
to 2 and 3 dimensions can be constructed. These generalizations are known under the common
name of Alternating Direction methods, and are a particular case of an even more general class
of so-called operator-splitting methods. In Appendix 1 we will point out a relation between
these methods and the IMEX methods mentioned in Lecture 14, as well as with the predictor-
corrector methods considered in Lecture 3. Finally, we will also describe that prescribing
boundary conditions (even the Dirichlet ones) for those time-ecient schemes is not always a
trivial matter, and demonstrate how they can be prescribed.
15.1 Setting up the solution vector on a three-dimensional grid
In this Lecture, we study the following IBVP:
u
t
= u
xx
+ u
yy
0 < x < 1, 0 < y < Y t > 0 ; (15.1)
u(x, y, t = 0) = u
0
(x, y) 0 x 1 0 y Y ; (15.2)
u(0, y, t) = g
0
(y, t), u(1, y, t) = g
1
(y, t), 0 y Y, t 0 ; (15.3)
u(x, 0, t) = g
2
(x, t), u(x, Y, t) = g
3
(x, t), 0 x 1, t 0 . (15.4)
We will always assume that the boundary conditions are consistent with the initial condition:
g
0
(y, 0) = u
0
(0, y), g
1
(y, 0) = u
0
(1, y), g
2
(x, 0) = u
0
(x, 0), g
3
(x, 0) = u
0
(x, Y ), (15.5)
and, at the corners of the domain, with each other:
g
0
(0, t) = g
2
(0, t), g
0
(Y, t) = g
3
(0, t), g
3
(1, t) = g
1
(Y, t), g
1
(0, t) = g
2
(1, t), for t > 0.
(15.6)
The gure on the right shows the two-dimensional
spatial domain, where the Heat equation (15.1)
holds, as well as the domains boundary, where the
boundary conditions (15.3), (15.4) are specied.
Note that we have allowed the lengths of the do-
main in the x and y directions to be dierent (if
Y = 1). Although, in principle, one can always
make Y = 1 by a suitable scaling of the spatial co-
ordinates, we prefer not to do so in order to allow,
later on, the step sizes in x and y to be the same.
The latter is simply the matter of convenience.
0 1
0
Y
y
x
D =[0,1][0,Y]
D
g
1

g
0

g
2

g
3

m=5
m=4
m=0
m=3
m=2
m=1
l=5
l=3
l=1
l=0
l=4
l=2
t
y
x
Two levels of 2Dspatial grid for M=5, L=5
16
4
3
2
1
9
15
6
5
7
8
14
10
11
12
13
n
n+1
To discretize the Heat equation (15.1), we cover domain D with a two-dimensional grid. As
we have just noted above, in what follows we will assume that the step sizes in the x and y
directions are the same and equal h. We also discretize the time variable with a step size . Then
the three-dimensional grid for the 2D Heat equation consists of points (x = mh, y = lh, t = n),
0 m M = 1/h, 0 l L = Y/h, and 0 n N = t
max
/. Two time levels of such a grid
for the case M = 5 and L = 5 are shown in the gure above.
We will denote the solution on the above grid as
U
n
ml
= u(mh, lh, n) , 0 m M, 0 l L, 0 n N. (15.7)
We expect that any numerical scheme that we will design will give some recurrence relation
between U
n+1
ml
and U
n
ml
(and, possibly, U
n1
ml
etc.). As long as our grid is rectangular, the array
of values U
n
ml
at each given n can be conveniently represented as an (M +1) (L+1) matrix.
In this Lecture, we will consider only this case of a rectangular grid. Then, to step from level
n to level (n + 1), we just apply the recurrence formula to each element of the matrix U
n
ml
.
For example:
for m = 2 : mmax-1
for ell = 2 : ellmax-1
Unew(m,ell) = a*U(m,ell) + b*U(m+1,ell-1);
end
end
U(2:mmax-1, 2:ellmax-1) = Unew(2:mmax-1, 2:ellmax-1);
If we want to record and keep the value of the solution at each time level, we can instead use:
U(m,ell,n+1) = a*U(m,ell,n) + b*U(m+1,ell-1,n); .
On the other hand, if the spatial domain in not rectangular, as occurs in many practical
problems, then dening U
n
ml
as a matrix is not possible, or at least not straightforward. In this
case, one needs to reshape the two-dimensional array U
n
ml
into a one-dimensional vector. Even
though it will not be needed for the purposes of this lecture or homework, we will still illustrate
the idea and implementation behind this reshaping. For simplicity, we will consider the case
where the grid is rectangular. This reshaping can be done in more than one way. Here we will
consider only the so-called lexicographic ordering. In this ordering, the rst (M1) components
of the solution vector

U will be the values U
m,1
with m = 1, 2, . . . , M1. (Here, for brevity, we
have omitted the superscript n, and also inserted a comma between the subscripts pertaining
to the x and y axes for visual convenience.) The next M 1 components will be U
m,2
with
m = 1, 2, . . . , M 1, and so on. The resulting vector is:
U =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
U
1,1
U
2,1
U
M1,1
U
1,2
U
2,2
U
M1,2
U
1,L1
U
2,L1
U
M1,L1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1st row of 2D level
(along y = 1 h (i.e. l = 1))
_
_
2nd row of 2D level
(along y = 2 h (i.e. l = 2))
_
_
(L 1)th row of 2D level
(along y = (L 1) h (i.e. l = L 1))
(15.8)
An example of lexicographic ordering of the nodes of one time level is shown in the gure on
the previous page for M = 5 and L = 5 (see the numbers next to the lled circles on the lower
level).
Let us now show how one can set up one time level of the grid and construct a vector of the
form (15.8), using built-in commands in Matlab. In order to avoid possible confusion, we will
dene the domain D slightly dierently than was done above. Namely, we let x [0, 2] and
y [3, 4]. Next, let us discretize the x coordinate as
>> x=[0 1 2]
x =
0 1 2
and the y coordinate as
>> y=[3 4]
y =
3 4
(Such a coarse discretization is quite sucient for the demonstration of how Matlab commands
can be used.) Now, let us construct a two-dimensional grid as follows:
>> [X,Y]=meshgrid(x,y)
X =
0 1 2
0 1 2
Y =
3 3 3
4 4 4
Thus, entries of matrix X along the rows equal the values of x, and these entries do not change
along the columns. Similarly, entries of matrix Y along the columns equal the values of y, and
these entries do not change along the rows.
Here is a function Z of two variables x and y, constructed with the help of the above matrices
X and Y:
>> Z=100*Y+X
Z =
300 301 302
400 401 402
Now, if we need to reshape matrix Z into a vector, we simply say:
>> Zr=reshape(Z,prod(size(Z)),1)
Zr =
300
400
301
401
302
402
(Note that
>> size(Z)
ans =
2 3
and command prod simply computes the product of all entries of its argument.) If we want to
go back and forth between using Z and Zr, we can use the reversibility of command reshape:
>> Zrr=reshape(Zr,size(X,1),size(X,2))
Zrr =
300 301 302
400 401 402
which, of course, gives you back the Z. Finally, if you want to plot the two-dimensional function
Z(x, y), you can type either mesh(x,y,Z) or mesh(X,Y,Z). You may always look up the
help for any of the above (or any other) commands if you have questions about them.
15.2 Simple explicit method for the 2D Heat equation
Construction of the simple explicit scheme for the 2D Heat equation is a fairly straightforward
matter. Namely, we discretize the terms in (15.1) in the standard way:
u
t

U
n+1
m,l
U
n
m,l
t
U
n
m,l
,
u
xx

U
n
m+1,l
2U
n
m,l
+ U
n
m1,l
h
2

2
x
U
n
m,l
h
2
,
u
yy

U
n
m,l+1
2U
n
m,l
+ U
n
m,l1
h
2

2
y
U
n
m,l
h
2
,
(15.9)
and substitute these expressions into (15.1) to obtain:
U
n+1
ml
= U
n
ml
+ r
_
2
x
U
n
ml
+
2
y
U
n
ml
_
_
1 + r
2
x
+ r
2
y
_
U
n
ml
.
(15.10)
Three remarks about notations in (15.9) and
(15.10) are in order. First,
r =

h
2
,
as before. Second, we will use the notations U
m,l
and U
ml
(i.e. with and without a comma between
m and l) interchangeably; i.e., they denote the
same thing. Third, the operators
2
x
and
2
y
will
be used extensively in this Lecture.
The stencil for the simple explicit scheme (15.10)
is shown on the right. Implementation of this
scheme is discussed in a homework problem.
o
o
o o o
o
m, l
m, l1
m1, l
m, l+1
m+1, l
LEVEL n
LEVEL n+1
Next, we perform the von Neumann stability analysis of scheme (15.10). To this end, we
use the fact that the solution of this constant-coecient dierence equation is satised by the
Fourier harmonics
U
n
ml
=
n
e
imh
e
ilh
, (15.11)
which we substitute into (15.10) to nd the amplication factor . In this calculation, as well as
in many other calculations in the remainder of this Lecture, we will use the following formulae:
2
x
_
e
imh
e
ilh
= 4 sin
2
_
h
2
_
_
e
imh
e
ilh
, (15.12)
2
y
_
e
imh
e
ilh
= 4 sin
2
_
h
2
_
_
e
imh
e
ilh
. (15.13)
(You will be asked to conrm the validity of these formulae in a homework problem.) Substi-
tuting (15.11) into (15.10) and using (15.12) and (15.13), one nds
= 1 4r
_
sin
2
h
2
+ sin
2
h
2
_
. (15.14)
The harmonics most prone to instability are, as for the one-dimensional Heat equation, those
with the highest spatial frequency, and for which
sin
2
h
2
= sin
2
h
2
= 1 .
For these harmonics, the stability condition || 1 implies
r
1
4
or, equivalently,
h
2
4
. (15.15)
Thus, in order to ensure the stability of the simple explicit scheme (15.10), one has to impose a
restriction on the time step that is twice as strong as the analogous restriction in the case of
the one-dimensional Heat equation. Therefore, the simple explicit scheme is computationally
inecient, and our next step is, of course, to look for a computationally ecient scheme. As
the rst candidate for that position, we will analyze the Crank-Nicolson scheme.
15.3 Naive generalization of Crank-Nicolson scheme for the 2D Heat
equation
Our main nding in this subsection will be that a naive generalization of the CN method (13.6)
is also computationally inecient. The underlying analysis will allow us to formulate specic
properties that a computationally ecient scheme must possess.
The naive generalization to two dimensions of the CN scheme, (13.5) or (13.6), is:
U
n+1
ml
= U
n
ml
+
r
2
_
2
x
+
2
y
_ _
U
n
ml
+ U
n+1
ml
_
, (15.16)
or, equivalently,
_
1
r
2
2
x
r
2
2
y
_
U
n+1
ml
=
_
1 +
r
2
2
x
+
r
2
2
y
_
U
n
ml
. (15.17)
Following the lines of Lecture 13, one can show that the accuracy of this scheme is O(
2
+h
2
).
Also, the von Neumann analysis yields the following expression for the error amplication
factor:
=
1 2r
_
sin
2 h
2
+ sin
2 h
2
_
1 + 2r
_
sin
2 h
2
+ sin
2 h
2
_, (15.18)
so that || 1 for any r and hence the CN scheme (15.17) is unconditionally stable.
We will now demonstrate that scheme (15.16) / (15.17) is computationally inecient. To
that end, we need to exhibit the explicit matrix form of that scheme. We begin by rewriting
(15.16) in the form
26
:
(1 + 2r)U
n+1
m,l

r
2
_
U
n+1
m+1,l
+ U
n+1
m1,l
_
r
2
_
U
n+1
m,l+1
+ U
n+1
m,l1
_
= (1 2r)U
n
m,l
+
r
2
_
U
n
m+1,l
+ U
n
m1,l
_
+
r
2
_
U
n
m,l+1
+ U
n
m,l1
_
.
(15.19)
To write down Eqs. (15.19) for all m and l in a compact form, we will need the following
notations:
A =
2r r/2 0 0
r/2 2r r/2 0 0

0 0 r/2 2r r/2
0 0 r/2 2r
,

U
; l
=
U
1,l
U
2,l
U
M2,l
U
M1,l
, (15.20)
and
B
k
=
(g
k
)
1
(g
k
)
2
(g
k
)
M1
, for k = 2, 3;

b
n
l
=
(g
0
)
n
l
+ (g
0
)
n+1
l
0
0
(g
1
)
n
l
+ (g
1
)
n+1
l
. (15.21)
Using these notations, one can recast Eq. (15.19) in a matrix form. Namely, for l = 2, . . . , L2
(i.e. for layers with constant y and which are not adjacent to the boundaries), Eq. (15.19)
becomes:
(I + A)
U
n+1
; l

r
2
I

U
n+1
; l+1
r
2
I

U
n+1
; l1
= (I A)
U
n
; l
+
r
2
I

U
n
; l+1
+
r
2
I

U
n
; l1
+
r
2
b
n
l
, (15.22)
where I is the (M 1) (M 1) identity matrix. Note that Eq. (15.22) is analogous to Eq.
(13.9), although the meanings of notation A is dierent in these two equations. Continuing,
for the layer with l = 1 one obtains:
(I + A)
U
n+1
; l

r
2
I

U
n+1
; l+1
r
2
B
n+1
2
= (I A)
U
n
; l
+
r
2
I

U
n
; l+1
+
r
2
B
n
2
+
r
2
b
n
l
. (15.23)
The equation for l = L1 has a similar form. Combining now all these equations into one, we
obtain:
(I +A)
U
n+1
= (I A)
U
n
+B
n
, (15.24)
where

U has been dened in (15.8), I is the [(M1)(L1)] [(M1)(L1)] identity matrix,
and
A =
A
r
2
I O O
r
2
I A
r
2
I O O

O O
r
2
I A
r
2
I
O O
r
2
I A
, B
n
=
r
2
B
n
2
+

B
n+1
2
+

b
n
1
b
n
2
b
n
L2
B
n
3
+

B
n+1
3
+

b
n
L1
. (15.25)
26
Recall our convention to use notations U
ml
and U
m,l
interchangeably.
In (15.25), O stands for the (M 1) (M 1) zero matrix; hopefully, the use of the same
character here and in the O-symbol (e.g., O(h
2
)) will not cause any confusion.
Now, the [(M 1)(L 1)] [(M 1)(L 1)] matrix A in (15.25) is block-tridiagonal,
but not tridiagonal. Namely, it has only 5 nonzero diagonals or subdiagonals, but the outer
subdiagonals are not located next to the inner subdiagonals but separated from them by a band
of zeros, with the bands width being (M 2). Thus, the total width of the central nonzero
band in matrix A is 2(M 2) + 3. Inverting such a matrix is not a computationally ecient
process in the sense that it will require not O(ML), but O(ML)
2
or O(ML)
3
operations. In
other words, the number of operations required to solve Eq. (15.25) is much greater than the
number of unknowns.
27
Let us summarize what we have established about the CN method (15.17) for the 2D Heat
equation. The method: (i) has accuracy O(
2
+ h
2
), (ii) is unconditionally stable, but (iii)
requires much more operations per time step than the number of unknown variables. We are
satised with features (i) and (ii), but not with (iii). In the remainder of this Lecture, we will
be concerned with constructing methods that do not have the deciency stated in (iii). For
reference purposes, we will now repeat the properties that we want our dream scheme to
have.
In order to be considered computationally ecient, the scheme:
(i) must have accuracy O(
2
+ h
2
) (or better);
(ii) must be unconditionally stable;
(iii) must require the number of operations per time step
that is proportional to the number of the unknowns.
(15.26)
In the next subsection, we will set the ground for obtaining such schemes.
15.4 Derivation of a computationally ecient scheme
In this section, we will derive a scheme which we will use later on to obtain methods that
satisfy all the three conditions (15.26). Specically, we pose the problem as follows: Find a
scheme that (a) reduces to the Crank-Nicolson scheme (13.6) in the case of the one-dimensional
Heat equation and (b) has the same order of truncation error, i.e. O(
2
+ h
2
); or, in other
words, satises property (i) of (15.26). Of course, there are many (probably, innitely many)
such schemes. A signicant contribution by computational scientists in the 1950s was nding,
among those schemes, the ones which are unconditionally stable (property (ii)) and could be
implemented in a time-ecient manner (property (iii)). In the remainder of this section, we
27
One might have reasoned that, since A in (15.25) is block-tridiagonal, then one could solve Eq. (15.24) by
the block-Thomas algorithm. This well-known generalization of the Thomas algorithm presented in Lecture 8
assumes that the coecients a
k
, b
k
, c
k
and
k
,
k
in (8.18) and (8.19) are (M 1) (M 1) square matrices.
Then formulae (8.21)(8.23) of the Thomas algorithm are straightforwardly generalized by assigning the matrix
sense to all the operations in those formulae.
However, this naive idea of being able to solve (15.24) by the block-Thomas algorithm does not work. Indeed,
consider the dening equation for
2
in (8.21). It involves
1
1
. While matrix
1
= b
1
is tridiagonal, its inverse
1
1
is full. Hence
2
is also a full matrix. Then by the last equation in (8.21), all subsequent
k
s are also
full matrices. But then nding the inverse of each
k
in (8.21)(8.23) would require O(M
3
) operations, and
this would have to be repeated O(L) times. Thus, the total operation count in this naive approach is O(M
3
L),
which renders the approach computationally inecient.
will concentrate on the derivation of a scheme, alternative to (15.17), that has property (i).
We postpone the discussion of implementation of that scheme, as well as demonstration of the
unconditional stability of such implementations, until the next section.
Since we want to obtain a scheme that reduces to the Crank-Nicolson method for the
one-dimensional Heat equation, it is natural to start with its naive 2D generalization, scheme
(15.17). Now, note the following: When applied to the solution of the discretized equation,
operators
1
t
,
1
h
2
2
x
, and
1
h
2
2
y
(see (15.9)) produce quantities of order O(1) (that is, not O(),
O(
1
), or anything else):
2
x
h
2
U
n
ml
= O(1),

t
U
n
ml
= O(1),

t
2
x
h
2
U
n
ml
= O(1), etc. (15.27)
Before we proceed with the derivation, we will pause and make a number of comments about
handling operators in equations. Note that the operators mentioned before (15.27) are simply
the discrete analogues of the continuous operators /t,
2
/x
2
, and
2
/y
2
, respectively. In
the discrete case, the latter two operators become matrices; for example, in the one-dimensional
case, operator
2
x
coincides with matrix A in (13.10)
28
. Therefore, when reading about, or
writing yourself, formulae involving operators (which you will have to do extensively in the
remainder of this Lecture), think of the latter as matrices. From this simple observation there
follows an important practical conclusion: If a formula involves a product of two oper-
ators, the order of the operators in the product must not be arbitrarily changed,
because dierent operators, in general, do not commute. This is completely analogous
to the fact that for two matrices A and B,
AB = BA in general.
(But, of course,
A + B = B + A,
and the same is true about any two operators.)
We conclude this detour about operator notations with two remarks.
Remark 1 One can show that operators
2
x
and
2
y
actually do commute, as do their continuous
prototypes. However, we will not use this fact in our derivation, so that the latter remains valid
for more general operators that do not necessarily commute.
Remark 2 Any two operators, which are (arbitrary) functions of the same primordial operator,
commute. That is, if O is any operator and f() and g() are any two functions, then
f(O) g(O) = g(O) f(O) . (15.28)
For example,
_
a + b
2
x
_ _
c + d
2
x
_
=
_
c + d
2
x
_ _
a + b
2
x
_
(15.29)
for any scalars a, b, c, d.
We now return to the derivation of a suitable modication of (15.17). From (15.27) it follows
that
2
x
h
2
2
y
h
2
U
n
ml
= O(1), and so, for instance,

2
4
2
x
h
2
2
y
h
2
U
n+1
ml
U
n
ml
= O(
2
) . (15.30)
28
In the two-dimensional case, the matrices for
2
x
and
2
y
are more complicated and depend on the order in
which the grid points are arranged into the vector

U. Fortunately for us, we will not require the corresponding
explicit forms of
2
x
and
2
y
.
The accuracy of scheme (15.17) is O(
2
+ h
2
), and therefore we can add to it any term of
the same order without changing the accuracy of the scheme. Let us use this observation and
add the term appearing on the l.h.s. of the second equation in (15.30) to the l.h.s. of scheme
(15.17), whose both sides are divided by . The result is:
U
n+1
ml
U
n
ml
+

2
4
2
x
h
2
2
y
h
2
U
n+1
ml
U
n
ml
=
1
2h
2
_
2
x
+
2
y
_ _
U
n+1
ml
+ U
n
ml
_
. (15.31)
Note that scheme (15.31) still has the accuracy O(
2
+ h
2
).
Next, we rewrite the last equation in the equivalent form:
_
1
r
2
2
x
r
2
2
y
+
r
2
2
x
r
2
2
y
_
U
n+1
ml
=
_
1 +
r
2
2
x
+
r
2
2
y
+
r
2
2
x
r
2
2
y
_
U
n
ml
. (15.32)
The operator expressions on both sides of the above equation can be factored, resulting in
_
1
r
2
2
x
_ _
1
r
2
2
y
_
U
n+1
ml
=
_
1 +
r
2
2
x
_ _
1 +
r
2
2
y
_
U
n
ml
. (15.33)
Note that when factoring the operator expressions, we did not change the order of operators in
their product.
Scheme (15.33) is the main result of this section. In the next section, we will show how
this scheme can be implemented in a time-ecient manner. The methods that do so are called
the Alternating Direction Implicit (ADI) methods. Here we preview the basic idea common
to all of them. Namely, the computations are split in 2 (for the 2D case, and 3, for the 3D
case) steps. In the rst step, one applies an implicit method in the x-direction and an explicit
method in the y-direction, producing an intermediate solution. The operations count for this
step is as follows: One needs to solve (L 1) tridiagonal (M 1) (M 1) systems; this
can be done with O(ML) operations. In the second step, one applies an implicit method in
the y-direction and an explicit method in the x-direction, which can also be implemented with
O(ML) operations. Hence the total operations count is also O(ML).
15.5 Alternating Direction Implicit methods
PeacemanRachford method
For this ADI method, the two steps mentioned at the end of the previous section are
implemented as follows:
(a) :
_
1
r
2
2
x
_

Uml
=
_
1 +
r
2
2
y
_
U
n
ml
,
(b) :
_
1
r
2
2
y
_
U
n+1
ml
=
_
1 +
r
2
2
x
_

Uml
.
(15.34)
Let us rst show that this method is equivalent to (15.33). This will imply that it satises
property (i) of the dream scheme conditions (15.26). Indeed, let us apply the operator
_
1
r
2
2
x
_
to both sides of (15.34b). Then we obtain the following sequence of equations:
_
1
r
2
2
x
_ _
1
r
2
2
y
_
U
n+1
ml
=
_
1
r
2
2
x
_ _
1 +
r
2
2
x
_

Uml
(15.29)
=
_
1 +
r
2
2
x
_ _
1
r
2
2
x
_

Uml
(15.34a)
=
_
1 +
r
2
2
x
_ _
1 +
r
2
2
y
_
U
n
ml
, (15.35)
which proves that (15.34) is equivalent to (15.33).
It is easy to see that the PeacemanRachford method (15.34) possesses property (iii) of
(15.26), i.e. is computationally ecient. Indeed, in order to compute each of the L 1 sub-
vectors
U; l
=
_
U1,l
,
U2,l
. . . ,
UM2,l
,
UM1,l
_
T
, (15.36)
of the intermediate solution
Uml
, one needs to solve a tridiagonal (M1)(M1) system given
by Eq. (15.34a) for each l. Thus, the step described by (15.34a) requires O(ML) operations.
Specically, for l = 2, . . . , L 2 (i.e. away from the boundaries), such a system has the form
_
1
r
2
2
x
_
U; l
=

U
n
; l
+
r
2
_
U
n
; l+1
2
U
n
; l
+

U
n
; l1
_
, 2 l L 2 , (15.37)
where

U
n
; l
is dened in (15.20). The counterpart of (15.37) for the boundary rows (with l = 1
and l = L 1) will be given in the next section. Continuing, the operator
2
x
in (15.37) is an
(M1) (M1) tridiagonal matrix, whose specic form depends on the boundary conditions
and will be discussed in the next section. Note that the operator
2
y
on the r.h.s. of (15.34a) is
not a matrix. Indeed, if it were a matrix, it would have been (L 1) (L 1), because the
discretization along the y-direction contains L 1 inner (i.e. non-boundary) points. However,
it would then have been impossible to multiply such a matrix with the (M 1)-component
vectors

U
n
; l
. Therefore, in (15.34a),
2
y
is interpreted not as a matrix but as the operation of
addition and subtraction of vectors

U
n
; l
, as shown on the r.h.s. of (15.37).
Similarly, after all components of the intermediate solution have been determined, it remains
to solve (M 1) equations (15.34b) for the unknown vectors
U
m;
= [U
m,1
, U
m,2
. . . , U
m,L2
, U
m,L1
]
T
, m = 1, . . . , M 1 . (15.38)
Each of these equations is an (L 1) (L 1) tridiagonal system of the form
_
1
r
2
2
y
_
U
n+1
m;
=
Um;
+
r
2
_
Um+1;
2
Um;
+
Um1;
_
, 1 m M 1 , (15.39)
where
Um;
are dened similarly to

U
m;
. Note that now the interpretations of operators
2
x
and
2
y
have interchanged. Namely, the
2
y
on the l.h.s. of (15.39) is an (L 1) (L 1) matrix,
while the
2
x
has to be interpreted as an operation of addition and subtraction of (L 1)-
component vectors
Um;
. The solution of M 1 tridiagonal systems (15.39), and hence the
implementation of step (15.34b), requires O(ML) operations, and thus the total operations
count for the PeacemanRachford method is O(ML).
Finally, it remains to show that the PeacemanRachford method is unconditionally stable,
i.e. has property (ii) of (15.26). This can be done as follows. Equations (15.34) have constant
(in x and y) coecients and hence their solution can be sought in the form:
U
n
ml
=
n
e
imh
e
ilh
,
Uml
=

n
e
imh
e
ilh
. (15.40)
Substituting (15.40) into (15.34) and using (15.12) and (15.13), one obtains:
=
1 Y
1 + X
,
=
1 X
1 + Y
=
1 X
1 + X

1 Y
1 + Y
,
(15.41)
where we have introduced two more shorthand notations:
X = 2r sin
2
h
2
, Y = 2r sin
2
h
2
. (15.42)
From the second of Eqs. (15.41) it follows that || 1 for all harmonics (i.e., for all and ),
because

1 X
1 + X
1 for all X 0. (15.43)

This shows that the PeacemanRachford method for the 2D Heat equation is unconditionally
stable. Altogether, the above has shown that this method satises all the three conditions
(15.26) of a dream scheme.
A drawback of the PeacemanRachford method is that its generalization to 3 spatial dimen-
sions is no longer unconditionally stable. Below we provide a sketch of proof of this statement.
For the 3D Heat equation
u
t
= u
xx
+ u
yy
+ u
zz
, (15.44)
the generalization of the PeacemanRachford method is:
(a) :
_
1
r
3
2
x
_

Umlj
=
_
1 +
r
3
2
y
+
r
3
2
z
_
U
n
mlj
,
(b) :
_
1
r
3
2
y
_

Umlj
=
_
1 +
r
3
2
x
+
r
3
2
z
_

Umlj
,
(c) :
_
1
r
3
2
z
_
U
n+1
mlj
=
_
1 +
r
3
2
x
+
r
3
2
y
_

Umlj
,
(15.45)
where
2
z
is dened similarly to
2
x
and
2
y
. Substituting into (15.45) the ansatze
U
n
mlj
=
n
e
imh
e
ilh
e
ijh
,
Umlj
=

n
e
imh
e
ilh
e
ijh
,
Umlj
=

n
e
imh
e
ilh
e
ijh
,
(15.46)
one obtains, similarly to (15.41):
=
_
1
2
3
(Y + Z)
_ _
1
2
3
(X + Z)
_ _
1
2
3
(X + Y )
_
_
1 +
2
3
X
_ _
1 +
2
3
Y
_ _
1 +
2
3
Z
_ , (15.47)
where X and Y have been dened in (15.42) and Z is dened similarly. The amplication
factor (15.47) is not always less than 1 in magnitude. For example, when X, Y , and Z are all
large numbers (and hence r is large), the value of the amplication factor is 8 (you will
be asked to verify this in a QSA), and hence the 3D PeacemanRachford method (15.45) is not
unconditionally stable.
An alternative ADI method that has an unconditionally stable generalization to 3 spatial
dimensions is described next.
Douglas method, a.k.a. DouglasGunn method
29
The equations of this method are:
(a) :
_
1
r
2
2
x
_

Uml
=
_
1 +
r
2
2
x
+ r
2
y
_
U
n
ml
,
(b) :
_
1
r
2
2
y
_
U
n+1
ml
=
Uml

r
2
2
y
U
n
ml
.
(15.48)
Let us now demonstrate that all the three properties (15.26) hold for the Douglas method.
To demonstrate property (i), it is sucient to show that (15.48) is equivalent to scheme
(15.33). One can do so following the idea(s) of (15.35); you will be asked to provide the details
in a homework problem.
To demonstrate property (ii), one proceeds similarly to the lines of (15.40) and (15.41).
Namely, substituting (15.40) into (15.48) and using Eqs. (15.12), (15.13), and (15.42), one
nds:
=
1 X 2Y
1 + X
,
=
+Y
1 + Y
=
1 X
1 + X

1 Y
1 + Y
.
(15.49)
Thus, the amplication factor for the Douglas method in 2D is the same as that factor of the
PeacemanRachford method, and hence the Douglas method is unconditionally stable in 2D.
Finally, property (iii) for the Douglas method is established in complete analogy with how
that was done for the PeacemanRachford method (see the text around Eqs. (15.36)(15.39)).
Let us now show that the generalization of the Douglas method to 3D is also unconditionally
stable. The corresponding equations have the form:
(a) :
_
1
r
2
2
x
_

Umlj
=
_
1 +
r
2
2
x
+ r
2
y
+ r
2
z
_
U
n
mlj
,
(b) :
_
1
r
2
2
y
_

Umlj
=
Umlj

r
2
2
y
U
n
mlj
,
(c) :
_
1
r
2
2
z
_
U
n+1
mlj
=
Umlj

r
2
2
z
U
n
mlj
, .
(15.50)
Using the von Neumann analysis, one can show that amplication factor for (15.50) is
= 1
2(X + Y + Z)
(1 + X)(1 + Y )(1 + Z)
, (15.51)
so that, clearly, 1. Using techniques from multivariable Calculus, it is easy to show that
also 1, and hence the 3D Douglas method is unconditionally stable.
29
This method was proposed by J. Douglas for the two- and three-dimensional Heat equation in [On the
numerical integration of u
xx
+ u
yy
= u
t
by implicit methods, J. Soc. Indust. Appl. Math. 3 4265 (1955)]
and in [Alternating direction methods for three space variables, Numerische Mathematik 4 4163 (1962)]. A
general form of such methods was discussed by J. Douglas and J. Gunn in [A general formulation of alternating
direction methods, I. Parabolic and hyperbolic problems, Numerische Mathematik 6 428453 (1964)].
To conclude this subsection, we mention two more methods for the 2D Heat equation.
Dyakonov method
The equations of this method are
(a) :
_
1
r
2
2
x
_

Uml
=
_
1 +
r
2
2
x
_ _
1 +
r
2
2
y
_
U
n
ml
,
(b) :
_
1
r
2
2
y
_
U
n+1
ml
=
Uml
.
(15.52)
One can show, similarly to how that was done for the PeacemanRachford and Douglas meth-
ods, that the Dyakonov method possesses all the three properties (15.26).
Fairweather-Mitchell scheme
This scheme is
_
1
0
2
x
_ _
1
0
2
y
_
U
n+1
ml
=
_
1 + (1
0
)
2
x
_ _
1 + (1
0
)
2
y
_
U
n
ml
,
0
=
r
2

1
12
.
(15.53)
This scheme improves scheme (15.33) in the same manner in which the Crandall method
(13.17)+(13.19) for the 1D Heat equations improves the Crank-Nicolson method. Conse-
quently, its accuracy is O(
2
+ h
4
), and the scheme is stable. As far as implementing this
scheme in a time-ecient manner, this can be done straighforwardly by using suitable modi-
cations of the PeacemanRachford or Dyakonov methods.
Generalizations
In Appendix 1 we will present two important generalizations.
First, we will take an another look at the Douglas method (15.48) and thereby observe its
relation to the predictor-corrector methods considered in Lecture 3 and to the IMEX methods
mentioned in Lecture 14.
Second, we will show how one can construct an unconditionally stable method whose global
error is of the order O(
2
+ h
2
), for a parabolic-type equation with a mixed derivative term,
e.g.:
u
t
= a
(xx)
u
xx
+ a
(xy)
u
xy
+ a
(yy)
u
yy
; (15.54)
here a
(xx)
etc. are coecients, and the term with the mixed derivatives is underlined. Our
construction will utilize the rst generalization considered in Appendix 1. It is worth pointing
out that construction of a scheme with aforementioned properties for (15.54) was not a trivial
problem. This is attested by the fact that it was solved more than 30 years after the pioneering
works by Peaceman, Rachford, Douglas, and others on the Heat equation (15.1). The paper
30
where this problem was solved, is posted on the course website.
A good reference on nite dierence methods in two and three spatial dimensions is a book
by A.R. Mitchell and G.F. Griths, The Finite Dierence Method in Partial Dierential
Equations (Wiley, 1980).
30
I.J.D. Craig and A.D. Sneyd, An alternating-direction implicit scheme for parabolic equations with mixed
derivatives, Computers and Mathematics with Applications 16(4), 341350 (1988).
15.6 Boundary conditions for the ADI methods
Here we will show how to prescribe boundary conditions for the intermediate solution
Uml
appearing in the ADI methods considered above. We will do so for the Dirichlet boundary
conditions (15.3) and (15.4) and for the Neumann boundary conditions
u
x
(0, y, t) = g
0
(y, t), u
x
(1, y, t) = g
1
(y, t), 0 y Y, t 0 ; (15.55)
u
y
(x, 0, t) = g
2
(x, t), u
y
(x, Y, t) = g
3
(x, t), 0 x 1, t 0 . (15.56)
The corresponding generalizations for the mixed boundary conditions (14.3) can be obtained
straightforwardly. Note that the counterpart of the matching conditions (15.5) between the
boundary conditions on one hand and the initial condition one the other, for the Neumann
boundary conditions has the form:
g
0
(y, 0) = (u
0
)
x
(0, y), g
1
(y, 0) = (u
0
)
x
(1, y),
g
2
(x, 0) = (u
0
)
y
(x, 0), g
3
(x, 0) = (u
0
)
y
(x, Y ) .
(15.57)
The counterpart of the requirement (15.6) that the boundary conditions match at the corners
of the domain follows from the relation u
xy
(x, y) = u
yx
(x, y) and has the form:
(g
0
)
y
(0, t) = (g
2
)
x
(0, t), (g
0
)
y
(Y, t) = (g
3
)
x
(0, t),
(g
3
)
x
(1, t) = (g
1
)
y
(Y, t), (g
1
)
y
(0, t) = (g
2
)
x
(1, t), for t > 0.
(15.58)
PeacemanRachford method
Dirichlet boundary conditions
Note that in order to solve Eq. (15.34a) for the
Uml
with 1 {m, l} {(M1), (L1)}, one requires
the values of
U0,l
and
UM,l
with 1 l L1. The
corresponding nodes are shown as open circles in
the gure on the right. Note that one does not need
the other boundary values,
Um,0
and
Um,L
, to solve
(15.34b), because the l.h.s. of the latter equation
is only dened for 1 l L 1. Hence, one does
not need (and cannot determine) the values
Um,0
and
Um,L
in the PeacemanRachford method.
m=M
m=0
x
y
g
2

g
0

g
1

g
3

Thus, how does one nd the required boundary values
U0,l
and
UM,l
? To answer this
question, note that the term in (15.34a) that produces
U0,l
and
UM,l
is:
r
2
2
x
Uml
(with m = 1
and m = M1). Let us then eliminate this term using both Eqs. (15.34). The most convenient
way to do so is to rewrite these equations in an equivalent form:
_
1
r
2
2
x
_

Uml
=
_
1 +
r
2
2
y
_
U
n
ml
,
_
1 +
r
2
2
x
_

Uml
=
_
1
r
2
2
y
_
U
n+1
ml
,
and then add them. The result is:
Uml
=
1
2
_
U
n
ml
+ U
n+1
ml
_
+
r
4

2
y
_
U
n
ml
U
n+1
ml
_
. (15.59)
Now, this equation, unlike (15.34a), can be evaluated for m = 0 and m = M, yielding
U{0,M}, l
=
1
2
_
U
n
{0,M}, l
+ U
n+1
{0,M}, l
_
+
r
4

2
y
_
U
n
{0,M}, l
U
n+1
{0,M}, l
_
=
1
2
_
(g
{0,1}
)
n
l
+ (g
{0,1}
)
n+1
l
_
+
r
4

2
y
_
(g
{0,1}
)
n
l
(g
{0,1}
)
n+1
l
_
(15.60)
=
1
2
_
(g
{0,1}
)
n
l
+ (g
{0,1}
)
n+1
l
_
+
r
4

__
(g
{0,1}
)
n
l+1
2(g
{0,1}
)
n
l
+ (g
{0,1}
)
n
l1
_
(g
{0,1}
)
n+1
l+1
2(g
{0,1}
)
n+1
l
+ (g
{0,1}
)
n+1
l1
_
(G
{0,1}
)
l
,
1 l L 1 .
It is now time to complete the discussion about the implementations of operators
r
2
2
x
and
r
2
2
y
in each of the equations (15.34). Recall that we started this discussion after Eq. (15.36),
but were unable to complete it then because we did not have the information about boundary
conditions. Let us begin with Eq. (15.34a). There, operator
r
2
2
x
is the following (M 1)
(M 1) matrix:
On the r.h.s. of (15.34a):
r
2
2
x
=
_
_
_
_
_
_
r
r
2
0 0
r
2
r
r
2
0 0

0 0
r
2
r
r
2
0 0
r
2
r
_
_
_
_
_
_
,
(15.61)
which has been obtained in analogy with matrix A in (13.10). Operator
2
y
should be interpreted
not as a matrix but as an operation of adding and subtracting (M 1)-component vectors,
as was shown on the r.h.s. of (15.37). Below we present the generalization of (15.37) for the
boundary rows l = 1 and l = L 1:
On the r.h.s. of (15.34a):
_
1 +
r
2
2
y
_

U
n
; l
=

U
n
; l
+
r
2
_
U
n
; l+1
2
U
n
; l
+

U
n
; l1
_
, 2 l L 2 ;
_
1 +
r
2
2
y
_

U
n
; 1
=

U
n
; 1
+
r
2
_
U
n
; 2
2
U
n
; 1
+

B
n
2
_
,
_
1 +
r
2
2
y
_

U
n
; L1
=

U
n
; L1
+
r
2
_
B
n
3
2
U
n
; L1
+

U
n
; L2
_
,
(15.62)
where

B
2
and

B
3
have been dened in (15.21). Note that the r.h.s. of (15.34a) also contains
terms contributed by
U{0,M}, l
, as we will explicitly show shortly.
In Eq. (15.34b), as has been mentioned earlier, the interpretations of
2
x
and
2
y
are reversed.
Namely, now
r
2
2
y
has the form given by the r.h.s. of (15.61); the dimension of this matrix is
(L1) (L1). Operator
2
x
is implemented as was shown on the r.h.s. of (15.39); below we
show its form for completeness of the presentation:
On the r.h.s. of (15.34b):
_
1 +
r
2
2
x
_
Um;
=
Um;
+
r
2
_
Um+1;
2
Um;
+
Um1;
_
. 1 m M 1 ;
(15.63)
Let us now summarize the above steps in the form of an algorithm.
Algorithm of solving the 2D Heat equation with Dirichlet boundary conditions
by the PeacemanRachford method:
The following steps need to be performed inside the loop over n (i.e., advancing in time).
Suppose that the solution has been computed at the nth time level.
Step 1 (set up boundary conditions):
Dene the boundary conditions at the (n + 1)th time level:
U
(n+1)
{0, M}, l
=
_
g
{0, 1}
_
(n+1)
l
, 0 l L; U
(n+1)
m, {0, L}
=
_
g
{2, 3}
_
(n+1)
m
, 1 m M1 . (15.64)
(Recall that the boundary conditions match at the corners; see (15.6).)
Next, determine the necessary boundary values of the intermediate solution:
U{0,M}, l
= (G
{0,1}
)
l
, 1 l L 1 , (15.65)
where (G
{0,1}
)
l
are dened in (15.60).
Step 2:
For each l = 1, . . . , L 1, solve the tridiagonal system
_
1
r
2
2
x
_
U; l
=

U
n
; l
+
r
2
_
U
n
; l+1
2
U
n
; l
+

U
n
; l1
_
+
r
2
b; l
,
1 l L 1,
(15.66)
where
r
2
2
x
is an (M 1) (M 1) matrix of the form (15.61),
U; l
=
_
_
_
_
_
_
_
_
U1,l
U2,l
UM2,l
UM1,l
_
_
_
_
_
_
_
_
,

U
n
; l
=
_
_
_
_
_
_
_
_
U
n
1,l
U
n
2,l
U
n
M2,l
U
n
M1,l
_
_
_
_
_
_
_
_
,
b; l
=
_
_
_
_
_
_
(G
0
)
l
0
0
(G
1
)
l
_
_
_
_
_
_
.
_
_
_
_
see
(15.20)
and
(15.36)
_
_
_
_
Note that

U
n
; 0

B
n
2
and

U
n
; L

B
n
3
are determined from the boundary conditions on the nth
time level.
Thus, combining the results of (15.64), (15.65), and (15.66), one has the following values of
the intermediate solution:
Um,l
for 0 m M, 1 l L 1 .
Step 3:
The solution U
n+1
m,l
with 1 {m, l} {M 1, L 1} is then determined from
_
1
r
2
2
y
_
U
n+1
m;
=
Um;
+
r
2
_
U
n
m+1 ;
2
Um;
+
Um1 ;
_
+
r
2
b
n+1
m;
,
1 m M 1.
(15.67)
Here
r
2
2
y
is the (L 1) (L 1) matrix of the form (15.61), and
Um;
=
_
_
_
_
_
_
_
_
Um,1
Um,2
Um,L2
Um,L1
_
_
_
_
_
_
_
_
,

U
n+1
m;
=
_
_
_
_
_
_
_
_
_
U
n+1
m,1
U
n+1
m,2
U
n+1
m,L2
U
n+1
m,L1
_
_
_
_
_
_
_
_
_
,

b
n+1
m;
=
_
_
_
_
_
_
(g
2
)
n+1
m
0
0
(g
3
)
n+1
m
_
_
_
_
_
_
. (15.68)
This completes the process of advancing the solution by one step in time.
Neumann boundary conditions
This case is technically more involved than the case of Dirichlet boundary conditions. There-
fore, here we only list the steps of the algorithm of advancing the solution from the nth to the
(n+1)st time level, while relegating the detailed derivation of these steps to Appendix 2. Also,
note that you will not need to use this algorithm in any of the homework problems. It is pre-
sented here so that you would be able to use it whenever you have to solve a problem of this
kind in your future career.
Algorithm of solving the 2D Heat equation with Neumann boundary conditions
by the PeacemanRachford method:
Step 1 (set up boundary conditions):
Given the solution U
n
m,l
with 0 {m, l} {M, L}, nd the values at the virtual nodes, U
n
1,l
and U
n
M+1,l
with 1 l L + 1 and U
n
m,1
and U
n
m,L+1
with 0 m M, from (15.94) and
(15.95) of Appendix 2:
U
n
1,l
= U
n
1,l
2h(g
0
)
n
l
, U
n
M+1,l
= U
n
M1,l
+ 2h(g
1
)
n
l
, 0 l L;
U
n
m,1
= U
n
m,1
2h(g
2
)
n
m
, U
n
m,L+1
= U
n
m,L1
+ 2h(g
3
)
n
m
, 0 m M ;
(15.94)
U
n
1,1
= U
n
1,1
2h(g
0
)
n
1
, U
n
M+1,1
= U
n
M1,1
+ 2h(g
1
)
n
1
,
U
n
1,L+1
= U
n
1,L+1
2h(g
0
)
n
L+1
, U
n
M+1,L+1
= U
n
M1,L+1
+ 2h(g
1
)
n
L+1
.
(15.95)
Dene auxiliary functions given by (15.97) and (15.98) of Appendix 2, which will later be
used to compute the boundary values of the intermediate solution
U
:
For 0 l L:
(G
0
)
l
=
1
2
_
(g
0
)
n
l
+ (g
0
)
n+1
l
_
+
r
4
__
(g
0
)
n
l+1
2(g
0
)
n
l
+ (g
0
)
n
l1
_
(g
0
)
n+1
l+1
2(g
0
)
n+1
l
+ (g
0
)
n+1
l1
_
,
(15.97)
(G
1
)
l
=
1
2
_
(g
1
)
n
l
+ (g
1
)
n+1
l
_
+
r
4
__
(g
1
)
n
l+1
2(g
1
)
n
l
+ (g
1
)
n
l1
_
(g
1
)
n+1
l+1
2(g
1
)
n+1
l
+ (g
1
)
n+1
l1
_
,
(15.98)
(Note that the form of G
0
and G
1
above is the same as in the case of the Dirichlet boundary
conditions see (15.60), although the meanings of g
0
and g
1
are dierent in these two
cases.)
Step 2:
For each 0 l L, solve the linear system, whose form follows from (15.92) and (15.99) of
Appendix 2:
_
1
r
2
2
x
_
U; l
=

U
n
; l
+
r
2
_
U
n
; l+1
2
U
n
; l
+

U
n
; l1
_
+
r
2
b; l
,
0 l L,
(15.69)
where
r
2
2
x
is an (M + 1) (M + 1) matrix of the form
_
_
_
_
_
_
r r 0 0
r
2
r
r
2
0 0

0 0
r
2
r
r
2
0 0 r r
_
_
_
_
_
_
, (15.70)
(here the terms that dier from the corresponding matrix for Dirichlet boundary conditions are
included in the box),
U; l
=
_
_
_
_
_
_
_
_
U0,l
U1,l
UM1,l
UM,l
_
_
_
_
_
_
_
_
,

U
n
; l
=
_
_
_
_
_
_
_
_
U
n
0,l
U
n
1,l
U
n
M1,l
U
n
M,l
_
_
_
_
_
_
_
_
,
b; l
=
_
_
_
_
_
_
2h(G
0
)
l
0
0
2h(G
1
)
l
_
_
_
_
_
_
, (15.71)
and G
0
and G
1
are dened in (15.97) and (15.98) (see above and in Appendix 2).
Having thus determined the following values of the intermediate solution:
Um,l
for 0 {m, l} {M, L},
nd the values
U1,l
and
UM+1,l
for 0 l L
from (15.97) and (15.98) of Appendix 2:
U1,l
=
U1,l
2h(G
0
)
l
, 0 l L, (15.97
UM+1,l
=
UM1,l
+2h(G
1
)
l
, 0 l L. (15.98
)
Thus, upon completing Step 2, one has the following values of the intermediate solution
Um,l
for 1 m M + 1 and 0 l L,
which are shown in Appendix 2 to be necessary and sucient to nd the solution on the (n+1)st
time level.
Step 3:
The solution at the new time level, U
n+1
m,l
with 0 {m, l} {M, L}, is determined from
(15.85), (15.88), and (15.89) of Appendix 2, which constitute the following (L + 1) (L + 1)
linear systems for each of the m = 0, . . . , M:
_
1
r
2
2
y
_
U
n+1
m;
=
Um;
+
r
2
_
U
n
m+1 ;
2
Um;
+
Um1 ;
_
+
r
2
b
n+1
m;
,
0 m M.
(15.72)
Here
r
2
2
y
is the (L + 1) (L + 1) matrix of the form (15.70), and
Um;
=
_
_
_
_
_
_
_
_
Um,0
Um,1
Um,L1
Um,L
_
_
_
_
_
_
_
_
,

U
n+1
m;
=
_
_
_
_
_
_
_
_
_
U
n+1
m,0
U
n+1
m,1
U
n+1
m,L1
U
n+1
m,L
_
_
_
_
_
_
_
_
_
,

b
n+1
m;
=
_
_
_
_
_
_
2h(g
2
)
n+1
m
0
0
2h(g
3
)
n+1
m
_
_
_
_
_
_
. (15.73)
This completes the process of advancing the solution by one step in time.
We conclude this subsection with the counterparts of Eq. (15.60) for the Douglas and
Dyakonov methods. You will be asked to derive these results in a homework problem. We
will not state any results for Neumann boundary conditions for the Douglas and Dyakonov
methods.
Douglas method
The Dirichlet boundary conditions for the intermediate solution
U
have the form:
U{0,M},l
= U
n+1
{0,M},l
+
r
2
2
y
_
U
n
{0,M},l
U
n+1
{0,M},l
_
, 1 l L 1 . (15.74)
Dyakonov method
The Dirichlet boundary conditions for the intermediate solution
U
have the form:
U{0,M},l
=
_
1
r
2
2
y
_
U
n+1
{0,M},l
, 1 l L 1 . (15.75)
15.7 Appendix 1: A generalized form of the ADI methods, and
a second-order ADI method for the parabolic equation with
mixed derivatives, Eq. (15.54)
A brief preview of this section was done at the end of Section 15.5. The presentation below is
based on the papers by K.J. in t Hout and B.D. Welfert, Stability of ADI schemes applied to
convection-diusion equations with mixed derivative terms, Applied Numerical Mathematics
57, 1935 (2007) and by I.J.D. Craig and A.D. Sneyd, An alternating-direction implicit scheme
for parabolic equations with mixed derivatives, Computers and Mathematics with Applications
16(4), 341350 (1988). Both papers are posted on the course website.
Let us begin by writing a general form of the equation that included the Heat equation
(15.1) as a special case:
u
t
= F F
(0)
+ F
(1)
+ F
(2)
, (15.76)
where F
(1)
and F
(2)
are terms that contain only the derivatives of u with respect to x and y,
respectively, and F
(0)
contains all other terms (e.g., nonlinear or with mixed derivatives). For
example, in (15.54),
F
(0)
= a
(xy)
(x, y)u
xy
, F
(1)
= a
(xx)
(x, y)u
xx
, F
(2)
= a
(yy)
(x, y)u
yy
.
Next, note that the Douglas method (15.48), which we repeat here for the readers conve-
nience:
_
1
r
2
2
x
_

Uml
=
_
1 +
r
2
2
x
+ r
2
y
_
U
n
ml
,
_
1
r
2
2
y
_
U
n+1
ml
=
Uml

r
2
2
y
U
n
ml
,
(15.48)
can be written in an equivalent, but dierent form:
W
(0)
= U
n
+ r
2
x
U
n
+ r
2
y
U
n
,
W
(1)
= W
(0)
+
1
2
(r
2
x
W
(1)
r
2
x
U
n
),
W
(2)
= W
(1)
+
1
2
(r
2
y
W
(2)
r
2
y
U
n
),
U
(n+1)
= W
(2)
.
(15.77)
Here, for brevity, we have omitted the subscripts {m, l} in U
n
m,l
etc. The correspondence of
notations of (15.48) and (15.77) is:
W
(1)
(15.77)
=
U(15.48)
.
In the notations introduced in (15.76), this can be written as
W
(0)
= U
n
+ F(U
n
),
W
(k)
= W
(k1)
+
1
2
_
F
(k)
(W
(k)
) F
(k)
(U
n
)
_
, k = 1, 2;
U
(n+1)
= W
(2)
.
(15.78)
Recall that F used in the rst equation above is dened in (15.76).
Let us make two observations about scheme (15.78). First, it can be interpreted as a
predictor-corrector method, which we considered in Lecture 3. Indeed, the rst equation in
(15.78) predicts the value of the solution at the next time level by the simple Euler method.
The purpose of each of the subsequent steps is to stabilize the predictor step by employing an
implicit modied Euler step in one particular direction (i.e., along either x or y). Indeed, if
we set F
(0)
= F
(2)
= 0 in (15.76), then (15.78) reduces to the implicit modied Euler method.
You will be asked to verify this in a QSA.
Second, (15.78) is seen to be closely related to the IMEX family of methods; see scheme
(14.51) in Lecture 14.
For F
(0)
= 0, method (15.78) has accuracy O(+h
2
) (for F
(0)
= 0, its accuracy is O(
2
+h
2
),
as we know from the discussion of the Douglas method in Section 15.5). It is of interest and
of considerable practical signicance to construct an extension of this scheme that would have
accuracy O(
2
+ h
2
) even when F
(0)
= 0. Two such schemes were presented in the paper by
in t Hout and Welfert, who generalized schemes presented earlier by other researchers. The
rst scheme is:
W
(0)
= U
n
+ F(U
n
),
W
(k)
= W
(k1)
+
1
2
_
F
(k)
(W
(k)
) F
(k)
(U
n
)
_
, k = 1, 2;
V
(0)
= W
(0)
+
1
2
_
F
(0)
(W
(2)
) F
(0)
(U
n
)
_
,
V
(k)
= V
(k1)
+
1
2
_
F
(k)
(V
(k)
) F
(k)
(U
n
)
_
, k = 1, 2;
U
(n+1)
= V
(2)
.
(15.79)
After one round of prediction and correction, accomplished by the rst two lines of this scheme,
it proceeds to do another round of prediction and correction, given by the third and fourth lines.
It appears that it is this second round that brings the accuracy of the scheme up to the order
O(
2
). Intuitively, the reason why this is so can be understood by making an analogy of these
two rounds with the two steps of the modied explicit Euler method. Specically, in a QSA
you will be asked to show that the scheme (15.79) with F
(1)
= F
(2)
= 0 reduces to the modied
explicit Euler method.
The second scheme proposed by in t Hout and Welfert is obtained from (15.79) by replacing
F
(0)
in its third line by F.
Stability of the above schemes, as well as of the generalized Douglas scheme (15.78), has
been investigated by in t Hout and Welfert. In particular, they showed that schemes (15.78)
and (15.79) are unconditionally stable for the so-called convection-diusion equation
u
t
= c
(x)
u
x
+ c
(y)
u
y
+ a
(xx)
u
xx
+ a
(xy)
u
xy
+ a
(yy)
u
yy
, (15.80)
where all the coecients may depend on x and y, and the quadratic form a
(xx)
x
2
+ a
(xy)
xy +
a
(yy)
y
2
is positive denite. The third scheme, mentioned after (15.79), can also be made
unconditionally stable upon replacing the coecient 1/2 in the second and fourth lines by any
number 3/4.
Below we will specify scheme (15.79) for the case of equation (15.80) with c
(x)
= c
(y)
= 0:
W
(0)
= U
n
+ r
_
a
(xx)
2
x
+ a
(xy)
xy
+ a
(yy)
2
y
_
U
n
,
_
1
1
2
ra
(xx)
2
x
_
W
(1)
= W
(0)
1
2
ra
(xx)
2
x
U
n
,
_
1
1
2
ra
(yy)
2
y
_
W
(2)
= W
(1)
1
2
ra
(yy)
2
y
U
n
,
V
(0)
= W
(0)
+
1
2
r
_
a
(xy)
xy
W
(2)
a
(xy)
xy
U
n
_
,
_
1
1
2
ra
(xx)
2
x
_
V
(1)
= V
(0)
1
2
ra
(xx)
2
x
U
n
,
_
1
1
2
ra
(yy)
2
y
_
U
(n+1)
= V
(1)
1
2
ra
(yy)
2
y
U
n
.
(15.81)
Here all the coecients are evaluated at node (m, l), and the mixed-derivative operator is:
xy
U
m,l
=
1
4h
2
_
U
m+1,l+1
+ U
m1,l1
U
m1,l+1
U
m+1,l1
_
. (15.82)
Scheme (15.81) was originally proposed by Craig and Sneyd in their paper cited above.
There, it was given by their Eq. (7) in more condensed notations, which we will write as:
_
1
1
2
ra
(xx)
2
x
__
1
1
2
ra
(yy)
2
y
_
W
(2)
=
_
1 +
1
2
ra
(xx)
2
x
__
1 +
1
2
ra
(yy)
2
y
_
U
n
+ ra
(xy)
xy
U
n
,
_
1
1
2
ra
(xx)
2
x
__
1
1
2
ra
(yy)
2
y
_
U
n+1
=
_
1 +
1
2
ra
(xx)
2
x
__
1 +
1
2
ra
(yy)
2
y
_
U
n
+
1
2
ra
(xy)
xy
_
W
(2)
+ U
n
_
.
(15.83)
In order to turn the GraigSneyd scheme (15.81) into a practical algorithm, one needs to
specify what boundary conditions for its auxiliary variables are needed and how those can
be found. I have been unable to nd an answer to this question in the published literature;
therefore below I present my own answer. Let us start at the rst line of (15.81). Its left-hand
side can be computed only at the interior grid points, i.e., for 1 m M1 and 1 l L1,
since the calculation of the terms on the right-hand side requires boundary values of U
n
along
the entire perimeter of the computational domain. Next, to determine W
(1)
in the second line,
we need its boundary values W
(1)
{0,M}, l
for 1 l L 1. Those can be found from the third
line with m = 0 and m = M if one knows the boundary values W
(2)
{0,M}, l
for all l, i.e., for
0 l L. Thus, we focus on specifying or nding these latter boundary values. It turns out
that one cannot nd them. Indeed, the fourth line of (15.81) does not provide any information
about W
(2)
{0,M}, l
but rather, to aggravate the matters, requires the values W
(2)
m, {0,L}
at the other
boundary in order to compute
2
xy
W
(2)
for all interior points 1 m M 1, 1 l L 1.
One can also verify that none of these boundary values can be computed if we start from the
last line of (15.81), either. Thus, the only option remains to specify the values of W
(2)
along
the entire perimeter of the computational domain.
This option is consistent with the form (15.83) of the CraigSneyd scheme. Indeed, then
the rst line of that scheme is nothing but the inhomogeneous version of (15.33) (with the
inhomogeneity being the
xy
U
n
-term), and we know that to solve it, the boundary values of
the variable on the left-hand side must be specied.
The way to specify the boundary values of W
(2)
appears to be to let them equal those of
U
n+1
:
W
(2)
{0,M}, l
= U
n+1
{0,M}, l
, 0 l L;
W
(2)
m, {0,L}
= U
n+1
m, {0,L}
, 1 m M 1.
(15.84)
To see that, let the mixed-derivative term in (15.81) vanish: a
(xy)
= 0. Then from the fourth
line of that scheme, V
(0)
= W
(0)
, and then W
(2)
simply coincides with U
n+1
at all nodes.
To summarize, we list the steps of implementing the algorithm (15.81), (15.84) into a code.
Step 1: Dene the boundary values U
n+1
{0,M}, l
and W
(2)
{0,M}, l
for 0 l L. Next, compute the
boundary values V
(1)
{0,M}, l
and W
(1)
{0,M}, l
for 1 l L1 from the last and third lines of (15.81),
respectively.
Step 2: Dene the boundary values U
n+1
m, {0,L}
and W
(2)
m, {0,L}
for 1 m M 1.
Step 3: Find the variables on the left-hand sides of the rst two lines of (15.81). This can be
done because the required boundary values of W
(1)
have been computed in Step 1.
Step 4: Find W
(2)
at all the interior points from the third line of (15.81). This can be done
because the required boundary values of W
(2)
have been dened in Step 2.
Step 5: Find the variables on the left-hand sides of the fth and sixth lines of (15.81). This
can be done because the required boundary values of V
(1)
have been computed in Step 1 and
the required boundary values of W
(2)
have been dened in Steps 1 and 2.
Step 6: Find U
n+1
at all the interior points from the last line of (15.81). This can be done
because the required boundary values have been dened in Step 2.
Generalization of the CraigSneyd scheme (15.81) to three spatial dimensions is straight-
forward. Craig and Sneyd also generalized it to a system of coupled equations of the form
(15.80); see An alternating direction implicit scheme for parabolic systems of partial dieren-
tial equations, Computers and Mathematics with Applications 20(3), 5362 (1990). Also, in t
Hout and Welfert published a follow-up paper to their paper cited above; it is: Unconditional
stability of second-order ADI schemes applied to multi-dimensional diusion equations with
mixed derivative terms, Applied Numerical Mathematics 59, 677692 (2009). There, they
consider a more restricted class of equations than (15.80) (diusion only, no convection), but in
exchange are able to prove unconditional stability of a certain scheme in any number of spatial
dimensions. This has applications in, e.g., nancial mathematics.
15.8 Appendix 2: Derivation of the PeacemanRachford algorithm
for the 2D Heat equation with Neumann boundary conditions
In order for you to understand the details of this derivation better, it is recommended that you
rst review the corresponding derivation for the Crank-Nicolson method in Sec. 14.1, since it
is that derivation on which the present one is based. It should also help you to draw a single
time level and refer to that drawing throughout the derivation.
We begin by determining which boundary values of the intermediate solution
U
are required
to compute the solution U
n+1
ml
, 0 {m, l} {M, L} at the new time level. To that end, let us
write down Eq. (15.34b) in a detailed form:
U
n+1
m,l

r
2
_
U
n+1
m,l+1
2U
n+1
m,l
+ U
n+1
m,l1
Um,l
+
r
2
_
Um+1,l
2
Um,l
+
Um1,l
_
. (15.85)
We need to determine the solution on the l.h.s. for 0 {m, l} {M, L}. First, we note that
we can set l = 0 in (15.85) despite the fact that U
n+1
m,1
will then appear in the last term on the
l.h.s., and that value is not part of the solution. To eliminate that value, we use the boundary
condition at y = 0:
U
n+1
m,1
U
n+1
m,1
2h
= (g
2
)
n+1
m
, U
n+1
m,1
= U
n+1
m,1
2h(g
2
)
n+1
m
, 0 m M . (15.86)
Similarly, at y = Y , we have
U
n+1
m,L+1
U
n+1
m,L1
2h
= (g
3
)
n+1
m
, U
n+1
m,L+1
= U
n+1
m,L1
+ 2h(g
3
)
n+1
m
, 0 m M . (15.87)
Therefore, Eq. (15.85) for l = 0 and l = L is replaced by the respective equations:
U
n+1
m,0

r
2
_
2U
n+1
m,1
2U
n+1
m,0
2h(g
2
)
n+1
m
Um,0
+
r
2
_
Um+1,0
2
Um,0
+
Um1,0
_
; (15.88)
U
n+1
m,L

r
2
_
2h(g
3
)
n+1
m
2U
n+1
m,L
+ 2U
n+1
m,L1
Um,L
+
r
2
_
Um+1,L
2
Um,L
+
Um1,L
_
. (15.89)
Thus, in order to determine from (15.85), (15.88), and (15.89) the solution U
n+1
m,l
for all 0
{m, l} {M, L}, we will need to know
Um,l
for 1 m M + 1 and 0 l L. (15.90)
The diculty that we need to overcome is the determination of the boundary values
U1,l
and
UM+1,l
for 0 l L. (15.91)
Now let us see which of the values
Um,l
we can determine directly from (15.34a). To that
end, let us write down that equation in a detailed form, similarly to (15.85):
Um,l

r
2
_
Um+1,l
2
Um,l
+
Um1,l
_
= U
n
m,l
+
r
2
_
U
n
m,l+1
2U
n
m,l
+ U
n
m,l1
. (15.92)
From this equation, we see that in order to determine the values required in (15.90), we need
to know
U
n
ml
for 1 m M + 1 and 1 l L + 1. (15.93)
While those values for 0 {m, l} {M, L} are known from the solution on the nth time level,
the values U
n
m,l
for {m, l} = 1 and {m, l} = {M + 1, L + 1} are not, and hence they need to
be found from the boundary conditions. This is done similarly to (15.86) and (15.87):
U
n
1,l
= U
n
1,l
2h(g
0
)
n
l
, U
n
M+1,l
= U
n
M1,l
+ 2h(g
1
)
n
l
, 0 l L;
U
n
m,1
= U
n
m,1
2h(g
2
)
n
m
, U
n
m,L+1
= U
n
m,L1
+ 2h(g
3
)
n
m
, 0 m M .
(15.94)
Once the values in (15.94) have been found, we determine the values at the corners:
U
n
1,1
= U
n
1,1
2h(g
0
)
n
1
, U
n
M+1,1
= U
n
M1,1
+ 2h(g
1
)
n
1
,
U
n
1,L+1
= U
n
1,L+1
2h(g
0
)
n
L+1
, U
n
M+1,L+1
= U
n
M1,L+1
+ 2h(g
1
)
n
L+1
.
(15.95)
Note that the smoothness of the solution at the corners is ensured by the matching conditions
(15.58). Thus, with (15.94) and (15.95), we have all the values required in (15.93).
We now turn back to (15.92). From it, we see that with all the values (15.93) being known,
the l.h.s. of (15.92) can be determined from an (M +1) (M +1) system of equations only if
we know the values in (15.91). To determine these values, we use Eq. (15.59) in the following
way: we subtract that equation with m (m1) from the same equation with m (m+1)
and divide the result by (2h). For m = 0, this yields:
U1,l

U1,l
2h
=
1
2
_
U
n
1,l
U
n
1,l
2h
+
U
n+1
1,l
U
n+1
1,l
2h
_
+
r
4
__
U
n
1,l+1
U
n
1,l+1
2h
2
U
n
1,l
U
n
1,l
2h
+
U
n
1,l1
U
n
1,l1
2h
_
+
_
U
n
1,l+1
U
n
1,l+1
2h
2
U
n
1,l
U
n
1,l
2h
+
U
n
1,l1
U
n
1,l1
2h
__
,
for 0 l L.
(15.96)
If we can nd each term on the r.h.s. of this equation, then we know the value of the term
on the l.h.s. and hence can determine
U1,l
. But each of these terms can be found from the
boundary conditions! Upon this observation, (15.96) can be rewritten as follows:
U1,l

U1,l
2h
=
1
2
_
(g
0
)
n
l
+ (g
0
)
n+1
l
_
+
r
4
__
(g
0
)
n
l+1
2(g
0
)
n
l
+ (g
0
)
n
l1
+
_
(g
0
)
n+1
l+1
2(g
0
)
n+1
l
+ (g
0
)
n+1
l1
_
,
(G
0
)
l
,
for 0 l L.
(15.97)
Similarly,
UM+1,l

UM1,l
2h
=
1
2
_
(g
1
)
n
l
+ (g
1
)
n+1
l
_
+
r
4
__
(g
1
)
n
l+1
2(g
1
)
n
l
+ (g
1
)
n
l1
+
_
(g
1
)
n+1
l+1
2(g
1
)
n+1
l
+ (g
1
)
n+1
l1
_
,
(G
1
)
l
,
for 0 l L.
(15.98)
Using Eqs. (15.97) and (15.98), the linear systems given, for each l = 0, . . . , L, by Eqs. (15.92)
with 1 m M 1, can be supplemented by the following equations for m = 0 and m = M:
U0,l

r
2
_
2
U1,l
2
U0,l
2h(G
0
)
l
_
= U
n
0,l
+
r
2
_
U
n
0,l+1
2U
n
0,l
+ U
n
0,l1
UM,l

r
2
_
2h(G
1
)
l
2
UM,l
+2
UM1,l
_
= U
n
M,l
+
r
2
_
U
n
M,l+1
2U
n
M,l
+ U
n
M,l1
,
0 l L.
(15.99)
From (15.99) and the remaining equations in (15.92) one can determine
Um,l
for 0 {m, l} {M, L}, (15.100)
and the remaining values (15.91) are determined from (15.97) and (15.98).
1. According to the lexicographic ordering, which quantity appears earlier in the vector

U
in Eq. (15.8): U
2,5
or U
5,2
?
2. Verify Eq. (15.14).
3. Verify Eq. (15.15).
4. Verify Eq. (15.19).
5. What is the length of vector

b
n
l
in Eq. (15.21)?
6. Write down Eq. (15.19) for l = 2. Then verify that it is equivalent to Eq. (15.22) for the
same value of l.
7. Obtain the analog of Eq. (15.23) for l = L 1.
8. What is the length of vector B
n
in Eq. (15.25)?
9. Why is the CN scheme (15.16) computationally inecient?
10. State the three properties that a computationally ecient scheme for the 2D Heat equa-
tion must have.
11. What is the order of the truncation error of scheme (15.31)?
12. Verify that (15.33) is equivalent to (15.32).
14. Make sure you can justify each step in (15.35).
16. Explain in detail why (15.34) is computationally ecient (that is, which systems need to
be solved at each step).
17. Obtain both equations in (15.41).
18. Why does one want to look for alternatives to the PeacemanRachford method?
20. Produce an example of r, , , and such that the corresponding amplication factor
(15.47) is greater than 1 in magnitude.
21. Explain in detail why (15.48) is computationally ecient (that is, which systems need to
be solved at each step).
22. Consider the PeacemanRachford method for the Heat equation with Dirichlet boundary
conditions. Explain which boundary values of the intermediate solution one requires, and
why one does not need other boundary values.
23. Verify that the scheme (15.78) with F
(0)
= F
(2)
= 0 reduces to the modied implicit
Euler method.
24. Verify that the scheme (15.79) with F
(1)
= F
(2)
= 0 reduces to the modied explicit Euler
method.
25. Make sure you can follow the argument made around (15.84).
16 Hyperbolic PDEs:
Analytical solutions and characteristics
Hyperbolic PDEs describe propagation of disturbances in space and time when the total energy
of the disturbances remains conserved. It is the condition of energy conservation that makes
the hyperbolic equations dierent from parabolic ones, considered in Lectures 12 through 15.
The following analogy with ODEs is intended to clarify the dierence between hyperbolic and
parabolic PDEs. Parabolic equations are multi-dimensional counterparts of the ODE
y
= y, Re > 0, (16.1)
and thus describe processes of relaxation of the initial disturbance towards an equilibrium
(which is y = 0 in the case of (16.1)). Hyperbolic equations are multi-dimensional counterparts
of the ODE
y
=
2
y,
2
> 0, (16.2)
which describes oscillations (see Lecture 5). However, hyperbolic PDEs describe not only oscil-
lations, but also (and, in fact, much more often) propagation of initial disturbances. Examples
include, e.g., propagation of sound and light.
16.1 Solution of the Wave equation
In fact, the basic form (i.e., before any perturbations or specic details are included into the
model) of the equation that governs propagation of light and sound is the same. That same
equation, called the Wave equation, also arises in a great variety of applications in physics and
engineering. A classic example, considered in most textbooks, is the vibration of a string. The
corresponding equation is
u
tt
= c
2
u
xx
. (16.3)
In the above example of a string, c =
T/, where T and are the strings tension and

density, respectively. As we will see shortly, in general, c is the speed of propagation of initial
disturbances (e.g., the speed of sound for sound waves or the light speed for light waves).
To solve Eq. (16.3), we need to supplement it with initial and boundary conditions. We will
do so later on. For now, let us discuss the general solution of (16.3). We will use this analytic
solution as a reference for numerical solutions that we will obtain in Lecture 17.
Rewriting (16.3) in the form presented in Lecture 11:
1 u
tt
+ 2 0 u
xt
+ (c
2
) u
xx
= 0,
and then using Eq. (11.15), we obtain the equations for the two characteristics of (16.3):
dx
dt
1,2
= c . (16.4)
Thus,
Along characteristics 1:
x ct = const; (16.5a)
Along characteristics 2:
x + ct = const. (16.5b)
This is illustrated in the gure on the right.
x
t
= const
= const
The signicance of the characteristics follows from the fact that any piece of initial or
boundary data propagates along the characteristics and thereby determines the solution of
(16.3) at any point in space and time. We will derive parts of this result later, and you will be
asked to complete that derivation in a homework problem. For now, it will be sucient for our
purposes to give the general solution of (16.3) without a derivation:
u(x, t) = F(x ct) + G(x + ct). (16.6)
Here F and G are functions determined by the initial and boundary conditions, as we will show
shortly. The meaning of solution (16.6) is this: The solution of the Wave equation splits into
two waveforms, each of which travels along its own characteristic.
Now let us show how F and G are found assuming that the intial conditions
u(x, t = 0) = (x), u
t
(x, t = 0) = (x), < x < (16.7)
are prescribed on the innite line. That is, for now we will assume no explicit boundary
conditions; implicitly, we will assume that there is no disturbance coming into the region of
nite x-values from either x = or x = +. Note that in (16.7), (x) can be interpreted
as the initial shape of the disturbance and (x), as its initial velocity. By substituting (16.6)
into (16.7) and following a calculation outlined in Appendix 1, one obtains:
u(x, t) =
1
2
( (x ct) + (x + ct) ) +
1
2c
x+ct
xct
(s) ds . (16.8)
This formula is called the DAlambert solution of the Wave equation (16.3) set up on the innite
line with the initial conditions (16.7). For example, when the initial velocity is zero everywhere,
the solution (16.8) at any time is given by two replicas of the initial disturbance (x) that travel
along the characteristics x ct = const and x + ct = const. Note especially that if the initial
disturbance is not smooth (e.g., is discontinuous), the discontinuities are not smoothened out
during the propagation but simply propagate along the characteristics.
Now let us show how formula (16.6) can be used to obtain a solution of (16.3) on a nite
interval. Instead of initial conditions (16.7), we will now consider initial conditions
u(x, t = 0) = (x), u
t
(x, t = 0) = (x), 0 x L (16.9)
along with boundary conditions
u(x = 0, t) = g
0
, u(x = L, t) = g
L
, t > 0 . (16.10)
Note that the boundary values for u
t
need not to be specied because they are determined by
(16.10). Also, on physical grounds, we require that the boundary and initial conditions match:
(x = 0) = g
0
(t = 0),
(x = L) = g
L
(t = 0);
(x = 0) = g
0
(t = 0),
(x = L) = g
L
(t = 0).
(16.11)
In what follows we illustrate the method of nding a solution of (16.3) on a nite interval
for the special case when the boundary values g
0
and g
L
do not depend on time. (The
same method, but with additional eort, can be used in the general case of time-dependent
boundary conditions.) When the boundary conditions are time-independent, we will rst show
that they can be set to zero without loss of generality. Using a trick analogous to that used in
the homework problems for Lecture 9, we consider a modied function
u = u
g
0
g
L
g
0
L
x
, (16.12)
which satises both the Wave equation (16.3) and the zero boundary conditions
u(0, t) = 0, u(L, t) = 0.
Thus we set g
0
= g
L
= 0 in (16.10) in what follows.
Now we will use a so-called method of reections,
where we claim that the solution of (16.3), (16.9),
(16.10) (with g
0
= g
L
= 0) is given by for-
mula (16.6) with (x) and (x) being replaced by
their anti-symmetric, 2L-periodic extensions about
points x = 0 and x = L:
(x) =

(x), L x 0
(x), 0 x L
(2L x), L x 2L

(16.13)
and similarly for

(x).
0 L 2L L
(x)

^
x
Then the solution
u(x, t) =
1
2
(x ct) +

(x + ct)
+
1
2c
x+ct
xct
(s) ds (16.14)
satises the PDE (16.3) (by virtue of (16.6)) and the initial condition (16.9) (by virtue of
(16.13)). In a question for self-assessment you will be asked to verify that it also satises the
zero boundary conditions at x = 0 and x = L.
16.2 Wave equation as a system of rst-order PDEs
Let us now present another point of view of the Wave equation. The numerical method devel-
oped in Lecture 17 will utilize this point of view.
If we denote
u
t
= p, cu
x
= q,
then Eq. (16.3) becomes
p
t
cq
x
= 0 . (16.15a)
From the formula u
xt
= u
tx
we obtain
q
t
cp
x
= 0 . (16.15b)
In matrix form, these equations are written as
p
q
0 1
1 0
p
q
0
0
. (16.16)
We proceed by diagonalizing the matrix in the above equation:
0 1
1 0
= S
1
1 0
0 1
S, S =
1
1 1
1 1
= S
1
. (16.17)
Multiplying (16.16) on the left by S
1
and using (16.17), we arrive at a diagonal (i.e., decoupled)
system of rst-order heperbolic equations:
v
w
c 0
0 c
v
w
0
0
. (16.18)
where

v
w
= S
1
p
q
=
1
p + q
p q
. (16.19)
In the component-by-component form, (16.18) is
v
t
cv
x
= 0, (16.18a)
w
t
+ cw
x
= 0. (16.18b)
In Appendix 2 we show that the general solutions of (16.18)
are
v(x, t) = g(x + ct), (16.20a)
w(x, t) = f(x ct). (16.20b)
Substituting these solutions into (16.19) and solving for p and q, we obtain
p = u
t
=
1
2
(g(x + ct) + f(x ct)) ,
q = cu
x
=
1
2
(g(x + ct) f(x ct)) .
(16.21)
Integration of the latter equations yields
u(x, t) = F(x ct) + G(x + ct), (16.6)
where
F() =
1
2c

f(
) d
, G() =
1
2c

g( ) d . (16.22)
Thus, we have reobtained the general solution (16.6) of the Wave equation.
As we have noted, the value of representing the Wave equation (16.3) in the form of a
system of rst-order equations, (16.16), is that in the next Lecture, we will develop methods of
numerical solution of rst-order hyperbolic PDEs. In preparation to this development, let us
set up an initial-boundary value problem (IBVP) for the simplest rst-order hyperbolic PDE,
w
t
+ cw
x
= 0. (16.18 b)
In what follows, we assume c > 0 unless stated otherwise. As shown in Appendix 2, the solution
of (16.18b)
is given by (16.20b). Characteristics

x ct = = const, or t =
1
c
(x ) (16.23)
of that equation are shown in the gure next to formulae (16.5). By looking at those character-
istics, one sees that one can prescribe the IBVP for (16.18b)
in two ways: either by an initial

condition on the entire line,
w(x, t = 0) = (x), < x < (16.24)
or on the boundary of the rst quadrant of the (x, t)-plane:
w(x, t = 0) = (x), x 0
w(x = 0, t) = g(t). t 0
(16.25)
Then the initial and boundary (if applicable) values will propagate along the characteristics
(16.23) and thereby determine the solution at any point inside the rst quadrant (x 0, t 0).
Note that the solution of (16.18b)
cannot be dened in the second quadrant, (x 0, t 0),

because the characteristics do not extend there.
16.3 Appendix 1: Derivation of DAlamberts formula (16.8)
Substituting (16.6) into (16.7) and using the identities
F
t
(x ct) = cF
() = cF
x
(), G
t
(x + ct) = cG
() = cG
x
(), (16.26)
where F
dF/d and G
dG/d, one obtains:

F(x) + G(x) = (x), cF
x
(x) + cG
x
(x) = (x) . (16.27)
Upon dierentiating the rst of these equations by x, one obtains a system of two equations
for F
x
(x) and G
x
(x). In a homework problem you will be asked to verify that its solution
integrated over x is
F(x) =
1
2
(x)
1
c
(s) ds
, G(x) =
1
2
(x) +
1
c
(s) ds
. (16.28a)
Hence
F(xct) =
1
2
(x ct)
1
c
xct
(s) ds
, G(x+ct) =
1
2
(x + ct) +
1
c
x+ct
(s) ds
.
(16.28b)
The substitution of (16.28b) into (16.6) yields (16.8).
16.4 Appendix 2: Solution of (16.18)
is given by (16.20)
Let us begin by putting the PDE
w
t
+ cw
x
= 0 (16.18 b)
in the form of an ODE. Consider a change of variables:

(x, t) ( = x ct, = x + ct). (16.29)
Using the Chain Rule for the function of several variables,
t
=

t
+

t
,

x
=

x
+

x
, (16.30)
we obtain:

c

+ c

w + c
w = 0 w
= 0. (16.31)
The last equation means that w depends only on , which, in view of (16.23), implies (16.20b).
Similarly, one shows that the solution of (16.18a)
is given by (16.20a).
Thus, for any two points (x
1
, t
1
) and (x
2
, t
2
) in the (x, t)-plane that satisfy x
1
ct
1
=
x
2
ct
2
, the solution of (16.18b)
satises
w(x
1
, t
1
) = w(x
2
, t
2
). (16.32)
1. Suppose one wants to develop a second-order accurate nite-dierence scheme for a hy-
perbolic PDE. Which of the two ODE methods one should mimic the scheme after: the
modied Euler or the Leap-frog?
2. What is the meaning of the solution (16.6)?
3. What is the meaning of each piece of the initial conditions (16.7)?
4. Where will the trick based on substitution (16.12) cause problems if the boundary con-
ditions were to depend on time?
5. Verify the statements made after formula (16.14).
6. Why did we want to diagonalize the matrix in (16.16)?
7. Verify that you obtain (16.6)
and (16.22) from (16.21).

8. Verify (16.31).
17 Method of characteristics for solving hyperbolic PDEs
In this lecture we will describe a method of numerical integration of hyperbolic PDEs which
uses the fact that all solutions of such PDEs propagate along characteristics.
17.1 Method of characteristics for a single hyperbolic PDE
Let us start the discussion with the simplest, rst-order hyperbolic PDE
w
t
+ cw
x
= 0, (17.1)
where we will take c > 0 for concreteness. For now, we assume that c = const; later on this
restriction will be removed. The general solution of (17.1), derived in Appendix 2 of Lecture
16, is
w(x, t) = w(x ct). (17.2)
Thus, if the steps in x and t are related so that
x = c t, (17.3)
then
w(x + x, t + t) = w
_
x + x c(t + t)
_
= w(x ct); (17.4)
see also (16.32). This simply illustrates the fact that the solution does not change along the
characteristic x ct = .
To put (17.4) at the foundation of a numerical method, consider the mesh
x
m
= mh, m = 0, 1, 2, . . .
t
n
= n, n = 0, 1, 2, . . .
(17.5a)
where h and are related as per (17.3):
h = c . (17.5b)
This is illustrated in the gure on the right.
t
x
=0
= h
= 2h
= h
= 2h
h 2h

2
The initial and boundary conditions for this problem are given by
w(x, t = 0) = (x), x 0;
w(x = 0, t) = g(t), t 0.
(16.25)
In the discretized form, they are:
W
0
m
= (x
m
), m 0;
W
n
0
= g(t
n
), n 0.
(17.6)
(Obviously, we require (0) = g(0) for the boundary and initial conditions to be consistent
with each other.) Then, according to (17.4), the solution at the node (m, n) with m > 0 and
n > 0 is found as
W
n
m
=
_
W
0
mn
= (x
mn
), m n;
W
nm
0
= g(t
nm
), n m.
(17.7)
This method, called the method of characteristics, can be generalized to the equation
w
t
+ c(x, t, w) w
x
= f(x, t, w). (17.8a)
For the sake of clarity, we will work out this generalization in two steps. First, we consider the
case when f(x, t, w) 0, i.e.
w
t
+ c(x, t, w) w
x
= 0. (17.8b)
Then, making a change of variables
(x, t) (, t), where = x
_
t
0
c
_
x(t
), t
, w(x(t
), t
)
_
dt
, (17.9)
and proceeding similarly
31
to Appendix 2 of Lecture 16, one can show that
w
t
= 0 w(x, t) = w() irrespective of a specic value of t. (17.10)
The equation for the characteristics = const of
(17.8) is obtained by dierentiating the expression
x
_
t
0
c
_
x(t
), t
, w(x(t
), t
)
_
dt
= const
(see (17.9)) with respect to t. The result is:
dx
dt
= c(x, t, w), w = const (17.11)
where the last condition (w = const) appears be-
cause along the characteristic, the solution w does
not change (see (17.10)).

x
t
h 2h

2
m=0
m=1
m=2
m=1
m=2

x
1
1

x
1
2

x
1
3

x
1
0

x
1
2

x
1
3

x
1
1

0
Note that unlike in the gure next to Eqs. (17.5), the characteristics corresponding to Eqs. (17.11)
are curved, not straight, lines, as illustrated above.
The numerical solution of Eqs. (17.10) and (17.11) can be generated as follows. Let us denote
x
n
m
to be the grid point at the intersection of the time level t = n and the characteristic
= mh (see the gure above for an illustration). Note that this denition of x
n
m
is dierent
from the denition of x
m
in (17.5a). Namely, there, x
m
are xed points of the spatial grid
which are dened independently of the time grid. On the contrary, in scheme (17.12) below,
x
n
m
moves along the m-th characteristic and hence is dierent at each time level.
Continuing with setting up a scheme for (17.10) and (17.11), let W
n
m
denote the value of w
at the grid point x
n
m
, i.e. W
n
m
= w
_
= mh, t = n). Then:
n = 0 : x
0
m
= mh, W
0
m
= (mh), m 0; (17.12a)
n = 1 :
x
1
1
= 0, W
1
1
= g(),
x
1
m
= x
0
m
+
_

0
c
_
x, t, W
0
m
_
dt, W
1
m
= (mh), m 0;
(17.12b)
31
E.g.,
t
=
t

+
t
t
t
= c
+
t
.
n 2 : x
n
n
= 0, W
n
n
= g(n),
x
n
m
= x
n1
m
+
_
n
(n1)
c
_
x, t, W
n1
m
_
dt,
_
W
n
m
= g(m),
W
n
m
= (mh),
(n 1) m < 0,
m 0.
(17.12c)
Above, the expression
_
n
(n1)
c(x, t, w) dt is a symbol denoting the result of integration of
the ODE (17.11) from t = (n 1) to t = n. This integration may be performed either
analytically (if the problem so allows) or numerically using any of the numerical methods for
ODEs. Note that this integration is the only computation required in (17.12); the rest of it is
just the assignment of known values to the grid points at each time level.
Let us emphasize the meaning of scheme (17.12). First, it computes the values x
n
m
along
the respective characteristics for each m as per the rst equation in (17.11). Then, the value
of w is kept constant along each characteristic, as specied by (17.10).
Remark If one wants to keep the number of grid points at each time level of (17.12) the same
(say, (M + 1)), then one needs to chop o the right-most point at every time step. Recall
also that one cannot prescribe a boundary condition at the right boundary x = Mh.
Let us illustrate the solution of (17.8b) by Eqs. (17.12) for the so-called shock wave equation
32
u
t
+ uu
x
= 0, (17.13)
which arises in a great many applications (e.g., in gas dynamics or in trac ow modelling).
As an initial condition, let us take
u(x, 0) (x) =
_
a sin
2
x, 0 x 1,
0, otherwise,
(17.14)
where a is some constant. We will consider this problem on the innite line, x (, ),
but in our numerical solution will only follow points where u = 0.
According to Eqs. (17.10) and (17.9), the solution of problem (17.13), (17.14) is given by
an implicit formula
u =
_
x
_
t
0
u(x(t
), t
) dt
_
, (17.15)
where we have used the initial condition u(x, t = 0) = (x). Recall that in (17.15), x(t) stands
for the equation of one given characteristic; in other words, the integral is computed along that
characteristic. We will now show that characteristics of (17.13) have a special form that allow
the integral in (17.15) to be be simplied. Indeed, the equation for the characteristics of (17.13)
follows from (17.11):
dx
dt
= u, u = const. (17.16)
Thus, since the u on the right-hand side of (17.15) is constant along the characteristic, then
(17.15) reduces to
u = (x ut). (17.15
)
This is now an implicit algebraic equation for u which, in principle, can be solved for each pair
(x, t).
32
Another name of this equation, or, more precisely, of the more general equation u
t
+ c(u) u
x
= 0, is the
simple wave equation, with the adjective simple originating from physical considerations.
To obtain a numerical solution of (17.13), (17.14) explicitly, we use a modication of (17.12)
which would allow us to keep track of only those grid points where u = 0. The correponding
scheme on such a moving grid is:
n = 0 : x
0
m
= mh, U
0
m
= (mh); (17.17a)
n = 1 : x
1
m
= x
0
m
+ U
0
m
, U
1
m
= (mh); (17.17b)
(where the meaning of index m is claried in (17.17d) below)
n 2 : x
n
m
= x
n1
m
+ U
n1
m
, U
n
m
= U
n1
m
= . . . = U
0
m
. (17.17c)
Note that in all these equations,
m = 0, 1, . . . , M, and h = 1/M (see (17.14)), (17.17d)
so that a particular value of m labels the characteristic emanating at point (x = mh, t = 0).
This way of labeling is illustrated in the gure next to Eq. (17.11). It denes a grid which
moves to the right (given that the initial velocity (x) 0). Also, at each time level except the
one at t = 0, the internode spacing along x is not uniform. Note that (17.17) is the discretized
form of the exact analytical solution (17.15
). You will be asked to plot solution (17.17) in a

homework problem.
Let us now return to Eq. (17.8) with f
0. Similarly to (17.10), one then obtains

w
t
= f(, t, w), (17.18a)
where f(, t, w) is obtained from f(x, t, w) by the change of variables (17.9). (For example,
if f(x, t) = x +t
2
and c = 3, then f(, t) = + 3t +t
2
.) Equation (17.18a) says that w(, t)
is no longer a constant along a characteristic
= const (17.18b)
but instead varies along it in the prescribed manner. When solving (17.18a), should be
considered as a constant parameter. To nd the equation of the characteristics, one needs to
solve the rst equation in (17.11) where instead of the second equation in (17.11), i.e. w = const,
one now needs to use (17.18a). Thus, the solution of the original PDE (17.8a) reduces to the
solution of two coupled ODEs: (17.18) and
dx
dt
= c(, t, w), x(t = 0) = , (17.19)
where c(, t, w) is obtained from c(x, t, w) by the change of variables (17.9) (see the clarication
after (17.18a)). An implementation of the solution of (17.18) and (17.19) that assumes the
boundary conditions (17.6) is given below:
n = 0 : x
0
m
= mh,
m
= x
0
m
, W
0
m
= (mh); (17.20a)
n = 1 :
x
1
1
= 0, W
1
1
= g(),
_
x
1
m
W
1
m
_
=
_
x
0
m
W
0
m
_
+
_

0
_
c
_
m
, t, w
m
(t)
_
f
_
m
, t, w
m
(t)
_
_
dt, m 0;
(17.20b)
n 2 :
x
n
n
= 0, W
n
n
= g(n),
_
x
n
m
W
n
m
_
=
_
x
n1
m
W
n1
m
_
+
_
n
(n1)
_
c
_
m
, t, w
m
(t)
_
f
_
m
, t, w
m
(t)
_
_
dt, m n + 1.
(17.20c)
Here the expression
_
n
(n1)
_
c
_
m
, t, w
m
(t)
_
f
_
m
, t, w
m
(t)
_
_
dt
is a symbol that denotes the result of integration of the coupled ODEs (17.18) and (17.19). In
practice, this integration can be done numerically by any suitable ODE method. Also, w
m
(t)
above means the solution along the characteristic =
m
(see (17.20a)).
The meaning of scheme (17.20) is the following. As previously scheme (17.12), it computes
the curves of characteristics x
m
(t) =
m
as per (17.19). However, unlike (17.12), now the value
of w is not constant along each characteristic but instead varies according to (17.18). Note
that now the equations for the characteristic and for the solution w are coupled and need to
be solved simultaneously.
17.2 Method of characteristics for a system of hyperbolic PDEs
In this section, we will rst point out technical diuculties that can arise when using the
method of charateristics for a system of PDEs. Then we will work out an example where those
diculties do not occur.
If we attempt to generalize the approach that led
to schemes (17.12) and (17.20) to the case where
one has a coupled system of two PDEs with inter-
secting families of curved (i.e., not straight-line)
characteristics, one is likely to encounter a prob-
lem depicted in the gure on the right. Namely,
suppose the characteristics of the two families are
chosen to intersect at level t = 0. In the gure,
the intersection points are x
m1
, x
m
, x
m+1
, etc.
However, these characteristics no longer intersect
at subsequent time levels; this is especially visible
at levels t = 2 and t = 3.
t
x
m1

0

2
x
m
x
m+1

An analogous problem can also occur if one has
three or more characteristics, even if they are
straight lines. The only case where this will not
occur is where all the characteristics can be cho-
sen to intersect at uniformly spaced points at each
level. An example of such a special situation is
shown on the right. Note that the vertical charac-
teristics are just the lines
dx
dt
= 0 x(t) =
m
const. (17.21)
t
x
m1

0

2
x
m
x
m+1

Characteristrics (17.21) occur whenever the system of PDEs includes an equation
w
j, t
= f
_
x, t, w
_
. (17.22)
You will be asked to verify this in a QSA.
A way around this issue is to interpolate the values of the solution at each time level. For
example, suppose one is to solve a system of two PDEs for w
1
and w
2
on the segment 0 x 1,
with the characteristics for w
1
(w
2
) going northeast (northwest). Let there be (M1) internal
points, x
m
= mh, m = 1, . . . , (M 1) at the initial time level t
0
= 0. Suppose that the
characteristics for w
j
, j = 1, 2, intersect the next time level t
1
= at points x
(j)
m
. Then one
can interpolate the set of values w
(j)
m
from the respective nonuniform grid x
(j)
m
onto the same
grid x
m
as at the initial time level. This interpolation process is then repeated at every time
level.
Matlabs command to interpolate a vector y from a grid dened by a vector x (such that
length(x)=length(y)) onto a vector xx is:
yy = spline( x, y, xx) .
We will now work out a solution of a system of two PDEs with straight-line charactristics:
w
1, t
+ c w
1, x
= f
1
(x, t, w
1
, w
2
),
w
2, t
c w
2, x
= f
2
(x, t, w
1
, w
2
),
w
j
(x, 0) =
j
(x), 0 x 1, j = 1, 2
w
1
(0, t) = g
1
(t), w
2
(1, t) = g
2
(t), t 0.
(17.23)
The two characteristic directions of (17.23) are
j
= x c
j
t, j = 1, 2; c
1
= c, c
2
= c. (17.24)
If f
j
for j = 1 and/or 2 in (17.23) vanishes, then the respective w
j
will not change along its
characteristic
j
. Therefore, for f
j
= 0, it is convenient to calculate the change of w
j
along the
characteristic
j
. Then, similarly to (17.18a), we can write the rst two equations of (17.23) as
w
j, t
= f
j
_
j
+ c
j
t, t, w
1
(
1
, t), w
2
(
2
, t)
_
along
j
=const, j = 1, 2. (17.25)
Note that while integrating, say, the equation with j = 1, the argument
2
of w
2
should not be
considered as constant. At the moment, this prescription is rather vague, but later on we will
present a specic example of how it can be implemented.
The formal numerical implementation of the solution of (17.25) is given below on the grid
(17.5) where the maximum value of m = M corresponds to the right boundary x = 1. Note that
this grid is stationary and hence is dierent from the moving grids used in schemes (17.12),
(17.17), and (17.20). In particular, in this stationary grid, m does not label a particular
characteristic.
The scheme for (17.25) is:
n = 0 : (
j
)
0
m
= x
m
, (W
j
)
0
m
=
j
(mh), j = 1, 2; (17.26a)
n 1 :
(
1
)
n
m
= x
m
c
1
n (mn)h, (see (17.5a,b))
(W
1
)
n
0
= g
1
(n),
(W
1
)
n
m
= (W
1
)
n1
m1
+
_
n
(n1)
f
1
_
(
1
)
n1
m1
+ c
1
t, t, W
1
, W
2
_
dt, m = 1, . . . , M;
(
2
)
n
m
= x
m
c
2
n (m + n)h,
(W
2
)
n
M
= g
2
(n),
(W
2
)
n
m
= (W
2
)
n1
m+1
+
_
n
(n1)
f
2
_
(
2
)
n1
m+1
+ c
2
t, t, W
1
, W
2
_
dt, m = 0, . . . , M 1.
(17.26b)
Note that with the step sizes along the temporal and spatial coordinates being related by
(17.5b), the values of
1
and
2
stay constant along the lines mn = const and m+n = const,
respectively.
To turn scheme (17.26) into a useful tool, we need to specify how the integrals
_
n
(n1)
f
j
_
(
j
)
n1
m+(1)
j
+ c
j
t, t, W
1
, W
2
_
dt, j = 1, 2
can be computed. Recall that these integrals are just the symbols denoting the increment of the
solutions of (17.25) from t = (n 1) to t = n along the respective characteristic
j
= const.
Below we show how this can be done by the modied explicit Euler method. We will write the
equations rst and then will comment on their meaning.
W
1
= (W
1
)
n1
m1
+ f
1
_
(
1
)
n1
m1
+ c
1
(n 1), c(n 1), (W
1
)
n1
m1
, (W
2
)
n1
m1
_
,
W
2
= (W
2
)
n1
m+1
+ f
2
_
(
2
)
n1
m+1
+ c
2
(n 1), c(n 1), (W
1
)
n1
m+1
, (W
2
)
n1
m+1
_
;
(17.27a)
(W
1
)
n
m
=
1
2
_
(W
1
)
n1
m1
+

W
1
+ f
1
_
(
1
)
n
m
+ c
1
n, cn,

W
1
,

W
2
__
,
(W
2
)
n
m
=
1
2
_
(W
2
)
n1
m+1
+

W
2
+ f
2
_
(
2
)
n
m
+ c
2
n, cn,

W
1
,

W
2
__
.
(17.27b)
Note that the notations (
1
)
n1
m1
+c
1
(n1) and (
2
)
n1
m+1
+c
2
(n1) in (17.27a) have been
used only to mimic the corresponding terms in (17.25). Those terms, as evident from the rst
two equations of (17.23) and from (17.25), must equal x
m1
and x
m+1
for j = 1 and j = 2,
respectively. Indeed:
(
1
)
n+1
m1
+ c
1
(n 1) =
_
h(m1) c(n 1)
_
+ c(n 1) = x
m1
,
(
2
)
n+1
m+1
+ c
2
(n 1) =
_
h(m + 1) + c(n 1)
_
c(n 1) = x
m+1
,
where we have used the equations for (
j
)
n
m
from (17.26b). Similarly, (
j
)
n
m
+c
j
n in (17.27b)
equal x
m
for both j = 1 and 2.
The meaning of the rst equation in (17.27a) is the
following. The change of W
1
is computed along the
characteristic
1
= (
1
)
n1
m1
by the simple Euler ap-
proximation, whereby all arguments of f
1
are eval-
uated at the starting node (x = x
m1
, t = t
n1
).
Since, as we have said, this change occurs along
the characteristic
1
= (
1
)
n1
m1
, which is labeled
1
= const in the gure on the right, then the
nal node of this step is (x = x
m
, t = t
n
).
n
n1
m1 m m+1
1
=const
2
=const
Similarly, the change of W
2
in (17.27a) is computed along the characteristic
2
= (
2
)
n1
m+1
by the simple Euler approximation; hence all arguments of f
2
are evaluated at the starting
node (x = x
m+1
, t = t
n1
) for that characteristic (which is labeled
2
= const in the gure
above.) The step along this characteristic ends at the same node (x = x
m
, t = t
n
).
Finally, the equations in (17.27b) are the standard corrector equations of the explicit
modied Euler method.
Scheme (17.26), (17.27) can be straightforwardly generalized for more than two coupled
rst-order hyperbolic PDEs, as long as all the characteristics can be chosen to intersect at
uniformly spaced points at each time level. An example of that situation is shown in the
gure next to Eq. (17.21). Extending the scheme to use a RungeKutta type method of order
higher than two (which is the order of the modied explicit Euler method) also appears to be
straightforward.
1. What is the meaning of scheme (17.7)?
2. Where can one specify a boundary condition for Eqs. (17.8) and where can one not?
3. Why is w = const in (17.11)?
5. What does the expression
_
n
(n1)
c(x(t
), t
, w) dt
in (17.12) stand for?

6. Explain where solution (17.15) comes from and then how it is reduced to (17.15
).
8. Describe a technical problem that is likely to occur when solving a system of coupled
PDEs by the method of chatacteristics. How can this problem be overcome?
9. Verify the statement found after Eq. (17.22).
10. What is the dierence between the grids used in schemes (17.12) and (17.20), on one
hand, and in scheme (17.26), on the other?
11. Why is W
2
in the rst line of (17.27a) evaluated at x
m1
? Why is W
1
in the second line
of (17.27a) evaluated at x
m+1
?

NumericalMethods UofV

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NumericalMethods UofV

Uploaded by

Copyright:

Available Formats

MATH 337, by T.

Lakoba, University of Vermont 1

(x) = f(x, y), (0.1)

(x) = f(x, y), y(x

(x) = a y + g(x), a = const , y(x

(x) = f(x, y), y(x

(x), one can use the Taylor expansion to write:

= f(x). The solution of the latter equation is y =

. This motivates the following.

= f(x, y), y(x

1. If we denote the l.h.s. of (4.2) as F[Y

-norm of the sequence {a

(x) = (x) + (x) . (4.18b)

= 30y (with any initial condition), we must use h <

as the particles velocity.

(0) = StartupError . (5.14)

(t) (the velocity). As before, y(t) denotes the particles

= f(y) and similar

= Ay , A is a constant matrix. (5.39)

= Dz, where z = Sy . (5.41)

+Cy = F(x). (5.67)

), y(a) = , y(b) = . (6.1)

= f(x, y), y(x

= f(x, y), y(a) = , y(b) = . (6.2)

+ Q(x)y = R(x), y(a) = , y(b) = . (6.3)

(b) = . (Neumann b.c.)

(b) = . (mixed b.c.)

(b); (periodic b.c.)

(1) = 1 (Problem II)

+ (sin x) y = 1, y() = 1, y(0) = 1 .

), y(a) = , y(b) = . (7.1)

+ Q(x)y = R(x) with Q(x) 0, y(a) = , y(b) = . (7.4)

(a) = 1, then v(x) cannot identically equal zero on [a, b].

and rewrite the BVP (7.25) in the

The solution and its derivative

The solution must satisfy

and hence the corresponding two solutions of

(R) = y(R), (7.53)

(R) and y(R) is this. Since the potential 2sech

y at x = R. Note that of the two possibilities

y which are implied by y

. Repeat this process for values of

in (8.1) by their second-order accurate discretizations:

[, and is the round-o error.

= f(x, y), as in (8.2),

does not enter the BVP, i.e. when the

= f(x, y) , y(a) = , y(b) = , (8.28)

Qy, one nds:

(a), one needs to use two more

and higher-order derivatives using y

= f(x, y), y(a) = , y(b) = , (8.28)

. Although the methods described below can also be adopted to a

), y(a) = , y(b) = , (7.1)

> 0 and also since c [L

> 0, then the factor (L

+ Q(x)y = R(x), y(0) = y(1) = 0 . (9.1)

denotes a dierential (i.e., an innitesimally small step) along . Note

x (i.e., the slope of ) are known once and the

can be found from the quadratic equation

1 for all X 0. (15.43)

T/, where T and are the strings tension and

In Appendix 2 we show that the general solutions of (16.18)