You are on page 1of 692

http://homepage.cs.uiowa.edu/~atkinson/ena_master.

html

CALCULATION OF FUNCTIONS

Using hand calculations, a hand calculator, or a com-


puter, what are the basic operations of which we are
capable? In essence, they are addition, subtraction,
multiplication, and division (and even this will usually
require a truncation of the quotient at some point).
In addition, we can make logical decisions, such as
deciding which of the following are true for two real
numbers a and b:
a > b, a = b, a<b
Furthermore, we can carry out only a finite number
of such operations. If we limit ourselves to just addi-
tion, subtraction, and multiplication, then in evaluat-
ing functions f (x) we are limited to the evaluation of
polynomials:
p(x) = a0 + a1x + anxn
In this, n is the degree (provided an 6= 0) and {a0, ..., an}
are the coecients of the polynomial. Later we will
discuss the ecient evaluation of polynomials; but for
now, we ask how we are to evaluate other functions
such as ex, cos x, log x, and others.
TAYLOR POLYNOMIAL APPROXIMATIONS

We begin with an example, that of f (x) = ex from


the text. Consider evaluating it for x near to 0. We
look for a polynomial p(x) whose values will be the
same as those of ex to within acceptable accuracy.

Begin with a linear polynomial p(x) = a0 +a1x. Then


to make its graph look like that of ex, we ask that the
graph of y = p(x) be tangent to that of y = ex at
x = 0. Doing so leads to the formula

p(x) = 1 + x
y

y = ex
2

y = p (x)
1

x
-1 1
Continue in this manner looking next for a quadratic
polynomial

p(x) = a0 + a1x + a2x2


We again make it tangent; and to determine a2, we
also ask that p(x) and ex have the same curvature
at the origin. Combining these requirements, we have
for f (x) = ex that

p(0) = f (0), p0(0) = f 0(0), p00(0) = f 00(0)


This yields the approximation

p(x) = 1 + x + 12 x2
y

y = ex 2
y = p (x)
1
y = p (x)
2

x
-1 1
We continue this pattern, looking for a polynomial

p(x) = a0 + a1x + a2x2 + + anxn


We now require that

p(0) = f (0), p0(0) = f 0(0), , p(n)(0) = f (n)(0)


This leads to the formula

p(x) = 1 + x + 12 x2 + + n!
1 xn

What are the problems when evaluating points x that


are far from 0?
y
.75
x
y = e -p (x)
1
x
y = e -p (x)
2
x
y = e -p (x) .50
3
x
y = e -p (x)
4

.25

x
-1 1
TAYLORS APPROXIMATION FORMULA

Let f (x) be a given function, and assume it has deriv-


atives around some point x = a (with as many deriv-
atives as we find necessary). We seek a polynomial
p(x) of degree at most n, for some non-negative inte-
ger n, which will approximate f (x) by satisfying the
following conditions:
p(a) = f (a)
p0(a) = f 0(a)
p00(a) = f 00(a)
..
p(n)(a) = f (n)(a)
The general formula for this polynomial is
1
pn(x) = f (a) + (x a)f (a) + (x a)2f 00(a)
0
2!
1
+ + (x a)nf (n)(a)
n!
Then f (x) pn(x) for x close to a.
TAYLOR POLYNOMIALS FOR f (x) = log x

In this case, we expand about the point x = 1, making


the polynomial tangent to the graph of f (x) = log x
at the point x = 1. For a general degree n 1, this
results in the polynomial
1 1
pn(x) = (x 1) (x 1)2 + (x 1)3
2 3
1
+ + (1)n1 (x 1)n
n
Note the graphs of these polynomials for varying n.
y

x
1 2

y = log x
y = p (x)
1
-1 y = p (x)
2
y = p (x)
3
THE TAYLOR POLYNOMIAL ERROR FORMULA

Let f (x) be a given function, and assume it has deriv-


atives around some point x = a (with as many deriva-
tives as we find necessary). For the error in the Taylor
polynomial pn(x), we have the formulas
1
f (x) pn(x) = (x a)n+1f (n+1)(cx)
(n + 1)!
Z
1 x
= (x t)nf (n+1)(t) dt
n! a
The point cx is restricted to the interval bounded by x
and a, and otherwise cx is unknown. We will use the
first form of this error formula, although the second
is more precise in that you do not need to deal with
the unknown point cx.
Consider the special case of n = 0. Then the Taylor
polynomial is the constant function:

f (x) p0(x) = f (a)


The first form of the error formula becomes

f (x) p0(x) = f (x) f (a) = (x a) f 0(cx)


with cx between a and x. You have seen this in
your beginning calculus course, and it is called the
mean-value theorem. The error formula
1
f (x) pn(x) = (x a)n+1f (n+1)(cx)
(n + 1)!
can be considered a generalization of the mean-value
theorem.
EXAMPLE: f (x) = ex

For general n 0, and expanding ex about x = 0, we


have that the degree n Taylor polynomial approxima-
tion is given by
1 2 1 3 1 n
pn(x) = 1 + x + x + x + + x
2! 3! n!
For the derivatives of f (x) = ex, we have
f (k)(x) = ex, f (k)(0) = 1, k = 0, 1, 2, ...
For the error,
1
ex pn(x) = xn+1ecx
(n + 1)!
with cx located between 0 and x. Note that for x 0,
we must have cx 0 and
1
ex pn(x) xn+1
(n + 1)!
This last term is also the final term in pn+1(x), and
thus
ex pn(x) pn+1(x) pn(x)
Consider calculating an approximation to e. Then let
x = 1 in the earlier formulas to get
1 1 1
pn(1) = 1 + 1 + + + +
2! 3! n!
For the error,
1
e pn(1) = ecx , 0 cx 1
(n + 1)!
To bound the error, we have
e0 ecx e1
1 e
e pn(1)
(n + 1)! (n + 1)!
To have an approximation accurate to within 105,
we choose n large enough to have
e
105
(n + 1)!
which is true if n 8. In fact,
e .
e p8(1) = 7.5 106
9!
. .
Then calculate p8(1) = 2.71827877, and e p8(1) =
3.06 106.
FORMULAS OF STANDARD FUNCTIONS

1 2 n xn+1
= 1 + x + x + + x +
1x 1x

x2 x4 x2m
cos x = 1 + + (1)m
2! 4! (2m)!
m x2m+2
+(1) cos cx
(2m + 2)!

x3 x5 x2m1
sin x = x + + (1)m1
3! 5! (2m 1)!
x2m+1
+(1)m cos cx
(2m + 1)!
with cx between 0 and x.
OBTAINING TAYLOR FORMULAS

Most Taylor polynomials have been bound by other


than using the formula
1
pn(x) = f (a) + (x a)f (a) + (x a)2f 00(a)
0
2!
1
+ + (x a)nf (n)(a)
n!
because of the diculty of obtaining the derivatives
f (k)(x) for larger values of k. Actually, this is now
much easier, as we can use Maple or Mathematica.
Nonetheless, most formulas have been obtained by
manipulating standard formulas; and examples of this
are given in the text.
For example, use
1 2 1 1
et = 1 + t + t + t3 + + tn
2! 3! n!
1
+ tn+1ect
(n + 1)!
in which ct is between 0 and t. Let t = x2 to obtain

x2 2 1 4 1 6 (1)n 2n
e = 1 x + x x + + x
2! 3! n!
(1)n+1 2n+2
+ x e x
(n + 1)!
Because ct must be between 0 and x2, we have it
must be negative. Thus we let ct = x in the error
term, with 0 x x2.
EVALUATING A POLYNOMIAL

Consider having a polynomial

p(x) = a0 + a1x + a2x2 + + anxn


which you need to evaluate for many values of x. How
do you evaluate it? This may seem a strange question,
but the answer is not as obvious as you might think.

The standard way, written in a loose algorithmic for-


mat:
poly = a0
f or j = 1 : n
poly = poly + aj xj
end
To compare the costs of dierent numerical meth-
ods, we do an operations count, and then we compare
these for the competing methods. Above, the counts
are as follows:
additions : n
n(n + 1)
multiplications : 1 + 2 + 3 + + n =
2
This assumes each term aj xj is computed indepen-
dently of the remaining terms in the polynomial.
Next, do the terms xj recursively:

xj = x xj1
n o
Then to compute x2, x3, ..., xn
will cost n 1 mul-
tiplications. Our algorithm becomes
poly = a0 + a1x
power = x
f or j = 2 : n
power = x power
poly = poly + aj power
end
The total operations cost is
additions : n
multiplications : n + n 1 = 2n 1
When n is evenly moderately large, this is much less
than for the first method of evaluating p(x). For ex-
ample, with n = 20, the first method has 210 multi-
plications, whereas the second has 39 multiplications.
We now considered nested multiplication. As exam-
ples of particular degrees, write
n = 2 : p(x) = a0 + x(a1 + a2x)
n = 3 : p(x) = a0 + x (a1 + x (a2 + a3x))
n = 4 : p(x) = a0 + x (a1 + x (a2 + x (a3 + a4x)))
These contain, respectively, 2, 3, and 4 multiplica-
tions. This is less than the preceding method, which
would have need 3, 5, and 7 multiplications, respec-
tively.

For the general case, write


p(x) = a0+x (a1 + x (a2 + + x (an1 + anx) ))
This requires n multiplications, which is only about
half that for the preceding method. For an algorithm,
write
poly = an
f or j = n 1 : 1 : 0
poly = aj + x poly
end
With all three methods, the number of additions is n;
but the number of multiplications can be dramatically
dierent for large values of n.
NESTED MULTIPLICATION

Imagine we are evaluating the polynomial

p(x) = a0 + a1x + a2x2 + + anxn


at a point x = z. Thus with nested multiplication

p(z) = a0+z (a1 + z (a2 + + z (an1 + anz) ))


We can write this as the following sequence of oper-
ations:
bn = an
bn1 = an1 + zbn
bn2 = an2 + zbn1
..
b0 = a0 + zb1
The quantities bn1, ..., b0 are simply the quantities in
parentheses, starting from the inner most and working
outward.
Introduce

q(x) = b1 + b2x + b3x2 + + bnxn1


Claim:

p(x) = b0 + (x z)q(x) ()
Proof: Simply expand

b0 + (x z) b1 + b2x + b3x2 + + bnxn1
and use the fact that

zbj = bj1 aj1, j = 1, ..., n

With this result (*), we have


p(x) b0
= + q(x)
xz xz
Thus q(x) is the quotient when dividing p(x) by xz,
and b0 is the remainder.
If z is a zero of p(x), then b0 = 0; and then

p(x) = (x z)q(x)
For the remaining roots of p(x), we can concentrate
on finding those of q(x). In rootfinding for polynomi-
als, this process of reducing the size of the problem is
called deflation.

Another consequence of (*) is the following. Form


the derivative of (*) with respect to x, obtaining
p0(x) = (x z)q 0(x) + q(x)
p0(z) = q(z)
Thus to evaluate p(x) and p0(x) simultaneously at x =
z, we can use nested multiplication for p(z) and we
can use the intermediate steps of this to also evaluate
p0(z). This is useful when doing rootfinding problems
for polynomials by means of Newtons method.
APPROXIMATING SF (x)

Define
Z
1 x sin t
SF (x) = dt, x 6= 0
x 0 t
We use Taylor polynomials to approximate this func-
tion, to obtain a way to compute it with accuracy and
simplicity.

y
1.0

0.5

x
-8 -4 4 8
As an example, begin with the degree 3 Taylor ap-
proximation to sin t, expanded about t = 0:
1 1 5
sin t = t t3 + t cos ct
6 120
with ct between 0 and t. Then
sin t 1 1 4
= 1 t2 + t cos ct
Z x
t Z x
6 120
sin t 1 2 1 4
dt = 1 t + t cos ct dt
0 t 0 6 120
Z x
1 3 1
= x x + t4 cos ctdt
Z x
18 120 0
1 sin t 1 2
dt = 1 x + R2(x)
x 0 t 18
Zx
1 1
R2(x) = t4 cos ct dt
120 x
0

How large is the error in the approximation


1 2
SF (x) 1 x
18
on the interval [1, 1]? Since |cos ct| 1, we have
for x > 0 that
Z x
1 1
0 R2(x) t4dt
120 x 0
1 4
= x
600
and the same result can be shown for x < 0. Then
for |x| 1, we have
1
0 R2(x)
600
To obtain a more accurate approximation, we can pro-
ceed exactly as above, but simply use a higher degree
approximation to sin t.
In the book we consider finding a Taylor polynomial
approximation to SF (x) with its error satisfying

|R8(x)| 5 109, |x| 1


A Matlab program, plot sint.m, implementing this
approximation is given in the text and in the class
account. The one in the class account includes the
needed additional functions sint tay.m and poly even.m.
Begin with a Taylor series for sin t,
t3 t5 n1 t2n1
sin t = t + + (1)
3! 5! (2n 1)!
t2n+1
+(1)n cos(ct)
(2n + 1)!
with ct between 0 and t. Then write
Z x"
1 t2 t4
Sint x = 1 +
x 0 3! 5!
#
t2n2
+(1)n1 dt + R2n2(x)
(2n 1)!
x2 x4
=1 +
3!3 5!5
n1 x2n2
+(1) + R2n2(x)
(2n 1)!(2n 1)

Z
1 x n t2n
R2n2(x) = (1) cos(ct) dt
x 0 (2n + 1)!
Z
1 x n t2n
R2n2(x) = (1) cos(ct) dt
x 0 (2n + 1)!
To simplify matters, let x > 0. Since |cos(ct)| 1,
Z
1 x t2n x2n
|R2n2(x)| dt =
x 0 (2n + 1)! (2n + 1)!(2n + 1)
It is easy to see that this bound is also valid for x < 0.
As required, choose the degree so that

|R2n2(x)| 5 109
From the error bound,
1
max |R2n2(x)|
|x|1 (2n + 1)!(2n + 1)
Choose n so that this upper bound is itself bounded
by 5 109. This is true if 2n + 1 11, i.e. n 5.
The polynomial is
x2 x4 x6 x8
p(x) = 1 + + , 1 x 1
3!3 5!5 7!7 9!9
and

|SF (x) p(x)| 5 109, |x| 1


To evaluate it eciently, we set u = x2 and evaluate
u u2 u3 u4
g(u) = 1 + +
18 600 35280 3265920
After the evaluation of the coecients (done once),
the total number of arithmetic evaluations is 4 addi-
tions and 5 multiplications to evaluate p(x) for each
value of x.
COMPUTING ANOMALIES

These examples are meant to help motivate the study


of machine arithmetic.
1. Calculator example: Use an HP-15C calculator,
which contains 10 digits in its display. Let

x1 = x2 = x3 = 98765
There are keys on the calculator for the mean
n
1 X
= xj
n j=1
and the standard deviation s where
n 2
1 X
s2 = xj
n 1 j=1
In our case, what should these equal? In fact, the
calculator gives
.
= 98765 s = 1.58
Why?
2. A Fortran program example: Consider two pro-
grams run on a now extinct computer.
Program A:

A = 1.0 + 2.0 (23)


B = A 1.0
P RIN T , A, B
EN D
.
Output: 1.0 1.19E 7 (= 223)

Program B:
A = 1.0 + 2.0 (23)
SILLY = 0.0
B = A 1.0
P RIN T , A, B
EN D
Output: 1.0 0.0

Why the change, since presumably the variable


SILLY does not have any connection to B.
DECIMAL FLOATING-POINT NUMBERS

Floating point notation is akin to what is called sci-


entific notation in high school algebra. For a nonzero
number x, we can write it in the form

x = 10e
with e an integer, 1 < 10, and = +1 or 1.
Thus
50
= (1.66666 )10 101, with = +1
3

On a decimal computer or calculator, we store x by


instead storing , , and e. We must restrict the
number of digits in and the size of the exponent e.
For example, on an HP-15C calculator, the number of
digits kept in is 10, and the exponent is restricted
to 99 e 99.
BINARY FLOATING-POINT NUMBERS

We now do something similar with the binary repre-


sentation of a number x. Write

x = 2e
with
1 < (10)2 = 2
and e an integer. For example,

(.1)10 = (1.10011001100 )2 24, = +1

The number x is stored in the computer by storing the


, , and e. On all computers, there are restrictions
on the number of digits in and the size of e.
FLOATING POINT NUMBERS

When a number x outside a computer or calculator


is converted into a machine number, we denote it by
f l(x). On an HP-calculator,

f l(.3333 ) = (3.333333333)10 101


The decimal fraction of infinite length will not fit in
the registers of the calculator, but the latter 10-digit
number will fit. Some calculators actually carry more
digits internally than they allow to be displayed.

On a binary computer, we use a similar notation. I will


concentrate on a particular form of computer float-
ing point number, that called the IEEE floating point
standard.

In single precision, we write such a number as

f l(x) = (1.a1a2 a23)2 2e


f l(x) = (1.a1a2 a23)2 2e
The significand = (1.a1a2 a23)2 immediately sat-
isfies 1 < 2. What are the limits on e.
To understand the limits on e and the number of bi-
nary digits chosen for , we must look roughly at how
the number x will be stored in the computer. Basi-
cally, we store as a single bit, the significand as 24
bits (only 23 need be stored), and the exponent fills
out 32 bits (=4 bytes). Thus the exponent e occupies
8 bits, including both negative and positive integers.
Roughly speaking, we have that e must satisfy

(1111111)2 e (1111111)2

127 e 127
In actuality, the limits are

126 e 127
for reasons related to the storage of 0 and other num-
bers such as .
What is the connection of the 24 bits in the significand
to the number of decimal digits in the storage of
a number x into floating point form. One way of
answering this is to find the integer M for which
1. 0 < x M and x an integer implies f l(x) = x;
and
2. f l(M + 1) 6= M + 1
This integer M is at least as big as

1.11 1 223 = 223 + + 20
| {z }
23 10s 2
This sums to 224 1. In addition, 224 = (1.0 0)2
224 also stores exactly. What about 224 + 1? It does
not store exactly, as

1.0 01 224
| {z }
23 00s 2
Storing this would require 25 bits, one more than al-
lowed. Thus
M = 224 = 16777216
This means that all 7 digit decimal integers store ex-
actly, along with a few 8 digit integers.
THE MACHINE EPSILON

Let y be the smallest number representable in the ma-


chine arithmetic that is greater than 1 in the machine.
The machine epsilon is = y 1. It is a widely used
measure of the accuracy possible in representing num-
bers in the machine.

The number 1 has the simple floating point represen-


tation
1 = (1.00 0)2 20
What is the smallest number that is greater than 1?
It is
1 + 223 = (1.0 01)2 20 > 1
and the machine epsilon in IEEE single precision float-
.
ing point format is = 223 = 1.19 107.
THE UNIT ROUND

Consider the smallest number > 0 that is repre-


sentable in the machine and for which

1+ >1
in the arithmetic of the machine.

For any number 0 < < , the result of 1 + is


exactly 1 in the machines arithmetic. Thus drops
o the end of the floating point representation in the
machine. The size of is another way of describing
the accuracy attainable in the floating point represen-
tation of the machine. The machine epsilon.has been
replacing it in recent years.
It is not too dicult to derive . The number 1 has
the simple floating point representation

1 = (1.00 0)2 20
What is the smallest number which can be added to
this without disappearing? Certainly we can write

1 + 223 = (1.0 01)2 20 > 1

Past this point, we need to know whether we are us-


ing chopped arithmetic or rounded arithmetic. We
will shortly look at both of these. With chopped
arithmetic, = 223; and with rounded arithmetic,
= 224.
ROUNDING AND CHOPPING

Let us first consider these concepts with decimal arith-


metic. We write a computer floating point number z
as

z = 10e (a1.a2 an)10 10e


with a1 6= 0, so that there are n decimal digits in the
significand (a1.a2 an)10.

Given a general number

x = (a1.a2 an )10 10e, a1 6= 0


we must shorten it to fit within the computer. This
is done by either chopping or rounding. The floating
point chopped version of x is given by

f l(x) = (a1.a2 an)10 10e


where we assume that e fits within the bounds re-
quired by the computer or calculator.
For the rounded version, we must decide whether to
round up or round down. A simplified formula is
f l(x) =
(
(a1.a2 an)10 10e an+1 < 5
[(a1.a2 an)10 + (0.0 1)10] 10e an+1 5
The term (0.0 1)10 denotes 10n+1, giving the or-
dinary sense of rounding with which you are familiar.
In the single case

(0.0 0an+1an+2 )10 = (0.0 0500 )10


a more elaborate procedure is used so as to assure an
unbiased rounding.
CHOPPING/ROUNDING IN BINARY

Let
x = (1.a2 an )2 2e
with all ai equal to 0 or 1. Then for a chopped floating
point representation, we have

f l(x) = (1.a2 an)2 2e


For a rounded floating point representation, we have
f l(x) =
(
(1.a2 an)2 10e an+1 = 0
[(1.a2 an)2 + (0.0 1)2] 10e an+1 = 1
ERRORS

The error xf l(x) = 0 when x needs no change to be


put into the computer or calculator. Of more interest
is the case when the error is nonzero. Consider first
the case x > 0 (meaning = +1). The case with
x < 0 is the same, except for the sign being opposite.

With x 6= f l(x), and using chopping, we have

f l(x) < x
and the error x f l(x) is always positive. This later
has major consequences in extended numerical com-
putations. With x 6= f l(x) and rounding, the error
x f l(x) is negative for half the values of x, and it is
positive for the other half of possible values of x.
We often write the relative error as
x f l(x)
=
x
This can be expanded to obtain

f l(x) = (1 + )x
Thus f l(x) can be considered as a perturbed value
of x. This is used in many analyses of the eects of
chopping and rounding errors in numerical computa-
tions.

For bounds on , we have


2n 2n, rounding
2n+1 0, chopping
IEEE ARITHMETIC

We are only giving the minimal characteristics of IEEE


arithmetic. There are many options available on the
types of arithmetic and the chopping/rounding. The
default arithmetic uses rounding.

Single precision arithmetic:

n = 24, 126 e 127


This results in

M = 224 = 16777216

= 223 = 1.19 107


Double precision arithmetic:

n = 53, 1022 e 1023


What are M and ?

There is also an extended representation, having n =


69 digits in its significand.
NUMERICAL PRECISION IN MATLAB

MATLAB can be used to generate the binary float-


ing point representation of a number. Execute the
command
format hex
This will cause all subsequent numerical output to the
screen to be given in hexadecimal format (base 16).
For example, listing the number 7 results in an output
of
401c000000000000
The 16 hexadecimal digits are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
a, b, c, d, e, f }. To obtain the binary representation,
convert each hexadecimal digit to a four digit binary
number. For the above number, we obtain the binary
expansion

0100 0000 0001 1100 0000 . . . 0000


for the number 7 in IEEE double precision floating-
point format.
NUMERICAL PRECISION IN FORTRAN

In Fortran, variables take on default types if no explicit


typing is given. If a variable begins with I, J, K, L, M,
or N , then the default type is INTEGER. Otherwise,
the default type is REAL, or SINGLE PRECISION.
We have other variable types, including DOUBLE PRE-
CISION.

Redefining the default typing : Use the statement

IMPLICIT DOUBLE PRECISION(A-H,O-Z)


to change the original default, of REAL, to DOUBLE
PRECISION. You can always override the default typ-
ing with explicit typing. For example
DOUBLE PRECISION INTEGRAL, MEAN
INTEGER P, Q, TML OUT
FORTRAN CONSTANTS

If you want to have a constant be DOUBLE PRECI-


SION, you should make a habit of ending it with D0.
For example, consider
DOUBLE PRECISION PI
..
PI=3.14159265358979
This will be compiled in way you did not intend. The
number will be rounded to single precision length and
then stored in a constant table. At run time, it will
be retrieved, zeros will be appended to extend it to
double precision length, and then it will be stored in
PI. Instead, write

PI=3.14159265358979D0
SOME DEFINITIONS

Let xT denote the true value of some number, usually


unknown in practice; and let xA denote an approxi-
mation of xT .

The error in xA is
error(xA) = xT xA
The relative error in xA is
error(xA) xT xA
rel(xA) = =
xT xT
Example: xT = e, xA = 19
7 . Then
19 .
error(xA) = e = .003996
7
rel(xA) = .00147
We also speak of significant digits. We say xA has m
significant digits with respect to xT if the magnitude
of error(xA) is 5 units in the (m + 1)st digit, be-
ginning with the first nonzero digit in xT . Above, xA
has 3 significant digits with respect to xT .
SOURCES OF ERROR

This is a very rough categorization of the sources of


error in the calculation of the solution of a mathemat-
ical model for some physical situation.

(E1) Modelling Error: As an example, if a projectile


of mass m is travelling thru the earths atmosphere,
then a popular description of its motion is given by
d2r(t) dr
m = mg k b
dt2 dt
with b 0. In this, r(t) is the vector position of the
projectile; and the final term in the equation repre-
sents friction. If there is an error in this a model of a
physical situation, then the numerical solution of this
equation is not going to improve the results.
(E2) Blunders: In the pre-computer era, these were
likely to be arithmetic errors. In the earlier years of the
computer era, the typical blunder was a programming
error. These were usually easy to find as they generally
resulted in absurd calculated answers.

Present day blunders are still often programming


errors. But now they are often much more dicult to
find, as they are often embedded in very large codes
which may mask their eect. Some simple rules:
(i) Break programs into small testable subprograms.
(ii) Run test cases for which you know the outcome.
(iii) When running the full code, maintain a skeptical
eye on the output, checking whether the output is
reasonable or not.
(E3) Observational Error: The radius of an electron
is given by

(2.81777 + ) 1013 cm, || .00011


This error cannot be removed, and it must aect the
accuracy of any computation in which it is used. We
need to be aware of these eects and to so arrange
the computation as to minimize the eects.

(E4) Rounding/chopping Error: This is the main source


of many problems, especially problems in solving sys-
tems of linear equations. We later look at the eects
of such errors.
(E5) Approximation Error: This is also called dis-
cretization error and truncation error; and it is the
main source of error with which we deal in this course.
Such errors generally occur when we replace a compu-
tationally unsolvable problem with a nearby problem
that is more tractable computationally.

For example, the Taylor polynomial approximation


x 1 2
e 1+x+ x
2
contains an approximation error.
The numerical integration
Z 1 n
1 X j
f (x) dx f
0 n j=1 n
contains an approximation error.
Finally, the numerical dierentiation formula
f (x + h) f (x h)
f 0(x)
2h
contains an approximation error.
LOSS OF SIGNIFICANCE ERRORS

This can be considered a source of error or a conse-


quence of the finiteness of calculator and computer
arithmetic. We begin with some illustrations.

Example. Define

f (x) = x [sqrt(x + 1) sqrt(x)]


and consider evaluating it on a 6-digit decimal calcu-
lator which uses rounded arithmetic. The values of
f (x), taken from the text in Section 2.2:
x Computed f (x) True f (x)
1 .414210 .414214
10 1.54340 1.54347
100 4.99000 4.98756
1000 15.8000 15.8074
10000 50.0000 49.9988
100000 100.000 158.113

What happened?
Example. Define
1 cos x
f (x) = 2
, x 6= 0
x
Values for a sequence of decreasing positive values
of x is given in Section 2.2 of the text, using a past
model of a popular calculator. The calculator carried
10 decimal digits, and it used rounded arithmetic.
x Computed f (x) True f (x)
0.1 0.4995834700 0.4995834722
0.01 0.4999960000 0.4999958333
0.001 0.5000000000 0.4999999583
0.0001 0.5000000000 0.4999999996
0.00001 0.0 0.5000000000
Consider one case, that of x = .001. Then on the
calculator:
cos (.001) = .9999994999
1 cos (.001) = 5.001 107
1 cos (.001)
2 = .5001000000
(.001)
The true answer is f (.001) = .4999999583. The rel-
ative error in our answer is
.4999999583 .5001 .0001000417 .
= = .0002
.4999999583 .4999999583
There 3 significant digits in the answer. How can such
a straightforward and short calculation lead to such a
large error (relative to the accuracy of the calculator)?
When two numbers are nearly equal and we subtract
them, then we suer a loss of significance error in
the calculation. In some cases, these can be quite
subtle and dicult to detect. And even after they are
detected, they may be dicult to fix.

The last example, fortunately, can be fixed in a num-


ber of ways. Easiest is to use a trigonometric identity:

cos (2) = 2 cos2 () 1 = 1 2 sin2 ()


Let x = 2. Then
1 cos x 2 sin2 (x/2)
f (x) = =
x2 x2
" #2
1 sin (x/2)
=
2 x/2
This latter formula, with x = .001, yields a computed
value of .4999999584, nearly the true answer. We
could also have used a Taylor polynomial for cos(x)
around x = 0 to obtain a better approximation to
f (x) for small values of x.
A MORE SUBTLE EXAMPLE

Evaluate e5 using a Taylor polynomial approxima-


tion:
(5) (5) 2 (5) 3 (5) n
e5 1 + + + + +
1! 2! 3! n!
With n = 25, the error is

(5)26

ec 108
26!

Imagine calculating this polynomial using a computer


with 4 digit decimal arithmetic and rounding. To
make the point about cancellation more strongly, imag-
ine that each of the terms in the above polynomial is
calculated exactly and then rounded to the arithmetic
of the computer. We add the terms exactly and then
we round to four digits.

See the table of results in Section 2.2 of the text. It


gives a result of 0.009989 whereas the correct answer
is 0.006738 to four significant digits.
To understand more fully the source of the error, look
at the numbers being added and their accuracy. For
example,

(5)3 125
= 20.83
3! 6
in the 4 digit decimal calculation, with an error of
magnitude 0.00333 . . . Note that this error in an in-
termediate step is of same magnitude as the true an-
swer 0.006738 being sought. Other similar errors
are present in calculating other coecients, and thus
they cause a major error in the final answer being cal-
culated.

Whenever a sum is being formed in which the fi-


nal answer is much smaller than some of the terms
being combined, then a loss of significance error is
occurring.
NOISE IN FUNCTION EVALUATION

Consider using a 4-digit decimal calculator (with round-


ing) to evaluate the two functions
f1(x) = x2 3
f2(x) = 9 + x x 6 = [f1(x)]2
2 2

Note that f2(x) = [f1(x)]2. On our calculator,




< 0, 0 x 1.731
f1(x) is = 0, x = 1.732

> 0, x 1.733
However,


> 0, 0 x 1.725
f2(x) is = 0, 1.726 x 1.738

> 0, x 1.739
Thus f2(x) has 13 distinct zeros on this calculator;
whereas we know that f2(x) has only the zeros sqrt(3)
.
= 1.732. What happened in our calculations?
Whenever a function f (x) is evaluated, there are arith-
metic operations carried out which involve rounding or
chopping errors. This means that what the computer
eventually returns as an answer contains noise. This
noise is generally random and small. But it can af-
fect the accuracy of other calculations which depend
on f (x). For example, we illustrate the evaluation of
f (x) = 1 + 3x 3x2 + x3
= 1 + x (3 + x (3 + x))
which is simply (x 1)3 and has only a single root
at x = 1. We use MATLAB with its IEEE double
precision arithmetic.
-15
x 10

0 x

-2

-4

-6

-8
0.99998 1.00000 1.00002
UNDERFLOW ERRORS

Consider evaluating

f (x) = x10
for x near 0. When using IEEE single precision arith-
metic, the smallest nonzero positive number express-
ible in normalized floating-point format is
.
m = 2126 = 1.18 1038;
see the table on IEEE single precision arithmetic with
E = 1 and (a1a2 . . . a23)2 = (00 . . . 0)2.Thus f (x)
will be set to zero if
x10 < m
1 .
|x| < m 10 = 1.61 104
0.000161 < x < 0.000161
OVERFLOW ERRORS

Attempts to use numbers that are too large for the


floating-point format will lead to overflow errors. These
are generally fatal errors on most computers. With
the IEEE floating-point format, overflow errors can
be carried along as having a value of or NaN,
depending on the context. Usually an overflow error
is an indication of a more significant problem or error
in the program and the user needs to be aware of such
errors.

When using IEEE single precision arithmetic, the largest


nonzero positive number expressible in normalized floating-
point format is

128 .
m=2 1224 = 3.40 1038
see the table on IEEE single precision arithmetic with
E = (254)10 and (a1a2 . . . a23)2 = (11 . . . 1)2.Thus
f (x) will overflow if
x10 > m
1 .
|x| > m = 7131.6
10
PROPAGATION OF ERROR

Suppose we are evaluating a function f (x) in the ma-


chine. Then the result is generally not f (x), but rather
an approximate of it which we denote by fe(x). Now
suppose that we have a number xA xT . We want
to calculate f (xT ), but instead we evaluate fe(xA).
What can we say about the error in this latter com-
puted quantity?
h i
e e
f (xT )f (xA) = [f (xT ) f (xA)]+ f (xA) f (xA)
The quantity f (xA) fe(xA) is the noise in the
evaluation of f (xA) in the computer, and we will
return later to some discussion of it. The quantity
f (xT ) f (xA) is called the propagated error; and it
is the error that results from using perfect arithmetic
in the evaluation of the function.

If the function f (x) is dierentiable, then we can use


the mean-value theorem to write
f (xT ) f (xA) = f 0() (xT xA)
for some between xT and xA.
Since usually xT and xA are close together, we can
say is close to either of them, and

f (xT ) f (xA) f 0(xT ) (xT xA) , (*)


Example. Define
f (x) = bx
where b is a positive real number. Then (*) yields

bxT bxA (log b) bxT (xT xA)

Rel (bxA ) xT (log b) (xT xA) /xT


= xT (log b) Rel (xA)
= K Rel (xA)
with K = xT (log b). Note that K = 104 and Rel(xA) =
107, then Rel(bxA ) 103. This is a large decrease
in accuracy; and it is independent of how we actually
calculate bx. The number K is called a condition number
for the computation.
PROPAGATION IN
ARITHMETIC OPERATIONS

Let denote arithmetic operation such as +, , ,


or /. Let denote the same arithmetic operation
as it is actually carried out in the computer, including
rounding or chopping error. Let xA xT and yA
yT . We want to obtain xT yT , but we actually obtain
xA yA. The error in xAyA is given by
xT yT xA yA = [xT yT xAyA]
+ [xAyA xAyA]
The final term is the error is introduced by the inex-
actness of the machine arithmetic. For it, we usually
assume
xA yA = f l (xAyA)
This means that the quantity xAyA is computed ex-
actly and is then rounded or chopped to fit the answer
into the floating point representation of the machine.
The formula

xA yA = f l (xAyA)
implies

xAyA = (xAyA) (1 + ) (**)


with limits given earlier for . Manipulating (**), we
have
Rel (xAyA) =
With rounded binary arithmetic having n digits in the
mantissa,
2n 2n
The term
xT yT xAyA
is the propagated error; and we now examine it for
particular cases.

Consider first = . Then for the relative error in


xA yA xAyA,
x y xAyA
Rel (xAyA) = T T
xT yT
Write
xT = xA + , yT = yA +
Then
xT yT xAyA
Rel (xAyA) =
xT yT
xT yT (xT ) (yT )
=
xT yT
xT + yT
=
xT yT

= +
xT yT xT yT
= Rel (xA) + Rel (yA) Rel (xA) Rel (yA)
Since we usually have

|Rel (xA)| , |Rel (yA)| 1


the relation

Rel (xAyA) = Rel (xA) +Rel (yA)Rel (xA) Rel (yA)


says
Rel (xAyA) Rel (xA) + Rel (yA)
Thus small relative errors in the arguments xA and yA
leads to a small relative error in the product xAyA.
Also, note that there is some cancellation if these rel-
ative errors are of opposite sign.

There is a similar result for division:


!
xA
Rel Rel (xA) Rel (yA)
yA
provided
|Rel (yA)| 1
ADDITION AND SUBTRACTION

For equal to or +, we have

[xT yT ] [xA yA] = [xT xA] [yT yA]


Thus the error in a sum is the sum of the errors in
the original arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.

Suppose you are solving

x2 26x + 1 = 0
Using the quadratic formula, we have the true answers
(1) (2)
rT = 13 + sqrt(168), rT = 13 sqrt(168)
From a table of square roots, we take
.
sqrt(168) = 12.961
Since this is correctly rounded to 5 digits, we have

|sqrt(168) 12.961| .0005


Then define
(1) (2)
rA = 13+12.961 = 25.961, rA = 1312.961 = .039
Then for both roots,
|rT rA| .0005
For the relative errors, however,

(1) .0005 .
Rel rA = 3.13 105
25.9605

(2) .0005 .
Rel rA = .0130
.0385
(2)
Why does rA have such poor accuracy in comparison
(1)
to rA ?

The answer is due to the loss of significance error


(2)
involved in the formula for calculating rA . Instead,
use the mathematically equivalent formula
(2) 1 . 1
rT = =
13 + sqrt(168) 25.961
This results in a much more accurate answer, at the
expense of an additional division.
SUMMATION

How should we compute a sum

S = a1 + a2 + + an
with a sequence of machine numbers {a1, ..., an}. Should
we add from largest to small, should we add from
smallest to largest, or should we just add the numbers
based on their original given order? In other words,
does it matter how we calculate the sum?

Recall the relationship between a number x and its


machine approximation f l(x):

f l(x) = (1 + ) x
For bounds on for a binary floating point represen-
tation with N binary digits in the mantissa, we have
2N 2N , rounding
2N +1 0, chopping
We use these results as tools for analyzing the error
in computing the sum S.
We create the sum S by a sequence simple additions.
Define
S2 = f l (a1 + a2) = (1 + 2) (a1 + a2)
S3 = f l(S2 + a3) = (1 + 3) (S2 + a3)
S4 = f l(S3 + a4) = (1 + 4) (S3 + a4)
..
Sn = f l(Sn1 + an) = (1 + n) (Sn1 + an)
This says each simple addition is performed exactly,
following which it is rounded or chopped back to the
precision of the machine. All of the numbers j satisfy
the inequalities given earlier.

We now combine the above to obtain a formula that


is simpler to work with. In particular, we find that
S Sn a1 (2 + + n)
a2 (2 + + n)
a3 (3 + + n)
a3 (4 + + n)
..
ann
In obtaining this, we have neglected all terms contain-
ing products ij , as these are generally much smaller
than the remaining terms.

For example,
S2 = (a1 + a2) + 2 (a1 + a2)
S3 = (1 + 3) [a3 + (1 + 2) (a1 + a2)]
= (a1 + a2 + a3) + (a1 + a2) (2 + 3)
+a33 + 23 (a1 + a2)
.
= (a1 + a2 + a3) + (a1 + a2) (2 + 3) + a33
Continue in this manner to get
.
Sn = (a1 + a2 + + an)
+ (a1 + a2) (2 + 3 + + n)
+a3 (3 + + n)
+a4 (4 + + n)
+ + ann
Using this yields the formula given earlier for S Sn.
Consider now the formula
S Sn = a1 (2 + + n)
a2 (2 + + n)
a3 (3 + + n)
a4 (4 + + n)
..
ann
and what it suggests as a method for adding the num-
bers a1, ..., an.

Since a1 and a2 have the largest number of quantities


j multiplied times them, it makes sense to have a1
and a2 be the smallest numbers in magnitude. We can
continue in this vein to motivate ordering the numbers
to satisfy
|a1| |a2| |an|
This is not an optimal method, but it generally is
superior to using a random ordering or to adding from
the largest term to the smallest.
EXAMPLE

An example is given in the text for


n
X
aj
n=1
with
aj = f l (1/j) , j1
The numbers aj are obtained by rounding to 4 signif-
icant decimal digits the numbers 1/j. Then two dif-
ferent decimal arithmetics are used, along with adding
from smallest to largest (SL), and from largest to
smallest (LS). The results are given in Section 2.4.

For n = 1000, we have the following results:

Chopped arithmetic, smallest to largest: Error=.202


Chopped arithmetic, largest to smallest: Error=.417
Rounded arithmetic, smallest to largest: Error=0
Rounded arithmetic, largest to smallest: Error=.037

Why is it so much larger with chopped arithmetic?


Using chopped 4-digit decimal arithmetic
n True SL Error LS Error
10 2.929 2.928 0.001 2.927 0.002
25 3.816 3.813 0.003 3.806 0.010
50 4.499 4.491 0.008 4.479 0.020
100 5.187 5.170 0.017 5.142 0.045
200 5.878 5.841 0.037 5.786 0.092
500 6.793 6.692 0.101 6.569 0.224
1000 7.486 7.284 0.202 7.069 0.417

Using rounded 4-digit decimal arithmetic


n True SL Error LS Error
10 2.929 2.929 0 2.929 0
25 3.816 3.816 0 3.817 0.001
50 4.499 4.500 0.001 4.498 0.001
100 5.187 5.187 0 5.187 0
200 5.878 5.878 0 5.876 0.002
500 6.793 6.794 0.001 6.783 0.010
1000 7.486 7.486 0 7.449 0.037
Recall the formula
.
Sn = (a1 + a2 + + an)
+ (a1 + a2) (2 + 3 + + n)
+a3 (3 + + n)
+a4 (4 + + n)
+ + ann
and examine one of the quantities on the right, as an
example of the general case. The first term contains
the sum
2 + 3 + + n
Write this as

Xn
1
(n 1) j (n 1)
n 1 j=2
In the case of rounding, the mean is approximately
zero, because the negative and positive values of j
are likely to cancel when large numbers of such j are
being summed; and thus (n 1) is usually quite
small. But for the case of chopping, is approxi-
mately 2N , and therefore,

(n 1) (n 1) 2N
Thus (n 1) will grow in a manner proportional
to n. By more advanced arguments, it can be shown
that for the case of rounded arithmetic, (n 1) will
grow in a manner proportional to sqrt(n), which grows
much slower than n.
EXTENDED PRECISION ARITHMETIC

Sometimes a limited use of a higher precision arith-


metic will greatly cut the amount of rounding error in
a calculation. Consider as an important example the
calculation of the dot product or inner product
n
X
S= aj bj
j=1
with the numbers aj , bj all machine numbers.

Imagine calculating this in single precision arithmetic.


Then there will be approximately n rounding errors
from the multiplications and n 1 rounding errors
from the additions. To cut this, imagine extending
each of the numbers aj , bj to double precision by ap-
pending sucient zeros to them. Then multiply and
add them in double precision; and when completed,
round the answer back to single precision. This re-
places 2n 1 single precision rounding errors with 1
such rounding error, a significant improvement.
What do you do if you are already in double preci-
sion? With the IEEE standard, you can use extended
precision arithmetic, which gives 16 binary digits of
extra accuracy. It must be accessed by using special
routines, but it allows calculation of much more accu-
rate inner products at a minimal increase in operation
time.

In linear algebra problems, rounding errors in many-


term summations, including inner products, are the
principal source of error. We will discuss this further
in the chapter on numerical linear algebra.

In MATLAB, all arithmetic is in double precision, and


extended IEEE arithmetic is not available. Thus we
illustrate these ideas with only a Fortran program.
REAL FUNCTION SUMPRD(A,B,N)
C
C THIS CALCULATES THE INNER PRODUCT
C
C I=N
C SUMPRD = SUM A(I)*B(I)
C I=1
C
C THE PRODUCTS AND SUMS ARE DONE IN DOUBLE
C PRECISION, AND THE FINAL RESULT IS
C CONVERTED BACK TO SINGLE PRECISION.
C
REAL A(*), B(*)
DOUBLE PRECISION DSUM
C
DSUM = 0.0D0
DO I=1,N
DSUM = DSUM + DBLE(A(I))*DBLE(B(I))
END DO
SUMPRD = DSUM
RETURN
END
ROOTFINDING

We want to find the numbers x for which


f (x) = 0, with f a given function. Here, we denote
such roots or zeroes by the Greek letter . Rootfind-
ing problems occur in many contexts. Sometimes they
are a direct formulation of some physical situtation;
but more often, they are an intermediate step in solv-
ing a much larger problem.

An example with annuities Suppose you are paying


into an account an amount of Pin per period of time,
for Nin periods of time. The amount you are de-
posited is compounded at at interest rate of r per pe-
riod of time. Then at the beginning of period Nin +1,
you will withdraw an amount of Pout per time period,
for Nout periods. In order that the amount you with-
draw balance that which has been deposited including
interest, what is the needed interest rate? The equa-
tion is
h i h i
Nin
Pin (1 + r) 1 = Pout 1 (1 + r)Nout
We assume the interest rate r holds over all Nin+Nout
periods.
As a particular case, suppose you are paying in Pin =
$1, 000 each month for 40 years. Then you wish to
withdraw Pout = $5, 000 per month for 20 years.
What interest rate do you need? If the interest rate
is R per year, compounded monthly, then r = R/12.
Also, Nin = 40 12 = 480 and Nout = 20 12 = 240.
Thus we wish to solve
" # " 240#
R 480 R
1000 1+ 1 = 5000 1 1 +
12 12
What is the needed yearly interest rate R? The answer
is 2.92%. How did I obtain this answer?

This example also shows the power of compound in-


terest.
THE BISECTION METHOD

Most methods for solving f (x) = 0 are iterative meth-


ods. We begin with the simplest of such methods, one
which most people use at some time.

We assume we are given a function f (x); and in addi-


tion, we assume we have an interval [a, b] containing
the root, on which the function is continuous. We
also assume we are given an error tolerance > 0,
and we want an approximate root e in [a, b] for which

e
| |
We further assume the function f (x) changes sign on
[a, b], with
f (a) f (b) < 0
Algorithm Bisect(f, a, b, ). Step 1 : Define
1
c= (a + b)
2
Step 2 : If b c , accept c as our root, and then
stop.
Step 3 : If b c > , then check compare the sign of
f (c) to that of f (a) and f (b). If
sign(f (b)) sign(f (c)) 0
then replace a with c; and otherwise, replace b with
c. Return to Step 1.

Denote the initial interval by [a1, b1], and denote each


successive interval by [aj , bj ]. Let cj denote the center
of [aj , bj ]. Then
1

cj bj cj = cj aj = bj aj
2
Since each interval decreases by half from the preced-
ing one, we have by induction,
n
1
| cn| (b1 a1)
2
EXAMPLE Find the largest root of

f (x) x6 x 1 = 0
accurate to within = 0.001. With a graph, it is easy
to check that 1 < < 2. We choose a = 1, b =
2; then f (a) = 1, f (b) = 61, and the requirement
f (a) f (b) < 0 is satisfied. The results from Bisect are
shown in the table. The entry n indicates the iteration
number n.

n a b c bc f (c)
1 1.0000 2.0000 1.5000 0.5000 8.8906
2 1.0000 1.5000 1.2500 0.2500 1.5647
3 1.0000 1.2500 1.1250 0.1250 0.0977
4 1.1250 1.2500 1.1875 0.0625 0.6167
5 1.1250 1.1875 1.1562 0.0312 0.2333
6 1.1250 1.1562 1.1406 0.0156 0.0616
7 1.1250 1.1406 1.1328 0.0078 0.0196
8 1.1328 1.1406 1.1367 0.0039 0.0206
9 1.1328 1.1367 1.1348 0.0020 0.0004
10 1.1328 1.1348 1.1338 0.00098 0.0096
Recall the original example with the function.
h i h i
Nin
f (r) = Pin (1 + r) 1 Pout 1 (1 + r)Nout
Checking, we see that f (0) = 0. Therefore, with a
graph of y = f (r) on [0, 1], we see that f (x) < 0 if
we choose x very small, say x = .001. Also f (1) > 0.
Thus we choose [a, b] = [.001, 1]. Using = .000001
yields the answer

e = .02918243

with an error bound of

| cn| 9.53 107


for n = 20 iterates. We could also have calculated
this error bound from
1 7
(1 .001) = 9.53 10
220
Suppose we are given the initial interval [a, b] = [1.6, 4.5]
with = .00005. How large need n be in order to have

| cn|
Recall that
n
1
| cn| (b a)
2
Then ensure the error bound is true by requiring and
solving
n
1
(b a)
2
n
1
(4.5 1.6) .00005
2
Dividing and solving for n, we have

2.9
n log = 15.82
.00005
Therefore, we need to take n = 16 iterates.
ADVANTAGES AND DISADVANTAGES

Advantages: 1. It always converges.


2. You have a guaranteed error bound, and it de-
creases with each successive iteration.
3. You have a guaranteed rate of convergence. The
error bound decreases by 12 with each iteration.

Disadvantages: 1. It is relatively slow when com-


pared with other rootfinding methods we will study,
especially when the function f (x) has several contin-
uous derivatives about the root .
2. The algorithm has no check to see whether the
is too small for the computer arithmetic being used.
[This is easily fixed by reference to the machine epsilon
of the computer arithmetic.]

We also assume the function f (x) is continuous on


the given interval [a, b]; but there is no way for the
computer to confirm this .
ROOTFINDING : A PRINCIPLE

We want to find the root of a given function f (x).


Thus we want to find the point x at which the graph of
y = f (x) intersects the x-axis. One of the principles
of numerical analysis is the following.

If you cannot solve the given problem, then solve a


nearby problem.

How do we obtain a nearby problem for f (x) = 0?


Begin first by asking for types of problems which we
can solve easily. At the top of the list should be that
of finding where a straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solv-
ing p(x) = 0 for some linear polynomial p(x) that
approximates f (x) in the vicinity of the root .
y
y=f(x)

(x ,f(x ))
0 0


x
x x
1 0

Given an estimate of , say x0, approximate f (x)


by its linear Taylor polynomial at (x0, f (x0)):

p(x) = f (x0) + (x x0) f 0(x0)


If x0 is very close to , then the root of p(x) should
be close to . Denote this approximating root by x1;
repeat the process to further improve our estimate of
.
To illustrate this procedure, we consider a well-known
example. For a number b > 0, consider solving the
equation
1
f (x) b = 0
x
The solution is, of course, = 1/b. Nonetheless, bear
with the example as it has some practical application.
y

y = b - 1/x

x
1
x
x 1/b 2/b
0

(x ,f(x ))
0 0
Let x0 be an estimate of the root = 1/b. Then the
line tangent to the graph of y = f (x) at (x0, f (x0))
is given by

p(x) = f (x0) + (x x0) f 0(x0)


with
0 1
f (x) = 2
x
Denoting the root of p(x) = 0 by x1, we solve for x1
in
f (x0) + (x1 x0) f 0(x0) = 0

f (x0)
x1 = x0 0
f (x0)
For our particular case, this yields
1
b
x0
x1 = x0 = x0 bx20 + x0
1
x20

x1 = x0 (2 bx0)
x1 = x0 (2 bx0)
Note that no division is used in our final formula.

If we repeat the process, now using x1 as our initial


estimate of , then we obtain a sequence of numbers
x1, x2, ...

xn+1 = xn (2 bxn) , n = 0, 1, 2, ...


Do these numbers xn converge to the root ? We
return to this after a bit.

This algorithm has been used in a practical way in a


number of circumstances.
The general Newtons method for solving f (x) = 0 is
derived exactly as above. The result is a sequence of
numbers x0, x1, x2, ... defined by
f (xn)
xn+1 = xn 0 , n = 0, 1, 2, ...
f (xn)
Again, we want to know whether these numbers con-
verge to the desired root ; and we would also like
to know something about the speed of convergence
(which says something about how many such iterates
must actually be computed).

Return to the iteration

xn+1 = xn (2 bxn) , n = 0, 1, 2, ...


for solving
1
f (x) b =0
x
We use a method of analysis which works for only
this example, and later we use another approach to
the general Newtons method.
Write

xn+1 = xn (1 + rn) , rn = 1 bxn


Note that the error and relative error in xn are given
by
1 rn
en = xn =
b b
en rn
rel (xn) = = b = rn
b
Thus rn is the relative error in xn, and we have xn
converges to if and only if rn tends to zero.

We find a recursion formula for rn, recalling that rn =


1 bxn for all n. Then
rn+1 = 1 bxn+1
= 1 bxn (1 + rn)
= 1 (1 rn)(1 + rn)
= 1 1 rn 2 = r2
n
Thus
2
rn+1 = rn
for every integer n 0. Thus

r1 = r02, r2 = r12 = r04, r3 = r22 = r08


By induction, we obtain
2n
rn = r0 , n = 0, 1, 2, 3, ...
We can use this to analyze the convergence of

xn+1 = xn (1 + rn) , rn = 1 bxn


In particular, we have rn 0 if and only if

|r0| < 1
This is equivalent to saying
1 < 1 bx0 < 1
2
0 < x0 <
b
A look at a graph of f (x) b x1 will show the
reason for this condition. If x0 is chosen greater than
2 , then x will be negative, which is unacceptable.
b 1
The interval
2
0 < x0 <
b
is called the interval of convergence. With most
equations, we cannot find this exactly, but rather only
some smaller subinterval which guarantees convergence.

2 and r = be , we have
Using rn+1 = rn n n

ben+1 = (ben)2

en+1 = be2n

xn+1 = b ( xn)2
Methods with this type of error behaviour are said to
be quadratically convergent; and this is an especially
desirable behaviour.
To see why, consider the relative errors in the above.
Assume the initial guess x0 has been so chosen that
r0 = .1. Then

r1 = 102, r2 = 104, r3 = 108, r4 = 1016


Thus very few iterates need be computed.

The iteration

xn+1 = xn (1 + rn) , rn = 1 bxn


has been used on a number of machines as a means
of doing division, of calculating 1/b.
NEWTONS METHOD

For a general equation f (x) = 0, we assume we are


given an initial estimate x0 of the root . The iterates
are generated by the formula
f (xn)
xn+1 = xn 0 , n = 0, 1, 2, ...
f (xn)

EXAMPLE Consider solving

f (x) x6 x 1 = 0
for its positive root . An initial guess x0 can be
generated from a graph of y = f (x). The iteration is
given by
x6n xn 1
xn+1 = xn , n0
6x5n 1
We use an initial guess of x0 = 1.5.
The column xn xn1 is an estimate of the error
xn1; justification is given later.

n xn f (xn) xn xn1 xn1


0 1.5 8.89E + 1
1 1.30049088 2.54E + 1 2.00E 1 3.65E 1
2 1.18148042 5.38E 1 1.19E 1 1.66E 1
3 1.13945559 4.92E 2 4.20E 2 4.68E 2
4 1.13477763 5.50E 4 4.68E 3 4.73E 3
5 1.13472415 7.11E 8 5.35E 5 5.35E 5
6 1.13472414 1.55E 15 6.91E 9 6.91E 9

As seen from the output, the convergence is very


rapid. The iterate x6 is accurate to the machine pre-
cision of around 16 decimal digits. This is the typical
behaviour seen with Newtons method for most prob-
lems, but not all.
We could also have considered the problem of solving
the annuity equation
" #
x 480
f (x) 1000 1+ 1
12
" 240#
x
5000 1 1 + =0
12
However, it turns out that you have to be very close to
the root in this case in order to get good convergence.
This phenomena is discussed further at a later time;
and the bisection method is preferable in this instance.
AN ERROR FORMULA

Suppose we use Taylors formula to expand f () about


x = xn. Then we have
1
f () = f (xn)+( xn) f (xn)+ ( xn)2 f 00(cn)
0
2
for some cn between and xn. Note that f () = 0.
Then divide both sides of this equation by f 0(xn),
yielding
f (xn) 2 f 00(cn)
0= 0 + xn + ( xn)
f (xn) 2f 0(xn)
Note that
f (xn)
0
xn = xn+1
f (xn)
and thus
f 00(cn)
xn+1 = 0 ( xn)2
2f (xn)
For xn close to , and therefore cn also close to ,
we have
f 00()
xn+1 0 ( xn)2
2f ()
Thus Newtons method is quadratically convergent,
provided f 0() 6= 0 and f (x) is twice dierentiable in
the vicinity of the root .

We can also use this to explore the interval of con-


vergence of Newtons method. Write the above as
f 00()
xn+1 M ( xn)2 , M = 0
2f ()
Multiply both sides by M to get

M ( xn+1) [M ( xn)]2
M ( xn+1) [M ( xn)]2
Then we want these quantities to decrease; and this
suggests choosing x0 so that
|M ( x0)| < 1

1 0
2f ()
| x0| < = 00
|M| f ()
If |M| is very large, then we may need to have a very
good initial guess in order to have the iterates xn
converge to .
ADVANTAGES & DISADVANTAGES

Advantages: 1. It is rapidly convergent in most cases.


2. It is simple in its formulation, and therefore rela-
tively easy to apply and program.
3. It is intuitive in its construction. This means it is
easier to understand its behaviour, when it is likely to
behave well and when it may behave poorly.

Disadvantages: 1. It may not converge.


2. It is likely to have diculty if f 0() = 0. This
condition means the x-axis is tangent to the graph of
y = f (x) at x = .
3. It needs to know both f (x) and f 0(x). Contrast
this with the bisection method which requires only
f (x).
THE SECANT METHOD

Newtons method was based on using the line tangent


to the curve of y = f (x), with the point of tangency
(x0, f (x0)). When x0 , the graph of the tangent
line is approximately the same as the graph of y =
f (x) around x = . We then used the root of the
tangent line to approximate .

Consider using an approximating line based on inter-


polation. We assume we have two estimates of the
root , say x0 and x1. Then we produce a linear
function
q(x) = a0 + a1x
with
q(x0) = f (x0), q(x1) = f (x1) (*)
This line is sometimes called a secant line. Its equa-
tion is given by
(x1 x) f (x0) + (x x0) f (x1)
q(x) =
x1 x0
y
y=f(x)

(x ,f(x ))
0 0

x x
1 2
x x
0

(x ,f(x ))
1 1

y
y=f(x)
(x0,f(x0))

(x ,f(x ))
1 1


x
x x x
2 1 0
(x1 x) f (x0) + (x x0) f (x1)
q(x) =
x1 x0
This is linear in x; and by direction evaluation, it satis-
fies the interpolation conditions of (*). We now solve
the equation q(x) = 0, denoting the root by x2. This
yields
f (x1) f (x0)
x2 = x1 f (x1)
x1 x0
We can now repeat the process. Use x1 and x2 to
produce another secant line, and then uses its root
to approximate . This yields the general iteration
formula
f (xn) f (xn1)
xn+1 = xnf (xn) , n = 1, 2, 3...
xn xn1
This is called the secant method for solving f (x) = 0.
Example We solve the equation

f (x) x6 x 1 = 0
which was used previously as an example for both the
bisection and Newton methods. The quantity xn
xn1 is used as an estimate of xn1. The iterate
x8 equals rounded to nine significant digits. As with
Newtons method for this equation, the initial iterates
do not converge rapidly. But as the iterates become
closer to , the speed of convergence increases.

n xn f (xn) xn xn1 xn1


0 2.0 61.0
1 1.0 1.0 1.0
2 1.01612903 9.15E 1 1.61E 2 1.35E 1
3 1.19057777 6.57E 1 1.74E 1 1.19E 1
4 1.11765583 1.68E 1 7.29E 2 5.59E 2
5 1.13253155 2.24E 2 1.49E 2 1.71E 2
6 1.13481681 9.54E 4 2.29E 3 2.19E 3
7 1.13472365 5.07E 6 9.32E 5 9.27E 5
8 1.13472414 1.13E 9 4.92E 7 4.92E 7
It is clear from the numerical results that the se-
cant method requires more iterates than the New-
ton method. But note that the secant method does
not require a knowledge of f 0(x), whereas Newtons
method requires both f (x) and f 0(x).

Note also that the secant method can be considered


an approximation of the Newton method
f (xn)
xn+1 = xn 0
f (xn)
by using the approximation

0 f (xn) f (xn1)
f (xn)
xn xn1
CONVERGENCE ANALYSIS

With a combination of algebraic manipulation and the


mean-value theorem from calculus, we can show
" #
00
f ( n)
xn+1 = ( xn) ( xn1) , (**)
2f 0( n)
with n and n unknown points. The point n is lo-
cated between the minimum and maximum of xn1, xn,
and ; and n is located between the minimum and
maximum of xn1 and xn. Recall for Newtons method
that the Newton iterates satisfied
" #
2 f 00( n)
xn+1 = ( xn)
2f 0(xn)
which closely resembles (**) above.
Using (**), it can be shown that xn converges to ,
and moreover,
r1
| xn+1| f ()
00
lim = 0 c
n | xn|r 2f ()
.
where 12 (1 + sqrt(5)) = 1.62. This assumes that x0
and x1 are chosen suciently close to ; and how
close this is will vary with the function f . In addition,
the above result assumes f (x) has two continuous
derivatives for all x in some interval about .

The above says that when we are close to , that

| xn+1| c | xn|r
This looks very much like the Newton result
f 00()
xn+1 M ( xn)2 , M=
2f 0()
and c = |M |r1. Both the secant and Newton meth-
ods converge at faster than a linear rate, and they are
called superlinear methods.
The secant method converge slower than Newtons
method; but it is still quite rapid. It is rapid enough
that we can prove
|xn+1 xn|
lim =1
n | xn|

and therefore,

| xn| |xn+1 xn|


is a good error estimator.

A note of warning: Do not combine the secant for-


mula and write it in the form
f (xn)xn1 f (xn1)xn
xn+1 =
f (xn) f (xn1)
This has enormous loss of significance errors as com-
pared with the earlier formulation.
COSTS OF SECANT & NEWTON METHODS

The Newton method


f (xn)
xn+1 = xn 0 , n = 0, 1, 2, ...
f (xn)
requires two function evaluations per iteration, that
of f (xn) and f 0(xn). The secant method
f (xn) f (xn1)
xn+1 = xnf (xn) , n = 1, 2, 3...
xn xn1
requires 1 function evaluation per iteration, following
the initial step.

For this reason, the secant method is often faster in


time, even though more iterates are needed with it
than with Newtons method to attain a similar accu-
racy.
ADVANTAGES & DISADVANTAGES

Advantages of secant method: 1. It converges at


faster than a linear rate, so that it is more rapidly
convergent than the bisection method.
2. It does not require use of the derivative of the
function, something that is not available in a number
of applications.
3. It requires only one function evaluation per iter-
ation, as compared with Newtons method which re-
quires two.

Disadvantages of secant method:


1. It may not converge.
2. There is no guaranteed error bound for the com-
puted iterates.
3. It is likely to have diculty if f 0() = 0. This
means the x-axis is tangent to the graph of y = f (x)
at x = .
4. Newtons method generalizes more easily to new
methods for solving simultaneous systems of nonlinear
equations.
BRENTS METHOD

Richard Brent devised a method combining the advan-


tages of the bisection method and the secant method.

1. It is guaranteed to converge.
2. It has an error bound which will converge to zero
in practice.
3. For most problems f (x) = 0, with f (x) dieren-
tiable about the root , the method behaves like the
secant method.
4. In the worst case, it is not too much worse in its
convergence than the bisection method.

In Matlab, it is implemented as f zero; and it is present


in most Fortran numerical analysis libraries.
FIXED POINT ITERATION

We begin with a computational example. Consider


solving the two equations
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
Graphs of these two equations are shown on accom-
panying graphs, with the solutions being
E1: = 1.49870113351785
E2: = 3.09438341304928
We are going to use a numerical scheme called fixed
point iteration. It amounts to making an initial guess
of x0 and substituting this into the right side of the
equation. The resulting value is denoted by x1; and
then the process is repeated, this time substituting x1
into the right side. This is repeated until convergence
occurs or until the iteration is terminated.

In the above cases, we show the results of the first 10


iterations in the accompanying table. Clearly conver-
gence is occurring with E1, but not with E2. Why?
y

y = 1 + .5sin x

y=x

y = 3 + 2sin x

y=x

x
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x

E1 E2
n xn xn
0 0.00000000000000 3.00000000000000
1 1.00000000000000 3.28224001611973
2 1.42073549240395 2.71963177181556
3 1.49438099256432 3.81910025488514
4 1.49854088439917 1.74629389651652
5 1.49869535552190 4.96927957214762
6 1.49870092540704 1.06563065299216
7 1.49870112602244 4.75018861639465
8 1.49870113324789 1.00142864236516
9 1.49870113350813 4.68448404916097
10 1.49870113351750 1.00077863465869
The above iterations can be written symbolically as
E1: xn+1 = 1 + .5 sin xn
E2: xn+1 = 3 + 2 sin xn
for n = 0, 1, 2, ... Why does one of these iterations
converge, but not the other? The graphs show similar
behaviour, so why the dierence.

As another example, note that the Newton method


f (xn)
xn+1 = xn 0
f (xn)
is also a fixed point iteration, for the equation
f (x)
x=x 0
f (x)
In general, we are interested in solving equations
x = g(x)
by means of fixed point iteration:
xn+1 = g(xn), n = 0, 1, 2, ...
It is called fixed point iteration because the root
is a fixed point of the function g(x), meaning that
is a number for which g() = .
EXISTENCE THEOREM

We begin by asking whether the equation x = g(x)


has a solution. For this to occur, the graphs of y = x
and y = g(x) must intersect, as seen on the earlier
graphs. The lemmas and theorems in the book give
conditions under which we are guaranteed there is a
fixed point .

Lemma: Let g(x) be a continuous function on the


interval [a, b], and suppose it satisfies the property
axb a g(x) b (#)
Then the equation x = g(x) has at least one solution
in the interval [a, b]. See the graphs for examples.

The proof of this is fairly intuitive. Look at the func-


tion
f (x) = x g(x), axb
Evaluating at the endpoints,
f (a) 0, f (b) 0
The function f (x) is continuous on [a, b], and there-
fore it contains a zero in the interval.
Theorem: Assume g(x) and g 0(x) exist and are con-
tinuous on the interval [a, b]; and further, assume

axb a g(x) b

0
max g (x) < 1
axb
Then:
S1. The equation x = g(x) has a unique solution
in [a, b].
S2. For any initial guess x0 in [a, b], the iteration

xn+1 = g(xn), n = 0, 1, 2, ...


will converge to .
S3.
n
| xn| |x1 x0| , n0
1
S4.
xn+1
lim = g 0()
n xn
Thus for xn close to ,

xn+1 g 0() ( xn)


The proof is given in the text, and I go over only a
portion of it here. For S2, note that from (#), if x0
is in [a, b], then
x1 = g(x0)
is also in [a, b]. Repeat the argument to show that

x2 = g(x1)
belongs to [a, b]. This can be continued by induction
to show that every xn belongs to [a, b].

We need the following general result. For any two


points w and z in [a, b],

g(w) g(z) = g 0(c) (w z)


for some unknown point c between w and z. There-
fore,
|g(w) g(z)| |w z|
for any a w, z b.
For S3, subtract xn+1 = g(xn) from = g() to get

xn+1 = g() g(xn)


= g 0(cn) ( xn) ($)
| xn+1| | xn| (*)
with cn between and xn. From (*), we have that
the error is guaranteed to decrease by a factor of
with each iteration. This leads to

| xn| n | xn| , n0
With some extra manipulation, we can obtain the error
bound in S3.

For S4, use ($) to write


xn+1
= g 0(cn)
xn
Since xn and cn is between and xn, we have
g 0(cn) g 0().
The statement

xn+1 g 0() ( xn)


tells us that when near to the root , the errors will
decrease by a constant factor of g 0(). If this is nega-
tive, then the errors will oscillate between positive and
negative, and the iterates will be approaching from
both sides. When g 0() is positive, the iterates will
approach from only one side.

The statements

xn+1 = g 0(cn) ( xn)

xn+1 g 0() ( xn)


also tell us a bit more of what happens when

0
g () > 1

Then the errors will increase as we approach the root


rather than decrease in size.
Look at the earlier examples
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
In the first case E1,
g(x) = 1 + .5 sin x
g 0(x) = .5 cos x
0
g ( 1
2
Therefore the fixed point iteration

xn+1 = 1 + .5 sin xn
will converge for E1.

For the second case E2,


g(x) = 3 + 2 sin x
g 0(x) = 2 cos x
.
g 0() = 2 cos (3.09438341304928) = 1.998
Therefore the fixed point iteration

xn+1 = 3 + 2 sin xn
will diverge for E2.
Corollary: Assume x = g(x) has a solution , and
further assume that both g(x) and g 0(x) are contin-
uous for all x in some interval about . In addition,
assume

0
g () < 1 (**)
Then any suciently small number > 0, the interval
[a, b] = [ , + ] will satisfy the hypotheses of
the preceding theorem.

This means that if (**) is true, and if we choose x0


suciently close to , then the fixed point iteration
xn+1 = g(xn) will converge and the earlier results
S1-S4 will all hold. The corollary does not tell us how
close we need to be to in order to have convergence.
NEWTONS METHOD

For Newtons method


f (xn)
xn+1 = xn 0
f (xn)
we have it is a fixed point iteration with
f (x)
g(x) = x 0
f (x)
Check its convergence by checking the condition (**).
f 0(x) f (x)f 00(x)
g 0(x) = 1 0 +
f (x) [f 0(x)]2
f (x)f 00(x)
=
[f 0(x)]2
g 0() = 0
Therefore the Newton method will converge if x0 is
chosen suciently close to .
HIGHER ORDER METHODS

What happens when g 0() = 0? We use Taylors


theorem to answer this question.
Begin by writing
1
g(x) = g() + g 0() (x ) + g 00(c) (x )2
2
with c between x and . Substitute x = xn and
recall that g(xn) = xn+1 and g() = . Also assume
g 0() = 0.
Then
xn+1 = + 12 g 00(cn) (xn )2
xn+1 = 12 g 00(cn) (xn )2
with cn between and xn. Thus if g 0() = 0, the
fixed point iteration is quadratically convergent or bet-
ter. In fact, if g 00() 6= 0, then the iteration is exactly
quadratically convergent.
ANOTHER RAPID ITERATION

Newtons method is rapid, but requires use of the


derivative f 0(x). Can we get by without this. The
answer is yes! Consider the method
f (xn + f (xn)) f (xn)
Dn =
f (xn)
f (xn)
xn+1 = xn
Dn
This is an approximation to Newtons method, with
f 0(xn) Dn. To analyze its convergence, regard it
as a fixed point iteration with
f (x + f (x)) f (x)
D(x) =
f (x)
f (x)
g(x) = x
D(x)
Then we can, with some diculty, show g 0() = 0
and g 00() 6= 0. This will prove this new iteration is
quadratically convergent.
FIXED POINT INTERATION: ERROR

Recall the result


xn
lim = g 0()
n x
n1
for the iteration

xn = g(xn1), n = 1, 2, ...
Thus
xn ( xn1) (***)
with = g 0() and || < 1.

If we were to know , then we could solve (***) for


:
xn xn1

1
Usually, we write this as a modification of the cur-
rently computed iterate xn:
xn xn1

1
xn xn xn xn1
= +
1 1

= xn + [xn xn1]
1
The formula

xn + [xn xn1]
1
is said to be an extrapolation of the numbers xn1
and xn. But what is ?

From
xn
lim = g 0()
n x
n1
we have
xn

xn1
Unfortunately this also involves the unknown root
which we seek; and we must find some other way of
estimating .

To calculate consider the ratio


xn xn1
n =
xn1 xn2
To see this is approximately as xn approaches ,
write
xn xn1 g(xn1) g(xn2)
= = g 0(cn)
xn1 xn2 xn1 xn2
with cn between xn1 and xn2. As the iterates ap-
proach , the number cn must also approach . Thus
n approaches as xn .
We combine these results to obtain the estimation
n xn xn1
bn = xn+
x [xn xn1] , n =
1 n xn1 xn2
bn the Aitken extrapolate of {xn2, xn1, xn};
We call x
and x bn.

We can also rewrite this as


n
bn xn =
xn x [xn xn1]
1 n
This is called Aitkens error estimation formula.

The accuracy of these procedures is tied directly to


the accuracy of the formulas

xn ( xn1) , xn1 ( xn2)


If this is accurate, then so are the above extrapolation
and error estimation formulas.
EXAMPLE

Consider the iteration

xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...


for solving
x = 6.28 + sin x
Iterates are shown on the accompanying sheet, includ-
ing calculations of n, the error estimate
n
bn xn =
xn x [xn xn1] (Estimate)
1 n
The latter is called Estimate in the table. In this
instance,
.
g 0() = .9644
and therefore the convergence is very slow. This is
apparent in the table.
AITKENS ALGORITHM

Step 1: Select x0
Step 2: Calculate

x1 = g(x0), x2 = g(x1)
Step3: Calculate
2 x2 x1
x3 = x2 + [x2 x1] , 2 =
1 2 x1 x0
Step 4: Calculate

x4 = g(x3), x5 = g(x4)
and calculate x6 as the extrapolate of {x3, x4, x5}.
Continue this procedure, ad infinatum.

Of course in practice we will have some kind of er-


ror test to stop this procedure when believe we have
sucient accuracy.
EXAMPLE

Consider again the iteration

xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...


for solving
x = 6.28 + sin x
Now we use the Aitken method, and the results are
shown in the accompanying table. With this we have

x3 = 7.98 104, x6 = 2.27 106


In comparison, the original iteration had

x6 = 1.23 102
GENERAL COMMENTS

Aitken extrapolation can greatly accelerate the con-


vergence of a linearly convergent iteration

xn+1 = g(xn)
This shows the power of understanding the behaviour
of the error in a numerical process. From that un-
derstanding, we can often improve the accuracy, thru
extrapolation or some other procedure.

This is a justification for using mathematical analyses


to understand numerical methods. We will see this
repeated at later points in the course, and it holds
with many dierent types of problems and numerical
methods for their solution.
MULTIPLE ROOTS

We study two classes of functions for which there is


additional diculty in calculating their roots. The first
of these are functions in which the desired root has a
multiplicity greater than 1. What does this mean?

Let be a root of the function f (x), and imagine


writing it in the factored form
f (x) = (x )m h(x)
with some integer m 1 and some continuous func-
tion h(x) for which h() 6= 0. Then we say that
is a root of f (x) of multiplicity m. For example, the
function
2
f (x) = ex 1
has x = 0 as a root of multiplicity m = 2. In partic-
ular, define
2
ex 1
h(x) =
x2
for x 6= 0.
Using Taylor polynomial approximations, we can show
for x 6= 0 that
h(x) 1 + 12 x2 + 16 x4
lim h(x) = 1
x0
This leads us to extend the definition of h(x) to
2
ex 1
h(x) = 2
, x 6= 0
x
h(0) = 1
Thus
f (x) = x2h(x)
as asserted and x = 0 is a root of f (x) of multiplicity
m = 2.

Roots for which m = 1 are called simple roots, and


the methods studied to this point were intended for
such roots. We now consider the case of m > 1.
If the function f (x) is m-times dierentiable around
, then we can dierentiate

f (x) = (x )m h(x)
m times to obtain an equivalent formulation of what
it means for the root to have multiplicity m.
For an example, consider the case

f (x) = (x )3 h(x)
Then
f 0(x)= 3 (x )2 h(x) + (x )3 h0(x)
(x )2 h2(x)
h2(x) = 3h(x) + (x ) h0(x)
h2() = 3h() 6= 0
This shows is a root of f 0(x) of multiplicity 2.

Dierentiating a second time, we can show

f 00(x) = (x ) h3(x)
for a suitably defined h3(x) with h3() 6= 0, and is
a simple root of f 00(x).
Dierentiating a third time, we have

f 000() = h3() 6= 0
We can use this as part of a proof of the following:
is a root of f (x) of multiplicity m = 3 if and only if

f () = f 0() = f 00() = 0, f 000() 6= 0

In general, is a root of f (x) of multiplicity m if and


only if

f () = = f (m1)() = 0, f (m)() 6= 0
DIFFICULTIES OF MULTIPLE ROOTS

There are two main diculties with the numerical cal-


culation of multiple roots (by which we mean m > 1
in the definition).

1. Methods such as Newtons method and the se-


cant method converge more slowly than for the
case of a simple root.

2. There is a large interval of uncertainty in the pre-


cise location of a multiple root on a computer or
calculator.

The second of these is the more dicult to deal with,


but we begin with the first for the case of Newtons
method.
Recall that we can regard Newtons method as a fixed
point method:
f (x)
xn+1 = g(xn), g(x) = x 0
f (x)
Then we substitute

f (x) = (x )m h(x)
to obtain
(x )m h(x)
g(x) = x
m (x )m1 h(x) + (x )m h0(x)
(x ) h(x)
= x
mh(x) + (x ) h0(x)
Then we can use this to show
1 m1
g 0() = 1 =
m m
For m > 1, this is nonzero, and therefore Newtons
method is only linearly convergent:
m1
xn+1 ( xn) , =
m
Similar results hold for the secant method.
There are ways of improving the speed of convergence
of Newtons method, creating a modified method that
is again quadratically convergent. In particular, con-
sider the fixed point iteration formula
f (x)
xn+1 = g(xn), g(x) = x m 0
f (x)
in which we assume to know the multiplicity m of
the root being sought. Then modifying the above
argument on the convergence of Newtons method,
we obtain
1
g 0() = 1 m =0
m
and the iteration method will be quadratically conver-
gent.

But this is not the fundamental problem posed by


multiple roots.
NOISE IN FUNCTION EVALUATION

Recall the discussion of noise in evaluating a function


f (x), and in our case consider the evaluation for val-
ues of x near to . In the following figures, the noise
as measured by vertical distance is the same in both
graphs.

y y

x x

simple root double root


Noise was discussed earlier in 2.2, and Figures 2.1
and 2.2 were of

f (x) = x3 3x2 + 3x 1 (x 1)3


Because of the noise in evaluating f (x), it appears
from the graph that f (x) has many zeros around x =
1, whereas the exact function outside of the computer
has only the root = 1, of multiplicity 3.

Any rootfinding method to find a multiple root that


uses evaluation of f (x) is doomed to having a large
interval of uncertainty as to the location of the root.
If high accuracy is desired, then the only satisfactory
solution is to reformulate the problem as a new prob-
lem F (x) = 0 in which is a simple root of F . Then
use a standard rootfinding method to calculate . It
is important that the evaluation of F (x) not involve
f (x) directly, as that is the source of the noise and
the uncertainly.
EXAMPLE

From the text, consider finding the roots of

f (x) = 2.7951 8.954x + 10.56x2 5.4x3 + x4


This has a root to the right of 1. From an exami-
nation of the rate of linear convergence of Newtons
method applied to this function, one can guess with
high probability that the multiplicity is m = 3. Then
form exactly the second derivative

f 00(x) = 21.12 32.4x + 12x2


Applying Newtons method to this with a guess of
x = 1 will lead to rapid convergence to = 1.1.

In general, if we know the root has multiplicity m >


1, then replace the problem by that of solving

f (m1)(x) = 0
since is a simple root of this equation.
STABILITY

Generally we expect the world to be stable. By this,


we mean that if we make a small change in something,
then we expect to have this lead to other correspond-
ingly small changes. In fact, if we think about this
carefully, then we know this need not be true. We
now illustrate this for the case of rootfinding.

Consider the polynomial


f (x) = x7 28x6 + 322x5 1960x4
+6769x3 13132x2 + 13068x 5040
This has the exact roots {1, 2, 3, 4, 5, 6, 7}. Now con-
sider the perturbed polynomial
F (x) = x7 28.002x6 + 322x5 1960x4
+6769x3 13132x2 + 13068x 5040
This is a relatively small change in one coecient, of
relative error
.002
= 7.14 105
28
What are the roots of F (x)?
Root of Root of Error
f (x) F (x)
1 1.0000028 2.8E 6
2 1.9989382 1.1E 3
3 3.0331253 0.033
4 3.8195692 0.180
5 5.4586758 + .54012578i .46 .54i
6 5.4586758 .54012578i .46 + .54i
7 7.2330128 0.233

Why have some of the roots departed so radically from


the original values? This phenomena goes under a
variety of names. We sometimes say this is an example
of an unstable or ill-conditioned rootfinding problem.
These words are often used in a casual manner, but
they also have a very precise meaning in many areas
of numerical analysis (and more generally, in all of
mathematics).
A PERTURBATION ANALYSIS

We want to study what happens to the root of a func-


tion f (x) when it is perturbed by a small amount. For
some function g(x) and for all small , define a per-
turbed function

F(x) = f (x) + g(x)


The polynomial example would fit this if we use

g(x) = x6, = .002


Let 0 be a simple root of f (x). It can be shown (us-
ing the implicit dierentiation theorem from calculus)
that if f (x) and g(x) are dierentiable for x 0,
and if f 0(0) 6= 0, then F(x) has a unique simple
root () near to 0 = (0) for all small values of .
Moreover, () will be a dierentiable function of .
We use this to estimate ().
The linear Taylor polynomial approximation of () is
given by
() (0) + 0(0)
We need to find a formula for 0(0). Recall that

F(()) = 0
for all small values of . Dierentiate this as a function
of and using the chain rule. Then we obtain
F0 (()) = f 0(())0()
+g(()) + g 0(())0() = 0
for all small . Substitute = 0, recall (0) = 0,
and solve for 0(0) to obtain
f 0(0)0(0) + g(0) = 0
g( )
0(0) = 0 0
f (0)
This then leads to
() (0) + 0(0)
g( ) (*)
= 0 0 0
f (0)
Example: In our earlier polynomial example, consider
the simple root 0 = 3. Then
36 .
() 3 = 3 15.2
48
With = .002, we obtain
.
(.002) 3 15.2(.002) = 3.0304
This is close to the actual root of 3.0331253.

However, the approximation (*) is not good at esti-


mating the change in the roots 5 and 6. By ob-
servation, the perturbation in the root is a complex
number, whereas the formula (*) predicts only a per-
turbation that is real. The value of is too large to
have (*) be accurate for the roots 5 and 6.
DISCUSSION

Looking again at the formula


g( )
() 0 0 0
f (0)
we have that the size of
g( )
0 0
f (0)
is an indication of the stability of the solution 0.

If this quantity is large, then potentially we will have


diculty. Of course, not all functions g(x) are equally
possible, and we need to look only at functions g(x)
that will possibly occur in practice.

One quantity of interest is the size of f 0(0). If it


is very small relative to g(0), then we are likely to
have diculty in finding 0 accurately.
INTERPOLATION

Interpolation is a process of finding a formula (often


a polynomial) whose graph will pass through a given
set of points (x, y).

As an example, consider defining



x0 = 0, x1 = , x2 =
4 2
and
yi = cos xi, i = 0, 1, 2
This gives us the three points

(0, 1) , , 1 , , 0
4 sqrt(2) 2
Now find a quadratic polynomial
p(x) = a0 + a1x + a2x2
for which
p(xi) = yi, i = 0, 1, 2
The graph of this polynomial is shown on the accom-
panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)

y = cos(x)
y = p (x)
2

x
/4 /2
PURPOSES OF INTERPOLATION

1. Replace a set of data points {(xi, yi)} with a func-


tion given analytically.

2. Approximate functions with simpler ones, usually


polynomials or piecewise polynomials.

Purpose #1 has several aspects.

The data may be from a known class of functions.


Interpolation is then used to find the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form

p(x) = a0 + a1ex + a2e2x + + anenx


n o
Then we need to find the coecients aj based
on the given data values.
We may want to take function values f (x) given
in a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.

Given a set of data points {(xi, yi)}, find a curve


passing thru these points that is pleasing to the
eye. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f (x) by simpler functions p(x), perhaps to make
it easier to integrate or dierentiate f (x). That will
be the primary reason for studying interpolation in this
course.

As as example of why this is important, consider the


problem of evaluating
Z 1
dx
I=
0 1 + x10
This is very dicult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.

We begin by using polynomials as our means of doing


interpolation. Later in the chapter, we consider more
complex piecewise polynomial functions, often called
spline functions.
LINEAR INTERPOLATION

The simplest form of interpolation is probably the


straight line, connecting two points by a straight line.
Let two data points (x0, y0) and (x1, y1) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight line
as
P1(x) = a0 + a1x
In fact, there are other more convenient ways to write
it, and we give several of them below.
x x1 x x0
P1(x) = y0 + y1
x0 x1 x1 x0
(x1 x) y0 + (x x0) y1
=
x1 x0
x x0
= y0 + [y1 y0]
x x0
1 !
y1 y0
= y0 + (x x0)
x1 x0
Check each of these by evaluating them at x = x0
and x1 to see if the respective values are y0 and y1.
Example. Following is a table of values for f (x) =
tan x for a few values of x.

x 1 1.1 1.2 1.3


tan x 1.5574 1.9648 2.5722 3.6021

Use linear interpolation to estimate tan(1.15). Then


use
x0 = 1.1, x1 = 1.2
with corresponding values for y0 and y1. Then
x x0
tan x y0 + [y1 y0]
x1 x0
x x0
tan x y0 + [y1 y0]
x1 x0
1.15 1.1
tan (1.15) 1.9648 + [2.5722 1.9648]
1.2 1.1
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sucient accuracy in our inter-
polant.
y

y=tan(x)

1 1.3 x

y = tan(x)
y = p (x)
1

x
1.1 1.2
QUADRATIC INTERPOLATION

We want to find a polynomial

P2(x) = a0 + a1x + a2x2


which satisfies

P2(xi) = yi, i = 0, 1, 2
for given data points (x0, y0) , (x1, y1) , (x2, y2). One
formula for such a polynomial follows:

P2(x) = y0L0(x) + y1L1(x) + y2L2(x) ()


with
(xx )(xx ) (xx )(xx )
L0(x) = (x x1)(x x2 ) , L1(x) = (x x0)(x x2 )
0 1 0 2 1 0 1 2
(xx )(xx )
L2(x) = (x x0)(x x1 )
2 0 2 1
The formula () is called Lagranges form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS

The functions
(xx )(xx ) (xx )(xx )
L0(x) = (x x1)(x x2 ) , L1(x) = (x x0)(x x2 )
0 1 0 2 1 0 1 2
(xx )(xx )
L2(x) = (x x0)(x x1 )
2 0 2 1
are called Lagrange basis functions for quadratic in-
terpolation. They have the properties
(
1, i = j
Li(xj ) =
0, i =
6 j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.

As a consequence of each Li(x) being of degree 2, we


have that the interpolant

P2(x) = y0L0(x) + y1L1(x) + y2L2(x)


must have degree 2.
UNIQUENESS

Can there be another polynomial, call it Q(x), for


which
deg(Q) 2
Q(xi) = yi, i = 0, 1, 2
Thus, is the Lagrange formula P2(x) unique?

Introduce
R(x) = P2(x) Q(x)
From the properties of P2 and Q, we have deg(R)
2. Moreover,
R(xi) = P2(xi) Q(xi) = yi yi = 0
for all three node points x0, x1, and x2. How many
polynomials R(x) are there of degree at most 2 and
having three distinct zeros? The answer is that only
the zero polynomial satisfies these properties, and there-
fore
R(x) = 0 for all x

Q(x) = P2(x) for all x


SPECIAL CASES

Consider the data points


(x0, 1), (x1, 1), (x2, 1)
What is the polynomial P2(x) in this case?

Answer: We must have the polynomial interpolant is


P2(x) 1
meaning that P2(x) is the constant function. Why?
First, the constant function satisfies the property of
being of degree 2. Next, it clearly interpolates the
given data. Therefore by the uniqueness of quadratic
interpolation, P2(x) must be the constant function 1.

Consider now the data points


(x0, mx0), (x1, mx1), (x2, mx2)
for some constant m. What is P2(x) in this case? By
an argument similar to that above,
P2(x) = mx for all x
Thus the degree of P2(x) can be less than 2.
HIGHER DEGREE INTERPOLATION

We consider now the case of interpolation by poly-


nomials of a general degree n. We want to find a
polynomial Pn(x) for which
deg(Pn) n
()
Pn(xi) = yi, i = 0, 1, , n
with given data points
(x0, y0) , (x1, y1) , , (xn, yn)
The solution is given by Lagranges formula
Pn(x) = y0L0(x) + y1L1(x) + + ynLn(x)
The Lagrange basis functions are given by
(x x0) ..(x xk1)(x xk+1).. (x xn)
Lk (x) =
(xk x0) ..(xk xk1)(xk xk+1).. (xk xn)
for k = 0, 1, 2, ..., n. The quadratic case was covered
earlier.

In a manner analogous to the quadratic case, we can


show that the above Pn(x) is the only solution to the
problem ().
In the formula
(x x0) ..(x xk1)(x xk+1).. (x xn)
Lk (x) =
(xk x0) ..(xk xk1)(xk xk+1).. (xk xn)
we can see that each such function is a polynomial of
degree n. In addition,
(
1, k = i
Lk (xi) =
0, k =
6 i
Using these properties, it follows that the formula

Pn(x) = y0L0(x) + y1L1(x) + + ynLn(x)


satisfies the interpolation problem of finding a solution
to
deg(Pn) n
Pn(xi) = yi, i = 0, 1, , n
EXAMPLE

Recall the table


x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021

We now interpolate this table with the nodes

x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3


Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3
Pn(1.15) 2.2685 2.2435 2.2296
Error .0340 .0090 .0049

It improves with increasing degree n, but not at a very


rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n 10, is often poorly
behaved when the node points {xi} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE

For a given function f (x) and two distinct points x0


and x1, define
f (x1) f (x0)
f [x0, x1] =
x1 x0
This is called a first order divided dierence of f (x).

By the Mean-value theorem,

f (x1) f (x0) = f 0(c) (x1 x0)


for some c between x0 and x1. Thus

f [x0, x1] = f 0(c)


and the divided dierence in very much like the deriv-
ative, especially if x0 and x1 are quite close together.
In fact,

x + x
f0 1 0
f [x0, x1]
2
is quite an accurate approximation of the derivative
(see 5.4).
SECOND ORDER DIVIDED DIFFERENCES

Given three distinct points x0, x1, and x2, define


f [x1, x2] f [x0, x1]
f [x0, x1, x2] =
x2 x0
This is called the second order divided dierence of
f (x).

By a fairly complicated argument, we can show


1
f [x0, x1, x2] = f 00(c)
2
for some c intermediate to x0, x1, and x2. In fact, as
we investigate in 5.4,

f 00 (x1) 2f [x0, x1, x2]


in the case the nodes are evenly spaced,

x1 x0 = x2 x1
EXAMPLE

Consider the table


x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997

Let x0 = 1, x1 = 1.1, and x2 = 1.2. Then


.45360 .54030
f [x0, x1] = = .86700
1.1 1
.36236 .45360
f [x1, x2] = = .91240
1.1 1
f [x1, x2] f [x0, x1]
f [x0, x1, x2] =
x2 x0
.91240 (.86700)
= = .22700
1.2 1.0
For comparison,

0 x1 + x0
f = sin (1.05) = .86742
2
1 00 1
f (x1) = cos (1.1) = .22680
2 2
GENERAL DIVIDED DIFFERENCES

Given n + 1 distinct points x0, ..., xn, with n 2,


define
f [x1, ..., xn] f [x0, ..., xn1]
f [x0, ..., xn] =
xn x0
This is a recursive definition of the nth-order divided
dierence of f (x), using divided dierences of order
n. Its relation to the derivative is as follows:
1
f [x0, ..., xn] = f (n)(c)
n!
for some c intermediate to the points {x0, ..., xn}. Let
I denote the interval

I = [min {x0, ..., xn} , max {x0, ..., xn}]


Then c I, and the above result is based on the
assumption that f (x) is n-times continuously dier-
entiable on the interval I.
EXAMPLE

The following table gives divided dierences for the


data in
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997

For the column headings, we use


Dk f (xi) = f [xi, ..., xi+k ]

i xi f (xi) Df (xi) D2f (xi) D3f (xi) D4f (xi)


0 1.0 .54030 -.8670 -.2270 .1533 .0125
1 1.1 .45360 -.9124 -.1810 .1583
2 1.2 .36236 -.9486 -.1335
3 1.3 .26750 -.9753
4 1.4 .16997

These were computed using the recursive definition


f [x1, ..., xn] f [x0, ..., xn1]
f [x0, ..., xn] =
xn x0
ORDER OF THE NODES

Looking at f [x0, x1], we have


f (x1) f (x0) f (x0) f (x1)
f [x0, x1] = = = f [x1, x0]
x1 x0 x0 x1
The order of x0 and x1 does not matter. Looking at
f [x1, x2] f [x0, x1]
f [x0, x1, x2] =
x2 x0
we can expand it to get
f (x0)
f [x0, x1, x2] =
(x0 x1) (x0 x2)
f (x1) f (x2)
+ +
(x1 x0) (x1 x2) (x2 x0) (x2 x1)
With this formula, we can show that the order of the
arguments x0, x1, x2 does not matter in the final value
of f [x0, x1, x2] we obtain. Mathematically,

f [x0, x1, x2] = f [xi0 , xi1 , xi2 ]


for any permutation (i0, i1, i2) of (0, 1, 2).
We can show in general that the value of f [x0, ..., xn]
is independent of the order of the arguments {x0, ..., xn},
even though the intermediate steps in its calculations
using
f [x1, ..., xn] f [x0, ..., xn1]
f [x0, ..., xn] =
xn x0
are order dependent.

We can show

f [x0, ..., xn] = f [xi0 , ..., xin ]


for any permutation (i0, i1, ..., in) of (0, 1, ..., n).
COINCIDENT NODES

What happens when some of the nodes {x0, ..., xn}


are not distinct. Begin by investigating what happens
when they all come together as a single point x0.

For first order divided dierences, we have


f (x1) f (x0)
lim f [x0, x1] = lim = f 0(x0)
x1x0 x1x0 x1 x0
We extend the definition of f [x0, x1] to coincident
nodes using
f [x0, x0] = f 0(x0)
For second order divided dierences, recall
1
f [x0, x1, x2] = f 00(c)
2
with c intermediate to x0, x1, and x2.

Then as x1 x0 and x2 x0, we must also have


that c x0. Therefore,
1 00
lim
x1x0
f [x , x
0 1 2, x ] = f (x0)
x x
2
2 0
We therefore define
1
f [x0, x0, x0] = f 00(x0)
2
For the case of general f [x0, ..., xn], recall that
1 (n)
f [x0, ..., xn] = f (c)
n!
for some c intermediate to {x0, ..., xn}. Then
1 (n)
lim f [x0, ..., xn] = f (x0)
{x1,...,xn}x0 n!
and we define
1 (n)
f [x , ..., x ] = f (x0)
| 0 {z 0} n!
n+1 times

What do we do when only some of the nodes are


coincident. This too can be dealt with, although we
do so here only by examples.
f [x1, x1] f [x0, x1]
f [x0, x1, x1] =
x1 x0
f 0(x1) f [x0, x1]
=
x1 x0
The recursion formula can be used in general in this
way to allow all possible combinations of possibly co-
incident nodes.
LAGRANGES FORMULA FOR
THE INTERPOLATION POLYNOMIAL

Recall the general interpolation problem: find a poly-


nomial Pn(x) for which
deg(Pn) n
Pn(xi) = yi, i = 0, 1, , n
with given data points

(x0, y0) , (x1, y1) , , (xn, yn)


and with {x0, ..., xn} distinct points.

In 5.1, we gave the solution as Lagranges formula

Pn(x) = y0L0(x) + y1L1(x) + + ynLn(x)


with {L0(x), ..., Ln(x)} the Lagrange basis polynomi-
als. Each Lj is of degree n and it satisfies
(
1, j = i
Lj (xi) =
0, j =
6 i
for i = 0, 1, ..., n.
THE NEWTON DIVIDED DIFFERENCE FORM
OF THE INTERPOLATION POLYNOMIAL

Let the data values for the problem


deg(Pn) n
Pn(xi) = yi, i = 0, 1, , n
be generated from a function f (x):

yi = f (xi), i = 0, 1, ..., n
Using the divided dierences

f [x0, x1], f [x0, x1, x2], ..., f [x0, ..., xn]


we can write the interpolation polynomials

P1(x), P2(x), ..., Pn(x)


in a way that is simple to compute.
P1(x) = f (x0) + f [x0, x1] (x x0)
P2(x) = f (x0) + f [x0, x1] (x x0)
+f [x0, x1, x2] (x x0) (x x1)
= P1(x) + f [x0, x1, x2] (x x0) (x x1)
For the case of the general problem
deg(Pn) n
Pn(xi) = yi, i = 0, 1, , n
we have
Pn(x) = f (x0) + f [x0, x1] (x x0)
+f [x0, x1, x2] (x x0) (x x1)
+f [x0, x1, x2, x3] (x x0) (x x1) (x x2)
+
+f [x0, ..., xn] (x x0) (x xn1)
From this we have the recursion relation

Pn(x) = Pn1(x)+f [x0, ..., xn] (x x0) (x xn1)


in which Pn1(x) interpolates f (x) at the points in
{x0, ..., xn1}.
Example: Recall the table
i xi f (xi) Df (xi) D2f (xi) D3f (xi) D4f (xi)
0 1.0 .54030 -.8670 -.2270 .1533 .0125
1 1.1 .45360 -.9124 -.1810 .1583
2 1.2 .36236 -.9486 -.1335
3 1.3 .26750 -.9753
4 1.4 .16997

with Dk f (xi) = f [xi, ..., xi+k ], k = 1, 2, 3, 4. Then

P1(x) = .5403 .8670 (x 1)


P2(x) = P1(x) .2270 (x 1) (x 1.1)
P3(x) = P2(x) + .1533 (x 1) (x 1.1) (x 1.2)
P4(x) = P3(x)
+.0125 (x 1) (x 1.1) (x 1.2) (x 1.3)
Using this table and these formulas, we have the fol-
lowing table of interpolants for the value x = 1.05.
The true value is cos(1.05) = .49757105.
n 1 2 3 4
Pn(1.05) .49695 .49752 .49758 .49757
Error 6.20E4 5.00E5 1.00E5 0.0
EVALUATION OF THE DIVIDED DIFFERENCE
INTERPOLATION POLYNOMIAL

Let
d1 = f [x0, x1]
d2 = f [x0, x1, x2]
..
dn = f [x0, ..., xn]
Then the formula
Pn(x) = f (x0) + f [x0, x1] (x x0)
+f [x0, x1, x2] (x x0) (x x1)
+f [x0, x1, x2, x3] (x x0) (x x1) (x x2)
+
+f [x0, ..., xn] (x x0) (x xn1)
can be written as
Pn(x) = f (x0) + (x x0) (d1 + (x x1) (d2 +
+(x xn2) (dn1 + (x xn1) dn) )
Thus we have a nested polynomial evaluation, and
this is quite ecient in computational cost.
ERROR IN LINEAR INTERPOLATION

Let P1(x) denote the linear polynomial interpolating


f (x) at x0 and x1, with f (x) a given function (e.g.
f (x) = cos x). What is the error f (x) P1(x)?

Let f (x) be twice continuously dierentiable on an in-


terval [a, b] which contains the points {x0, x1}. Then
for a x b,
(x x0) (x x1) 00
f (x) P1(x) = f (cx)
2
for some cx between the minimum and maximum of
x0, x1, and x.

If x1 and x are close to x0, then


(x x0) (x x1) 00
f (x) P1(x) f (x0)
2
Thus the error acts like a quadratic polynomial, with
zeros at x0 and x1.
EXAMPLE

Let f (x) = log10 x; and in line with typical tables of


log10 x, we take 1 x, x0, x1 10. For definiteness,
let x0 < x1 with h = x1 x0. Then
log10 e
f 00(x) =
x2
" #
(x x0) (x x1) log e
log10 x P1(x) = 10
2 c2x
" #
log10 e
= (x x0) (x1 x)
2c2x
We usually are interpolating with x0 x x1; and
in that case, we have

(x x0) (x1 x) 0, x0 cx x1
(x x0) (x1 x) 0, x0 cx x1
and therefore
" #
log10 e
(x x0) (x1 x) 2 log10 x P1(x)
2x1
" #
log10 e
(x x0) (x1 x)
2x20

For h = x1 x0 small, we have for x0 x x1


" #
log10 e
log10 x P1(x) (x x0) (x1 x)
2x20

Typical high school algebra textbooks contain tables


of log10 x with a spacing of h = .01. What is the
error in this case? To look at this, we use
" #
log10 e
0 log10 x P1(x) (x x0) (x1 x)
2x20
By simple geometry or calculus,
h2
max (x x0) (x1 x)
x0xx1 4
Therefore,
" #
2
h log10 e . h2
0 log10 x P1(x) 2 = .0543 2
4 2x0 x0
If we want a uniform bound for all points 1 x0 10,
we have
h2 log10 e .
0 log10 x P1(x) = .0543h2
8

0 log10 x P1(x) .0543h2


For h = .01, as is typical of the high school text book
tables of log10 x,

0 log10 x P1(x) 5.43 106


If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
.
log 5.41 = .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.

From the bound


h2 log10 e . h2
0 log10 x P1(x) 2 = .0543 2
8x0 x0
we see the error decreases as x0 increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE

Recall the general interpolation problem: find a poly-


nomial Pn(x) for which deg(Pn) n

Pn(xi) = f (xi), i = 0, 1, , n
with distinct node points {x0, ..., xn} and a given
function f (x). Let [a, b] be a given interval on which
f (x) is (n + 1)-times continuously dierentiable; and
assume the points x0, ..., xn, and x are contained in
[a, b]. Then
(x x0) (x x1) (x xn) (n+1)
f (x)Pn(x) = f (cx)
(n + 1)!
with cx some point between the minimum and maxi-
mum of the points in {x, x0, ..., xn}.
(x x0) (x x1) (x xn) (n+1)
f (x)Pn(x) = f (cx)
(n + 1)!
As shorthand, introduce

n(x) = (x x0) (x x1) (x xn)


a polynomial of degree n + 1 with roots {x0, ..., xn}.
Then
n(x) (n+1)
f (x) Pn(x) = f (cx)
(n + 1)!
THE QUADRATIC CASE

For n = 2, we have
(x x0) (x x1) (x x2) (3)
f (x) P2(x) = f (cx)
3!
(*)
with cx some point between the minimum and maxi-
mum of the points in {x, x0, x1, x2}.

To illustrate the use of this formula, consider the case


of evenly spaced nodes:

x1 = x0 + h, x2 = x1 + h
Further suppose we have x0 x x2, as we would
usually have when interpolating in a table of given
function values (e.g. log10 x). The quantity

2(x) = (x x0) (x x1) (x x2)


can be evaluated directly for a particular x.
Graph of

2(x) = (x + h) x (x h)
using (x0, x1, x2) = (h, 0, h):

h
x
-h
In the formula (), however,
we do not know cx, and

therefore we replace f (3) (cx) with a maximum of

(3)
f (x) as x varies over x0 x x2. This yields

|2(x)|
(3)
|f (x) P2(x)| max f (x) (**)
3! x0xx2
If we want a uniform bound for x0 x x2, we must
compute

max |2(x)| = max |(x x0) (x x1) (x x2)|


x0xx2 x0xx2
Using calculus,
2h3 h
max |2(x)| = , at x = x1
x0xx2 3 sqrt(3) sqrt(3)
Combined with (), this yields
h3
(3)
|f (x) P2(x)| max f (x)
9 sqrt(3) x0xx2
for x0 x x2.
For f (x) = log10 x, with 1 x0 x x2 10, this
leads to
h3 2 log10 e
|log10 x P2(x)| max
9 sqrt(3) x0xx2 x3
.05572 h3
=
x30
For the case of h = .01, we have
5.57 108 8
|log10 x P2(x)| 5.57 10
x30
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log10 x with
h = .01? The error bound for the linear interpolation
was 5.43 106, and therefore we want the same to
be true of quadratic interpolation. Using a simpler
bound, we want to find h so that

|log10 x P2(x)| .05572 h3 5 106


This is true if h = .04477. Therefore a spacing of
h = .04 would be sucient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
(x x0) (x xn) (n+1)
f (x) Pn(x) = f (cx)
(n + 1)!
n(x) (n+1)
= f (cx)
(n + 1)!
n(x) = (x x0) (x x1) (x xn)
with cx some point between the minimum and max-
imum of the points in {x, x0, ..., xn}. When bound-
ing the error we replace f (n+1) (cx) with its maximum
over the interval containing {x, x0, ..., xn}, as we have
illustrated earlier in the linear and quadratic cases.

Consider now the function


n(x)
(n + 1)!
over the interval determined by the minimum and
maximum of the points in {x, x0, ..., xn}. For evenly
spaced node points on [0, 1], with x0 = 0 and xn = 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR

Consider the error


(x x0) (x xn) (n+1)
f (x) Pn(x) = f (cx)
(n + 1)!
n(x) (n+1)
= f (cx)
(n + 1)!
n(x) = (x x0) (x x1) (x xn)
as n increases and as x varies. As noted previously, we
cannot do much with f (n+1) (cx) except
to replace it

with a maximum value of f (n+1) (x) over a suitable
interval. Thus we concentrate on understanding the
size of
n(x)
(n + 1)!
ERROR FOR EVENLY SPACED NODES

We consider first the case in which the node points


are evenly spaced, as this seems the natural way to
define the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?

The interpolation nodes are determined by using


1
h = , x0 = 0, x1 = h, x2 = 2h, ..., xn = nh = 1
n
For this case,

n(x) = x (x h) (x 2h) (x 1)
Our graphs are the cases of n = 2, ..., 9.
y n=2 y n=3

1
x
1
x

y n=4 y n=5

1
x

1
x

Graphs of n(x) on [0, 1] for n = 2, 3, 4, 5


y n=6 y n=7

1
x

1
x

y n=8 y n=9

1
x

1
x

Graphs of n(x) on [0, 1] for n = 6, 7, 8, 9


Graph of

6(x) = (x x0) (x x1) (x x6)


with evenly spaced nodes:

x
x x x x x x x
0 1 2 3 4 5 6
Using the following table
,
n Mn n Mn
1 1.25E1 6 4.76E7
2 2.41E2 7 2.20E8
3 2.06E3 8 9.11E10
4 1.48E4 9 3.39E11
5 9.01E6 10 1.15E12

we can observe that the maximum


|n(x)|
Mn max
x0xxn (n + 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of n(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
|n(x)|
max = 3.39 1011
x0xx1 (n + 1)!
|n(x)|
max = 6.89 1013
x4xx5 (n + 1)!
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x0 x x1 as compared to the
case when x4 x x5. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x0, ..., xn} being used to define the
interpolation, if possible.
AN APPROXIMATION PROBLEM

Consider now the problem of using an interpolation


polynomial to approximate a given function f (x) on
a given interval [a, b]. In particular, take interpolation
nodes
a x0 < x1 < < xn1 < xn b
and produce the interpolation polynomial Pn(x) that
interpolates f (x) at the given node points. We would
like to have
max |f (x) Pn(x)| 0 as n
axb
Does it happen?

Recall the error bound


max |f (x) Pn(x)|
axb
|n(x)|
(n+1)
max max f (x)
axb (n + 1)! axb
We begin with an example using evenly spaced node
points.
RUNGES EXAMPLE

Use evenly spaced node points:


ba
h= , xi = a + ih for i = 0, ..., n
n
For some functions, such as f (x) = ex, the maximum
error goes to zero quite rapidly. But the size of the
derivative term f (n+1)(x) in
max |f (x) Pn(x)|
axb
|n(x)|
(n+1)
max max f (x)
axb (n + 1)! axb
can badly hurt or destroy the convergence of other
cases.

In particular, we show the graph of f (x) = 1/ 1 + x2
and Pn(x) on [5, 5] for the cases n = 8 and n = 12.
The case n = 10 is in the text on page 127. It can
be proven that for this function, the maximum er-
ror on [5, 5] does not converge to zero. Thus the
use of evenly spaced nodes is not necessarily a good
approach to approximating a function f (x) by inter-
polation.
Runges example with n = 10:

y=P (x)
10
2
y=1/(1+x )

x
OTHER CHOICES OF NODES

Recall the general error bound


|n(x)|
(n+1)
max |f (x) Pn(x)| max max f (x)
axb axb (n + 1)! axb
There is nothing we really do with the derivative term
for f ; but we can examine the way of defining the
nodes {x0, ..., xn} within the interval [a, b]. We ask
how these nodes can be chosen so that the maximum
of |n(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it is
taken up in 4.6. The node points {x0, ..., xn} turn
out to be the zeros of a particular polynomial Tn+1(x)
of degree n+1, called a Chebyshev polynomial. These
zeros are known explicitly, and with them

b a n+1 n
max |n(x)| = 2
axb 2
This turns out to be smaller than for evenly spaced
cases; and although this polynomial interpolation does
not work for all functions f (x), it works for all dier-
entiable functions and more.
ANOTHER ERROR FORMULA

Recall the error formula


n(x) (n+1)
f (x) Pn(x) = f (c)
(n + 1)!
n(x) = (x x0) (x x1) (x xn)
with c between the minimum and maximum of {x0, ..., xn, x}.
A second formula is given by

f (x) Pn(x) = n(x) f [x0, ..., xn, x]


To show this is a simple, but somewhat subtle argu-
ment.

Let Pn+1(x) denote the polynomial of degree n + 1


which interpolates f (x) at the points {x0, ..., xn, xn+1}.
Then
Pn+1(x) = Pn(x)
+f [x0, ..., xn, xn+1] (x x0) (x xn)
Substituting x = xn+1, and using the fact that Pn+1(x)
interpolates f (x) at xn+1, we have
f (xn+1) = Pn(xn+1)
+f [x0, ..., xn, xn+1] (xn+1 x0) (xn+1 xn)

f (xn+1) = Pn(xn+1)
+f [x0, ..., xn, xn+1] (xn+1 x0) (xn+1 xn)
In this formula, the number xn+1 is completely ar-
bitrary, other than being distinct from the points in
{x0, ..., xn}. To emphasize this fact, replace xn+1 by
x throughout the formula, obtaining
f (x) = Pn(x) + f [x0, ..., xn, x] (x x0) (x xn)
= Pn(x) + n(x) f [x0, ..., xn, x]
provided x 6= x0, ..., xn.
The formula
f (x) = Pn(x) + f [x0, ..., xn, x] (x x0) (x xn)
= Pn(x) + n(x) f [x0, ..., xn, x]
is easily true for x a node point. Provided f (x) is
dierentiable, the formula is also true for x a node
point.

This shows

f (x) Pn(x) = n(x) f [x0, ..., xn, x]

Compare the two error formulas

f (x) Pn(x) = n(x) f [x0, ..., xn, x]

n(x) (n+1)
f (x) Pn(x) = f (c)
(n + 1)!
Then
n(x) (n+1)
n(x) f [x0, ..., xn, x] = f (c)
(n + 1)!

f (n+1) (c)
f [x0, ..., xn, x] =
(n + 1)!
for some c between the smallest and largest of the
numbers in {x0, ..., xn, x}.

To make this somewhat symmetric in its arguments,


let m = n + 1, x = xn+1. Then

f (m) (c)
f [x0, ..., xm1, xm] =
m!
with c an unknown number between the smallest and
largest of the numbers in {x0, ..., xm}. This was given
in an earlier lecture where divided dierences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION

Recall the examples of higher degree polynomial in-


1
terpolation of the function f (x) = 1+x2 on
[5, 5]. The interpolants Pn(x) oscillated a great
deal, whereas the function f (x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.

Consider the data


x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0

What are methods of interpolating this data, other


than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.

Since we only have the data to consider, we would gen-


erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
y

x
1 2 3 4

The data points

x
1 2 3 4

Piecewise linear interpolation


y

x
1 2 3 4

Polynomial Interpolation

x
1 2 3 4

Piecewise quadratic interpolation


PIECEWISE POLYNOMIAL FUNCTIONS

Consider being given a set of data points (x1, y1), ...,


(xn, yn), with

x1 < x2 < < xn


Then the simplest way to connect the points (xj , yj )
is by straight line segments. This
n is called
o a piecewise
linear interpolant of the data (xj , yj ) . This graph
has corners, and often we expect the interpolant to
have a smooth graph.

To obtain a somewhat smoother graph, consider using


piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates

{(x1, y1), (x2, y2), (x3, y3)}


Then construct the quadratic polynomial that inter-
polates
{(x3, y3), (x4, y4), (x5, y5)}
Continue this process of constructing quadratic inter-
polants on the subintervals

[x1, x3], [x3, x5], [x5, x7], ...


If the number of subintervals is even (and therefore
n is odd), then this process comes out fine, with the
last interval being [xn2, xn]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modification of this procedure.
Suggest such!

With piecewise quadratic interpolants, however, there


are corners on the graph of the interpolating func-
tion. With our preceding example, they are at x3 and
x5. How do we avoid this?

Piecewise polynomial interpolants are used in many


applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION

Let data points (x1, y1), ..., (xn, yn) be given, as let

x1 < x2 < < xn


Consider finding functions s(x) for which the follow-
ing properties hold:
(1) s(xi) = yi, i = 1, ..., n
(2) s(x), s0(x), s00(x) are continuous on [x1, xn].
Then among such functions s(x) satisfying these prop-
erties, find the one which minimizes the integral
Z xn 2
00
s (x) dx
x1
The idea of minimizing the integral is to obtain an in-
terpolating function for which the first derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS

Let a set of node points {xi} be given, satisfying

a x1 < x2 < < xn b


for some numbers a and b. Often we use [a, b] =
[x1, xn]. A cubic spline function s(x) on [a, b] with
breakpoints or knots {xi} has the following prop-
erties:
1. On each of the intervals

[a, x1], [x1, x2], ..., [xn1, xn], [xn, b]


s(x) is a polynomial of degree 3.
2. s(x), s0(x), s00(x) are continuous on [a, b].

In the case that we have given data points (x1, y1),...,


(xn, yn), we say s(x) is a cubic interpolating spline
function for this data if
3. s(xi) = yi, i = 1, ..., n.
EXAMPLE

Define
(
(x )3 , x
(x )3+ =
0, x
This is a cubic spline function on (, ) with the
single breakpoint x1 = .

Combinations of these form more complicated cubic


spline functions. For example,
s(x) = 3 (x 1)3+ 2 (x 3)3+
is a cubic spline function on (, ) with the break-
points x1 = 1, x2 = 3.

Define
n
X 3
s(x) = p3(x) + aj x xj
+
j=1
with p3(x) some cubic polynomial. Then s(x) is a
cubic spline function on (, ) with breakpoints
{x1, ..., xn}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integral
Z xn 2
00
s (x) dx
x1
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satisfies

s00(x1) = s00(xn) = 0
Spline functions satisfying these boundary conditions
are called natural cubic spline functions, and the so-
lution to our minimization problem is a natural cubic
interpolatory spline function. We will show a method
to construct this function from the interpolation data.

Motivation for these boundary conditions can be given


by looking at the physics of bending thin beams of
flexible materials to pass thru the given data. To the
left of x1 and to the right of xn, the beam is straight
and therefore the second derivatives are zero at the
transition points x1 and xn.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION

To make the presentation more specific, suppose we


have data

(x1, y1) , (x2, y2) , (x3, y3) , (x4, y4)


with x1 < x2 < x3 < x4. Then on each of the
intervals
[x1, x2] , [x2, x3] , [x3, x4]
s(x) is a cubic polynomial. Taking the first interval,
s(x) is a cubic polynomial and s00(x) is a linear poly-
nomial. Let

Mi = s00(xi), i = 1, 2, 3, 4
Then on [x1, x2],
(x2 x) M1 + (x x1) M2
s00(x) = , x1 x x2
x2 x1
We can find s(x) by integrating twice:

(x2 x)3 M1 + (x x1)3 M2


s(x) = + c1x + c2
6 (x2 x1)
We determine the constants of integration by using

s(x1) = y1, s(x2) = y2 (*)


Then
(x2 x)3 M1 + (x x1)3 M2
s(x) =
6 (x2 x1)
(x2 x) y1 + (x x1) y2
+
x2 x1
x2 x1
[(x2 x) M1 + (x x1) M2]
6
for x1 x x2.

Check that this formula satisfies the given interpola-


tion condition (*)!
We can repeat this on the intervals [x2, x3] and [x3, x4],
obtaining similar formulas.

For x2 x x3,

(x3 x)3 M2 + (x x2)3 M3


s(x) =
6 (x3 x2)
(x3 x) y2 + (x x2) y3
+
x3 x2
x3 x2
[(x3 x) M2 + (x x2) M3]
6
For x3 x x4,

(x4 x)3 M3 + (x x3)3 M4


s(x) =
6 (x4 x3)
(x4 x) y3 + (x x3) y4
+
x4 x3
x x3
4 [(x4 x) M3 + (x x3) M4]
6
We still do not know the values of the second deriv-
atives {M1, M2, M3, M4}. The above formulas guar-
antee that s(x) and s00(x) are continuous for
x1 x x4. For example, the formula on [x1, x2]
yields
s(x2) = y2, s00(x2) = M2
The formula on [x2, x3] also yields

s(x2) = y2, s00(x2) = M2


All that is lacking is to make s0(x) continuous at x2
and x3. Thus we require
s0(x2 + 0) = s0(x2 0)
(**)
s0(x3 + 0) = s0(x3 0)
This means

lim s0(x) = lim s0(x)


x&x2 x%x2
and similarly for x3.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:

x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h
Then our earlier formulas simplify to

(x2 x)3 M1 + (x x1)3 M2


s(x) =
6h
(x2 x) y1 + (x x1) y2
+
h
h
[(x2 x) M1 + (x x1) M2]
6
for x1 x x2, with similar formulas on [x2, x3] and
[x3, x4].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.

h 2h h
M1 + M2 + M3
6 3 6
y y2 y2 y1
= 3
h h
h 2h h
M2 + M3 + M4
6 3 6
y y3 y3 y2
= 4
h h
This gives us two equations in four unknowns. The
earlier boundary conditions on s00(x) gives us immedi-
ately
M1 = M4 = 0
Then we can solve the linear system for M2 and M3.
EXAMPLE

Consider the interpolation data points


x 1 2 3 4
y 1 1 1 1
2 3 4

In this case, h = 1, and linear system becomes


2 1 1
M2 + M3 = y3 2y2 + y1 =
3 6 3
1 2 1
M2 + M3 = y4 2y3 + y2 =
6 3 12
This has the solution
1
M2 = , M3 = 0
2
This leads to the spline function formula on each
subinterval.
On [1, 2],

(x2 x)3 M1 + (x x1)3 M2


s(x) =
6h
(x2 x) y1 + (x x1) y2
+
h
h
[(x2 x) M1 + (x x1) M2]
6
3 3
(2 x) 0 + (x 1) 2 1 (2 x) 1 + (x 1) 12
= +
6 1
1 h i
(2 x) 0 + (x 1) 12
6
1 (x 1)3 7 (x 1) + 1
= 12 12

Similarly, for 2 x 3,
1 1 1 1
s(x) = (x 2)3 + (x 2)2 (x 1) +
12 4 3 2
and for 3 x 4,
1 1
s(x) = (x 4) +
12 4
x 1 2 3 4
y 1 1 1 1
2 3 4

y = 1/x
1 y = s(x)

0.8

0.6

0.4

0.2

0 x
0 0.5 1 1.5 2 2.5 3 3.5 4

Graph of example of natural cubic spline


interpolation
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0

x
1 2 3 4

Interpolating natural cubic spline function


ALTERNATIVE BOUNDARY CONDITIONS

Return to the equations


h 2h h
M1 + M2 + M3
6 3 6
y3 y2 y2 y1
=
h h
h 2h h
M2 + M3 + M4
6 3 6
y y3 y3 y2
= 4
h h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M1 and
M4. For example, the data in our numerical exam-
ple were generated from the function f (x) = x1 . With
it, f 00(x) = x23 , and thus we could use
1
M1 = 2, M4 =
32
With this we are led to a new formula for s(x), one
that approximates f (x) = x1 more closely.
THE CLAMPED SPLINE

In this case, we augment the interpolation conditions


s(xi) = yi, i = 1, 2, 3, 4
with the boundary conditions
s0(x1) = y10 , s0(x4) = y40 (#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h h y y1
M1 + M2 = 2 y10
3 6 h
h 2h h
M1 + M2 + M3
6 3 6
y3 y2 y2 y1
=
h h
h 2h h
M2 + M3 + M4
6 3 6
y4 y3 y3 y2
=
h h
h h 0 y4 y3
M3 + M4 = y4
6 3 h
For our numerical example, it is natural to obtain
these derivative values from f 0(x) = x12 :
1
y10 = 1, y40 =
16
When combined with your earlier equations, we have
the system
1 1 1
M1 + M2 =
3 6 2
1 2 1 1
M1 + M2 + M3 =
6 3 6 3
1 2 1 1
M2 + M3 + M4 =
6 3 6 12
1 1 1
M3 + M4 =
6 3 48
This has the solution

173 7 11 1
[M1, M2, M3, M4] = , , ,
120 60 120 60
We can now write the functions s(x) for each of the
subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for
x1 x x2,

(x2 x)3 M1 + (x x1)3 M2


s(x) =
6h
(x x) y1 + (x x1) y2
+ 2
h
h
[(x2 x) M1 + (x x1) M2]
6
We can substitute in from the data
x 1 2 3 4
y 1 1 1 1
2 3 4

and the solutions {Mi}. Doing so, consider the error


f (x) s(x). As an example,

1 3 2 3
f (x) = , f = , s = .65260
x 2 3 2
This is quite a decent approximation.
THE GENERAL PROBLEM

Consider the spline interpolation problem with n nodes

(x1, y1) , (x2, y2) , ..., (xn, yn)


and assume the node points {xi} are evenly spaced,

xj = x1 + (j 1) h, j = 1, ..., n
We have that the interpolating spline s(x) on
xj x xj+1 is given by
3 3
xj+1 x Mj + x xj Mj+1
s(x) =
6h
xj+1 x yj + x xj yj+1
+
h
h h i
xj+1 x Mj + x xj Mj+1
6
for j = 1, ..., n 1.
To enforce continuity of s0(x) at the interior
n o node
points x2, ..., xn1, the second derivatives Mj must
satisfy the linear equations
h 2h h yj1 2yj + yj+1
Mj1 + Mj + Mj+1 =
6 3 6 h
for j = 2, ..., n 1. Writing them out,
h 2h h y 2y2 + y3
M1 + M2 + M3 = 1
6 3 6 h
h 2h h y 2y3 + y4
M2 + M3 + M4 = 2
6 3 6 h
..
h 2h h y 2yn1 + yn
Mn2 + Mn1 + Mn = n2
6 3 6 h
This is a system of n 2 equations in the n unknowns
{M1, ..., Mn}. Two more conditions must be imposed
on s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very eciently.
BOUNDARY CONDITIONS

Natural boundary conditions


s00(x1) = s00(xn) = 0
Spline functions satisfying these conditions are called
natural cubic splines. They arise out the minimiza-
tion problem stated earlier. But generally they are not
considered as good as some other cubic interpolating
splines.

Clamped boundary conditions We add the condi-


tions
s0(x1) = y10 , s0(xn) = yn
0

with y10 , yn
0 given slopes for the endpoints of s(x) on
[x1, xn]. This has many quite good properties when
compared with the natural cubic interpolating spline;
but it does require knowing the derivatives at the end-
points.

Not a knot boundary conditions This is more com-


plicated to explain, but it is the version of cubic spline
interpolation that is implemented in Matlab.
THE NOT A KNOT CONDITIONS

As before, let the interpolation nodes be


(x1, y1) , (x2, y2) , ..., (xn, yn)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x1, y1) , (x3, y3) , ..., (xn2, yn2) , (xn, yn)
Thus deleting two of the points. We now have n 2
points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x1, x3] , [x3, x4] , ..., [xn3, xn2] , [xn2, xn]
This leads to n 4 equations in the n 2 unknowns
M1, M3, ..., Mn2, Mn. The two additional boundary
conditions are
s(x2) = y2, s(xn1) = yn1
These translate into two additional equations, and we
obtain a system of n2 linear simultaneous equations
in the n 2 unknowns M1, M3, ..., Mn2, Mn.
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0

x
1 2 3 4

Interpolating cubic spline function with not-a knot


boundary conditions
MATLAB SPLINE FUNCTION LIBRARY

Given data points

(x1, y1) , (x2, y2) , ..., (xn, yn)


type arrays containing the x and y coordinates:
x = [x1 x2 ...xn]
y = [y1 y2 ...yn]
plot (x, y, o)
The last statement will draw a plot of the data points,
marking them with the letter oh. To find the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say

h = (xn x1) / (10 n) ; xx = x1 : h : xn;


use
yy = spline (x, y, xx)
plot (x, y, o, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION

Let an interval [a, b] be given, and then define


ba
h= , xj = a + (j 1)h, j = 1, ..., n
n1
Suppose we want to approximate a given function
f (x) on the interval [a, b] using cubic spline inter-
polation. Define

yi = f (xi), j = 1, ..., n
Let sn(x) denote the cubic spline interpolating this
data and satisfying the not a knot boundary con-
ditions. Then it can be shown that for a suitable
constant c,

En max |f (x) sn(x)| ch4


axb
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h2 rather than h4;
it does not converge to zero as rapidly.
EXAMPLE

Take f (x) = arctan x on [0, 5]. The following ta-


ble gives values of the maximum error En for various
values of n. The values of h are being successively
halved.

n En E 1 n/En
2
7 7.09E3
13 3.24E4 21.9
25 3.06E5 10.6
49 1.48E6 20.7
97 9.04E8 16.4
BEST APPROXIMATION

Given a function f (x) that is continuous on a given


interval [a, b], consider approximating it by some poly-
nomial p(x). To measure the error in p(x) as an ap-
proximation, introduce
E(p) = max |f (x) p(x)|
axb
This is called the maximum error or uniform error of
approximation of f (x) by p(x) on [a, b].

With an eye towards eciency, we want to find the


best possible approximation of a given degree n.
With this in mind, introduce the following:
n(f ) = min E(p)
deg(p)n
" #
= min max |f (x) p(x)|
deg(p)n axb
The number n(f ) will be the smallest possible uni-
form error, or minimax error, when approximating f (x)
by polynomials of degree at most n. If there is a
polynomial giving this smallest error, we denote it by
mn(x); thus E(mn) = n(f ).
Example. Let f (x) = ex on [1, 1]. In the following
table, we give the values of E(tn), tn(x) the Tay-
lor polynomial of degree n for ex about x = 0, and
E(mn).

Maximum Error in:


n tn(x) mn(x)
1 7.18E 1 2.79E 1
2 2.18E 1 4.50E 2
3 5.16E 2 5.53E 3
4 9.95E 3 5.47E 4
5 1.62E 3 4.52E 5
6 2.26E 4 3.21E 6
7 2.79E 5 2.00E 7
8 3.06E 6 1.11E 8
9 3.01E 7 5.52E 10
Consider graphically how we can improve on the Tay-
lor polynomial
t1(x) = 1 + x
as a uniform approximation to ex on the interval [1, 1].

The linear minimax approximation is

m1(x) = 1.2643 + 1.1752x

x
y=e

y=t (x)
1

y=m (x) 1
1

x
-1 1

Linear Taylor and minimax approximations to ex


y

0.0516

x
-1 1

Error in cubic Taylor approximation to ex

y
0.00553

x
-1 1

-0.00553

Error in cubic minimax approximation to ex


Accuracy of the minimax approximation.

[(b a)/2]n+1
(n+1)
n(f ) n
max f (x)
(n + 1)!2 axb
This error bound does not always become smaller with
increasing n, but it will give a fairly accurate bound
for many common functions f (x).

Example. Let f (x) = ex for 1 x 1. Then


e
n(ex) (*)
(n + 1)!2n

n Bound (*) n(f )


1 6.80E 1 2.79E 1
2 1.13E 1 4.50E 2
3 1.42E 2 5.53E 3
4 1.42E 3 5.47E 4
5 1.18E 4 4.52E 5
6 8.43E 6 3.21E 6
7 5.27E 7 2.00E 7
CHEBYSHEV POLYNOMIALS

Chebyshev polynomials are used in many parts of nu-


merical analysis, and more generally, in applications
of mathematics. For an integer n 0, define the
function

1
Tn(x) = cos n cos x , 1 x 1 (1)
This may not appear to be a polynomial, but we will
show it is a polynomial of degree n. To simplify the
manipulation of (1), we introduce
= cos1(x) or x = cos(), 0 (2)
Then
Tn(x) = cos(n) (3)
Example. n = 0
T0(x) = cos(0 ) = 1
n=1
T1(x) = cos() = x
n=2
T2(x) = cos(2) = 2 cos2() 1 = 2x2 1
y

x
-1 1

T (x)
0
T (x)
1
T (x)
2

-1

T (x)
3
T (x)
4
y

x
-1 1

-1
The triple recursion relation. Recall the trigonomet-
ric addition formulas,

cos( ) = cos() cos() sin() sin()


Let n 1, and apply these identities to get

Tn+1(x) = cos[(n + 1)] = cos(n + )


= cos(n) cos() sin(n) sin()
Tn1(x) = cos[(n 1)] = cos(n )
= cos(n) cos() + sin(n) sin()
Add these two equations, and then use (1) and (3) to
obtain

Tn+1(x) + Tn1 = 2 cos(n) cos() = 2xTn(x)


Tn+1(x) = 2xTn(x) Tn1(x), n1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall

T0(x) = 1, T1(x) = x

Tn+1(x) = 2xTn(x) Tn1(x), n1

Let n = 2. Then

T3(x) = 2xT2(x) T1(x)


= 2x(2x2 1) x
= 4x3 3x
Let n = 3. Then

T4(x) = 2xT3(x) T2(x)


= 2x(4x3 3x) (2x2 1)
= 8x4 8x2 + 1
The minimum size property. Note that

|Tn(x)| 1, 1 x 1 (5)
for all n 0. Also, note that

Tn(x) = 2n1xn + lower degree terms, n1


(6)
This can be proven using the triple recursion relation
and mathematical induction.

Introduce a modified version of Tn(x),


1 n +lower degree terms (7)
Ten(x) = T n (x) = x
2n1
From (5) and (6),
1
e
Tn(x) , 1 x 1, n1 (8)
2n1
Example.
1 1
Te4(x) = 8x 8x + 1 = x4 x2 +
4 2
8 8
A polynomial whose highest degree term has a coe-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial Ten(x) has size 1/2n1 on
1 x 1, and this becomes smaller as the degree
n increases. In comparison,

max |xn| = 1
1x1
Thus xn is a monic polynomial whose size does not
change with increasing n.

Theorem. Let n 1 be an integer, and consider all


possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [1, 1] is the modified Chebyshev polynomial
Ten(x), and its maximum value on [1, 1] is 1/2n1.

This result is used in devising applications of Cheby-


shev polynomials. We apply it to obtain an improved
interpolation scheme.
A NEAR-MINIMAX APPROXIMATION METHOD

Let f (x) be continuous on [a, b] = [1, 1]. Consider


approximating f by an interpolatory polynomial of de-
gree at most n = 3. Let x0, x1, x2, x3 be interpo-
lation node points in [1, 1]; let c3(x) be of degree
3 and interpolate f (x) at {x0, x1, x2, x3}. The in-
terpolation error is
(x) (4)
f (x) c3(x) = f ( x), 1 x 1 (1)
4!
(x) = (x x0)(x x1)(x x2)(x x3) (2)
with x in [1, 1]. We want to choose the nodes
{x0, x1, x2, x3} so as to minimize the maximum value
of |f (x) c3(x)| on [1, 1].

From (1), the only general quantity, independent of f ,


is (x). Thus we choose {x0, x1, x2, x3} to minimize

max |(x)| (3)


1x1
Expand to get
(x) = x4 + lower degree terms
This is a monic polynomial of degree 4. From the
theorem in the preceding section, the smallest possible
value for (3) is obtained with

e T4(x) 1 4 8x2 + 1)
(x) = T4(x) = 3
= (8x (4)
2 8
and the smallest value of (3) is 1/23 in this case. The
equation (4) defines implicitly the nodes {x0, x1, x2, x3}:
they are the roots of T4(x).

In our case this means solving


T4(x) = cos(4) = 0, x = cos()

3 5 7
4 = , , , , . . .
2 2 2 2
3 5 7
= , , , ,...
8 8 8
8
3 5
x = cos , cos , cos ,... (5)
8 8 8
using cos() = cos().

3 5 7
x = cos , cos , cos , cos ,...
8 8 8 8
The first four values are distinct; the following ones
are repetitive. For example,

9 7
cos = cos
8 8
The first four values are

{x0, x1, x2, x3} = {0.382683, 0.923880} (6)


Example. Let f (x) = ex on [1, 1]. Use these nodes
to produce the interpolating polynomial c3(x) of de-
gree 3. From the interpolation error formula and the
bound of 1/23 for |(x)| on [1, 1] , we have
1/23
max |f (x) c3(x)| max ex
1x1 4! 1x1
e .
= 0.014158
192
By direct calculation,
.
max |ex c3(x)| = 0.00666
1x1
Interpolation Data: f (x) = ex

i xi f (xi) f [x0, . . . , xi]


0 0.923880 2.5190442 2.5190442
1 0.382683 1.4662138 1.9453769
2 0.382683 0.6820288 0.7047420
3 0.923880 0.3969760 0.1751757
y
0.00666

x
-1 1

-0.00624

The error ex c3(x)

. .
For comparison, E(t3) = 0.0142 and 3(ex) = 0.00553.
THE GENERAL CASE

Consider interpolating f (x) on [1, 1] by a polyno-


mial of degree n, with the interpolation nodes
{x0, . . . , xn} in [1, 1]. Denote the interpolation poly-
nomial by cn(x). The interpolation error on [1, 1] is
given by
(x) (n+1)
f (x) cn(x) = f ( x) (7)
(n + 1)!
(x) = (x x0) (x xn)
with x and unknown point in [1, 1]. In order to
minimize the interpolation error, we seek to minimize

max |(x)| (8)


1x1
The polynomial being minimized is monic of degree
n + 1,

(x) = xn+1 + lower degree terms


From the theorem of the preceding section, this min-
imum is attained by the monic polynomial
1
Ten+1(x) = n Tn+1(x)
2
Thus the interpolation nodes are the zeros of Tn+1(x);
and by the procedure that led to (5), they are given
by

2j + 1
xj = cos , j = 0, 1, . . . , n (9)
2n + 2
The near-minimax approximation cn(x) of degree n is
obtained by interpolating to f (x) at these n+1 nodes
on [1, 1].

The polynomial cn(x) is sometimes called a Cheby-


shev approximation.
Example. Let f (x) = ex. the following table contains
the maximum errors in cn(x) on [1, 1] for varying
n. For comparison, we also include the corresponding
minimax errors. These figures illustrate that for prac-
tical purposes, cn(x) is a satisfactory replacement for
the minimax approximation mn(x).
n max |ex cn(x)| n(ex)
1 3.72E 1 2.79E 1
2 5.65E 2 4.50E 2
3 6.66E 3 5.53E 3
4 6.40E 4 5.47E 4
5 5.18E 5 4.52E 5
6 3.80E 6 3.21E 6
THEORETICAL INTERPOLATION ERROR

For the error


(x) (n+1)
f (x) cn(x) = f ( x)
(n + 1)!
we have
max |(x)|
1x1
max |f (x) cn(x)| max |f ()|
1x1 (n + 1)! 11
From the theorem of the preceding section,
1
e
max Tn+1(x) = max |(x)| = n
1x1 1x1 2
in this case. Thus
1
max |f (x) cn(x)| max |f ()|
1x1 (n + 1)!2n 11
OTHER INTERVALS

Consider approximating f (x) on the finite interval


[a, b]. Introduce the linear change of variables
1
x = [(1 t) a + (1 + t) b] (10)
2
2 b+a
t = x (11)
ba 2
Introduce

1
F (t) = f [(1 t) a + (1 + t) b] , 1 t 1
2
The function F (t) on [1, 1] is equivalent to f (x) on
[a, b], and we can move between them via (10)-(11).
We can now proceed to approximate f (x) on [a, b] by
instead approximating F (t) on [1, 1].

Example. Approximating f (x) = cos x on [0, /2] is


equivalent to approximating

1+t
F (t) = cos , 1 t 1
4
LEAST SQUARES APPROXIMATION

Another approach to approximating a function f (x)


on an interval a x b is to seek an approximation
p(x) with a small average error over the interval of
approximation. A convenient definition of the average
error of the approximation is given by
" Z b #1
1 2
E(p; f ) [f (x) p(x)]2 dx (1)
ba a
This is also called the root-mean-square-error (de-
noted subsequently by RMSE ) in the approximation
of f (x) by p(x). Note first that choosing p(x) to
minimize E(p; f ) is equivalent to minimizing
Z b
[f (x) p(x)]2 dx
a
thus dispensing with the square root and multiplying
fraction (although the minimums are generally dier-
ent). The minimizing of (1) is called the least squares
approximation problem.
Example. Let f (x) = ex, let p(x) = 0 + 1x, 0,
1 unknown. Approximate f (x) over [1, 1]. Choose
0, 1 to minimize
Z 1
g(0, 1) [ex 0 1x]2 dx (2)
1
Z 1 ( 2x 2 2 2 x
)
e + 0 + 1x 20e
g(0, 1) = x dx
1 21xe + 201x
Integrating,

g(0, 1) = c120 + c221 + c301 + c40 + c51 + c6


with constants {c1, . . . , c6}, e.g.

c1 = 2, 1
c6 = e e1 /2.
g is a quadratic polynomial in the two variables 0,
1. To find its minimum, solve the system
g g
= 0, =0
0 1
It is simpler to return to (2) to dierentiate, obtaining
Z 1
2 [ex 0 1x] (1) dx = 0
1
Z 1
2 [ex 0 1x] (x) dx = 0
1
This simplifies to
Z 1
20 = exdx = e e1
1
Z 1
2
1 = xexdx = 2e1
3 1

e e1 .
0 = = 1.1752
2
.
1 = 3e1 = 1.1036
Using these values for 0 and 1, we denote the re-
sulting linear approximation by

1(x) = 0 + 1x
It is called the best linear approximation to ex in the
sense of least squares. For the error,
.
max |ex 1(x)| = 0.439
1x1
Errors in linear approximations of ex:

Approximation Max Error RMSE


Taylor t1(x) 0.718 0.246
Least squares 1(x) 0.439 0.162
Chebyshev c1(x) 0.372 0.184
Minimax m1(x) 0.279 0.190

x
y=e

y=l (x) 1
1

x
-1 1

The linear least squares approximation to ex


THE GENERAL CASE

Approximate f (x) on [a, b], and let n 0. Seek p(x)


to minimize the RMSE. Write

p(x) = 0 + 1x + + nxn

Z1 " #2
f (x) 0 1x
g(0, 1, . . . , n) dx
nxn
1
Find coecients 0, 1, . . . , n to minimize this in-
tegral. The integral g(0, 1, . . . , n) is a quadratic
polynomial in the n + 1 variables 0, 1, . . . , n.

To minimize g(0, 1, . . . , n), invoke the conditions


g
= 0, i = 0, 1, . . . , n
i
This yields a set of n+1 equations that must be satis-
fied by a minimizing set 0, 1, . . . , n for g. Manip-
ulating this set of conditions leads to a simultaneous
linear system.
To better understand the form of the linear system,
consider the special case of [a, b] = [0, 1]. Dierenti-
ating g with respect to each i, we obtain
Z 1
2 [ex 0 nxn] (1) dx = 0
1
Z 1
2 [ex 0 nxn] (x) dx = 0
1
..
Z 1
2 [ex 0 nxn] (xn) dx = 0
1
Then the linear system is
n
X Z 1
j
= xif (x) dx, i = 0, 1, . . . , n
j=0 i + j + 1 0

We will study the solution of simultaneous linear sys-


tems in Chapter 6. There we will see that this linear
system is ill-conditioned and is dicult to solve ac-
curately, even for moderately sized values of n such as
n = 5. As a consequence, this is not a good approach
to solving for a minimizer of g(0, 1, . . . , n).
LEGENDRE POLYNOMIALS

Define the Legendre polynomials as follows.

P0(x) = 1

1 dn h 2 ni
Pn(x) = n
n x 1 , n = 1, 2, . . .
n!2 dx
For example,

P1(x) = x
1 2
P2(x) = 3x 1
2
1 3
P3(x) = 5x 3x
2
1 4 2

P4(x) = 35x 30x + 3
8
The Legendre polynomials have many special proper-
ties, and they are widely used in numerical analysis
and applied mathematics.
y

x
-1 1

P (x)
1
P (x)
2
P (x)
3
P (x)
4

-1

Legendre polynomials of degrees 1, 2, 3, 4


PROPERTIES

Introduce the special notation


Z b
(f, g) = f (x)g(x) dx
a
for general functions f (x) and g(x).

Degree and normalization:

deg Pn = n, Pn(1) = 1, n0

Triple recursion relation: For n 1,


2n + 1 n
Pn+1(x) = xPn(x) Pn1(x)
n+1 n+1

Orthogonality and size:




0,
i 6= j
Pi, Pj = 2

, i=j
2j + 1
Zeroes:
All zeroes of Pn(x) are located in [1, 1] ;
all zeroes are simple roots of Pn(x)

Basis: Every polynomial p(x) of degree n can


be written in the form
n
X
p(x) = j Pj (x)
j=0
with the choice of 0, 1, . . . , n uniquely deter-
mined from p(x):

p, Pj
j = , j = 0, 1, . . . , n
Pj , Pj
FINDING THE LEAST SQUARES
APPROXIMATION

We solve the least squares approximation problem on


only the interval [1, 1]. Approximation problems on
other intervals [a, b] can be accomplished using a lin-
ear change of variable.

We seek to find a polynomial p(x) of degree n that


minimizes
Z b
[f (x) p(x)]2 dx
a
This is equivalent to minimizing

(f p, f p) (3)
We begin by writing p(x) in the form
n
X
p(x) = j Pj (x)
j=0
n
X
p(x) = j Pj (x)
j=0
Substitute into (3), obtaining
ge ( 0
, 1, . . . , n) (f p, f p)

n
X n
X
= f j Pj , f iPi
j=0 i=0
Expand this into the following:
n (f, P )2
X j
ge = (f, f )
j=0 (Pj , Pj )
n " #2
X (f, Pj )
+ Pj , Pj j
j=0 (Pj , Pj )
Looking at this carefully, we see that it is smallest
when
(f, Pj )
j = , j = 0, 1, . . . , n
(Pj , Pj )
Looking at this carefully, we see that it is smallest
when
(f, Pj )
j = , j = 0, 1, . . . , n
(Pj , Pj )
The minimum for this choice of coecients is
n (f, P )2
X j
ge = (f, f )
j=0 (Pj , Pj )
We call
n (f, P )
X j
n(x) = Pj (x) (4)
j=0 (Pj , Pj )
the least squares approximation of degree n to f (x)
on [1, 1].

If n = 0, then its actual degree is less than n.


Example. Approximate f (x) = ex on [1, 1]. We use
(4) with n = 3:
3
X (f, Pj )
3(x) = j Pj (x), j = (5)
j=0 (Pj , Pj )
The coecients { 0, 1, 2, 3} are as follows.

j 0 1 2 3
j 2.35040 0.73576 0.14313 0.02013

Using (5) and simplifying,


2 3
3(x) = .996294 + .997955x + .536722x + .176139x

The error in various cubic approximations:

Approximation Max Error RMSE


Taylor t3(x) .0516 .0145
Least squares 3(x) .0112 .00334
Chebyshev c3(x) .00666 .00384
Minimax m3(x) .00553 .00388
y

0.0112

x
-1 1

-0.00460

Error in the cubic least squares approximation to ex


NUMERICAL INTEGRATION

How do you evaluate


Z b
I= f (x) dx
a
From calculus, if F (x) is an antiderivative of f (x),
then
Z b
I= f (x) dx = F (x)|ba = F (b) F (a)
a
However, in practice most integrals cannot be evalu-
ated by this means. And even when this can work, an
approximate numerical method may be much simpler
and easier to use. For example, the integrand in
Z 1
dx
0 1 + x5
has an extremely complicated antiderivative; and it is
easier to evaluate the integral by approximate means.
Try evaluating this integral with Maple or Mathemat-
ica.
NUMERICAL INTEGRATION
A GENERAL FRAMEWORK

Returning to a lesson used earlier with rootfinding:


If you cannot solve a problem, then replace it with a
near-by problem that you can solve.
In our case, we want to evaluate
Z b
I= f (x) dx
a
To do so, many of the numerical schemes are based
on choosing approximates of f (x). Calling one such
fe(x), use
Z b
I fe(x) dx Ie
a
What is the error?
Z bh i
E = I Ie = f (x) fe(x) dx
a
Z b
e
|E| f (x) f (x) dx
a

(b a) f fe


f fe max f (x) fe(x)
axb
We also want to choose the approximates fe(x) of a
form we can integrate directly and easily. Examples
are polynomials, trig functions, piecewise polynomials,
and others.

If we use polynomial approximations, then how do we


choose them. At this point, we have two choices:

1. Taylor polynomials approximating f (x)

2. Interpolatory polynomials approximating f (x)


EXAMPLE

Consider evaluating
Z 1
x2
I= e dx
0
Use
1 t2 + + 1 tn +
et = 1 + t + 2! 1 tn+1ect
n! (n+1)!
2 1 x4 + + 1 x2n + 1 x2n+2edx
ex = 1 + x2 + 2! n! (n+1)!

with 0 dx x2. Then


Z 1h i
I= 2 1 4 1
1 + x + 2! x + + n! x2n dx
0
Z 1h i
1
+ (n+1)! x 2n+2 d
e x dx
0
Taking n = 3, we have
I = 1 + 13 + 10
1 + 1 + E = 1.4571 + E
42
Z 1h i
0<E e 8 e = .0126
x dx = 216
24 0
USING INTERPOLATORY POLYNOMIALS

In spite of the simplicity of the above example, it is


generally more dicult to do numerical integration by
constructing Taylor polynomial approximations than
by constructing polynomial interpolates. We therefore
construct the function fe in
Z b Z b
f (x) dx fe(x) dx
a a
by means of interpolation.

Initially, we consider only the case in which the in-


terpolation is based on interpolation at evenly spaced
node points.
LINEAR INTERPOLATION

The linear interpolant to f (x), interpolating at a and


b, is given by
(b x) f (a) + (x a) f (b)
P1(x) =
ba
Using the linear interpolant
(b x) f (a) + (x a) f (b)
P1(x) =
ba
we obtain the approximation
Z b Z b
f (x) dx P1(x) dx
a a
= 12 (b a) [f (a) + f (b)] T1(f )
The rule
Zb
f (x) dx T1(f )
a
is called the trapezoidal rule.
y

y=f(x)

y=p1(x)

x
a b

Illustrating I T1(f )

Example.
Z /2 h i
sin x dx sin 0 + sin
0 4 2
=.
= 4 .785398
Error = .215
HOW TO OBTAIN GREATER ACCURACY?

How do we improve our estimate of the integral


Z b
I= f (x) dx
a
One direction is to increase the degree of the approxi-
mation, moving next to a quadratic interpolating poly-
nomial for f (x). We first look at an alternative.

Instead of using the trapezoidal rule on the original


interval [a, b], apply it to integrals of f (x) over smaller
subintervals. For example:
Z c Z b
I = f (x) dx + f (x) dx, c = b+a
2
a c
ca [f (a) + f (c)] + bc [f (c) + f (b)]
2 2
= h [f (a) + 2f (c) + f (b)] T (f ), h = ba
2 2 2
Example.
Z /2 h i
sin x dx sin 0 + 2 sin + sin
0 8 4 2
.
= .948059
Error = .0519
y

y=f(x)

x
a=x x x b=x
0 1 2 3

Illustrating I T3(f )
THE TRAPEZOIDAL RULE

We can continue as above by dividing [a, b] into even


smaller subintervals and applying
Z

f (x) dx [f () + f ()] , ()

2
on each of the smaller subintervals. Begin by intro-
ducing a positive integer n 1,
ba
h= , xj = a + j h, j = 0, 1, ..., n
n
Then
Z xn
I = f (x) dx
x
Z x0 Z x Z xn
1 2
= f (x) dx + f (x) dx + + f (x) dx
x0 x1 xn1
Use [, ] = [x0, x1], [x1, x2], ..., [xn1, xn], for each
of which the subinterval has length h.
Then applying
Z

f (x) dx [f () + f ()]

2
we have
I h2 [f (x0) + f (x1)] + h2 [f (x1) + f (x2)]
+
+ h2 [f (xn2) + f (xn1)] + h2 [f (xn1) + f (xn)]
Simplifying,

1 1
I h f (a) + f (x1) + + f (xn1) + f (b)
2 2
Tn(f )
This is called the composite trapezoidal rule, or
more simply, the trapezoidal rule.
h i

Example. Again integrate sin x over 0, 2 . Then we
have
n Tn(f ) Error Ratio
1 .785398163 2.15E1
2 .948059449 5.19E2 4.13
4 .987115801 1.29E2 4.03
8 .996785172 3.21E3 4.01
16 .999196680 8.03E4 4.00
32 .999799194 2.01E4 4.00
64 .999949800 5.02E5 4.00
128 .999987450 1.26E5 4.00
256 .999996863 3.14E6 4.00

Note that the errors are decreasing by a constant fac-


tor of 4. Why do we always double n?
USING QUADRATIC INTERPOLATION

Rb
We want to approximate I = a f (x) dx using quadratic
interpolation of f (x). Interpolate f (x) at the points
{a, c, b}, with c = 12 (a + b). Also let h = 12 (b a).
The quadratic interpolating polynomial is given by
(x c) (x b) (x a) (x b)
P2(x) = 2
f (a) + 2
f (c)
2h h
(x a) (x c)
+ f (b)
2h2
Replacing f (x) by P2(x), we obtain the approximation
Z b Z b
f (x) dx P2(x) dx
a a
= h3 [f (a) + 4f (c) + f (b)] S2(f )
This is called Simpsons rule.
y

y=f(x)

x
a (a+b)/2 b

Illustration of I S2(f )

Example.
Z /2 h i
/2
sin x dx 3 sin 0 + 4 sin 4 + sin 2
0
.
= 1.00227987749221
Error = 0.00228
SIMPSONS RULE

As with the trapezoidal rule, we can apply Simpsons


rule on smaller subdivisions in order to obtain better
accuracy in approximating
Z b
I= f (x) dx
a
Again, Simpsons rule is given by
Z
+
f (x) dx [f () + 4f () + f ()] , =
3 2
and = 12 ( ).

Let n be a positive even integer, and


ba
h= , xj = a + j h, j = 0, 1, ..., n
n
Then write
Z xn
I = f (x) dx
x
Z x0 Z x Z xn
2 4
= f (x) dx + f (x) dx + + f (x) dx
x0 x2 xn2
Apply
Z
+
f (x) dx [f () + 4f () + f ()] , =
3 2
to each of these subintegrals, with

[, ] = [x0, x2] , [x2, x4] , ..., [xn2, xn]


In all cases, 12 ( ) = h. Then

I h3 [f (x0) + 4f (x1) + f (x2)]


+ h3 [f (x2) + 4f (x3) + f (x4)]
+
+ h3 [f (xn4) + 4f (xn3) + f (xn2)]
+ h3 [f (xn2) + 4f (xn1) + f (xn)]
This can be simplified to
Z b
f (x) dx Sn(f ) h3 [f (x0) + 4f (x1)
a
+2f (x2) + 4f (x3) + 2f (x4)
+ + 2f (xn2) + 4f (xn1) + f (xn)]
This is called the composite Simpsons rule or more
simply, .Simpsons rule
EXAMPLE

Z /2
Approximate sin x dx. The Simpson rule results
0
are as follows.
n Sn(f ) Error Ratio
2 1.00227987749221 2.28E3
4 1.00013458497419 1.35E4 16.94
8 1.00000829552397 8.30E6 16.22
16 1.00000051668471 5.17E7 16.06
32 1.00000003226500 3.23E8 16.01
64 1.00000000201613 2.02E9 16.00
128 1.00000000012600 1.26E10 16.00
256 1.00000000000788 7.88E12 16.00
512 1.00000000000049 4.92E13 15.99

Note that the ratios of successive errors have con-


verged to 16. Why? Also compare this table with
that for the trapezoidal rule. For example,
I T4 = 1.29E 2
I S4 = 1.35E 4
There is a great deal to be learned by doing numbers
of examples. For example, are the ratios of conver-
gence for our numerical examples typical of the trape-
zoidal and Simpson rules? Several of these are given
on pages 188 (for trapezoidal rule) and 192 (for Simp-
sons rule). They are for the integrals
Z 1
2 .
I (1) = ex dx = .74682413281234
0
Z 4
dx
I (2) = 2
= arctan 4
01+x
Z 2
(3) dx 2
I = =
0 2 + cos x sqrt(3)
Look carefully at those tables.
TRAPEZOIDAL METHOD
ERROR FORMULA

Theorem Let f (x) have two continuous derivatives on


the interval a x b. Then
Z b
h2 (b a) 00
EnT (f ) f (x) dx Tn(f ) = f (cn)
a 12
for some cn in the interval [a, b].

Later I will say something about the proof of this re-


sult, as it leads to some other useful formulas for the
error.

The above formula says that the error decreases in


a manner that is roughly proportional to h2. Thus
doubling n (and halving h) should cause the error to
decrease by a factor of approximately 4. This is what
we observed with a past example from the preceding
section.
Example. Consider evaluating
Z 2
dx
I=
0 1 + x2
using the trapezoidal method Tn(f ). How large should
n be chosen in order to ensure that

T
En (f ) 5 106

We begin by calculating the derivatives:

0 2x 00 2 + 6x2
f (x) = 2 , f (x) = 3
1 + x2 1 + x2
From a graph of f 00(x),

00
max f (x) = 2
0x2
Recall that b a = 2. Therefore,
h2 (b a)
EnT (f ) = f 00 (cn)
12
h2 (2) h2
T
En (f ) 2=
12 3
T h2 (b a) 00
En (f ) = f (cn)
12
h22 h2
T
En (f ) 2=
12 3
00
We bound f (cn) since we do not know cn, and

therefore we must assume the worst possible case, that
which makes the error formula largest. That is what
has been done above.

When do we have

T
En (f ) 5 106 (1)
To ensure this, we choose h so small that
h2
5 106
3
This is equivalent to choosing h and n to satisfy
h .003873
2
n= 516.4
h
Thus n 517 will imply (1).
DERIVING THE ERROR FORMULA

There are two stages in deriving the error:


(1) Obtain the error formula for the case of a single
subinterval (n = 1);
(2) Use this to obtain the general error formula given
earlier.

For the trapezoidal method with only a single subin-


terval, we have
Z +h
h h3 00
f (x) dx [f () + f ( + h)] = f (c)
2 12
for some c in the interval [, + h].

A sketch of the derivation of this error formula is given


in the problems.
Recall that the general trapezoidal rule Tn(f ) was ob-
tained by applying the simple trapezoidal rule to a sub-
division of the original interval of integration. Recall
defining and writing
ba
h= , xj = a + j h, j = 0, 1, ..., n
n
x
Zn
I = f (x) dx
x0
x
Z1 x
Z2
= f (x) dx + f (x) dx +
x0 x1
x
Zn
+ f (x) dx
xn1

I h2 [f (x0) + f (x1)] + h2 [f (x1) + f (x2)]


+
+ h2 [f (xn2) + f (xn1)] + h2 [f (xn1) + f (xn)]
Then the error
Z b
EnT (f ) f (x) dx Tn(f )
a
can be analyzed by adding together the errors over the
subintervals [x0, x1], [x1, x2], ..., [xn1, xn]. Recall
Z +h
h h3 00
f (x) dx [f () + f ( + h)] = f (c)
2 12
Then on [xj1, xj ],
xj
Z
hh i h3 00
f (x) dx f (xj1) + f (xj ) = f ( j )
xj1
2 12

with xj1 j xj , but otherwise j unknown.


Then combining these errors, we obtain

T h3 00 h3 00
En (f ) = f ( 1) f ( n)
12 12
This formula can be further simplified, and we will do
so in two ways.
Rewrite this error as
" #
3 00 00
h n f ( 1) + + f ( n)
EnT (f ) =
12 n
Denote the quantity inside the brackets by n. This
number satisfies

min f 00(x) n max f 00(x)


axb axb
Since f 00(x) is a continuous function (by original as-
sumption), we have that there must be some number
cn in [a, b] for which

f 00(cn) = n
Recall also that hn = b a. Then
" #
3 00 00
h n f ( 1) + + f ( n)
EnT (f ) =
12 n
h2 (b a) 00
= f (cn)
12
This is the error formula given on the first slide.
AN ERROR ESTIMATE

We now obtain a way to estimate the error EnT (f ).


Return to the formula
h3 h3
EnT (f ) = f 00( 1) f 00( n)
12 12
and rewrite it as

T h2 h 00 00
i
En (f ) = f ( 1)h + + f ( n)h
12
The quantity

f 00( 1)h + + f 00( n)h


is a Riemann sum for the integral
Z b
f 00(x) dx = f 0(b) f 0(a)
a
By this we mean
h i Z b
lim f 00( 1)h + + f 00( n)h = f 00(x) dx
n a
Thus

f 00( 1)h + + f 00( n)h f 0(b) f 0(a)


for larger values of n. Combining this with the earlier
error formula

T h2 h 00 00
i
En (f ) = f ( 1)h + + f ( n)h
12
we have

T h2 h 0 0
i
e T (f )
En (f ) f (b) f (a) En
12
This is a computable estimate of the error in the nu-
merical integration. It is called an asymptotic error
estimate.
Example. Consider evaluating
Z
x e + 1 .
I(f ) = e cos x dx = = 12.070346
0 2
In this case,
f 0(x) = ex [cos x sin x]
f 00(x) = 2ex sin x

max f 00(x) = f 00 (.75) = 14. 921
0x
Then
h 2 (b a)
EnT (f ) = f 00 (cn)
12
h2
T
En (f ) 14.921 = 3.906h2
12
Also
h2
e T
En (f ) = 0 0
f () f (0)
12
h2 .
= [e + 1] = 2.012h2
12
In looking at the table (in a separate file on website)
for evaluating the integral I by the trapezoidal rule,
we see that the error EnT (f ) and the error estimate
Ee T (f ) are quite close. Therefore
n
h2
I(f ) Tn(f ) [e + 1]
12
h2
I(f ) Tn(f ) + [e + 1]
12
This last formula is called the corrected trapezoidal
rule, and it is illustrated in the second table (on the
separate page). We see it gives a much smaller er-
ror for essentially the same amount of work; and it
converges much more rapidly.

In general,
h2 0
I(f ) Tn(f ) f (b) f 0(a)
12
h2 0
I(f ) Tn(f ) f (b) f 0(a)
12
This is the corrected trapezoidal rule. It is easy to
obtain from the trapezoidal rule, and in most cases,
it converges more rapidly than the trapezoidal rule.
SIMPSONS RULE ERROR FORMULA

Recall the general Simpsons rule


Z b
f (x) dx Sn(f ) h3 [f (x0) + 4f (x1) + 2f (x2)
a
+4f (x3) + 2f (x4) +
+2f (xn2) + 4f (xn1) + f (xn)]
For its error, we have
Zb
h4 (b a) (4)
EnS (f ) f (x) dx Sn(f ) = f (cn)
a
180
for some a cn b, with cn otherwise unknown. For
an asymptotic error estimate,
Zb
e S h4 h 000 000
i
f (x) dxSn(f ) En (f ) f (b) f (a)
a
180
DISCUSSION

For Simpsons error formula, both formulas assume


that the integrand f (x) has four continuous deriva-
tives on the interval [a, b]. What happens when this
is not valid? We return later to this question.

Both formulas also say the error should decrease by a


factor of around 16 when n is doubled.

Compare these results with those for the trapezoidal


rule error formulas:.
Z b
h2 (b a) 00
EnT (f ) f (x) dx Tn(f ) = f (cn)
a 12

T h2 h 0 0
i
e T (f )
En (f ) f (b) f (a) En
12
EXAMPLE

Consider evaluating
Z 2
dx
I=
0 1 + x2
using Simpsons rule Sn(f ). How large should n be
chosen in order to ensure that

S
En (f ) 5 106

Begin by noting that


5x4 10x2 + 1
f (4)(x) = 24 5
1+x 2

(4)
max f (x) = f (4)(0) = 24
0x1
Then
h4 (b a)
EnS (f ) = f (4)(cn)
180
h4 2 4h4
S
En (f ) 24 =
180 15

S
Then En (f ) 5 106 is true if

4h4
5 106
15
h .0658
n 30.39
Therefore, choosing n 32 will give the desired er-
ror bound. Compare this with the earlier trapezoidal
example in which n 517 was needed.

For the asymptotic error estimate, we have

000 x2 1
f (x) = 24x 4
1+x 2

h 4
e S
En (f ) 000 000
f (2) f (0)
180
h4 144 4 4
= = h
180 625 3125
INTEGRATING sqrt(x)

Consider the numerical approximation of


Z 1
2
sqrt(x) dx =
0 3
In the following table, we give the errors when using
both the trapezoidal and Simpson rules.
n EnT Ratio EnS Ratio
2 6.311E 2 2.860E 2
4 2.338E 2 2.70 1.012E 2 2.82
8 8.536E 3 2.74 3.587E 3 2.83
16 3.085E 3 2.77 1.268E 3 2.83
32 1.108E 3 2.78 4.485E 4 2.83
64 3.959E 4 2.80 1.586E 4 2.83
128 1.410E 4 2.81 5.606E 5 2.83

The rate of convergence is slower because the func-


tion f (x) =sqrt(x) is not suciently dierentiable on
[0, 1]. Both methods converge with a rate propor-
tional to h1.5.
ASYMPTOTIC ERROR FORMULAS

If we have a numerical integration formula,


Z b n
X
f (x) dx wj f (xj )
a j=0
let En(f ) denote its error,
Z b n
X
En(f ) = f (x) dx wj f (xj )
a j=0
We say another formula E e n(f ) is an asymptotic error
formula this numerical integration if it satisfies
e n(f )
E
lim =1
n En(f )
Equivalently,
e n(f )
En(f ) E
lim =0
n En(f )
These conditions say that E en(f ) looks increasingly
like En(f ) as n increases, and thus
e n(f )
En(f ) E
Example. For the trapezoidal rule,
h2h i
T e T
En (f ) En (f ) 0 0
f (b) f (a)
12
This assumes f (x) has two continuous derivatives on
the interval [a, b].

Example. For Simpsons rule,


h4 h i
S S
e (f )
En (f ) E 000 000
f (b) f (a)
n
180
This assumes f (x) has four continuous derivatives on
the interval [a, b].

Note that both of these formulas can be written in an


equivalent form as
e c
En(f ) = p
n
for appropriate constant c and exponent p. With the
trapezoidal rule, p = 2 and
(b a)2 h 0 0
i
c= f (b) f (a)
12
and for Simpsons rule, p = 4 with a suitable c.
The formula
e c
En(f ) = p (2)
n
occurs for many other numerical integration formulas
that we have not yet defined or studied. In addition,
if we use the trapezoidal or Simpson rules with an
integrand f (x) which is not suciently dierentiable,
then (2) may hold with an exponent p that is less than
the ideal.
Example. Consider
Z 1
I= x dx
0
in which 1 < < 1, 6= 0. Then the conver-
gence of the trapezoidal rule can be shown to have an
asymptotic error formula
En E en = c (3)
n +1
for some constant c dependent on . A similar result
holds for Simpsons rule, with 1 < < 3, not an
integer. We can actually specify a formula for c; but
the formula is often less important than knowing that
(2) is valid for some c.
APPLICATION OF ASYMPTOTIC
ERROR FORMULAS

Assume we know that an asymptotic error formula


c
I In p
n
is valid for some numerical integration rule denoted by
In. Initially, assume we know the exponent p. Then
imagine calculating both In and I2n. With I2n, we
have
c
I I2n p p
2 n
This leads to
I In 2p [I I2n]
2pI2n In I2n In
I = I2n +
2p 1 2p 1
The formula
I2n In
I I2n + p (4)
2 1
is called Richardsons extrapolation formula.
Example. With the trapezoidal rule and with the in-
tegrand f (x) having two continuous derivatives,
1
I T2n + [T2n Tn]
3
Example. With Simpsons rule and with the integrand
f (x) having four continuous derivatives,
1
I S2n + [S2n Sn]
15

We can also use the formula (2) to obtain error esti-


mation formulas:
I2n In
I I2n p (5)
2 1
This is called Richardsons error estimate. For exam-
ple, with the trapezoidal rule,
1
I T2n [T2n Tn]
3
These formulas are illustrated for the trapezoidal rule
in an accompanying table, for
Z +1
e .
ex cos x dx = = 12.07034632
0 2
AITKEN EXTRAPOLATION

In this case, we again assume


c
I In p
n
But in contrast to previously, we do not know either
c or p. Imagine computing In, I2n, and I4n. Then
c
I In p
n
c
I I2n p p
2 n
c
I I4n p p
4 n
We can directly try to estimate I. Dividing
I In p I I2n
2
I I2n I I4n
Solving for I, we obtain
(I I2n)2 (I In) (I I4n)
2
I (In + I4n 2I2n) InI4n I2n
2
InI4n I2n
I
In + I4n 2I2n
This can be improved computationally, to avoid loss
of significance errors.
" #
2
InI4n I2n
I I4n + I4n
In + I4n 2I2n
(I4n I2n)2
= I4n
(I4n I2n) (I2n In)
This is called Aitkens extrapolation formula.

To estimate p, we use
I2n In
2p
I4n I2n
To see this, write
I2n In (I In) (I I2n)
=
I4n I2n (I I2n) (I I4n)
Then substitute from the following and simplify:
c
I In p
n
c
I I2n p p
2 n
c
I I4n p p
4 n
Example. Consider the following table of numerical
integrals. What is its order of convergence?
n In In I 1 n Ratio
2
2 .28451779686
4 .28559254576 1.075E 3
8 .28570248748 1.099E 4 9.78
16 .28571317731 1.069E 5 10.28
32 .28571418363 1.006E 6 10.62
64 .28571427643 9.280E 8 10.84

It appears
. .
2p = 10.84, p = log2 10.84 = 3.44
We could now combine this with Richardsons error
formula to estimate the error:

1
I In p In I 1 n
2 1 2
For example,
1
I I64 [9.280E 8] = 9.43E 9
9.84
PERIODIC FUNCTIONS

A function f (x) is periodic if the following condition


is satisfied. There is a smallest real number > 0 for
which

f (x + ) = f (x), < x < (6)


The number is called the period of the function
f (x). The constant function f (x) 1 is also consid-
ered periodic, but it satisfies this condition with any
> 0. Basically, a periodic function is one which
repeats itself over intervals of length .

The condition (6) implies

f (m)(x + ) = f (m)(x), < x < (7)


for the mth-derivative of f (x), provided there is such
a derivative. Thus the derivatives are also periodic.

Periodic functions occur very frequently in applica-


tions of mathematics, reflecting the periodicity of many
phenomena in the physical world.
PERIODIC INTEGRANDS

Consider the special class of integrals


Z b
I(f ) = f (x) dx
a
in which f (x) is periodic, with ba an integer multiple
of the period for f (x). In this case, the performance
of the trapezoidal rule and other numerical integration
rules is much better than that predicted by earlier error
formulas.
To hint at this improved performance, recall
Z b 2h i
e n(f ) h 0 0
f (x) dx Tn(f ) E f (b) f (a)
a 12
With our assumption on the periodicity of f (x), we
have
f (a) = f (b), f 0(a) = f 0(b)
Therefore,
e n(f ) = 0
E
and we should expect improved performance in the
convergence behaviour of the trapezoidal sums Tn(f ).

If in addition to being periodic on [a, b], the integrand


f (x) also has m continous derivatives, then it can be
shown that
c
I(f ) Tn(f ) = m + smaller terms
n
By smaller terms, we mean terms which decrease
to zero more rapidly than nm.

Thus if f (x) is periodic with b a an integer multiple


of the period for f (x), and if f (x) is infinitely dier-
entiable, then the error I Tn decreases to zero more
rapidly than nm for any m > 0. For periodic inte-
grands, the trapezoidal rule is an optimal numerical
integration method.
Example. Consider evaluating
Z 2
sin x dx
I=
0 1 + esin x
Using the trapezoidal rule, we have the results in the
following table. In this case, the formulas based on
Richardson extrapolation are no longer valid.
n Tn Tn T 1 n
2
2 0.0
4 0.72589193317292 7.259E 1
8 0.74006131211583 1.417E 2
16 0.74006942337672 8.111E 6
32 0.74006942337946 2.746E 12
64 0.74006942337946 0.0
NUMERICAL INTEGRATION:
ANOTHER APPROACH

We look for numerical integration formulas


Z 1 n
X
f (x) dx wj f (xj )
1 j=1
which are to be exact for polynomials of as large a
degree as possible.
n o There are no restrictions
n o placed
on the nodes xj nor the weights wj in working
towards that goal. The motivation is that if it is exact
for high degree polynomials, then perhaps it will be
very accurate when integrating functions that are well
approximated by polynomials.

There is no guarantee that such an approach will work.


In fact,nit o
turns out to be a bad idea when the node
points xj are required to be evenly spaced over the
interval
n o of integration. But without this restriction on
xj we are able to develop a very accurate set of
quadrature formulas.
The case n = 1. We want a formula
Z 1
w1f (x1) f (x) dx
1
The weight w1 and the node x1 are to be so chosen
that the formula is exact for polynomials of as large a
degree as possible.

To do this we substitute f (x) = 1 and f (x) = x. The


first choice leads to
Z 1
w1 1 = 1 dx
1
w1 = 2
The choice f (x) = x leads to
Z 1
w1x1 = x dx = 0
1
x1 = 0
The desired formula is
Z 1
f (x) dx 2f (0)
1
It is called the midpoint rule and was introduced in
the problems of Section 5.1.
The case n = 2. We want a formula
Z 1
w1f (x1) + w2f (x2) f (x) dx
1
The weights w1, w2 and the nodes x1, x2 are to be so
chosen that the formula is exact for polynomials of as
large a degree as possible. We substitute and force
equality for
f (x) = 1, x, x2, x3
This leads to the system
Z 1
w1 + w2 = 1 dx = 2
1
Z 1
w1x1 + w2x2 = x dx = 0
1
Z 1
2
w1x21 + w2x22 = 2
x dx =
1 3
Z 1
w1x31 + w2x32 = x3 dx = 0
1
The solution is given by

w1 = w2 = 1, 1 ,
x1 = sqrt(3) 1
x2 = sqrt(3)
This yields the formula
Z 1
1
f (x) dx f sqrt(3) 1
+ f sqrt(3) (1)
1
We say it has degree of precision equal to 3 since it
integrates exactly all polynomials of degree 3. We
can verify directly that it does not integrate exactly
f (x) = x4.
Z 1
x4 dx = 25
1

1
f sqrt(3) 1
+ f sqrt(3) = 29

Thus (1) has degree of precision exactly 3.

EXAMPLE Integrate
Z 1
dx .
= log 2 = 0.69314718
1 3 + x
The formula (1) yields
1 1
+ = 0.69230769
3 + x1 3 + x2
Error = .000839
THE GENERAL CASE

We want to find the weights {wi} and nodes {xi} so


as to have
Z 1 n
X
f (x) dx wj f (xj )
1 j=1
be exact for a polynomials f (x) of as large a degree
as possible. As unknowns, there are n weights wi and
n nodes xi. Thus it makes sense to initially impose
2n conditions so as to obtain 2n equations for the 2n
unknowns. We require the quadrature formula to be
exact for the cases
f (x) = xi, i = 0, 1, 2, ..., 2n 1
Then we obtain the system of equations
Z 1
w1xi1 + w2xi2 + + wnxin = xi dx
1
for i = 0, 1, 2, ..., 2n 1. For the right sides,

Z 1 2


i , i = 0, 2, ..., 2n 2
x dx = i+1
1

0, i = 1, 3, ..., 2n 1
The system of equations
Z 1
w1xi1 + + wnxin = xi dx, i = 0, ..., 2n 1
1
has a solution, and the solution is unique except for
re-ordering the unknowns. The resulting numerical
integration rule is called Gaussian quadrature.

In fact, the nodes and weights are not found by solv-


ing this system. Rather, the nodes and weights have
other properties which enable them to be found more
easily by other methods. There are programs to pro-
duce them; and most subroutine libraries have either
a program to produce them or tables of them for com-
monly used cases.
SYMMETRY OF FORMULA

The nodes and weights possess symmetry properties.


In particular,

xi = xni, wi = wni, i = 1, 2, ..., n


A table of these nodes and weights for n = 2, ..., 8 is
given in the text in Table 5.7. A MATLAB program
to give the nodes and weights for an arbitrary finite
interval [a, b] is given in the class account.

In addition, it can be shown that all weights satisfy

wi > 0
for all n > 0. This is considered a very desirable
property from a practical point of view. Moreover, it
permits us to develop a useful error formula.
CHANGE OF INTERVAL
OF INTEGRATION

Integrals on other finite intervals [a, b] can be con-


verted to integrals over [1, 1], as follows:
Z b Z !
ba 1 b + a + t(b a)
F (x) dx = F dt
a 2 1 2
based on the change of integration variables
b + a + t(b a)
x= , 1 t 1
2

EXAMPLE Over the interval [0, ], use

x = (1 + t) 2
Then
Z Z 1

F (x) dx = 2
F (1 + t) 2 dt
0 1
EXAMPLE Consider again the integrals used as ex-
amples in Section 5.1:
Z 1
2 .
I (1) = ex dx = .74682413281234
0
Z 4
dx
I (2) = 2
= arctan 4
01+x
Z 2
dx 2
I (3) = =
0 2 + cos x sqrt(3)

n I I (1) I I (2) I I (3)


2 2.29E 4 2.33E 2 8.23E 1
3 9.55E 6 3.49E 2 4.30E 1
4 3.35E 7 1.90E 3 1.77E 1
5 6.05E 9 1.70E 3 8.12E 2
6 7.77E 11 2.74E 4 3.55E 2
7 8.60E 13 6.45E 5 1.58E 2
10 1.27E 6 1.37E 3
15 7.40E 10 2.33E 5
20 3.96E 7

Compare these results with those of Section 5.1.


AN ERROR FORMULA

The usual error formula for Gaussian quadrature for-


mula,
Z 1 n
X
En(f ) = f (x) dx wj f (xj )
1 j=1
is not particularly intuitive. It is given by

f (2n)(cn)
En(f ) = en
(2n)!
22n+1 (n!)4
en = 2 n
(2n + 1) [(2n)!] 4
for some a cn b.

To help in understanding the implications of this error


formula, introduce

(k)
f (x)
Mk = max
1x1 k!
With many integrands f (x), this sequence {Mk } is
bounded or even decreases to zero. For example,

cos x
1

f (x) = 1 Mk k!


1
2+x
Then for our error formula,
f (2n)(cn)
En(f ) = en
(2n)!
|En(f )| enM2n (2)
By other methods, we can show

en n
4
When combined with (2) and an assumption of uni-
form boundedness for {Mk }, we have the error de-
creases by a factor of at least 4 with each increase of
n to n + 1. Compare this to the convergence of the
trapezoidal and Simpson rules for such functions, to
help explain the very rapid convergence of Gaussian
quadrature.
A SECOND ERROR FORMULA

Let f (x) be continuous for a x b; let n 1.


Then, for the Gaussian numerical integration formula
Z b n
X
I f (x) dx wj f (xj ) In
a j=1
on [a, b], the error in In satisfies

|I(f ) In(f )| 2 (b a) 2n1(f ) (3)


Here 2n1(f ) is the minimax error of degree 2n 1
for f (x) on [a, b]:
" #
m(f ) = min max |f (x) p(x)| , m0
deg(p)m axb
2
EXAMPLE Let f (x) = ex . Then the minimax er-
rors m(f ) are given in the following table.

m m(f ) m m(f )
1 5.30E 2 6 7.82E 6
2 1.79E 2 7 4.62E 7
3 6.63E 4 8 9.64E 8
4 4.63E 4 9 8.05E 9
5 1.62E 5 10 9.16E 10

Using this table, apply (3) to


Z 1
x2
I= e dx
0
For n = 3, (3) implies

x2 .
|I I3| 25 e = 3.24 105

The actual error is 9.55E 6.


INTEGRATING
A NON-SMOOTH INTEGRAND

Consider using Gaussian quadrature to evaluate


Z 1
I= sqrt(x) dx = 23
0

n I In Ratio
2 7.22E 3
4 1.16E 3 6.2
8 1.69E 4 6.9
16 2.30E 5 7.4
32 3.00E 6 7.6
64 3.84E 7 7.8

The column labeled Ratio is defined by


I I1n
2
I In
c
It is consistent with I In 3 , which can be proven
n
theoretically. In comparison for the trapezoidal and
c
Simpson rules, I In 1.5
n
WEIGHTED GAUSSIAN QUADRATURE

Consider needing to evaluate integrals such as


Z 1 Z 1
1
f (x) log x dx, x 3 f (x) dx
0 0
How do we proceed? Consider numerical integration
formulas
Z b n
X
w(x)f (x) dx wj f (xj )
a j=1
in which f (x) is considered a nice function (one
with several continuous derivatives). The function
w(x) is allowed to be singular, but must be integrable.
We assume here that [a, b] is a finite interval. The
function w(x) is called a weight function, and it is
implicitly absorbed into the definition of the quadra-
ture weights {wi}. We again determine the nodes
{xi} and weights {wi} so as to make the integration
formula exact for f (x) a polynomial of as large a de-
gree as possible.
The resulting numerical integration formula
Z b n
X
w(x)f (x) dx wj f (xj )
a j=1
is called a Gaussian quadrature formula with weight
function w(x). We determine the nodes {xi} and
weights {wi} by requiring exactness in the above for-
mula for

f (x) = xi, i = 0, 1, 2, ..., 2n 1

To make the derivation more understandable, we con-


sider the particular case
Z 1 n
X
1
x 3 f (x) dx wj f (xj )
0 j=1
We follow the same pattern as used earlier.
The case n = 1. We want a formula
Z 1
1
w1f (x1) x 3 f (x) dx
0
The weight w1 and the node x1 are to be so chosen
that the formula is exact for polynomials of as large a
degree as possible. Choosing f (x) = 1, we have
Z 1
1
w1 = x3 dx = 34
0
Choosing f (x) = x, we have
Z1
1
w1x1 = x 3 x dx = 37
0
x1 = 4
7
Thus
Z 1
1
x 3 f (x) dx 34 f 4
7
0
has degree of precision 1.
The case n = 2. We want a formula
Z 1
1
w1f (x1) + w2f (x2) x 3 f (x) dx
0
The weights w1, w2 and the nodes x1, x2 are to be
so chosen that the formula is exact for polynomials of
as large a degree as possible. We determine them by
requiring equality for
f (x) = 1, x, x2, x3
This leads to the system
Z1
1
w1 + w2 = x3 dx = 34
0
Z1
1
w1x1 + w2x2 = xx 3 dx = 37
0
Z1
1
w1x21 + w2x22 = 2 3
x x 3 dx = 10
0
Z1
1
w1x31 + w2x32 = 3 3
x x 3 dx = 13
0
The solution is
7 3 sqrt(35),
x1 = 13 7 + 3 sqrt(35)
x2 = 13
65 65
w1 = 38 392
3 sqrt(35), w2 = 38 + 392
3 sqrt(35)

Numerically,
x1 = .2654117024, x2 = .8115113746
w1 = .3297238792, w2 = .4202761208
The formula
Z 1
1
x 3 f (x) dx w1f (x1) + w2f (x2) (4)
0
has degree of precision 3.
EXAMPLE Consider evaluating the integral
Z 1
1
x3 cos x dx (5)
0
In applying (4), we take f (x) = cos x. Then

w1f (x1) + w2f (x2) = 0.6074977951


The true answer is
Z 1
1 .
x3 cos x dx = 0.6076257393
0
.
and our numerical answer is in error by E2 = .000128.
This is quite a good answer involving very little com-
putational eort (once the formula has been deter-
mined). In contrast, the trapezoidal and Simpson
rules applied to (5) would converge very slowly be-
cause the first derivative of the integrand is singular
at the origin.
CHANGE OF VARIABLES

As a side note to the preceding example, we observe


that the change of variables x = t3 transforms the
integral (5) to
Z 1
3 t cos t3 dt
3
0
and both the trapezoidal and Simpson rules will per-
form better with this formula, although still not as
good as our weighted Gaussian quadrature.

A change of the integration variable can often im-


prove the performance of a standard method, usually
by increasing the dierentiability of the integrand.

EXAMPLE Using x = tr for some r > 1, we have


Z 1 Z 1
g(x) log x dx = r tr1g (tr ) log t dt
0 0
The new integrand is generally smoother than the
original one.
NUMERICAL DIFFERENTIATION

There are two major reasons for considering numeri-


cally approximations of the dierentiation process.

1. Approximation of derivatives in ordinary dieren-


tial equations and partial dierential equations.
This is done in order to reduce the dierential
equation to a form that can be solved more easily
than the original dierential equation.

2. Forming the derivative of a function f (x) which is


known only as empirical data {(xi, yi) | i = 1, . . . , m}.
The data generally is known only approximately,
so that yi f (xi), i = 1, . . . , m.
Recall the definition
f (x + h) f (x)
f 0(x) = lim
h0 h
This justifies using
f (x + h) f (x)
f 0(x) Dhf (x) (1)
h
for small values of h. The approximation Dhf (x) is
called a numerical derivative of f (x) with stepsize h.

Example. Use Dhf (x) to approximate the derivative


of f (x) = cos(x) at x = /6. In the table, the error
is almost halved when h is halved.

h Dhf Error Ratio


0.1 0.54243 0.04243
0.05 0.52144 0.02144 1.98
0.025 0.51077 0.01077 1.99
0.0125 0.50540 0.00540 1.99
0.00625 0.50270 0.00270 2.00
0.003125 0.50135 0.00135 2.00
Error behaviour. Using Taylors theorem,

f (x + h) = f (x) + hf 0(x) + 12 h2f 00(c)


with c between x and x + h. Evaluating (1),
1 nh 0 1 2 00
i o
Dhf (x) = f (x) + hf (x) + 2 h f (c) f (x)
h
= f 0(x) + 12 hf 00(c)
f 0(x) Dhf (x) = 12 hf 00(c) (2)
Using a higher order Taylor expansion,

f 0(x) Dhf (x) = 12 hf 00(x) 16 h2f 00(c),


f 0(x) Dhf (x) 12 hf 00(x) (3)
for small values of h.

For f (x) = cos x,


h i
f 0(x) Dhf (x) = 12 h cos c,
c 6, 6 + h
In the preceding table, check the accuracy of the ap-
proximation (3) with x = 6 .
The formula (1),

0 f (x + h) f (x)
f (x) Dhf (x)
h
is called a forward dierence formula for approximat-
ing f 0(x). In contrast, the approximation
f (x) f (x h)
f 0(x) , h>0 (4)
h
is called a backward dierence formula for approxi-
mating f 0(x). A similar derivation leads to
f (x) f (x h) h
f 0(x) = f 00(c) (5)
h 2
for some c between x and x h. The accuracy of
the backward dierence formula (4) is essentially the
same as that of the forward dierence formula (1).

The motivation for this formula is in applications to


solving dierential equations.
DIFFERENTIATION USING INTERPOLATION

Let Pn(x) be the degree n polynomial that interpo-


lates f (x) at n + 1 node points x0, x1, . . . , xn. To
calculate f 0(x) at some point x = t, use
f 0(t) Pn0 (t) (6)
Many dierent formulas can be obtained by varying n
and by varying the placement of the nodes x0, . . . , xn
relative to the point t of interest.

Example. Take n = 2, and use evenly spaced nodes


x0, x1 = x0 + h, x2 = x1 + h. Then
P2(x) = f (x0)L0(x) + f (x1)L1(x) + f (x2)L2(x)
P20 (x) = f (x0)L00(x) + f (x1)L01(x) + f (x2)L02(x)
with
(x x1)(x x2)
L0(x) =
(x0 x1)(x0 x2)
(x x0)(x x2)
L1(x) =
(x1 x0)(x1 x2)
(x x0)(x x1)
L2(x) =
(x2 x0)(x2 x1)
Forming the derivatives of these Lagrange basis func-
tions and evaluating them at x = x1

0 0 f (x1 + h) f (x1 h)
f (x1) P2(x1) = Dhf (x1)
2h
(7)
For the error,
f (x + h) f (x h) h2
1 1
f 0(x1) = f 000(c2) (8)
2h 6
with x1 h c2 x1 + h.

A proof of this begins with the interpolation error for-


mula

f (x) P2(x) = 2(x)f [x0, x1, x2, x]


2(x) = (x x0) (x x1) (x x2)
Dierentiate to get
d
f 0(x) P20 (x) = 2(x) f [x0, x1, x2, x]
dx
+02(x)f [x0, x1, x2, x]
0 0 d
f (x) P2(x) = 2(x) f [x0, x1, x2, x]
dx
+02(x)f [x0, x1, x2, x]
With properties of the divided dierence, we can show

0 0 1
f (x)P2(x) = 24 2(x)f (4) 1 0
c1,x + 6 2(x)f (3) c2,x
with c1,x and c2,x between the smallest and largest of
the values {x0, x1, x2, x}. Letting x = x1 and noting
that 2(x1) = 0, we obtain (8).

Example. Take f (x) = cos(x) and x1 = 16 . Then


(7) is illustrated as follows.

h Dhf Error Ratio


0.1 0.49916708 0.0008329
0.05 0.49979169 0.0002083 4.00
0.025 0.49994792 0.00005208 4.00
0.0125 0.49998698 0.00001302 4.00
0.00625 0.49999674 0.000003255 4.00

Note the smaller errors and faster convergence as com-


pared to the forward dierence formula (1).
UNDETERMINED COEFFICIENTS

Derive an approximation for f 00(x) at x = t. Write


(2)
f 00(t) Dh f (t) Af (t + h) (9)
+Bf (t) + Cf (t h)
with A, B, and C unspecified constants. Use Taylor
polynomial approximations
h 2
f (t h) f (t) hf 0(t) + f 00(t)
2
h3 000 h4 (4)
f (t) + f (t)
6 24 (10)
0 h2 00
f (t + h) f (t) + hf (t) + f (t)
2
h3 000 h4 (4)
+ f (t) + f (t)
6 24
Substitute into (9) and rearrange:
(2)
Dh f (t) (A + B + C)f (t)
h 2
+h(A C)f 0(t) + (A + C)f 00(t) (11)
2
h3 h4
+ (A C)f (t) + (A + C)f (4)(t)
000
6 24
To have
(2)
Dh f (t) f 00(t) (12)
for arbitrary functions f (x), require

A + B + C = 0: coecient of f (t)
h(A C) = 0: coecient of f 0(t)
h2
(A + C) = 1: coecient of f 00(t)
2
Solution:
1 2
A = C = 2, B= 2 (13)
h h
This determines
(2) f (t + h) 2f (t) + f (t h)
Dh f (t) = (14)
h2

For the error, substitute (13) into (11):

(2) 00 h2 (4)
Dh f (t) f (t) + f (t)
12
Thus
00 f (t + h) 2f (t) + f (t h) h2 (4)
f (t) 2
f (t)
h 12
(15)
Example. Let f (x) = cos(x), t = 16 ; use (14) to

calculate f 00(t) = cos 16 .

(2)
h Dh f Error Ratio
0.5 0.84813289 1.789E 2
0.25 0.86152424 4.501E 3 3.97
0.125 0.86489835 1.127E 3 3.99
0.0625 0.86574353 2.819E 4 4.00
0.03125 0.86595493 7.048E 5 4.00
EFFECTS OF ERROR IN FUNCTION VALUES

Recall
(2) f (x2) 2f (x1) + f (x0) 00(x )
Dh f (x1) = f 1
h2
with x2 = x1 + h, x0 = x1 h. Assume the ac-
tual function values used in the computation contain
data error, and denote these values by fb0, fb1, and fb2.
Introduce the data errors:
b
i = f (xi) fi, i = 0, 1, 2 (16)
The actual quantity calculated is
fb 2fb + fb
(2)
c f (x ) = 2 1 2
D h 1 2
(17)
h
For the error in this quantity, replace fbj by f (xj ) j ,
j = 0, 1, 2, to obtain the following:
(2)
c f (x ) = f 00(x )
f 00(x1) Dh 1 1
[f (x2) 2] 2[f (x1) 1] + [f (x0) 0]

h2
" #
f (x2) 2f (x1) + f (x0)
= f 00(x1)
h2
2 1+ 0
+ 2
h2
121 h2f (4)(x ) + 2 2 1 + 0
1
h2
(18)
The last line uses (15).

The errors { 0, 1, n2} are generally


o random in some
interval [, ]. If fb0, fb1, fb2 are experimental data,
n o
then is a bound on the experimental error. If fbj
are obtained from computing f (x) in a computer, then
the errors j are the combination of rounding or chop-
ping errors and is a bound on these errors.
In either case, (18) yields the approximate inequality
2
00 (2) h (4) 4
c f (x )
f (x ) D f (x ) + (19)
1 h 1 1
12 h2
This suggests that as h 0, the error will eventually
increase, because of the final term h42 .

(2)
c (x ) for f (x) = cos(x) at
Example. Calculate D h 1
x1 = 16 . To show the eect of rounding errors, the
values fbi are obtained by rounding f (xi) to six signif-
icant digits; and the errors satisfy

| i| 5.0 107 = , i = 0, 1, 2
(2)
c f (x )
Other than these rounding errors, the formula D h 1
is calculated exactly. In this example, the bound (19)
becomes

00 (2)
c f (x ) h cos 1
f (x ) D 1 2
1 h 1 12 6

+ h42 (5 107)
. 2 2106
= 0.0722h + h2 E(h)
.
For h = 0.125, the bound E(h) = 0.00126, which is
not too far o from the actual error given in the table.
(2)
c f (x )
h Dh 1 Error
0.5 0.848128 0.017897
0.25 0.861504 0.004521
0.125 0.864832 0.001193
0.0625 0.865536 0.000489
0.03125 0.865280 0.000745
0.015625 0.860160 0.005865
0.0078125 0.851968 0.014057
0.00390625 0.786432 0.079593

The bound E(h) indicates that there is a smallest


value of h, call it h, below which the error bound
will begin to increase. To find it, let E 0(h) = 0, with
.
its root being h. This leads to h = 0.0726, which is
consistent with the behavior of the errors in the table.
LINEAR SYSTEMS

Consider the following example of a linear system:


x1 + 2x2 + 3x3 = 5
x1 + x3 = 3
3x1 + x2 + 3x3 = 3
Its unique solution is

x1 = 1, x2 = 0, x3 = 2
In general we want to solve n equations in n un-
knowns. For this, we need some simplifying nota-
tion. In particular we introduce arrays. We can think
of these as means for storing information about the
linear system in a computer. In the above case, we
introduce

1 2 3 5 1

A = 1 0 1 , b = 3 , x= 0
3 1 3 3 2
These arrays completely specify the linear system and
its solution. We also know that we can give mean-
ing to multiplication and addition of these quantities,
calling them matrices and vectors. The linear system
is then written as
Ax = b
with Ax denoting a matrix-vector multiplication.

The general system is written as


a1,1x1 + + a1,nxn = b1
..
an,1x1 + + an,nxn = bn
This is a system of n linear equations in the n un-
knowns x1, ..., xn. This can be written in matrix-
vector notation as
Ax = b

a1,1 a1,n b1 x1
.. .. ..
A = .. ... , b = x=
an,1 an,n bn xn
A TRIDIAGONAL SYSTEM

Consider the tridiagonal linear system


3x1 x2 = 2
x1 + 3x2 x3
= 1
..
xn2 + 3xn1 xn = 1
xn1 + 3xn = 2
The solution is

x1 = = xn = 1
This has the associated arrays

3 1 0 0 2 1

1 3 1 0 1 1
..
A= ... , b = , x = ..

.. 1
3 1 1 1

0 1 3 2 1
SOLVING LINEAR SYSTEMS

Linear systems Ax = b occur widely in applied math-


ematics. They occur as direct formulations of real
world problems; but more often, they occur as a part
of the numerical analysis of some other problem. As
examples of the latter, we have the construction of
spline functions, the numerical solution of systems of
nonlinear equations, ordinary and partial dierential
equations, integral equations, and the solution of op-
timization problems.

There are many ways of classifying linear systems.

Size: Small, moderate, and large. This of course


varies with the machine you are using. Most PCs
are now being sold with a memory of 256 to 512
megabytes (MB), and my old HP workstation has 768
MB (it is over four years old).
For a matrix A of order n n, it will take 8n2 bytes
to store it in double precision. Thus a matrix of order
8000 will need around 512 MB of storage. The latter
would be too large for most present day PCs, if the
matrix was to be stored in the computers memory,
although one can easily expand a PC to contain much
more memory than this.

Sparse vs. Dense. Many linear systems have a matrix


A in which almost all the elements are zero. These
matrices are said to be sparse. For example, it is quite
common to work with tridiagonal matrices

a1 c1 0 0
..
b2 a2 c2 0

A=
0 b3 a3 c3

.. ...

0 bn an
in which the order is 104 or much more. For such
matrices, it does not make sense to store the zero ele-
ments; and the sparsity should be taken into account
when solving the linear system Ax = b. Also, the
sparsity need not be as regular as in this example.
BASIC DEFINITIONS & THEORY

A homogeneous linear system Ax = b is one for which


the right hand constants are all zero. Using vector
notation, we say b is the zero vector for a homo-
geneous system. Otherwise the linear system is call
non-homogeneous.

Theorem. The following are equivalent statements.


(1) For each b, there is exactly one solution x.
(2) For each b, there is a solution x.
(3) The homogeneous system Ax = 0 has only the
solution x = 0.
(4) det (A) 6= 0.
(5) A1 exists. [The matrix inverse and determinant
are introduced in 6.2, but they belong as a part of
this theorem.]
EXAMPLE. Consider again the tridiagonal system
3x1 x2 = 2
x1 + 3x2 x3 = 1
..
xn2 + 3xn1 xn = 1
xn1 + 3xn = 2
The homogeneous version is simply
3x1 x2 = 0
x1 + 3x2 x3 = 0
..
xn2 + 3xn1 xn = 0
xn1 + 3xn = 0
Assume x 6= 0, and therefore that x has nonzero com-
ponents. Let xk denote a component of maximum
size:


|xk | = max xj
1jn
Consider now equation k, and assume 1 < k < n.
Then
xk1 + 3xk xk+1 = 0
xk = 1 x
3 k1 + xk+1
|xk | 1 x
+ x


3 k1 k+1
1 (|x | + |x |)
3 k k
= 2 |x |
3 k
This implies xk = 0, and therefore x = 0. A similar
proof is valid if k = 1 or k = n, using the first or the
last equation, respectively.

Thus the original tridiagonal linear system Ax = b has


a unique solution x for each right side b.
METHODS OF SOLUTION

There are two general categories of numerical methods


for solving Ax = b.

Direct Methods: These are methods with a finite


number of steps; and they end with the exact solution
x, provided that all arithmetic operations are exact.
The most used of these methods is Gaussian elimi-
nation, which we begin with. There are other direct
methods, but we do not study them here.

Iteration Methods: These are used in solving all types


of linear systems, but they are most commonly used
with large sparse systems, especially those produced
by discretizing partial dierential equations. This is
an extremely active area of research.
MATRICES in MATLAB

Consider the matrices



1 2 3 1

A = 2 2 3 , b= 1
3 3 3 1
In MATLAB, A can be created as follows.
A = [1 2 3; 2 2 3; 3 3 3];
A = [1, 2, 3; 2, 2, 3; 3, 3, 3];
A = [1 2 3
2 2 3
3 3 3] ;
Commas can be used to replace the spaces. The vec-
tor b can be created by

b = ones(3, 1);
Consider setting up the matrices for the system
Ax = b with

Ai,j = max {i, j} , bi = 1, 1 i, j n


One way to set up the matrix A is as follows:
A = zeros(n, n);
for i = 1 : n
A(i, 1 : i) = i;
A(i, i + 1 : n) = i + 1 : n;
end
and set up the vector b by

b = ones(n, 1);
MATRICES

Matrices are rectangular arrays of real or complex


numbers. With them, we define arithmetic operations
that are generalizations of those for real and complex
numbers. The general form a matrix of order m n
is

a1,1 a1,n
. ... ..
A= .
am,1 am,n
We say it has order m n. Matrices that consist of
a single column are called column vectors, and those
consisting of a single row are called row vectors. In
both cases, they will have properties identical to the
geometric vectors studied earlier in mulitvariable cal-
culus. I assume that most of you have seen this mate-
rial previously in a course named linear algebra, matrix
algebra, or something similar. Section 6.2 in the text
is intended as both a quick introduction and review of
this material.
MATRIX ADDITION

h i h i
Let A = ai,j and B = bi,j be matrices of order
m n. Then
C =A+B
is another matrix of order m n, with

ci.j = ai,j + bi,j

EXAMPLE.

1 2 1 1 2 1

3 4 + 1 1 = 2 5
5 6 1 1 6 5
MULTIPLICATION BY A CONSTANT


a1,1 a1,n ca1,1 ca1,n
..
c .. ... =
.. ... ..
am,1 am,n cam,1 cam,n

EXAMPLE.

1 2 5 10

5 3 4 = 15 20
5 6 25 30
" # " #
a b a b
(1) =
c d c d
THE ZERO MATRIX 0

Define the zero matrix of order m n as the matrix


of that order having all zero entries. It is sometimes
written as 0mn, but more commonly as simply 0.
Then for any matrix A of order m n,

A+0=0+A=A
The zero matrix 0mn acts in the same role as does
the number zero when doing arithmetic with real and
complex numbers.

EXAMPLE.
" # " # " #
1 2 0 0 1 2
+ =
3 4 0 0 3 4
We denote by A the solution of the equation

A+B =0
It is the matrix obtained by taking the negative of all
of the entries in A. For example,
" # " # " #
a b a b 0 0
+ =
c d c d 0 0
" # " # " #
a b a b a b
= = (1)
c d c d c d
" # " #
a1,1 a1,2 a1,1 a1,2
=
a2,1 a2,2 a2,1 a2,2
MATRIX MULTIPLICATION

h i h i
Let A = ai,j have order m n and B = bi,j have
order n p. Then

C = AB
is a matrix of order m p and
ci,j = Ai,B,j
= ai,1b1,j + ai,2b2,j + + ai,nbn,j
or equivalently

b1,j
h i b2,j

ci,j = ai,1 ai,2 ai,n ..

bn,j
= ai,1b1,j + ai,2b2,j + + ai,nbn,j
EXAMPLES


" # 1 2 " #
1 2 3 22 28
3 4 =
4 5 6 49 64
5 6

1 2 " # 9 12 15
1 2 3
3 4 = 19 26 33
4 5 6
5 6 29 40 51

a1,1 a1,n x1 a1,1x1 + + a1,nxn
. ... .. . ..
. . =
an,1 an,n xn an,1x1 + + an,nxn
Thus we write the linear system
a1,1x1 + + a1,nxn = b1
..
an,1x1 + + an,nxn = bn
as
Ax = b
THE IDENTITY MATRIX I

For a given integer n 1, Define In to be the matrix


of order n n with 1s in all diagonal positions and
zeros elsewhere:

1 0 ... 0
0 1 0

In = .. . . . ..

0 ... 1
More commonly it is denoted by simply I.

Let A be a matrix of order m n. Then

AIn = A, ImA = A
The identity matrix I acts in the same role as does
the number 1 when doing arithmetic with real and
complex numbers.
THE MATRIX INVERSE

Let A be a matrix of order n n for some n 1. We


say a matrix B is an inverse for A if

AB = BA = I
It can be shown that if an inverse exists for A, then
it is unique.

EXAMPLES. If ad bc 6= 0, then
" #1 " #
a b 1 d b
=
c d ad bc c a
" #1 " #
1 2 1 1
=
2 2 1 12
1
1 1 1
2 3 9 36 30
1 1 1
2 3 4 = 36 192 180

1 1 1 30 180 180
3 4 5
Recall the earlier theorem on the solution of linear
systems Ax = b with A a square matrix.

Theorem. The following are equivalent statements.

1. For each b, there is exactly one solution x.

2. For each b, there is a solution x.

3. The homogeneous system Ax = 0 has only the


solution x = 0.

4. det (A) 6= 0.

5. A1 exists.
EXAMPLE


1 2 3

det 4 5 6 = 0
7 8 9
Therefore, the linear system

1 2 3 x1 b1

4 5 6 x2 = b2
7 8 9 x3 b3
is not always solvable, the coecient matrix does not
have an inverse, and the homogeneous system Ax = 0
has a solution other than the zero vector, namely

1 2 3 1 0

4 5 6 2 = 0
7 8 9 1 0
The arithmetic properties of matrix addition and mul-
tiplication are listed on page 248, and some of them
require some work to show. For example, consider
showing the distributive law for matrix multiplication,

(AB) C = A (BC)
with A, B, C matrices of respective orders mn, np,
and p q, respectively. Writing this out, we want to
show
p
X n
X
(AB)i,k Ck,l = Ai,j (BC)j,l
k=1 j=1
for 1 i m, 1 l q.

With new situations, we often use notation to suggest


what should be true. But this is done only after de-
ciding what actually is true. You should read carefully
the properties given in the text on page 248.
PARTITIONED MATRICES

Matrices can be built up from smaller matrices; or


conversely, we can decompose a large matrix into a
matrix of smaller matrices. For example, consider

1 2 0 " #
B c
A= 2 1 1 =
d e
0 1 5
" # " #
1 2 0 h i
B= c= d= 0 1 e=5
2 1 1
Matlab allows you to build up larger matrices out of
smaller matrices in exactly this manner; and smaller
matrices can be defined as portions of larger matrices.
We will often write an n n square matrix in terms
of its columns:
h i
A = A,1, ..., A,n
For the n n identity matrix I, we write
I = [e1, ..., en]
with ej denoting a column vector with a 1 in position
j and zeros elsewhere.
ARITHMETIC OF PARTITIONED MATRICES

As with matrices, we can do addition and multiplica-


tion with partitioned matrices provided the individual
constituent parts have the proper orders.

For example, let A, B, C, D be n n matrices. Then


" #" # " #
I A I C I + AD C + A
=
B I D I B + D I + BC

Let A be n n and x be a column vector of length


n. Then

h i x1

Ax = A,1, ..., A,n .. = x1A,1 + +xnA,n
xn
Compare this to

a1,1 a1,n x1 a1,1x1 + + a1,nxn
. ... .. . ..
. . =
an,1 an,n xn an,1x1 + + an,nxn
PARTITIONED MATRICES IN MATLAB

In MATLAB, matrices can be constructed using smaller


matrices. For example, let

A = [1, 2; 3, 4]; x = [5, 6]; y = [7, 8]0;


Then
B = [A, y; x, 9];
forms the matrix

1 2 7

B= 3 4 8
5 6 9
ELEMENTARY ROW OPERATIONS

As preparation for the discussion of Gaussian Elimi-


nation in Section 6.3, we introduce three elementary
row operations on general rectangular matrices. They
are:
i) Interchange of two rows.
ii) Multiplication of a row by a nonzero scalar.
iii) Addition of a nonzero multiple of one row to an-
other row.

Consider the rectangular matrix



3 3 3 1

A= 2 2 3 1
1 2 3 1
We add row 2 times (1) to row 1, and then add row
3 times (1) to row 2 to obtain the matrix:

1 1 0 0

1 0 0 0
1 2 3 1
Add row 2 times (1) to row 1, and to row 3 as well:

0 1 0 0

1 0 0 0
0 2 3 1
Add row 1 times (2) to row 3:

0 1 0 0

1 0 0 0
0 0 3 1
Interchange row 1 and row 2:

1 0 0 0

0 1 0 0
0 0 3 1
Finally, we multiply row 3 by 1/3:

1 0 0 0

0 1 0 0
0 0 1 1/3
This is obtained from A using elementary row opera-
tions. A reverse sequence of operations of the same
type converts this result back to A.
SOLVING LINEAR SYSTEMS

We want to solve the linear system


a1,1x1 + + a1,nxn = b1
..
an,1x1 + + an,nxn = bn
This will be done by the method used in beginning
algebra, by successively eliminating unknowns from
equations, until eventually we have only one equation
in one unknown. This process is known as Gaussian
elimination. To put it onto a computer, however, we
must be more precise than is generally the case in high
school algebra.

We begin with the linear system


3x1 2x2 x3 = 0 (E1)
6x1 2x2 + 2x3 = 6 (E2)
9x1 + 7x2 + x3 = 1 (E3)
3x1 2x2 x3 = 0 (E1)
6x1 2x2 + 2x3 = 6 (E2)
9x1 + 7x2 + x3 = 1 (E3)
[1] Eliminate x1 from equations (E2) and (E3). Sub-
tract 2 times (E1) from (E2); and subtract 3 times
(E1) from (E3). This yields
3x1 2x2 x3 = 0 (E1)
2x2 + 4x3 = 6 (E2)
x2 2x3 = 1 (E3)
[2] Eliminate x2 from equation (E3). Subtract 12 times
(E2) from (E3). This yields
3x1 2x2 x3 = 0 (E1)
2x2 + 4x3 = 6 (E2)
4x3 = 4 (E3)
Using back substitution, solve for x3, x2, and x1, ob-
taining
x3 = x2 = x1 = 1
In the computer, we work on the arrays rather than
on the equations. To illustrate this, we repeat the
preceding example using array notation.

The original system is Ax = b, with



3 2 1 0

A = 6 2 2 , b= 6
9 7 1 1
We often write these in combined form as an aug-
mented matrix:


3 2 1 0

[A | b] = 6 2 2 6

9 7 1 1
In step 1, we eliminate x1 from equations 2 and 3.
We multiply row 1 by 2 and subtract it from row 2;
and we multiply row 1 by -3 and subtract it from row
3. This yields


3 2 1 0

0 2 4 6

0 1 2 1


3 2 1 0

0 2 4 6

0 1 2 1
In step 2, we eliminate x2 from equation 3. We mul-
tiply row 2 by 12 and subtract from row 3. This yields


3 2 1 0

0 2 4 6

0 0 4 4
Then we proceed with back substitution as previously.
For the general case, we reduce

(1) (1) (1)
a
1,1 a1,n b1
.
[A | b] = . ... .. ..



(1) (1) (1)
an,1 an,n bn
in n 1 steps to the form

(1) (1) (1)
a
1,1 a1,n b1
... .. ..
0

.. ... ..


(n) (n)
0 0 an,n bn
More simply, and introducing new notation, this is
equivalent to the matrix-vector equation U x = g:

u1,1 u1,n x1 g1

0 ... ..

..



..

.. ... .. = ..

0 0 un,n xn gn
This is the linear system
u1,1x1 + u1,2x2 + + u1,n1xn1 + u1,nxn = g1
..
un1,n1xn1 + un1,nxn = gn1
un,nxn = gn
We solve for xn, then xn1, and backwards to x1.
This process is called back substitution.
gn
xn =
un,n
n o
gk uk,k+1xk+1 + + uk,nxn
uk =
uk,k
for k = n1, ..., 1. What we have done here is simply
a more carefully defined and methodical version of
what you have done in high school algebra.
How do we carry out the conversion of

(1) (1) (1)
a
1,1 a1,n b1
.
. ... .. ..



(1) (1) (1)
an,1 an,n bn
to

(1) (1) (1)
a
1,1 a1,n b1
... .. ..
0

.. ... ..


(n) (n)
0 0 an,n bn
To help us keep track of the steps of this process, we
will denote the initial system by

(1) (1) (1)
a
1,1
a1,n b
1.

(1) (1)
[A | b ] = . ... .. .
.


(1) (1) (1)
an,1 an,n bn
Initially we will make the assumption that every pivot
element will be nonzero; and later we remove this
assumption.
Step 1. We will eliminate x1 from equations 2 thru
n. Begin by defining the multipliers
(1)
ai,1
mi,1 = (1) , i = 2, ..., n
a1,1
(1)
Here we are assuming the pivot element a1,1 6= 0.
Then in succession, multiply mi,1 times row 1 (called
the pivot row ) and subtract the result from row i.
This yields new matrix elements
(2) (1) (1)
ai,j = ai,j mi,1a1,j , j = 2, ..., n

(2) (1) (1)


bi = bi mi,1b1
for i = 2, ..., n.
Note that the index j does not include j = 1. The
reason is that with the definition of the multiplier mi,1,
it is automatic that
(2) (1) (1)
ai,1 = ai,1 mi,1a1,1 = 0, i = 2, ..., n
The augmented matrix now is

(1) (1) (1) (1)
a a1,2 a1,n b
1,1 1

(2) (2) (2)
(2) (2) 0 a2,2 a2,n b
[A | b ] =
.
2
. .. ... .. ..



(2) (2) b(2)
0 an,2 an,n n
Step k: Assume that for i = 1, ..., k 1 the unknown
xi has been eliminated from equations i + 1 thru n.
We have the augmented matrix

(1) (1) (1)
(1)
a1,1 a1,2 a1,n b1

(2) (2) (2)
0 a2,2 a2,n b2

... ... .. ..

(k) (k)
[A | b ] = (k) (k)
.. 0 ak,k ak,n (k)
bk

.. .. ... .. ..


(k)
(k) (k) bn
0 0 an,k an,n
We want to eliminate unknown xk from equations k +
1 thru n. Begin by defining the multipliers
(k)
ai,k
mi,k = (k) , i = k + 1, ..., n
ak,k
(k)
The pivot element is ak,k , and we assume it is nonzero.
Using these multipliers, we eliminate xk from equa-
tions k + 1 thru n. Multiply mi,k times row k (the
pivot row ) and subtract from row i, for i = k + 1 thru
n.
(k+1) (k) (k)
ai,j = ai,j mi,k ak,j , j = k + 1, ..., n

(k+1) (k) (k)


bi = bi mi,k bk
for i = k + 1, ..., n. This yields the augmented matrix
[A(k+1) | b(k+1)]:

(1) (1)
a a1,n (1)
1,1 b1
... ..
0 ..


(k) (k) (k)
ak,k ak,k+1 ak,n b(k)

k
(k+1) (k+1)
.
. 0 ak+1,k+1 ak+1,n b(k+1)

k+1
.. .. ... .. ..



(k+1) (k+1) b(k+1)
0 0 an,k+1 an,n n
Doing this for k = 1, 2, ..., n 1 leads to the upper
triangular system with the augmented matrix

(1) (1) (1)
a
1,1 a1,n b1
... .. ..
0

.. ... ..


(n) (n)
0 0 an,n bn
We later remove the assumption
(k)
ak,k 6= 0, k = 1, 2, ..., n
QUESTIONS

How do we remove the assumption on the pivot


elements?

How many operations are involved in this proce-


dure?

How much error is there in the computed solution


due to rounding errors in the calculations?

How does the machine architecture aect the im-


plementation of this algorithm.
PARTIAL PIVOTING

Recall the reduction of



(1) (1) (1)
a
1,1 a1,n b1
.
(1) (1)
[A | b ] = ... .. ..
.


(1) (1) (1)
an,1 an,n bn
to

(1) (1) (1) (1)
a a1,2 a1,n b
1,1 1

(2) (2) (2)
(2) (2) 0 a2,2 a2,n b
[A | b ] =
.
2
. .. ... .. ..


(2)

(2) (2)
0 an,2 an,n bn
(1)
What if a1,1 = 0? In that case we look for an equation
in which the x1 is present. To do this in such a way
as to avoid zero the maximum extant possible, we do
the following.
Look at all the elements in the first column,
(1) (1) (1)
a1,1, a2,1, ..., an,1
and pick the largest in size. Say it is

(1) (1)
a = max a
k,1
j=1,...,n j,1

Then interchange equations 1 and k, which means


interchanging rows 1 and k in the augmented matrix
[A(1) | b(1)]. Then proceed with the elimination of x1
from equations 2 thru n as before.

Having obtained

(1) (1) (1) (1)
a a1,2 a1,n b
1,1 1

(2) (2) (2)
(2) (2) 0 a2,2 a2,n b
[A | b ] =
.
2
.
. .. ... .. .

(2)
(2) (2) b
0 an,2 an,n n
(2)
what if a2,2 = 0? Then we proceed as before.
Among the elements
(2) (2) (2)
a2,2, a3,2, ..., an,2
pick the one of largest size:

(2) (2)
a = max a
k,2
j=2,...,n j,2

Interchange rows 2 and k. Then proceed as before to


eliminate x2 from equations 3 thru n, thus obtaining

(1) (1) (1) (1)
a a a1,3 a1,n (1)
1,1 1,2 b1

(2) (2) (2) (2)
0 a a2,3 a2,n b2
2,2

[A(3) | b(3)] =
0 0
(3)
a3,3
(3)
a3,n


(3)
b3



. .. .. ... .. ..
.

(3) (3) (3)
bn
0 0 an,3 an,n

This is done at every stage of the elimination process.


This technique is called partial pivoting, and it is a
part of most Gaussian elimination programs (including
the one in the text).
Consequences of partial pivoting. Recall the defini-
tion of the elements obtained in the process of elimi-
nating x1 from equations 2 thru n.
(1)
ai,1
mi,1 = (1) , i = 2, ..., n
a1,1

(2) (1) (1)


ai,j = ai,j mi,1a1,j , j = 2, ..., n
(2) (1) (1)
bi = bi mi,1b1
for i = 2, ..., n. By our definition of the pivot element
(1)
a1,1, we have


mi,1 1, i = 2, ..., n
(2) (2)
Thus in the calculation of ai,j and bi , we have that
the elements do not grow rapidly in size. This is in
comparison to what might happen otherwise, in which
the multipliers mi,1 might have been very large. This
property is true of the multipliers at very step of the
elimination process:


mi,k 1, i = k + 1, ..., n, k = 1, ..., n 1
The property


mi,k 1, i = k + 1, ..., n
leads to good error propagation properties in Gaussian
elimination with partial pivoting. The only error in
Gaussian elimination is that derived from the round-
ing errors in the arithmetic operations. For example,
at the first elimination step (eliminating x1 from equa-
tions 2 thru n),
(2) (1) (1)
ai,j = ai,j mi,1a1,j , j = 2, ..., n
(2) (1) (1)
bi = bi mi,1b1
The above property on the size of the multipliers pre-
vents these numbers and the errors in their calculation
from growing as rapidly as they might if no partial piv-
oting was used.

As an example of the improvement in accuracy ob-


tained with partial pivoting, see the example on pages
262-263.
OPERATION COUNTS

One of the major ways in which we compare the e-


ciency of dierent numerical methods is to count the
number of needed arithmetic operations. For solving
the linear system
a1,1x1 + + a1,nxn = b1
..
an,1x1 + + an,nxn = bn
using Gaussian elimination, we have the following op-
eration counts.
1. A U , where we are converting Ax = b to
U x = g:

n(n 1)
Divisions
2
n(n 1)(2n 1)
Additions
6
n(n 1)(2n 1)
Multiplications
6
2. b g:
n(n 1)
Additions
2
n(n 1)
Multiplications
2
3. Solving U x = g:
Divisions n
n(n 1)
Additions
2
n(n 1)
Multiplications
2

On some machines, the cost of a division is much


more than that of a multiplication; whereas on others
there is not any important dierence. We assume the
latter; and then the operation costs are as follows.

2
n n 1
MD(A U ) =
3
n(n 1)
MD(b g) =
2
n(n + 1)
MD(Find x) =
2
n(n 1)(2n 1)
AS(A U) =
6
n(n 1)
AS(b g) =
2
n(n 1)
AS(Find x) =
2
Thus the total number of operations is
2n3 + 3n2 5n
Additions
! 6
Multiplications n + 3n2 n
3
and Divisions 3
Both are around 13 n3, and thus the total operations
account is approximately
2 3
n
3
What happens to the cost when n is doubled?
Solving Ax = b and Ax = c. What is the cost? Only
the modification of the right side is dierent in these
two cases. Thus the additional cost is
!
M D(b g)
= n2
MD(Find x)
!
AS(b g)
= n(n 1)
AS(Find x)
The total is around 2n2 operations, which is quite a
bit smaller than 23 n3 when n is even moderately large,
say n = 100.

Thus one can solve the linear system Ax = c at little


additional cost to that for solving Ax = b. This has
important consequences when it comes to estimation
of the error in computed solutions.
CALCULATING THE MATRIX INVERSE

Consider finding the inverse of a 3 3 matrix



a1,1 a1,2 a1,3 h i

A = a2,1 a2,2 a2,3 = A,1, A,2, A,3
a3,1 a3,2 a3,3
We want to find a matrix
h i
X = X,1, X,2, X,3
for which
AX = I
h i
A X,1, X,2, X,3 = [e1, e2, e3]
h i
AX,1, AX,2, AX,3 = [e1, e2, e3]
This means we want to solve

AX,1 = e1, AX,2 = e2, AX,3 = e3


We want to solve three linear systems, all with the
same matrix of coecients A.
In augmented matrix notation, we want to work with

[A | I]


a1,1 a1,2 a1,3 1 0 0

a2,1 a2,2 a2,3 0 1 0

a3,1 a3,2 a3,3 0 0 1
Then we proceed as before with a single linear system,
only now we have three right hand sides. We first
introduce zeros into positions 2 and 3 of column 1;
and then we introduce zero into position 3 of column
2. Following that, we will need to solve three upper
triangular linear systems by back substitution. See the
numerical example on pages 264-266.
MATRIX INVERSE EXAMPLE


1 1 2

A= 1 1 1
1 1 0

1
1 2 1 0 0

1 1 1 0 1 0

1 1 0 0 0 1
m2,1 = 1 m3,1 = 1

1
1 2 1 0 0

0 0 3 1 1 0

0 2 2 1 0 1

1
1 2 1 0 0

0 2 2 1 0 1

0 0 3 1 1 0

1
1 2 1 0 0

0 2 2 1 0 1

0 0 3 1 1 0
Then by using back substitution to solve for each col-
umn of the inverse, we obtain

1 1 1
6 3 2
1 1 1
A = 6 3 12

13 1
3 0
COST OF MATRIX INVERSION

In 1
h calculating A , we i are solving for the matrix X =
X,1, X,2, . . . , X,n where
h i
A X,1, X,2, . . . , X,n = [e1, e2, . . . , en]
and ej is column j of the identity matrix. Thus we
are solving n linear systems
AX,1 = e1, AX,2 = e2, . . . , AX,n = en (1)
all with the same coecient matrix. Returning to
the earlier operation counts for solving a single linear
system, we have the following.
Cost of triangulating A: approx. 23 n3 operations
Cost of solving Ax = b: 2n2 operations
Thus solving the n linear systems in (1) costs approx-
imately

2 n3 + n 2n2 = 8 n3 operations, approximately
3 3
It costs approximately four times as many operations
to invert A as to solve a single system. With attention
to the form of the right-hand sides in (1) this can be
reduced to 2n3 operations.
MATLAB MATRIX OPERATIONS

To solve the linear system Ax = b in Matlab, use

x = A\b
In Matlab, the command

inv (A)
will calculate the inverse of A.

There are many matrix operations built into Matlab,


both for general matrices and for special classes of
matrices. We do not discuss those here, but recom-
mend the student to investigate these thru the Matlab
help options.
GAUSSIAN ELIMINATION - REVISITED

Consider solving the linear system


2x1 + x2 x3 + 2x4 = 5
4x1 + 5x2 3x3 + 6x4 = 9
2x1 + 5x2 2x3 + 6x4 = 4
4x1 + 11x2 4x3 + 8x4 = 2
by Gaussian elimination without pivoting. We denote
this linear system by Ax = b. The augmented matrix
for this system is

2 1 1 2 5

4 5 3 6 9

[A | b] =
2 5 2 6 4

4 11 4 8 2
To eliminate x1 from equations 2, 3, and 4, use mul-
tipliers

m2,1 = 2, m3,1 = 1, m4,1 = 2


To eliminate x1 from equations 2, 3, and 4, use mul-
tipliers

m2,1 = 2, m3,1 = 1, m4,1 = 2


This will introduce zeros into the positions below the
diagonal in column 1, yielding

2 1 1 2 5

0 3 1 2 1


0 6 3 8 9

0 9 2 4 8

To eliminate x2 from equations 3 and 4, use multipli-


ers
m3,2 = 2, m4,2 = 3
This reduces the augmented matrix to

2 1 1 2 5

0 3 1 2 1


0 0 1 4 11

0 0 1 2 5
To eliminate x3 from equation 4, use the multiplier

m4,3 = 1
This reduces the augmented matrix to

2 1 1 2 5

0 3 1 2 1


0 0 1 4 11

0 0 0 2 6
Return this to the familiar linear system
2x1 + x2 x3 + 2x4 = 5
3x2 x3 + 2x4 = 1
x3 + 4x4 = 11
2x4 = 6
Solving by back substitution, we obtain

x4 = 3, x3 = 1, x2 = 2, x1 = 1
There is a surprising result involving matrices asso-
ciated with this elimination process. Introduce the
upper triangular matrix

2 1 1 2
0 3 1 2

U =
0 0 1 4
0 0 0 2
which resulted from the elimination process. Then
introduce the lower triangular matrix

1 0 0 0 1 0 0 0
m2,1 1 0 0 2 1 0 0

L= =
m3,1 m3,2 1 0 1 2 1 0
m4,1 m4,2 m4,3 1 2 3 1 1
This uses the multipliers introduced in the elimination
process. Then
A = LU

2 1 1 2 1 0 0 0 2 1 1 2
4 5 3 6 2 1 0 0 0 3 1 2

=
2 5 2 6 1 2 1 0 0 0 1 4
4 11 4 8 2 3 1 1 0 0 0 2
In general, when the process of Gaussian elimination
without pivoting is applied to solving a linear system
Ax = b, we obtain A = LU with L and U constructed
as above.

For the case in which partial pivoting is used, we ob-


tain the slightly modified result

LU = P A
where L and U are constructed as before and P is a
permutation matrix. For example, consider

0 0 1 0
1 0 0 0

P =
0 0 0 1
0 1 0 0
Then

0 0 1 0 a1,1 a1,2 a1,3 a1,4 A3,
1 0 0 0 a2,1 a2,2 a2,3 a2,4 A1,

PA = =
0 0 0 1 a3,1 a3,2 a3,3 a3,4 A4,
0 1 0 0 a4,1 a4,2 a4,3 a4,4 A2,

0 0 1 0 a1,1 a1,2 a1,3 a1,4
1 0 0 0 a a2,2 a2,3 a2,4
2,1
PA =
0 0 0 1 a3,1 a3,2 a3,3 a3,4
0 1 0 0 a4,1 a4,2 a4,3 a4,4

A3,
A1,

=
A4,
A2,
The matrix P A is obtained from A by switching around
rows of A. The result LU = P A means that the LU -
factorization is valid for the matrix A with its rows
suitably permuted.
Consequences: If we have a factorization

A = LU
with L lower triangular and U upper triangular, then
we can solve the linear system Ax = b in a relatively
straightforward way.

The linear system can be written as

LU x = b
Write this as a two stage process:

Lg = b, Ux = g
The system Lg = b is a lower triangular system
g1 = b1
2,1g1 + g2 = b2
3,1g1 + 3,2g2 + g3 = b3
..
n,1g1 + n,n1gn1 + gn = bn
We solve it by forward substitution. Then we solve
the upper triangular system U x = g by back substi-
tution.
VARIANTS OF GAUSSIAN ELIMINATION

If no partial pivoting is needed, then we can look for


a factorization
A = LU
without going thru the Gaussian elimination process.

For example, suppose A is 4 4. We write



a1,1 a1,2 a1,3 a1,4
a2,1 a2,2 a2,3 a2,4


a3,1 a3,2 a3,3 a3,4
a4,1 a4,2 a4,3 a4,4

1 0 0 0 u1,1 u1,2 u1,3 u1,4
1 0 0 0 u2,2 u2,3 u2,4

= 2,1
3,1 3,2 1 0 0 0 u3,3 u3,4
4,1 4,2 4,3 1 0 0 0 u4,4
n o n o
To find the elements i,j and ui,j , we multiply
the right side matrices L and U and match the results
with the corresponding elements in A.
Multiplying the first row of L times all of the columns
of U leads to

u1,j = a1,j , j = 1, 2, 3, 4
Then multiplying rows 2, 3, 4 times the first column
of U yields

i,1u1,1 = ai,1, i = 2, 3, 4
n o
and we can solve for 2,1, 3,1, 4,1 . We can con-
tinue this process, finding the second row of U and
then the second column of L, and so on. For example,
to solve for 4,3, we need to solve for it in

4,1u1,3 + 4,2u2,3 + 4,3u3,3 = a4,3


Why do this? A hint of an answer is given by this
last equation. If we had an n n matrix A, then we
would find n,n1 by solving for it in the equation

n,1u1,n1+ n,2u2,n1+ + n,n1un1,n1 = an,n1


h i
an,n1 n,1u1,n1 + + n,n2un2,n1
n,n1 =
un1,n1
Embedded in this formula we have a dot product. This
is in fact typical of this process, with the length of the
inner products varying from one position to another.

Recalling 2.4 and the discussion of dot products, we


can evaluate this last formula by using a higher pre-
cision arithmetic and thus avoid many rounding er-
rors. This leads to a variant of Gaussian elimination
in which there are far fewer rounding errors.

With ordinary Gaussian elimination, the number of


rounding errors is proportional to n3. This reduces
the number of rounding errors, with the number now
being proportional to only n2. This can lead to major
increases in accuracy, especially for matrices A which
are very sensitive to small changes.
TRIDIAGONAL MATRICES


b1 c1 0 0 0

a2 b2 c2 0
..
0 a3 b3 c3
A=
...


..
an1 bn1 cn1
0 an bn
These occur very commonly in the numerical solution
of partial dierential equations, as well as in other ap-
plications (e.g. computing interpolating cubic spline
functions).

We factor A = LU , as before. But now L and U


take very simple forms. Before proceeding, we note
with an example that the same may not be true of the
matrix inverse.
EXAMPLE

Define an n n tridiagonal matrix



1 1 0 0 0

1 2 1 0
..
0 1 2 1
A=
...


.. 1 2 1

0 1 n1n
Then A1 is given by

A1 = max {i, j}
i,j
Thus the sparse matrix A can (and usually does) have
a dense inverse.
We factor A = LU, with

1 0 0 0 0

2 1 0 0
..
0 3 1 0
L=
...


..
n1 1 0
0 n 1

1 c1 0 0 0

0 2 c2 0
..
0 0 3 c3
U =
...


..
0 n1 cn1
0 0 n
Multiply these and match coecients with A to find
{i, i}.
By doing a few multiplications of rows of L times
columns of U , we obtain the general pattern as fol-
lows.
1 = b1 : row 1 of LU
2 1 = a2, 2c1 + 2 = b2 : row 2 of LU
..
n n1 = an, ncn1 + n = bn : row n of LU

These are straightforward to solve.


1 = b1
aj
j = , j = bj j cj1, j = 2, ..., n
j1
To solve the linear system

Ax = f
or
LU x = f
instead solve the two triangular systems

Lg = f, Ux = g
Solving Lg = f :
g1 = f1
gj = fj j gj1, j = 2, ..., n
Solving Ux = g:
gn
xn =
n
gj cj xj+1
xj = , j = n 1, ..., 1
j
See the numerical example on page 278.
OPERATIONS COUNT

Factoring A = LU .
Additions: n1
Multiplications: n 1
Divisions: n1
Solving Lz = f and U x = z:
Additions: 2n 2
Multiplications: 2n 2
Divisions: n
Thus the total number of arithmetic operations is ap-
proximately 3n to factor A; and it takes about 5n to
solve the linear system using the factorization of A.

If we had A1 at no cost, what would it cost to com-


pute x = A1f ?
n
X
xi = A1 f , i = 1, ..., n
i,j j
j=1
MATLAB MATRIX OPERATIONS

To obtain the LU-factorization of a matrix, including


the use of partial pivoting, use the Matlab command
lu. In particular,

[L, U, P ] = lu(X)
returns the lower triangular matrix L, upper triangular
matrix U , and permutation matrix P so that

P X = LU
ESTIMATION OF ERROR

b denote an approximate solution for Ax = b;


Let x
b is obtained by Gaussian elimination. Let x
perhaps x
denote the exact solution. Then introduce

b
r = b Ax
b Then
a quantity called the residual for x.
r = b
b Ax
= Ax Axb
= b
A (x x)
b =
xx A1r
b is the exact solution of
or the error e = x x

Ae = r
Thus we can solve this to obtain an estimate eb of our
error e.
EXAMPLE. Recall the linear system
.729x1 + .81x2 + .9x3 = .6867
x1 + x2 + x3 = .8338
1.331x1 + 1.21x2 + 1.1x3 = 1.000
The true solution, rounded to four significant digits,
is
x = [.2245, .2814, .3279]T
Using Gaussian elimination without pivoting and four
digit decimal floating point arithmetic with rounding,
the resulting solution and error are
b = [.2251, .2790, .3295]T
x
e = [.0006, .0024, .0016]T
Then

b = [.00006210, .0002000, .0003519]T


r = b Ax
Solving Ae = r by Gaussian elimination, we obtain

e eb = [.0004471, .002150, .001504]T


THE RESIDUAL CORRECTION METHOD

If in the above we had taken eb and added it to x,


b then
we would have obtained an improved answer:

b + eb = [.2247, .2811, .3280]T


xx
Recall
x = [.2245, .2814, .3279]T
With the new approximation, we can repeat the earlier
process of estimating the error and then using it to
improve the answer. This iterative process is called
the residual correction method. It is illustrated with
another example on page 286 in the text.
ERROR ANALYSIS

Begin with a simple example. The system


7x + 10y = 1
5x + 7y = .7
has the solution

x = 0, y = .1
The perturbed system
b + 10yb = 1.01
7x
b +
5x 7yb = .69
has the solution

b = .17,
x yb = .22
Why is there such a dierence?
Consider the following Hilbert matrix example.
1 1

1 2 3 1.000 .5000 .3333
1
H3 = 1 1 , f =
H .5000 .3333 .2500


2 3 4 3
1 1 1 .3333 .2500 .2000
3 4 5

9 36 30

H31 = 36 192 180
30 180 180

9.062 36.32 30.30
f1 =
H 36.32 193.7 181.6


3
30.30 181.6 181.5
We have changed H3 in the fifth decimal place (by
rounding the fractions to four decimal digits). But we
have ended with a change in H31 in the third decimal
place.
VECTOR NORMS

A norm is a generalization of the absolute value func-


tion, and we use it to measure the size of a vector.
There are a variety of ways of defining the norm of
a vector, with each definition tied to certain applica-
tions.

Euclidean norm : Let x be a column vector with n


components. Define
1
n
X 2 2
kxk2 = xj
j=1
This is the standard definition of the length of a vec-
tor, giving us the straight line distance between
head and tail of the vector.
1-norm : Let x be a column vector with n compo-
nents. Define
Xn

kxk1 = xj
j=1
For planar applications (n = 2), this is sometimes
called the taxi cab norm, as it corresponds to dis-
tance as measured when driving in a city laid out with
a rectangular grid of streets.

-norm : Let x be a column vector with n compo-


nents. Define


kxk = max xj
1jn
This is also called the maximum norm and the Cheby-
shev norm. It is often used in numerical analysis where
we want to measure the maximum error component
in some vector quantity.
EXAMPLES

Let
h iT
x= 1 2 3
Then
kxk1 = 6
.
kxk2 = sqrt(14) = 3.74
kxk = 3

PROPERTIES

Let kk denote a generic norm. Then:

(a) kxk = 0 if and only if x = 0.


(b) kcxk = |c| kxk for any vector x and constant c.
(c) kx + yk kxk + kyk, for all vectors x and y.
MATRIX NORMS

We also need to measure the sizes of general matrices,


and we need to have some way of relating the sizes
of A and x to the size of Ax. In doing this, we will
consider only square matrices A.

We say a matrix norm is a way of defining the size


of a matrix, again satisfying the properties seen with
vector norms. Thus:

1. kAk = 0 if and only if A = 0.

2. kcAk = |c| kAk for any matrix A and constant c.

3. kA + Bk kAk + kBk, for all matrices A and B


of equal order.
In addition, we can multiply matrices, forming AB
from A and B. With absolute values, we have |ab| =
|a| |b| for all complex numbers a and b. There is no
way of generalizing exactly this relation to matrices A
and B. But we can obtain definitions for which

(d) kABk kAk kBk


Finally, if we are given some vector norm kkv , we can
obtain an associated matrix norm definition for which

(e) kAxkv kAk kxkv


for all n n matrices A and n 1 vectors x.

Often we use as our definition of kAk the smallest


number for which this last inequality is satisfied for
all vectors x. In that case, we also obtain the useful
property
kIk = 1
Let the vector norm be kk for n1 vectors x. Then
the associated matrix norm definition is
Xn

kAk = max ai,j
1in
j=1
This is sometimes called the row norm of a matrix
A.

EXAMPLE. Let
" # " # " #
1 2 1 1
A= , z= , Az =
5 7 1 2
Then

kAk = 12, kzk = 1, kAzk = 2


h iT
and clearly kAzk kAk kzk. Also let z = 1, 1 .
Then
" #
3
Az = , kzk = 1, kAzk = 12
12
and kAzk = kAk kzk.
Let the vector norm be kk1. Then the associated
matrix norm definition is
Xn

kAk = max ai,j
1jn
i=1
This is sometimes called the column norm of the
matrix A.

Let the vector norm be kk2. Then the associated


matrix norm definition is
h i
T
kAk = sqrt r (A A)
To understand this, let B denote an arbitrary square
matrix of order n n. Then introduce
(B) = { an eigenvalue of B}

r (B) = max ||
(B)
The set (B) is called the spectrum of B, and it
contains all the eigenvalues of B. The number r (B)
is called the spectral radius of B. There are easily
computable bounds for kAk, but the norm itself is
dicult to compute.
ERROR BOUNDS

Let Ax = b and Ax b = b
b, and we are interested in
cases with b bb. Then

b
b
kx xkv 1 b b
kAk A v
kxkv kbkv
where kkv is some vector norm and kk is an associ-
ated matrix norm.

Proof :
b = bb
Ax Ax b
b = bb
A (x x) b

b =
xx A 1 bbb

1 b
bkv =
kx x A bb
v
1
A b bb
v
1
b v
kx xk A b b b
v
kxkv kxkv
Rewrite this as

b
b v
kx xk 1 b b
kAk A v
kxkv kAk kxkv
Since Ax = b, we have

kbkv = kAxkv kAk kxkv


Using this,

b
b v
kx xk 1 b b
kAk A v
kxkv kbkv
This completes the proof of the earlier assertion.

The quantity

1
cond(A) = kAk A
is called a condition number for the matrix A.
EXAMPLE. Recall the earlier example:
" # " #
7x1 + 10x2 = 1 x1 0
=
5x1 + 7x2 = .7 x2 .1
" # " #
b1 + 10x
7x b2 = 1.01 b1
x .17
=
b1 +
5x b2 =
7x .69 b2
x .22
Then

b
kbk = 1, b b = .01

kxk = .1, b = .17
kx xk
" # " #
7 10 7 10
A= , A1 =
5 7 5 7

1
kAk = 17, A = 17, cond(A) = 289

b
b b b
kx xk
= 1.7 = 170 cond(A)
kxk kbk .01

b
b
kx xk b b
cond(A)
kxk kbk
The result

b
b
kx xkv b b
cond(A) v
kxkv kbkv
has another aspect which we do not prove here. Given
any matrix A, then there is a vector b and a nearby
perturbation bb for which the above inequality can be
replaced by equality. Moreover, there is no simple way
to know of these b and bb in advance. For such b and
b
b, we have

b
b
kx xkv b b
cond(A) = v
kxkv kbkv
Thus if cond(A) is very large, say 108, then there are
b and bb for which


b v
kx xk b b
b
= 10 8 v
kxkv kbkv
We call such systems ill-conditioned.
Recall an earlier discussion of error in Gaussian elim-
ination. Let x b denote an approximate solution for
Ax = b; perhaps x b is obtained by Gaussian elimina-
tion. Let x denote the exact solution. Then introduce
the residual
b
r = b Ax
We then obtained x x b = A1r. But we could also
have discussed this as a special case of our present
results. Write

Ax = b and b =br b
Ax b
Then

b
b
kx xkv b b
cond(A) v
kxkv kbkv
becomes
b v
kx xk krkv
cond(A)
kxkv kbkv
ILL-CONDITIONED EXAMPLE

Define the 4 4 Hilbert matrix:



1 1 1 1
2 3 4
1 1 1 1
2 3 4 5
H4 = 1 1 1 1

3 4 5 6
1 1 1 1
4 5 6 7
Its inverse is given by

16 120 240 140
120 1200 2700 1680
1
H4 =
240 2700 6480 4200
140 1680 4200 2800
For the matrix row norm,
25
cond(H4) = 13620 = 28375
12
Thus rounding error in defining b should lead to errors
in solving H4x = b that are larger than the rounding
errors by a factor of 104 or more.
ITERATION METHODS

These are methods which compute a sequence of pro-


gressively accurate iterates to approximate the solu-
tion of Ax = b.

We need such methods for solving many large lin-


ear systems. Sometimes the matrix is too large to
be stored in the computer memory, making a direct
method too dicult to use.

More importantly, the operations cost of 23 n3 for


Gaussian elimination is too large for most large sys-
tems. With iteration methods, the cost
can often be
reduced to something of cost O n2 or less. Even
when a special form for A can be used to reduce the
cost of elimination, iteration will often be faster.

There are other, more subtle, reasons, which we do


not discuss here.
JACOBIS ITERATION METHOD

We begin with an example. Consider the linear system


9x1 + x2 + x3 = b1
2x1 + 10x2 + 3x3 = b2
3x1 + 4x2 + 11x3 = b3
In equation #k, solve for xk :
x1 = 19 [b1 x2 x3]
1 [b 2x 3x ]
x2 = 10 2 1 3
1 [b 3x 4x ]
x3 = 11 3 1 2
T
(0) (0) (0)
Let x(0) = x1 , x2 , x3 be an initial guess to the
solution x. Then define

(k+1) 1 b x(k) (k)
x1 = 9 1 2 x3

(k+1) 1 b 2x(k) 3x(k)
x2 = 10 2 1 3

(k+1) 1 b 3x(k) 4x(k)
x3 = 11 3 1 2
for k = 0, 1, 2, . . . . This is called the Jacobi iteration
method or the method of simultaneous replacements.
NUMERICAL EXAMPLE.
Let b = [10, 19, 0]T .
The solution is x = [1, 2, 1]T .
To measure the error, we use

(k) (k)

Error = kx x k = max xi xi
i

(k) (k) (k)


k x1 x2 x3 Error Ratio
0 0 0 0 2.00E + 0
1 1.1111 1.9000 0 1.00E + 0 0.500
2 0.9000 1.6778 0.9939 3.22E 1 0.322
3 1.0351 2.0182 0.8556 1.44E 1 0.448
4 0.9819 1.9496 1.0162 5.06E 2 0.349
5 1.0074 2.0085 0.9768 2.32E 2 0.462
6 0.9965 1.9915 1.0051 8.45E 3 0.364
7 1.0015 2.0022 0.9960 4.03E 3 0.477
8 0.9993 1.9985 1.0012 1.51E 3 0.375
9 1.0003 2.0005 0.9993 7.40E 4 0.489
10 0.9999 1.9997 1.0003 2.83E 4 0.382
30 1.0000 2.0000 1.0000 3.01E 11 0.447
31 1.0000 2.0000 1.0000 1.35E 11 0.447
GAUSS-SEIDEL ITERATION METHOD

Again consider the linear system


9x1 + x2 + x3 = b1
2x1 + 10x2 + 3x3 = b2
3x1 + 4x2 + 11x3 = b3
and solve for xk in equation #k:
x1 = 19 [b1 x2 x3]
1 [b 2x 3x ]
x2 = 10 2 1 3
x3 = 11 1 [b 3x 4x ]
3 1 2
Now immediately use every new iterate:

(k+1) (k) (k)
x1 = 19 b1 x2 x3

(k+1) 1 b 2x(k+1) (k)
x2 = 10 2 1 3x3

(k+1) 1 b 3x(k+1) (k+1)
x3 = 11 3 1 4x2

for k = 0, 1, 2, . . . . This is called the Gauss-Seidel


iteration method or the method of successive replace-
ments.
NUMERICAL EXAMPLE.
Let b = [10, 19, 0]T .
The solution is x = [1, 2, 1]T .
To measure the error, we use

(k)
Error = kx x(k)k = max xi xi
i

(k) (k) (k)


k x1 x2 x3 Error Ratio
0 0 0 0 2.00E + 0
1 1.1111 1.6778 0.9131 3.22E 1 0.161
2 1.0262 1.9687 0.9958 3.13E 2 0.097
3 1.0030 1.9981 1.0001 3.00E 3 0.096
4 1.0002 2.0000 1.0001 2.24E 4 0.074
5 1.0000 2.0000 1.0000 1.65E 5 0.074
6 1.0000 2.0000 1.0000 2.58E 6 0.155

The values of Ratio do not approach a limiting value


with larger values of the iteration index k.
A GENERAL SCHEMA

Rewrite Ax = b as

Nx = b + P x (1)
with A = N P a splitting of A. Choose N to be
nonsingular. Usually we want N z = f to be easily
solvable for arbitray f . The iteration method is

N x(k+1) = b + P x(k), k = 0, 1, 2, . . . , (2)

EXAMPLE. Let N be the diagonal of A, and let P =


N A. The iteration method is the Jacobi method:
X n
(k+1) (k)
ai,ixi = bi ai,j xj , 1in
j=1
j6=i
for k = 0, 1, . . . .
EXAMPLE. Let N be the lower triangular part of A,
including its diagonal, and let P = N A. The
iteration method is the Gauss-Seidel method:
i
X n
X
(k+1) (k)
ai,j xj = bi ai,j xj , 1in
j=1 j=i+1
for k = 0, 1, . . . .

EXAMPLE. Another method could be defined by let-


ting N be the tridiagonal matrix formed from the di-
agonal, super-diagonal, and sub-diagonal of A, with
P = N A:

a a1,2 0 0
1,1 ... ..
a2,1 a2,2 a 2,3
...
N =
0 0

.. ... a
n1,n2 an1,n1 an1,n
0 0 an,n1 an,n

Solving N x(k+1) = b + P x(k) uses the algorithm for


tridiagonal systems from 6.4.
CONVERGENCE

When does the iteration method (2) converge? Sub-


tract (2) from (1), obtaining

N x x(k+1) = P xx (k)

xx(k+1) 1
= N P xx (k)

e(k+1) = Me(k), M = N 1P (3)


with e(k) x x(k)
Return now to the matrix and vector norms of 6.5.
Then

(k+1) (k)
e kM k e , k0

Thus the error e(k) converges to zero if kM k < 1,


with

(k) k (0)
e kMk e , k0
EXAMPLE. For the earlier example with the Jacobi
method,

(k+1) (k) (k)
x1 = 19 b1 x2 x3

(k+1) 1 b 2x(k) (k)
x2 = 10 2 1 3x3

(k+1) 1 b 3x(k) (k)
x3 = 11 3 1 4x2


0 19 19

2 3
M = 10 0 10

3 4
11 0
11
7 .
kM k = = 0.636
11
This is consistent with the earlier table of values, al-
though the actual convergence rate was better than
predicted by (3).
EXAMPLE. For the earlier example with the Gauss-
Seidel method,

(k+1) (k) (k)
x1 = 19 b1 x2 x3

(k+1) 1 b 2x (k+1) (k)
x2 = 10 2 1 3x3

(k+1) 1 b 3x(k+1) 4x(k+1)
x3 = 11 3 1 2

1
9 0 0 0 1 1

M = 2 10 0 0 0 3
3 4 11 0 0 0

0 19 19

1 5
= 0 45 18

1
0 45 13
99
kMk = 0.3
This too is consistent with the earlier numerical re-
sults.
DIAGONALLY DOMINANT MATRICES

Matrices A for which


X n

ai,i > ai,j , i = 1, . . . , n
j=1
j6=i
are called diagonally dominant. For the Jacobi itera-
tion method,
a1,2 a1,n

0
a1,1 a1,1
a
2,1
a 2,n

0
M = a2,2 a2,2
.. ... ..

an,1 an,n1

0
an,n an,n
With diagonally dominant matrices A,
n

X a
i,j
kMk = max <1 (4)
1in ai,i
j=1
j6=i
Thus the Jacobi iteration method for solving Ax = b
is convergent.
GAUSS-SEIDEL ITERATION

Assuming A is diagonally dominant, we can show that


the Gauss-Seidel iteration will also converge. How-
ever, constructing M = N 1P is not reasonable for
this method and an alternative approach is needed.
Return to the error equation

N e(k+1) = P e(k)
and write it in component form for the Gauss-Seidel
method:
i
X Xn
(k+1) (k)
ai,j ej = ai,j ej , 1in
j=1 j=i+1

i1
X ai,j (k+1) Xn a
(k+1) i,j (k)
ei = ej ej (5)
a
j=1 i,i a
j=i+1 i,i
Introduce
i1

n a

X ai,j X i,j
i = , i = , 1in
ai,i ai,i
j=1 j=i+1
with 1 = n = 0. Taking bounds in (5),

(k+1)
e e(k+1) + e(k)

, i = 1, ..., n (6)
i i i

Let be an index for which



(k+1) (k+1) (k+1)
e = max e = e

1in i

Then using i = in (6),



(k+1) (k+1) (k)
e e + e


(k+1) (k)
e e
1
Define
i
= max
i 1 i
Then

(k+1) (k)
e e
For A diagonally dominant, it can be shown that

kMk (7)
where kM k is for the definition of M for the Jacobi
method, given earlier in (4) as
n

X ai,j

kM k = max = max (i + i) < 1
1in ai,i 1in
j=1
j6=i
Consequently, for A diagonally dominant, the Gauss-
Seidel method also converges and it does so more
rapidly than the Jacobi method in most cases.

Showing (7) follows by showing


i
(i + i) 0, 1in
1 i
For our earlier example with A of order 3, we have
= 0.375 This is not as good as computing kMk
directly for the Gauss-Seidel method, but it does show
that the rate of convergence is better than for the
Jacobi method.
CONVERGENCE: AN ADDENDUM

Since
kMk = kN 1P k kN 1k kP k,
kMk < 1 is satisfied if N satisfies
kN 1k kP k < 1
Using P = N A, this can be rewritten as
1
kA N k <
kN 1k
We also want to choose N so that systems N z = f
is easily solvable.

GENERAL CONVERGENCE THEOREM:


N x(k+1) = b + P x(k), k = 0, 1, 2, . . . ,
will converge, for all right sides b and all initial guesses
x(0), if and only if all eigenvalues of M = N 1P
satisfy
|| < 1
This is the basis of deriving other splittings A = N P
that lead to convergent iteration methods.
RESIDUAL CORRECTION METHODS

N be an invertible approximation of the matrix A; let


x(0) x for the solution of Ax = b. Define

r(0) = b Ax(0)
Since Ax = b for the true solution x,

r(0) = Ax Ax(0)
= A(x x(0)) = Ae(0)
with e(0) = x x(0). Let e(0) be the solution of

N e(0) = r(0)
and then define

x(1) = x(0) + e(0)


Repeat this process inductively.
RESIDUAL CORRECTION

For k = 0, 1, . . . , define
r(k) = b Ax(k)
N e(k) = r(k)
x(k+1) = x(k) + e(k)
This is the general residual correction method.

To see how this fits into our earlier framework, proceed


as follows:
x(k+1) = x(k) + e(k) = x(k) + N 1r(k)
= x(k) + N 1(b Ax(k))
Thus,
N x(k+1) = N x(k) + b Ax(k)
= b + (N A)x(k)
= b + P x(k)
Sometimes the residual correction scheme is a prefer-
able way of approaching the development of an itera-
tive method.
LEAST SQUARES DATA FITTING

Experiments generally have error or uncertainty in mea-


suring their outcome. Error can be human error, but it is
more usually due to inherent limitations in the equipment
being used to make measurements. Uncertainty can be
due to lack of precise denition or of human variation
in what is being measured. (For example, how do you
measure how much you like something?)

We often want to represent the experimental data by


some functional expression. Interpolation is often un-
satisfactory because it fails to account for the error or
uncertainty in the data.

We may know from theory that the data is taken from a


particular form of function (e.g. a quadratic polynomial),
or we may choose to use some particular type of formula
to represent the data.
EXAMPLE

Consider the following data.

Table 1: Empirical data

xi yi xi yi
1 :0 1:945 3:2 0:764
1 :2 1:253 3:4 0:532
1 :4 1:140 3:6 1:073
1 :6 1:087 3:8 1:286
1 :8 0:760 4:0 1:502
2 :0 0:682 4:2 1:582
2 :2 0:424 4:4 1:993
2 :4 0:012 4:6 2:473
2 :6 0:190 4:8 2:503
2 :8 0:452 5:0 2:322
3 :0 0:337

From the following Figure 1 it appears to be approxi-


mately linear.
Figure 1: The plot of empirical data

An experiment seeks to obtain an unknown functional


relationship
y = f ( x) (1)
involving two related variables x and y . We choose vary-
ing values of x, say, x1; x2; : : : ; xn. Then we measure
a corresponding set of values for y . Let the actual mea-
surements be denoted by y1; : : : ; yn, and let

i = f ( xi ) yi
denote the unknown measurement errors. We want to
use the points (x1; y1); : : : ; (xn; yn) to determine the
analytic relationship (1) as accurately as possible.
Often we suspect that the unknown function f (x) lies
within some known class of functions, for example, poly-
nomials. Then we want to choose the member of that
class of functions that will best approximate the unknown
function f (x), taking into account the experimental er-
rors f ig.

As an example of such a situation, consider the data in


Table 1 and the plot of it in Figure 1. From this plot,
it is reasonable to expect f (x) to be close to a linear
polynomial,
f (x) = mx + b (2)
Assuming this to be the true form, the problem of deter-
mining f (x) is now reduced to that of determining the
constants m and b.

We can choose to determine m and b in a number of


ways. We list three such ways.
1. Choose m and b so as to minimize the quantity
n
1X
jf (xi) yi j
n i=1
which can be considered an average approximation error.

2. Choose m and b so as to minimize the quantity


v
u n
u1 X
t [f ( x i ) y i ]2
n i=1
which can also be considered an average approximation
error. It is called the root mean square error in the ap-
proximation of the data f(xi; yi)g by the function f (x).

3. Choose m and b so as to minimize the quantity

max jf (xi) yi j
1 i n
which is the maximum error of approximation.
All of these can be used, but #2 is the favorite, and we
now comment on why. To do so we need to understand
more about the nature of the unknown errors f ig.

Standard assumption: Each error i is a random vari-


able chosen from a normal probability distribution. Intu-
itively, such errors satisfy the following.
(1) If the experiment is repeated many times for the same
x = xi, then the associated unknown errors i in the em-
pirical values yi will be likely to have an average of zero.
(2) For this same experimental case with x = xi, as the
size of i increases, the likelihood of its occurring will
decrease rapidly.

This is the normal error assumption. We also assume


that the individual errors i, 1 i n, are all random
variables from the same normal probability distribution
function, meaning that the size of i is unrelated to the
size of xi or yi.
Assume f (x) is in a known class of functions, call it C .
An example is the assumption that f (x) is linear for the
data in Table 1.

Then among all functions fb(x) in C , it can be shown


that the function fb that is most likely to equal f will
also minimize the expression
v
u n h i2
u1 X
E=t fb(xi) yi (3)
n i=1

among all functions fb in C .

This is called the root-mean-square error in the approx-


imation of the data fyig by fb(x). The function fb (x)
that minimizes E relative to all fb in C is called the least
squares approximation to the data f(xi; yi)g.
EXAMPLE

Return to the data in Table 1, pictured in Figure 1. The


least squares approximation is given by

fb (x) = 1:06338x 2:74605 (4)


It is illustrated graphically in Figure 2.

Figure 2: The linear least squares t fb (x)


CALCULATING THE LEAST SQUARES
APPROXIMATION

How did we calculate fb (x)? We want to minimize


v
u n
u1 X
E t [f ( x i ) y i ]2
n i=1
when considering all possible functions
f (x) = mx + b. Note that minimizing E is equivalent
to minimizing the sum, although the minimum values will
be dierent. Thus we seek to minimize
n
X
G(b; m) = [mxi + b y i ]2 (5)
i=1
as b and m are allowed to vary arbitrarily.

The choices of b and m that minimize G(b; m) will satisfy


@G(b; m) @G(b; m)
= 0; =0 (6)
@b @m
Use
Xn
@G
= 2 [mxi + b yi ]
@b i=1
Xn n
X h i
@G
= 2 [mxi + b y i ] xi = 2 mx2i + bxi xi y i
@m i=1 i=1
This leads to the linear system
0 1
n
X n
X
nb + @ xi A m = yi
i=1 i=1
0 1 0 1 (7)
n
X Xn n
X
@ xi A b + @ x2i A m = xi y i
i=1 i=1 i=1
This is uniquely solvable if the determinant is nonzero,
0 12
n
X n
X
n x2i @ xiA 6= 0 (8)
i=1 i=1
This is true unless

x 1 = x2 = = xn = constant
and this is false for our case.
For our example in Table 1,
n
X n
X
xi = 63:0 x2i = 219:8
i=1 i=1
Xn Xn
yi = 9:326 xiyi = 60:7302
i=1 i=1
Using this in (7), the linear system becomes
21b + 63:0m = 9:326
63:0b + 219:8m = 60:7302
The solution is
: :
b= 2:74605 m = 1:06338

fb (x) = 1:06338x 2:74605


The root-mean-square-error in fb (x) is
:
E = 0:171
Recall the graph of fb (x) is given in Figure 2.
GENERALIZATION

To represent the data f(xi; yi) j 1 i ng, let

fb(x) = a1'1(x) + a2'2(x) + + a m ' m ( x) (9)


a1; a2; : : : ; am arbitrary numbers, '1(x); : : : ; 'm(x) given
functions.

If fb(x) is to be a quadratic polynomial, write

fb(x) = a1 + a2x + a3x2 (10)

' 1 ( x) 1; '2(x) = x; ' 3 ( x) = x2


Under the normal error assumption, the function fb(x) is
to be chosen to minimize the root-mean-square error
v
u n h i2
u1 X
E=t fb(xi) yi
n i=1
Consider m = 3. Then

fb(x) = a1'1(x) + a2'2(x) + a3'3(x)


Choose a1, a2, a3 to minimize
n
X
G(a1; a2; a3) = [a1'1(xj )+a2'2(xj )+a3'3(xj ) yj ]2
j=1
At the minimizing point (a1; a2; a3),
@G @G @G
= 0; = 0; =0
@a1 @a2 @a3
This leads to the three equations. For i = 1; 2; 3,
Xn
@G
0= = 2[a1'1(xj )+a2'2(xj )+a3'3(xj ) yj ]'i(xj )
@ai j=1
2 3 2 3
n
X n
X
4 ' 1 ( xj ) ' i ( xj ) 5 a 1 + 4 ' 2 ( xj ) ' i ( xj ) 5 a 2
j=1 j=1
2 3
n
X n
X
+4 ' 3 ( xj ) ' i ( xj ) 5 a 3 = y j ' i ( xj ) ; (11)
j=1 j=1
Apply this to the quadratic formula

fb(x) = a1 + a2x + a3x2

' 1 ( x) 1; '2(x) = x; ' 3 ( x) = x2


Then the three equations are
2 3 2 3
n
X n
X n
X
na1 + 4 xj 5 a 2 + 4 x2j 5 a3 = yj
j=1 j=1 j=1
2 3 2 3
2 3
n
X Xn n
X Xn
4 xj 5 a 1 + 4 2
xj 5 a 2 + 4 3
xj 5 a 3 = y j xj
j=1 j=1 j=1 j=1
(12)
2 3 2 3 2 3
n
X n
X n
X n
X
4 x2j 5 a1 + 4 x3j 5 a2 + 4 x4j 5 a3 = yj x2j
j=1 j=1 j=1 j=1
This can be shown a nonsingular system due to the as-
sumption that the points fxj g are distinct.
Generalization. Let

fb(x) = a1'1(x) + a2'2(x) + + a m ' m ( x)


The root-mean-square error E in (3) is minimized with
the coe cients a1; : : : ; am satisfying
2 3
m
X n
X n
X
ak 4 ' k ( xj ) ' i ( x j ) 5 = y j ' i ( xj ) (13)
k=1 j=1 j=1
for i = 1; : : : ; m. For the special case of a polynomial
of degree (m 1),

fb(x) = a1 + a2x + a3x2 + + a m xm 1


write
'1(x) = 1; '2(x) = x; '3(x) = x2;
(14)
: : : ; ' m ( x) = xm 1
System (13) becomes
2 3
m
X n
X n
X
25
ak 4 xi+k
j = yj xij 1; i = 1 ; 2; : : : ; m
k=1 j=1 j=1
(15)
When m = 3, this yields the system (12) obtained earlier.
ILL-CONDITIONING

This system (15) is nonsingular (for m < n). Unfortu-


nately it is increasingly ill-conditioned as the degree m 1
increases.

The condition number for the matrix of coe cients can


be very large for fairly small values of m, say, m = 4.

For this reason, it is seldom advisable to use


' 1 ( x ) = 1; '2(x) = x; '3(x) = x2;
: : : ; ' m ( x) = xm 1
to do a least squares polynomial t, except for
degree 2.
To do a least squares t to data f(xi; yi) j 1 i ng
with a higher degree polynomial fb(x), write

fb(x) = a1'1(x) + + a m ' m ( x)


with '1(x); : : : ; 'm(x) so chosen that the matrix of co-
e cients in
2 3
m
X n
X n
X
ak 4 ' k ( xj ) ' i ( xj ) 5 = y j ' i ( xj )
k=1 j=1 j=1
is not ill-conditioned.

There are optimal choices of these functions 'j (x), with


deg('j ) = j 1 and with the coe cient matrix becom-
ing diagonal.
IMPROVED BASIS FUNCTIONS

A nonoptimal but still satisfactory choice in general can


be based on the Chebyshev polynomials fTk (x)g of Sec-
tion 5.5, and a somewhat better choice is the Legendre
polynomials of Section 5.7.

Suppose that the nodes fxig are chosen from an interval


[ ; ]. Introduce modied Chebyshev polynomials
!
2x
'k (x) = Tk 1 ; x ; k 1
(16)
Then degree ('k ) = k 1; and any polynomial fb(x)
of degree (m 1) can be written as a combination of
'1(x); : : : ; 'm(x).
EXAMPLE

Consider the following data.

Table 2: Data for a cubic least squares t

xi yi xi yi
0:00 0:486 0:55 1:102
0:05 0:866 0:60 1:099
0:10 0:944 0:65 1:017
0:15 1:144 0:70 1:111
0:20 1:103 0:75 1:117
0:25 1:202 0:80 1:152
0:30 1:166 0:85 1:265
0:35 1:191 0:90 1:380
0:40 1:124 0:95 1:575
0:45 1:095 1:00 1:857
0:50 1:122

From the following Figure 3 it appears to be approxi-


mately cubic. We begin by using
fb(x) = a1 + a2x + a3x2 + a4x3 (17)
Figure 3: The plot of data of Table 2

The resulting linear system (15), denoted here by


La = b, is given by
2 3
21 10:5 7:175 5:5125
6 10:5 7:175 5:5125 4:51666 7
6 7
L=6 7
4 7:175 5:5125 4:51666 3:85416 5
5:5125 4:51666 3:85416 3:38212
a = [ a 1 ; a 2 ; a 3 ; a 4 ]T
b = [24:1180; 13:2345; 9:46836; 7:55944]T
The solution is

a = [0:5747; 4:7259; 11:1282; 7:6687]T


The condition number is
:
cond(L) = kLk kL 1k = 22000 (18)
This is very large; it may be di cult to obtain an accurate
answer for La = b.

To verify this, perturb b above by adding to it the per-


turbation

[0:01; 0:01; 0:01; 0:01]T


This will change b in its second place to the right of the
decimal point, within the range of possible perturbations
due to errors in the data. The solution of the new per-
turbed system is

a = [0:7408; 2:6825; 6:1538; 4:4550]T


This is very dierent from the earlier result for a.
The main point here is that use of

fb(x) = a1 + a2x + a3x2 + a4x3


leads to a rather ill-conditioned system of linear equations
for determining fa1; a2; a3; a4g.

A better basis.

Use the modied Chebyshev functions of (16) on


[ ; ] = [0; 1]:

f ( x) = a 1 ' 1 ( x) + a 2 ' 2 ( x) + a 3 ' 3 ( x) + a 4 ' 4 ( x)

'1(x) = T0(2x 1) 1
'2(x) = T1(2x 1) = 2x 1
'3(x) = T2(2x 1) = 8x2 8x + 1
'4(x) = T3(2x 1) = 32x3 48x2 + 18x 1
The values fa1; a2; a3; a4g are completely dierent than
in the representation (17).
The linear system (13) is again denoted by La = b:
2 3
21 0 5 :6 0
6 0 7:7 0 2:8336 7
6 7
L=6 7
4 5:6 0 10:4664 0 5
0 2:8336 0 11:01056
b = [24:118; 2:351; 6:01108; 1:523576]T
The solution is

a = [1:160969; 0:393514; 0:046850; 0:239646]T


The linear system is very stable with respect to the type
of perturbation made in b with the earlier approach to
the cubic least squares t, using (17).

This is implied by the small condition number of L.


: :
cond(L) = kLk kL 1k = (26:6)(0:1804) = 4:8
Relatively small perturbations in b will lead to relatively
small changes in the solution a.
The graph of fb(x) is shown in Figure 4.

Figure 4: The cubic least squares t for Table 2

To give some idea of the accuracy of fb(x) in approxi-


mating the data in Table 2, we easily compute the root-
mean-square error from (3) to be
:
E = 0:0421
a fairly small value when compared with the function val-
ues of fb(x).
THE EIGENVALUE PROBLEM

Let A be an n n square matrix. If there is a number


and a column vector v 6= 0 for which

Av = v
then we say is an eigenvalue of A and v is an associated
eigenvector. Note that if v is an eigenvector, then any
nonzero multiple of v is also an eigenvector for the same
eigenvalue .

Example: Let
" #
1:25 0:75
A= (1)
0:75 1:25
The eigenvalue-eigenvector pairs for A are
" #
1
1 = 2; v (1) =
1
" # (2)
1
2 = 0 :5 ; v (2) =
1
Eigenvalues and eigenvectors are often used to give addi-
tional intuition to the function

F (x) = Ax; x 2 Rn or Cn
Example. The eigenvectors in the preceding example (2)
form a basis for R2. For x = [x1; x2]T ,

x = c1v (1) + c2v (2)


x + x2 x x1
c1 = 1 ; c2 = 2
2 2
Using (2), the function

F (x) = Ax; x 2 R2
can be written as

F (x) = c1Av (1) + c2Av (2)


1
(1)
= 2c1v + c2v (2); x 2 R2
2
The eigenvectors provide a better way to understand the
meaning of F (x) = Ax.

See the following Figure 1 and Figure 2.


Figure 1: Decomposition of x

Figure 2: Decomposition of Ax
How to calculate and v ? Rewrite Av = v as

( I A)v = 0; v 6= 0 (3)
a homogeneous system of linear equations with the coef-
cient matrix I A and the nonzero solution v . This
can be true if and only if

f( ) det( I A) = 0
The function f ( ) is called the characteristic polynomial
of A, and its roots are the eigenvalues of A.
Assuming A has order n,

f( ) = n + n 1 n 1 + + 1 + 0 (4)
For the case n = 2,
" #
a11 a12
f ( ) = det
a21 a22
=( a11)( a22) a21a12
= 2 (a11 + a22) + a11a22 a21a12
The formula (4) shows that a matrix A of order n can
have at most n distinct eigenvalues.
Example. Let
2 3
7 13 16
6 7
A=4 13 10 13 5 (5)
16 13 7
Then
2 3
+7 13 16
6 7
f ( ) = det 4 13 + 10 13 5
16 13 +7
= 3 + 24 2 405 + 972
is the characteristic polynomial of A.

The roots are

1 = 36; 2 = 9; 3 =3 (6)
Finding an eigenvector. For = 36, nd an
associated eigenvector v by solving ( I A)v = 0, which
becomes
( 36I A)v = 0
2 32 3 2 3
29 13 16 v1 0
6 76 7 6 7
4 13 26 13 5 4 v2 5 = 4 0 5
16 13 29 v3 0
If v1 = 0, then the only solution is v = 0. Thus v1 6= 0,
and we arbitrarily choose v1 = 1. This leads to the
system

13v2 + 16v3 = 29
26v2 13v3 = 13
13v2 29v3 = 16
The solution is v2 = 1, v3 = 1. Thus the
eigenvector v for = 36 is

v (1) = [1; 1; 1]T (7)


or any nonzero multiple of this vector.
SYMMETRIC MATRICES

Recall that A symmetric means AT = A, assuming that


A contains only real number entries. Such matrices are
very common in applications.

Example. A general 3 3 symmetric matrix looks like


2 3
a b c
6 7
A=4 b d e 5
c e f
h i
A general n n matrix A = ai;j is symmetric if and
only if
ai;j = aj;i; 1 i; j n

Following is a very important set of results for symmetric


matrices, explaining much of their special character.
THEOREM. Let A be a real, symmetric, n n ma-
trix. Then there is a set of n eigenvalue-eigenvector pairs
f i; v (i)g, 1 i n that satisfy the following proper-
ties.

(i) The numbers 1; 2; : : : ; n are all of the roots of


the characteristic polynomial f ( ) of A, repeated
according to their multiplicity. Moreover, all the i
are real numbers.

(ii) If the column matrices v (i), 1 i n are regarded


as vectors in n-dimensional space, then they are mu-
tually perpendicular and of length 1:

v (i)Tv (j) = 0; 1 i; j n; i 6= j
v (i)Tv (i) = 1; 1 i n
(iii) For each column vector x = [x1; x2; : : : ; xn]T, there
is a unique choice of constants c1; : : : ; cn for which
x = c1v (1) + + cnv (n)
The constants are given by
n
X (i)
ci = xj vj = xTv (i); 1 i n
j=1
(i) (i)
where v (i) = [v1 ; : : : ; vn ]T.

(iv) Dene the matrix U of order n by


U = [v (1); v (2); : : : ; v (n)] (8)
Then
2 3
1 0 0
6
6 0 ... ... 7
7
U TAU =D 6 ...
2
... ... 7
4 0 5
0 0 n
and
U U T = U TU = I (9)
A = U DU T is a useful decomposition of A.
Example. Recall
" #
1:25 0:75
A=
0:75 1:25
and its eigenvalue-eigenvector pairs
" #
1
1 = 2; v (1) =
1
" #
1
2 = 0 :5 ; v (2) =
1
Figure 1 illustrates that the eigenvectors are perpendic-
ular. To have them be of length 1, replace the above
by
2 3
p1
6 2 7
1 = 2; v (1) = 4 1 5
p
2
2 3
p 1
6 2 7
2 = 0 :5 ; v (2) = 4 1 5
p
2

that are multiples of the original v (i).


The matrix U of (8) is given by
2 3
p1 p 1
6 2 2 7
U = 4 1 5
p p1
2 2

Easily, U TU = I .

Also,
U TAU
2 3 2 3
p1 p1 " # p1 p 1
6 2 2 7 1:25 0:75 6 2 2 7
= 4 1 5 4 1 5
p p1 0:75 1:25 p p1
2 2 2 2
" #
2 0
=
0 0:5
as specied in (9).
NONSYMMETRIC MATRICES

Nonsymmetric matrices have a wide variety of possible


behaviour. We illustrate with two simple examples some
of the possible behaviour.

Example. We illustrate the existence of complex eigen-


values. Let
" #
0 1
A=
1 0
The characteristic polynomial is
" #
1
f ( ) = det = 2+1
1
The roots of f ( ) are complex,
p
1=i 1; 2 = i
and corresponding eigenvectors are
" # " #
i i
v (1) = ; v (2) =
1 1
Example. For A an n n nonsymmetric matrix, there
may not be n independent eigenvectors. Let
2 3
a 1 0
6 7
A=4 0 a 1 5
0 0 a
where a is a constant.

Then = a is the eigenvalue of A with multiplicity 3,


and any associated eigenvector must be of the form

v = c [1; 0; 0]T
for some c 6= 0.

Thus, up to a nonzero multiplicative constant, we have


only one eigenvector,

v = [1; 0; 0]T
for the three equal eigenvalues 1 = 2 = 3 = a.
THE POWER METHOD

This numerical method is used primarily to nd the eigen-


value of largest magnitude, if such exists.

We assume that the eigenvalues f 1; : : : ; ng of an n n


matrix A satisfy

j 1j > j 2j j nj (10)
Denote the eigenvector for 1 by v (1). We dene an
iteration method for computing improved estimates of
(1)
1 and v .

Choose z (0) v (1), usually chosen randomly. Dene

w(1) = Az (0)
Let 1 be the maximum component of w(1), in size. If
there is more than one such component, choose the rst
such component as 1. Then dene
1
z (1) = w(1)
1
Repeat the process iteratively. Dene

w(m) = Az (m 1) (11)
Let m be the maximum component of w(m), in size.
Dene
(m) 1 (m)
z = w (12)
m
for m = 1; 2; : : : Then, roughly speaking, the vectors
z (m) will converge to some multiple of v (1).

To nd 1 by this process, also pick some nonzero com-


ponent of the vectors z (m) and w(m), say component k;
and x k. Often this is picked as the maximal component
of z (l), for some large l. Dene
(m)
(m) wk
1 = (m 1) ; m = 1 ; 2; : : : (13)
zk
(m 1)
where zk denotes component k of z (m 1).

(m)
It can be shown that 1 converges to 1 as m ! 1.
" #
1:25 0:75
Example. Recall the earlier example A = .
0:75 1:25
Double precision was used in the computation, with rounded
values shown in the table given here.

Table 1: Power method example

(m) (m) (m) (m) (m 1)


m z1 z2 1 1 1 Ratio
0 1 :0 :5
1 1 :0 :84615 1:62500
2 1 :0 :95918 1:88462 2:60E 1
3 1 :0 :98964 1:96939 8:48E 2 0:33
4 1 :0 :99740 1:99223 2:28E 2 0:27
5 1 :0 :99935 1:99805 5:82E 3 0:26
6 1 :0 :99984 1:99951 1:46E 3 0:25
7 1 :0 :99996 1:99988 3:66E 4 0:25

Note that in this example, 1 = 2 and v (1) = [1; 1]T ,


and the numerical results are converging to these
values.
(m) (m 1)
The column of successive dierences 1 1 and
their successive ratios are included to show that there is
a regular pattern to the convergence.
CONVERGENCE

Assume A is a real n n matrix.

It can be shown, by induction, that

Amz (0)
z (m) = m ; m 1 (14)
kAmz (0)k
where m = 1.

Further assume that the eigenvalues f 1; : : : ; ng satisfy


(10),
n and also that o there are n corresponding eigenvectors
v (1); : : : ; v (n) that form a basis for Rn.

Thus
z (0) = c1v (1) + + cnv (n) (15)
for some choice of constants fc1; : : : ; cng.

Assume c1 6= 0, something that a truly random choice of


z (0) will usually guarantee.
Apply A to z (0) in (15), to get
Az (0) = c1Av (1) + + cnAv (n)
= 1c1v (1) + + ncnv (n)
Apply A repeatedly to get
Amz (0) = m
1 c1v (1) + + m
n cn v
(n)
" ! m !m #
2 n
= m
1 c1 v
(1) + c2v (2) + + cnv (n)
1 1
From (14),
!m
1
z (m) = m
j 1j ! !m
m
2 n
c1v (1) + c2v (2) + + cnv (n)
1! 1!
m m
2 n
c1v (1) + c2v (2) + + cnv (n)
1 1
(16)
As m ! 1, the terms ( j = 1)m ! 0, for 2 j n,
with ( 2= 1)m the largest. Also,
!m
1
m = 1
j 1j
Thus as m ! 1, most terms in (16) are converging
to zero. Cancel c1 from numerator and denominator,
obtaining
v (1)
z (m) ^m
kv (1)k
where ^ m = 1.

If the normalization of z (m) is modied, to always have


some particular component be positive, then

v (1)
z (m) ! v^(1) (17)
v (1)
with a xed sign independent of m. Our earlier normal-
ization of sign, dividing by m, will usually accomplish
this, but not always.

The error in z (m) will satisfy


m
2
kz (m) v^(1)k c ; m 0 (18)
1
for some constant c > 0.
(m)
CONVERGENCE OF 1

(m)
A similar convergence analysis can be given for 1 ,
with the same kind of error bound.

Moreover, if we also assume

j 1j > j 2j > j 3j j 4j j nj 0 (19)


then
!m
(m) 2
1 1 c (20)
1
for some constant c, as m ! 1.

The error decreases geometrically" with a ratio of


# 2= 1.
1:25 0:75
In the earlier example with A = ,
0:75 1:25

2= 1 = 0:5=2 = :25
which is the ratio observed in Table 1.
Example. Consider the symmetric matrix
2 3
7 13 16
6 7
A=4 13 10 13 5
16 13 7
From earlier, the eigenvalues are

1 = 36; 2 = 9; 3 =3
The eigenvector v (1) associated with 1 is
v (1) = [1; 1; 1]T
The results of using the power method are shown in the
following Table 2.
(m)
Note that the ratios of the successive dierences of 1
are approaching
2
= 0:25
1
Also note that location of the maximal component of
z (m) changes from one iteration to the next.

The initial guess z (0) was chosen closer to v (1) than


would be usual in actual practice. It was so chosen for
purposes of illustration.
Table 2: Power method example

(m) (m) (m)


m z1 z2 z3
0 1:000000 0:800000 0:900000
1 0:972477 1:000000 1:000000
2 1:000000 0:995388 0:993082
3 0:998265 0:999229 1:000000
4 1:000000 0:999775 0:999566
5 0:999891 0:999946 1:000000
6 1:000000 0:999986 0:999973
7 0:999993 0:999997 1:000000

(m) (m) (m 1)
m 1 1 1 Ratio
1 31:80000
2 36:82075 5:03E + 0
3 35:82936 9:91E 1 0:197
4 36:04035 2:11E 1 0:213
5 35:99013 5:02E 2 0:238
6 36:00245 1:23E 2 0:245
7 35:99939 3:06E 3 0:249
AITKEN EXTRAPOLATION

From (20),
(m+1) (m)
1 1 r( 1 1 ); r = 2= 1 (21)
for large m. Choose r using
(m+1) (m)
r 1 1 (22)
(m) (m 1)
1 1
as with Aitken extrapolation in 3.4 on linear iteration
methods.

Using this r, solve for 1 in (21),


1 (m+1) (m)
1 1 r 1
1 r (23)
(m+1) r (m+1) (m)
= 1 + 1 1
1 r
This is Aitkens extrapolation formula. It also gives us
the Aitken error estimate
(m+1) r (m+1) (m)
1 1 1 1 (24)
1 r
Example. In Table 2, take m + 1 = 7. Then (24) yields
(7) r (7) (6)
1 1 1 1
1 r
0:249 :
= [0:00306] = 0:00061
1 + 0:249
which is the actual error.

Also the Aitken formula (23) will give the exact answer
for 1, to seven signicant digits.
SYSTEMS OF NONLINEAR EQUATIONS

Widely used in the mathematical modeling of real world


phenomena.
We introduce some numerical methods for their solution.

For better intuition, we examine systems of two nonlinear


equations and numerical methods for their solution. We
then generalize to systems of an arbitrary order.

The Problem: Consider solving a system of two nonlin-


ear equations

f (x; y ) = 0
g (x; y ) = 0 (1)
Example: Consider solving the system
f (x; y ) x2 + 4 y 2 9 =0
(2)
g (x; y ) 18y 14x2 + 45 = 0
A graph of z = f (x; y ) is given in Figure 1, along with
the curve for f (x; y ) = 0.

Figure 1: Graph of z = x2 + 4y 2 9, along with z = 0


on that surface
To visualize the points (x; y ) that satisfy simultaneously
both equations in (1), look at the intersection of the zero
curves of the functions f and g . Figure 2 illustrates this.
The zero curves intersect at four points, each of which
corresponds to a solution of the system (2). For example,
there is a solution near the point (1; 1).

Figure 2: The graphs of f (x; y ) = 0 and g (x; y ) = 0


TANGENT PLANE APPROXIMATION

The graph of z = f (x; y ) is a surface in xyz -space.

For (x; y ) (x0; y0), we approximate f (x; y ) by a plane


that is tangent to the surface at the point (x0; y0; f (x0; y0)).
The equation of this plane is z = p(x; y ) with

p(x; y ) f ( x0 ; y 0 ) (3)
+(x x0)fx(x0; y0) + (y y 0 ) f y ( x0 ; y 0 ) ;

@f (x; y ) @f (x; y )
fx(x; y ) = ; fy (x; y ) = ;
@x @y
the partial derivatives of f with respect to x and y , re-
spectively.

If (x; y ) (x0; y0), then

f (x; y ) p(x; y )
For functions of two variables, p(x; y ) is the linear Taylor
polynomial approximation to f (x; y ).
NEWTONS METHOD

Let = ( ; ) denote a solution of the system

f (x; y ) = 0
g (x; y ) = 0
Let (x0; y0) ( ; ) be an initial guess at the solution.

Approximate the surface z = f (x; y ) with the tangent


plane at (x0; y0; f (x0; y0)).

If f (x0; y0) is su ciently close to zero, then the zero


curve of p(x; y ) will be an approximation of the zero
curve of f (x; y ) for those points (x; y ) near (x0; y0).

Because the graph of z = p(x; y ) is a plane, its zero


curve is simply a straight line.
Example. Consider f (x; y ) x2 + 4 y 2 9. Then

fx(x; y ) = 2x; fy (x; y ) = 8y


At (x0; y0) = (1; 1),

f ( x0 ; y 0 ) = 4; f x ( x 0 ; y 0 ) = 2; f y ( x0 ; y 0 ) = 8
At (1; 1; 4) the tangent plane to the surface z =
f (x; y ) has the equation

z = p(x; y ) 4 + 2(x 1) 8(y + 1)


The graphs of the zero curves of f (x; y ) and p(x; y ) for
(x; y ) near (x0; y0) are given in Figure 3.

Figure 3: f (x; y ) = 0 and p(x; y ) = 0


The tangent plane to the surface z = g (x; y ) at
(x0; y0; g (x0; y0)) has the equation z = q (x; y ) with

q (x; y ) g (x0; y0)+(x x0)gx(x0; y0)+(y y0)gy (x0; y0)


Recall that the solution = ( ; ) to

f (x; y ) = 0
g (x; y ) = 0
is the intersection of the zero curves of z = f (x; y ) and
z = g (x; y ).

Approximate these zero curves by those of the tangent


planes z = p(x; y ) and z = q (x; y ). The intersection of
these latter zero curves gives an approximate solution to
the above nonlinear system.

Denote the solution to

p(x; y ) = 0
q (x; y ) = 0
by (x1; y1).
Example. Return to the equations
f (x; y ) x2 + 4 y 2 9 =0
g (x; y ) 18y 14x2 + 45 = 0
with (x0; y0) = (1; 1).

The use of the zero curves of the tangent plane approxi-


mations is illustrated in Figure 4.

Figure 4: f = g = 0 and p = q = 0
Calculating (x1; y1). To nd the intersection of the zero
curves of the tangent planes, we must solve the linear
system

f (x0; y0) + (x x0)fx(x0; y0) + (y y 0 ) f y ( x0 ; y 0 ) = 0


g (x0; y0) + (x x0)gx(x0; y0) + (y y0)gy (x0; y0) = 0
The solution is denoted by (x1; y1). In matrix form,
" #" # " #
f x ( x0 ; y 0 ) f y ( x0 ; y 0 ) x x0 f ( x0 ; y 0 )
=
gx(x0; y0) gy (x0; y0) y y0 g ( x0 ; y 0 )
It is actually computed as follows. Dene x and y to
be the solution of the linear system
" #" # " #
f x ( x 0 ; y 0 ) f y ( x0 ; y 0 ) x f ( x0 ; y 0 )
=
gx(x0; y0) gy (x0; y0) y g ( x0 ; y 0 )
" # " # " #
x1 x0 x
= +
y1 y0 y

Usually the point (x1; y1) is closer to the solution than


is the original point (x0; y0).
Continue this process, using (x1; y1) as a new initial
guess. Obtain an improved estimate (x2; y2).

This iteration process is continued until a solution with


su cient accuracy is obtained.

The general iteration is given by


" #" # " #
f x ( xk ; y k ) f y ( xk ; y k ) x;k f ( xk ; y k )
=
gx(xk ; yk ) gy (xk ; yk ) y;k g ( xk ; y k )
" # " # " #
xk+1 xk x;k
= + ; k = 0 ; 1; : : :
yk+1 yk y;k
(4)
This is Newtons method for solving

f (x; y ) = 0
g (x; y ) = 0
Many numerical methods for solving nonlinear systems
are variations on Newtons method.
Example. Consider again the system
f (x; y ) x2 + 4 y 2 9 =0
g (x; y ) 18y 14x2 + 45 = 0
Newtons method (4) becomes
" #" # " #
2 xk 8 yk x;k x2k + 4yk2 9
=
28xk 18 y;k 18yk 14x2k + 45
" # " # " #
xk+1 xk x;k
= +
yk+1 yk y;k
(5)
Choose (x0; y0) = (1; 1). The resulting Newton iter-
ates are given in Table 1, along with

Error = k ( xk ; y k ) k maxfj xk j ; j yk jg
Table 1: Newton iterates for the system (5)

k xk yk
0 1 :0 1 :0
1 1:170212765957447 1:457446808510638
2 1:202158829506705 1:376760321923060
3 1:203165807091535 1:374083486949713
4 1:203166963346410 1:374080534243534
5 1:203166963347774 1:374080534239942

k Error
0 3:74E 1
1 8:34E 2
2 2:68E 3
3 2:95E 6
4 3:59E 12
5 2:22E 16

The nal iterate is accurate to the precision of the com-


puter arithmetic.
AN ALTERNATIVE NOTATION. We introduce a more
general notation for the preceding work. The problem to
be solved is
F 1 ( x1 ; x 2 ) = 0
(6)
F 2 ( x1 ; x 2 ) = 0
Introduce
" # " #
x1 F 1 ( x1 ; x 2 )
x= ; F ( x) =
x2 F 2 ( x1 ; x 2 )
2 3
@F1 @F1
6 7
6 @x1 @x2 7
F 0 ( x) = 6
6
7
7
4 @F2 @F2 5
@x1 @x2
F 0(x) is called the Frechet derivative of F (x). It is a
generalization to higher dimensions of the ordinary deriv-
ative of a function of one variable.

The system (6) can now be written as

F ( x) = 0 (7)
A solution of this equation will be denoted by .
Newtons method becomes
F 0(x(k)) (k) = F (x(k))
(8)
x(k+1) = x(k) + (k)

for k = 0; 1; : : : .

A shorter and mathematically equivalent form:


h i 1
x(k+1) = x(k) F 0(x(k)) F (x(k)) (9)
for k = 0; 1; : : : .

This last formula is often used in discussing and analyzing


Newtons method for nonlinear systems. But (8) is used
for practical computations, since it is usually less expen-
sive to solve a linear system than to nd the inverse of
the coe cient matrix.

Note the analogy of (9) with Newtons method for a sin-


gle equation.
THE GENERAL NEWTON METHOD

Consider the system of n nonlinear equations


F 1 ( x1 ; : : : ; x n ) = 0
... (10)
F n ( x1 ; : : : ; x n ) = 0
Dene
2 3 2 3
x1 F 1 ( x1 ; : : : ; x n )
6 .. 7 6 ... 7
x = 4 . 5; F ( x) = 4 5
xn F n ( x1 ; : : : ; x n )
2 3
@F1 @F1
6 7
6 @x1 @xn 7
6 ... ... 7
F 0 ( x) = 6 7
6 7
4 @Fn @Fn 5
@x1 @xn
The nonlinear system (10) can be written as

F ( x) = 0
Its solution is denoted by 2 Rn .
Newtons method is
F 0(x(k)) (k) = F (x(k))
(11)
x(k+1) = x(k) + (k) ; k = 0 ; 1; : : :
Alternatively, as before,
h i 1
x(k+1) = x(k) F 0(x(k)) F (x(k));
k = 0 ; 1; : : :
(12)
This formula is often used in theoretical discussions of
Newtons method for nonlinear systems. But (11) is used
for practical computations, since it is usually less expen-
sive to solve a linear system than to nd the inverse of
the coe cient matrix.

CONVERGENCE. Under suitable hypotheses, it can be


shown that there is a c > 0 for which
2
x(k+1) c x (k) ; k = 0 ; 1; : : :

(k)
x(k) max i xi
1 i n
Newtons method is quadratically convergent.
DIFFERENTIAL EQUATIONS

A principal model of physical phenomena.

The equation:
y 0 = f (x; y )
The initial value:

y (x0) = Y0
Find solution Y (x) on some interval x0 x b: To-
gether these two conditions constitute an initial value
problem.

We will study methods for solving systems of rst or-


der equations, but we begin with a single equation.

Many of the crucial ideas in the numerical analysis


arise from properties of the original equation.
SPECIAL CASES

1. y 0(x) = y (x) + b(x); x x0 ;


f (x; z ) = z + b(x).
General solution:
Z x
Y (x) = ce x + e (x t)b(t)dt
x0
with c arbitrary. With y (x0) = Y0,
Z x
Y (x) = Y0e (x x0) + e (x t)b(t)dt
x0

2. y 0(x) = ay (x)2; f (x; z ) = az 2.


General solution:
1
Y ( x) = ; c arbitrary
ax + c
With y (x0) = Y0, use
1
c= ax0
Y0
3. y 0(x) = [y (x)]2 + y (x); f (x; z ) = z2 + z.
General solution:
1
Y ( x) =
1 + ce x

4. \Separable equations": y 0(x) = g (y (x))h(x);


f (x; z ) = g (z )h(x).
General solution: Write
1 dy
= h ( x)
g (y ) dx
Let z = y (x); dz = y 0(x)dx. Evaluate the inte-
grals in
Z Z
dz
= h(x)dx
g (z )
Replace z by Y (x) and solve for Y (x), if possible.
DIRECTION FIELDS

At each point (x; y ) at which the function f is de ned,


evaluate it to get f (x; y ). Then draw in a small line
segment at this point with slope f (x; y ). With enough
of these, we have a picture of how the solutions behave
for the di erential equation

y 0 = f (x; y )

Consider the di erential equation

y0 = y + 2 cos x
We can draw direction elds by hand by the method
described above, by using the Matlab program given in
the book; or we can use the Matlab program provided
in the class account.
Direction eld for y 0 = y + 2 cos x. Also shown are
example solution curves
SOLVABILITY THEORY

Consider whether there is a function Y (x) which sat-


is es

y 0 = f (x; y ); x x0 ; y (x0) = Y0 (1)


Assume there is some open set D that is subset of the
xy -plane and that contains (x0; Y0); for which:

1. If two points (x; y ) and (x; z ) are contained in D,


then the line segment joining them is also contained
in D.

2. f (x; y ) is continuous for all points (x; y ) contained


in D.

3. @f (x; y )=@y is continuous for all points (x; y ) con-


tained in D.

Then there is an interval [c; d] containing x0 and there


is a unique function Y (x) de ned on [c; d] which sat-
is es (1), with the graph of Y (x) contained in D.
THE LIPSCHITZ CONDITION

The preceding condition on the partial derivative of f


is an easy way to specify that the following condition
is satis ed. It is the condition that is really needed.
The Lipschitz condition: There is a non-negative con-
stant K for which
jf (x; y ) f (x; z )j K jy zj
for all points (x; y ), (x; z ) in the region D. In practice,
we use
@f (x; y )
K = max
(x;y)2D @y
The Lipschitz condition occurs throughout our treat-
ment of both the theory of di erential equations and
the theory of the numerical methods for their solution.

For this course, we simplify matters by assuming


@f (x; y )
K= max <1
1<y<1 @y
x0 x b
with [x0; b] the interval on which we are solving the
initial value problem.
EXAMPLE

Let > 0 be a given constant, and consider solving


2x 2
y0 = 2
y ; x 0; y (0) = 1

Then the partial derivative is


4xy
fy (x; y ) = 2
and fy (0; 1) = 0. Thus fy (x; y ) is small for (x; y ) near
to (0; 1), and it is continuous for all (x; y ). Choose

D = f(x; y ) : jxj 1; jyj Bg


for some B > 0. Then there is a solution Y (x) on
some interval [c; d] containing x0 = 0. How big is
[c; d]? In this case,
2
Y ( x) = 2
; <x<
x2
If is small, then the interval is small.
IMPROVED SOLVABILITY THEORY

Assume there is a Lipschitz constant K for which f


satis es

jf (x; y ) f (x; z )j K jy zj
for all (x; y ); (x; z ) satisfying

x0 x b; 1 < y; z < 1
Then the initial value problem

y 0 = f (x; y ); x0 x0 b; y (x0) = Y0
has a solution Y (x) on the entire interval [x0; b].

Example: Consider y 0 = y +g (x) with g (x) continuous


for all x. Then

y 0 = y + g ( x) ; y (x0) = Y0
has a solution Y (x) has a unique continuous solution
for 1 < x < 1.
STABILITY

The concept of stability refers in a loose sense to what


happens to the solution Y (x) of an initial value prob-
lem if we make a small change in the data, which
includes both the di erential equation and the initial
value.

If small changes in the data lead to large changes in


the solution, then we say the initial value problem is
unstable or ill-conditioned; whereas if small changes
in the data lead to small changes in the solution, we
call the problem stable or well-conditioned.
EXAMPLE

Consider solving

y 0 = 100y 101e x; y (0) = 1 (2)


This has a solution of Y (x) = e x.

Now consider the perturbed problem

y 0 = 100y 101e x; y (0) = 1 +


where is some small number. The solution of this is
Y (x) = e x + e100x, and

Y ( x) Y (x) = e100x
Thus Y (x) Y (x) increases very rapidly as x in-
creases, and we say (2) is an \unstable"or \ill-conditioned"
problem.
NUMERICAL METHODS FOR ODEs

Consider the initial value problem


y 0 = f (x; y ); x0 x b; y (x0) = Y0
and denote its solution by Y (x). Most numerical
methods solve this by nding values at a set of node
points:
x0 < x1 < < xN b
The approximating values are denoted in this book in
various ways. Most simply, we have
y1 Y ( x1 ) ; ; yN Y ( xN )
We also use
y ( xi ) yi ; i = 0; 1; :::; N
To begin with, and for much of our work, we use a
xed stepsize h, and we generate the node points by
xi = x0 + i h; i = 0; 1; :::; N
Then we also write
y h ( xi ) yi Y ( xi ) ; i = 0; 1; :::; N
EULER'S METHOD

Euler's method is de ned by

yn+1 = yn + h f (xn; yn); n = 0; 1; :::; N 1


with y0 = Y0. Where does this method come from?

There are various perspectives from which we can de-


rive numerical methods for solving

y 0 = f (x; y ); x0 x b; y (x0) = Y0
and Euler's method is simplest example of most such
perspectives. Moreover, the error analysis for Euler's
method is introduction to the error analysis of most
more rapidly convergent (and more practical) numer-
ical methods.
A GEOMETRIC PERSPECTIVE

Look at the graph of y = Y (x), beginning at x =


x0. Approximate this graph by the line tangent at
(x0; Y (x0)):
y = Y (x0) + (x x0)Y 0(x0)
= Y (x0) + (x x0)f (x0; Y0)
Evaluate this tangent line at x1 and use this value
to approximate Y (x1). This yields Euler's approxima-
tion.
We could generalize this by looking for more accurate
means of approximating a function, e.g. by using a
higher degree Taylor approximation.

An illustration of Euler's method derivation


TAYLOR'S SERIES

Approximate Y (x) about x0 by a Taylor polynomial


approximation of some degree:

0 h2 00
Y ( x0 + h ) Y ( x 0 ) + h Y ( x 0 ) + Y ( x0 )
2!
hp (p)
+ + Y (x 0)
p!
Euler's method is the case p = 1:
Y ( x0 + h ) Y ( x0 ) + h Y 0 ( x0 )
= y 0 + h f ( x0 ; y 0 ) y 1
We have an error formula for Taylor polynomial ap-
proximations; and in this case,
h2 00
Y ( x1 ) y 1 = Y ( 0 )
2
with some x0 0 x1 .
GENERAL ERROR FORMULA

In general,

yn+1 = yn + h f (xn; yn); n = 0; 1; :::; N 1

h 2
Y (xn+1) = Y (xn) + h Y 0(xn) + Y 00( n)
2
h2 00
= Y (xn) + h f (xn; Y (xn)) + Y ( n)
2
with some xn n xn+1.

We will use this as the starting point of our error


analyses of Euler's method. In particular,
Y (xn+1) yn+1 = Y (xn) yn
+h [ f (xn; Y (xn)) f (xn; yn)]
h2 00
+ Y ( n)
2
NUMERICAL DIFFERENTIATION

From beginning calculus,


Y ( xn + h ) Y ( xn )
Y 0 ( xn )
h
This leads to
Y ( xn + h ) Y ( xn ) + h Y 0 ( xn )
= Y (xn) + h f (xn; Y (xn))
y n + h f ( xn ; y n )
Most numerical di erentiation approximations can be
used to obtain numerical methods for solving the ini-
tial value problem. However, most such formulas turn
out to be poor methods for solving di erential equa-
tions. We will see an example of this in the one of the
following sections of the book.
NUMERICAL INTEGRATION

Consider the numerical approximation


Z a+h
g (x) dx hg (a)
a
which is called the left-hand rectangle rule. It is the
area of the rectangle with base [a; a + h] and height
g (a).

An illustration of the left-hand rectangle rule:


R a+h
a g (x) dx h g (a)
EULER'S METHOD VIA
NUMERICAL INTEGRATION

Return to the di erential equation y 0 = f (x; y ) and


substitute the solution Y (x) for y :

Y 0(x) = f (x; Y (x))


Integrate this over the interval [xn; xn+1],
Z x Z x
n+1 n+1
Y 0(x) dx = f (x; Y (x)) dx
xn xn
Z x
n+1
Y (xn+1) = Y (xn) + f (x; Y (x)) dx
xn
Approximate this with the left-hand rectangle rule,

Y (xn+1) Y (xn) + h f (xn; Y (xn))


Again this leads to Euler's method.
EXAMPLE. Solve
Y ( x) + x2 2
Y 0 ( x)
= ; Y (0) = 1
x+1
The solution is

Y ( x) = x2 + 2 x + 2 2 (x + 1) log (x + 1)
We give selected results for three values of h.
Note the behaviour of the error as h is halved.
h x yh(x) Error Relative
Error
0:2 1 :0 2:1592 6:82E 2 0:0306
3 :0 5:4332 4:76E 1 0:0805
5 :0 14:406 1:09 0:0703
0:1 1 :0 2:1912 3:63E 2 0:0163
3 :0 5:6636 2:46E 1 0:0416
5 :0 14:939 5:60E 1 0:0361
0:05 1 :0 2:2087 1:87E 2 0:00840
3 :0 5:7845 1:25E 1 0:0212
5 :0 15:214 2:84E 1 0:0183
GENERAL ERROR FORMULA

In general,

yn+1 = yn + h f (xn; yn); n = 0; 1; :::; N 1

h 2
Y (xn+1) = Y (xn) + h Y 0(xn) + Y 00( n)
2
h2 00
= Y (xn) + h f (xn; Y (xn)) + Y ( n)
2
with some xn n xn+1.

We will use this as the starting point of our discussion


of the error in Euler's method. In particular,
Y (xn+1) yn+1 = Y (xn) yn
+h [ f (xn; Y (xn)) f (xn; yn)]
h2 00
+ Y ( n)
2
ERROR ANALYSIS - SPECIAL CASES

We begin with a couple of special cases, to obtain


some additional intuition on the behaviour of the er-
ror. Consider

y 0 = 2x; y (0) = 0
This has the solution Y (x) = x2. Euler's method
becomes

yn+1 = yn + 2xnh; y0 = 0

y 1 = y 0 + 2 x0 h = x 1 x0
y 2 = y 1 + 2 x1 h = x 1 x0 + 2 x1 h = x 2 x 1
y 3 = y 2 + 2 x2 h = x 2 x1 + 2 x 2 h = x 3 x 2
By an induction argument,

y n = xn xn 1 ; n 1
For the error,

Y ( xn ) yn = x2n xn xn 1 = xn h
Note that Y (xn) yn _ h at each xed xn.
Return to our error equation
Y (xn+1) yn+1 = Y (xn) yn
+h [ f (xn; Y (xn)) f (xn; yn)]
h2 00
+ Y ( n)
2
(1)
With the mean value theorem,
@f (xn; n)
f (xn; Y (xn)) f ( xn ; y n ) = [Y (xn) yn ]
@y
for some n between Y (xn) and yn. As shorthand,
use eh(x) = Y (x) yh(x). Then we can write
" #
@f (xn; n) h2 00
eh(xn+1) = 1 + h eh(xn) + Y ( n)
@y 2
(2)
with eh(x0) = 0. We also will assume henceforth that
@f (x; y )
K max <1
x0 x b @y
1<y<1
Consider those di erential equations with
@f (x; y )
0; x0 x b; 1<y<1
@y
Then
@f (xn; n)
1 1+h 1
@y
provided h is chosen su ciently small, e.g. if
1
h<
K
Using this in our error formula (2),
h2
jeh(xn+1)j jeh(xn)j + Y 00 ; n 0 (3)
2 1
in which
Y 00 = max Y 00(t)
1 x0 t b
Using induction with (3), we can prove
h
jeh(xn)j (xn x0) Y 00
2 1
Again the error is bounded by something of the form
c ( xn ) h .
EXAMPLE

Solve the initial value problems

y0 = y + cos x sin x; 0 x 5; y (0) = 2


Its true solution is

Y (x) = e x + cos x
Then
Y 00(x) = e x
cos x
:
Y 00 1 = max Y 00(x) = 1:0442
0 x 5
We solve the problem with Euler's method on [0; 5].

For this di erential equation,


@f
f (x; y ) = y + cos x sin x; = 1
@y
Then the error bound becomes
h
jeh(xn)j (xn x0) Y 00 1
2
= 0:5221hxn
The following table gives the error when solving this
problem with h = :1 on the interval [0; 5]. The nal
column is the error bound, and it clearly exceeds the
magnitude of the actual error at each value of xn.
xn yn Y ( xn ) y n 0:5221hxn
0 2:00000000 0:000000 0:00000
1 :91533345 :007152 :05221
2 :28535798 :004546 :10442
3 :97071646 :030511 :15663
4 :67539538 :040067 :20884
5 :27163709 :018763 :26105
GENERAL ERROR ANALYSIS

Return to
" #
@f (xn; n) h2 00
eh(xn+1) = 1 + h eh(xn) + Y ( n)
@y 2
in which
eh(x) = Y (x) y h ( x)
As before, we assume
@f (x; y )
K max <1
1<y<1 @y
x0 x b
This is much more restrictive than is needed, but it
simpli es the statement and analysis of the error in
solving di erential equations. From this we can show

e(xn x0 )K 1
jeh(xn)j (x x )K
e n 0 jeh(x0)j+h Y 00
2K 1
for all points x0 xn b. In theory, we would have
eh(x0) = 0. But we can also examine what would
happen if we allowed for errors in the choice of y0.
EXAMPLE

Recall the last example, of solving

y0 = y + cos x sin x; 0 x 5; y (0) = 2


whose true solution is Y (x) = e x + cos x. Again,
@f
f (x; y ) = y + cos x sin x; = 1
@y
and then K = 1. Recall
Y 00(x) = e x
cos x
:
Y 00 1 = max Y 00(x) = 1:0442
0 x 5
and also assume eh(0) = 0. Then our general error
bound becomes
e(xn x0)K 1
jeh(xn)j h Y 00 1
2K
x
e n 1
= h (1:0442)
2
This is consistent with our earlier bound since

exn 1 xn
ASYMPTOTIC ERROR FORMULA

We see from the earlier results that the error satis es

jY (xn) yn j c (xn) h; x0 xn b
for some number c (xn). We can improve upon this.
Namely, it can be shown that

Y ( xn ) y h ( xn ) = D ( xn ) h + O h 2

The term O(h2) denotes a quantity which is bounded


by a constant times h2. We say the term being bounded
is of order h2. Thus

Y ( xn ) y h ( xn ) D ( xn ) h (4)
for smaller values of h. The function D(x) satis es
a particular di erential equation, but it can seldom
be found in practice since it depends on the solution
Y (x). Instead we use (4) to justify using Richardson
extrapolation to estimate the error.
RICHARDSON EXTRAPOLATION

Consider solving the initial value problem

y 0 = f (x; y ); x0 x b; y (x0) = Y0
with Euler's method, as suppose we do it twice, using
stepsizes of h and 2h. Denote the respective numer-
ical solutions by yh(xn) and y2h(xn). Then from (4)
above, and at a generic node point x,
Y ( x ) y h ( x) D ( x) h
Y (x) y2h(x) D(x) (2h)
Multiply the rst equation by 2 and then subtract the
second equation. This yields
Y (x) 2yh(x) + y2h(x) 0
Y (x) 2yh(x) y2h(x)
This last formula is called \Richardson's extrapolation
formula" for Euler's method. We can also use it to
estimate the error.
Y ( x) y h ( x) [2yh(x) y2h(x)] y h ( x)
= yh(x) y2h(x)
The formula

Y ( x) y h ( x) y h ( x) y2h(x)
is called \Richardson's error estimation formula" for
Euler's method.

EXAMPLE: Recall the last example, of solving

y0 = y + cos x sin x; 0 x 5; y (0) = 2


whose true solution is Y (x) = e x + cos x. We do so
with stepsizes h = :1 and 2h = :2.
xn y h ( xn ) Y ( xn ) y h ( xn ) yh(x) y2h(x)
0 2:00000000 0:000000 0:00000
1 :91533345 :007152 0:007593
2 :28535798 :004546 0:004420
3 :97071646 :030511 0:031741
4 :67539538 :040067 0:041539
5 :27163709 :018763 0:018820
NUMERICAL STABILITY;
IMPLICIT METHODS

When solving the initial value problem

Y 0(x) = f (x; Y (x)); x0 x b


Y (x0) = Y0
we know that small changes in the initial data Y0 will
result in small changes in the solution of the di eren-
tial equation. More precisely, consider the perturbed
problem

Y"0(x) = f (x; Y"(x)); x0 x b


Y"(x0) = Y0 + "
Then assuming f (x; z ) and @f (x; z )=@z are continu-
ous for x0 x b; 1 < z < 1, we have

max jY"(x) Y ( x) j c j"j


x0 x b
for some constant c > 0. We would like our numerical
methods to have a similar property.
Consider the Euler method

yn+1 = yn + hf (xn; yn) ; n = 0 ; 1; : : :


y0 = Y0
and then consider the perturbed problem
"
yn+1 " + hf (x ; y " ) ;
= yn n = 0 ; 1; : : :
n n
y0" = Y0 + "
We can show the following.

max "
jyn yn j cb j"j
x0 x n b
for some constant cb > 0 and for all su ciently small
values of the stepsize h.
This implies that Euler's method is stable, and in the
same manner as was true for the original di erential
equation problem.
The general idea of stability for a numerical method
is essentially that given above for Eulers's method.

There is a general theory for numerical methods for


solving the initial value problem

Y 0(x) = f (x; Y (x)); x0 x b


Y (x0) = Y0
If the truncation error in a numerical method has order
2 or greater, then the numerical method is stable if
and only if it is a convergent numerical method.

Note that Euler's method has a truncation error of


order 2.

The numerical methods studied in this chapter are


both stable and convergent.
All of these general results on stability and conver-
gence are valid if the stepsize h is su ciently small.
What is meant by this?

Rather than studying this for a general problem, we


restrict our interest to the model problem

Y 0 ( x ) = Y ( x) ; x 0; Y (0) = 1
The analysis for this problem is generally applicable
to the more general di erential equation problem.

The true solution is Y (x) = e x. When < 0 or


is complex with Real( ) < 0, we have Y (x) ! 0 as
x ! 1. We would like the same to be true for the
numerical solution of the model problem.

We begin by studying Euler's method applied to the


model problem.
EULER'S METHOD
AND THE MODEL PROBLEM

Apply Euler's method to the model problem.

yn+1 = yn + h yn
= (1 + h ) yn; n = 0 ; 1; : : :
with y0 = 1. By induction,

yn = (1 + h )n ; n 0
The numerical solution yn ! 0 as n ! 1 if and only
if
j1 + h j < 1
In the case is real and negative, this is equivalent to

2<h <0
If j j is quite large, then h must be correspondingly
small. Usually the truncation error does not require
such a small value of h; it is needed only for stability.
NUMERICAL EXAMPLE: Consider the problem

Y 0(x) = Y (x) + (1 ) cos(x) (1 + ) sin(x)


with Y (0) = 1. The true solution is Y (x) = sin(x) +
cos(x). We use Euler's method and give numerical
results for varying and h are given. The truncation
2
error Tn+1 = h2 Y 00( n) does not depend on , but
the numerical solution does.

x Error: Error: Error:


h = 0 :5 h = 0 :1 h = 0:01
1 1 2:46E 1 4:32E 2 4:22E 3
3 2:66E 2 6:78E 3 7:22E 4
5 2:72E 1 4:91E 2 4:81E 3
10 1 3:98E 1 6:99E 3 6:99E 4
3 1:11E + 2 3:86E 3 3:64E 4
5 2:83E + 4 3:78E 3 3:97E 4
50 1 3:26E + 0 1:06E + 3 1:39E 4
3 1:08E + 6 1:17E + 15 8:25E 5
5 3:59E + 11 1:28E + 27 7:00E 5
ABSOLUTE STABILITY

The set of values of h for which the numerical solu-


tion yn ! 0 as n ! 1 is called the region of absolute
stability of the numerical method. We allow to be
complex, restricting it with Real( ) < 0.

With Euler's method, this region is the set of all com-


plex numbers z = h for which

j1 + zj < 1
or equivalently,

jz ( 1)j < 1
This is a circle of radius one in the complex plane,
centered at the complex number 1 + 0 i.

If a numerical method has no restrictions on in order


to have yn ! 0 as n ! 1, we say the numerical
method is A-stable.
THE BACKWARD EULER METHOD

Expand the function Y (x) as a linear Taylor polyno-


mial about xn+1:
Y (x) = Y (xn+1) + (x xn+1) Y 0(xn+1)
+ 21 (x xn+1)2 Y 00( n)
with n between x and xn+1. Let x = xn, solve for
Y (xn+1), and use the di erential equation to evaluate
Y 0(xn+1):
1 2 00
Y (xn+1) = Y (xn) + hY 0(xn+1) h Y ( n)
2
1 2 00
= Y (xn) + hf (xn+1; Y (xn+1)) h Y ( n)
2
with n between xn and xn+1. The backward Euler
method is obtained by dropping the truncation error:

yn+1 = yn + hf (xn+1; yn+1) ; n = 0 ; 1; : : :


y0 = Y0
The truncation is essentially of the same size as for
Euler's method, but of opposite sign.
Apply the backward Euler method to the model prob-
lem

Y 0 ( x ) = Y ( x) ; x 0; Y (0) = 1
This yields

yn+1 = yn + h yn+1
1
yn+1 = yn ; n = 0 ; 1; : : :
1 h
with y0 = 1. By induction,
1 n
yn = ; n = 0 ; 1; : : :
1 h
We want to know when yn ! 1 as n ! 1. This
will be true if
1
<1
1 h
The hypothesis that < 0 or Real( ) < 0 is su -
cient to show this is true, regardless of the size of the
stepsize h. Thus the backward Euler method is an
A-stable numerical method.
NUMERICAL EXAMPLE

We apply the backward Euler method to the problem

Y 0(x) = Y (x) + (1 ) cos(x) (1 + ) sin(x)


with Y (0) = 1. The true solution is Y (x) = sin(x) +
cos(x). We give numerical results with varying
and with h = 0:5. The truncation error Tn+1 =
h2 Y 00 ( )does not depend on , but the numeri-
2 n
cal solution does. But in contrast to Euler's method,
there are no stability problems with the numerical so-
lution.

x Error: Error: Error:


= 1 = 10 = 50
2 2:08E 1 1:97E 2 3:60E 3
4 1:63E 1 3:35E 2 6:94E 3
6 7:04E 2 8:19E 3 2:18E 3
8 2:22E 1 2:67E 2 5:13E 3
10 1:14E 1 3:04E 2 6:45E 3
SOLVING THE BACKWARD
EULER METHOD

For a general di erential equation, we must solve

yn+1 = yn + hf (xn+1; yn+1) (1)


for each n. In most cases, this is a root nding problem
for the equation

z = yn + hf (xn+1; z ) (2)
with the root z = yn+1. Such numerical methods
(1) for solving di erential equations are called implicit
methods.

Methods in which yn+1 is given explicitly are called ex-


plicit methods. Euler's method is an explicit method.
Fixed point iteration is often used to solve (2):
(k+1) (k)
yn+1 = yn + hf xn+1; yn+1 ; k = 0; 1; : : : (3)

(0)
For an initial guess, we use yn+1 = yn or something
even better. The iteration error satis es
(k+1) @f (xn+1; yn+1) (k)
yn+1 yn+1 h yn+1 yn+1
@y
(4)
We have convergence if
@f (xn+1; yn+1)
h <1 (5)
@y
which is true if h is su ciently small. If this is too re-
strictive on h, then another root nding method must
be used to solve (2).
THE TRAPEZOIDAL METHOD

The backward Euler method is stable, but still is lack-


ing in accuracy. A similar but more accurate numer-
ical method is the trapezoidal method:
h
yn+1 = yn + [f (xn; yn) + f (xn+1; yn+1)] ;
2 (6)
n = 0 ; 1; : : :
It is derived by applying the simple trapezoidal numer-
ical integration rule to the equation
Z x
n+1
Y (xn+1) = Y (xn) + f (t; Y (t)) dt
xn
obtaining
h
Y (xn+1) = Y (xn) + [f (xn; Y (xn))
2
h3 000
+f (xn+1; Y (xn+1))] Y ( n)
12
The method (6) results from dropping the truncation
error term.
As with the backward Euler method, the equation (6)
is a nonlinear equation with a root of yn+1. Again,
xed point iteration can be used to solve it:
(j+1) h (j)
yn+1 = yn + [f (xn; yn) + f (xn+1; yn+1)]
2
for j = 0; 1; 2; : : : The iteration will converge if
h @f (xn+1; yn+1)
<1
2 @z
(0)
provided also that yn+1 is a su ciently good initial
guess. Often we use
(0)
yn+1 = yn + hf (xn; yn)
or
(0) h
yn+1 = yn + [3f (xn; yn) f (xn 1; yn 1)]
2
These latter formulas are called predictor formulas.
The rst is simply Euler's method. The second is
an Adams-Bashforth method of order 2 ; and it is an
explicit 2-step method. Other root nding methods
are used for more di cult problems.
CONVERGENCE: We can show that for all su -
ciently small values of h,

max jY (xn) yn j ch2 max Y 000(x)


x 0 xn b x0 x b
The constant c depends on the Lipschitz constant K
for f (x; z ):
@f (x; z )
K= max
x0 x b @z
1<z<1
Thus it converges more rapidly than either the Euler
method or the backward Euler method.
STABILITY: The trapezoidal method is absolutely
stable. Apply the method to the problem

Y 0 ( x ) = Y ( x) ; x 0; Y (0) = 1
Then
h
yn+1 = yn + [ yn + yn+1]
2
with y0 = 1. Solving for yn+1,
0 1
1+ h
yn+1 = @ 2 Ay ; n 0
h n
1 2
By induction,
0 1n
1+ h
yn = @ 2 A ; n 0
1 h
2
Then for real and negative, and also for com-
plex with Real ( ) < 0, we can use this formula to
show yn ! 0 as n ! 1. This shows the trapezoidal
method is absolutely stable.
NUMERICAL EXAMPLE: We apply the trapezoidal
method to the problem

Y 0(x) = Y (x) + (1 ) cos(x) (1 + ) sin(x)


with Y (0) = 1. The true solution is Y (x) = sin(x) +
cos(x). We give numerical results with varying
and with h = 0:5. The truncation error Tn+1 =
h3 Y 000 ( )
does not depend on , but the numerical
12 n
solution does. As with the backward Euler method,
there are no stability problems with the numerical so-
lution. Moreover, it converges more rapidly than does
the backward Euler method.

x Error: Error: Error:


= 1 = 10 = 50
2 1:13E 2 2:78E 3 7:91E 4
4 1:43E 2 8:91E 5 8:91E 5
6 2:02E 2 2:77E 3 4:72E 4
8 2:86E 3 2:22E 3 5:11E 4
10 1:79E 2 9:23E 4 1:56E 4
STIFF PROBLEMS: Linear equations

Y 0 ( x) = Y ( x ) + g ( x )
with Real ( ) < 0; jReal ( )j large, are called sti dif-
ferential equations. For general equations, the role of
is replaced by
@f (x; Y (x))
@z

With many numerical methods, the truncation error


for these equations may be satisfactorily small with
not too small a value of h. But the large size of
jReal ( )j may force h to be smaller so that h will
be in the stability region of the numerical method.

For sti di erential equations, one must use a numer-


ical method with a large region of absolute stability,
or else h must be chosen very small.
The backward Euler method and the trapezoidal method
are very desirable because their stability region con-
tains all h where Real ( ) < 0. The backward Euler
method and the trapezoidal rule have convergence or-
ders 1 and 2, respectively.

The iteration method for calculating yn+1 in

yn+1 = yn + hf (xn+1; yn+1)


or
h
yn+1 = yn + [f (xn; yn) + f (xn+1; yn+1)]
2
must be something like Newton's method, something
where the convergence is not a ected by the size of
h.
EXAMPLE OF ONE-STEP METHOD

Consider solving

y 0 = y cos x; y (0) = 1
Imagine writing a Taylor series for the solution Y (x),
say initially about x = 0. Then

0 h2 00 h3 000
Y (h ) = Y (0) + hY (0) + Y (0) + Y (0) +
2 6
We can calculate Y 0(0) = Y (0) cos(0) = 1. How do
we calculate Y 00(0) and higher order derivatives?

Y 0(x) = Y (x) cos(x)

Y 00(x) = Y (x) sin(x) + Y 0(x) cos(x)

Y 000(x) = Y (x) cos(x) 2Y 0(x) sin(x)+ Y 00(x) cos x


Then Y (0) = 1, Y 0(0) = 1;and

Y 00(0) = Y (0) sin(0) + Y 0(0) cos(0) = 1

Y 000(0) = Y (0) cos(0) 2Y 0(0) sin(0)+Y 00(0) cos 0 = 0


Thus
h 2
Y (h ) = Y (0) + hY (0) + 2 Y 00(0)
0
3
+ h6 Y 000(0) +
h 2
= 1+h+ 2 +
We can generate as many terms as desired, obtaining
added accuracy as we do so. In this particular case,
the true solution is Y (x) = exp (sin x). Thus

h2 h4
Y (h ) = 1 + h + +
2 8
We can truncate the series after a particular order.
Then continue with the same process to generate ap-
proximations to Y (2h); Y (3h); ::: Letting xn = nh,
and using the order 2 Taylor approximation, we have

0 h2 00 h3 000
Y (xn+1) = Y (xn)+hY (xn)+ Y (xn)+ Y ( n)
2 6
with xn n xn+1. Drop the truncation error
term, and then de ne

0 h2 00
yn+1 = yn + hyn + yn ; n 0
2
with
0 = y cos(x )
yn n n

00 =
yn 0 cos(x )
yn sin(xn) + yn n

We give a numerical example of computing the nu-


merical solution with Taylor series methods of orders
2, 3, and 4. For a Taylor series of degree r, the global
error will be O (hr ). The numerical example output
is given in a separate le.
A 4th-ORDER EXAMPLE

Consider solving

y0 = y; y (0) = 1
whose true solution is Y (x) = e x. Di erentiating
the equation
Y 0 ( x) = Y ( x)
we obtain
Y 00 = Y0 =Y

Y 000 = Y 0 = Y; Y (4) = Y
Then expanding Y (xn + h) in a Taylor series,

h 2 h 3
Y (xn+1) = Yn + hYn0 + Yn00 + Yn000
2 6
4
h (4) h 5
+ Yn + Y (4)( n)
24 120
Dropping the truncation error, we have the numerical
method
h 2 h 3 h 4 (4)
yn+1 = 0
yn + hyn 00 000
+ 2 yn + 6 yn + 24 yn
h 2 h 3 h 4
= 1 h+ 2 6 + 24 yn
with y0 = 1. By induction,
!n
h2 h3 h 4
yn = 1 h+ + ; n 0
2 6 24
Recall that
h2 h3 h4 h5
e h =1 h+ + e
2 6 24 120
with 0 < < h. Then
n
yn = e h + h5 e
120
n
= e nh 1+ h5 eh
120
: 4
= e xn 1 + x120
n h eh

Thus
Y ( xn ) y n = e xn + O ( h 4 )
We can introduce the Taylor series method for the
general problem

y 0 = f (x; y ); y (x0) = Y0
Simply imitiate what was done above for the particular
problem y 0 = y cos x.

In general,
Y 0(x) = f (x; Y (x))
Y 00(x) = fx (x; Y (x)) + fy (x; Y (x)) Y 0(x)
= fx (x; Y (x)) + fy (x; Y (x)) f (x; Y (x))
Y 000(x) = fxx + 2fxy f + fyy f 2 + fy fx + fy2f
and we can continue on this manner. Thus we can
calculate derivatives of any order for Y (x); and then
we can de ne Taylor series of any desired order.
This used to be considered much too arduous a task
for practical problems, because everything had to be
done by hand. But with symbolic programs such as
Mathematica and Maple, Taylor series can be con-
sidered a serious framework for numerical methods.
Programs that implement this in an automatic way,
with varying order and stepsize, are available.
RUNGE-KUTTA METHODS

Nonetheless, most researchers still consider Taylor se-


ries methods to be too expensive for most practical
problems (a point contested by others). This leads us
to look for other one-step methods which imitate the
Taylor series methods, without the necessity to cal-
culate the higher order derivatives. These are called
Runge-Kutta methods. There are a number of ways in
which one can approach Runge-Kutta methods, and I
will describe a fairly classical approach.

We begin by considering explicit Runge-Kutta meth-


ods of order 2. We want to write

yn+1 = yn + hF (xn; yn; h; f )


with F (xn; yn; h; f ) some carefully chosen approxima-
tion to f (x; y ) on the interval [xn; xn+1]. In particu-
lar, write

F (x; y; h; f ) = 1f (x; y )+ 2f (x + h; y + hf (x; y ))


F (x; y; h; f ) = 1f (x; y )+ 2f (x + h; y + hf (x; y ))
This is some kind of \average" derivative. Intuitively,
we should restrict so that 0 1. In addition to
this, how should the constants 1; 2; ; be chosen?

Introduce the truncation error

Tn(Y ) = Y (x + h) [Y (x) + hF (x; Y (x); h; f )]


and choose the constants to make Tn(Y ) = O(hp)
with p as large as possible.

By choosing
1
= = ; 1 =1 2
2 2
we can show
Tn(Y ) = O(h3)
EXAMPLES
yn+1 = yn + hF (xn; yn; h; f )
F (x; y; h; f ) = 1f (x; y ) + 2f (x + h; y + hf (x; y ))

Case: 2 = 21 . This leads to the trapezoidal Runge-


Kutta method.
h
yn+1 = yn+ [f (xn; yn) + f (xn + h; yn + hf (xn; yn))]
2
(1)
Case: 2 = 1. This leads to the midpoint Runge-
Kutta method.

yn+1 = yn + hf (xn + h2 ; yn + h2 f (xn; yn)) (2)


We can derive other second order formulas by choos-
ing other values for 2.

We illustrate these two methods by solving

Y 0 ( x) = Y (x) + 2 cos x; Y (0) = 1


with true solution Y (x) = sin x + cos x. Observe the
rate at which the error decreases when h is halved.
NUMERICAL EXAMPLE. Solve
Y 0 ( x) = Y (x) + 2 cos x; Y (0) = 1
The true solution is Y (x) = sin x + cos x. We give
numerical results for the Runge-Kutta method in (1)
with stepsizes h = 0:05 and 0:1.
h x yh ( x) Error
0 :1 2 :0 0:491215673 1:93E 3
4 :0 1:407898629 2:55E 3
6 :0 0:680696723 5:81E 5
8 :0 0:841376339 2:48E 3
10:0 1:380966579 2:13E 3
0:05 2 :0 0:492682499 4:68E 4
4 :0 1:409821234 6:25E 4
6 :0 0:680734664 2:01E 5
8 :0 0:843254396 6:04E 4
10:0 1:382569379 5:23E 4
Observe the ratio by which the error decreases when
h is halved, for each xed x. This is consistent with
an error formula
Y ( xn ) y h ( xn ) = O h 2
FOUR-STAGE FORMULAS

We have just studied 2-stage formulas. To obtain a


higher rate of convergence, we must use more deriva-
tive evaluations. A 4-stage formula looks like
yn+1 = yn + hF (xn; yn; h; f )
F (x; y; h; f ) = 1V1 + 2V2 + 3V3 + 4V4

V1 = f (x; y )
V2 = f (x + 2h; y + 2;1hV1)
V3 = f (x + 3h; y + 3;1hV1 + 3;2hV2)
V4 = f (x + 4h; y + h( 4;1V1 + 4;2V2 + 4;3V3)
Again it can be analyzed by expanding the truncation
error

Tn(Y ) = Y (x + h) [Y (x) + hF (x; Y (x); h; f )]


in powers of h.
We attempt to choose the unknown coe cients so as
to force Tn(Y ) = O(hp) with p as large as possible.
In the above case, this can be done so that p = 5.
The algebra becomes very complicated, and we omit
it here.

The truncation error indicates the new error intro-


duced in each step of the numerical method. The to-
tal or global error for this case will be of size O h4 .

The classical 4th-order formula follows.


h
yn+1 = yn + [V1 + 2v2 + 2V3 + V4]
6
V1 = f (x; y )
V2 = f (x + 12 h; y + 12 hV1)
V3 = f (x + 12 h; y + 12 hV2)
V4 = f (x + h; y + hV3)
The even more accurate 4th-order Runge-Kutta-Fehlberg
formula is given in the text, along with a numerical
example.
ERROR DISCUSSION

If a Runge-Kutta method satis es

Tn(Y ) = O(hp)
with p 2, then it can be shown that

jY (xn) yn j chp 1; x0 xn b
when solving the initial value problem

y 0 = f (x; y ); x0 xn b; y (x0) = Y0
We can also go further and show that

Y ( xn ) y h ( xn ) = D ( xn ) h p 1 + O ( h p ) ; x0 xn b
This can then be used to justify Richardson's extrap-
olation.

For example, if Tn(Y ) = O(h3), then Richardson's


error estimate is
1
Y ( xn ) y h ( xn ) [yh(xn) y2h(xn)]
3
MULTISTEP METHODS

All of the methods we have studied until now are ex-


amples of one-step methods. These methods use one
past value of the numerical solution, call it yn, in order
to compute a new value yn+1. Multistep methods use
more than one past value, and these are often more
e cient than the earlier methods. For example,
h
yn+1 = yn+ [3f (xn; yn) f (xn 1; yn 1)] ; n 1
2
(1)
is an explicit, stable, second order method with trun-
cation error of size O h3 . We derive it below.

Multistep methods can be derived in a number of


ways, but the most popular methods are derived most
easily using numerical interpolation and numerical in-
tegration.
Integrate the equation

Y 0(x) = f (x; Y (x)) (2)


over the interval [xn; xn+1]. This yields
Z x
n+1
Y (xn+1) = Y (xn) + f (x; Y (x)) dx (3)
xn
Other intervals xn p; xn+1 , p > 0, can also be used;
but the most popular multistep methods are obtained
with the choice [xn; xn+1].

Approximate the integral by replacing the integrand


with a polynomial interpolant of it. For example, the
linear interpolant of a function g (x) based on inter-
polation at fxn 1; xng is given by
( xnx) g (xn 1) + (x xn 1) g (xn)
P1(x) =
h
Integrate this over the interval [xn; xn+1], obtaining
Z x Z x
n+1 n+1
g (x) dx P1(x) dx
xn xn
3h h
= g ( xn ) g ( xn 1 )
2 2
By using methods similar to those used in x5.2, it can
be shown that
Z x
n+1 3h h
g (x) dx = g ( xn ) g ( xn 1 )
xn 2 2
5 3 00
+ h g ( n)
12
for some n, xn 1 n xn+1. Using this in (3)
with g (x) = Y 0(x) = f (x; Y (x)) yields

Y (xn+1) = Y (xn) + h2 [3f (xn; Y (xn))


f (xn 1; Y (xn 1))]
5 h3 Y 000 ( )
+ 12 n
This leads to the method in (1) when the truncation
error term is dropped:
h
yn+1 = yn+ [3f (xn; yn) f (xn 1; yn 1)] ; n 1
2
This is a two-step method (a second order recurrence
relation) and it requires values for both y0 and y1
before proceeding to nd yn for n 2. A value for
y1 must be obtained by some other method.
NUMERICAL EXAMPLE. Consider solving

Y 0 ( x) = Y (x) + 2 cos x; Y (0) = 1


The true solution is Y (x) = sin x + cos x. We give
numerical results for the method in (1) with stepsizes
h = 0:05 and 2h = 0:1.
x yh(x) Y (x) y2h(x) Y ( x) y h ( x ) Ratio
2 0.492597 2.13E-3 5.53E-4 3.9
4 -1.411170 2.98E-3 7.24E-4 4.1
6 -0.675004 -3.91E-3 -9.88E-4 4.0
8 0.843737 3.68E-4 1.21E-4 3.0
10 -1.383983 3.61E-3 8.90E-4 4.1

This is consistent with an error formula

Y ( xn ) y h ( xn ) = O h 2
with a constant of proportionality that depends on xn.

Compare the error to that for a second order Runge-


Kutta method.
It can be shown that if

Y ( x0 ) y0 = O h 2 [usually zero]
Y ( x1 ) y1 = O h 2
with standard assumptions on the di erentiability of
f and Y , then we have

max jY (xn) y h ( xn ) j c h2
x0 xn b
for some constant c 0.

Moreover, if

Y ( x0 ) y0 c0 h 2
Y ( x1 ) y1 c1 h 2
for constants c0, c1, then

Y ( xn ) y h ( xn ) = D ( xn ) h 2 + O h 3
with D(x) a continuous function for x0 x b.
From this, we also have the Richardson error estimate
1
Y ( xn ) y h ( xn ) [y h ( x) y2h(x)]
3
NUMERICAL EXAMPLE
(continuation)

Continue with solving

Y 0 ( x) = Y (x) + 2 cos x; Y (0) = 1


using method (1). The true solution is Y (x) = sin x +
cos x. We use stepsizes h = 0:05 and 2h = 0:1.
x yh ( x ) Y ( x) y h ( x) 1 [y ( x )y2h(x)]
3 h
2 0:492597 5:53E 4 5:26E 4
4 1:411170 7:24E 4 7:52E 4
6 0:675004 9:88E 4 9:73E 4
8 0:843737 1:21E 4 8:21E 5
10 1:383983 8:90E 4 9:08E 4
ADAMS-BASHFORTH METHODS

Recall the integrated formula


Z x
n+1
Y (xn+1) = Y (xn) + f (x; Y (x)) dx (4)
xn
Approximate the integrand using a polynomial inter-
polant of degree q . In particular, do the interpolation
using the q + 1 points xn; xn 1; : : : ; xn q .

Denote the interpolant by Pq (x):


q
X
Pq (x) = Lj;n(x)f xn j ; Y (xn j )
j=0
Xq
= Lj;n(x)Y 0 xn j
j=0
with Lj;n(x), 0 j q , denoting the Lagrange inter-
polation basis functions using the points xn; : : : ; xn q .
Substituting Pq (x) into (4) and integrating, we obtain
a formula
Y (xn+1) = Y (xn)
q
X
+h wj;q f xn j ; Y (xn j ) + En
(5)
j=0
Z x
n+1
hwj;q = Lj;n(x) dx
xn
and En denotes the error for this numerical integra-
tion over [xn; xn+1]. Dropping En, we obtain a way
to approximate Y (xn+1) using values at past points
xn ; x n 1 ; : : : ; x n q :
q
X
yn+1 = yn + h wj;q f xn j ; yn j (6)
j=0
This is called an Adams-Bashforth method. It is a
(q + 1)-step explicit method, and its truncation error
is of size O hq+2 .

We give several of these numerical methods and their


corresponding truncation error terms En.
q = 0:

yn+1 = yn + hf (xn; yn) [Euler's method]


1
En = h2Y 00( n)
2
q = 1: Use the notation yk0 = f (xk ; yk ).
hh 0 0
i
yn+1 = yn + 3 yn y n 1 ; n 1
2
5 3 000
En = h Y ( n)
12
q = 2:
h 0 0 0
yn+1 = yn + [23yn 16yn 1 + 5 yn 2 ]
12
3 4 (4)
En = h Y ( n)
8
q = 3:
h 0 0 0 0
yn+1 = yn + [55yn 59yn 1 + 37yn 2 9 yn 3]
24
251 5 (5)
En = h Y ( n)
720
CONVERGENCE. Note rst that when computing
yn+1 using (6) that we require the q + 1 initial values
y0; y1; : : : ; yq in order to compute yq+1. Then we can
continue using (6) to compute yn for n q + 1. The
initial values y1; : : : ; yq must be computed by some
other method, perhaps a Runge-Kutta method or per-
haps a lower order Adams-Bashforth method.

We can prove the following as regards the conver-


gence. Let
(h) = max Y (xj ) yj
0 j q
(h) = max jEnj
x q xn b
Then there is an h0 > 0 such that for 0 < h h0 ,
the numerical method of (6) is computable, and
1
jY (xn) ynj c (h) + (h) (7)
h
for some constant c > 0. From the examples, we see
that
j (h )j dhq+2 max Y (q+2)(x)
x0 x b
and this is true for general q 0.
STABILITY. All Adams-Bashforth methods possess
the basic type of numerical stability associated with
the earlier methods studied (Euler, backward Euler,
trapezoidal).

Consider solving
q
X
yn+1 = yn + h wj;q f xn j ; yn j
j=0
with initial values
y k = Y ( xk ) ; k = 0 ; 1; : : : ; q
Then perturb these initial values and solve
q
X
zn+1 = zn + h wj;q f xn j ; zn j
j=0
with
jzk yk j "; k = 0 ; 1; : : : ; q
for all su ciently small values of h, say 0 < h h0 .
Then there is a constant c > 0 with
max jzk yk j c"; 0<h h0 (8)
x q xn b
ADAMS-MOULTON METHODS

These methods are derived in the same manner as the


Adams-Bashforth methods. We begin with
Z x
n+1
Y (xn+1) = Y (xn) + f (x; Y (x)) dx
xn
Approximate the integrand using a polynomial inter-
polant of degree q . In particular,
n do the interpolation
o
using the q + 1 points xn+1; xn; : : : ; xn q+1 .

Denote the interpolant by Pbq (x):


qX1
Pbq (x) = b ( x) f x
Lj;n n j ; Y ( xn j )
j= 1
qX1
= b ( x) Y 0 x
Lj;n n j
j= 1
b (x),
with L 1 j q 1, denoting the La-
j;n
grange
n interpolationo basis functions using the points
xn+1; : : : ; xn q+1 .
Substituting Pbq (x) into the integral and integrating,
we obtain a formula
Y (xn+1) = Y (xn)
qX1
+h b
vj;q f xn j ; Y (xn j ) + E n
j= 1
Z x
n+1
hvj;q = b (x) dx
Lj;n
xn
and Eb denotes the error for this numerical integra-
n
b , we obtain a way
tion over [xn; xn+1]. Dropping E n
to approximate Y (xn+1) using values at the points
xn+1; : : : ; xn q+1:
qX1
yn+1 = yn + h vj;q f xn j ; yn j
j= 1
This is called an Adams-Moulton method. It is a q -
step implicit method, and its truncation error is of size
O hq+2 .

We give several of these numerical methods and their


b .
corresponding truncation error terms En
Recall the notation yk0 = f (xk ; yk ), k 0.

q = 0: It is the backward Euler method,

yn+1 = yn + hyn+1 0

b 1 2 00
E n = h Y ( n)
2
q = 1: It is the trapezoidal method,
h 0 0]
yn+1 = yn + [yn+1 + yn
2
b 1 3 (3)
En = h Y ( n)
12
q = 2:
h 0 0 0
yn+1 = yn + [5yn+1 + 8 yn yn 1]
12
b 1 4 (4)
En = h Y ( n)
24
q = 3:
h 0 0 0 0
yn+1 = yn + [9yn+1 + 19yn 5 yn 1 + yn 2 ]
24
b 19 5 (5)
E n = h Y ( n)
720
CONVERGENCE and STABILITY. The convergence
and stability properties of Adams-Bashforth methods
extend to Adams-Moulton methods. In particular, the
results given in (7) and (8).

From earlier, the backward Euler method and the


trapezoidal method are A-stable methods. This is not
true of higher order Adams-Moulton methods. But
such methods do have very desirable stability and con-
vergence properties. These are important enough as
to justify the cost of solving the implicit methods.

As earlier, we often use xed point iteration to solve


the implicit equation. For the example with q = 2,
we use
(k+1) h (k) 0 0
yn+1 = yn + [5f (xn+1; yn+1) + 8yn yn 1]
12
(0)
for k = 0; 1; : : : The initial guess yn+1 is often ob-
tained by using an Adams-Bashforth method of a com-
parable order (q = 1 or q = 2 in this case).
GENERAL REMARKS

Most large scale packages for solving di erential equa-


tions to high accuracy are based on using Adams-
Bashforth and Adams-Moulton methods, usually vary-
ing both the stepsize h and the order q + 1 of the
method.

Matlab contains such a code, called ode113.


SYSTEMS OF ODES

Consider the pendulum shown below. Assume the rod


is of neglible mass, that the pendulum is of mass m,
and that the rod is of length `. Assume the pendulum
moves in the plane shown, and assume there is no
friction in the motion about its pivot point. Let (x)
denote the position of the pendulum about the vertical
line thru the pivot, with measured in radians and x
measured in units of time. Then Newton's second
law implies
d2
m` 2 = mg sin ( (x))
dx
Introduce Y1(x) = (x) and Y2(x) = 0(x). The
function Y2(x) is called the angular velocity. We can
now write
Y10(x) = Y2(x); Y1(0) = (0)
g
0
Y2 (x) = sin (Y1(x)) ; Y2(0) = 0(0)
`
This is a simultaneous system of two di erential equa-
tions in two unknowns.

We often write this in vector form. Introduce


" #
Y1(x)
Y ( x) =
Y2(x)
Then
2 3
Y2(x)
Y 0 ( x)
= 4 g 5
sin (Y1(x))
` " # " #
Y1(0) (0)
Y(0) = Y0 = = 0 (0)
Y2(0)
Introduce
2 3 " #
z2 z1
f (x; z) = 4 g 5; z =
sin (z1) z2
`
Then our di erential equation problem
2 3
Y2(x)
Y 0 ( x) = 4
g 5
sin (Y1(x))
` " # " #
Y1(0) (0)
Y(0) = Y0 = = 0 (0)
Y2(0)
can be written in the familiar form

Y0(x) = f (x; Y(x)) ; Y(0) = Y0 (1)


We can convert any higher order di erential equation
into a system of rst order di erential equations, and
we can write them in the vector form (1).
Lotka-Volterra predator-prey model.
Y10 = AY1[1 BY2]; Y1(0) = Y1;0
(2)
Y20 = CY2[DY1 1]; Y2(0) = Y2;0
with A; B; C; D > 0. x denotes time, Y1(x) is the
number of prey (e.g., rabbits) at time x, and Y2(x)
the number of predators (e.g., foxes). If there is only a
single type of predator and a single type of prey, then
this model is often a good approximation of reality.

Again write
" #
Y1(x)
Y ( x) =
Y2(x)
and de ne
" # " #
Az1[1 Bz2] z1
f (x; z) = ; z=
Cz2[Dz1 1] z2
although there is no explicit dependence on x. Then
system (2) can be written as

Y0(x) = f (x; Y(x)) ; Y(0) = Y0


GENERAL SYSTEMS OF ODES

An initial value problem for a system of m di erential


equations has the form
Y10(x) = f1(x; Y1(x); : : : ; Ym(x)); Y1(x0) = Y1;0
... ...
0 (x) = f (x; Y (x); : : : ; Y (x)); Y (x )
Ym = Ym;0
m 1 m m 0
(3)
Introduce
2 3 2 3
Y1(x) Y1;0
Y ( x) = 6
4
... 7
5 ; Y0 = 4
6 ... 7
5
Ym(x) Ym;0
2 3
f1(x; z1; : : : ; zm)
f (x; z) = 6
4
... 7
5
fm(x; z1; : : : ; zm)
Then (3) can be written as

Y0(x) = f (x; Y(x)) ; Y(0) = Y0


LINEAR SYSTEMS

Of special interest are systems of the form

Y0(x) = AY(x) + G(x); Y(0) = Y0 (4)


with A a square matrix of order m and G(x) a col-
umn vector of length m with functions Gi(x) as com-
ponents. Using the notation introduced for writing
systems,

f (x; z) = Az + G(x); z 2 Rm
This equation is the analogue for studying systems of
ODEs that the model equation

y 0 = y + g ( x)
is for studying a single di erential equation.
EULER'S METHOD FOR SYSTEMS

Consider
Y0(x) = f (x; Y(x)) ; Y(0) = Y0
to be a systems of two equations
Y10(x) = f1(x; Y1(x); Y2(x)); Y1(0) = Y1;0
(5)
Y20(x) = f2(x; Y1(x); Y2(x)); Y2(0) = Y2;0
Denote its solution be [Y1(x); Y2(x)].
Following the earlier derivations for Euler's method,
we can use Taylor's theorem to obtain
h2 00
Y1(xn+1) = Y1(xn) + hf1(xn; Y1(xn); Y2(xn)) + Y1 ( n)
2
(6)
h2 00
Y2(xn+1) = Y2(xn) + hf2(xn; Y1(xn); Y2(xn)) + Y2 ( n)
2
Dropping the remainder terms, we obtain Euler's method
for problem (5),
y1;n+1 = y1;n + hf1(xn; y1;n; y2;n); y1;0 = Y1;0
y2;n+1 = y2;n + hf2(xn; y1;n; y2;n); y2;0 = Y2;0
for n = 0; 1; 2; : : :
ERROR ANALYSIS

If Y1(x), Y2(x) are twice continuously di erentiable,


and if the functions f1(x; z1; z2) and f2(x; z1; z2) are
su ciently di erentiable, then it can be shown that

max Y1(xn) y1;n ch


x0 x b
(7)
max Y2(xn) y2;n ch
x0 x b
for a suitable choice of c 0.

The theory depends on generalizations of the proof


used with Euler's method for a single equation. One
needs to assume that there is a constant K > 0 such
that

kf (x; z) f (x; w)k1 K kz w k1 (8)


for x0 x b, z; w 2 R2. Recall the de nition of
the norm k k1 from Chapter 6.
The role of @f (x; z )=@z in the single variable theory
is replaced by the Jacobian matrix
2 3
@f1(x; z1; z2) @f1(x; z1; z2)
6 7
F(x; z) = 66
@z1 @z2 7
7 (9)
4 @f2(x; z1; z2) @f2(x; z1; z2) 5
@z1 @z2
It is possible to show that

K = max kF(x; z)k1


x0 x b
z2 R 2
is suitable for showing (8).

All of this work generalizes to problems of any order


m 2. Then we require

kf (x; z) f (x; w)k1 K kz w k1 (10)


with x0 x b, z; w 2 Rm. The choice of K is
often obtained using

K = max kF(x; z)k1


x0 x b
z2Rm
where F(x; z) is the m m generalization of (9).
The Euler method in all cases can be written in the
dimensionless form

yn+1 = yn + hf (xn; yn); n 0


with y0 = Y0.

It can be shown that if (10) is satis ed, and if Y(x)


is twice-continuously di erentiable on [x0; b], then

max kY(xn) ynk1 ch (11)


x0 x b
for some c 0 and for all small values of h.

In addition, we can show there is a vector function


D(x) for which
Y ( x) yh(x) = D(x)h + O(h2); x0 xn b
for x = x0; x1; : : : ; b. Here yh(x) shows the depen-
dence of the solution on h, and yh(x) = yn for
x = x0 + nh. This justi es the use of Richardson
extrapolation, leading to

Y ( x) yh(x) = yh(x) y2h(x) + O(h2)


NUMERICAL EXAMPLE. Consider solving the initial
value problem
Y 000 + 3Y 00 + 3Y 0 + Y = 4 sin(x);
(12)
Y (0) = Y 0(0) = 1; Y 00(0) = 1
Reformulate it as
Y10 = Y2 Y1(0) = 1
Y20 = Y3 Y2(0) = 1
Y30 = Y1 3Y2 3Y3 4 sin(x); Y3(0) = 1
(13)
The solution of (12) is Y (x) = cos(x) + sin(x), and
the solution of (13) can be generated from it using
Y1(x) = Y (x).
The results for Y1(x) = sin(x)+cos(x) are given in the
following table, for stepsizes 2h = 0:1 and h = 0:05.

The Richardson error estimate is quite accurate.

x y ( x) y (x) y2h(x) y ( x) y h ( x) Ratio


2 0:49315 8:78E 2 4:25E 2 2:1
4 1:41045 1:39E 1 6:86E 2 2:0
6 0:68075 5:19E 2 2:49E 2 2:1
8 0:84386 1:56E 1 7:56E 2 2:1
10 1:38309 8:39E 2 4:14E 2 2:0

x y ( x) y ( x) y h ( x) yh(x) y2h(x)
2 0:49315 4:25E 2 4:53E 2
4 1:41045 6:86E 2 7:05E 2
6 0:68075 2:49E 2 2:70E 2
8 0:84386 7:56E 2 7:99E 2
10 1:38309 4:14E 2 4:25E 2
OTHER METHODS

Other numerical methods apply to systems in the same


straightforward manner. by using the vector form
Y0(x) = f (x; Y(x)) ; Y(0) = Y0 (14)
for a system, there is no apparent change in the nu-
merical method. For example, the following Runge-
Kutta method for solving a single di erential equation,
h
yn+1 = yn + [f (xn; yn)+ f (xn+1; yn + hf (xn; yn))];
2
n 0, generalizes as follows for solving (14):
h
yn+1 = yn + [f (xn; yn)+ f (xn+1; yn + hf (xn; yn))];
2
n 0. This can then be decomposed into compo-
nents if needed. For a system of order 2, we have
hh
yj;n+1 = yj;n + fj (xn; y1;n; y2;n)
2
+fj xn+1; y1;n + hf1(xn; y1;n; y2;n);
i
y2;n + hf2(xn;y1;n; y2;n)
for n 0 and j = 1; 2.
TWO-POINT BVP

Consider the two-point boundary value problem of a


second-order linear equation:

Y 00(x) = p(x) Y 0(x) + q (x) Y (x) + r (x)


a x b
Y (a) = g1; Y (b) = g2
Assume the given functions p, q and r are continuous
on [a; b]. Unlike the initial value problem of the equa-
tion that always has a unique solution, the theory of
the two-point boundary value problem is more com-
plicated. We will assume the problem has a unique
smooth solution Y ; a su cient condition for this is
q (x) > 0 for x 2 [a; b].

In general, we need to depend on numerical methods


to solve the problem.
FINITE DIFFERENCE METHOD

We derive a nite di erence scheme for the two-point


boundary value problem in three steps.

Step 1. Discretize the interval [a; b].


Let N be a positive integer, and divide the interval
[a; b] into N equal parts:

[a; b] = [x0; x1] [ [x1; x2] [ [ [x N 1 ; x N ]


Let h = (b a)=N , called stepsize or gridsize. xi =
a + i h, 0 i N , are grid (or node) points.

We use the notation pi = p(xi), qi = q (xi), ri =


r(xi), 0 i N . For 0 i N , yi is numerical
approximation of Yi = Y (xi).
Step 2. Discretize the di erential equation at the in-
terior node points x1; : : : ; xN 1.
Recall
Y Yi 1
Y 0(xi) = i+1 + O (h 2 )
2h
00 Yi+1 2 Yi + Yi 1 2)
Y ( xi ) = + O ( h
h2
Then the di erential equation at x = xi becomes
Yi+1 2 Yi + Yi 1 Yi+1 Yi 1
= pi
h2 2h
+ qiYi + ri + O(h2)
Dropping the O(h2) term, replacing Yi by yi, we ob-
tain the di erence equations
yi+1 2 yi + y i 1 yi+1 yi 1
2
= pi + q i yi + r i
h 2h
for 1 i N 1. So
h h
1 + pi yi 1 + (2 + h2qi)yi + pi 1 yi+1
2 2
= h2 r i ; 1 i N 1
Step 3. Treatment of boundary conditions.
Use
y0 = g1; yN = g2
Then the di erence equation with i = 1 becomes
h
(2 + h2q1) y1 + p1 1 y2
2
h
= h2 r 1 + 1 + p1 g1
2
and that with i = N 1
h
1+ pN 1 yN 2 + (2 + h2qN 1) yN 1
2
h
= h2 r N 1 + 1 pN 1 g2
2
Finally, the nite di erence system is

Ay = b
where, unknown numerical solution vector

y = [ y1 ; ; y N 1 ]T
right-hand side vector
h
b= h2r1 + 1 + p1 g1; h2 r 2 ; ;
2
h T
h2 r
N 2; h2 r
N 1+ 1 pN 1 g2
2
and coe cient matrix A, which is tridiagonal.
THEORETICAL RESULTS

Suppose the true solution Y (x) has several continuous


derivatives. For the nite di erence scheme, we have
the following results.

1. The scheme is of second-order accurate,

max jY (xi) yi j = O (h 2 )
0 i N

2. There is an asymptotic error expansion

Y ( xi ) y h ( xi ) = h 2 D ( xi ) + O ( h 4 )
for some function D(x) independent of h.

De ne Richardson extrapolation
4 y h ( xi ) y2h(xi)
y~h(xi) =
3
Then
Y ( xi ) y~h(xi) = O(h4)
i.e., without much additional e ort, we obtain a fourth-
order approximate solution.

Actually we can have more terms in asymptotic error


expansion

Y ( xi ) yh(xi) = h2D1(xi) + h4D2(xi)


+ O (h 6 )
for some functions D1(x) and D2(x) independent of
h.

We can then perform further steps of extrapolation


16 y~h(xi) y~2h(xi)
Y ( xi ) = O (h 6 )
15
to get even higher order convergence.
EXAMPLE. Use the nite di erence method to solve
the boundary value problem
2x 2
Y 00 = Y 0+Y +
1 + x2 1 + x2
log(1 + x2); 0 x 1
Y (0) = 0
Y (1) = log(2)

The true solution is Y (x) = log(1 + x2).


Numerical errors Y (x) y h ( x)

x h = 1=20 h = 1=40 R h = 1=80 R


0 :1 5:10E 5 1:27E 5 4:0 3:18E 6 4:0
0 :2 7:84E 5 1:96E 5 4:0 4:90E 6 4:0
0 :3 8:64E 5 2:16E 5 4:0 5:40E 6 4:0
0 :4 8:08E 5 2:02E 5 4:0 5:05E 6 4:0
0 :5 6:73E 5 1:68E 5 4:0 4:21E 6 4:0
0 :6 5:08E 5 1:27E 5 4:0 3:17E 6 4:0
0 :7 3:44E 5 8:60E 6 4:0 2:15E 6 4:0
0 :8 2:00E 5 5:01E 6 4:0 1:25E 6 4:0
0 :9 8:50E 6 2:13E 6 4:0 5:32E 7 4:0

The column marked \R" next to the column of the


solution errors for a stepsize h consists of the ratios
of the solution errors for the stepsize h with those for
the stepsize 2h. We clearly observe an error reduction
of a factor of around 4 when the stepsize is halved,
indicating a second order convergence of the method.
The next table give the extrapolation errors for solving
the boundary value problem, showing the accuracy
improvement by the extrapolation.

x h = 1=40 h = 1=80 R
0 :1 9:23E 09 5:76E 10 16:01
0:2 1:04E 08 6:53E 10 15:99
0:3 6:60E 09 4:14E 10 15:96
0:4 1:18E 09 7:57E 11 15:64
0:5 3:31E 09 2:05E 10 16:14
0:6 5:76E 09 3:59E 10 16:07
0:7 6:12E 09 3:81E 10 16:04
0 :8 4:88E 09 3:04E 10 16:03
0 :9 2:67E 09 1:67E 10 16:02
DIFFERENCE SCHEME FOR GENERAL
EQUATION

Di erence schemes for solving boundary value prob-


lems of more general equations can be derived simi-
larly. As an example, consider

Y 00 = f (x; Y; Y 0)
At an interior node point xi, the di erential equation
can be approximated by the di erence equation
yi+1 2 yi + yi 1 yi+1 yi 1
= f x ;
i iy ;
h2 2h
TREATMENT OF OTHER BOUNDARY
CONDITIONS

Boundary conditions involving the derivative of the


unknown need to be discretized carefully.

Consider the following boundary condition at x = b:

Y 0(b) + k Y (b) = g2
If we use the discrete boundary condition
yN yN 1
+ k yN = g2
h
then the di erence solution will have a rst-order ac-
curacy only, even though the di erence equations at
the interior nodes are second-order.
To maintain second-order accuracy, need a second-
order treatment of the derivative term Y 0(b), e.g.,
since
3 YN 4 YN 1 + YN 2
Y 0(b) = + O (h 2 )
2h
we can approximate the boundary condition by
3 yN 4 yN 1 + yN 2
+ k yN = g2
2h
  
 



  
    
                


    
   
     
       
    



   !   
 
  !   
 
 !  
" 
  


   #
   
  ! $  %
 

 
  
    


 
$       !  
  

&
 '(   )   
 
        '( 

   !   

   !   

  !  

&

    
  '(  ! 
 %
    $  

       

    
%
 
 
 



* $      )   #
 
!
%   !$ 

 % 


'(
'    

 
       + 
 
$
 +       


          
 
    $       
  ,   

 
     + 
 
&
          
   
  +    
  
  
-   

  
         
 
$
    
   % 
 
          
        
 ) 
              

  
   . /
*%   

  
         
 
0     
   % 
 
          
        
 ) 
             

  

      . /
$       
     '(
 

 
 
* $  !     
 %  
 
!
 '    

 
       +
 

$   
  
   

         +
1 
    %%  )
 



  
% % !  2$ !   

  
3  !$  $ 
%  )   #

 !
%    
 %  
 
  
%    !
  ! 
 
 +   4   4   ! 

%       
 
 +
5    
  +   ! !

   $  
  !
  ! 
 


 
* )
 
   )   #
 
 !

+  . 4/  . 4/  %    


% . 4/
    
     4 
  6 7  
 $  %    
%
. 4/     
     4
   6   
   

   4      4 4      4
$
    4     4  
y

yn +1
y

yj

y1 x
x1 xi xn +1
x
  
" 
  #
       



      8     8      9
 
  
  #
  
: 
   
% %

   
   8
   
   
   

   
   8
   
  
    
        
  ) 
 #
 
:   !
    $
%  !$   #
     



  


 8
 
  
  8
 
 
 
  8     8    
4
    
      

 
  
   

     4 
  4 
  4 
  4
8
*      
    8
   #
    4       

 !
 2$
 !
8    
8     

3   $     






:
 


   
       
  
      
   %
 
    
  
% %
 

    
 !
  )%  
   %  
  ; 
         %
 
$  
 %   !
 ! 
)%     
 !
   
!  
 % 
&
     $
   
 
  

            


 #
    4  
$
 

 
  
  
  
   
8     


  4 
 
  
  
   
8     
0    ! ; 7  
  
  %       
    !
 
4 8 
  
 

 %

!
  8    
!
  8    

  4 


 
 

   


  
   


*  
  4 
  4   
  
   
   
!" 
" 
  
 %  
 

 
 
 8     
  

      4

       4 
   4
 
    

      
   $   )
 $

0.8
the true solution

0.6

0.4

0.2

0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
yaxis
xaxis
*   )%     %  
 
   

  $    4< 

 %   )


Solution: h=0.0625 Error: h=0.0625

3
x 10

1.5 1
The numerical solution

0
1
The error

0.5
2

0 3
1 1
1 1
0.5 0.5
0.5 0.5
yaxis 0 0 xaxis yaxis 0 0 xaxis
=:   


 : 

    
  !

%
   
     !$   
 =: (


 
 >?8@( 8
A 48@>4( 8 @
4< ?84@( ? 8
?8 A?B(  4
5     %       

  6   %  :   




    !
! 
:    
)
  
  


  ! 


$
  
 
   


% 
$
      


   :  %
     

 
  
 :   
   4 4   48 4
  ? 4   4 48   48 48
   4 ?
 A   4<
 8<>(8 <A(? @ 4<4(? 8
 ?B>(8 @4<(? @ 88A(? 8
 8<>(8 <A(? @ 4<4(? 8
 ?B>(8 @4<(? @ 88A(? 8
 >?(8 4?(8 @ ?88(? 8
 8<>(8 <A(? @ 4<4(? 8
0     
  
    
3      :
   
% 
  9

    
        !$   



:    
%

  
        ?
!
 
      -
   
  !   
  !
   

 %  
  %% 
  
 
  !
  
    
 
     $   
  6  $

$
 ? 

  
          
3!     
   

   

  6 8     
    $ 
 
  6  &
  $ %

 
       

 
   8    
(     
   !
  $

  $   

 
C   
$

C   :
    )


C   
  ?
 
 $        #
 $   
 !



(:
    


 !
  

%  
    % 
  %  
:  
A   4<
 8(D 48>(D> 4<?>
 8A@(D 4BB(D> 4<?>
 8(D 48>(D> 4<?>
 8A@(D 4BB(D> 4<?>
 @(D 8>(D> 4<?>
 8(D 48>(D> 4<?>
,  
  
 !  :

    $   )%    #

   2   4<   : *
      :     



 !  :
    
 
 
8  4  $
  !   #
  
  
 
  84  3 
$
 

$  :
  !

    !
    
 !   
 
    
    4  
 

   
 




       

    
    
   
  
  
 

       


         
        
   
 
  

 ! "      #
       #
   
t

u =au +f
t xx
u(0,t) = g1(t) u(L,t) = g2(t)

x
O L
u(x,0) = u (x)
0
  

$        #
        

%  &       #
      $    '  
 (      
  #        #
 ! "     )
  *  +     
 +  
%    $  ( $ #
         
         
    
,       
-# (
.    (
%    (
   
/         #
 0  1 2    "   
 &  (    
   '  ' 
' 3   
.    ' 
'
       *  
$    ! "   
    !  
    *  
  
   
 
   . ,   
  !   # #
.   #    
.        #
   
     +
  * 
 
4  


 *   $       4 
        " 
         
      

 

    
   ' 
'

       


### +    
     +  
% .5   +  
  $ 
4   
 
 
$  0     1   
 6$  
* '

 '  * ' 
 
  
 ' * '  


' *  
  

     



   


    

0      1


 +      ! 


  -        +
   !  ! " 
      
%        #
 0  1    $   
    0  1 3     
     (    
    '  '  
'  
  & ! .   
  

,$  (  ! "    


*   *   7 
   * 

     

  


    

 $   $    
     * 







%       

 

       #


       

  ' 
'
    '
  
'  *   
 

* 

 


/     .  #
     6$ $    
    .     
-      . 
     .  
           
 .       
        
 
   !
8      $  #
        #
    -     #
         9  
 
  ' 
'    
         
   
9  


9
'  *  9
9

:  *  
 
'  *   

     $    9
- 

  
' '
*
  *  
   
'  *   
   
       &   
       %  
$  '       
     $  
   
;   !   
$  # .   
        & # .   
   
  
%        #
 '          #
      $ 
   
.

        


       #   
  & #    
"#! $#%
3   $ #   



       '     *

     0  '1
   '     0  *1

          


$   $ & 

0.8
the true solution

0.6

0.4

0.2

0
0.2

0.15 1
0.8
0.1
0.6
0.05 0.4
0.2
0 0
taxis
xaxis
%  .  '  ' * 
 *  ' -   #
 ' 

* '
 *
2   < = '*   
 = *> ?=    <   
   & 2    . 
 .       *
  $ 
 . 
< = ' @>'>A*
= *> B @*?A<
'* ?= ' *BC<A<
+   $$     .#
           >
        
  '*   ?=  $  
$ & 
Solution: h =0.0021 h =0.0833 Error: h =0.0021 h =0.0833
t x t x

3
x 10

1 2
The numerical solution

0.8
1.5
0.6
The error

1
0.4
0.5
0.2

0 0
0.2 0.2
1 1
0.1 0.1
0.5 0.5
taxis 0 0 xaxis taxis 0 0 xaxis
        #
 $ 6       2
&  6 '   '   
 $   &  %     
 *       :$#
   !   
 '    '      
  !    
Solution: ht=0.0200 hx=0.1000 Error: ht=0.0200 hx=0.1000

1 0.04
The numerical solution

0.8
0.03
0.6
The error

0.02
0.4
0.01
0.2

0 0
0.2 0.2
1 1
0.1 0.1
0.5 0.5
taxis 0 0 xaxis taxis 0 0 xaxis
       6
*   *     >  #
  $   &  2   
 !   !     #
         
  *
Solution: h =0.0100 h =0.0500 Error: h =0.0100 h =0.0500
t x t x

7 7
x 10 x 10

2 2
The numerical solution

1 1
The error

0 0

1 1

2 2
0.2 0.2
1 1
0.1 0.1
0.5 0.5
taxis 0 0 xaxis taxis 0 0 xaxis

 $     


 
      
  $6   .  
 "      
        (  
$    
-         
            #
   %        
   D  $    
       E * 
           
   $     E 
 
E 
E  


E  >' 
%  $  $     
 "        
    (
  .   !   
$ F
3 6      
     
     '  ' 
'
      '  '  
'
D     & ! #
.         
  ! .
   * 

     

  #   / 
& #   $    
6$ !
  


      

     6$   
     * 





 


 * 
D        
   
  ' 
'


   *

  
'
*        


   
* 




  

 


76  .     #


.       $  
      &  .
    .     -  
     
#%# & #%! '
D        * $  
    
  
:  6$   0     1 
  #  
  0


  
     

  


1
   '    '  .

'
*  

   '
*   

 
 
 

  '
*   
 '
* 
      - $    #
   37 (  $  
          
-    $    
   37 (
   !
%   $       
   $   
     $  -
     
;   !   
$  # .   
        & # .   
   
    $     

.

        


    6$     #
      & #  #
          
    
"#! $#%
3   6$    
  #    

       '     *

     0  '1
   '     0  *1

 $    $   


;       
     

,    $    *  *  
$   &  2     
  $     
Solution: h =0.0100 h =0.0500 Error: ht=0.0100 hx=0.0500
t x

3
x 10

1 5
The numerical solution

0.8 0

0.6 5
The error

0.4 10

0.2 15

0 20
0.2 0.2
1 1
0.1 0.1
0.5 0.5
taxis 0 0 xaxis taxis 0 0 xaxis
     6    
$          
$  B       #
   $      
   %     $  
           
  . *   
  2 $   
    "        
         
 >    < @* % $ "  
         
    >    > *? 2
         
 < @*  > *?     $ 
   >      
$        
 &        
  B  '  *  *
B ' ' *
* <B   * * ? <@* >*?
> B==  * * ? <@* >*?
 
 
  
   
  
 
     
    
 
           

  
  
  
   
  

   
  
             

     


  
 
    
     
  


!   "   

   
 "       
    
     "    

      "   



     
#$   
 
%
" 

    
     
  & "    


  
       
 
     
 
 
 %

   
 
   
 
  


    
    
  
 
%
!
       '  '      '
    

       '  '      '




     (  ) 


 *



 
      

 
      
 ) 

 *
     

  

!
   +   
  

     
 


 


   +   
 
    

  
  

  +     " +     "  


 

)  %

  +   
    +   
  
 

      
 

 )  %



   
   
  
 
 
 
   
 



 

 '      '"
    
  


 
  ,
   (   
  
        
 
   
!
    '      '
  

-    
  
         
    
 ) 

$
 
  
  (   

    
,"   .    
(
 


" '      '"  

 *


            


 
 *   
    


       
 
+


  
 *
      . 
  ,
     
   


  

+    '      '

 
     
(
 

 " ' 
    '"   )  %




    +      .  %    ) 
 
 %

  +   

  +   
   
 

      
 

  

    '    

  
         
)  %
    '!


     + '           
     

      ,
     

 !
    +  
  

/   %
"  
 
 
 


 
   '        


+ 
+ 
+   

     
 
. 
, "       
      
  !
-  "
    '      '
  

."
 
   
 


     '     

+  


 +       + 
  

+ 

-
"    +      "
 
  
 



 
    +'      









   




+ 
  
0 
   
 
    
  '" "
 


    


   
 

 
    
 

%
   +


1  
   "    
   
 
     

  

"
  
        


*
 
 
           
 

 

 
 
"           

   ,
  !
2    
  
 
 
  !
  
     +           '
               3 4
          3 '4

          "


 
    ( 

0.8
the true solution

0.6

0.4

0.2

0
1
0.8 3.5
0.6 3
2.5
0.4 2
1.5
0.2 1
0.5
0 0
taxis
xaxis
             5 5" 

' '
 + +    
   
       + +
    ( 


Solution: ht=0.0500 hx=0.1571 Error: ht=0.0500 hx=0.1571

4
x 10

1 2

0
The numerical solution

0.8
2
0.6
The error

4
0.4
6
0.2 8

0 10
1 1
4 4
0.5 0.5
2 2
taxis 0 0 xaxis taxis 0 0 xaxis
    
    
 "   
   
  
*   
 
     

*
 
 
       
 



  + 6 7 8 '     







   6" 


         

 5   ' 9
   + 9

 + '8+:; 6<+:6 ;8< ''<:6 6+
 6 66=:; ''<:; ;85 +=':6 6'
7 <<+:; +':; ;86 5+:6 6'
8 '';:+ +=6:; ;8; <;6:6 6'
' '6=:; ;8=:; ;8; =7=:6 6'

You might also like