Professional Documents
Culture Documents
Error Analysis of
Numerical Differentiation
Dr. Harvey J. Stein
Head, Quantitative Finance R&D
Bloomberg L.P.
22 June 2005
Revision: 1.13
1. Overview
1. Overview
Although one could try to take this limit numerically, this is much work. More
commonly, one chooses a small value of h = h0 , and tries to verify that the
approximation
f (x + h0 ) f (x)
f 0 (x) fr0 (x) =
h0
is sufficiently close to the desired derivative.
But, once were not taking the limit, we run into questions, such as:
What value of h0 ?
Why fr0 , and not fl0 (x) =
f (x)f (xh0 )
h0
or fc0 =
f (x+h0 )f (xh0 )
?
2h0
1
Normal CDF
Central derivative with h = 10^-12
Central derivative with h = 10^-13
0.8
0.6
0.4
0.2
0
-3
-2
-1
As can be seen in the above graph, h0 = 1013 exhibits some noise (which, Im
afraid is barely visible on this slide).
0.6
0.4
0.2
0
-3
-2
-1
0.8
0.6
0.4
0.2
0
-3
-2
-1
Doubles by the IEEE standard are 64 bits long with a 53 bit mantissa. For any
given exponent, one can only represent 252 different positive values (one bit is
the sign bit). 252 1016 , so we only get to use at most 16 decimal digits to
represent a mantissa. In particular, our computers think that 1 + 1017 = 1.
This discreteness affects the input value, the output value, and all intermediate
computations.
Inspecting the above graph, we see that a stepsize of 1016 for x values around 1
will give either zero or a ridiculously high value for the derivative. 10 15 will also
give noisy derivatives, depending on where the actual x + h0 and x h0 lie on
the step function that constitutes the normal CDF at this level of magnification.
Theres another effect thats commonly discussed. The fact that (x + h) (x
h) 6= 2h because of roundoff error. This means that the denominator of our
approximation shouldnt really be h. There are two ways to fix this. One is to
use a power of 2 for h instead of a power of 10. The other is to set h = (x+h)x.
The later requires a little bit of effort to prevent an optimizing compiler from
reducing it to h = h.
But, Ive only observed minor impact from such adjustments. For clarity, Ill
10
11
0.0001
5e-05
-5e-05
-0.0001
-3
-2
-1
12
8e-07
6e-07
4e-07
2e-07
0
-2e-07
-4e-07
-6e-07
-8e-07
-1e-06
-3
-2
-1
13
Continuing along, we see that 106 works much better than 108 :
1e-08
normal - delta f, h=10^-8
normal - delta f, h=10^-7
normal - delta f, h=10^-6
8e-09
6e-09
4e-09
2e-09
0
-2e-09
-4e-09
-6e-09
-8e-09
-1e-08
-3
-2
-1
14
Finally, we see that 105 gives the best results, with 104 being smooth, but
giving a high bias.
3e-10
normal - delta f, h=10^-6
normal - delta f, h=10^-5
normal - delta f, h=10^-4
2e-10
1e-10
0
-1e-10
-2e-10
-3e-10
-4e-10
-5e-10
-6e-10
-7e-10
-3
-2
-1
15
-4
-2
16
The best h0 for f (x) = x3 ends up again being around 105 . Coincidence?
1.2e-08
deriv(x^3), error with h=10^-6
deriv(x^3), error with h=10^-5
deriv(x^3), error with h=10^-4
1e-08
8e-09
6e-09
4e-09
2e-09
0
-2e-09
-3
-2
-1
17
-2
-1
It appears to consist of noise plus some periodic error. The noise is from cancellation error, and the periodic component is from convexity error.
18
6. Convexity error
6. Convexity error
f 00 (x) 2 f 000 (x) 3
h +
h + ...
f (x + h) = f (x) + f (x)h +
2!
3!
0
f 000 (x)
3! + ...
f (x + h) f (x h) = 2f (x)h + 2
3
h
0
f 000 (x) 2
f (x + h) f (x h)
0
= f (x) +
h + ...
2h
3!
f 000 2
3! h
This is one of the reasons why a centered derivative is favored over a one sided
derivative the f 00 h error term drops out, so the convexity error is smaller.
Note also that the one sided derivative is the same as the two sided derivative
computed at half the stepsize and shifted by half the stepsize, so in some sense
the error in the one sided derivative is that its estimating the derivative at the
wrong x value.
19
7. Cancellation error
7. Cancellation error
f (x + h) =.335 + noise
(f (x h) =.231 + noise)
f (x + h) f (x h) =.104 + noise
However,
f (x + h) =.335 + noise
(f (x h) =.331 + noise)
f (x + h) f (x h) =.004 + noise
20
7. Cancellation error
In the first case, the difference yielded 3 significant digits. In the second case,
the high order digits cancelled, leaving only 1 significant digit in the difference.
This is cancellation error.
Clearly, cancellation error increases as h decreases. For h sufficiently small,
f (x + h) = f (x h) and the relative error becomes infinite.
In general, we can use relative errors to encode the number of significant digits
in our computations. Let
where f(x) is what we actually get when computing f (x). Here (x) is a random
quantity that quantifies the relative error in calculating f (x). If were accurate
to machine precision, then || 253 . If our calculation only yields 5 decimal
105
digits of accuracy, then || 2 .
21
7. Cancellation error
f(x + h) f(x h) =f (x + h) f (x h)
+ (x + h)f (x + h) (x h)f (x h)
f (x + h) f (x h) + (x)f (x)
22
7. Cancellation error
000
(x)f (x)
) .
2h
implies
or
f 000 h3 3f = 0
h=
p
3
(3f /f 000 )
If our calculations are exact except for roundoff error, and f and f 000 are around
the same order of magnitude, then 253 (52 accurate digits, base 2), which
gives an optimal h of about 7 106 . This ties out with our above empirical
work, and indicates that both functions are being computed with around full
accuracy.
23
7. Cancellation error
For x2 , we know the 3rd derivative is zero. This means that the only error is
from cancellation, which means that larger h should automatically be better.
Graphs confirm this theory:
2.5e-13
deriv(x^2), error with h=10^-2
deriv(x^2), error with h=10^-1
deriv(x^2), error with h=10^0
2e-13
1.5e-13
1e-13
5e-14
0
-5e-14
-1e-13
-1.5e-13
-2e-13
-2.5e-13
-3
-2
-1
Note that in finance our error is often on the order of a penny on a 100 value, a
relative error of about 105 which requires h0 0.03, a rather large value!
24
f (x + h) + f (x h) 2f (x)
f (x)
h20
00
and
f 000 (x)
f (x + 2h) 2f (x + h) + 2f (x h) f (x 2h)
2h20
so that for the second and third derivative approximations sample f with the
same spacing as the first derivative calculation.
25
Graphing f 00 and f 000 as a function of step size shows that the second derivative
is visibly poor for h0 = 107 and that the third derivative is visibly poor for
h0 = 105 :
1
Normal CDF
Derivative with h = 10^-5
2nd derivative with h = 10^-7
2nd derivative with h = 10^-6
3rd derivative with h = 10^-5
3rd derivative with h = 10^-4
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-3
-2
-1
This at least indicates that something between 106 and 104 is called for when
26
computing f 0 . Can we make this more precise? Maybe, but we havent tried.
One might consider trying to integrate the derivative to see how close it comes
back to the original function. Unfortunately this doesnt help because the sum is
effectively a telescoping series. If x = 2h, X = {x1 , x1 + x, x1 + 2x . . . , x2 },
then
X f (x + h) f (x h)
x = f (x2 + h) f (x1 h).
2h
X
In other words, the error in the derivative from one point to the next cancel each
other. This is clear because if a given point is too high, then the derivative to
the left will be too large while on the right it will be too small.
27
28
29
N
X
max(SN j K, 0)
N
X
j=0
= erT
j=0
= erT
N
X
j=j(S0 )
where p =
ert d
ud ,
S0 (pu)j (qd)N j K
Then,
dC
= erT
dS0
N
X
(pu)j (qd)N j
j=j(S0 )
The derivative is a step function, only changing value when j(S0 ) changes.
30
This is rarely noticed when graphing the function, just like the error in the
derivative calculations werent noticed in our initial graphs:
Option value, 1 yr opt, 30% vol, 3% risk free rate
60
12 step lattice
50
40
30
20
10
0
40
60
80
100
120
140
160
31
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
A 12 step lattice gives us large piecewise linear sections. A 120 step lattice, while
increasing computation by a factor of 100, only decreases the sizes of the steps
by about a factor of 3.
32
33
20
15
10
5
0
-5
40
60
80
100
120
140
160
10. Smoothing
34
10. Smoothing
When the calculation is a black box, we cant get inside to use the internals in
the calculation. In this case, how can one compute a good derivative?
One trick is to use a large h. We suffer convexity error because its being swamped
by error from the piecewise linearity of the function. Picking h around 1 to 2x
the support of the linear segments will do it.
35
11. H adjustment
11. H adjustment
Here we can see that with a 12 step lattice, we need to compute the derivative
with h0 17.
ddC, 1 yr opt, 30% vol, 3% risk free rate, 12 step lattice
1
h=.01
h=1
h=10
h=17
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
36
11. H adjustment
The second derivative is also helped by using a larger stepsize, but still isnt
especially good:
2nd deriv of 1 yr opt, 30% vol, 3% risk free rate, 12 step lattice
0.03
h=1
h=10
h=17
0.025
0.02
0.015
0.01
0.005
0
40
60
80
100
120
140
160
37
11. H adjustment
More commonly, people would use 120 levels for a 1 year stock option, but even
this requires a large value of h0 :
dC/dS of 1 yr opt, 30% vol, 3% risk free rate, 120 step lattice
1
h=.01
h=1
h=5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
38
11. H adjustment
Second derivative:
2nd deriv of 1 yr opt, 30% vol, 3% risk free rate, 120 step lattice
0.02
h=5
0.015
0.01
0.005
0
40
60
80
100
120
140
160
39
11. H adjustment
For fun, lets take a look at what happens with 1200 levels, which is over 3
levels/day:
dC/dS, 30% vol, 3% risk free rate, 1200 step lattice
1
h=.01
h=2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
As you can see, we still have fairly large piecewise linear sections. We need to
make h0 around 2 to get reasonable derivative estimates.
40
11. H adjustment
Second derivative:
2nd deriv, 30% vol, 3% risk free rate, 1200 step lattice
0.018
h=2
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
120
140
160
11. H adjustment
41
Why did an h0 of 17 for 12 levels, 5 for 120 levels and 2 for 1200 levels work
reasonably well?
As mentioned before,
the stepsize needed is roughly the lattice spacing. This is
approximately 2S0 T , which is 17 for 12 steps/year, 5.5 for 120 steps/year,
and 1.7 for 1200 steps/year.
Even for a dense lattice of 1200 levels, a much larger stepsize required than is
commonly recognized.
In fact, its common to use monthly steps in a binomial lattice for long dated
bonds, and a bump size of 10bp for modified duration and key rate duration with
a one sided derivative.
Lets take a look at the behavior of this. Well use a trinomial lattice, which
gives better results than a binomial lattice.
42
11. H adjustment
First, consider the error in computing the change in a callable bond as a function
of the step size using a centered derivative.
Centered derivatives
0
dC/dS 10bp
dC/dS 25bp
dC/dS 50bp
-0.01
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
A 25bp shift is bumpy, but looks fairly close to what it should be.
43
11. H adjustment
dC/dS 25bp
dC/dS 10bp, 1 sided
dC/dS 25bp, 1 sided
dC/dS 50bp, 1 sided
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
The stepping in the one sided derivative for a given h0 is the same as that for
the centered derivative at h0 /2, but the convexity error is much worse.
44
11. H adjustment
Next, consider the key rate sensitivities. The bond isnt sensitive to the 3mo
rate, so the 1st key rate sensitivity should be zero.
key rate duration - k1 - call sensitivity vs level
-7.75e-05
-7.8e-05
-7.85e-05
-7.9e-05
10bp 1side
10bp centered
25bp 1side
25bp centered
50bp 1side
50bp centered
-7.95e-05
-8e-05
-8.05e-05
-8.1e-05
-8.15e-05
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
It ends up being close to zero, but noisy, and pretty similar for all step sizes
and seemingly unaffected by whether we use a centered derivative or a one sided
derivative.
45
11. H adjustment
The second key rate sensitivities suffer from the piecewise nature of the calculation, both the centered ones as well as the one sided ones.
key rate duration - k2 - call sensitivity vs level
0
-0.001
-0.002
-0.003
-0.004
10bp 1side
10bp centered
25bp 1side
25bp centered
50bp 1side
50bp centered
-0.005
-0.006
-0.007
-0.008
-0.009
-0.01
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
46
11. H adjustment
Comparing the sum of the one sided key rates to the 25bp centered derivatives, we
see that the sum suffers both from the piecewise nature as well as the convexity,
and suffers worse than the one sided full sensitivity.
Sum of key rates, one sided
0
-0.01
-0.02
dC/dS 25bp
sum kr 10bp
sum kr 25bp
sum kr 50bp
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
47
11. H adjustment
The centered key rates are better, with the 25bp step size landing fairly close to
the 25bp derivative.
Sum of key rates, centered
0
-0.01
dC/dS 25bp
sum kr 10bp
sum kr 25bp
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
48
11. H adjustment
Comparing the 50bp centered key rates to the 50bp difference derivative, we
see that the two are close, but are significantly different. This is because the
key rates interact with the piecewise nature differently than the full curve shift.
Sum of key rates, centered
0
-0.01
dC/dS 50bp
sum kr 50bp
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
49
12. Filtering
12. Filtering
A more sophisticated approach is to smooth our pricing function. Essentially,
wed like to filter out the high frequencies that come from the corners where the
slope changes, leaving only the lower frequency data arising from the changing
function values.
This amounts to computing the Fourier transform of the price function, multiplying by a function that decays to zero (to dampen out the high frequency
noise), and transforming back, or
Smooth f = F 1 (F(f )D)
where D is our damping function (or smoothing kernel), and F is the Fourier
transform.
50
12. Filtering
= F 1 (F(f F 1 (D)))
= f F 1 (D)
So, smoothing a function is the same as computing its convolution with the
inverse transform of the smoothing kernel. Since (f g)0 = f 0 g = f g 0 ,
smoothing the derivative can be done by convolving with the derivative of the
inverse transform of the smoothing kernel. Finally, since the Fourier transform
of a Gaussian PDF is a Gaussian (up to scaling), we can smooth by integrating
against a Gaussian and its derivatives.
All thats left is to integrate a function times a Gaussian, which is best done by
51
12. Filtering
Gaussian quadrature.
1
0
x2
f (x0 x) e 22 dx =
x2
1
f (x0 x)xe 22 dx
3 2
Z
1
x2
dx
f (x0 2x)xe
=
2
X wi
where xi are the Gaussian quadrature points and wi are the associated weights.
The theory sounds beautiful, and looks like exactly what we need, but the theory
doesnt live up to its promise in practice. Although Ive used this method in
the past, and it has applications in signal processing, Ive been unable to make
it perform better than a two point difference derivative. It seems that it works
better in the random noise case than on piecewise linear functions.
52
12. Filtering
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
Stock price
The 5 point FFT method yields similar results to the 2 point difference derivative.
Both look good by inspection.
53
12. Filtering
But of course, the best way to check is to compare to a good reference. In this
case, well compare the error relative to a difference derivative computed on the
formula using a step size of 105 .
FFT vs centered difference - 1st derivative, errors
0.006
Centered err
FFT err, sig=3
FFT err, sig=2
0.004
0.002
0
-0.002
-0.004
-0.006
40
60
80
100
120
140
160
Stock price
The FFT method is hard pressed to do better than a well chosen step size.
54
12. Filtering
Its hard to make either method produce both a smooth and accurate second
derivative:
FFT vs centered difference - 2nd derivative
0.016
formula, h=10^-4
centered, h=15
FFT h=6
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
120
140
160
Stock price
Both look equally poor, with the FFT method requiring twice the computational
effort.
55
12. Filtering
0.0004
0.0002
0
-0.0002
-0.0004
-0.0006
-0.0008
40
60
80
100
Stock price
120
140
160
56
+
+
f (x + ih) = f (x) + f (x)ih
2
3!
4!
0
so
f (3) (x)h2
=(f (x + ih))
0
= f (x)
+
h
3!
This has the same convexity error as the centered derivative, but doesnt directly
suffer from cancellation error, allowing one to reduce h to lower convexity error
without increasing cancellation error.
While this approach can be useful in analytic methods, difficulties are encountered when trying to apply it in finance. It doesnt correct for correlated errors
when the function is piecewise linear, it just does a very good job of returning
the slope of the linear sections, yielding a step function for the derivative. Its
also not as straight forward as it looks. One cant just change all references
57
58
N
X
j=j(S0 )
S0 (pu)j (qd)N j K
we see that
=(C(S0 + ih))/h = erT
N
X
j=j(S0 )
(pu)j (qd)N j K
Up to round off error, the complex method gives the same results as the centered
difference.
59
60
61
62
Unfortunately, the approach cant always be applied. In interest rate lattices, the
values at the other nodes dont always correspond to a shift of the yield curve.
In normal short rate models they do, but in log normal models they dont. In the
latter case, to apply this approach, one would have to either adjust the derivative
or settle for differentiating with respect to a different sort of curve move.
63
None the less, where this method applies, it works quite well. Well compare
it to the best of fixed h0 selection. Consider Black-Scholes again with a 1 year
option, 30% vol, 3% risk free rate, computed using a 120 step binomial lattice.
Again, the first derivatives are visually fine:
Centered difference vs lattice shift - 1st derivative
1
formula, h=10^-5
centered, h=5
lattice shift
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
Stock price
120
140
160
64
But the differences to a reference show that the shifted lattice approach is far
smoother and more accurate:
Centered difference vs lattice shift - 1st derivative, errors
0.004
Centered err
lattice shift
0.003
0.002
0.001
0
-0.001
-0.002
-0.003
-0.004
-0.005
-0.006
40
60
80
100
Stock price
120
140
160
65
The results on the second derivative are more pronounced. The fixed h selection
is visibly poor, while the shifted lattice still looks quite good:
Centered difference vs lattice shift - 2nd derivative
0.016
formula, h=10^-4
centered, h=15
lattice shift
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
Stock price
120
140
160
66
0.0004
0.0002
0
-0.0002
-0.0004
-0.0006
-0.0008
40
60
80
100
Stock price
120
140
160
67
Looking at the errors for the lattice shift by itself, we can see the errors in the
second derivative calculation are around 104 , which is about a 1.5% error in
the second derivative.
Lattice shift 2nd derivative error
8e-05
lattice shift
6e-05
4e-05
2e-05
0
-2e-05
-4e-05
-6e-05
-8e-05
-0.0001
40
60
80
100
Stock price
120
140
160
68
Surprisingly, this method yields reasonable results even with a monthly lattice.
Heres the first derivative:
Centered difference vs lattice shift - 1st derivative
1
formula, h=10^-5
lattice shift
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
Stock price
120
140
160
69
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
Stock price
120
140
160
70
The shifted lattice performs well because it samples the price function at exactly
the right points. At option expiration, on the up tree theres exactly one more
node in the money and one less out of the money, and the rest get exactly the
same value.
When the call option price is:
C(S0 ) = erT
N
X
j=j(S0 )
S0 (pu)j (qd)N j K
N
X
S0 u(pu)j (qd)N j K
N
X
S0 d(pu)j (qd)N j K
j=j(S0 )1
j=j(S0 )+1
71
N
j(S0 )1
N j(S0 )+1
X
(qd)
K/S
u(pu)
0
+
(pu)j (qd)N j
erT
u1
j=j0 (S0 )
Its exactly the centered difference derivative for a small shift, plus a correction
term thats a linear function of 1/S0 . Its the correction term varying as a
function of S0 while j(S0 ) remains fixed that makes up the appropriate correction
for the derivative calculation.
It must be noted that despite the above graphs, optimizing h0 and the shifted
lattice method are actually the same numerically. Picking h0 gives poorer results
in the above tests predominantly because were not picking a different h0 for each
underlying. Fixing it for the entire computation is why its not behaving nearly
as well.
Setting h = S0 u S0 would make the difference derivative quite close to the
shifted lattice value. Using separate up and down shifts instead of doing a
72
centered derivative would make them identical. But this is the same value at
almost three times the computational effort.
73
such
that
0
dS0
dS0 E[X(S0 )] = E[X(S0 )],
which again allows computing the derivative directly by Monte Carlo instead of
taking the difference of two Monte Carlo price calculations.
Differentiation under the integral can also be used when valuing options via FFT.
74
The derivative can be computed by computing the FFT of the derivative of the
characteristic function.
75
18. Summary
76
18. Summary
Approximating the derivative by a difference magnifies the error of the original function.
Small step sizes give huge errors due to cancellation error.
Large step sizes give huge errors due to convexity error.
Balancing convexity error and cancellation error requires unexpectedly large
step sizes as large as 105 when calculations are accurate to machine
precision.
Its hard to judge accuracy without an an accurate reference, but one can
try to make due by graphing higher order derivatives with small stepsize.
Finite difference methods produce piecewise linear (or exponential) functions, which require extra care. Large step sizes are needed to produce
reasonable results. We observed the need for step sizes of 17 for a 12 level
binomial lattice, and 25 50 bp for a 12 level trinomial lattice. Hedges in
practice could be way off.
Fixing this by increasing lattice density
is computationally infeasible because
18. Summary
77
19. References
78
19. References
19. References
79
19. References
80