Risky Measures of Risk: Error Analysis of Numerical Differentiation

Risky Measures of Risk:
Error Analysis of
Numerical Differentiation
Dr. Harvey J. Stein
Head, Quantitative Finance R&D
Bloomberg L.P.
22 June 2005
Revision: 1.13
1. Overview
1. Overview
Issues in numerical differentiation

Roundoff error
Convexity error
Cancellation error
Correlated errors
Methods to improve accuracy
h control
Smoothing techniques
Computation specific approaches
2. Derivatives as finite difference
2. Derivatives as finite difference

To compute the derivative of a function f , one must compute
f (x + h) f (x)
.
f (x) = lim
h0
h
0
Although one could try to take this limit numerically, this is much work. More
commonly, one chooses a small value of h = h0 , and tries to verify that the
approximation
f (x + h0 ) f (x)
f 0 (x) fr0 (x) =
h0
is sufficiently close to the desired derivative.
But, once were not taking the limit, we run into questions, such as:
What value of h0 ?
Why fr0 , and not fl0 (x) =
f (x)f (xh0 )
h0
or fc0 =
f (x+h0 )f (xh0 )
?
2h0
3. Investigating behavior as a function of h0

Naively, one might think that since we want the limit as h goes to zero, small is
better.
However, graphing the computed derivatives as functions of the step size indicates
otherwise.
1
Normal CDF
Central derivative with h = 10^-12
Central derivative with h = 10^-13
0.8
0.6
0.4
0.2
0
-3
-2
-1
As can be seen in the above graph, h0 = 1013 exhibits some noise (which, Im
afraid is barely visible on this slide).
Using h0 = 1014 gives visible noise:

1
Normal CDF
Central derivative, stepsize 10^-14
0.8
0.6
0.4
0.2
0
-3
-2
-1
As h0 is decreased, the approximation gets worse and worse, with h0 = 1017

giving complete nonsense - a derivative approximation thats mostly zero:
1
Normal CDF
0.8
0.6
0.4
0.2
0
-3
-2
-1
4. Error analysis 101 - machine precision

Why does the derivative look so bad for h0 = 1017 ? Because then h0 is disappearing into the resolution of our computer arithmetic:
Cumulative normal
Range from 1-10^-15 to 1+10^-15
Doubles by the IEEE standard are 64 bits long with a 53 bit mantissa. For any
given exponent, one can only represent 252 different positive values (one bit is
the sign bit). 252 1016 , so we only get to use at most 16 decimal digits to
represent a mantissa. In particular, our computers think that 1 + 1017 = 1.
This discreteness affects the input value, the output value, and all intermediate
computations.
Inspecting the above graph, we see that a stepsize of 1016 for x values around 1
will give either zero or a ridiculously high value for the derivative. 10 15 will also
give noisy derivatives, depending on where the actual x + h0 and x h0 lie on
the step function that constitutes the normal CDF at this level of magnification.
Theres another effect thats commonly discussed. The fact that (x + h) (x
h) 6= 2h because of roundoff error. This means that the denominator of our
approximation shouldnt really be h. There are two ways to fix this. One is to
use a power of 2 for h instead of a power of 10. The other is to set h = (x+h)x.
The later requires a little bit of effort to prevent an optimizing compiler from
reducing it to h = h.
But, Ive only observed minor impact from such adjustments. For clarity, Ill
continue using powers of 10 instead of powers of 2 here.
10
11
5. Error analysis 102A - The right h

But, is this the whole story? Since were differentiating the normal CDF, we
can compare it to the analytic solution. Comparing our original calculation with
h0 = 1012 , we see that the derivative computed wasnt nearly as accurate as
we had thought. In fact, h0 = 1010 gives much smaller errors:
0.00015
normal - delta f, h=10^-12
0.0001
5e-05
-5e-05
-0.0001
-3
-2
-1
12
Lets see how large we need to make h0 .

1e-06
8e-07
6e-07
4e-07
2e-07
0
-2e-07
-4e-07
-6e-07
-8e-07
-1e-06
-3
-2
-1
Clearly, 108 works much better than 1010 .
13
Continuing along, we see that 106 works much better than 108 :
1e-08
8e-09
6e-09
4e-09
2e-09
0
-2e-09
-4e-09
-6e-09
-8e-09
-1e-08
-3
-2
-1
14
Finally, we see that 105 gives the best results, with 104 being smooth, but
giving a high bias.
3e-10
2e-10
1e-10
0
-1e-10
-2e-10
-3e-10
-4e-10
-5e-10
-6e-10
-7e-10
-3
-2
-1
15
To prove Im not cheating, lets do the same with f (x) = x3 :

2
deriv(x^3), error with h = 10^-14
1
0
-1
-2
-3
-4
-5
-6
-6
-4
-2
16
The best h0 for f (x) = x3 ends up again being around 105 . Coincidence?
1.2e-08
deriv(x^3), error with h=10^-6
1e-08
8e-09
6e-09
4e-09
2e-09
0
-2e-09
-3
-2
-1
17
Lets look more closely at the error graph using h0 = 105 :

1.5e-11
1e-11
5e-12
0
-5e-12
-1e-11
-1.5e-11
-3
-2
-1
It appears to consist of noise plus some periodic error. The noise is from cancellation error, and the periodic component is from convexity error.
18
6. Convexity error
6. Convexity error
f 00 (x) 2 f 000 (x) 3
h +
h + ...
f (x + h) = f (x) + f (x)h +
2!
3!
0
f 000 (x)
3! + ...
f (x + h) f (x h) = 2f (x)h + 2
3
h
0
f 000 (x) 2
f (x + h) f (x h)
0
= f (x) +
h + ...
2h
3!
f 000 2
3! h
When h is large enough, the

tends to zero as h tends to zero.
term contributes to the error. This error
This is one of the reasons why a centered derivative is favored over a one sided
derivative the f 00 h error term drops out, so the convexity error is smaller.
Note also that the one sided derivative is the same as the two sided derivative
computed at half the stepsize and shifted by half the stepsize, so in some sense
the error in the one sided derivative is that its estimating the derivative at the
wrong x value.
19
7. Cancellation error
Consider f (x + h) f (x h). Suppose our of each has 3 significant digits of

accuracy. How many digits of accuracy are in the difference?
f (x + h) =.335 + noise
(f (x h) =.231 + noise)
f (x + h) f (x h) =.104 + noise
However,
f (x + h) =.335 + noise
(f (x h) =.331 + noise)
f (x + h) f (x h) =.004 + noise
20
In the first case, the difference yielded 3 significant digits. In the second case,
the high order digits cancelled, leaving only 1 significant digit in the difference.
This is cancellation error.
Clearly, cancellation error increases as h decreases. For h sufficiently small,
f (x + h) = f (x h) and the relative error becomes infinite.
In general, we can use relative errors to encode the number of significant digits
in our computations. Let
f(x) = f (x) + (x)f (x)
where f(x) is what we actually get when computing f (x). Here (x) is a random
quantity that quantifies the relative error in calculating f (x). If were accurate
to machine precision, then || 253 . If our calculation only yields 5 decimal
105
digits of accuracy, then || 2 .
21
f(x + h) f(x h) =f (x + h) f (x h)
+ (x + h)f (x + h) (x h)f (x h)
f (x + h) f (x h) + (x)f (x)
assuming h is small enough that f (x + h) and f (x h) are about the same

magnitude, and that (x + h) and (x h) are independent noise, and ignoring
the fact that summing them might cause the loss of one additional significant
bit. This analysis can be done more carefully, but this will get us into the right
ballpark.
This quantifies the problem of cancellation error. The absolute error in calculating f is roughly f . For small h, f is small, leaving the error to dominate
the calculation.
The best value of h will balance the cancellation error and the convexity error.
f 000 (x) 2 (x)f (x)
f (x + h) f (x h)
0
f (x) +
h +
2h
3!
2h
22
000
To minimize the error we must minimize f 3!(x) h2 +
(x)f (x)
) .
2h
d f 000 (x) 2 (x)f (x)

f 000 h
f
(
h +
)=
2
dh
3!
2h
3
2h
f 000 h3 3f
=
6h2
=0
implies
or
f 000 h3 3f = 0
h=
p
3
(3f /f 000 )
If our calculations are exact except for roundoff error, and f and f 000 are around
the same order of magnitude, then 253 (52 accurate digits, base 2), which
gives an optimal h of about 7 106 . This ties out with our above empirical
work, and indicates that both functions are being computed with around full
accuracy.
23
For x2 , we know the 3rd derivative is zero. This means that the only error is
from cancellation, which means that larger h should automatically be better.
Graphs confirm this theory:
2.5e-13
deriv(x^2), error with h=10^0
2e-13
1.5e-13
1e-13
5e-14
0
-5e-14
-1e-13
-1.5e-13
-2e-13
-2.5e-13
-3
-2
-1
Note that in finance our error is often on the order of a penny on a 100 value, a
relative error of about 105 which requires h0 0.03, a rather large value!
8. Error analysis 102B - Flying without a reference
24

More commonly were faced with calculating derivatives without being able to
verify against a known analytic formula. If we dont know the derivative, and
we dont know how much error we have in our function, how does one pick h0 ?
One way is to inspect higher order derivatives. If our first derivative is jumping
around, then the second derivative with the same step size will be visibly noisy.
Well use
f (x + h) + f (x h) 2f (x)
f (x)
h20
00
and
f 000 (x)
f (x + 2h) 2f (x + h) + 2f (x h) f (x 2h)
2h20
so that for the second and third derivative approximations sample f with the
same spacing as the first derivative calculation.
25
Graphing f 00 and f 000 as a function of step size shows that the second derivative
is visibly poor for h0 = 107 and that the third derivative is visibly poor for
h0 = 105 :
1
Normal CDF
Derivative with h = 10^-5
2nd derivative with h = 10^-7
2nd derivative with h = 10^-6
3rd derivative with h = 10^-5
3rd derivative with h = 10^-4
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-3
-2
-1
This at least indicates that something between 106 and 104 is called for when
26
computing f 0 . Can we make this more precise? Maybe, but we havent tried.
One might consider trying to integrate the derivative to see how close it comes
back to the original function. Unfortunately this doesnt help because the sum is
effectively a telescoping series. If x = 2h, X = {x1 , x1 + x, x1 + 2x . . . , x2 },
then
X f (x + h) f (x h)
x = f (x2 + h) f (x1 h).
2h
X
In other words, the error in the derivative from one point to the next cancel each
other. This is clear because if a given point is too high, then the derivative to
the left will be too large while on the right it will be too small.
9. Error analysis 201 - Correlated errors
27

This is where most error analysis ends up, but is far from the whole story. One
of the key assumptions in the above analysis is that the error is random and
uncorrelated from one x value to another. This is rarely the case.
Consider finite difference and lattice (aka tree) approaches to option valuation.
In these, the pricing function is a weighed average of the payoff sampled at various
points. The weights change slightly as a function of the underlying, but the actual
payoffs used change substantially as the strike passes a sample point. This makes
the pricing function calculation roughly a piecewise linear approximation of the
actual function. In the case of a european option on a stock under Black-Scholes
and using a binomial lattice, its exactly a piecewise linear approximation. In
the case of an option on a bond, its closer to piecewise exponential.
To prove this, let Sij be the stock value at node j at time ti . With starting value
S0 , volatility , maturity time T , N steps, t = T /N , and risk free rate r, a
typical binomial lattice uses future stock values of:
Sij = S0 uj dij ,
where u = t, and d = 1/u.
28
29
The value of an option of strike K is then

C = erT
N
X
max(SN j K, 0)
N
X
max(S0 (pu)j (qd)N j K, 0)
j=0
= erT
j=0
= erT
N
X
j=j(S0 )
where p =
ert d
ud ,
S0 (pu)j (qd)N j K
and q = 1p, and j(S0 ) is the minimum j such that SN j > K.
Then,
dC
= erT
dS0
N
X
(pu)j (qd)N j
j=j(S0 )
The derivative is a step function, only changing value when j(S0 ) changes.
30
This is rarely noticed when graphing the function, just like the error in the
derivative calculations werent noticed in our initial graphs:
Option value, 1 yr opt, 30% vol, 3% risk free rate
60
12 step lattice
50
40
30
20
10
0
40
60
80
100
120
Initial stock value
140
160
31
But it becomes clearly evident when we inspect the difference derivative:

dC/dS, 1 yr opt, 30% vol, 3% risk free rate, h=.01
1
12 steps
120 steps
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
Initial stock value
A 12 step lattice gives us large piecewise linear sections. A 120 step lattice, while
increasing computation by a factor of 100, only decreases the sizes of the steps
by about a factor of 3.
32
Why only a factor of three? Because u = t. Decreasing the step size by a

factor of 10 only decreases u by a factor of 10 3, which only gives about 3
times the level density.
When the approximation is piecewise linear, and the stepsize is much smaller
than the support of the linear segments, the first derivative is poor. In computing
the second derivative, the sample endpoints almost always land on the same
segement, making the estimate of the second derivative zero almost everywhere.
33
ddC, 1 yr opt, 30% vol, 3% risk free rate, h=.01

25
12 step lattice
120 step lattice
20
15
10
5
0
-5
40
60
80
100
120
Initial stock value
140
160
10. Smoothing
34
10. Smoothing
When the calculation is a black box, we cant get inside to use the internals in
the calculation. In this case, how can one compute a good derivative?
One trick is to use a large h. We suffer convexity error because its being swamped
by error from the piecewise linearity of the function. Picking h around 1 to 2x
the support of the linear segments will do it.
35
11. H adjustment
11. H adjustment
Here we can see that with a 12 step lattice, we need to compute the derivative
with h0 17.
ddC, 1 yr opt, 30% vol, 3% risk free rate, 12 step lattice
1
h=.01
h=1
h=10
h=17
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
Initial stock value
140
160
36
11. H adjustment
The second derivative is also helped by using a larger stepsize, but still isnt
especially good:
2nd deriv of 1 yr opt, 30% vol, 3% risk free rate, 12 step lattice
0.03
h=1
h=10
h=17
0.025
0.02
0.015
0.01
0.005
0
40
60
80
100
120
Initial stock value
140
160
37
11. H adjustment
More commonly, people would use 120 levels for a 1 year stock option, but even
this requires a large value of h0 :
dC/dS of 1 yr opt, 30% vol, 3% risk free rate, 120 step lattice
1
h=.01
h=1
h=5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
Initial stock value
140
160
38
11. H adjustment
Second derivative:
2nd deriv of 1 yr opt, 30% vol, 3% risk free rate, 120 step lattice
0.02
h=5
0.015
0.01
0.005
0
40
60
80
100
120
Initial stock value
140
160
39
11. H adjustment
For fun, lets take a look at what happens with 1200 levels, which is over 3
levels/day:
dC/dS, 30% vol, 3% risk free rate, 1200 step lattice
1
h=.01
h=2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
Initial stock value
As you can see, we still have fairly large piecewise linear sections. We need to
make h0 around 2 to get reasonable derivative estimates.
40
11. H adjustment
Second derivative:
2nd deriv, 30% vol, 3% risk free rate, 1200 step lattice
0.018
h=2
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
120
Initial stock value
140
160
11. H adjustment
41
Why did an h0 of 17 for 12 levels, 5 for 120 levels and 2 for 1200 levels work
reasonably well?
As mentioned before,
the stepsize needed is roughly the lattice spacing. This is
approximately 2S0 T , which is 17 for 12 steps/year, 5.5 for 120 steps/year,
and 1.7 for 1200 steps/year.
Even for a dense lattice of 1200 levels, a much larger stepsize required than is
commonly recognized.
In fact, its common to use monthly steps in a binomial lattice for long dated
bonds, and a bump size of 10bp for modified duration and key rate duration with
a one sided derivative.
Lets take a look at the behavior of this. Well use a trinomial lattice, which
gives better results than a binomial lattice.
42
11. H adjustment
First, consider the error in computing the change in a callable bond as a function
of the step size using a centered derivative.
Centered derivatives
0
dC/dS 10bp
dC/dS 25bp
dC/dS 50bp
-0.01
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
A 25bp shift is bumpy, but looks fairly close to what it should be.
43
11. H adjustment
Compare this to a one sided derivative, which is whats commonly used:

One sided derivatives
0
-0.01
-0.02
dC/dS 25bp
dC/dS 10bp, 1 sided
dC/dS 25bp, 1 sided
dC/dS 50bp, 1 sided
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
The stepping in the one sided derivative for a given h0 is the same as that for
the centered derivative at h0 /2, but the convexity error is much worse.
44
11. H adjustment
Next, consider the key rate sensitivities. The bond isnt sensitive to the 3mo
rate, so the 1st key rate sensitivity should be zero.
key rate duration - k1 - call sensitivity vs level
-7.75e-05
-7.8e-05
-7.85e-05
-7.9e-05
10bp 1side
10bp centered
25bp 1side
25bp centered
50bp 1side
50bp centered
-7.95e-05
-8e-05
-8.05e-05
-8.1e-05
-8.15e-05
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
It ends up being close to zero, but noisy, and pretty similar for all step sizes
and seemingly unaffected by whether we use a centered derivative or a one sided
derivative.
45
11. H adjustment
The second key rate sensitivities suffer from the piecewise nature of the calculation, both the centered ones as well as the one sided ones.
key rate duration - k2 - call sensitivity vs level
0
-0.001
-0.002
-0.003
-0.004
10bp 1side
10bp centered
25bp 1side
25bp centered
50bp 1side
50bp centered
-0.005
-0.006
-0.007
-0.008
-0.009
-0.01
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
The other key rates look similar.
46
11. H adjustment
Comparing the sum of the one sided key rates to the 25bp centered derivatives, we
see that the sum suffers both from the piecewise nature as well as the convexity,
and suffers worse than the one sided full sensitivity.
Sum of key rates, one sided
0
-0.01
-0.02
dC/dS 25bp
sum kr 10bp
sum kr 25bp
sum kr 50bp
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
47
11. H adjustment
The centered key rates are better, with the 25bp step size landing fairly close to
the 25bp derivative.
Sum of key rates, centered
0
-0.01
dC/dS 25bp
sum kr 10bp
sum kr 25bp
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
48
11. H adjustment
Comparing the 50bp centered key rates to the 50bp difference derivative, we
see that the two are close, but are significantly different. This is because the
key rates interact with the piecewise nature differently than the full curve shift.
Sum of key rates, centered
0
-0.01
dC/dS 50bp
sum kr 50bp
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075
curve shift
49
12. Filtering
12. Filtering
A more sophisticated approach is to smooth our pricing function. Essentially,
wed like to filter out the high frequencies that come from the corners where the
slope changes, leaving only the lower frequency data arising from the changing
function values.
This amounts to computing the Fourier transform of the price function, multiplying by a function that decays to zero (to dampen out the high frequency
noise), and transforming back, or
Smooth f = F 1 (F(f )D)
where D is our damping function (or smoothing kernel), and F is the Fourier
transform.
50
12. Filtering
Since F(f g) = F(f )F(g) (where is the convolution operator),

Smooth f = F 1 (F(f )D)
= F 1 (F(f )F(F 1 (D)))
= F 1 (F(f F 1 (D)))
= f F 1 (D)
So, smoothing a function is the same as computing its convolution with the
inverse transform of the smoothing kernel. Since (f g)0 = f 0 g = f g 0 ,
smoothing the derivative can be done by convolving with the derivative of the
inverse transform of the smoothing kernel. Finally, since the Fourier transform
of a Gaussian PDF is a Gaussian (up to scaling), we can smooth by integrating
against a Gaussian and its derivatives.
All thats left is to integrate a function times a Gaussian, which is best done by
51
12. Filtering
Gaussian quadrature.
1
0
x2
f (x0 x) e 22 dx =
x2
1
f (x0 x)xe 22 dx
3 2
Z
1
x2
dx
f (x0 2x)xe
=
2
X wi
f (x0 2xi )xi
where xi are the Gaussian quadrature points and wi are the associated weights.
The theory sounds beautiful, and looks like exactly what we need, but the theory
doesnt live up to its promise in practice. Although Ive used this method in
the past, and it has applications in signal processing, Ive been unable to make
it perform better than a two point difference derivative. It seems that it works
better in the random noise case than on piecewise linear functions.
52
12. Filtering
Using 5 points on a 120 level Black-Scholes lattice yields:

FFT vs centered difference - 1st derivative
1
formula, h=10^-5
centered, h=5
Gauss, 3pt, h=5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
120
140
160
Stock price
The 5 point FFT method yields similar results to the 2 point difference derivative.
Both look good by inspection.
53
12. Filtering
But of course, the best way to check is to compare to a good reference. In this
case, well compare the error relative to a difference derivative computed on the
formula using a step size of 105 .
FFT vs centered difference - 1st derivative, errors
0.006
Centered err
FFT err, sig=3
FFT err, sig=2
0.004
0.002
0
-0.002
-0.004
-0.006
40
60
80
100
120
140
160
Stock price
The FFT method is hard pressed to do better than a well chosen step size.
54
12. Filtering
Its hard to make either method produce both a smooth and accurate second
derivative:
FFT vs centered difference - 2nd derivative
0.016
formula, h=10^-4
centered, h=15
FFT h=6
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
120
140
160
Stock price
Both look equally poor, with the FFT method requiring twice the computational
effort.
55
12. Filtering
Error graphs confirm what we saw in the previous graph:

FFT vs centered difference - 2nd derivative errors
0.0006
Centered, h=15
FFT err, sig=6
0.0004
0.0002
0
-0.0002
-0.0004
-0.0006
-0.0008
40
60
80
100
Stock price
120
140
160
56
13. Complex arithmetic

Another approach makes use of complex analysis. If f is a real valued function
of one real variable, and can be extended to a complex analytic function, then
f (3) (x)ih3
f (4) (x)h4
f 00 (x)h2
+
+
f (x + ih) = f (x) + f (x)ih
2
3!
4!
0
so
f (3) (x)h2
=(f (x + ih))
0
= f (x)
+
h
3!
This has the same convexity error as the centered derivative, but doesnt directly
suffer from cancellation error, allowing one to reduce h to lower convexity error
without increasing cancellation error.
While this approach can be useful in analytic methods, difficulties are encountered when trying to apply it in finance. It doesnt correct for correlated errors
when the function is piecewise linear, it just does a very good job of returning
the slope of the linear sections, yielding a step function for the derivative. Its
also not as straight forward as it looks. One cant just change all references
57
of double to complex because numerical code in finance makes heavy use of

inequalities, as in
max(S K, 0),
which are meaningless on the complex plane. They need to be replaced by
something else. One source recommends comparing the real parts, but this
prevents the function from being analytic, thus breaking the above Taylor series
analysis. Finally, our analytic formulas in finance typically involve cumulative
normal distributions. While there is a unique continuation to the complex plane,
computing it is more involved than just calculating erf(x/sqrt(2))/2+1/2. One
would need to develop fast and accurate numerical methods for the calculation
of a complex cumulative normal before this method is useful in such a context.
This method is commonly compared to a one sided derivative because both require one additional function evaluation. But, evaluating a function at a complex
point can triple the computational effort. One complex addition is over double
the effort of a real addition, in that it requires two real additions and works with
more memory. One complex multiplication requires four real multiplications plus
two real additions, and thus is over four times as expensive as a real addition.
58
A centered derivative is more comparable in computational effort, in which case

both methods have the same convergence properties as h0 tends to zero. The
only difference is in the cancellation error.
Its easy to see why this doesnt help for Black-Scholes binomial lattices. Recalling that the lattice computation for the value of a call option is
C(S0 ) = erT
N
X
j=j(S0 )
S0 (pu)j (qd)N j K
we see that
=(C(S0 + ih))/h = erT
N
X
j=j(S0 )
(pu)j (qd)N j K
Up to round off error, the complex method gives the same results as the centered
difference.
59
Another complex technique is to exploit the Cauchy integral formula, which

states that:
I
n!
f (z)
dz
f n (z0 ) =
n+1
2i (z z0 )
where is a counterclockwise loop enclosing z0 .
One can then compute the above integral numerically. Bruno Dupire and Arun
Verma have looked into this method a little, deriving formulas for getting 4th
order accuracy using 4 points for the first 4 derivatives.
14. Algorithm specific approaches
60
14. Algorithm specific approaches

There are additional approaches that can be taken if one can modify the internals
of the numerical calculations.
15. Using internal lattice spacing
61

In finite difference approaches, one can often read extra information from the
lattice itself. In a simple Black-Scholes lattice, one can start the lattice two
levels early. This gives the option value as the middle value after the second
step. The values at the other two nodes can be used for the up and down values.
One reference for this method is a 1994 article by Pelsser and Vorst, where they
call it a well known alternative to the difference derivative.
Pelsser and Vorst compute the derivative as f
x , which introduces convexity error
by doing a difference derivative around the wrong point. Here we avoid this by
using another numerical technique fitting all three points (the up, the down
and the center) to a quadratic and reading the derivatives from there.
This latter technique could also be used in general when three points are available, and should reduce convexity error, but I havent tested it.
Shifting the lattice often gives the best derivatives that can be gotten from a
lattice.
62
Unfortunately, the approach cant always be applied. In interest rate lattices, the
values at the other nodes dont always correspond to a shift of the yield curve.
In normal short rate models they do, but in log normal models they dont. In the
latter case, to apply this approach, one would have to either adjust the derivative
or settle for differentiating with respect to a different sort of curve move.
63
None the less, where this method applies, it works quite well. Well compare
it to the best of fixed h0 selection. Consider Black-Scholes again with a 1 year
option, 30% vol, 3% risk free rate, computed using a 120 step binomial lattice.
Again, the first derivatives are visually fine:
Centered difference vs lattice shift - 1st derivative
1
formula, h=10^-5
centered, h=5
lattice shift
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
Stock price
120
140
160
64
But the differences to a reference show that the shifted lattice approach is far
smoother and more accurate:
Centered difference vs lattice shift - 1st derivative, errors
0.004
Centered err
lattice shift
0.003
0.002
0.001
0
-0.001
-0.002
-0.003
-0.004
-0.005
-0.006
40
60
80
100
Stock price
120
140
160
65
The results on the second derivative are more pronounced. The fixed h selection
is visibly poor, while the shifted lattice still looks quite good:
Centered difference vs lattice shift - 2nd derivative
0.016
formula, h=10^-4
centered, h=15
lattice shift
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
Stock price
120
140
160
66
Checking the differences to a reference shows how much better:

Centered difference vs lattice shift - 2nd derivative errors
0.0006
Centered, h=15
FFT err, sig=6
lattice shift
0.0004
0.0002
0
-0.0002
-0.0004
-0.0006
-0.0008
40
60
80
100
Stock price
120
140
160
67
Looking at the errors for the lattice shift by itself, we can see the errors in the
second derivative calculation are around 104 , which is about a 1.5% error in
the second derivative.
Lattice shift 2nd derivative error
8e-05
lattice shift
6e-05
4e-05
2e-05
0
-2e-05
-4e-05
-6e-05
-8e-05
-0.0001
40
60
80
100
Stock price
120
140
160
68
Surprisingly, this method yields reasonable results even with a monthly lattice.
Heres the first derivative:
Centered difference vs lattice shift - 1st derivative
1
formula, h=10^-5
lattice shift
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
40
60
80
100
Stock price
120
140
160
69
Heres the second derivative:

Centered difference vs lattice shift - 2nd derivative
0.016
formula, h=10^-4
lattice shift
0.014
0.012
0.01
0.008
0.006
0.004
0.002
40
60
80
100
Stock price
120
140
160
70
The shifted lattice performs well because it samples the price function at exactly
the right points. At option expiration, on the up tree theres exactly one more
node in the money and one less out of the money, and the rest get exactly the
same value.
When the call option price is:
C(S0 ) = erT
N
X
j=j(S0 )
S0 (pu)j (qd)N j K
and 1 <= j(S0 ) <= N 1, the up price is:

C(S0 u) = erT
N
X
S0 u(pu)j (qd)N j K
N
X
S0 d(pu)j (qd)N j K
j=j(S0 )1
and the down price is:

C(S0 d) = erT
j=j(S0 )+1
71
Whereas we used a quadratic approximation, for simplicity, just consider a one

sided derivative. Its value is
N
j(S0 )1
N j(S0 )+1
X
(qd)
K/S
u(pu)
0
+
(pu)j (qd)N j
erT
u1
j=j0 (S0 )
Its exactly the centered difference derivative for a small shift, plus a correction
term thats a linear function of 1/S0 . Its the correction term varying as a
function of S0 while j(S0 ) remains fixed that makes up the appropriate correction
for the derivative calculation.
It must be noted that despite the above graphs, optimizing h0 and the shifted
lattice method are actually the same numerically. Picking h0 gives poorer results
in the above tests predominantly because were not picking a different h0 for each
underlying. Fixing it for the entire computation is why its not behaving nearly
as well.
Setting h = S0 u S0 would make the difference derivative quite close to the
shifted lattice value. Using separate up and down shifts instead of doing a
72
centered derivative would make them identical. But this is the same value at
almost three times the computational effort.
16. Differentiation under the integral sign
73

In Monte Carlo calculations, one computes an integral via random sampling of
the payoff. Pricing errors in Monte Carlo based calculations are typically much
larger than in other methods, making shifting methods particularly poor.
One approach (advocated by Vladimir Piterbarg) is to exploit the fact that
the integral and derivative commute to integrate the derivative of the payoff
function instead of differentiating the integral of the payoff. This approach may
lead to staircasing, but even so, its still better than the random noise observed
in attempting a direct finite difference calculation.
Another approach (advocated by Fournie, Lasry, Lebuchoux, Lions and Touzi,
as well as by Benhamou) applies Malliavin calculus in an effort to reduce the
error in computing the expectation of the derivative. Here, instead of computing
d
d
E[X(S
)],
we
find
a
random
variable
such
that
0
dS0
dS0 E[X(S0 )] = E[X(S0 )],
which again allows computing the derivative directly by Monte Carlo instead of
taking the difference of two Monte Carlo price calculations.
Differentiation under the integral can also be used when valuing options via FFT.
74
The derivative can be computed by computing the FFT of the derivative of the
characteristic function.
17. Analytic techniques
75
17. Analytic techniques

Theres a large literature on working out various greeks analytically which I
havent reviewed. Because of the pricing PDE, there are relations that can be
exploited to avoid the need to compute all the greeks some can be gotten from
others. Symmetries and in general, behavior under specific transformations can
be exploited as well. Papers by Peter Carr as well as by Oliver Reiss and Uwe
Wystup are good places to get started.
18. Summary
76
18. Summary
Approximating the derivative by a difference magnifies the error of the original function.
Small step sizes give huge errors due to cancellation error.
Large step sizes give huge errors due to convexity error.
Balancing convexity error and cancellation error requires unexpectedly large
step sizes as large as 105 when calculations are accurate to machine
precision.
Its hard to judge accuracy without an an accurate reference, but one can
try to make due by graphing higher order derivatives with small stepsize.
Finite difference methods produce piecewise linear (or exponential) functions, which require extra care. Large step sizes are needed to produce
reasonable results. We observed the need for step sizes of 17 for a 12 level
binomial lattice, and 25 50 bp for a 12 level trinomial lattice. Hedges in
practice could be way off.
Fixing this by increasing lattice density
is computationally infeasible because
level spacing is proportional to t.
18. Summary
77
Beware of key rate durations. Theyre especially inaccurate.

Beware of one sided derivatives. Theyre more sensitive to piecewise linear
functions and more sensitive to convexity the worst of both worlds.
Other methods appear in the literature, but dont always help.
One simple method that does help is using the points in the lattice for the
up and down values, extending the lattice back in time if necessary to get
those points.
19. References
78
19. References
What Every Computer Scientist Should Know About Floating-Point Arithmetic,

David Goldberg, Computing Surveys, March 1991
http://docs.sun.com/source/806-3568/ncg goldberg.html
Numerical Recipes in C/C++/Fortran, William H. Press, Saul A. Teukolsky,
William T. Vetterling, Brian P. Flannery
The Binomial Model and the Greeks, Antoon Pelsser and Ton Vorst, Journal of
Derivatives, Spring 1994.
The Complex-Step Derivative Approximation (Sensitivity Analysis Workshop,
Livermore, August 2001)
http://mdolab.utias.utoronto.ca/documents/livermore2001.pdf
The connection between the complex-step derivative approximation and algorithmic differentiation, J. R. R. A. Martins, P. Sturdza, J. J. Alonso, AIAA Paper
19. References
79
2001-0921, Jan. 2001.

Using Complex Variables to Estimate Derivatives of Real Functions, William
Squire and George Trapp, SIAM Review, Vol. 40, No. 1, March 1998.
Risk Sensitivities of Bermuda Swaptions, Vladimir Piterbarg, Bank of America
Working Paper, November 1, 2002
Applications of Malliavin calculus to Monte Carlo methods in finance, Eric
Fournie, Jean-Michel Lasry, Jer
ome Lebuchoux, Pierre-Louis Lions, Nizar Touzi,
Finance and Stochastics, Vol. 3, No. 4, August 1999
Applications of Malliavin calculus to Monte Carlo methods in finance, II, Eric
Fournie, Jean-Michel Lasry, Jer
ome Lebuchoux, Pierre-Louis Lions, Nizar Touzi,
Finance and Stochastics, Vol 5, No. 2, April 2001
Smart Monte Carlo: Various tricks using Malliavin calculus, E. Benhamou,
Quantitative Finance, Volume 2, Number 5, 2002.
19. References
80
Optimal Malliavin Weighting Function for the Computation of the Greeks, E.

Benhamou, Mathematical Finance, Volume 13, Issue 1, 2003.
Deriving Derivatives of Derivative Securities, Peter Carr, Journal of Computational Finance, Vol. 4, No. 2, Winter 2000.
Computing Option Price Sensitivities Using Homogeneity and Other Tricks,
Oliver Reiss and Uwe Wystup, The Journal of Derivatives, Winter 2001

Risky Measures of Risk: Error Analysis of Numerical Differentiation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Risky Measures of Risk: Error Analysis of Numerical Differentiation

Uploaded by

Copyright:

Available Formats

Risky Measures of Risk:

Issues in numerical differentiation

2. Derivatives as finite difference

2. Derivatives as finite difference

3. Investigating behavior as a function of h0

3. Investigating behavior as a function of h0

3. Investigating behavior as a function of h0

3. Investigating behavior as a function of h0

Using h0 = 1014 gives visible noise:

3. Investigating behavior as a function of h0

As h0 is decreased, the approximation gets worse and worse, with h0 = 1017

4. Error analysis 101 - machine precision

4. Error analysis 101 - machine precision

Range from 1-10^-15 to 1+10^-15

4. Error analysis 101 - machine precision

4. Error analysis 101 - machine precision

continue using powers of 10 instead of powers of 2 here.

5. Error analysis 102A - The right h

5. Error analysis 102A - The right h

5. Error analysis 102A - The right h

Lets see how large we need to make h0 .

Clearly, 108 works much better than 1010 .

5. Error analysis 102A - The right h

5. Error analysis 102A - The right h

5. Error analysis 102A - The right h

To prove Im not cheating, lets do the same with f (x) = x3 :

5. Error analysis 102A - The right h

5. Error analysis 102A - The right h

Lets look more closely at the error graph using h0 = 105 :

When h is large enough, the

term contributes to the error. This error

Consider f (x + h) f (x h). Suppose our of each has 3 significant digits of

f(x) = f (x) + (x)f (x)

assuming h is small enough that f (x + h) and f (x h) are about the same

To minimize the error we must minimize f 3!(x) h2 +

d f 000 (x) 2 (x)f (x)

8. Error analysis 102B - Flying without a reference

8. Error analysis 102B - Flying without a reference

8. Error analysis 102B - Flying without a reference

8. Error analysis 102B - Flying without a reference

9. Error analysis 201 - Correlated errors

9. Error analysis 201 - Correlated errors

9. Error analysis 201 - Correlated errors

where u = t, and d = 1/u.

9. Error analysis 201 - Correlated errors

The value of an option of strike K is then

max(S0 (pu)j (qd)N j K, 0)

and q = 1p, and j(S0 ) is the minimum j such that SN j > K.

9. Error analysis 201 - Correlated errors

Initial stock value

9. Error analysis 201 - Correlated errors

But it becomes clearly evident when we inspect the difference derivative:

Initial stock value

9. Error analysis 201 - Correlated errors

Why only a factor of three? Because u = t. Decreasing the step size by a

9. Error analysis 201 - Correlated errors

ddC, 1 yr opt, 30% vol, 3% risk free rate, h=.01

Initial stock value

Initial stock value

Initial stock value

Initial stock value

Initial stock value

Initial stock value

Initial stock value