Nlreg 10 E

Introduction to Nonlinear Regression
Andreas Ruckstuhl
IDP Institut fr Datenanalyse und Prozessdesign
ZHAW Zrcher Hochschule fr Angewandte Wissenschaften
October 2010
Contents
1. The Nonlinear Regression Model 1
2. Methodology for Parameter Estimation 5
3. Approximate Tests and Condence Intervals 8
4. More Precise Tests and Condence Intervals 13
5. Prole t-Plot and Prole Traces 15
6. Parameter Transformations 17
7. Forecasts and Calibration 23
8. Closing Comments 27
A. Gauss-Newton Method 28
The author thanks Werner Stahel for his valuable comments.
E-Mail Address: Andreas.Ruckstuhl@zhaw.ch; Internet: http://www.idp.zhaw.ch

I
Goals
The nonlinear regression model block in the Weiterbildungslehrgang (WBL) in ange-
wandter Statistik at the ETH Zurich should
1. introduce problems that are relevant to the tting of nonlinear regression func-
tions,
2. present graphical representations for assessing the quality of approximate con-
dence intervals, and
3. introduce some parts of the statistics software R that can help with solving
concrete problems.
1. The Nonlinear Regression Model
a
The Regression Model. Regression studies the relationship between a variable of
interest Y and one or more explanatory or predictor variables x
(j)
. The general
model is
Y
i
= hx
(1)
i
, x
(2)
i
, . . . , x
(m)
i
;
1
,
2
, . . . ,
p
) + E
i
.
Here, h is an appropriate function that depends on the explanatory variables and
parameters, that we want to summarize with vectors x = [x
(1)
i
, x
(2)
i
, . . . , x
(m)
i
]
T
and
= [
1
,
2
, . . . ,
p
]
T
. The unstructured deviations from the function h are described
via the random errors E
i
. The normal distribution is assumed for the distribution of
this random error, so
E
i
^
_
0,
2
_
, independent.
b The Linear Regression Model. In (multiple) linear regression, functions h are con-
sidered that are linear in the parameters
j
,
hx
(1)
i
, x
(2)
i
, . . . , x
(m)
i
;
1
,
2
, . . . ,
p
) =
1
x
(1)
i
+
2
x
(2)
i
+ . . . +
p
x
(p)
i
,
where the x
(j)
can be arbitrary functions of the original explanatory variables x
(j)
.
(Here the parameters are usually denoted as
j
instead of
j
.)
c The Nonlinear Regression Model In nonlinear regression, functions h are considered
that can not be written as linear in the parameters. Often such a function is derived
from theory. In principle, there are unlimited possibilities for describing the determin-
istic part of the model. As we will see, this exibility often means a greater eort to
make statistical statements.
Example d Puromycin. The speed with which an enzymatic reaction occurs depends on the
concentration of a substrate. According to the information from Bates and Watts
(1988), it was examined how a treatment of the enzyme with an additional substance
called Puromycin inuences this reaction speed. The initial speed of the reaction is
chosen as the variable of interest, which is measured via radioactivity. (The unit of the
variable of interest is count/min
2
; the number of registrations on a Geiger counter per
time period measures the quantity of the substance present, and the reaction speed is
proportional to the change per time unit.)
A. Ruckstuhl, ZHAW
2 1. The Nonlinear Regression Model
0.0 0.2 0.4 0.6 0.8 1.0
50
100
150
200
Concentration
V
e
l
o
c
i
t
y
Concentration
V
e
l
o
c
i
t
y
Figure 1.d: Puromycin Example. (a) Data ( treated enzyme; untreated enzyme) and (b)
typical course of the regression function.
The relationship of the variable of interest with the substrate concentration x (in ppm)
is described via the Michaelis-Menten function
hx; ) =

1
x
2
+ x
.
An innitely large substrate concentration (x ) results in the asymptotic speed
1
. It has been suggested that this variable is inuenced by the addition of Puromycin.
The experiment is therefore carried out once with the enzyme treated with Puromycin
and once with the untreated enzyme. Figure 1.d shows the result. In this section the
data of the treated enzyme is used.
Example e Oxygen Consumption. To determine the biochemical oxygen consumption, river water
samples were enriched with dissolved organic nutrients, with inorganic materials, and
with dissolved oxygen, and were bottled in dierent bottles. (Marske, 1967, see Bates
and Watts (1988)). Each bottle was then inoculated with a mixed culture of microor-
ganisms and then sealed in a climate chamber with constant temperature. The bottles
were periodically opened and their dissolved oxygen content was analyzed. From this
the biochemical oxygen consumption [mg/l] was calculated. The model used to con-
nect the cumulative biochemical oxygen consumption Y with the incubation timex,
is based on exponential growth decay, which leads to
hx, ) =
1
_
1 e
2
x
_
. Figure 1.e shows the data and the regression function to be applied.
Example f From Membrane Separation Technology (Rapold-Nydegger (1994)). The ratio of
protonated to deprotonated carboxyl groups in the pores of cellulose membranes is
dependent on the pH value x of the outer solution. The protonation of the carboxyl
carbon atoms can be captured with
13
C-NMR. We assume that the relationship can
be written with the extended Henderson-Hasselbach Equation for polyelectrolytes
log
10
_
1
y
y
2
_
=
3
+
4
x ,
WBL Applied Statistics Nonlinear Regression
1 2 3 4 5 6 7
8
10
12
14
16
18
20
Days
O
x
y
g
e
n

D
e
m
a
n
d
Days
O
x
y
g
e
n

D
e
m
a
n
d
Figure 1.e: Oxygen consumption example. (a) Data and (b) typical shape of the regression
function.
2 4 6 8 10 12
160
161
162
163
x (=pH)
y

(
=

c
h
e
m
.

s
h
i
f
t
)
(a)
x
y
(b)
Figure 1.f: Membrane Separation Technology.(a) Data and (b) a typical shape of the regression
function.
where the unknown parameters are
1
,
2
and
3
> 0 and
4
< 0. Solving for y leads
to the model
Y
i
= hx
i
; ) + E
i
=

1
+
2
10
3
+
4
x
i
1 + 10
3
+
4
x
i
+ E
i
.
The regression funtion hx
i
, ) for a reasonably chosen is shown in Figure 1.f next
to the data.
g A Few Further Examples of Nonlinear Regression Functions:
Hill Model (Enzyme Kinetics): hx
i
, ) =
1
x
3
i
/(
2
+ x
3
i
)
For
3
= 1 this is also known as the Michaelis-Menten Model (1.d).
Mitscherlich Function (Growth Analysis): hx
i
, ) =
1
+
2
exp
3
x
i
).
From kinetics (chemistry) we get the function
hx
(1)
i
, x
(2)
i
; ) = exp
1
x
(1)
i
exp
2
/x
(2)
i
)).
A. Ruckstuhl, ZHAW
4 1. The Nonlinear Regression Model
Cobbs-Douglas Production Function
h
_
x
(1)
i
, x
(2)
i
;
_
=
1
_
x
(1)
i
_
2
_
x
(2)
i
_
3
.
Since useful regression functions are often derived from the theory of the application
area in question, a general overview of nonlinear regression functions is of limited
benet. A compilation of functions from publications can be found in Appendix 7 of
Bates and Watts (1988).
h Linearizable Regression Functions. Some nonlinear regression functions can be lin-
earized through transformation of the variable of interest and the explanatory vari-
ables.
For example, a power function
hx; ) =
1
x
2
can be transformed for a linear (in the parameters) function
lnhx; )) = ln
1
) +
2
lnx) =
0
+
1
x ,
where
0
= ln
1
),
1
=
2
and x = lnx). We call the regression function h lin-
earizable, if we can transform it into a function linear in the (unknown) parameters
via transformations of the arguments and a monotone transformation of the result.
Here are some more linearizable functions (also see Daniel and Wood, 1980):
hx, = 1/(1 + 2 expx) 1/hx, = 1 + 2 expx
hx, = 1x/(2 + x) 1/hx, = 1/1 + 2/1
1
x
hx, = 1x
2
lnhx, = ln1 + 2 lnx
hx, = 1 exp2gx lnhx, = ln1 + 2gx
hx, = exp1x
(1)
exp2/x
(2)
lnlnhx, = ln1 + lnx
(1)
2/x
(2)
hx, = 1
_
x
(1)
_
2
_
x
(2)
_
3
lnhx, = ln1 + 2 lnx
(1)
+ 3 lnx
(2)
.
The last one is the Cobbs-Douglas Model from 1.g.
i The Statistically Complete Model. A linear regression with the linearized regression
function in the referred-to example is based on the model
lnY
i
) =
0
+
1
x
i
+ E
i
,
where the random errors E
i
all have the same normal distribution. We back transform
this model and thus get
Y
i
=
1
x

E
i
with

E
i
= expE
i
). The errors

E
i
, i = 1, . . . , n now contribute multiplicatively and
are lognormal distributed! The assumptions about the random deviations are thus
now drastically dierent than for a model that is based directily on h,
Y
i
=
1
x
2
+ E
i
with random deviations E
i
that, as usual, contribute additively and have a specic
normal distribution.
A linearization of the regression function is therefore advisable only if the assumptions
about the random deviations can be better satised - in our example, if the errors
actually act multiplicatively rather than additively and are lognormal rather than
normally distributed. These assumptions must be checked with residual analysis.
j * Note: In linear regression it has been shown that the variance can be stabilized with certain
transformations (e.g. log,
). If this is not possible, in certain circumstances one can also

perform a weighted linear regression . The process is analogous in nonlinear regression.
k The introductory examples so far:
We have spoken almost exclusively of regression functions that only depend on one
original variable. This was primarily because it was possible to fully illustrate the
model graphically. The ensuing theory also functions well for regression functions
hx; ), that depend on several explanatory variables x = [x
(1)
, x
(2)
, . . . , x
(m)
] .
2. Methodology for Parameter Estimation
a The Principle of Least Squares. To get estimates for the parameters = [
1
,
2
, . . . ,
p
]
T
, one applies, like in linear regression calculations, the principle of least squares.
The sum of the squared deviations
S() :=
n
i=1
(y
i
i
))
2
mit
i
) := hx
i
; )
should thus be minimized. The notation where hx
i
; ) is replaced by
i
) is reason-
able because [x
i
, y
i
] is given by the measurement or observation of the data and only
the parameters remain to be determined.
Unfortunately, the minimum of the squared sum and thus the estimation can not be
given explicitly as in linear regression. Iterative numeric procedures help further.
The basic ideas behind the common algorithm will be sketched out here. They also
form the basis for the easiest way to derive tests and condence intervals.
b Geometric Illustration. The observed values Y = [Y
1
, Y
2
, . . . , Y
n
]
T
determine a
point in n-dimensional space. The same holds for the model values ) =
[
1
) ,
2
) , . . . ,
n
)]
T
for given .
Take note! The usual geometric representation of data that is standard in, for example,
multivariate statistics, considers the observations that are given by m variables x
(j)
,
j = 1, 2, . . . , m,as points in m-dimensional space. Here, though, we consider the Y -
and -values of all n observations as points in n-dimensional space.
Unfortunately our idea stops with three dimensions, and thus with three observations.
So, we try it for a situation limited in this way, rst for simple linear regression.
As stated, the observed values Y = [Y
1
, Y
2
, Y
3
]
T
determine a point in 3-dimensional
space. For given parameters
0
= 5 and
1
= 1 we can calculate the model values
i
_
_
=
0
+
1
x
i
and represent the corresponding vector
_
_
=
0
1+
1
x as a point.
We now ask where all points lie that can be achieved by variation of the parameters.
These are the possible linear combinations of the two vectors 1 and x and thus form the
A. Ruckstuhl, ZHAW
6 2. Methodology for Parameter Estimation
plane spanned by 1 and x . In estimating the parameters according to the principle
of least squares, geometrically represented, the squared distance between Y and
_
_
is minimized. So, we want the point on the plane that has the least distance to Y .
This is also called the projection of Y onto the plane. The parameter values that
correspond to this point are therefore the estimated parameter values

= [
0
,

1
]
T
.
Now a nonlinear function, e.g. hx; ) =
1
exp1
2
x), should be tted on the
same three observations. We can again ask ourselves where all points ) lie that
can be achieved through variations of the parameters
1
and
2
. They lie on a
two-dimensional curved surface (called the model surface in the following) in three-
dimensional space. The estimation problem again consists of nding the point on
the model surface that lies nearest to Y . The parameter values that correspond to
this point , are then the estimated parameter values

= [
1
,

2
]
T
.
c Solution Approach for the Minimization Problem. The main idea of the usual al-
gorithm for minimizing the sum of squared deviations (see 2.a) goes as follows: If a
preliminary best value
()
exists, we approximate the model surface with the plane
that touches the surface at the point
_
()
_
= h
_
x;
()
_
. Now we seek the point in
this plane that lies closest to Y . This amounts to the estimation in a linear regression
problem. This new point lies on the plane, but not on the surface, that corresponds
to the nonlinear problem. However, it determines a parameter vector
(+1)
and with
this we go into the next round of iteration.
d Linear Approximation. To determine the approximated plane, we need the partial
derivative
A
(j)
i
) :=

i
)
j
,
which we can summarize with a n p matrix A. The approximation of the model
surface ) by the tangential plane in a parameter value
is
i
)
i
) + A
(1)
i

) (
1

1
) + ... + A
(p)
i

) (
p

p
)
or, in matrix notation,
)
) + A
) (
) .
If we now add back in the random error, we get a linear regression model
Y = A
) + E
with the preliminary residuals

Y
i
= Y
i

i
) as variable of interest, the columns

of A as regressors and the coecients
j
=
j

j
(a model without intercept
0
).
e Gauss-Newton Algorithm. The Gauss-Newton algorithm consists of, beginning with a
start value
(0)
for , solving the just introduced linear regression problem for
=
(0)
to nd a correction and from this get an improved value
(1)
=
(0)
+ . For this,
again, the approximated model is calculated, and thus the preliminary residuals
Y
_
(1)
_
and the partial derivatives A
_
(1)
_
are determined, and this gives us
2
. This iteration step is continued as long as the correction is negligible. (Further
details can be found in Appendix A.)
It can not be guaranteed that this procedure actually nds the minimum of the squared
sum. The chances are better, the better the p-dimensionale model surface at the
minimum

= (
1
, . . . ,

p
)
T
can be locally approximated by a p-dimensional plane
and the closer the start value
(0)
is to the solution being sought.
* Algorithms comfortably determine the derivative matrix A numerically. In more complex prob-
lems the numerical approximation can be insucient and cause convergence problems. It is then
advantageous if expressions for the partial derivatives can be arrived at analytically. With these
the derivative matrix can be reliably numerically determined and the procedure is more likely to
converge (see also Chapter 6).
f Initial Values. A iterative procedure requires a starting value in order for it to be
applied at all. Good starting values help the iterative procedure to nd a solution
more quickly and surely. Some possibilities to get these more or less easily are here
briey presented.
g Initial Value from Prior Knowledge. As already noted in the introduction, nonlin-
ear models are often based on theoretical considerations from the application area in
question. Already existing prior knowledge from similar experiments can be used
to get an initial value. To be sure that the chosen start value ts, it is advisable to
graphically represent the regression function hx; ) for various possible starting values
=
0
together with the data (e.g., as in Figure 2.h, right).
h Start Values via Linearizable Regression Functions. Often, because of the distri-
bution of the error, one is forced to remain with the nonlinear form in models with
linearizable regression functions. However, the linearized model can deliver starting
values.
In the Puromycin Example the regression function is linearizable: The reciprocal
values of the two variables fulll
y =
1
y

1
hx; )
=
1
1
+

2
1
1
x
=
0
+
1
x .
The least squares solution for this modied problem is

= [
0
,

1
]
T
= (0.00511, 0.000247)
T
(Figure 2.h (a)). This gives the initial value
(0)
1
= 1/
0
= 196 ,
(0)
2
=

1
/
0
= 0.048 .
A. Ruckstuhl, ZHAW
8 3. Approximate Tests and Condence Intervals
0 10 20 30 40 50
0.005
0.010
0.015
0.020
1/Concentration
1
/
v
e
l
o
c
i
t
y
0.0 0.2 0.4 0.6 0.8 1.0
50
100
150
200
Concentration
V
e
l
o
c
i
t
y
Figure 2.h: Puromycin Example. Left: Regression line in the linearized problem. Right: Re-
gression function hx; ) for the initial values =
(0)
( ) and for the least squares
estimation =

().
i Initial Values via Geometric Meaning of the Parameter. It is often helpful to consider
the geometrical features of the regression function.
In the Puromycin Example we can thus arrive at an initial value in another, in-
structive way:
1
is the y value for x = . Since the regression function is monotone
increasing, we can use the maximal y
i
-value or a visually determined asymptotic
value
0
1
= 207 as initial value for
1
. The parameter
2
is the x-value, at which y
reaches half of the asymptotic value
1
. This gives
0
2
= 0.06.
The initial values thus result from the geometrical meaning of the parameters and a
coarse determination of the corresponding aspects of a curve tted by eye.
Example j Membrane Separation Technology.In the Membrane Separation example we let x
, so hx; )
1
(since
4
< 0); for x , hx; )
2
. From Figure 1.f(a)
along with the data shows
1
163.7 and
2
159.5. We know
1
and
2
, so we can
linearize the regression function through
y := log
10
(0)
1
y
y
(0)
2
) =
3
+
4
x .
We speak of a conditional linearizable function. The linear regression leads to the
initial value
(0)
3
= 1.83 and
(0)
4
= 0.36.
With this initial value the algorithm converges to the solution

1
= 163.7,

2
= 159.8,
3
= 2.675 and

4
= 0.512. The functions h;
(0)
) and h;

) are shown in Figure
2.j(b).
* The property of conditional linearity of a function can also be useful for developing an algorithm
specially suited for this situation (see e.g. Bates and Watts, 1988).
3. Approximate Tests and Condence Intervals
a The estimator

gives the value of that ts the data optimally. We now ask which
parameter values are compatible with the observations. The condence region is
2 4 6 8 10 12
2
1
0
1
2
x (=pH)
y
(a)
2 4 6 8 10 12
160
161
162
163
x (=pH)
y

(
=
V
e
r
s
c
h
i
e
b
u
n
g
)
(b)
Figure 2.j: Membrane Separation Technology Example. (a) Regression line, which is used for
determining the initial values for
3
and
4
. (b) Regression function hx; ) for the initial
value =
(0)
( ) and for the least squares estimation =

().
the set of all these values. For an individual parameter
j
the condence region is the
condence interval.
The results that now follow are based on the fact that the estimator

is asymptotically
multivariate normally distributed. For an individual parameter that leads to a z-Test
and the corresponding condence interval; for several parameters the corresponding
Chi-Square test works and gives elliptical condence regions.
b The asymptotic properties of the estimator can be derived from the linear approx-
imation. The problem of nonlinear regression is indeed approximately equal to the
linear regression problem mentioned in 2.d
Y = A
) + E ,
if the parameter vector
, which is used for the linearization lies near to the solution. If

the estimation procedure has converged (i.e.
=

), then = 0 otherwise this would
not be the solution. The standard error of the coecients and more generally the
covariance matrix of

then correspond approximately to the corresponding values
for

.
* A bit more precisely: The standard errors characterize the uncertainties that are generated by the
random uctuations in the data. The available data have led to the estimation value

. If the
data were somewhat dierent, then

would still be approximately correct, thus we accept that it
is good enough for the linearization. The estimation of for the new data set would thus lie as
far from the estimated value for the available data, as this corresponds to the distribution of the
parameter in the linearized problem.
c Asymptotic Distribution of the Least Squares Estimator. From these considerations
it follows: Asymptotically the least squares estimator

is normally distributed (and
consistent) and therefore
a
^
_
,
V )
n
_
,
with asymptotic covariance matrix V ) =
2
(A)
T
A))
1
, where A) is the
n p matrix of the partial derivatives (see 2.d).
A. Ruckstuhl, ZHAW
To determine the covariance matrix V ) explicitly, A) is calculated at the point
instead of the unknown point , and for the error variance

2
the usual estimator
is plugged
V ) =
2
_
A
_
_
T
A
_
_
_
1
mit
2
=
S
)
n p
=
1
n p
n
i=1
_
y
i
i
_
_ _
2
.
With this the distribution of the estimated parameters is approximately determined,
from which, like in linear regression, standard error and condence intervals can be
derived, or condence ellipses (or ellipsoids) if several variables are considered at once.
The denominator n p in
2
is introduced in linear regression to make the estimator
unbiased. Tests and condence intervals are not determined with the normal and
chi-square distribution, but with the t and F distributions. There it is taken into
account that the estimation of
2
causes an additional random uctuation. Even if
the distribution is no longer exact, the approximations get more exact if we do this in
nonlinear regression. Asymptotically the dierence goes to zero.
Example d Membrane Separation Technology. A computer output for the Membrane Separa-
tion example shows Table 3.d. The estimations of the parameters are in the column
Value, followed by the estimated approximate standard error and the test statistics
(t value), that are approximately t
np
distributed. In the last row the estimated
standard deviation of the random error E
i
is given.
From this output, in linear regression the condence intervals for the parameters can
be determined: The approximate 95% condence interval for the parameter
1
is
163.706 q
t
35
0.975
0.1262 = 163.706 0.256 .
Formula: delta (T1 + T2 * 10(T3 + T4 * pH)) / (10(T3 + T4 * pH) + 1)
Parameters:
Estimate Std. Error t value Pr(> [t[)
T1 163.7056 0.1262 1297.256 < 2e-16
T2 159.7846 0.1594 1002.194 < 2e-16
T3 2.6751 0.3813 7.015 3.65e-08
T4 -0.5119 0.0703 -7.281 1.66e-08
Residual standard error: 0.2931 on 35 degrees of freedom
Number of iterations to convergence: 7
Achieved convergence tolerance: 5.517e-06
Table 3.d: Membrane Separation Technology Example: R summary of the tting.
Example e Puromycin. For checking the inuence of treating an enzyme with Puromycin of the
postulated form (1.d) a general model for the data with and without the treatment
can be formulated as follows:
Y
i
=
(
1
+
3
z
i
)x
i
2
+
4
z
i
+ x
i
+ E
i
.
Where z is the indicator variable for the treatment (z
i
= 1, if treated, otherwise =0).
Table 3.e shows that the parameter
4
at the 5% level is not signicantly dierent from
0, since the P value of 0.167 is larger then the level (5%). However, the treatment has
a clear inuence, which is expressed through
3
; the 95% condence interval covers
the region 52.398 9.5513 2.09 = [32.4, 72.4] (the value 2.09 corresponds to the 0.975
quantile of the t
19
distribution).
Formula: velocity (T1 + T3 * (treated == T)) * conc/(T2 + T4 * (treated
== T) + conc)
Parameters:
T1 160.280 6.896 23.242 2.04e-15
T2 0.048 0.008 5.761 1.50e-05
T3 52.404 9.551 5.487 2.71e-05
T4 0.016 0.011 1.436 0.167
Table 3.e: R summary of the t for the Puromycin example.
f Condence Intervals for Function Values. Besides the parameters, the function value
hx
0
, ) for a given x
0
is of interest. In linear regression the function value h
_
x
0
,
_
=
x
T
0
=:
0
is estimated by
0
= x
T
0
and the estimated (1 ) condence interval for

it is

0
q
t
np
1/2
se
0
) with se
0
) =
_
x
T
o
(X
T
X)
1
x
o
.
With analogous considerations and asymptotic approximation we can specify con-
dence intervals for the function values hx
0
; ) for nonlinear h. If the function
0
_
_
:= h
_
x
0
,

_
is approximated at the point , we get
)
o
) + a
T
o
(
) mit a
o
=
hx
o
, )
.
(If x
0
is equal to an observed x
i
, then a
0
equals the corresponding row of the matrix
A from 2.d.) The condence interval for the function value
0
) := hx
0
, ) is then
approximately
0
_
_
q
t
np
1/2
se
_
0
_
__
mit se
_
0
_
__
=
_
a
T
o
_
A
_
_
T
A
_
_ _
1
a
o
.
In this formula, again the unknown values are replaced by their estimations.
A. Ruckstuhl, ZHAW
1.0 1.2 1.4 1.6 1.8 2.0 2.2
0
1
2
3
Years^(1/3)
l
o
g
(
P
C
B

C
o
n
c
e
n
t
r
a
t
i
o
n
)
0 2 4 6 8
0
5
10
15
20
25
30
Days
O
x
y
g
e
n

D
e
m
a
n
d
Figure 3.g: Left: Condence band for an estimated line for a linear problem. Right: Condence
band for the estimated curve hx, ) in the oxygen consumption example.
g Condence Band. The expression for the (1 ) condence interval for
o
) :=
hx
o
, ) also holds for arbitrary x
o
. As in linear regression, it is obvious to represent
the limits of these intervals as a condence band that is a function of x
o
, as this
Figure 3.g shows for the two examples of Puromycin and oxygen consumption.
Condence bands for linear and nonlinear regression functions behave dierently: For
linear functions this condence band is thinnest by the center of gravity of the explana-
tory variables and gets gradually wider as it move out (see Figure 3.g, left). In the
nonlinear case, the bands can be arbitrary. Because the functions in the Puromycin
and Oxygen Consumption must go through zero, the interval shrinks to a point there.
Both models have a horizontal asymptote and therefore the band reaches a constant
width for large x (see Figure 3.g, right) .
h Prediction Interval. The considered condence band indicates where the ideal func-
tion values hx), and thus the expected values of Y for givenx, lie. The question, in
which region future observations Y
0
for given x
0
will lie, is not answered by this.
However, this is often more interesting than the question of the ideal function value;
for example, we would like to know in which region the measured value of oxygen
consumption would lie for an incubation time of 6 days.
Such a statement is a prediction about a random variable and is dierent in principle
from a condence interval, which says something about a parameter, which is a xed
but unknown number. Corresponding to the question posed, we call the region we
are now seeking a prediction interval or prognosis interval. More about this in
Chapter 7.
i Variable Selection. In nonlinear regression, unlike linear regression, variable selection
is not an important topic, because
a variable does not correspond to each parameter, so usually the number of
parameters is dierent than the number of variables,
there are seldom problems where we need to clarify whether an explanatory
variable is necessary or not the model is derived from the subject theory.
However, there is sometimes a reasonable question of whether a portion of the parame-
4. More Precise Tests and Condence Intervals 13
ters in the nonlinear regression model can appropriately describe the data (see Beispiel
Puromycin).
4. More Precise Tests and Condence Intervals
a
The quality of the approximate condence region depends strongly on the quality of the
linear approximation. Also the convergence properties of the optimization algorithms
are inuenced by the quality of the linear approximation. With a somewhat larger
computational eort, the linearity can be checked graphically and, at the same time,
we get a more precise condence interval.
b F Test for Model Comparison. To test a null hypothesis =
for the whole

parameter vector or also
j
=
j
for an individual component, we can use an F-Test
for model comparison like in linear regression. Here, we compare the sum of squares
S
) that arises under the null hypothesis with the sum of squares S
). (For n
the F test is the same as the so-called Likelihood Quotient test, and the sum of squares
is, up to a constant, equal to the log likelihood.)
Now we consider the null hypothesis =
for the whole parameter vector. The test

statistic is
T =
n p
p
S
) S
)
S
)
a
F
p,np
.
From this we get a condence region
_
S) S
)
_
1 +
p
np
q
__
where q = q
F
p,np
1
is the (1) quantile of the F distribution with p and np degrees
of freedom.
In linear regression we get the same exact condence region if we use the (multivariate)
normal distribution of the estimator

. In the nonlinear case the results are dierent.
The region that is based on the F tests is not based on the linear approximation in 2.d
and is thus (much) more exact.
c Exact Condence Regions for p=2. If p = 2, we can nd the exact condence region
by calculating S) on a grid of values and determine the borders of the region
through interpolation, as is familiar for contour plots. In Figure 4.c are given the
contours together with the elliptical regions that result from linear approximation for
the Puromycin example (left) and the oxygen consumption example (right).
For p > 2 contour plots do not exist. In the next chapter we will be introduced to
graphical tools that also work for higher dimensions. They depend on the following
concepts.
A. Ruckstuhl, ZHAW
14 4. More Precise Tests and Condence Intervals
190 200 210 220 230 240
0.04
0.05
0.06
0.07
0.08
0.09
0.10
2
0 10 20 30 40 50 60
0
2
4
6
8
10
2
Figure 4.c: Nominal 80 and 95% likelihood contures () and the condence ellipses from the
asymptotic approximation ( ). + denotes the least squares solution. In the Puromycin
example (left) the agreement is good and in the oxygen consumption example (right) it is bad.
d F Test for Individual Parameters. It should be checked whether an individual param-
eter
k
can be equal to a certain value
k
. Such a null hypothesis makes no statement
about the other parameters. The model that corresponds to the null hypothesis that
ts the data best is determined at a xed
k
=
k
through a least squares estimation
of the remaining parameters. So, S
1
, . . . ,
k
, . . . ,
p
) is minimized with respect to
j
, j ,= k. We denote the minimum with

S
k
and the value
j
that leads to it as

j
.
Both values depend on
k
. We therefore write

S
k
k
) and

k
).
The F test statistics for the test
k
=
k
is
T
k
= (n p)
S
k
k
) S
_
_
S
_
_ .
It has an (approximate) F
1,np
distribution.
We get a condence interval from this by solving the equation

T
k
= q
F
1,np
0.95
numerically
for
k
. It has a solution that is smaller than

k
and one that is larger.
e t Test via F Test. In linear regression and in the previous chapter we have calculated
tests and condence intervals from a test value that follows a t-distribution (t-test for
the coecients). Is this another test?
It turns out that the test statistic of the t-test in linear regression turns into the test
statistic of the F-test if we square it, and both tests are equivalent. In nonlinear
regression, the F-test is not equivalent with the t-test discussed in the last chapter
(3.d). However, we can transform the F-test into a t-test that is more precise than
that of the last chapter:
From the test statistics of the F-tests, we drop the root and provide then with the
signs of

k
,
T
k
k
) := sign
_
k
_
_
S
k
k
_
S
_
_

.
5. Prole t-Plot and Prole Traces 15
(signa) denotes the sign of a, and is
2
= S
_
_
/(n p).) This test statistic is
(approximately) t
np
distributed.
In the linear regression model, T
k
, is, as mentioned, equal to the test statistic of the
usual t-test,
T
k
k
) =
k
se
_
k
_ .
f Condence Intervals for Function Values via F-Test. With this technique we can also
determine condence intervals for a function value at a point x
o
. For this we repa-
rameterize the original problem so that a parameter, say
1
, represents the function
value hx
o
) and proceed as in 4.d.
5. Prole t-Plot and Prole Traces
a Prole t-Function and Prole t-Plot. The graphical tools for checking the linear
approximation are based on the just discussed t-test, that actually doesnt use this
approximation. We consider the test statistic T
k
(4.e) as a function of its arguments
k
and call it prole t-function (in the last chapter the arguments were denoted with
k
, now for simplicity we leave out the

). For linear regression we get, as is apparent
from 4.e, a line, while for nonlinear regression the result is a monotone increasing
function. The graphical comparison of T
k
k
) with a line enables the so-called prole
t-plot. Instead of
k
, it is common to use a standardized version
k
) :=

k

k
se
_
k
_
on the horizontal axis because of the linear approximation. The comparison line is
then the diagonal, so the line with slope 1 and intercept 0.
The more strongly the prole t-function is curved, the stronger is the nonlinearity in a
neighborhood of
k
. Therefore, this representation shows how good the linear approx-
imation is in a neighborhood of

k
. (The neighborhood that is statistically important
is approximately determined by [
k
k
)[ 2.5.) In Figure 5.a it is apparent that in
the Puromycin example the nonlinearity is minimal, but in the oxygen consumption
example it is large.
From the illustration we can read o the condence intervals according to 4.e. For
convenience, on the right vertical axis are marked the probabilites P T
k
t) according
to the t-distribution. In the oxygen consumption example, this gives a condence
interval without an upper bound!
Example b from Membrane Separation Technology. As 5.a shows, from the prole t-plot we can
graphically read out corresponding condence intervals that are based on the prole t-
function. The R function confint(...) numerically calculates the desired condence
interval on the basis of the prole t-function.In Table 5.b is shown the corresponding
R output from the membrane separation example. In this case, no large dierences
from the classical calculation method are apparent.
A. Ruckstuhl, ZHAW
16 5. Prole t-Plot and Prole Traces
> confint(Mem.fit, level=0.95)
Waiting for profiling to be done...
2.5% 97.5%
T1 163.4661095 163.9623685
T2 159.3562568 160.0953953
T3 1.9262495 3.6406832
T4 -0.6881818 -0.3797545
Table 5.b: Membrane separation technology example: R output for the condence intervals that
are based on the prole t-function.
c Likelihood Prole Traces. The likelihood prole traces are another useful tool.
Here the estimated parameters

j
, j ,= k at xed
k
(see 4.d) are considered as
functions

(k)
j

k
) of these values.
The graphical representation of these functions would ll a whole matrix of diagrams,
but without diagonals. It is worthwhile to combine the opposite diagrams of this
matrix: Over the representation of

(k)
j

k
) we superimpose

(j)
k

j
) in mirrored
form, so that the axes have the same meaning for both functions.
In Figure 5.c ist shown each of these diagrams for our two examples. Additionally, are
shown contours of condence regions for [
1
,
2
] . We see that the prole traces cross
the contours at points of contact of the horizontal and vertical tangents.
The representation shows not only the nonlinearities, but also holds useful clues for
how the parameters inuence each other. To understand this, we now consider
the case of a linear regression function. The prole traces in the individual diagrams
then consist of two lines, that cross at the point [
1
,

2
] . We standardize the parameter
by using
k
k
) from 5.a, so we can show that the slope of the trace

(k)
j

k
) is equal
to the correlation coecient c
kj
of the estimated coecients

j
and

k
. The reverse
3
2
1
0
1
2
3
190 200 210 220 230 240
(
1
)
T
1
(
1
)
1
2 0 2 4
0.99
0.80
0.0
0.80
0.99
L
e
v
e
l
4
2
0
2
4
20 40 60 80 100
(
1
)
T
1
(
1
)
1
0 10 20 30
0.99
0.80
0.0
0.80
0.99
L
e
v
e
l
Figure 5.a: Prole t-plot for the rst parameter is each of the Puromycin and oxygen consump-
tion examples. The dashed lines show the applied linear approximation and the dotted line
the construction of the 99% condence interval with the help of T
1
1
).
190 200 210 220 230 240
0.04
0.05
0.06
0.07
0.08
0.09
0.10
2
15 20 25 30 35 40
0.5
1.0
1.5
2.0
2
Figure 5.c: Likelihood prole traces for the Puromycin and oxygen consumption examples, with
80%- and 95% condence regions (gray curves).
trace

(j)
k

j
) then has, compared with the horizontal axis, a slope of 1/c
kj
. The angle
that the lines enclose is thus a monotone function of this correlation. It therefore
measures the collinearity between the two predictor variables. If the correlation
between the parameter estimations is zero, then the traces are parallel to each other.
For a nonlinear regression function, both traces are curved. The angle between them
still shows how strongly the two parameters
j
and
k
hold together, so their estima-
tions are correlated.
Example d Membrane Separation Technology. All prole t-plots and prole traces can be as-
sembled into a triangle matrix of diagrams, as Figure 5.d shows for the membrane
separation technology example.
Most prole traces are strongly curved, i.e. the regression function tends to a strong
nonlinearity around the estimated parameter values. Even though the prole traces for
3
and
4
are lines, a further problem is apparent: The prole traces lie on top of each
other! This means that the parameters
3
and
4
are extremely strongly collinear.
Parameter
2
is also collinear with
3
and
4
, although more weakly.
e * Good Approximation of Two Dimensional Likelihood Contours. The prole traces can be used
to construct very accurate approximations for two dimensional projections of the likelihood con-
tours (see Bates and Watts, 1988). Their calculation is computationally less dicult than for the
corresponding exact likelihood contours.
6. Parameter Transformations
a
Parameter transformations are primarily used to improve the linear approximation
and therefore improve the convergence behavior and the quality of the condence
interval.
It is here expressly noted that parameter transformations, unlike transformations of
the variable of interest (see 1.h) does not change the statistical part of the model. So,
they are not helpful if the assumptions about the distribution of the random deviations
A. Ruckstuhl, ZHAW
18 6. Parameter Transformations
163.4 163.8
3
2
1
0
1
2
3
1
163.4 163.8
159.2
159.4
159.6
159.8
160.0
160.2
159.2 159.6 160.0
3
2
1
0
1
2
3
2
163.4 163.8
2.0
2.5
3.0
3.5
4.0
159.2 159.6 160.0
2.0
2.5
3.0
3.5
4.0
2.0 3.0 4.0
3
2
1
0
1
2
3
3
163.4 163.8
0.8
0.7
0.6
0.5
0.4
159.2 159.6 160.0
0.8
0.7
0.6
0.5
0.4
2.0 3.0 4.0
0.8
0.7
0.6
0.5
0.4
0.8 0.6 0.4
3
2
1
0
1
2
3
4
Figure 5.d: Prole t-plots and Prole Traces for the Example from Membrane Separation Tech-
nology. The + in the prole t-plot denotes the least squares solution.
are violated. It is the quality of the linear approximation and the statistical statements
based on it that are changed.
Often, the transformed parameters are very dicult to interpret for the application.
The important questions often concern individual parameters the original parameters.
Despite this, we can work with transformations: We derive more accurate condence
regions for the transformed parameters and back transform these to get results for the
original variables.
b Restricted Parameter Regions. Often the permissible region of a parameter is re-
stricted, e.g. because the regression function is only dened for positive values of a
parameter. Usually, such a constraint is ignored to begin with and we wait to see
whether and where the algorithm converges. According to experience, the parameter
estimation will end up in a reasonable range if the model describes the data well and
the data give enough information for determining the parameter.
Sometimes, though, problems occur in the course of the computation, especially if the
parameter value that best ts the data lies near the edge of the permissible range. The
simplest way to deal with such problems is via transformation of the parameter.
Beispiele:
The parameter should be positive. Through a transformation = ln),
= exp) is always positive for all possible values of R:
hx, ) hx, exp))
The parameter should lie in the interval (a, b). With the log transformation
= a + (b a)/(1 + exp)), can, for arbitrary R, only take values in
(a, b).
In the model
hx, ) =
1
exp
2
x) +
3
exp
4
x)
with
2
,
4
> 0 the parameter pairs (
1
,
2
) and (
3
,
4
) are interchangeable, i.e.
hx, ) does not change. This can create uncomfortable optimization problems,
because, for one thing, the solution is not unique. The constraint 0 <
2
<
4
that ensures the uniqueness is achieved via the transformation
2
= exp
2
) und
4
= exp
2
)(1 + exp
4
)). The function is now
hx, (
1
,
2
,
3
,
4
)) =
1
exp exp
2
)x) +
3
exp exp
2
)(1 + exp
4
))x) .
c Parameter Transformation for Collinearity. A simultaneous variable and parameter
transformation can be helpful to weaken collinearity in the partial derivative vectors.
So, for example, the model hx, ) =
1
exp
2
x) has the derivatives
h
1
= exp
2
x) ,
h
2
=
1
xexp
2
x) .
If all x values are positive, both vectors
a
1
:= (exp
2
x
1
), . . . , exp
2
x
n
))
T
a
2
:= (
1
x
1
exp
2
x
1
), . . . ,
1
x
n
exp
2
x
n
))
T
tend to disturbing collinearity. This collinearity can be avoided through centering.
The model can be written as hx, ) =
1
exp
2
(x x
0
+ x
0
)) With the reparame-
terization
1
:=
1
exp
2
x
0
) and
2
:=
2
we get
h
_
x,
_
=
1
exp
2
(x x
0
)) .
The derivative vectors become approximately orthogonal if for x
0
the mean value of
the x
i
is chosen.
A. Ruckstuhl, ZHAW
1

2

3

1

2

2
-0.256
2
-0.256
3
-0.434 0.771

3
0.323 0.679
4
0.515 -0.708 -0.989
4
0.515 -0.708 -0.312
Table 6.d: Correlation matrices for the Membrane Separation Technology for the original pa-
rameters (left) and the transformed parameters

3
(right).
Example d Membrane Separation Technology. In this example it is apparent from the approx-
imate correlation matrix (Table 6.d, left half) that the parameters
3
and
4
are
strongly correlated. (We have already found this in 5.d from the prole traces).
If the model is reparameterized to
y
i
=

1
+
2
10
3
+
4
(x
i
medx
j
)
1 + 10
3
+
4
(x
i
medx
j
)
+ E
i
, i = 1 . . . n
with

3
=
3
+
4
medx
j
), an improvement is achieved (right half of Table 6.d).
e Reparameterization. In Chapter 5 we have presented means for graphical evaluation
of the linear approximation. If the approximation is considered in adequate we would
like to improve it. An appropriate reparameterization can contribute to this. So, for
example, for the model
hx, ) =

1
3
(x
2
x
3
)
1 +
2
x
1
+
3
x
2
+
4
x
3
the reparameterization
hx, ) =
x
2
x
3
1
+
2
x
1
+
3
x
2
+
4
x
3
is recommended by (Ratkowsky, 1985). (Also see exercises)
Example f from Membrane Separation Technology. The parameter transformation given in 6.d
leads to a satisfactory result, as far as the correlation is concerned. We look at the
likelihood contours or the prole t-plot and the prole traces, and the parametrization
is still not satisfactory.
An intensive search for further improvements leads to the following transformations,
for which the prole traces turn out satisfactorily (Figure 6.f):
1
:=
1
+
2
10
3
10
3
+ 1
,

2
:=log
10
_

1
2
10
3
+ 1
10
3
_
,
3
:=
3
+
4
medx
j
)

4
:=10
4
.
The model is then
Y
i
=

1
+ 10
2
1

4
(x
i
medx
j
))
1 + 10
4
(x
i
medx
j
)
+ E
i
.
and we get the result shown in Table 6.f .
161.50 161.60 161.70
1.0
0.5
0.0
0.5
1.0
1.5
1
161.50 161.60 161.70
0.28
0.30
0.32
0.34
0.36
0.28 0.32 0.36
1.0
0.5
0.0
0.5
1.0
2
161.50 161.60 161.70
0.00
0.05
0.10
0.15
0.28 0.32 0.36
0.00
0.05
0.10
0.15
0.00 0.05 0.10 0.15
1.5
1.0
0.5
0.0
0.5
1.0
3
161.50 161.60 161.70
0.25
0.30
0.35
0.28 0.32 0.36
0.25
0.30
0.35
0.00 0.05 0.10 0.15
0.25
0.30
0.35
0.25 0.30 0.35
1.5
1.0
0.5
0.0
0.5
1.0
4
Figure 6.f: Prole t-plot and prole traces for the membrane separation technology example
according to the given transformations.
g
More Successful Reparametrization. It is apparent that a successful
reparametrization of the data set depends, for one thing, that the nonlinearties
and correlations between estimated parameters depend on the estimated value

it-
self. Therefore, no generally valid recipe can be given, which makes the search for
appropriate reparametrizations often very tiresome.
h Failure of Gaussian Error Propagation. Even if a parameter transformation helps us
deal with diculties with the convergence behavior of the algorithm or the quality of
the condence intervals, the original parameters often have a physical interpretation.
We take the simple transformation example = ln) from 6.b. The tting of
the model opens with an estimator

with estimated standard error
. An obvious
estimator for is then

= exp
). The standard error for

can be determined with
A. Ruckstuhl, ZHAW
Formula: delta TT1 + 10TT2 * (1 - TT4pHR)/(1 + 10TT3 * TT4pHR)
Parameters:
TT1 161.60008 0.07389 2187.122 < 2e-16
TT2 0.32336 0.03133 10.322 3.67e-12
TT3 0.06437 0.05951 1.082 0.287
TT4 0.30767 0.04981 6.177 4.51e-07
Correlation of Parameter Estimates:
TT1 TT2 TT3
TT2 -0.56
TT3 -0.77 0.64
TT4 0.15 0.35 -0.31
Table 6.f: Membrane separation technology: R summary of the t after the parameter
transformation.
the help of the Gaussian error propagation law (see Stahel, 2000, Abschnitt 6.10):

2

_
exp)
_
2
s
2
=
_
exp
_
_
2

2
or

exp
.
From this we get the approximate 95% condence interval for :
exp
q
t
np
0.975
= exp
_
_
1
q
t
np
0.975
_
.
The Gaussian error propagation law is based on the linearization of the transformation
function g); concretely on exp). We have carried out the parameter transformation
because the quality of the condence intervals left a lot to be desired, then unfortu-
nately this linearization negates what has been achieved and we are back where we
started before the transformation.
The correct approach in such cases is to determine the condence interval as presented
in Chapter 4. If this is impossible for whatever reason, we can fall back on the following
approximation.
i Alternative Procedure. Compared to 6.h, we get a better approximation of the con-
dence interval via the interval that consists of all values for which ln) lie in the
interval

q
t
df
0.975
. Generally formulated: Let g be the transformation of to
= g). Then
_
: g
1
)
_
q
t
df
0.975
,

+
q
t
df
0.975
__
is an approximate 95% interval for . If g
1
) is strictly monotone increasing, this
interval is identical to
_
g
_
q
t
df
0.975
_
, g
q
t
df
0.975
_
_
.
This procedure also ensures that, unlike the Gaussian error propagation law, the con-
dence interval is entirely in the region that is predetermined for the parameter. It is
thus impossible that, for example, the condence interval in the example = exp),
unlike the interval from 6.h, can contain negative values.
As already noted, the approach just discussed should only be used if the way via the
F-test from Chapter 4 is not possible. The Gaussian error propagation law should not
be used in this context.
7. Forecasts and Calibration
Forecasts
a Besides the question of which parameters are plausible with respect to the given data,
the question of in what range future observations will lie is often of central interest.
The dierence in these two questions was already discussed in 3.h. In this chapter we
want to go into, among other things, the answer to the second question. We assume
that the parameter is estimated from the data with the least squares method. (If
you want, you can call this dataset the training dataset.) Was can we now, with the
help of our model, say about a future observation Y
o
at a given point x
o
?
b Cress Example. The concentration of an agrochemical material in soil samples can
be studied through the growth behavior of a certain type of cress (nasturtium). 6
measurements of the variable of interest Y were made on each of 7 soil samples with
predetermined (or measured with the largest possible precision) concentrations x. In
each case we act as though the x values have no measurement error. The variable of
interest is the weight of the cress per unit area after 3 weeks. A logit-log model is
used to describe the relationship between concentration and weight:
hx, ) =
_
_
_
1
if x = 0
1
1+exp
2
+
3
lnx
if x > 0.
(The data and the function h) are graphically represented in Figure 7.b
) We can now
ask ourselves which weight values are expected for a concentration of, e.g. x
0
= 3?
A. Ruckstuhl, ZHAW
24 7. Forecasts and Calibration
0 1 2 3 4
200
400
600
800
1000
Concentration
W
e
i
g
h
t
Concentration
W
e
i
g
h
t
Figure 7.b: Cress Example. Left: Representation of the data. Right: A typical shape of the
applied regression function.
c Approximate Forecast Intervals. We can estimate the expected value EY
o
) =
hx
o
, ) of the variable of interest Y at the point x
o
through
o
:= hx
o
,

) We
also want to get an interval where a future observation will lie with high probability,
so we have to take into account not only the scatter of the estimate
o
but also the
random error E
o
. Analogous to linear regression, an at least approximate (1 /2)
forecast interval can be given by

o
q
t
np
1/2

_

2
+
_
se
o
)
_
2
The calculation of se
o
) is found in 3.f.
* Derivation. The random variable Yo is thus the value of the variable of interest for an obser-
vation with predictor variable value xo . Since we do not know the true curve (actually only the
parameters), we have no choice but to study the deviations of the observations from the estimated
curve,
Ro = Yo h
xo,
_
=
_
Yo hxo,
_
_
h
xo,
_
hxo,
_
.
Even if is unknown, we know the distribution of the expressions in the big parentheses: Both
are normally distributed random variables and they are independent because the rst only depends
on the future observation Yo , the second only on the observations Y1, . . . , Yn that lead to the
estimated curve. Both have expected value 0; the variances add up to
varRo
2
+
2
a
T
o
(A
T
A)
1
ao .
The above described forecast interval follows by replacing the unknown values with their estima-
tions.
d Forecast Versus Condence lntervals. If the sample size n of the training dataset
is very large, the estimated variance
2
v
is dominated by the error variance
2
. This
means that the uncertainty in the forecast is then determined primarily by the obser-
vation error. The second summand in the expression for
2
v
reects the uncertainty
that is caused by the estimation of .
It is therefore clear that the forecast interval is wider than the condence interval for
the expected value, since the random scatter of the observation must be taken into
account. The endpoint of such an interval for x
o
values in a chosen region are shown
in Figure 7.i left, bound together as a band.
e * Quality of the Approximation. The determination of the forecast interval in 7.c is based on the
same approximation as is used in Chapterber 3. The quality of the approximation can again be
checked graphically as in Chapter 5.
f Interpretation of the Forecast Band. The interpretation of the forecast band, as
is shown in Figure 7.i, is not totally simple: From the derivation it holds that
P
V

0
x
o
) Y
o
V

1
x
o
)
_
= 0.95
where V

0
x
o
) is the lower and V

1
x
o
) the upper boundaries of the prediction interval
for hx
o
). However, if we want to make a prediction about more than one future
observation, then the number of the observations in the forecast interval is not binomial
distributed with = 0.95. The events, that the individual future observations fall in
the bad, are, specically, not independent; they depend on each other over the random
borders V
0
and V
1
. If, for example, the estimation of randomly turns out too small,
for all future observations the band remains too narrow, and too many observations
would lie outside of the band.
Calibration
g The actual goal of the experiment in the cress example is to estimate the concen-
tration of the agrochemical materials from the weight of the cress. Consequently, we
would like to use the regression relationship in the wrong direction. Problems of
inference arise from this. Despite that, such a procedure is often desired to calibrate
a measurement method or to estimate the result of a more expensive measurement
method from a cheaper one. The regression curve in this relationship is often called a
calibration curve. Another key word for nding this topic is inverse regression.
We would like to present here a simple method that gives a useable result if simplifying
assumptions apply.
h Procedure under Simplifying Assumptions . We assume that the x values have
no measurement error. This is achieved in the example if the concentration of the
agrochemical material concentrations are very carefully determined. For several such
soil samples with the most dierent possible concentrations we carry out several in-
dependent measurements of the value Y . This gives a training dataset, with which
the unknown parameters can be estimated and the standard error of the estimations
determined.
Now, if the value y
o
is read for a trial to be measured, it is obvious to determine the
corresponding x
o
value as possible:
x
o
= h
1
y
o
,

_
.
Here, h
1
denotes the inverse function of h. This procedure is, however, only correct
if h) is monotone increasing or decreasing. However, this condition is usually fullled
in calibration problems.
A. Ruckstuhl, ZHAW
26 7. Forecasts and Calibration
0 1 2 3 4 5
0
200
400
600
800
1000
Concentration
W
e
i
g
h
t
0 1 2 3 4 5
0
200
400
600
800
1000
Concentration
W
e
i
g
h
t
Figure 7.i: Cress example. Left: Condence band for the estimated regression curve (dashed)
and forecast band (solid). Right: Schematic representation of how a calibration interval is
determined, at the points y
0
= 650 and y
0
= 350. The resulting intervals are [0.4, 1.22] and
[1.73, 4.34] respectively.
i Accuracy of the Obtained Values. At the end of this calculation, the question arises
of how accurate this value really is. The problem appears similar to the prediction
problem at the start of this chapter. Unlike the prediction problem, y
o
is observed
and the corresponding value x
o
must be estimated.
The answer can be formulated as follows: We see x
o
as a parameter for which we
want a condence interval. Such an interval arises (as always) from a test. We take as
null hypothesis x
H
= x
o
! As we have seen in 7.c, Y lies with probability 0.95 in the
forecast interval

o
q
t
np
1/2

_

2
+
_
se
o
)
_
2
.
Where
o
was a compact notation for hx
o
,

). This interval therefore forms an
acceptance interval for the value Y
o
(which here plays the role of a test statistic)
under the null hypothesis x
o
= x
H
. In Figure 7.i are graphically represented the
prediction intervals for all possible x
o
in a set interval for the cress example.
j
Illustration. Figure 7.i right illustrates now the thought process for the Cress ex-
ample: Measured values y
o
are compatible with parameter values x
o
in the sense
of the test, if the point [x
o
, y
o
] lies between the curves shown. In the gure, we can
thus determine without diculty the set of x
o
values that are compatible with the
observation y
o
. They form the shown interval, which can also be described as the set
_
x : [y
0
h
x,

_
[ q
t
np
1/2

_

2
+
_
se
x,

__
_
2
_
.
This interval is now the desired condence interval (or calibration interval) for x
o
.
If we have m values to determine y
o
, we apply the above method to y
o
=

m
j=0
y
oj
/m
an:
_
x : [ y
o
h
x,

_
[
_

2
+
_
se
x,

__
_
2
q
t
np
1/2
_
.
8. Closing Comments 27
k In this chapter, one of many possibilities for determining the calibration interval was
presented. The diversity of methods is because their determination is based on some
strongly simplied assumptions.
8. Closing Comments
a Reason for the Diculty in the Oxygen Consumption Example. Why do we have
so much diculty with the oxygen consumption example? We consider Figure 1.e and
remind ourselves that the parameter
1
represents the expected oxygen consumption
for innite incubation time, so it is clear that
1
is dicult to estimate, because the
horizontal asymptote through the data is badly determined. If we had more observa-
tions with longer incubation times, we could avoid the diculties with the quality of
the condence intervals of .
Also in nonlinear models, a good (statistical) Experimental Design is essential. The
information content of the data is determined through the choice of the experimental
conditions and no (statistical) procedure can deliver information that is not contained
in the data.
b Bootstrap. For some time the bootstrap has also been used for determining con-
dence, prediction, and calibration intervals. Regarding this, see, e.g., Huet, Bouvier,
Gruet and Jolivet (1996). In this book the case of inconstant variance (heteroskedastic
models) is also discussed. It is also worth taking a look in the book from Carroll and
Ruppert (1988). ccc
c Correlated Error. In this document is is always assumed that the errors E
i
are inde-
pendent. Like linear regression analysis, nonlinear regression can also be extended by
bringing in correlated errors and random eects (Find software hints in 8.d).
d Statistics Programs. Today most statistics packages contain a procedure that can
calculate asymptotic condence intervals for the parameters. In principle it is then
possible to calculate t-proles and prole traces, because they are also based on the
tting of nonlinear models, although with one fewer parameter.
In both implementations of the statistics language S, S-Plus and R, the function nls
is available, that is based on the work of Bates and Watts (1988). The library nlme
contains S functions that can t nonlinear regression models with correlated errors
(gnls) and random eects (nlme). These implementations are based on the book
Mixed Eects Models in S and S-Plus from Pinheiro and Bates (2000).
e Literature Notes. This document is based mainly on the book from Bates and Watts
(1988). A mathematical discussion about the statistical and numerical methods in
nonlinear regression can be found in Seber and Wild (1989). The book from Ratkowsky
(1989) enumerates many possibly nonlinear functions h) that primarily occur in
biological applications.
A short introduction to this theme in relationship with the statistics program R can
be found in Venables and Ripley (2002) and in Ritz and Streibig (2008).
A. Ruckstuhl, ZHAW
28 A. Gauss-Newton Method
A. Gauss-Newton Method
a The Optimization Problem. To determine the least squares estimator we must mini-
mize the squared sum
S() :=
n
i=1
(y
i
i
))
2
. Unfortunately this can not be carried out explicitly like in linear regression. With
local linear approximations of the regression function, however, the diculties can be
overcome.
b Solution Procedure. The procedure is set out in four steps. We consider the (j +1)-th
iteration in the procedure.To simplify, we assume that the regression function h) has
only one unknown parameter. The solution from the previous iteration is denoted by
(j)
.
(0)
is then the notation for the initial value.
1. Approximation: The regression function hx
i
, ) = )
i
with the one dimensional
parameter at the point
(j)
is approximated by a line:
i
)
i
(j)
) +

i
)
(j)
(
(j)
) =
i
(j)
) + a
i
(j)
)(
(j)
) .
* For a multidimensional with help of a hyper plane it is approximated:
hxi, = i i
(j)
+ a
(1)
i

(j)
(1
(j)
1
) + . . . + a
(p)
i

(j)
(p
(j)
p
),
where
a
(k)
i
:=
hxi,
k
, k = 1, . . . , p.
With vectors and matrices the above equation can be written as

(j)
+ A
(j)
(
(j)
) .
Here the (n p) derivative matrix A
(j)
consists of the j -th iteration of the elements
{a
ik
= a
(k)
i
} at the point =
(j)
.
2. Local Linear Model: We now assume that the approximation in the 1st step holds
exactly for the true model. With this we get for the residuals
r
(j+1)
i
= y
i
(j)
)
i
+ a
i
(j)
)(
(j)
) = y
(j+1)
i
a
i
(j)
)
(j)
with y
(j+1)
i
:= y
i
(j)
) and
(j+1)
:=
(j)
.
* For a multidimensional it holds that:
r
(j+1)
= y {(
(j)
) + A
(j)
(
(j)
)} = y
(j+1)
A
(j)
(j+1)
with y
(j+1)
:= y (
(j)
) and
(j+1)
:=
(j)
.
3. Least Square Estimation in the Locally Linear Model: To nd the best-tting
(j+1)
for the data, we minimize the sum of squared residuals:

n
i=1
(r
(j+1)
i
)
2
.
This is the usual linear least squares problem with y
(j+1)
i
y
i
, a
i
(j)
)
x
i
and
(j+1)
(The line goes through the origin). The solution to this
problem,

(j+1)
, gives the best solution on the approximated line.
References 29
* To nd the best-tting

(j+1)
in the multidimensional case, we minimize the sum of squared
residuals: r
(j+1)
2
. This is the usual linear least squares problem with y
(j+1)
y,
A
(j)
X and
(j+1)
. The solution to this problem gives a

(j+1)
, so that the
point
(j+1)
with
(j+1)
=
(j)
+

(j)
lies nearer to y than
(j)
.
4. Iteration: Now with
(j+1)
=
(j)
+

(j)
we return to step 1 and repeat steps 1,
2, and 3 until this procedure converges. The converged solution minimizes S)
and thus corresponds to the desired estimation value

.
c Further Details. This minimization procedure that is known as the Gauss-Newton
method, can be further rened. However, there are also other minimization methods
available. It must be addressed in more detail what converged should mean and how
the convergence can be achieved. The details about these technical questions can be
read in, e.g. Bates and Watts (1988). For us it is of primary importance to see that
iterative procedures must be applied to solve the optimization problems in 2.a.
References
Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis & Its Applica-
tions, John Wiley & Sons.
Carroll, R. and Ruppert, D. (1988). Transformation and Weighting in Regression,
Wiley, New York.
Daniel, C. and Wood, F. S. (1980). Fitting Equations to Data, John Wiley & Sons,
New York.
Huet, S., Bouvier, A., Gruet, M.-A. and Jolivet, E. (1996). Statistical Tools for Non-
linear Regression: A Practical Guide with S-Plus Examples, Springer-Verlag, New
York.
Pinheiro, J. C. and Bates, D. M. (2000). Mixed-Eects Models in S and S-PLUS,
Statistics and Computing, Springer.
Rapold-Nydegger, I. (1994). Untersuchungen zum Diusionsverhalten von Anionen in
carboxylierten Cellulosemembranen, PhD thesis, ETH Zurich.
Ratkowsky, D. A. (1985). A statistically suitable general formulation for modelling
catalytic chemical reactions, Chemical Engineering Science 40(9): 16231628.
Ratkowsky, D. A. (1989). Handbook of Nonlinear Regression Models, Marcel Dekker,
New York.
Ritz, C. and Streibig, J. C. (2008). Nonlinear Regression with R, Use R!, Springer
Verlag.
Seber, G. and Wild, C. (1989). Nonlinear regression, Wiley, New York.
Stahel, W. (2000). Statistische Datenanalyse: Eine Einf uhrung f ur Naturwis-
senschaftler, 3. Auage edn, Vieweg, Braunschweig/Wiesbaden.
Venables, W. N. and Ripley, B. (2002). Modern Applied Statistics with S, Statistics
and Computing, fourth edn, Springer-Verlag, New York.
A. Ruckstuhl, ZHAW

Nlreg 10 E

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nlreg 10 E

Uploaded by

Copyright:

Available Formats

Introduction to Nonlinear Regression

The author thanks Werner Stahel for his valuable comments.

E-Mail Address: Andreas.Ruckstuhl@zhaw.ch; Internet: http://www.idp.zhaw.ch

). If this is not possible, in certain circumstances one can also

) as variable of interest, the columns

, which is used for the linearization lies near to the solution. If

instead of the unknown point , and for the error variance

and the estimated (1 ) condence interval for

for the whole

for the whole parameter vector. The test

). The standard error for

You might also like