ch7 PDF

ECON 5350 Class Notes
Nonlinear Regression Models
1 Introduction
In this section, we examine regression models that are nonlinear in the parameters and give a brief overview
of methods to estimate such models.
2 Nonlinear Regression Models
The general form of the nonlinear regression model is
yi = h(xi ; ; i ); (1)
which is more commonly written in a form with an additive error term
yi = h(xi ; ) + i : (2)
Below are two examples
1. h(xi ; ; i ) = 0 x1i
1
x2i2 exp( i ). This is an intrinsically linear model because by taking natural loga-
rithms, we get a model that is linear in the parameters, ln(yi ) = 0+ 1 ln(x1i ) + 2 ln(x2i ) + i . This
can be estimated with standard linear procedures such as OLS.
2. h(xi ; ) = 0 x1i
1
x2i2 . Since the error term in (2) is additive, there is no transformation that will produce
a linear model. This is an intrinsically nonlinear model (i.e., the relevant …rst-order conditions are
nonlinear in the parameters). Below we consider two methods for estimating such a model –linearizing
the underlying regression model and nonlinear optimization of the objective function.
2.1 Linearized Regression Model and the Gauss-Newton Algorithm
Consider a …rst-order Taylor series approximation of the regression model around 0
yi = h(xi ; ) + i ' h(xi ; 0) + g(xi ; 0 )( 0) + i
where
g(xi ; 0) = (@h=@ 1j = 0
; :::; @h=@ kj = 0
):
1
Collecting terms and rearranging gives
Y 0 = X0 + 0
where
Y0 Y h(X; 0) + g(X; 0) 0
X0 g(X; 0 ):
The matrix X 0 is called the pseudoregressor matrix. Note also that 0

will include higher-order approxima-
tion errors.
2.1.1 Gauss-Newton Algorithm
Given an initial value for 0, we can estimate with the following iterative LS procedure
bt+1 = [X 0 (bt )0 X 0 (bt )] 1

[X 0 (bt )0 Y 0 (bt )]
= [X 0 (bt )0 X 0 (bt )] 1
[X 0 (bt )0 (X 0 (bt )bt + e0t )]
= bt + [X 0 (bt )0 X 0 (bt )] 1
X 0 (bt )0 e0t
= bt + Wt t gt
where Wt = [2X 0 (bt )0 X 0 (bt )] 1

, t = 1 and gt = 2X 0 (bt )0 e0t . The iterations continue until the di¤erence
between bt+1 and bt is su¢ ciently small. This is called the Gauss-Newton algorithm. Interpretations for
2
Wt , t and gt will be given below. A consistent estimator of is
1 Xn
s2 = (yi h(xi ; b))2 :
n k i=1
2.1.2 Properties of the NLS Estimator
Only asymptotic results are available for this estimator. Assuming that the pseudoregessors are well-behaved
(i.e., plim n1 X 00 X 0 = Q0 , a …nite positive de…nite matrix), then we can apply the CLT to show that
2
asy
b N[ ; (Q0 ) 1
],
n
2
where the estimate of n (Q0 ) 1
is s2 (X 00 X 0 ) 1
:
2
2.1.3 Notes
1. Depending on the initial value, b0 , the Gauss-Newton algorithm can lead to a local (as opposed to
global) minimum or head o¤ on a divergent path.
2. The standard R2 formula may produce a goodness-of-…t value outside the interval [0; 1].
3. Extensions of the J test are available that allow one to test nonlinear versus linear models.
4. Hypothesis testing is only valid asymptotically.
2.2 Hypothesis Testing
Consider testing the hypothesis H0 : R( ) = q. Below are four tests that are asymptotically equivalent.
2.2.1 Asymptotic F test.
Begin by letting S(b) = (Y h(X; b))0 (Y h(X; b)) be the sum of square residuals evaluated at the unrestricted
NLS estimate. Also, let S(b ) be the corresponding measure evaluated at the restricted estimate. The
standard F formula gives

[S(b ) S(b)]=J
F = :
S(b)=(n k)
asy 2
Under the null hypothesis, JF (J).
2.2.2 Wald Test.
The nonlinear counterpart to the Wald statistic is
asy
W = [R(b) q]0 [C V^ C 0 ] 1
[R(b) q] 2
(J)
where V^ = ^ 2 (X 00 X 0 ) 1
, C = @R(b)=@b and ^ 2 = S(b)=n.
2.2.3 Likelihood Ratio Test.
2
Assume N (0; I). The likelihood ratio statistic is
asy 2
LR = 2[ln(L ) ln(L)] (J)
where ln(L) and ln(L ) are the unrestricted and restricted (log) likelihood values respectively.
3
2.2.4 Lagrange Multiplier Test.
The LM statistic is based solely on the restricted model. Occasionally, by imposing the restriction R( ) = q,
it may change an intrinsically nonlinear model into an intrinsically linear one. The LM statistic is
e0 X 0 [X 00 X 0 ] 1 X 00 e
LM =
S(b )=n
e0 X 0~b ]
ESS e2 asy 2
= =n = nR (J)
S(b )=n ]
T SS
where e = Y e2 is the coe¢ cient

h(X; b ), X 0 = g(X; b ), ~b is the estimated coe¢ cient of e on X 0 and R
of determination of e on X 0 .
3 Brief Overview of Nonlinear Optimization Techniques
An alternative method for estimating the parameters of equation (2) is to apply nonlinear optimization
techniques directly to the …rst-order conditions. Consider the NLS problem of minimizing
Xn Xn
S(b) = e2i = (yi h(xi ; b))2 :
i=1 i=1
The …rst-order conditions produce
@S(b) Xn @h(xi ; b)
= 2 (yi h(xi ; b)) = 0; (3)
@b i=1 @b
which is generally nonlinear in the parameters and does not have a nice closed-form, analytical solution.
The methods outlined below can be used to solve the set of equations (3).
3.1 Introduction
Consider the function f ( ) = a + b + c 2 . The …rst-order condition for minimization is
df ( )
= b + 2c = 0 =) = b=2c:
d
This is considered a linear optimization problem even though the objective function is nonlinear in the
2
parameters. Alternatively, consider the objective fuction f ( ) = a + b + c ln( ). The …rst-order condition
for minimization is
df ( ) c
= 2b + = 0:
d
This is considered a nonlinear optimization problem.
4
Here is a general outline of how to solve the nonlinear optimization problem. Let be the parameter
vector, the directional vector and the step length.
Procedure.
1. Specify 0 and 0.
2. Determine .
3. Compute t+1 = t + t t.
4. Convergence criterion satis…ed?
Yes ! Exit.
No ! Update t = t + 1, compute t and return to #2.
There are two general types of nonlinear optimization algorithms –those that do not involve derivatives
and those that do.
3.2 Derivative-Free Methods
Derivative-free algorithms are used when the number of parameters are small, analytical derivatives are
di¢ cult to calculate or seed values are needed for other algorithms.
1. Grid search. This is a trial-and-error method that is typically not feasible for more than two parame-
ters. It can be a useful means to …nd starting values for other algorithms.
2. Direct search methods. Using the iterative algorithm t+1 = t + t t, a search is performed in m
directions: 1 ; :::; m. t is chosen to ensure that G( t+1 ) > G( t ).
3. Other methods. Simplex algorithm and simulated annealing are examples of other derivative-free
methods.
3.3 Gradient Methods
The goal is to choose a directional vector t to go uphill (for a max) and an appropriate step length t. Too
big a step may overshoot a max and too small a step may be ine¢ cient. (See Figures 5.3 and 5.4 attached.)
With this in mind, consider choosing t such that the objective function increases (i.e., G( t+1 ) > G( t )).
The relevant derivative is

dG( t+ t t)
= gt0 t
d t
5
where gt = dG( t+1 )=d t+1 . If we let t = Wt gt , where Wt is a positive de…nite matrix, then we know that
dG( t + t t)
= gt0 Wt gt 0.
d t
As a result, almost all algorithms take the general form
t+1 = t + t Wt gt
where t is the step length, Wt is a weighting matrix, and gt is the gradient. The Gauss-Newton algorithm
above could be written in this general form. Here are examples of some other algorithms.
1. Steepest Ascent.
Wt = I so t = gt .
An optimal line search produces t = g 0 g=(g 0 Hg).
Therefore, the algorithm is t+1 = t [gt0 gt =(gt0 Ht gt )]gt .
This method has the drawbacks that (a) it can be slow to converge, especially on long narrow
ridges and (b) H can be di¢ cult to calculate.
2. Newton’s Method (aka Newton-Raphson).
Newton’s method can be motivated by taking a Taylor series approximation (around 0) of the
gradient and setting equal to zero. This gives g( t ) ' g( 0 ) + H( 0 )[ t 0 ]. Rearranging,

1
produces t = 0 H ( 0 )g( 0 ).
1
Therefore, Wt = H and t = 1.
Very popular and works well in many settings.
Hessian can be di¢ cult to calculate or positive de…nite if far from optimum.
Newton’s method will reach the optimum in one step if G( t ) is quadratic.
3. Quadratic Hill Climbing.
1
Wt = (H( t ) I) , where > 0 is chosen to ensure that Wt is positive de…nite.
4. Davidson-Fletcher-Powell (DFP).
Wt+1 = Wt + Et , where Et is a positive de…nite matrix.

( t t )( t t)
0
Wt (gt gt 1 )(gt gt 1 )0 Wt
Et = ( t t )(gt gt 1)
+ (gt gt 1 )0 Wt (gt gt 1 ) .
Notice that no second derivatives (i.e., H( t )) are required.
6
Choose W0 = I.
5. Method of Scoring.
2 1
Wt = E( @@ ln(L)
@ 0 ) .
6. BHHH or Outer Product of the Gradients.
2
Wt = (g( t )g( t )0 ) 1
is an estimate of H 1
( t) = ( @@ ln(L)
@ 0 )
1
.
Wt is always positive de…nite.
Only requires …rst derviatives.
Notes.
1. Nonlinear optimization with constraints. There are several options such as forming a Lagrangian
function, substituting the constraint directly into the objective and imposing arbitrary penalties into
the objective function.
2. Assessing convergence.
The ususal choice of convergence criterion is G or .
Sometimes these methods can be sensitive to the scaling of the function.
Belsley suggests using g 0 H 1

g as the criterion, which removes the units of measurement.
3. The biggest problem in nonlinear optimization is making sure the solution is a global, as opposed to
local, optimum. The above methods work well for globally concave (convex) functions.
3.4 Examples of Newton’s Method
Here are two numerical examples of Newton’s method.
1. Example #1. A sample of data (n = 20) was generated from the intrinsically nonlinear regression
model
2
yt = 1 + 2 x2t + 2 x3t + t,
where 1 = 2 = 1. The objective is to minimize the function
0
G( 1 ; 2) = (y h( 1 ; 2 )) (y h( 1 ; 2 ))
2
where h( 1 ; 2) = 1 + 2 x2t + 2 x3t . See Figure B.2 and Table B.3 (attached) to see how Newton’s
method performs for three di¤erent initial values.
7
2. Example #2. The objective is to minimize
3 2
G( ) = 3 + 5.
The gradient and Hessian are given by
2
g( ) = 3 6 =3 ( 2)
H( ) = 6 6 = 6( 1).
Substituting these into Newton’s algorithm gives
3 t ( t 2) t( t 2)
t+1 = t = t .
6( t 1) 2( t 1)
Now consider two di¤erent starting values 0 = 1:5 and 0 = 0:5.
Starting Value 0 = 1:5 Starting Value 0 = 0:5

(1:5)( 0:5) (0:5)( 1:5)
1 = 1:5 2(0:5) = 2:25 1 = 0:5 2( 0:5) = 0:25
(2:25)(0:25) ( 0:25)( 2:25)
2 = 2:25 2(1:25) = 2:025 2 = 0:25 2( 1:25) = 0:025
(2:025)(0:025) ( 0:025)( 2:025)
3 = 2:025 2(1:025) = 2:0003 3 = 0:025 2( 1:025) = 0:0003.
This examples highlights the fact that, at least for objective functions that are not globally concave
(or convex), the choice of starting values is an important aspect of nonlinear optimization.
4 MATLAB Example
In this example, we are going to estimate the parameters of an intrinsically nonlinear Cobb-Douglas produc-
tion function
Qt = 1 Lt
2
Kt 3 + t
using Gauss-Newton and Newton’s method, as well as test for constant returns to scale (see MATLAB
example 14).
For Gauss-Newton, the relevant gradient vector is
0 0 0 0 0 0
0 0 0
g(X; ) = fLt 2 Kt 3 ; ln(Lt ) 1 Lt Kt
2 3
; ln(Kt ) 1 Lt Kt
2 3
g.
8
For Newton’s method, the relevant gradient and Hessian matrices are
Xn @h(xi ; b)
g(b) = 2 ei ( )
i=1 @b
@g(b)
H(b) = .
@b

ch7 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch7 PDF

Uploaded by

Copyright:

Available Formats

ECON 5350 Class Notes

Nonlinear Regression Models

of methods to estimate such models.

2 Nonlinear Regression Models

The general form of the nonlinear regression model is

which is more commonly written in a form with an additive error term

Below are two examples

can be estimated with standard linear procedures such as OLS.

2.1 Linearized Regression Model and the Gauss-Newton Algorithm

Consider a …rst-order Taylor series approximation of the regression model around 0

yi = h(xi ; ) + i ' h(xi ; 0) + g(xi ; 0 )( 0) + i

The matrix X 0 is called the pseudoregressor matrix. Note also that 0

2.1.1 Gauss-Newton Algorithm

bt+1 = [X 0 (bt )0 X 0 (bt )] 1

where Wt = [2X 0 (bt )0 X 0 (bt )] 1

2.1.2 Properties of the NLS Estimator

global) minimum or head o¤ on a divergent path.

4. Hypothesis testing is only valid asymptotically.

2.2 Hypothesis Testing

2.2.1 Asymptotic F test.

standard F formula gives

2.2.2 Wald Test.

The nonlinear counterpart to the Wald statistic is

2.2.3 Likelihood Ratio Test.

where e = Y e2 is the coe¢ cient

3 Brief Overview of Nonlinear Optimization Techniques

The …rst-order conditions produce

Consider the function f ( ) = a + b + c 2 . The …rst-order condition for minimization is

This is considered a nonlinear optimization problem.

vector, the directional vector and the step length.

4. Convergence criterion satis…ed?

No ! Update t = t + 1, compute t and return to #2.

and those that do.

3.2 Derivative-Free Methods

directions: 1 ; :::; m. t is chosen to ensure that G( t+1 ) > G( t ).

3.3 Gradient Methods

The relevant derivative is

As a result, almost all algorithms take the general form

An optimal line search produces t = g 0 g=(g 0 Hg).

Therefore, the algorithm is t+1 = t [gt0 gt =(gt0 Ht gt )]gt .

ridges and (b) H can be di¢ cult to calculate.

2. Newton’s Method (aka Newton-Raphson).

gradient and setting equal to zero. This gives g( t ) ' g( 0 ) + H( 0 )[ t 0 ]. Rearranging,

Very popular and works well in many settings.

Newton’s method will reach the optimum in one step if G( t ) is quadratic.

3. Quadratic Hill Climbing.

Wt+1 = Wt + Et , where Et is a positive de…nite matrix.

Notice that no second derivatives (i.e., H( t )) are required.

6. BHHH or Outer Product of the Gradients.

Wt is always positive de…nite.

Only requires …rst derviatives.

the objective function.

The ususal choice of convergence criterion is G or .

Sometimes these methods can be sensitive to the scaling of the function.

Belsley suggests using g 0 H 1

3.4 Examples of Newton’s Method

Here are two numerical examples of Newton’s method.

where 1 = 2 = 1. The objective is to minimize the function

method performs for three di¤erent initial values.

The gradient and Hessian are given by

Substituting these into Newton’s algorithm gives

Now consider two di¤erent starting values 0 = 1:5 and 0 = 0:5.

Starting Value 0 = 1:5 Starting Value 0 = 0:5