Professional Documents
Culture Documents
20 19 18 17 16 15
Preface to Second Edition
The topical content of the first Theory and Practice of Econometrics and the
pace of developments in econometrics over the last four years have made it
necessary to rewrite and restructure the first edition completely. We have
added chapters concerned with asymptotic distribution theory, Bayesian infer-
ence, time series, and simultaneous equation statistical models. The distributed
lag chapters have been rewritten within a time series context and the nonlinear
estimation and estimation and hypothesis testing involving nonsample informa-
tion have been extensively rewritten to take account of recent research results.
In the other chapters, some results that are mainly of historical interest have
been deleted and relevant new research results have been included.
Given the state of computer development and availability, we have deleted
the Monte Carlo sample data contained in the first volume. Those who do not
have computer facilities to develop Monte Carlo sample data for the various
statistical models considered in this edition are referred to the instructor's
manual for the book Introduction to the Theory and Practice of Econometrics,
published by Wiley in 1982.
The research and instructional objectives and uses of the book are the same
as those enumerated in the first edition. Chapters 2 through 7 are concerned
with the theoretical basis for statistical inference in economics. The remainder
of the book is concerned with using these tools to cope with the stochastic,
dynamic, and simultaneous characteristics of economic data and determining
the statistical consequences of changing (I) the statistical models, (2) the
amount and type of information used, and (3) the measure of performance
employed. For introductory material relating to the various chapters contained
herein, we refer the interested reader to our Introduction to the Theory and
Practice of Econometrics and the first edition of this book.
The criticisms and suggestions we received from colleagues and students
who used the original book and/or read early editions of the new and revised
chapters, were very helpful in eliminating some errors and in determining the
contents and organization of this volume. We cannot here acknowledge each
individual contribution but we do indicate our debt to these readers and to each
we offer our sincere thanks.
For a book of this nature a skilled technical typist who has patience and who
can type rapidly while emphasizing accuracy and consistency, is a necessity.
Dixie Trinkle not only possesses these talents but her interest in the book and
willingness to do repeated versions of each chapter, contributed much to its
quality and style. Partial support for this work was provided by a National
Science Foundation grant.
George G. Judge
William E. Griffiths
R. Carter Hill
Helmut Liitkepohl
Tsoung-Chao Lee
April 1984
..
VII
Preface to First Edition
This book is concerned with the problems facing those who wish to learn from
finite samples of economic data. Over the last half century, econometricians
have, in conjunction with theoretical and applied statisticians, developed a
varied and sometimes curious set of methods and procedures that serve as a
basis for this learning process. In the theory and practice of econometrics, the
economic and statistical models, the method of estimation and inference, and
the data are all interdependent links in the search and discovery process.
Therefore, economists who apply quantitative methods face many choices as
they attempt to give empirical content to their theoretical constructs. Many of
these alternatives are interrelated, and the choices actually taken condition the
inferential reach of the results and their usefulness in a decision context.
As an information base to aid in the choice process, two types of economet-
rics books appear in the literature. Texts on econometric theory examine an
array of alternative statistical models and, under prescribed sets of conditions,
evaluate the sampling properties of a number of estimators and test statistics.
Alternatively, applied econometric texts specify and statistically analyze a
range of economic relationships such as supply, investment, consumption, and
production functions and give examples of how the empirical results may be
used in practice. Both types of books provide valuable information for the
student and the researcher. In many cases, however, traditional books fail to
identify a range of alternatives by being either too model- or too case-specific or
fail to aid the student or researcher in coping with a particular sequence of
questions or choices. Consequently, a basis may not be provided for choosing
the statistical specification and estimation procedure that is most appropriate
for the problem at hand. Within this context, this book is directed to specifying
an up-to-date set of procedures that researchers may use to identify and miti-
gate the impact of econometric problems arising from the use of passively
generated data and incomplete information concerning the appropriate eco-
nomic and statistical model.
Our objective of improving the efficiency of the learning and quantitative
research choice process is addressed in the following way:
The book can serve as (I) a text for theoretical and applied courses in econo-
metrics, (2) a handbook for those who are in one way or another searching for
quantitative economic knowledge, and (3) a reference book for undergraduate
IX
x Preface to First Edition
George G. Judge
William E. Griffiths
R. Carter Hill
Tsoung-Chao Lee
Apri/1979
Contents
Chapter 1 Introduction 1
1.1. The Search Process 2
1.2. The Nonexperimental Model-building Restriction 3
1.3. Objectives of the Book 4
1.4. Organization of the Book 5
1.5. Econometric Thought 6
1.6. References 6
XI
XII Contents
Chapter 12 Disturbance-related
Sets of Regression Equations 465
12.1. Sets of Equations with
Contemporaneously Correlated Disturbances 466
12.1.1. Estimation 467
12.l.la. Stein-Like Estimators 471
12.1.2. Hypothesis Testing 472
12.1.2a. Linear Restrictions
on the Coefficient Vector 472
12.1.2b. Testing for a
Diagonal Covariance Matrix 476
12.1.3. Goodness of Fit Measures 477
12.1.4. Bayesian Estimation 478
12.1.5. Extensions of the Model 480
12.2. Sets of Equations with
Unequal Numbers of Observations 480
12.3. Sets of Equations with
First-order Autoregressive Distnrbances 483
12.3. i. Estimation when R is Diagonal 485
12.3.2. Estimation with Nondiagonal R 490
12.3.3. Finite Sample Properties 493
12.3.4. Hypothesis Testing 493
12.3.5. Extensions of the AR( I) Error Model 496
12.4. Recommendations .496
12.S. Estimation with a Singular
Disturbance Covariance Matrix 497
12.5.1. Some Illustrations 497
12.5.2. The Restricted Generalized Inverse Estimator 499
12.5.3. Concluding Remarks 502
Contents XXI
Introduction
Given a basis for organizing knowledge, and given the usefulness or indeed
the necessity of quantitative economic knowledge, the next important question
involves how to go about searching for it. If all economists had IQs that were
300 standard deviations above the mean, this question might be of limited
importance. However, many of us do not fit into this category, and the question
of how to get on the efficient search turnpike is a real one.
Real Axiom or
I postulate
world Process of abstraction
phenomena I system
Rules of logic
Evaluation
(Theorems and
lemmas)
Real
world Process of interpretation J Conclusions
conclusions or realization I
Evaluation Experiment
I Statistical I Observations
Conclusions interpretation I
I
Economic theory, through a formal deductive system, provides the basis for
experimental abstraction and the experimental design, but society in most
cases carries out the experiment, possibly using its own design. Therefore, the
economic researcher observes the outcome of society's experiment or perfor-
mance but has little or no impact on the experimental design and the observa-
tions generated. This means that the economist is often asked to estimate the
impact of a change in the mechanism that produces the data when there is
limited or no possibility of producing the data beforehand in a laboratory exper-
4 Introduction
iment. Thus, by the passive nature of the data, economic researchers are, to a
large extent, restricted in their knowledge search to the process of nonexperi-
mental model building. As shown in the preceding flow diagram, we start with
the observations and the experimental design, but the experiment is outside .of
the researcher's control.
Unfortunately, postulation typically provides many admissible economic and
thus statistical models that do not contradict our perceived knowledge of hu-
man behavior or the economic and institutional processes through which these
data are generated. Therefore, in most econometric work, there is uncertainty
as to the economic and statistical model that is used for estimation and infer-
ence purposes and the sampling model that actually was the basis for the data
generation.
When econometric models are correctly specified, statistical theory provides
well-defined procedures for obtaining point and interval estimates and evaluat-
ing the performance of various linear and usually unbiased estimators. How-
ever, uncertainties usually exist about the correct underlying economic and
statistical models, such as the variables that should appear in the design matrix,
the algebraic form of the relation or the type of nonlinearities that should be
considered, the dynamic or lag structure between the economic variables, and
the stochastic assumptions underlying the statistical model. In many cases,
these uncertainties lead to models that are incorrectly specified and thus invali-
date the sampling results that are traditionally claimed and generate estimators
whose sampling performance is unknown. Consequently, to a large extent,
econometrics is concerned with efficient procedures for identifying or mitigat-
ing the impacts of model misspecification.
The nonexperimental restriction also means that in addition to being efficient
in using the data that are available, one must, in many cases, use nonsample
information in order to adequately support the parameter space. Thus, much of
econometrics is concerned with how to sort out and make use of this nonsam-
pIe information in conjunction with the sample observations. In fact, much of
the success in handling many of the problems in this book springs from how
effectively this priur information is used. Therefore, in econometric work, as
one goes from the conceptual model to the observed data and from prior infor-
mation to the estimated relations, many interrelated questions face the re-
searcher.
Given the plethora of admissible economic and statistical models and the non-
experimental or passively generated nature of economic data, the general ob-
jective of this book is to provide the student and the researcher with the
statistical consequences of a range of models and rules and to present an array
of estimating procedures that are appropriate and effective in providing infor-
mation that may be "useful" for a range of economic decision problems and
data sets. Emphasis is directed to understanding the statistical consequences
Introduction 5
involved in the search for quantitative knowledge and not to mechanics and
technicalities.
In particular, the purposes of this book are as follows.
1.6 REFERENCES
SAMPLING
THEORY AND
BAYESIAN
APPROACHES
TO INFERENCE
In the following three chapters, we review the sampling theory and Bayesian
approaches to inference when the underlying sampling process can be de-
scribed by a variant of the traditional linear statistical model. Basic concepts
and definitions underlying statistical d~cision theory are presented and the
statistical implications of (1) using conventional hypothesis testing (estimation)
procedures, (2) combining sample and nonsample information, and (3) the non-
traditional family of Stein-rule estimators are explored and their sampling per-
formances evaluated under a squared error loss measure. Finally, the concepts
underlying the Bayesian approach to inference are introduced and applied in
the linear statistical model case. Some matrix and distribution theorems that
are useful in formulating and evaluating the statistical models in Part 1 are
presented in Appendix A at the end of the book.
Chapter 2
The Classical
Inference Approach for
the General Linear Model
In many cases we have limited knowledge about the relationships among eco-
nomic variables. Facing this situation, one way to gain knowledge is to experi-
ment. If we visualize the observed values of economic variables as the out-
comes of experiments, then we must have a theoretical framework on which to
base the experimental design or to act as a basis for modeling the data genera-
tion or the sampling process. Witbin this context consider the following linear
statistical model
y = X(3 + e (2.1.1)
12 Sampling Theory and Bayesian Approaches to Inference
ance matrix is E[ee'] = 0-2h, where the scalar 0- 2 is unknown and h is a Tth-
order identity matrix. We denote this specification for the real random vector e
as e - (O,0-2h).
The (T x K) design matrix X is assumed to be composed of K fixed (explana-
tory) variables. In an experimental context these variables would be considered
treatment or control variables and their design would be specified by the re-
searcher. In economics the experiment is usually designed and carried out by
society, and the researcher is a passive observer in the data generation process.
The assumption that X is a matrix of fixed variables means that for all the
possible situations that might take place under repeated sampling, the matrix X
would take on the same values. However, the observable vector y would vary
from sample to sample. In reality, usually only one sample vector y is actually
observed. In this situation, if both X and yare thought of as random variables,
then the repetition of X implies that it has a singular distribution. Alternatively,
y may be thought of as a random vector with a distribution that is conditional on
the matrix of fixed variables X.
Other than the functional form specified in (2.1.1) it is assumed that only
sample information is used and that no other information is either available or
introduced about the unknown parameter vector p and the unknown scalar 0- 2
Furthermore, the design matrix is assumed to be correctly specified, and the
outcome vector y and the treatment variables in the design matrix X are as-
sumed to be observable and measured without error. In the sections to come
we will consider more general data generation schemes and thus change the
stochastic assumptions that are made relative to the distribution of the real
random vector e.
For the statistical model (2.1.1), the observed values of the random variables y
contain all of the information available about the unknown p vector and the
unknown scalar 0- 2 Consequently, the point estimation problem is one of find-
ing a suitable function of the observed values of the random variables y, given
the design matrix X, that will yield, in a repeated sample sense, the "best"
estimator of the unknown parameters. If, in estimating the unknown coefficient
vector p, we restrict ourselves to the class of rules that are linear functions of y,
we may write the general linear estimator or rule as
b o = Ay (2.1.2)
makes the sum of the squared errors of the disturbance vector e a minimum, we
would then minimize the quadratic form
I ae
- - = X'Xb - X'y = 0
2 aJl
(2.1.4)
This linear rule (linear function ofy) is appropriately known as the least squares
estimator and means that one candidate for A is A = (X'X)-IX'.
Given this covariance result, the Gauss-Markov theorem provides proof that
out of the class of linear unbiased rules for the statistical model (2.1.1) the least
squares estimator is best, where best is defined in terms of minimum variance.
That is, any arbitrary linear unbiased estimator b o will have a covariance matrix
where A is a positive semidefinite matrix. Therefore, for the linear rule (2.1.4),
the least squares estimator is equal to or better in terms of sampling precision
than all others in its class. This is a beautiful result, which does much to explain
the popularity of the least squares rule.
2.1.1b An Estimator of 0- 2
Having obtained a point estimator for the unknown coefficient vector (3, we
now turn to a basis for estimating (]"2, the variance of the elements of the
random vectors y and e. The quadratic form (2.1.3) that was used in obtaining a
least squares estimator of (3 does not contain (]"2, and therefore does not
provide a basis for its estimation.
From our stochastic assumptions regarding the error vector e we know that
E[e'e] = T(]"2. Thus it would seem that if we had a sample of observations for
the random error vector y - X(3 = e, we could use this sample as a basis for
estimating (]"2. Unfortunately, since (3 is unknown and unobservable the ran-
dom vector e is unobservable. If we are to estimate (]"2 on the basis of informa-
tion that we have, this will involve the observed random y vector and its least
squares counterpart y = Xb. Thus the vector of least squares residuals, e = y -
Xb provides a least squares sample analogue of the vector of unobservable
errors e. Since
y= Xb = X(X'X)-tx'y (2.1.8)
then
and
MM' = MM = M2 = M
= [h - X(X'X)-tX'][h - X(X'X)-tx'] = [IT - X(X'X)-tx']
16 Sampling Theory and Bayesian Approaches to Inference
and thus is an idempotent matrix. If we use the quadratic; form e' einstead of e' e
as a basis for estimating the scalar cr 2 , then it can be written as
= E[e'Me]
= E[tr(e'Me)] = E[tr(Mee')]
= tr{ME[ee']} = cr 2tr[Ir - X(X'X)-IX']
= cr 2 [tr(Ir) - tr(X(X'X)-IX')] = cr 2[tr(Ir) - tr(h)]
= cr 2(T - K) (2.1.12)
Consequently,
E[ e'e
T-K
]=( 1 )cr 2 (T-K)
T-K
=cr 2 (2.1.13)
Once a distributional assumption is made about the random vector e, the maxi-
mum likelihood principle provides another basis for estimating the unknown
parameters J3 and cr 2 Under the maximum likelihood principle the observations
yare the only elements in the sample space OY that are relevant to problems of
decision or inference concerning the unknown parameters. Thus, in making
decisions about the unknown parameters after y is observed, all relevant sam-
ple information is contained in the likelihood function. Under the maximum
likelihood criterion, the parameter estimates are chosen so as to maximize the
The Classical Inference Approach for the General Linear Model 17
(2.1.16)
and
- -
(j2 = (y - Xrl)'(y - Xrl) (2.1.17)
T
and thus is biased in finite samples. However, the estimator [T/(T - K)](j2 = 6- 2
is as we know an unbiased estimator. Since (j2[T/(T - K)] is a quadratic form in
terms of the normal random vector e (see Section 7.1. 3c in Judge et al., 1982 or
Appendix A) the quadratic form
e'(I - X(X'X)-tX')e 6- 2
~--':"""""'2--"":"'-~ = (T - K) -2 (2.1.19)
a- a-
2
[6- 2
] = T: K [X~T-K)] = a- 2 (2.1.20)
and
(2.1.21)
Likewise,
(2. 1. 22a)
and
(2. 1. 22b)
Thus 6- 2 has a bias -(K/T)a- 2 and a variance which, in finite samples, is smaller
than the variance for the unbiased estimator 6- 2 If one visualizes a trade-off
between bias and precision, interest might focus on a minimum mean squared
error (MSE) estimator of a- 2 That is an estimator 6- 2 where
(2. 1. 23a)
(2.1.23b)
(2.1.23c)
The mean squared error of &2 is always less than the variance (MSE) of fj2.
Finally, the optimum divisor of the quadratic form to get an unbiased estimator
is (T - K) and in this class of estimators 6- 2 is the minimum variance unbiased
estimator of a- 2
The Classical Inference Approach for the General Linear Model 19
I( ) = _E[a 2 In C(",IY,X)]
'" a", a",' (2.1.24)
12 X' X
-- - a-14 (X' Y - X ,Xfl)
a-
/("') = -E
~4 (X'y
T 1
- - X'Xfl)' - - - (y - XIl)'(y - XIl)
2a- 4 a- 6 .... ....
_12 x'x 0
a-
o (2.1.25)
and
2~ ]
(2.1.26)
20 Sampling Theory and Bayesian Approaches to Inference
(2.1.27)
and this means that the Cramer-Rao lower bound is attained by the covariance
of ~ = b but not for the unbiased estimator if 2. This gives us a stronger result
than the Gauss-Markov theorem because it means that the maximum likelihood
estimator of (3 is best in a larger class of estimators that does not include the
linearity restriction. Also, although the variance of (f2 does not attain the lower
bound, an unbiased estimator of (J"2 with variance lower than 2(J"4j(T - K) does
not exist.
b = (3 + (X'X)-IX'e = (3 + v
v ~ N(O,(J"2(X'X)-I) (2.1.28)
(2.1.29)
(2.1.30)
VI ~ N(O,(J"2R(X'X)-IR') (2.1.31)
The Classical Inference Approach for the General Linear Model 21
(2. 1. 32a)
Therefore
= (b - (3)'R'[R(X'X)-IR']-IR(b - (3)/(J2
e' X(X' X)-IR'[R(X' X)-IR']-IR(X' X)-IX' e 2
(J2 ~ X(J) (2.1.32b)
(2.1.33)
This result in conjunction with (2.1.32b), and the fact that these two quadratic
forms are independent, can be used to construct t and F statistics for interval
estimation and hypothesis testing.
R(b - (3)
(2.1.34)
R(b - (3)
Pr{-t[(T-K),a/2] <: 6-[R(X'X) IR']1/2 <: t[(T-K),a/2]} = I - a (2.1.35a)
(2.1.35b)
(b - (3)'R'[R(X'X)-IR']-IR(b - (3)
J6- 2 ~ F(J,T-K) (2.1.36)
22 Sampling Theory and Bayesian Approaches to Inference
(2.1.37a)
or
(T - K)6- 2 2
(T - K)6- 2
2 <: a- <: -----"2- - - (2.1.37c)
X(T-K,a2~ l-a/2) X(T-K,al ~a/2)
and in a repeated sampling sense the set of upper and lower bounds computed
from the samples will contain a- 2 , (1 - a) percent of the time.
y = Xil + e
where
(2.1.38a)
and
Rb = Ril + VI
Ril = r
VI ~ N(O,a- 2R(X'X)-IR') (2.1.38b)
The Classical Inference Approach for the General Linear Model 23
Rb - RJl Rb - r
'YI = 6-(R(X'X)-IR')II2- 6-(R(X'X)-IR')1/2 (2.1.39)
(2.1.40)
'Yl = max[C(Jl,(]"2Iy,X,RJl = r)]
where Cand Care the maximum values of the likelihood under Ho and HI . This
ratio leads to the quadratic forms
(b - Jl)'R'[R(X'X)-IR']-IR(b - Jl)
(]"2
2
(y - Xb)'(y - Xb) = e'(l - X(X'X)-IX')e = (T _ K) 6- (2 1 42)
(]"2 (]"2 (]"2
(2.1.43)
rejected if the value of the test statistic AJ is greater than some critical value c,
which is determined for a given significance level a for the central F distribu-
tion with J and (T - K) degrees of freedom.
In general what we look for in a statistical test is one that has a low probabil-
ity of a Type I error when the null is true, has a low probability of a Type 2
error when the null is false (power large) and is robust to misspecifications of
the statistical model.
It is interesting to note that in the context of (2. I .38) we may write the
Lagrangian for the restricted likelihood as (3,a- 2Iy,X) + 2(r' - WR /)X and this
leads to optimum values of the Lagrange multipliers
(2.1.44)
is distributed as a x7J)'
Hypothesis testing frameworks are considered again in Chapters 3 and 5 and
are used in one form or another in most of the chapters to follow. To complete
this discussion let us note that for testing Ho: a- 2 = a-6, the ratio (T - K)6- 2/a-6,
is distributed as a central X7r-K) when the hypothesis is correct.
In statistical model (2. 1.1) it is assumed that the (T x K) design matrix X has
rank K and thus X/X is a (K x K) positive definite symmetric matrix. Conse-
quently, using the matrix theorems discussed in Section 7 .1.3c in Judge et at.,
(1982) and Appendix A of this book, there exists a positive definite symmetric
matrix S 112, such that S-1/2X/XS- 1/2 = h, where S I12SI12 = S = X/X = C' AC, A =
CSC', and S-1/2 = C' A - 112 C. Here C is an orthogonal matrix of order (K x K)
and A is a diagonal matrix whose diagonal elements are the characteristic
roots of X/X. Therefore, we may rewrite the statistical model (2.1. I) in equiva-
lent form as
where 6 = SII2(3, Z = XS-1 /2 and hence Z/Z = h. The mean and covariance of
the random vectors e and y remain unchanged. This transformation is known as
a canonical reduction, and in this case the specification has been reparameter-
ized from the (3 space to the 6 space.
Note that if we reparameterize the statistical model as in (2.1.46) the least
The Classical Inference Approach for the General Linear Model 25
(2.1.48)
Z'y = 6 + Z'e
or
z = 6 + w (2.1.49)
where z has a K variate normal distribution with mean vector 6 and nonsingular
covariance matrix, (J'2h, which we may without loss of generality let be equal
to h if we wish to assume (J'2 known.
This specification is used from time to time in the chapters ahead, when for
expository purposes it is more convenient to work in the 6 rather than the (3
space. In some cases this reparameterization will not cause a loss of generality.
Up to this point we have made the simplifying assumption that the elements of
the random vector e were independent and identically distributed random vari-
ables. Although this assumption may describe a sampling process consistent
with much economic data, it is inappropriate for situations involving such
problems as heteroscedasticity, autocorrelation, and equation error related sets
of regressions. These and other violations of the scalar identity covariance
assumption are discussed in detail in Chapters 8, 11, 12, and 13 and Chapters
10, 11, 14, 15, and 16 of Judge et al. (1982). In this chapter we consider the case
in which
where 'II is a known real positive definite symmetric matrix. Thus we consider
the statistical model
y = X(3 + e (2.2.1)
26 Sampling Theory and Bayesian Approaches to Inference
Py = PX(3 + Pe (2.2.2)
or
y* = X*(3 + e*
(2.2.3)
where l}I is a (T x T) known matrix. Since (2.2.2) is equivalent to (2.1.1) all the
results of previous sections hold and P~ (3,cr 2(X'l}I-IX)-I) and is, in fact, a
The Classical Inference Approach for the General Linear Model 27
(2.2.6a)
and
(2.2.6b)
RI3 = RI3 + v
v ~ N(O,(J2R(X'\fI-IX)-IR') (2.2.7)
(2.2.8)
(2.2.9)
and may be used as a basis for hypothesis testing and interval estimation.
If we incorrectly specify the sampling process and assume E[ee'] = (J2/r,
when it is in fact E[ee'] = (J2\f1' =1= (J2/r, and use the least squares rule b =
(X'X)-IX'y we face the following consequences.
28 Sampling Theory and Bayesian Approaches to Inference
The statistical model in which 'it =1= Ir can be extended to sets of linear
statistical models that are error related. In this context we may, following the
original idea of Zellner (1962), write a set of M linear statistical models as
+ (2.2.10)
or compactly as
y = XJ3 +e (2.2.11)
(TIl Ir (T12Ir
(T21Ir (T22Ir
E[ee'] = = 2. @ Ir = <P (2.2.12)
(2.2.13)
The Classical Inference Approach for the General Linear Model 29
(2.2.14)
which is the Aitken or generalized least squares estimator, or for normal errors
the maximum likelihood estimator, for the set of linear statistical models. When
the design matrices are identical, that is, XI = X 2 = . . . = X M , or when the
covariance matrix (2.2.12) is diagonal, the Aitken estimator applied to (2.2.11)
or the least squares estimator applied to each equation separately yield identi-
cal results.
The sampling characteristics of ~ are identical to those discussed above.
Thus ~ ~ N(Jl,(X'C~:-1 I)X)-I) and with known covariance L, and true linear
hypothesis RJl = r, the quadratic form
(2.2.15)
(2.3.1)
The second line of Equation 2.3.2 uses the result X'e = X'(y - Xb) = O. If, in
addition, X contains a constant term, then j' e = 0 and Equation 2.3.2 becomes
the estimated equation, and e'e = (y - y)'(y - y) is the residual sum of squares
or variation in y not explained by the regression. The coefficient of multiple
determination is defined as
(2.3.4)
(y - jy)'(y - jy)
and, because of the decomposition in (2.3.3) it lies between zero and one and
may be interpreted as a measure of the proportion of the variation in y ex-
plained by the estimated equation.
Also, R2 is equal to the square of the correlation coefficient between y and y
and thus can be viewed as a measure of the model's predictive ability over the
sample period. In this context
R2 = y - jy)'(y - jy2
(2.3.5)
(y - jy)'(y - jy)(y - jy)'(y - jy)
R2 T- K
F = 1 - R2 . K - 1 (2.3.6)
The acceptance of the null hypothesis 135 = 0 implies that there is insufficient
evidence to suggest that any of the explanatory variables influences y. Note
that the F value is monotonically related to R2.
Because the addition of new explanatory variables will always increase R2, an
alternative coefficient, the "adjusted R 2 " is frequently used as a goodness of
fit measure. It is given by
e'e/(T - K)
R2 = 1 (2.3.7)
(y - jy)'(y - jy)/(T - 1)
R2 = 1 _ e e (2.3.9)
y'y
This value will be between zero and one and, for testing the hypothesis 13 = 0, it
will be related to the F statistic in a similar way to (2.3.6), though the degrees of
freedom will be different.
A second alternative is to define R2 as the squared correlation between y and
y, as it appears in the last expression in Equation 2.3.5. Defined in this way, R2
will be a measure of the predictive ability of the equation over the sample
period and it will be between zero and one, but it will not be equal to the
definition in (2.3.9).
In the single-equation model y = XI3 + e, where E[e] = 0 and E[ee ' ] = ~ =l=a.2J,
the potential for misusing R2 is high, particularly if one reports, without ques-
tion, the number given by most computer programs. If we define e and y from
the equation
y= X~ + e= y + e (2.3.10)
where ~ = (X'~-IX)-IX'~-ly, the first line in (2.3.2) will still hold with these
new definitions and so we have
However, in this case the last term will not disappear even if X contains a
constant. This is because X'e = X'y - X'X~ =1= 0, the normal equations being,
in this case, X'~-ly - X'~-IX~ = o.
Thus, as was the case for model 2.2.1 when there was no constant term, the
measure
(2.3.12)
(y - jy)'(y - jy)
32 Sampling Theory and Bayesian Approaches to Inference
will have a range from -00 to 1, and so it will be difficult to interpret it as the
proportion of variation in y explained by the regression. It does have one
possible advantage over another measure (mentioned below), namely that we
are measuring the variation in y in its original units, not in terms of a trans-
formed y or a "weighted variation in y."
The above is not the number reported by many computer programs. A large
number of programs would apply least squares to the transformed model
y* = X*13 + e* (2.3.13)
where y* = Py, X* = PX, e* = Pe, and P is such that P' P = ~-I. Then, if the
formula in (2.3.12) is used on the transformed observations, the resulting R2
definition is
e. . *, eA*
(y* - JY*)'(y* - Jy*)
= 1
(Py - jj'Py/D'(Py - jj'Py/D (2.3.14)
where e* = y* - X*~ and y* = "'Z;=lyiIT. There are two main problems with
this measure. First, if the transformed matrix X* does not contain a constant,
R2 will not lie between zero and one. Second, we are now measuring the
explained variation in a transformed y. The first of these problems can be
overcome by measuring the weighted variation in y around a weighted mean.
This weighted mean is defined as
(2.3.15)
2 _ e'~-Ie
R - 1 - (y - jyw*)'~ I(y - jyw*) (2.3.16)
This value will lie between zero and one and can be interpreted as the propor-
tion of weighted variation in y explained by the regression. Also, as in the
scalar covariance matrix case, this value will be monotonically related to the F
statistic used to test the null hypothesis that all coefficients except the intercept
are zero. The expressions in (2.3.14) and (2.3.16) will be equal if X* has a
constant column in exactly the same position as X.
The relevant question for the applied worker is which of the alternative R 2's
is the "best" one to report? We do not have any strong recommendations on
this point. However, we do feel that the method of calculation should be
mentioned and that it is better not to report an R2 at all than to report an
ambiguous number.
The Classical Inference Approach for the General Linear Model 33
2.4 SUMMARY
The statistical models and the classical statistical inference we have reviewed
provide much of the basis for applied econometric work. However, these
special case statistical models are not sufficient as a basis for modeling many
economic data generation processes and the sampling theory approach to infer-
ence is not in many cases an efficient way of learning from economic data. In
order to cope with the special characteristics of economic data we can (1)
change and enrich the range of statistical models, (2) change the amount of
information used, and (3) change the measure of performance or the basis for
inference. In order to more effectively learn from economic data, we consider
the following possibilities in the chapters ahead.
variables, the functional form of the relationship, the way the variables
should be scaled, and whether the model should be specified as linear or
nonlinear in the parameters. Seldom, in actual practice, are these statisti-
cal model ingredients known, and thus some criterion must be used as a
basis for choosing among the alternatives. In Chapter 21 several alterna-
tive search criteria are considered and the applicability of each is as-
sessed.
U. In the specification of the statistical models we have assumed that the
design matrix was of full column rank. With economic data in many cases
this assumption is not fulfilled. Therefore, in Chapter 22 the statistical
implications of collinearity are identified and procedures for mitigating its
impact are evaluated.
The above chapters result because for many applied problems (l) the tradi-
tional statistical models must be changed or extended, (2) the amount and types
of information available vary, and (3) the appropriate criterion for estimator
performance depends on the problem considered.
2.5 EXERCISES
Let
(2.5.1)
XI Xz X3
1 0.693 0.693
1 1.733 0.693
1 0.693 1.386
1 1.733 1.386
1 0.693 1.792
1 2.340 0.693
1 1.733 1.792
1 2.340 1.386
1 2.340 1.792
X= 1 0.693 0.693 (2.5.2)
1 0.693 1.386
1 1.733 0.693
1 1.733 1.386
1 0.693 1.792
1 2.340 0.693
1 1.733 1.792
1 2.340 1.386
1 2.340 1.792
1 1.733 1.386
1 0.693 0.693
2.S.1a Individual Exercises
Exercise 2.1
Using the least squares rule and five of the samples, compute estimates of {31,
{32, {33, and cr2 for each sample. Compute the estimated covariance matrix for
each sample and compare it with the true covariance matrix cr 2 (X' X)-I .
Exercise 2.2
Using the true parameters as the hypothesis, compute (-test statistics for {32 and
{33 for each sample in Exercise 2.1 and interpret. Repeat the tests under the
hypotheses that {32 = 0 and {33 = o.
Exercise 2.3
Using the true parameter value of cr2 as the hypothesis, that is, cr2 = 0.0625, test
the compatibility of the estimates of cr2 obtained in Exercise 2.1 and the hypoth-
eSIs.
Exercise 2.4
U sing the true parameters for i31 , {32' and {33 as the hypothesis, compute the F-
test statistic for each of the five data samples and interpret. Repeat the test
under the hypothesis {31 = {32 = {33 = 0 and interpret.
The Classical Inference Approach for the General Linear Model 37
Exercise 2.5
Choose a significance level and develop and interpret interval estimates for {31 ,
{32, {33, and cr for each of the five data samples.
Exercise 2.6
Compute the mean of b l , b 2 , b 3 , and 6'2 for the five samples and compare these
averages with the true parameters. Compute the mean variance for each of the
b's for the five samples and compare with the true variances.
Exercise 2.7
Divide the observations for each sample and the rows of the design matrix in
half and from the 5 samples you have used create 10 samples of size 10.
Compute least squares estimates of the {3's and (J2. Compare these estimates
with the estimates from the 5 samples (Exercise 2.1) and especially note their
relative sampling variability as the sample size decreases.
Exercise 2.8
Compute the average values for the estimates of {31 , {32, {33, and (J2 for all 100
samples and compare these averages with the true parameters. Compute the
average variances of the estimates of the {3's and (J2 for all 100 samples and
compare these averages with the true variances. Compute the average total
squared loss
100
2: (b
j= I
i - (3)'(b i - (3)/100
Exercise 2.9
Construct an empirical frequency distribution of all 100 samples for {31, {32, {33,
and (J2 and interpret.
Exercise 2.10
Construct an empirical distribution of all 100 samples for the t- and F-test
statistics. Construct and interpret the empirical distribution for the random
variable 67/(J2, for i = 1,2, . . . , 100.
Exercise 2.11
Use the Kolmogorov-Smirnov D statistic or some other appropriate statistic to
test the agreement between the empirical and theoretical (normal, t, X2 and F)
distributions of the estimates and test statistics.
Exercise 2.12
Using the design matrix (2.5.2) and a random number generator on the com-
puter you have access to, for a uniform distribution with mean zero and vari-
ance (J2 = 0.0625, generate 100 samples of size 20, estimate the parameters and
compare these results with those obtained in Exercise 2.8 and 2.10.
38 Sampling Theory and Bayesian Approaches to Inference
where X is given in (2.5.2) and e is a normal random vector with mean zero and
covariance matrix
T-l
1 p p2 P
T-2
p 1 P P
(J2
E[ee'] = = I - T-3 (2.5.4)
(J2W
p2
p2 P I P
with (J2 = 1.0 and p = 0.8. Using the experimental design given by (2.5.2),
(2.5.3), and (2.5.4), develop tOO samples of data of size 20.
Exercise 2.13
Calculate the covariance matrices for the least squares estimator and the gener-
alized least squares estimator and comment on the results.
Exercise 2.14
Calculate the mean of the least squares error variance estimator
Exercise 2.15
Select five samples from the Monte Carlo data and for each sample compute
values for the following estimators.
Exercise 2.16
In Exercise 2.15 your computer program is likely to have provided estimated
covariance matrices for b and ~. Using the element (32 as an example, compare
the various estimates of var(b 2) and var(t32) with each other and with the
relevant values from Exercise 2.13. '
Exercise 2.17
Use the results in Exercise 2.15 and 2.16 to calculate t values that would be
used to test
Since we have two estimators and five samples this involves, for each hypothe-
sis, the calculation of 10 t values. Keeping in mind the validity of each null
hypothesis, comment on the results.
Exercise 2.18
Repeat Exercise 2.15 for all 100 samples and use the results to estimate the
means of b, ~, &2, and &; as well as the mean square error matrices of b and ~.
Are all these estimates reasonably close to their population counterparts?
Exercise 2.19
Use all 190 samples to estimate the means of the variance estimators var(b 2)
and var({32)' Comment on the results and their implications for testing hypothe-
ses about {32.
Exercise 2.20
Repeat Exercise 2.17 for all 100 samples and calculate the proportion of Type I
and Type II errors that are obtained when a 5% significance level is used.
Comment.
Exercise 2.21
Construct empirical distributions for b2, 132, and the two t statistics used in
Exercise 2.20 to test the null hypothesis {32 = 0.4. Comment.
2.6 REFERENCES
Statistical Decision
Theoryand
Biased Estimation
General linear
I I
Compatibility tests
to 3.4.3)
Concern with the implications of scarcity has led economists over time to
develop a general theory of choice. Within this context Marschak once re-
marked, "Information is useful if it helps us make the right decisions." Conse-
quently, as economists, a primary reason for interest in efficient procedures for
learning from data is to gain information so that we may (1) better understand
economic processes and institutions structurally and quantitatively and (2) use
this descriptive information prescriptively as a basis for action or choice. Given
these descriptive and prescriptive objectives, it would seem that a decision
framework, based on the analysis of losses due to incorrect decisions (parame-
ter estimates) can and should be used.
From the standpoint of classical statistics, where inferences are made with-
out regard to the uses to which they are to be put, any two estimators may have
the same precision matrix, but if one is biased while the other is unbiased, then
one would choose the unbiased rule, which is right on average. Alternatively, if
two rules (estimators) are unbiased but one has a smaller sampling variation
than the other in the sense of (2.1. 7), then the more precise estimator would be
chosen. The problem, of course, with the classical properties approach to
gauging estimator performance comes when one estimator is biased but has a
Statistical Decision Theory and Biased Estimation 43
Statistical decision theory, as the name implies, is concerned with the problem
of making decisions based on statistical knowledge. Since the knowledge is
statistical in nature it involves uncertainties that we will consider to be un-
known parameter values. The objective is to combine the sample information
with other relevant information in order to make the "best" decision. These
other types of nonsample information include (1) knowledge of the possible
consequences of the decisions that can be expressed as the loss that would be
incurred, over the range of the parameter space, if a particular decision rule is
used, and (2) prior or nonsample information that reflects knowledge other than
that derived from the statistical investigation.
A problem in decision theory has three ingredients: (I) the parameter space
'!A that reflects the possible states of nature relative to the unknown parameter
vector f3, (2) a set of all possible decisions or actions .sIl and particular actions a
and (3) a loss function L(f3,a) defined for all (f3,a)E ('!A x .sIl).
When we experiment to obtain information about f3, the design is such that
the observations yare distributed according to some probability distribution
that has f3 as the unknown parameter vector. In this context, the loss incurred
depends on the outcome of an observable random variable y through the func-
tion a used to assign an action for a given y. As noted above, the distribution of
y depends on f3, the true state of nature. The function a(y) is called a decision
rule and the loss associated with this rule is denoted as L( f3,a(y.
Since the T-dimensional sample vector y is random, it is important for deci-
sion purposes to consider the average losses involving actions taken under the
various outcomes for y. In other words since the estimator a(y) = ~ is random,
as it depends on y, the loss incurred in not knowing the parameter vector will
also be random. The average losses or the expected value of the loss function
L(f3,~), where f3 is the true parameter is called a risk function p(f3,a(y -
p(f3,~) and is designated by
(3.1.1)
44 Sampling Theory and Bayesian Approaches to Inference
Given a framework for defining loss or risk associated with each action and
state of nature, the central problem of decision theory is how to choose a
decision rule or, alternatively, how to choose among the alternative decision
rules. Intuitively, it would seem that we should choose a rule that makes the
risk of not knowing IJ as small as possible, that is, as a general criterion we
should choose a rule that minimizes the risk for each value of IJ in '!.A. In
general, there is no one rule with minimum risk for alllJ in '!.A. This means that
typically there is no estimator 130 such that for every other estimator ~,
p(IJ,~o) <: p(IJ,~), for every IJ. This proposition is expressed graphically in
Figure 3.1 for the measure space A = WIJ, where each element of the IJ vector
may take on values between - 00 and 00. In Figure 3.1, over the range of the
parameter space from the origin to A = WIJ = a, the estimator ~2 has the lowest
risk. At point a the risk functions for ~2 and 133 cross and ~3 has the smallest
risk over the remaining range of the parameter space.
The situation represented by the alternative risk functions in Figure 3.1 is
typical of what happens in practice. Consequently, in order to choose among
a
Parameter space II = !3' !3
Figure 3.1 Risk functions for four alternative estimators.
Statistical Decision Theory and Biased Estimation 45
If p(P,tl) < p(P,tlo) for some p and p(P,tl) < p(P,tlo), for all p, the estimator P
is said to strictly dominate Po. Under this definition the estimator P2 in Figure
3.1 i~ said to strictly domin~te Ill' However, note that 112 does not dominate 113
and P4 does not dominate P2'
An estimator is called admissible if it is not strictly dominated by any other
estimator. We may say that an estimator III is inadmissible if there is another
estimator, 1l2' such that p( p, 1l2)< p( P,lld for all P and, for some p, the strict
inequality holds. These results of course imply that an inadmissible estimator
or rule should not be used. Unfortunately, there is usually a large class of
admissible estimators for anyone problem.
As we noted in the sampling theory approach, one important way of limiting
the class of estimators or rules is to require that the estimator be unbiased and
thus requit;,e that E[lll = p. Furthermore, because of tractability, the require-
ment that p be a linear function of y is sometimes imposed and in fact many of
the estimators used in econometric work are restricted to the linear unbiased
class. Unfortunately, as we will see in the chapters to come, requiring this
property sometimes results in inadmissible estimators. The question of why
one should be interested in a property of unbiasedness that requires absolute
accuracy on the average has been asked many times. One might question
whether it has been properly answered. Certainly, the typical response-"It is
conventional' '-is not an adequate reply.
Another restriction sometimes used in estimator choice, which tries to pro-
tect against the worst state of nature, is the minimax criterion. Under this
criterion an estimator tlo is said to be minimax, within the class of estimators D,
if the maximum risk of Po is equal to or less than that for all other estimators p,
that is,
Another frequently used basis for choosing among estimators is the Bayes
criterion. We noted that many admissible estimators have risks that are supe-
rior for different values of Jl. In the Bayesian approach, the objective is to make
proper use of nonsample or prior information g( Jl) about Jl. Since the prior
g(Jl) supposedly reflects which Jl's are the likely ones to appear, it seems
reasonable to weight the risk functions p(Jl,~) by g(Jl) and average, that is,
take the expectations over Jl. In using the Bayes principle, a multivariate
distribution on the Jl space is assigned with probability density function g(Jl).
Under this specification the Bayes solution is to choose the estimator that
minimizes the average risk with respect to the prior density g(Jl), that is, the
estimator ~g is said to be Bayes if
(3.1.4)
for all Jl in D.
Thus far we have not discussed the specific form of the loss or risk function that
is to be used in evaluating estimator performance. Although there are many
alternatives, for expository purposes we will at this point discuss only the
squared error loss and risk matrix (generalized mean square error) measures.
squares estimator, then under the squared error loss criterion the least squares
estimator has risk
(3.1.8)
Furthermore, writing the unweighted, squared error loss risk function in the
o parameter space as
p(O,O) = E[(O - 0)'(0 - 0)] = E[(S"2b - SII2~)'(SI/2b - SI/2~)]
= E[(b - ~)'SI/2S1/2(b - ~)] = E[(b - ~)'S(b - ~)] (3.1.9)
yields a weighted loss function in the ~ space with weight matrix X'X = S.
Alternatively, an unweighted, squared error loss risk function in the ~ space
results in the following weighted risk function in the 0 space:
which in words is the covariance matrix for ~, plus the bias squared matrix for
~. The diagonal elements of (3.1.11) are the mean square errors for each ele-
ment of ~ and the trace of the mean square error matrix is equal to the squared
error loss criterion of (3.1.6) when W = I, that is
(3.1.13)
where A is a positive semidefinite matrix and therefore "I' A'Y :> 0 for any K-
dimensional real vector "I =1= 0 and for all p.
The risk matrix for the least squares estimator is
Since the least squares estimator is unbiased, it has a null bias matrix. If in
(3.1.13) we let Il = band Ilo be all other linear unbiased estimators, the conclu-
sion of the Gauss-Markov theorem is stated.
By making use ofthese criteria, the least squares or generalized least squares
estimators of Chapter 2 are best linear unbiased.
To give the decision theoretic concepts an operational flavor, let us use the
orthonormal K mean version of the linear statistical model that is y = ZO + e
where Z'Z = hand e ~ N(O,(T2Ir). The maximum likelihood estimator that
uses only sample information is 0 = Z'y and 0 ~ N(0,(T2h). Suppose, based on
this sampling process, we want to choose a linear estimator 0 = Ay that mini-
mizes expected squared error loss, which is,
- - - -
p(O,O) = E[L(O,O)] = E[(O - 0)'(0 - 0)] (3.1.15)
Given this_ measure of performance, one criterion is to find, for the linear
estimator 0 = Ay, the matrix A, which minimizes expected squared error loss
or risk. That is, find the matrix A that minimizes
- - - - - --
E[(O - 0)'(0 - 0)] = E[O - E[O]) + (E[O] - 0'0 - E[O])
-
+ (E[O] - 0))]
- - - - - -
= E[(O - E[O])'(O - E[O]) + (E[O] - O),(E[O] - 0)]
(3.1.16)
which is the sum of the variance~ of th~ elements of 0, plus the sum of the
squared bias for each element of o. For 0 = Ay and y = ZO + e, the risk with
the appropriate substitutions in (3.1.16) becomes
where tr AA' is the sum of the diagonal elements of the covariance matrix
(T2
(T2AA' for 0, which, given the choice of A, is a fixed number. The only way to
minimize (3.1.16) or (3.1.17) for all 0 is to choose A such that the second
expression on the right-hand side of (3.1. 17) is as small as possible.
Statistical Decision Theory and Biased Estimation 49
One way of looking at the problem is to assume that nature or some opponent
chooses the 0 vector and that we must choose A such that we can protect
against an "unfavorable" outcome for nature's choice ofO. To guard against a
large possible loss, we could choose A such that AZ = Ix and thus make the
last expression on the right-hand side of (3.1.17) equal to zero. This ~uggests
that A = Z' and leads to the unbiased linear (least squares) estimator 0 = Z'y
with covariance (J"2(Z'Z)-1 and risk E[(O - 0)'(0 - 0)] = (J"2 tr(Z'Z)-1 = (J"2K.
Consequently, under a squared error loss criterion, the choice of a linear rule
that minimizes the maximum expected loss (minimax decision rule) leads_to an
unbiased linear decision rule that is identical to the least squares rule 0 that
minimizes the quadratic form (y - ZO)'(y - ZO).
Continuing with the use of the orthonormal K mean statistical model, let us
consider the development of the risk functions for a family of estimators
8e (0) = cO, where (" is a scalar and 0 is the maximum likelihood estimator.
Under a squared error loss measure, the family of risk functions may be speci-
fied as
- - -
E[L(0,8 c (0)] = E[(cO - O)'(cO - 0)]
- - -
= E[(cO - E[cO])'(cO - E[cO])l
-
+ -
E[(E[cO] - O)'(E[cO] - 0)]
-
= E[(cO - cO)'(cO - cO)] + E[(cO - O)'(cO - 0)]
= ("2E[(0 - 0)'(0 - 0)] + (c - 1)26'8
= C 2(J"2K + (c - 1) 28'8 (3.1.18)
First consider the risk function for the estimator when c = 1. In this case the
risk over all 8 is p(0,8 1(0 = (J"2K, the risk of the least squares estimator, which
as shown in (3.1.17) is a minimax decision rule and as we will show later in this
chapter when K < 2 this estimator is admissible. The risk function for the
estimator 0 is given in Figure 3.2.
Now consider the case of c > 1. In this case
(3.1.19)
Therefore, the least squares estimator strictly dominates the family of estima-
tors defined when c > 1 and thus demonstrates their inadmissibility.
Next consider the family of estimators when 0 < c < 1. For this range of c
all of the risk functions cross somewhere in the parameter space and no one
choice of c dominates any other including the outcome of c = 1. The character-
istics of these risk functions are in general similar to those shown in Figure 3.2,
when c = 0.6. As in the case when c = 1 if K < 2 all of the estimators 0 < c < 1
are admissible. The risks for the estimators in this range are unbounded.
Next let us consider the trivial estimator when c = 0 and thus cO = O. This
means that regardless of the outcomes for the random vector y we choose 8 to
be a null vector. This estimator has a risk of zero at the origin. It is equal to 8'8
50 Sampling Theory and Bayesian Approaches to Inference
otherwise and therefore has an unbounded risk. The estimator is admissible and
this indicates that although admissibility may be a gem of great value for deci-
sion rules, it does not assure us that the estimator is a meaningful one. A graph
of the risk function when e = 0 is given in Fig~re 3.2;
When considering a decision rule such as 8c(0) = cO it seems appropriate to
ask if there is a particular value of e that minimizes risk over the range of the
parameter space. To investigate this question for the risk function (3.1.18), we
may find the value of e which minimizes risk by solving
~c ~--------~~~----~~-----------------------
_
p(8,oc=,(8))
o
Parameter space 8' 8
Figure 3.2 Characteristics of the risk functions for the estimators (\(6) when c = 0,
c = .6, and c = 1.
Statistical Decision Theory and Biased Estimation 51
(3.1.22)
where g(O) represents the density of the random variables over which the
expectation is being taken. For the orthonormal statistical model used in the
above examples let the prior information be represented by the joint density
g(O) ~ N(O,r 2h). In this case the ()c(O) decision rules have Bayes risks
In order to find the c that minimizes the average risk, that is the c that makes
(3.1.23) a minimum, we solve
(3.1.24)
(3.1.25)
In the sections and chapters to follow, the basic definitions and concepts relat-
ing to statistical theory are used to gauge the sampling performance of a range
of traditional and nontraditional estimators that use various forms of nonsam-
pIe information. The Bayesian approach to inference will be discussed in some
detail in Chapter 4. Those interested in a more complete discussion of statisti-
cal decision theory should see Berger (1980), Ferguson (1967), and DeGroot
(1970).
knowledge. One type deals with knowledge of the possible consequences of the
decisions and the other is knowledge other than that derived from the statistical
investigation. In this section we analyze within a decision theoretic context the
statistical implications of using both sample and various kinds of nonsample
information.
There may be instances in applied work when the investigator has exact infor-
mation on a particular parameter or linear combination of parameters. For
example, in estimating a log-linear production function, information may be
available that the firm is operating under the condition of constant returns to
scale. Alternatively, in estimating a demand relation, information may be avail-
able from consumer theory on the homogeneity condition or information con-
cerning the income response coefficient may be available from previous empiri-
cal work.
If information of this type is available, it may be stated in the form of the
following set of linear relations or linear equality restrictions:
Rf3 = r (3.2.1)
f31
I0 0 0 0 f32 k
1 1 1 1 1 - 1 (3.2.2)
0 1 -1 0 0 0
f3K
[!j,O(K-j){::_J = k j (3.2.3)
subject to
b* = b + (X'X)-IR'[R(X'X)-IR']-I(r - Rb)
which differs from the unrestricted least squares estimator by a linear function
of the vector r - Rb. The restricted least squares random vector b* has mean
If the restrictions are incorrect and E[r - Rb] = 8 oF 0, then the restricted
least squares estimator is biased and has the mean square error or risk matrix
which in words is equal to the covariance matrix for the restricted least squares
estimator plus the bias matrix resulting from the products of the bias vector.
It is apparent from the restricted least squares covariance matrix (3.2.7) and
the mean square error matrix (3.2.8) that if the nonsample information is cor-
rect, then using it in conjunction with the sample information will lead to an
unbiased estimator that has a precision matrix superior to the unrestricted least
squares estimator.
The covariance matrix of the restricted least squares estimator is a null matrix
because each of the parameters of the Il vector is constrained and thus does not
vary from sample to sample. Therefore, the risk of the restricted least squares
estimator is equal to the sum of squares of the bias or restriction errors 8'8. The
covariance matrix for the least squares estimator for the orthonormal st:ttistical
model is (T2h and thus the risk under squared loss is
a scalar that depends on the sample information and is unrelated to the informa-
tion and/or errors contained in the system of linear equality relations.
Since the least squares risk is (T2K, for every 8'8, and the restricted least
Statistical Decision Theory and Biased Estimation 55
squares risk is 0'0 and thus a function of 0'0, the two risks are equal when
(3.2.12)
or
or
When the nonsample information in the form of restrictions is correct, the risk
gain of using the restricted least squares estimator is a- 2K. These results are
shown in Figure 3.3. As indicated in the figure under a squared error loss
measure of performance, the restricted estimator can be very good or very bad
(precisely wrong and unbounded risk). Its performance relative to the least
squares estimator depends on the quality of the nonsample information. If the
quality of the nonsample information is such that 0'0/2a-2 < K12, then p(rl,b*) <
p(rl,b) and the restricted least squares estimator is the clear choice. However,
if the restrictions are such that 0'012a-2 > K12, then the restricted estimator is
inferior to the least squares estimator over an infinite range of the parameter
space.
u2K~ __________ ~~
p({J, b)__
________________
K .. , I i ' Ii
2 Restnctlon error ,,~ 20 2
Figure 3.3 Risk functions for the least squares and restricted least squares estimators,
56 Sampling Theory and Bayesian Approaches to Inference
criterion we evaluate the difference between the risk matrix (covariance) for
the unrestricted orthonormal least squares estimator (J"2h and that of the re-
stricted least squares estimator 00' and evaluate the conditions under which the
difference in risk matrices yields a positive semidefinite matrix. Thus we con-
sider the characteristics of the difference matrix
Therefore, under the general mean squared or risk matrix criterion, the re-
stricted least squares estimator b* will be superior to the unrestricted least
squares estimator b, if for all (K x 1) vectors t =1= O.
(3.2.14)
or
00')
t' ( IK - (J"2 t > 0
(3.2.15)
and the maximum of the left side of (3.2.15) taken over all (K x 1) nonzero
vectors is (Rao, 1973, p. 60)
This means that the difference between the two risk matrices (3.2.13) is positive
semidefinite if and only if
0'0
-<
2 -: I
(J"
or
0'0 1
-<:- (3.2.17)
2(J"2 - 2
In other words, if the prior information error structure is such that 0'012(J"2 <: !,
the restricted least squares estimator for the orthonormal statistical model is
superior under general mean squared error to the least squares estimator. It
should be noted that this is a much stronger condition for estimator superiority
than the squared error loss criterion (3.2.9) and (3.2. I I), that is, 0'0/2(J"2 <: i
Statistical Decision Theory and Biased Estimation 57
versus 8'8/2(T2 <: K/2. As developed in Judge and Bock (1978, p. 29), for
general X and R design matrices under the risk matrix criterion, the restricted
least squares rule is superior to the unrestricted least squares rule if
or
This is the very reasonable or intuitive result that if the restrictions are "close
to being correct" then the false, but not very false restrictions, are adding
information that improves on least squares.
(Rb - r)'[R(X'X)-IR']-I(Rb - r)
J6- 2
There are many situations in applied work when assuming exact prior informa-
tion is not appropriate or in fact this type of information does not exist. If
uncertainty exists about the prior information specifications, one alternative is
to make use of stochastic restrictions or hypotheses of the following form:
r = RP + v (3.2.19)
Let us suppose that prior information is available on the first two elements of rJ.
We may feel, for example, that almost certainly, say with 95% probability, f3,
lies between 0 and 1 and /37 lies between! and ~. If we use "two times the (T
rule," the range of f3, is i + 2~ and that of f32 is i + 2~ and, assuming
for the moment that 0"2 = 1, we can write (for 3.2.19)
r = [:l R = [~
0 O'
1 O'
. OJ
0'
v = [~~l n=r~
LO
~]. 64
where [e' ,v']' is a multivariate normal random vector with mean vector and
covariance matrix
(3.2.20b)
(3.2.21)
(3.2.22)
Statistical Decision Theory and Biased Estimation 59
covariance matrix
If the stochastic restrictions are unbiased, that is, E[r - R(3] = E[v] = 0 = 0,
the stochastic restricted estimator is unbiased and the Aitken estimator (3.2.21)
is best linear unbiased out of the class of estimators making use of the sample
information y and the stochastic prior information r.
One difficulty with estimator (3.2.21) is that (]"z is unknown. As it is written
(3.2.2l) does not depend explicitly on (]"z, which will cancel out. However,
because the covariance of v is assumed to be (]"zD, knowledge of (]"z is required
for proper specification of the prior information. To deal with this problem,
Theil (1963) has suggested that (]"z be replaced by an unbiased estimator based on
the sample data, that is, o-Z = (y - Xb)'(y - Xb)!(T - K). Now 0- 2 is stochastic,
but it will differ from (]"z to the order IIVTin probability when the random
vector e is normally distributed. Therefore, if (]"2 is replaced by 0- 2 , the estima-
tor is asymptotically unbiased if the stochastic prior information is correct on
the average and has the asymptotic moment matrix given in (3.2.23). Theil
generalized the model (3.2.20a) to allow for the possibility that the prior judg-
ments on R(3 are positively correlated with the corresponding elements of the
sample estimator.
easily seen if, following Judge and Bock (1978), we rewrite the statistical model
(3.2.20a) in equivalent form as
(3.2.26)
(3.2.27)
must be positive semidefinite. Making use of the mean square error matrix
(3.2.24) and the least squares covariance matrix, we write (3.2.28) as
In order for the risk matrix of the stochastic restricted estimator to be superior
to the least squares alternative, (T2(RS- 1R' + D) - 88' must be positive semi-
definite, or following Judge and Bock (1978),
(3.2.30)
Note the similarity between the above condition and that for the restricted least
squares estimator (3.2.18).
Ju~ge and B_ock (l97~) have shown that under the squared error loss criterion
p(3,(3) = E[(3 - (3)'(3 - (3)] the stochastic restricted estimator is superior
(smaller risk) to the least squares estimator if
(3.2.31)
Statistical Decision Theory and Biased Estimation 61
If a weighted risk function is used where the weight matrix is X' X = S, the
condition is the same except S-2 changes to S-I.
Therefore, under all of the criteria, the superiority of the stochastic restricted
estimator depends on the average specification error in the linear stochastic
hypotheses. When the linear stochastic restrictions are correct, that is, 0 = 0,
the stochastic restricted estimator is superior to its least squares counterpart if
o i= 0 and is known.
One weakness of the mixed estimator lies in assuming the random vector 0,
representing the uncertainty of the prior information, has a zero mean vector,
an assumption necessary for the estimator to be unbiased. Since the frequency
and subjective interpretations of probability are different, the argument that
prior judgments are equivalent to prior unbiased estimates seems unsatisfac-
tory. As shown by Swamy and Mehta the estimator (3.2.21) may be less effi-
cient than the least squares estimator based only on sample data if the stochas-
tic prior information (3.2.19) is misspecified. The requirement that the prior
covariance matrix 0 be known must in most cases be unreasonably demanding.
In addition, the fixed fJ and random rand 0 in (3.2.19) does not fit any Bayesian
axiom system and it would appear that no set of principles have been set down
which would justify this specification.
or
E[r] = RfJ
where the random variable r is a normally distributed vector, with mean vector
RfJ + 0 and covariance matrix ~O, that is independent of the least squares
estimator b. The stochastic prior and sample information provide two separate
estimates of RfJ, that is, rand Rb, and Theil (1963) has proposed that we test
the compatibility of the two estimates by the test statistic
(3.2.33a)
which, when (T2 is known and 0 = 0, has a central chi-square distribution with J
degrees of freedom. If 0 i= 0, then UI is distributed as a noncentral chi square
with noncentrality parameter 11.= o'(RS-1R' + 0)-lo/2~. If ~ is unknown and
replaced by cr, the test statistic
3.2.3 Inequality-Restricted
Nonsample Information
In this section we recognize that in applied work there exists in many cases
nonsample information concerning the nonnegativity or nonpositivity of a pa-
rameter or linear combination of parameters, or that a parameter or linear
combination of parameters lies between certain upper and lower bounds or that
functions are monotonic, convex or quasi convex. Production functions, utility
functions, behavioral coefficients, and marginal productivities are obvious ex-
amples. When information of this form is available, it can be represented within
the context of this section by the following system of linear inequality con-
straints:
or
RJl - 8 = r (3.2.34b)
where 8 > O. The prior information design matrix R and the real vector r can
accommodate individual or linear combinations of inequality restrictions for a
mixed system such as rl <: RIJl <: r2 by letting
and
and
(3.2.38b)
along with
(3.2.38c)
and
(3.2.38d)
b+ = (X'X)-tx'y + (X'X)-tR'A +
= b + (X'X)-tR'A + (3.2.39)
If all of the restrictions are redundant and Rb :> r, all of the elements of A+ are
zero and from (3.2.39) the inequality restricted solution reduces to the least
squares solution. Alternatively, if all of the restrictions are binding all of the
elements of A+ are positive and (3.2.39) reduces to
the restricted least squares estimator. These results mean that for 1 indepen-
dent restrictions, the inequality-restricted estimator is defined by a choice rule
that, for each sample of data, selects among at most the 21 different restricted
and unrestricted estimators. For example, if the restrictions are of lower- and
64 Sampling Theory and Bayesian Approaches to Inference
upper-bound types, that is, ro <: Rill <: rl, then J = 2 and there are at most 2' =
4 possible solutions of which three are feasible:
The fourth case, that both constraints are binding, is excluded as a solution in
this situation, since it is not possible unless ro = Rill = rl' If the two bounds
coincide, ro = rl, then there will be only one constraint and the solution will be
either the restricted or unrestricted estimator. If we assume that ro =1= rl , then as
Klemn and Sposito have suggested, the solution can be expressed in the follow-
ing closed form:
The solution for the Lagrange multipliers is Ai = 1..2 = 0 for case I; Ai = (ro -
Rlb)(RI(X'X)~IR))~1 > 0 and At = 0 f()r case 2; and Ai = 0 and At = (Rib -
rl)(RI(X'X)~IR))~1 > 0 for case 3.
where H is defined in the obvious way and the mean square error or risk matrix
E[(b* - (l)(b* - (l)'] = (T2(S-1 - He'S-I) + H88'H'.
For purposes of analysis it is important to note that S-1/2 e(e' S-Ie)-I e ' S-1/2 is a
symmetric idempotent matrix of rank one and that an orthonormal matrix Q
exists such that
and
y = XS-1/2Q'QS1I2(l + e = ZO + e (3.2.41)
where 0 = QSII2(l, Z = XS- 1/2 Q', z'z = QS-1I2X'XS- 1/2 Q' = QQ' = hand
(3.2.42b)
two different estimators: one restricted and the other unrestricted. In the 6
space, the inequality-restricted estimator 6+ may be expressed as
6+ = [I(-x,ro)(WI)ro + I[rO,oc)(WI)WI]
(3.2.43)
W(K-I)
where I(a,b)(WI) is an indicator function that takes the value one when WI falls in
the interval (a,b) and is zero otherwise,
UsingI[ro,oc)(wl) = 1 -I[-x,roiwl), the estimator in (3.2.43) may be rewritten as
(3,2.44)
b+ = b - S-1I2Q'
[
I
O(-x,r)
(c'b)
0'] QS-1/2 C(C'S-I C)-I C'lb - S-lc(C'S-lc)-l r ]
(3.2.45)
(3.2.47)
(3.2.48)
Consequently, for -00 < 80 <: 0, as the vector 80 ~ -00, the [0+] ~ O.
Alternatively, as 80 ~ 0 the
(3.2.49)
Finally for 0:::; 80 < 00 as 80 ~ 00, which implies ro ~ 00, the E[On is asymptotic
to the line f(8 0 ) = 0 1 + 80 , or in other words asymptotic to the mean of the
equality-restricted estimator 0*. The characteristics of the bias function are
shown in Figure 3.4.
In evaluating the sampling performance of the inequality-restricted estimator
b+, we use the weighted risk function
(3.2.50)
4
'"
'"
co
3
2
A
~
/
/.
""
//
."
" '-'-'-0.---.
-4 -3 -2 -1 o 2 3 4 5 6
[,
a
Figure 3.4 Mean of the inequality-restricted estimator as a function of the restriction
hypothesis specification error Bier: ____ , inequality estimator; _._, pretest estimator.
68 Sampling Theory and Bayesian Approaches to Inference
... ..
where, m hne with the partitIOned parameter space, A =
a3 A2 [ ,]
al
. Con-
a3
+ 2E[00a I I(-oo,1l00'-1)(ul)ud
+ ~E[J(-OO,lloO'-I)(ul)ulad
- 2E[I(-oo,/lQO'-I)(ul)uda I80
+ E[J(-oc,1l00'-1)(ul)]06 a l (3.2.52)
As 00 ~ -00 or as 00 ~ 0, the terms 06P(xf.) :> 06/(]'2) and P(xt) :> 06/(]'2) in
(3.2.53) approach 0 and 1, respectively. Consequently, p(b+,(3) approaches
p(b,(3) = (]'2tr S-IW or (],2tr(S-IW) - (]'212)al, the risks of the unrestricted and
inequality-restricted estimators, respectively. When 00 ~ 00, p(b+ ,(3) in (3.2.54)
becomes asymptotic to the line 1(00) = (],2tr(S-1 W) - (]'2 - (6)al , which is the
risk of the general equality-restricted estimator.
Statistical Decision Theory and Biased Estimation 69
Risk
:! I .--.,
2.0
i I. .
p (bi. ~;I . :
:1 \
1 \. p(br.~;1
a2
~
,:
/.
\..r
r-
\
a
2
1.5
;= \\
, I
.I
.I :, \
i
i i
.
! :
I
I :
,. "
\
.
"-
---------- ---i-1.0
\ T t-'~-:b~~~'~----
. i
,.5; a2
!
\ !
\
....: I
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
0/0
Figure 3.5 Risk functions for the maximum likelihood, equality-restricted, inequality-
restricted, and inequality-pretest estimators.
If the direction of the inequality is correct (i.e., 00 < 0), the risk is, over the
range of the parameter space, equal to or less than the maximum likelihood risk
(T2tr(S-1 W). Alternatively, if the direction of the inequality is incorrect (i.e.,
00 > 0), then the inequality estimator risk is uniformly inferior to the equality
estimator risk, and as 00 goes to infinity the risk function is unbounded.
Risk functions for the equality-restricted b*, inequality-restricted b+, and
maximum likelihood b estimators are given in Figure 3.5. In plotting the risks
the functions have been standardized so that the maximum likelihood risk is
one. These results indicate that when the direction of the inequality is correct,
the inequality-restricted estimator offers a viable and, under squared error loss,
superior rule to the conventional estimators using only sample information.
and independently distributed, that is, b i ~ N(f3i ,a-2), consider the following
null and alternative hypotheses about the coefficient f3i:
Ho: f3i:> ri
Ha: f3i < ri (3.2.55)
bi - ri
= Ui (3.2.56)
which is distributed as a normal random variable with mean a/a- = (ri - f3;)/a-
and variance 1. If it is assumed that ai = ri - f3i = 0, then Ui is a standard normal
(0, 1) random variable and the test structure could be formulated in terms of ai,
with Ho: ai < 0 and Ha: ai > O. Consequently, by using test statistic (3.2.56) we
make use of the following test mechanism.
1. Reject the hypothesis if (b i - r;)/a- = Ui < Ci and use the maximum likeli-
hood estimator b i , where Ci is the critical value of the test from the standard
normal table.
2. Accept the hypothesis if Ui = (b i - r;)/a- :> Ci and use the inequality-
restricted estimator b( = l(-x,rij(b;)ri + llri,ocj(bi)b i .
By accepting the null hypothesis Ho, we take b( as the estimate of f3i and by
rejecting Ho, the maximum likelihood estimate b i is used. If the variance a- 2 is
unknown the central t distribution with (T - K) degrees offreedom is used as a
basis for determining the critical value Ci for a given a level of the test for the
one-sided likelihood ratio test.
If two or more inequality hypotheses are involved, a more satisfactory ap-
proach is to test the entire set of inequality hypotheses simultaneously. In this
context, a sequence of one-at-a-time tests leaves unknown the a level of the
joint result.
As an illustration of the problem of joint hypotheses, let us follow Yancey,
Judge, and Bock (1981) and suppose that the hypothesis is Ho: f31 :2: 0 and f32 :2:
o and the alternative hypothesis Ha, is that at least one f3i < O. Since we are
working with the orthonormal linear statistical model, the remaining K - 2
elements of J3 can be ignored. With unknown variance the likelihood ratio test
is: reject Ho if
Statistical Decision Theory and Biased Estimation 71
Accept Ho
o
--::----t-=------- t,
Consequently, given a for the joint test, the c2 is determined such that
( F(2.T-K):>"2Z)
_ 1 Z 1 C
a - 2P(FO.T-K) :> C ) +:IP
If the variance is known, the I's and F's are replaced by standard normal and
x.~.) random variables.
One-sided confidence regions can be constructed that are counterparts to the
inequality hypotheses using the same distribution theory used for hypothesis
testing. For example, if J3 is a (2 x 1) vector and b is (0.4, 0.7)' and a- = l.l, from
a sample of size 18, a one-sided 95% confidence region with c2/2 = 2.46 is the
union of the regions /3] :> b] - c a- = - 1.33 for /32 > b2 , /32 :> -1.03 for /31 :> b]
and (/3] - 0.4)2 + (/3z - 0.7)2 <: 2.98, when /3] < b] and /32 < b2 If Figure 3.6
were relabeled with (b] ,b2) as the origin, I] as /3] , t2 as /32, the complement of
the critical region would be the one-sided confidence region just described. As
Yancey, Judge, and Bock (1981) note, the inequality hypothesis testing results
can be generalized to the case involving joint testing of three or more hypothe-
ses. Also, the test extends to include combinations of inequality and equality
72 Sampling Theory and Bayesian Approaches to Inference
In applied work when there is uncertainty about the proper equality restric-
tions, hypotheses are specified in the form of (3.2.1) and one proceeds by
Statistical Decision Theory and Biased Estimation 73
making tests of the compatibility of the sample information b and the linear
hypotheses Ril = r. That is, questions such as the inclusion or exclusion of a
variable or the choice of functional form are introduced in the form of general
linear restrictions or hypotheses and then conventional test procedures are
used for decision or choice purposes.
To see the sampling implications of this test mechanism, again let us for
expository purposes consider the orthonormal statistical model where X' X =
h, a hypothesis design matrix of the form R = h and the likelihood ratio test
procedures (Section 2.1.3b) of Chapter 2. To test the null hypothesis that 11 = r
*
against the hypothesis 11 r, it is conventional to use the test statistic
(b - r)'(b - r)
u= (3.3.1)
By accepting the null hypothesis, we take the restricted least squares estimator
b* as our estimate of 11; by rejecting the null hypothesis 11 - r = 0 = 0, we use
the unrestricted least squares estimator b. Thus the estimator that results is
dependent on a preliminary test of significance and the estimator used by an
applied worker is therefore of the following form.
~ {b* if u < c
11= (3.3.2)
b if u > c
Alternatively, the estimator may be written as
~
11 = I(o.c)(u)b* + I[c,oo)(u)b
= I(o,c)(u)b* + [I - I(o,c)(u)]b
= b - I(O,c)(u)(b - b*) = b -I(O,c)(u)(b - r) (3.3.3)
where I(O,c)(u) and I[c,oo)(u) are indicator functions that take the values I(o,c)(u) =
1 and I[c,cc)(u) = 0 if the argument u falls within the interval zero to c and I(o,c)(u)
= 0 and I[c,oo)(u) = 1 when u > c. Therefore, in a repeated sampling context, the
74 Sampling Theory and Bayesian Approaches to Inference
data, the linear hypotheses, and the level of significance all determine the
combination of the two estimators that are chosen on the average. From an
applied standpoint one thing that becomes apparent from the above conditions
is the impact of the level of significance on the outcome for the pretest estima-
tor. If the level of significance a is equal to zero, then the pretest estimator in
terms of (3.3.3) is
(3.3.4)
(3.3.5)
and the least squares estimator is always chosen. Therefore the choice of a,
which is usually made in a cavalier way in applied work, has a crucial role to
play in determining the sampling performance of the pretest estimator.
or compactly
where 1 > 1(2) > 1(4) > O. From the risk function (3.3.6) the following results
are clear.
Statistical Decision Theory and Biased Estimation 75
1. If the restrictions are correct and 8 = 0, the risk of the pretest estimator is
(]"2K[1 - 1(2)], where 1 > (1 - 1(2)) > 0 for 0 < c < 00. Therefore, the
pretest estimator risk is less than that of the least squares estimator at the
origin, and the decrease in risk depends on the level of significance ex and
correspondingly the critical value of the test c.
2. As the hypothesis error J3 - r = 8 and thus 8'8/2(]"2 increases and ap-
proaches infinity, 10 and 8'8/() approach zero. Therefore, the risk of the
pretest estimator approaches (]"2 K, the risk of the unrestricted least squares
estimator.
3. As the hypothesis error grows, the risk of the pretest estimator increases,
obtains a maximum after crossing the risk of the least squares estimator,
and then monotonically decreases to approach (]"2K, the risk of the least
squares estimator.
4. The pretest estimator risk function defined on the 8'8/2(]"2 parameter space
crosses the risk function of the least squares estimator within the bounds
KI4 <: 8'8/2(]"2 <: K12.
~
III
---
- - - ...... p(ll.
.....
02K~~~--~----------------~~&.------------
p(Il, hI
o K
'4
!S.
2 0' 0
Hypothesis error A = 20 2
Figure 3.7 Risk functions for the least squares and restricted least squares estimators
and typical risk function for the pretest estimator.
76 Sampling Theory and Bayesian Approaches to Inference
Hypothesis error, A = ~
20
Figure 3.8 Impact of choice of the critical value of the test on the value of the pretest
risk function.
in applied problems one seldom knows the hypothesis errors and thus the
correct A. Consequently, the choice of the estimator is unresolved.
Let us note again that the form of the pretest estimator in (3.3.3) implies, for
evaluation purposes, the probabilities of ratios of random variables 1(2) and 1(4)
being less than a constant that depends on the critical value of the test c or a,
the level of statistical significance. Thus, as a ~ 0, the probabilities 1(') ~ 1 and
the risk of the pretest estimator approaches that of the restricted least squares
estimator b*. In contrast, as a ~ 1 and thus 1() approaches zero, the risk of the
pretest estimator approaches that of the least squares estimator b. The impact
of the choice of c or a, which has a crucial impact on the performance of the
pretest estimator, is portrayed in Figure 3.8.
where a is the significance level of the test and F(K,T-K,A=(')) denotes a noncen-
tral F distribution with noncentrality parameter A = i or A = K/2. This test
Statistical Decision Theory and Biased Estimation 77
mechanism again leads to a preliminary test estimator that has the general
characteristics of the traditional pretest estimator that was previously dis-
cussed.
~
.!!!
~~~~~~~~~~~~~~~~~~~
Hypothesis error A
Let us consider again the testing of inequality hypotheses that was considered
in Section 3.2.3d. For expository purpose let us again consider the orthonormal
statistical model, where b i ~ N(f3i ,(T2), the null hypothesis is Ho: f3i ::> ri and the
test statistic is Ui = (b i - ri)/{T. We noted in Section 3.2.3d that by accepting the
null hypothesis Ho we use bt the inequality estimator of f3i and if we reject Ho
the maximum likelihood estimator b i is used. Consequently, when a prelimi-
nary test of the inequality hypothesis is made and a decision is taken based on
the data at hand, the following pretest estimator results.
(3.3.9)
For any given critical value of Ci for -00 < Ci < 0, if ri - f3i = fii = 0, the
E[bt+] = f3i + ({T/\I'2;)[1 - P(Xf2) ::> c;)] and consequently has a positive bias.
However, as fii ~ -00, the E[b++] = f3i and the pretest estimator is unbiased,
Statistical Decision Theory and Biased Estimation 79
1.
2.
If Ci =
If Ci ~
anda i
-00 and
= 0, then E[bt+] = f3i'
ai = 0, then E[bt+] = f3i + (aIVbr).
3. If Ci ~ -00 and ai ~ -00, then E[bt+] =f3i.
When <: ai < 00, which means that -00 < Ci < 0, since ri is actually greater than
f3i and di > 0, the mean of the inequality-restricted estimator is
2
E[b i+ +] -- f3i cr {P [2
+ vb ai ] - P[X(2)2 :> d;]2} +
X(2):> cr2 ai
ai {P [2
-"2 X(I):> cr2 2
aT] + P[x(1) 2}
:> d;] (3.3.10)
E [b i++ ] -- ai {P [2
f3j -"2 aT] -
X(I):> cr2 2
P[ X(I):> d2]}
i
cr . {P [2
+ vb X(2):> cr2 2 :> d;]2}
aT] - P[X(2) (3.3.11)
For a fixed Ci < if ai~ 00, then E[bt+] = f3i' The unbiased outcome for aj~ 00
results because the hypothesis (inequality restriction) is rejected all of the time
and the maximum likelihood estimate is used for each sample.
For a given Ci the bias as a function of a j is given in Figure 3.4. These results
imply that for any Ci < the bias of the inequality pretest estimator in the a;/cr 2
space is equal to or less than that of the inequality-restricted estimator .
.. '
/
--- - - -
.'
- - - Exact restricted
2 4 6 8 10 12 14 16 18
02
02
Figure 3.10 Risk functions for the inequality pretest and traditional pretest estimators,
using critical values for the .05 significance level.
80 Sampling Theory and Bayesian Approaches to Inference
for -00 < Ci < 0 and -00 < 8; < O. When 8; -l> 0, the
and the risk of the pretest estimator is equal to or less than that of the maximum
likelihood estimator. As the critical value C; -l> 0, the p(b{+ ,{3;) -l> (1"2, the
maximum likelihood risk. As Ci -l> -00 and 8i -l> 0, the p(bt+ ,{3i) -l> (1"2/2, the risk
of the inequality-restricted estimator. Alternatively, for any C; as 8; -l> -00, the
p(br ,{3;) -l> (1"2, the maximum likelihood risk.
Next, consider the case when 0 <: 8; < 00 and C; + 8/(1" <: O. If we use the
lemmas developed by Judge and Yancey (1978), the pretest risk may be rewrit-
ten as
8T {P
-"'2 [2
X(I) 2: 8 TJ + P[X(I)
(1"2 2 > 2}
d;l (3.3.13)
When 8; = 0 and -00 < C; <: 0, the p(bt+ ,{3;) = (1"2/2)[1 + P(Xt3) > cm, which is
in agreement with the result of (3.3.12). As Ci -l> 0, the p(b{+,{3;) -l> (1"2, the
maximum likelihood risk.
For any fixed 8;, when Ci -l> -00,
When -00 < Ci <: 0 and Ci + 8/(1" > 0, the risk function for the pretest
Statistical Decision Theory and Biased Estimation 81
a- 2 ui 2 2
+ T2 {
P [ X(3):> s;,2]
a-2 - P[X(3) :> d;l } (3.3.15)
Since P[X[I) :> drJ :> P[X[I) :> aT/a- 2 ], the inequality pretest risk is less than the
maximum likelihood risk when aT < a- 2 , greater when aT> a- 2 , and equal when
aT = a- 2 As ai ~ 00, the p(bt+ ,{3;) ~ a- 2 , the maximum likelihood risk. The
characteristics of the inequality pretest risk function are portrayed in Figure
3.10.
1. For any critical value c < 0, the bias of the inequality-restricted pretest
estimator in the a/a- space is always equal to or less than that of the inequality-
restricted estimator.
2. For any critical value c < 0, when the direction of the inequality restric-
tion is correct, the risk of the inequality-restricted pretest estimator when a- 2 is
known is equal to or less than the risk of the maximum likelihood estimator for
every value of the a/a- parameter space. As c ~ the risk of the inequality-
restricted estimator approaches the maximum likelihood risk. As c ~ -00, the
pretest estimator risk approaches the corresponding inequality-restricted esti-
mator risk.
3. If the direction of the linear hypothesis is correct (a < 0), the risk of the
inequality-restricted estimator as a ~ -00 approaches that of the equality-
restricted or maximum likelihood estimator as c approaches -00 or zero, re-
spectively.
82 Sampling Theory and Bayesian Approaches to Inference
P* = (1 - alb'b)b (3.4.1)
Statistical Decision Theory and Biased Estimation 83
when theorems in Judge and Bock (1978, p. 322) are used in the evaluation.
This result means that p(l,(l*) < p(l,b) for all (l if 0 < a < 2(K - 2). By
differentiation of (3.4.3) with respect to a the value of a that minimizes the risk
of (l* is a = K - 2. Therefore, the optimal (minimum risk) James and Stein
estimator is
and this rule demonstrates the inadmissibility of the maximum likelihood esti-
mator. When (l = 0 or A. = (l'(l/2 = 0 the risk of the James and Stein estimator is
2. The risk of (l* increases to the risk of the maximum likelihood estimator
(p(l,b) = K) as (l'(l- 00. Thus for values of (l close to the origin the gain in risk
is considerable. The characteristics of the risk function for the James and Stein
estimator are depicted in Figure 3.11.
The James and Stein estimator (3.4.4) shrinks the maximum likelihood esti-
mates toward a null mean vector. A more general formulation which makes the
arbitrary origin of the above case more explicit considers a mean vector (lo =
r = b* and an estimator of the form
This estimator has bias and risk characteristics consistent with the conven-
tional James and Stein estimator, that is, a risk of 2 when A. = 8'812 = 0 and a
risk of K as A. = 8'8/2 - 00 where 8 = r - (l.
84 Sampling Theory and Bayesian Approaches to Inference
K~--~~~~~--------
I p(/3, a 2 ; /3*)
T - K
[K - T K+2 (K-2))
o {j'8
A=-
2 2a
Figure 3.11 The risk characteristics of the Stein rule and least squares rule.
To see that the Stein rule is but another type of pretest estimator for combining
the unrestricted and restricted least squares estimators, remember that s/a- 2 =
(T - K)6- 2/a- 2 is distributed as a central X{T-Kl random variable and (b - r)'
(b - r)/a- 2 is distributed as a noncentral xtK.f. ) random variable. Therefore, if we
divide each of these independent chi-square random variables by their corre-
sponding degrees offreedom, we can write the generalized Stein-rule estimator
with unknown variance as
M= (l - QI/u)(b - r) + r (3.4.8)
where u = b'b/K&2. Therefore, the Stein rules specified in (3.4.8) and (3.4.9)
provide, under a squared error loss measure, a superior way of using the test
statistic u to combine the least squares and restricted least squares estimators.
If the test statistic u = aI, then the Stein rule becomes the restricted least
squares estimator. As al/u ~ 0 or, in other words, as the test statistic becomes
larger relative to the scalar aJ, the Stein rule approaches the least squares
estimator. The ratio al/u determines how much the least squares estimator is
adjusted toward the restricted (hypothesis) estimator. However, contrary to
the traditional pre-test estimator (3.3.3) the James and Stein rule is minimax.
Although the James and Stein rule is minimax it is not admissible. Therefore,
alternative superior rules exist. We begin our search for a better rule by noting
that in (3.4.9), when u < aI, the Stein rule changes the sign ofthe least squares
estimates b. Some may agree on adjusting the magnitudes of the least squares
estimates, but a change of sign is a somewhat different matter. As it turns out,
the estimator
u**
t"s = I [aI, )(u) [1 - aUl ] (b - r) + r
00
= [I - min(l,al/u)](b - r) + r (3.4.10)
uniformly improves on the James and Stein rule and is minimax when 0 <
al < 2(K - 2). Therefore, if al < U < 00, the Stein-rule estimator is used and,
if u < aI, the restricted least squares (hypothesis) estimator is used. Conse-
quently, if r = 0, either 0 or the Stein-rule estimate is used; therefore, the esti-
mator has been called the Stein positive-rule estimator. Thus we have another
way of combining prior and sample information, and this positive-rule estima-
tor confirms the inadmissibility of the James and Stein rules and demonstrates a
superior alternative. Unlike the James and Stein rules, there is no single value
of al that is optimal. The proof of these propositions is given in Judge and Bock
(1978, pp. 181-197).
The implications of the Stein-rule results were neatly summarized in the
following remarks by Lindley in his discussion of a paper by Efron and Morris
(1973):
. . . the result of Stein's undermines the most important practiGal technique in statis-
tics. . . . The next time you do an analysis of variance or fit a regression surface (a
line is all right!) remember you are for sure, using an unsound procedure. Unsound,
'that is, by the criteria you have presumably been using to justify the procedure . . . .
86 Sampling Theory and Bayesian Approaches to Inference
The results for Sections 3.4.1,3.4.2, and 3.4.3 have been extended to include
a general design matrix X and these and results relating to a minimax-admissi-
ble estimator are given by Judge and Bock (1978). In general, the minimax Stein
result holds and only the conditions change.
One disadvantage of the Stein-type estimators is that much of the risk gains
occurs near the origin, that is, when the: shrink vector r is correct. To remedy
this deficiency Efron and Morris (1972) proposed a "limited translation" esti-
mator and Stein (1981) proposed an estimator where the shrinkage is deter-
mined on a component by component basis. For an evaluation of these estima-
tors see Bohrer and Yancey (1984) and Dey and Berger (1983). For an empirical
Bayes interpretation of the Stein-like estimators see Section 4.3.3.
Since the Stein rule combines the least squares and restricted least squares
estimators, the next step is to compare these risk results with the traditional
pretest estimator rule
.
~ = I(O,c)(u)r + I{c,oo)(u)b = r + I".x)(u)(b - r) (3.4.11)
which uses the likelihood ratio test statistic and the level of the test ex or the
critical value c to determine the mixture of the two estimators. Since the Stein-
rule estimator is uniformly superior to the least squares estimator, it seems
intuitive that if in (3.4.11) we replace the least squares estimator by the James
and Stein or Stein positive-rule estimator, that is, respecify the pretest estima-
tor as
.
M = I(O,c)(u)r + I",oc)(u)l3;
= I(o.du)r + I[c,oc)(u) [( 1 - :1 )(b - r) + r]
= r + I[c,oc)(u) (1 - :1) (b - r) (3.4.12)
that this pretest estimator might be superior to the traditional pretest estimator.
This is, in fact, the modified Stein-rule estimator proposed by Sclove, Morris,
and Radhakrishnan (1972) and the theorems proved by them verify that the
Stein pretest estimator (3.4.12) dominates (uniformly superior to) the tradi-
tional pretest estimator (3.4.11). If 0 <: C <: QI, the Stein-rule pretest estimator
is a minimax estimator. Also, when c > QI, the Stein-rule pretest estimator can
no longer be a minimax substitute for the pretest (3.4.11); however, for compa-
rable c values for both estimators the Stein rule continues to dominate under a
squared error loss measure. These results are shown in Figure 3.12.
The problem of the optimal choice of the level of the test ex and thus the
critical value c still remains. However, it should be noted that by adding the
Stein-rule estimators we have changed the minimum risk function reflected by
Figure 3.9. Since the Stein-rule family, under squared error loss and conditions
Statistical Decision Theory and Biased Estimation 87
-"
'"
a:
Consider the orthonormal statistical model where the unknown coefficient vec-
tor Jl has dimension K and the variance (1'2 is known and equal to one. Assume
that nonsample information is available on one of the parameters, say the first,
li I Ii ft !
~ ~t!:llif
+
+
t fi +
'IT + +++
Ihtt 1-! t H H
t
+t hit
o .. ,h' h
Spec I f Icatlon error, A = ~2
(J
Figure 3.13 Minimum risk function that optimally combines b, b*, and f3*.
88 Sampling Theory and Bayesian Approaches to Inference
in the form of a single inequality restriction, f31 :> 0. Under this setting, the
inequality maximum likelihood estimator may be written as
(3.4.13)
b; = I(-cc,o)(b l ) [ 0 ,
(1 - C/b(K-l)b(K-Ib(K-I)
] + I[o,ce)(b l ) [0 - /b
c2 'b) (b l
b(K-1)
)]
From (3.4.16), the risk difference between the maximum likelihood inequal-
ity restricted estimator and the Stein inequality restricted estimator is
40'--'1---'-1-'1---'-1-'1---'1--'1---'-1-'1---'-1-'1--~
-
---b
-
b++
-
-(3.
24 - -
-"
.!!! 20- -
a::
16 - -
12 - -
8- -
Figure 3.14 Risk functions for conventional, Stein and inequality-restricted esti-
mators.
The right-hand side of (3.4.18) is negative if 0 <: c) < 2(K - 3) and 0 < C2 <
2(K - 2). The fact that under these conditions the right-hand side of (3.4.18) is
negative implies that the Stein inequality estimator b; dominates or is uni-
formly superior to the inequality maximum likelihood restricted estimator b+
and therefore demonstrates its inadmissibility. The risk characteristics of the
maximum likelihood estimator b, the inequality estimator b+, the James and
Stein and positive rule estimators Jl* and Jl** and the inequality James and
Stein and positive rule estimators b; and b;+ are shown in Figure 3.14. Ifthe
direction of the inequality restriction is correct, the inequality Stein estimators
are risk superior to their conventional Stein counterparts.
The results of the preceding subsections suggest that, under a squared error
loss measure, the Stein rules are uniformly superior to traditional equality and
inequality maximum likelihood and pretest estimators. They can be recom-
mended on the grounds of simplicity, generalizability, efficiency, and ro-
bus.tness.
In many cases, the choice between estimators rests on the trade off between
bias and precision. In the world of Stein rules, the data determine the compro-
mise between these two extremes.
90 Sampling Theory and Bayesian Approaches to Inference
3.5 SUMMARY
(1955) showed, under conditions normally fulfilled in practice, that there were
other minimax estimators. James and Stein (1961) exhibited an estimator that,
under squared error loss, dominates the least squares estimator and thus dem-
onstrates its inadmissibility. This result means that the unbiased least squares
rule may have an inferior mean square error when compared to some biased
estimators.
Another difficulty with the conventional estimator arises when the investiga-
tor has information about the unknown parameters that is over and above that
contained in the sample observations. Assuming this fortunate event, we have
in this chapter reviewed three alternative ways to take account of this nonsam-
pIe information for purposes of estimation and inference.
We first considered nonsample information about the unknown parameters in
the form of equality constraints and noted the gain in precision when this
information was combined with the sample information. We also noted that if
this information was incorrect that the estimators were biased and we therefore
raised questions concerning the trade-off of bias for variance. Under a squared
error loss measure, we investigated the sampling performance of the equality-
restricted estimator and the traditional pretest estimator that use linear equality
hypotheses and noted their unsatisfactory performance over a large range of
the parameter space. In addition we noted that the Stein pretest estimator
dominated the traditional pretest estimator and thus demonstrated its inadmis-
sibility.
Alternatively, the nonsample information may be available in the form of
linear stochastic restrictions. In this case, there is a sampling theory estimator
available, but this estimator has many of the characteristics of the equality-
restricted estimator and pretest estimator. In addition, one weakness of the
stochastic prior information estimator lies in assuming that the random vector
representing the uncertainty of the prior information has a mean of zero, that is,
the prior information is unbiased. Additional assumptions regarding a known
covariance matrix of the stochastic prior information seem unreasonably de-
manding.
Inequality restrictions offer another alternative for representing prior infor-
mation. If the direction of the inequality restriction(s) is correct, the inequality-
restricted estimator under squared error loss is uniformly superior to the maxi-
mum likelihood estimator. Also, when the direction of the inequality restriction
is correct, the inequality-restricted pretest estimator is superior to the tradi-
tional pretest estimator over practically the entire parameter space. If the direc-
tion of the inequality is incorrect, as the specification error grows, the risk of
the inequality-restricted estimator approaches that of the restricted least
squares estimator. If the direction of the inequality is incorrect, the inequality-
restricted pretest estimator is dominated by the traditional pretest estimator.
Therefore, if the applied worker is sure of the direction of the inequality, for
example, the parameters are nonnegative or nonpositive, then the inequality
estimator not only performs well relative to other alternatives but is uniformly
superior to the least squares estimator, which is used by most applied workers,
and the James and Stein estimator, which, under a squared error loss measure,
92 Sampling Theory and Bayesian Approaches to Inference
should be used by most applied workers. Inequality pretest estimators are risk
superior to the maximum likelihood estimator if the direction of the restriction
is correct. If the direction is not correct the risk performance is much like that
of the equality pretest estimator. Stein-like inequality estimators dominate the
traditional inequality estimator under squared error loss, and when the direc-
tion of the restriction is correct they are risk superior to the Stein estimators.
These and other attributes give the inequality estimation and hypothesis testing
framework great appeal in applied work in economics.
3.6 EXERCISES
Using the statistical model and the sample observations from the sampling
experiment described in Section 2.5.1 of Chapter 2, complete the following
exercises involving equality-restricted, stochastic-restricted, inequality-re-
stricted, and Stein-rule estimators.
Exercise 3.1
Develop restricted least squares estimates of the J3 parameter vector, using 100
samples of data, a restricted least squares estimator and the restriction
{31
RJ3 = [0 1 1] {3z 1
{33
Compare the mean vector and covariance for the 100 samples and compare the
sampling results with those results from using an unrestricted least squares
estimator. Under a squared error loss measure, develop the risk for the re-
stricted least squares estimates from the 100 samples and compare this empiri-
cal risk with the corresponding unrestricted least squares result.
Exercise 3.2
Consider the restnctIOn in Exercise 3.1 as a general linear hypothesis and
develop one hundred estimates of the J3 vector by using a pretest estimator with
the a = .05 level of significance. Under a squared error loss measure, deter-
mine the empirical risk for the pretest estimator and compare it with that of the
unrestricted and restricted least squares estimators. Change the restriction or
general linear hypothesis to RJ3 = {3z + {33 = 10.0 and develop the empirical
risks for the unrestricted, restricted, and pretest estimators. Discuss the
results.
Exercise 3.3
Using the stochastic restriction, RJ3 + v = {3z + {33 + V = 1, where v is assumed
to be normal with mean zero and variance 614, obtain 100 estimates of the J3
vector by using the stochastic restricted estimator under known and unknown
variance a- z. Develop the mean vector and covariance and contrast with the
results from the unrestricted and restricted least squares estimators.
Statistical Decision Theory and Biased Estimation 93
Exercise 3.4
U sing the inequality restriction R Il = f32 + f33 > 1 and the corresponding in-
equality estimator, estimate the Il vector for the 100 samples and compare and
contrast the results with those from the unrestricted and restricted least
squares estimators.
Exercise 3.5
Assuming known and using the general linear inequality hypothesis
(F2
f32 + f33 > 1, and the corresponding pretest estimator (ex = .05) develop pretest
estimates for the 100 samples. Compute the corresponding empirical risk and
compare and contrast with the empirical risks for the traditional pretest estima-
tor of Exercise 3.2.
Exercise 3.6
For the design matrix given in Section 2.5.1 of Chapter 2 there exists a square
nonsingular matrix P that will transform the positive definite matrix X' X into an
identity matrix, that is, by using a procedure such as Cholesky's decomposition
method there is a P matrix such that PX'XP' = h. Verify that
.223607 o o
P = -.513617 .331131 o
- .585486 - .028953 .498237
y = XIl + e = Xpp-Ill + e = Z9 + e
where 9 = P-Ill.
Given the transformed design matrix Z, use the least squares rule to obtain
100 estimates of the 9 vector and compute the empirical risk for the least
squares rule under squared error loss. Specify a James and Stein .estimator for
this orthonormal model, assuming (F2 known and equal to 0.0625, which shrinks
to the null vector; compute estimates of the parameter vector 9 for the 100
samples and the corresponding empirical risk under squared error loss and
compare with the least squares risk. Repeat the estimation process shrinking
to the true parameter vector and compute the risk and make the relevant risk
compansons.
Exercise 3.7
Within the context of Exercise 3.6 specify alternative positive Stein rule esti-
mators, cO!l1pute the alternative estimates of the 9 vector for the 100 samples
and develop the corresponding empirical risk and compare and contrast them
to those of Exercise 3.6.
94 Sampling Theory and Bayesian Approaches to Inference
3.7 REFERENCES
The Bayesian
Approach to Inference
where h is the joint density function for 6 and y, g denotes a density function for
6, andfdenotes a density function for y. Obviously,fand g sometimes denote
marginal distributions and sometimes denote conditional distributions, and
hence they are not used consistently to denote the same functional form. Nev-
ertheless, in each case the meaning should be clear.
Rearranging (4.1.1) yields
'61 ) - f(yI6)g(6) (4.1.2)
g, y - f(y)
This expression is known as Bayes' theorem. The posterior density function for
6 is g(6Iy), since it summarizes all the information about 6 after the sample y
has been observed, and g(6) is the prior density for 6, summarizing the nonsam-
pIe information about 6. If we recognize that, with respect to 6,f(y) can be
regarded as a constant and thatf(yI6) can be written as the likelihood function
C(6Iy), then (4.1.2) becomes
g(6Iy) ~ C(6Iy)g(6) (4.1.3)
y = Xp + e (4.1.4)
vector with the properties e - N(O,fT 2I), and p and fT are unknown parameters
about which we seek information. In the framework of (4.1.3),9 = (W,fT)' and
~
Interval Estimation
Section 4.6
Predictive Densities
f(yly)
Section 4.5 r-
. p(Holy)
Postenor Odds p(H1Iy)
Section 4.7
The Bayesian Approach to Inference 101
cussed in Section 4.7; concluding remarks are given in Section 4.8. These
contents are summarized in Table 4.1.
The likelihood function that summarizes the sample information plays a basic
role in Bayesian inference. From Chapter 2 it is given by
If we drop irrelevant constants and write this function in terms of the jointly
sufficient statistic (b' ,6- 2 ), where b = (X'X)-IX'y, 6- 2 = (y - Xb)'(y - Xb)/v,
and v = T - K, we obtain
(fl,rrly) Q( rr- T exp { - 2~2 [v6- 2 + (fl - b)' X' X(fl - b)]} (4.1.7)
The choice of a prior distribution depends on the opinions and knowledge of the
researcher, and so it is impossible to choose a distribution that will necessarily
be appropriate under all circumstances. The approach usually taken, therefore,
is to consider broad classes of prior distributions that are likely to be useful in a
variety of situations. In this section we discuss use of a noninformative prior,
and, later in the chapter (Section 4.3), we consider classes of informative
priors.
The most common noninformative prior for fl and rr is
(4.2.1)
It is obtained by assuming each 13k and rr are a priori independent, with the
marginal densities given by g(!3k) Q( constant and g(rr) Q( rr-I. We then have g(fl)
Q( constant and g(fl,rr) = g(fl)g(rr) Q( rr-I.
The density g(rr) Q( rr-I is obtained by taking g(ln rr) = constant for -00 < In rr
< 00 and transforming to g(rr). Roughly speaking, g(!3k) and g(ln rr) are nonin-
formative priors because they are uniform densities and do not suggest that
some values of the parameters are more likely than other values. However, it is
natural to ask why, for example, g(ln rr) = constant was chosen instead of
g(rr) = constant, and whether there are any general rules that can be applied
to specify noninformative priors. For details of such information we refer the
reader to Jeffreys (1967), Zellner (1971, 1977), Box and Tiao (1973), Villegas
(1977), and Berger (1980). Briefly, there are a number of desirable characteris-
102 Sampling Theory and Bayesian Approaches to Inference
and density functions with this property are termed improper. Improper priors
are often used for representing ignorance. The fact that they are not integrable
is of no consequence for making posterior inferences, providing that they lead
(via Bayes' theorem) to posterior densities that are proper. Such is generally
the case, although two exceptions can be found in Tiao and Tan (1965) and
Griffiths et al. (1979).
One interpretation of non informative priors like (4.2.1) is that they are locally
uniform [Box and Tiao (1973)]. That is, they are assumed to hold over the
region of the parameter space for which the likelihood function is appreciable,
but outside this region they taper off to zero, and are therefore proper. This
interpretation goes part way to overcoming some objections that have been
raised concerning improper priors, but, pragmatically, it makes little difference
whether the parameter range is finite or infinite; the resulting posterior distribu-
tion is essentially the same.
Three main reasons seem to have been put forward for using noninformative
priors. First, there may be circumstances where we have very little information
on the parameters (our knowledge is "vague" or "diffuse"), and we would like
a prior distribution that reflects this lack of information. Second, although we
might have substantial prior information, there may be occasions where it is
considered more objective to present a posterior distribution that reflects only
sample information, instead of "biasing" it with our own prior views. Finally,
it is frequently very difficult to formulate an appropriate informative prior
distribution, and, despite the consequential suboptimality, an easy way out of
the difficulty is to use a noninformative prior.
The Bayesian Approach to Inference 103
Combining the likelihood in (4.1. 7) with the noninformative prior g(l,(1') ex (1'-1
leads to the joint posterior density
g(l,(1'ly) ex g(l,(1')f(l,(1'ly)
This joint density function summarizes all our postsample information about (l
and (1'. If interest centers on only one or two parameters, however, it is likely to
be a rather cumbersome way of presenting the information; and so it is useful to
consider various marginal and conditional posterior densities. Also, an ex-
amination of the marginal and conditional densities provides some insights into
the nature of the joint density in (4.2.2). With these points in mind we rewrite
(4.2.2) as
where
(4.2.4)
and
Equation 4.2.4 shows that the conditional posterior density function for (l given
(1' is multivariate normal with mean vector b and covariance (1'2(X' X)-I. We are
104 Sampling Theory and Bayesian Approaches to Inference
able to recognize this fact by examining the terms that contain 13 in the expo-
nent of Equation (4.2.2). That part of the density function that determines its
functional form (exp[ -(Il - bY X' X(13 - b)/2a- 2] in this case) is called the kernel
of the density function, while the other part [(27T)-KI2a--KIX'XII/2] is the normal-
izing constant, introduced to make the function integrate to one. We find the
normalizing constant by using known properties of the multivariate normal
distribution. Note that, because the density in (4.2.4) is conditional on a-, the
normalizing constant includes the term a-- K. In (4.2.5) part of the term a--(v+1)
compensates for this introduction.
If, at the outset, we had assumed that a- was known, then our prior would
simply be g(l3) IX constant and the posterior density for 13 would be the multiva-
riate normal in (4.2.4). Marginal posterior densities for the individual elements
in 13 would be the univariate normal densities derived from (4.2.4).
The marginal posterior density function for a- given in (4.2.5) is in the form of
an inverted-gamma density function with parameters v and 6- 2 , and it is recog-
nizable through its kernel a--(v+ I) exp( -v6- 2/2a- 2). The term [2/f(vI2)](v6- 2/2)vI2 is
the normalizing constant. See Zellner (1971) for details. Because of the decom-
position in (4.2.3) through (4.2.5), the joint density function g(I3,a-ly) is often
referred to as a normal-gamma distribution. The term "normal-gamma" is
actually more in accordance with other presentations [e.g., DeGroot (1970)]
where the analysis is in terms of a precision parameter h = 1/a- 2 , and the
marginal distribution for h is a gamma (not inverted) distribution.
In practice a- is unknown and so a posterior distribution for 13, which is more
relevant than (4.2.4) is the marginal one obtained by integrating out the nui-
sance parameter a-. That is,
g(l3ly) = J; g(I3,a-ly)da-
The second line in (4.2.6) shows that the marginal posterior density g(l3ly) can
be viewed as a weighted average of the conditional densities g(I3Ia-,y) with
weights given by the marginal density for a-, g(a-Iy). The integral in (4.2.6) is
solved by using properties of the gamma function [see, e.g., Judge et al. (1982,
p. 86)]. From the kernel of g(l3ly) given in the last line of (4.2.6) we can
recognize that this density is a multivariate t with mean b, covariance matrix
[vl(v - 2)]6- 2(X'X)-I, and degrees of freedom parameter v = T - K. See
Judge et al. (1982, p. 219) for the normalizing constant and properties of the
multivariate t.
The Bayesian Approach to Inference 105
1 ({31 - b 1)2]-(Ihl/2
g({3tiy) ex [ 1 +;; 6.y~T1 (4.2.7)
4.3.1 Specification of a
Natural Conjugate Prior
In principle it does not matter what kind of density function has been used to
frame prior information on (13' ,0-); it still can be combined with the likelihood
function to form a posterior density function. Thus, in principle, we can choose
the type of prior density function that best reflects our prior information. In
practice, however, a number of various types of density functions can usually
equally well represent our prior information, and, in terms of mathematical
convenience, some of these will combine more easily with the likelihood func-
tion than others. Given these circumstances it is desirable to choose a prior
distribution from a family of density functions that combines conveniently with
the likelihood function and that is rich in the sense that it can represent a wide
variety of prior opinions. In many instances natural conjugate priors have these
properties. They are algebraically convenient because a natural conjugate prior
leads to a posterior density with the same functional form, and hence the
posterior density from one sample will be the natural conjugate prior for a
future sample; and they are usually sufficiently flexible to represent a diversity
of prior opinions. [There are some cases, however, where results can be quite
sensitive to specification of the informative prior. See, e.g., Berger (1980,
p. 85).]
To obtain the natural conjugate prior for a set of parameters we first write the
likelihood function in terms of its sufficient statistics and then note its type of
functional form when viewed as a function of the unknown parameters. The
likelihood function for the model we are considering was written in terms of its
sufficient statistics in Equation 4.1.7. When viewed as a function of 11 and
0", the (Il,O"iy) is in the form of a normal-gamma function (see Section 4.2)
because it can be written as
(4.3.1)
where
and
(4.3.3)
This suggests that the natural conjugate prior distribution for (13' ,0") is one
where the conditional distribution of 11 given 0" is multivariate normal and the
marginal distribution for 0" is an inverted-gamma distribution. Thus for our
informative prior we take
The Bayesian Approach to Inference 107
(4.3.4)
and
2
g(a-) = f(vI2) (4.3.5)
V) 1/2 _
mode(CT) = ( v + 1 s (4.3.6)
In (4.3.4) and (4.3.5) prior parameters take the place of the sufficient statistics
that appear in (4.3.2) and (4.3.3), respectively. Also, for notation, a single bar
over a symbol denotes a prior parameter, and, in later sections, a double bar
denotes a posterior parameter.
Combining (4.3.4) and (4.3.5) and omitting the constants of proportionality
gives the normal-gamma joint prior density
The marginal prior density for Il is obtained by integrating CT out of (4.3.7) and is
the multivariate-t density
1 - A - ] -(K h)/2
g(ll) ex [ 1 + ~ (Il - Il)' 52 (Il - Il)
(4.3.8)
Using Bayes' theorem (4.1.5) to combine the natural conjugate prior (4.3.7)
with the likelihood function, (4.1.6) yields the joint posterior density function
g(l3,o-ly)
(4.3.9a)
1/2~ 112 13) , (A 112~ - 1/213)]}
= o--T-K-v-I exp { __1_2 [VS2 + ( A - A A
20- y - XI3 y - XI3
(4.3.9b)
and
and
v=T+v (4.3.12)
U sing these definitions and a decomposition similar to that used for deriving
(4.1.7) from (4.1.6), we can write (4.3.9c) as
1 = (A + X'X) = ]-(K+U1/2
g(3iy) oc [ 1 + ~ (3 - (3)' S2 (3 - (3) (4.3.14)
The marginal posterior density for a single element from (3, say {31, will have
a univariate-t distribution with density function
where ~I is the first element in ~, and ell is the first diagonal element in (A +
X'X)-I.
These posterior density functions represent our current state of knowledge
(prior and sample) about the parameters and, if desired, they can be used to
obtain point or interval estima!es or to test hypotheses. Note from Equation
4.3.10 that the pos_terior mean (3 can be regarded as a matrix weighted average
of the prior mean (3 and the sampling theory estimator b, with weights given by
the inverses of the conditional covariance matrices 0-2A- 1 and 0-2(X'X)-I.
The properties of matrix-weighted averages have been studied by Chamber-
lain and Leamer (1976) and Leamer (1978, 1982). One property i1' that, except
under special and unlikely circumstances, it will not be true that ~ lies between
b and ~, in the sense that each element in ~ lies between the corresponding
elements from the other two vectors. This result is in contrast to the case of a
simple weighted average (see Section 4.3.4). A simple illustrative e~ample can
be found in Judge et al. (1982, p. 230). For given values of b, (3 ~nd X'X,
Chamberlain and Leamer provide an ellipsoid bound that will contain (3 regard-
less of the value of the prior covariance matrix 0-2A -I. Leamer (1982) extends
this result by deriving ellipsoid bounds for cases where 0-2A -I can be bounded
from above and/or below. These ellipsoid bounds are useful because specifica-
tion of A is often difficult, and because it is useful to be able to present posterior
information that is relevant for a wide range of prior distributions.
Another point worth noting is that using a natural conjugate prior is equiva-
lent to assuming that we have information from some previous hypothetical
sample. If this hypothetical sample information is combined with our current
110 Sampling Theory and Bayesian Approaches to Inference
(4.3.16)
and
(4.3.17)
(4.3.18)
where v= T,
(4.3.19)
and
(4.3.20)
Thus, the posterior mean for p becomes a simple weighted average of the prior
mean p and the least squares estimate b, with the relative weights determined
by go. Marginal posterior densities for p and fr can be obtained from (4.3.18)
along the lines described in Section 4.3.2. See Zellner (1982a) for further results
and for an evaluation of the sampling properties of the posterior mean.
1 - - ]-<K+V*1/2
g(P) ex [ 1 + v* (P - P)'A(P - P) (4.3.21)
g (O") ex -_
1- exp [_ -vs -2]
O"v+ 1 20"2 (4.3.22)
(4.3.23)
the elements in (3. It is much more difficult, however, even for the trained
statistician, to conceptualize prior information in terms of the covariances of
the elements in (3. Also, although it might be possible to get some idea about
likely values for (J because of their implications for the prior covariance matrix
cov(3/(J), it is not immediately obvious how to proceed.
These facts have led to research into the assessment or elicitation of prior
density functions. We will not give details of the various suggestions, but we
note that one possibility is to use elicited percentiles, as mentioned above,
while another is to assess what is known as the prior predictive density for y for
a number of settings_ of the explanatory variables and to use these to derive
parameters such as (3, A, 11, and s. Methods for combining information from
both approaches have also been suggested. For details we refer the reader to
Dickey (1980), Kadane (1980), Kadane et al. (1980), Winkler (1967,1977,1980),
and Zellner (1972).
The posterior density function for a parameter summarizes all the information
that a researcher has about that parameter; thus, once the equation of the
posterior density has been derived this could be regarded as a reasonable point
at which to conclude a study. Under some circumstances, however, a re-
searcher needs to make a decision on just one value (or point estimate) of a
parameter, and in this situation, we need some procedure for determining
which value is best. In the Bayesian framework this procedure consists of (I),
assuming that there is a loss function, L(O,a), that describes the consequences
a
of basing a decision on the point estimate when 0 is the true parameter vector,
and (2), choosing that value of athat minimizes expected loss where the expec-
tation is taken with respect to the posterior density function for O.
a,
Thus, the Bayesian estimator for 0 is that value of which minimizes
where Q is a positive definite matrix, it can be readily shown [e.g., Judge et al.
(1982, p. 232)] that the posterior mean E[9/y] is the optimal Bayesian point
estimate. In terms of the coefficiellt vector (3 for the general linear model y =
X(3 + e where E[e] = 0 and E[ee'] = (J21, use of a natural conjugate prior led to
the posterior mean (3 = (A + X'X)-I(A~ + X'Xb); with a diffuse prior the
posterior mean was the least squares estimator b = (X'X)-IX'y. For (J and (J2
the posterior means when a natural conjugate prior is used are
Let us now compare the Bayesian approach of obtaining estimators that mini-
mize expected posterior loss with the decision theory approach employed
within a sampling theory framework. This latter approach was discussed in
Chapter 3. Reiterating briefly, in the sampling theory approach, we first as-
sume, as we did for the Bayesian approach, that we have a loss function L(O,9)
that reflects the consequences of choosing 9 when 0 is the real parameter
vector. Then we assume that a desirable estimator is one that minimizes ex-
pected loss where the expectation is taken with respect to y. That is, we would
like to choose an estimator where average loss is a minimum with the average
calculated from the losses incurred in repeated samples. This average loss is
known as the risk function and is given by
To see the relationship between this estimator (which is also called a Bayesian
estimator), and one that minimizes expected posterior loss, we rewrite (4.4.4a)
as
Eo[p(O,9)] = f f(y) [f L(O,9)g(Oly) dO] dy (4.4.4b)
where we have used the fact thatf(yIO)g(O) = g(Oly)f(y) and we have changed
the order of integration. The function in (4.4.4b) will be minimized (with respect
to 9) when the integral within the square brackets is minimized. This integral is
equal to expected posterior loss, and hence the two estimators are identical,
providing the step from (4.4.4a) to (4.4.4b) is legitimate.
The Bayesian Approach to Inference 115
and Giles and Rayner show that MSE(ji) - MSE(b) is negative semidefinite if
and only if
(4.4.6)
This inequality will always hold if the prior information is unbiased in the sense
that ji = 13. When ji =1= 13, whether or not (4.4.6) holds will depend on the
magnitude of the prior bias relative to the precision of the sample and prior
, information.
The properties of Bayesian and sampling theory estimators have also been
116 Sampling Theory and Bayesian Approaches to Inference
compared for a number of other models. See, for example, Fomby and Guilkey
(1978), Surekha and Griffiths (1984), and Zellner (1971, Ch. 9).
'*1 1 1
0 = E[f3d 1+ var[/3d/E[/3d 2 (4.4.7)
where, again, E[], var[] and cov[,] refer to the posterior moments.
We now consider some results relating to of
and Or for the special case
where the diffuse prior g(p,a-) IX (J'-I is employed in conjunction with the
normal general linear model y = Xp + e. In this case the posterior moments in
(4.4.7) and (4.4.8) are E[/3;] = bi, var[/3i] = Va- 2aii/(V - 2) and COV[/3i,,Bj] =
va- 2aij/(v - 2) where b i is the ith element of band aij is the (i,j)th element of
(X'X)-I. Many of the following results also hold under more general circum-
stances [see Zellner (1978)].
1. The maximum likelihood (ML) estimators for 01 and O2 are given by l/b l
and b l /b 2 , respectively. Thus, the estimators of and Or can be viewed as the
ML estimators multiplied by a factor that depends on the relative second-order
posterior moments. In the case of the reciprocal (Equation 4.4.7), the addi-
tional factor lies between zero and 1 and so can be regarded as a "shrinking"
factor. If the coefficient of variation of /31 is small, and hence /31 = 0 is an
unlikely value, the shrinking factor will be close to I, and of will be close to
Yb l When the coefficient of variation is large and small values of /31 are likely,
of and lIb l can be quite different. In the ratio case the relationship between Or
and b l /b 2 varies depending on COV[/3b/32]/E[/3dE[/32]' as well as on the coeffi-
cient of variation of /32, See Zellner (1978) for details.
2. The posterior means for 11/31 and /31//32 do not exist and so optimal point
estimates for these quantities do not exist relative to the quadratic loss function
L = (0 - 0)2. Analogously, the ML estimators lib I and b l /b 2 do not possess
finite sample moments and have infinite risk relative to quadratic loss. [In the
sampling theory approach, inference about 11/31, for example, is usually based
The Bayes:an Approach to Inference 117
on the limiting distribution of YT(b l l - (31 1), which is normal with mean zero
and variance er2f314all. See Section 5.3.1.]
3. The estimators Or and Or possess at least finite second order moments
and hence have finite risk relative to quadratic loss functions. Asymptotically,
Or and Or have the same distribution as the ML estimators.
4. In some situations the posterior distributions of 8 1 and 82 (and the distri-
butions of the ML estimators lib 1 and b 1lb 2 ) will have more than one pro-
nounced mode. This could occur, for example, in small sample situations
where the posterior coefficients of variation of f31 and f32 are large. Under these
circumstances the estimators ofand Or are undesirable because it is highly
likely that the MELO estimate will lie between the two posterior modes. For
some alternative strategies in this case see Zellner (1978, p. 156).
Further properties and details of the estimators Or and OJ, and their exten-
sion to simultaneous equations models, can be found in Zellner (1978), Zellner
and Park (1979), and Park (1982). See also Chapter 15. Zaman (1981) discusses
estimation of a reciprocal with respect to other loss functions and prior distri-
butions.
where k is a value that is chosen such that the coefficient estimates have
"stabilized." See Chapter 22 for details. Suppose, in the Bayesian approach,
we specify the natural conjugate prior (Jller) - N(ii,er 2A -I) and take the special
case wpere ii = 0 and A -I = k-II. It is clear that this prior leads to the posterior
mean Jl given in Equation (4.4.9). Thus the ridge regression estimator can be
viewed as a Bayesian estimator under quadratic loss, with the prior for Jl such
that each of its elements is (a priori) independent and identically distributed
with zero mean. It is difficult to envisage an economic environment where such
a prior is relevant. Lindley and Smith (1972) suggest an educational example of
possible relevance, but they too are generally skeptical of the appropriateness
of the assumption.
~
Jl= {b*b if
if
u<c
u>c (4.4.10)
118 Sampling Theory and Bayesian Approaches to Inference
were considered, where b* is the restricted estimator under the null hypothesis
Ho: Jl = r(= b*); b = (X'X)-IX'y is the unrestricted estimator under the
alternative hypothesis HI: Jl =f. r; and u = (b - r)' X' X(b - r)IK6- 2 is the F(K.T-K)
statistic used to test the hypothesis Jl = r. Let us consider an analogous Bay-
esian estimator.
In Section 4.7 we will see that, in the Bayesian framework, uncertainty
concerning the hypotheses is handled by calculating what is known as the
posterior odds ratio, which is equal to the ratio of the posterior probabilities on
each hypothesis. When a point estimate is required and alternative hypotheses
lead to different point estimates, an optimal Bayesian point estimate is obtained
by minimizing expected loss averaged over the hypotheses, with the posterior
probabilities used as weights. That is, for two , hypotheses Ho and Ht. the
Bayesian "pretest" estimator is that value of Jl that minimizes
where p(Hol y) and P(HII y) are the posterior probabilities. With quadratic loss,
where the posterior means are optimal, the minimizing value for ~ is a weighted
average of the posterior means, or
In the context of the hypotheses Ho: Jl = r and HI: Jl =1= r, there is a set of
reasonable prior assumptions such that E[JlIHo] = b* and E[JlIHtl = b. Taking
these values as the posterior means and noting that P(Htly) = 1 - P(Holy), ~*
becomes
~* = P(Holy)b* + [1 - P(Holy)]b
= [1 - P(Holy)](b - b*) + b* (4.4.13)
A major distinction between this Bayesian "pretest" estimator and the sam-
pling theory pretest estimator in (4.4.10) is that the Bayesian estimator is a
continuous function of the data, while the sampling theory estimator is discon-
tinuous at c. If any given sample ~ will either be b* or b, whereas ~* will be a
weighted average of both b* and b. The greater the posterior probability
P(Holy), the greater will be the weight on the restricted estimator b*.
A sampling theory estimator that is also a continuous function of the data and
that is similar in form to (4.4.13) is the Stein rule estimator. From (3.4.8) this
estimator is
(4.4.14)
~ = (r- 2/ + J)-Ib = (. :2 r 2) b
= (1 - I ~ rJ b (4.4.15)
E [
K -
b'b
2] = I +Ir2
(4.4.16)
From this result it seems reasonable to estimate the factor 1/(1 + r2) with
(K - 2)/b'b in which case (4.4.15) becomes
=
JlI =
(
I -
K -
b'b
2) b (4.4.17)
This estimator is precisely the James and Stein estimator that was developed in
Section 3.4.4.
In Equation 4.4.15, it is clear that I/(1 + r2) <: l. However, it is feasible for
its estimator, (K - 2)/b'b, to be greater than l. Consequently, it seems reason-
120 Sampling Theory and Bayesian Approaches to Inference
=
132 =
( . { K -
1 - mm 1, b'b 2}) b (4.4.18)
AT ( K - 2 . { ~ b'b })
13k = 1- b'b mm I,d b~(K _ 2) bk (4.4.19)
k = 1, 2, . . . ,K
(3
B _
-
(
1
._
mm
- 2 1 +I}) b
{Kb'b' 72 (4.4.20)
This estimator is equal to the Bayes estimator when the prior information is
supported by the data in the sense that b'b is small, and is equal to the James
and Stein estimator if the data appear to contradict the prior information.
Because it is robust from the Bayesian point of view, and it performs favorably
relative to sampling theory estimators, it is likely to be a useful compromise for
both the Bayesian and sampling theory camps. See Berger (1982) for more
details.
We conclude this section with some general comments and references to re-
lated work.
1. The reader should not conclude from this section that point estimates are
the usual goal of a research investigation. Point estimates can be valuable
but they are generally a very inadequate means of reporting results. It is
much more useful to report complete posterior distributions. This point has
been emphasized on several occasions by Box and Tiao, and Zellner.
2. Many of the ad hoc sampling theory approaches to model selection and
point estimation can be viewed within a Bayesian framework [Leamer
(1978)].
3. In the Bayesian approach, uncertainty could exist with respect to the prior
distribution, the likelihood function, and the loss function. These uncer-
tainties have led to consideration of robust Bayesian techniques. For fur-
ther details, see Berger (1980, 1982) and Ramsay and Novick (1980).
4. In the above subsection we alluded to the empirical Bayes approach to
point estimation without discussing it in detail. For details and further
references, see Deely and Lindley (1981) and Morris (1983).
5. One way of overcoming uncertainty about the values of prior parameters is
to assign a further prior distribution to these parameters. The parameters of
this additional prior distribution are often known as hyperparameters. For
some work in this area see Lindley and Smith (1972), Smith (1973), Trivedi
(1980), and Goel and DeGroot (1981).
122 Sampling Theory and Bayesian Approaches to Inference
4.5 PREDICTION
y = Xp + e (4.5.1)
where e - N(O,cr 2I). A similar process is assumed to have generated the pre-
vious observations y = Xp + e, and e and e are assumed to be independent.
That is, E[ee'] = O.
The main objective in the Bayesian approach to prediction is to derive a
predictive density functionf(yly) that does not depend on the unknown coeffi-
cients p and cr, and that contains all the information about y, given knowledge
of the past observations y. Information on the future values y can be presented
solely in terms of the complete predictive density, or, alternatively, point pre-
dictions can be obtained by setting up a loss function and minimizing expected
loss with respect to f(Yly).
The derivation of f(yly) involves two main steps. In the first step the joint
density function h(y,P,crly) is found by multiplying the likelihood function for
the future observations.!(ylp,cr,y), by the posterior density for the parameters,
g(P,crly). That is
To specify the likelihood function we note that, since e and e are independent,
f(yIP,cr,y) = f(yIP,cr), and this density is given by
(4.5.3)
h(y,P,(Tly)
V va-'2 - -
I-yiy = v - 2
C-I =
v - 2
(I + X(X'X)-IX')
(4.5.7)
[See Judge et al. (1982, p. 144) for the analogous sampling theory result.] If
desired, properties of the muItivariate-t distribution can be used to obtain the
marginal predictive densities for one or more of the elements in 51. Also, similar
principles can be applied to establish predictive densities for other types of
priors [see, for example, Dickey (1975)].
If a proper informative prior is used, it is also possible to derive the predic-
tive density for a sample of observations when no previous sample is available.
Predictive densities of this type are frequently used in the calculation of poste-
rior odds (see Section 4.7). As an example, if we employ the natural conjugate
prior in (4.3.7), namely,
f(yllJ,a-) ex a--
T
exp [ - 2~2 (y - XIJ)'(y - XIJ)] (4.5.9)
then, along similar lines to those used above, it can be shown that
Values a and d that satisfy (4.6.1) will not be unique, however, and so we need
some criterion for choosing between alternative intervals. One possibility is to
choose the most likely region by insisting that the value of the posterior density
function for every point inside the interval is greater than that for every point
outside the interval, which implies g(aly) = g(dly). An interval with this prop-
erty is known as the highest posterior density (HPD) interval, and, if the poste-
rior density is unimodal, it is equivalent to finding the shortest interval (mini-
mizing d - a) such that (4.6.1) holds.
As an example consider the posterior density function for {3, given in Equa-
tion 4.2.7. In this case t = ({3, - b,)IcJ-v;;Ii has a univariate t distribution with v
degrees of freedom, all being the first diagonal element of (X'X)~'. Thus, if ta/2
is the point such that P(t > t a l2) = a12, then
(4.6.2)
-2
2 VCT 2
P(XO-a/2) < CT 2 < X(a/2) = 1- a
(4.6.3)
where P(X 2 > X~a/2) = a/2, and P(X 2 > X~I-a/2) = 1 - a/2. Rearranging (4.6.3)
yields P(a < CT < d) = 1 - a where a = (va-2/X~a/2112 and d = (va- 2/xZI_ a/21/2.
See Judge et al. (1982, Chapter 8) for numerical examples of both exact and
approximate HPD intervals for CT.
It is worth pointing out that, if we had derived the posterior density functions
and HPD intervals for, for example, CT 2 , CT- 2 or log CT, instead of for CT, the
appropriate transformations of these intervals would not yield exactly the HPD
for CT. However, the HPD interval for log CT will correspond after transforma-
tion to that of log CT 2 See Box and Tiao (1973) for details and for tables that can
be used to construct HPD intervals for log CT 2
We have restricted the discussion to one-dimensional HPD intervals. It is
also possible to consider HPD intervals (regions) in more than one dimension.
See Box and Tiao (1973) for details and for some examples.
One method for testing hypotheses within the Bayesian framework is to accept
or reject a point null hypothesis depending on whether or not the specified
value under Ho lies within an HPD interval with a preassigned probability
content. For example, in terms ofthe HPD for f31 given above, we would accept
a null hypothesis of the form Ho: f31 = f3r if f3? lies in the interval (hI -
tal2a-vZ1I, hI + ta l2a-vZ1I); otherwise we would reject Ho.
This method for testing hypotheses is obviously similar to the sampling the-
ory approach of rejecting a null hypothesis when the hypothesized value for a
parameter falls outside a confidence interval. When a diffuse prior and the
model of this chapter are employed, the results (but not the interpretations) are
identical. There are other models, however, where the results are not identical.
See, for example, an application by Tsurumi (1977). Furthermore, this method
of testing hypotheses is usually suggested only for circumstances where a
diffuse or vague prior is employed, and one of the hypotheses is a single point
[Lindley (1965), Zellner (1971, Ch. 10)]. When a proper prior is employed, or
both hypotheses are composite, hypothesis testing is generally carried out
within a posterior odds framework. It is to this topic that we now turn.
ratio, which is denoted by KOI = P(Holy)IP(Htly). This ratio summarizes all the
evidence (prior and sample) in favor of one hypothesis relative to another; its
calculation is frequently regarded as the endpoint of any study of competing
hypotheses. In contrast to the usual sampling theory approach, it is possible to
give the odds in favor of one hypothesis relative to another, and, although
possible, as in the nonnested model procedures of Chapter 21, it is not neces-
sary to accept or reject each hypothesis. Thus, as Zellner (1971, Chapter 10)
points out, it may be more appropriate to describe the Bayesian procedure as
"comparing" rather than' 'testing" hypotheses. If a decision to accept or reject
must be made, it can be handled by using a loss function that expresses the
consequences of making the wrong decision and by minimizing expected loss
where the expectation is with respect to the posterior probabilities on each
hypothesis.
Let us turn to the problem of deriving an expression for the posterior odds
KOI = p(Holy)IP(Hlly) If the two hypotheses are inequalities such as Ho: 0:> C
and HI: 0 < c, where 0 is a parameter and c is a constant, then KOI can easily be
obtained from the posterior density function g(Oly). Specifically, P(Holy) =
f~ g(Oly)dO and P(Hlly) = J:-oc g(Oly)dO. This approach can be readily extended
to regions of more than one dimension, such as the linear inequality hypotheses
Ho: RP > r and HI: RP ::5 r, in the general linear model y = Xp + e. Also, if a
proper prior for 0 (or P) is specified, then the prior odds P(Ho)IP(H I ) can be
found in a similar way. With an improper prior the prior odds will be indetermi-
nate, but the posterior odds can still be calculated. See Zellner (1979) for
further details, examples and references.
For other types of hypotheses, such as comparing a "sharp" hypothesis Ho:
RP = r with the alternative HI: RP =1= r, or comparing alternative nonnested
regression models, we proceed as follows. Consider the two hypotheses
Ho: Y = Zy + eo (4.7.1)
and
HI: y = Xp + e (4.7.2)
Thus
and
P(H])
P(Hdy) = fey) JJ g(IJ,(TIHd . f(yllJ,(T,Hdd(TdlJ
P(H] )f(yIH])
fey) (4.7.5)
Equation 4.7.6 shows, in Zellner's (1971) words, "that the posterior .Odds are
equal to the prior odds P(Ho)/P(H]) times the ratio of averaged likelihoods
with the prior densities g(y,(ToIHo) and g(IJ,(TIHd serving as the weighting
functions. This contrasts with the usual likelihood-ratio testing procedure
which involves taking the ratio of maximized likelihood functions . . . . "
We usually assume that g(y,(ToIHo) and g(IJ,(TIHd are proper priors; other-
wise the weighting functions in (4.7.6) will not integrate to one and the interpre-
tation just given is questionable. Furthermore, certain ambiguities can arise
[see Leamer (1978, p. 111) for examples]. In many cases, however, it is reason-
able to specify an improper prior for those parameters with identical assump-
tions under both hypotheses. For example, K o] will not be adversely affected
by the assumptions (T = (To and gTIHo) = gTIHd ex (T-I, even although this
means that f(yIHo) and f(ylHd will be improper.
The integrals (or averaged likelihood functions) in the numerator and denom-
inator of (4.7.6) are the predictive densities f(yIHo) and f(yIHd, and their ratio
f(yIHo)/f(YIHd is known as the "Bayes factor." Assuming the two hypotheses
Ho and H] are exhaustive, the unconditional predictive density fey) can be
obtained by averaging over the two hypotheses. That is,f(y) = f(yIHo) . P(Ho)
128 Sampling Theory and Bayesian Approaches to Inference
+ f(yIH , ) . P(H I ). Also, it is worth noting that when there are more than two
competing hypotheses, it is still possible to calculate the posterior odds for any
pair of hypotheses.
Before turning to some special cases of the posterior odds ratio in (4.7.6) it is
instructive to examine the sampling theory approach from a Bayesian view-
point [Zellner (1971, p. 295)]. Let L(Ho,lf l ) denote the loss incurred if HI is
selected and Ho is the true state, and let L(H I ,flo) denote the loss incurred if Ho
is selected and HI is the true state. Assuming that the loss incurred from
choosing a correct state is zero, Ho will be the minimum expected loss choice
when
We now consider the posterior odds in (4.7.6) for the special case where
g('Y,u-oIHo) and g(l,u-IH I ) are natural conjugate priors. The predictive density
from a natural conjugate prior was given in (4.5.10). When we view this expres-
sion as conditional on HI and include the normalizing constant it becomes
where v, s, ji, and A are parameters from the natural conjugate prior. Letting
n(v + T)12](VS2)vI2
(4.7.11)
kl = f(vI2)7T T12
and
(4.7.13)
(4.7.14)
(4.7.10) becomes
(4.7.15)
(4.7.16)
where vo, So, B, and l' are the parameters of the natural conjugate prior for
('Y,<To); y = (Z'Z)-IZ'y, voa-~ = (y - Zy)'(y - Zy), Vo = T - K o, Qo =
(y - 1')'[B- l + (Z'Z)-l]-l(y - 1') and ko = {[[(vo + T)/2](vos~)v"'2}/{f(vol
2)7TTI2}. The posterior odds ratio can then be written as
1. The greater the prior odds P(Ho)IP(H l ), the greater will be the posterior
odds.
2. The term IBI/I(B + Z'Z)I and its counterpart in the denominator are mea-
sures of the precision of the prior information on Jl relative to the precision
of the posterior information. Other things equal, these terms mean that we
will favor the hypothesis with more prior information.
3. Goodness-of-fit considerations are given by the residual sums of squares
terms voa-ij and va- 2, and these have the expected effect on the posterior
odds.
130 Sampl ing Theory and Bayesian Approaches to Inference
4. The terms Qo and QI are measures of the compatibility of the prior and
sample information. The further b deviates from ji, for example, the
greater will be QI and, other things equal, the greater will be the posterior
odds in favor of H o.
5. The relative effect of the prior information on (T and (To depends on the ratio
kolkl' as weIl as the last bracketed terms in both the numerator and denomi-
nator.
in favor of Ho. Another property, and one that may be surprising, is that, for a
given F value, the posterior odds in favor of Ho are an increasing function of the
sample size T. This result is in contrast to the fixed significance level sampling
theory situation where the magnitude of F (or its right-hand tail probability) is
the only me,asure of support for HI' Although in some circumstances it might
be desirable, a criticism of the fixed significance level sampling theory approach
is that it treats Type I and Type II errors asymmetrically. As T increases the
probability of a Type I error remains constant while the probability of a Type II
error (for a given (3) declines. For both errors to be treated symmetrically the
significance level needs to be a decreasing function of sample size. The poste-
rior odds formulation treats both errors symmetrically, and this is reflected by
the fact that, for a fixed F value, KOl is an increasing function of T. This
phenomenon is known as Lindley's paradox. For details see Lindley (1957),
Zellner (1971, p. 304), Leamer (1978, p. 105), and Judge et al. (1982, p. 101).
With respect to the parameter T, we note that as T- I ~ 0, KOl ~ 1. This is a
special case of the result A -I ~ 0 that was discussed earlier. The consequences
ofT ~ 0 (or A ~ 0) will be discussed briefly in Section 4.7.2.
A second special case of (4.7.18), which leads to a simplification similar to
(4.7.19), is that obtained from the g prior discussed in Section 4.3.4 [Zellner
(1982a)]. When we set A = goX'X, Equation 4.7.18 becomes
rior odds for comparing Ho: f3 = 0 with HI: f3 =1= O. See Zellner and Siow (1980)
for details.
Posterior odds for other combinations of alternative priors, models and hy-
potheses have been derived and applied by various authors. For details we
refer the reader to Bernardo (1980), Dickey (1975), Gaver and Geisel (1974),
Leamer (1978), Lempers (1971), Rossi (1980), Zellner (1971, 1979, 1982b),
Zellner and Geisel (1970), Zellner and Richard (1973), and Zellner and Williams
(1973); as well as the references cited by these authors.
4.8 SUMMARY
4.9 EXERCISES
Exercise 4.1
The following results are useful in Section 4.7. Prove that
(a) II + XA -I X'I = IA + X' Xl . lA-II
(b) [A-I + (X'X)-I]-I = X'X - X'X(A + X'X)-IX'X
(c) (I + XA -I X')-I = 1- X(A + X' X)-I X'
(d) (y - Xb)'(y - Xb) + (b - 1i)'[A-I + (X'X)-I]-I(b - Ii)
= (y - XIi)'(I + XA -IX')-I(y - Xli) .
134 Sampl ing Theory and Bayesian Approaches to Inference
Exercise 4.2
Consider the general linear model y = Xfl + e with normally distributed first-
order autoregressive errors e t = pet-I + Vt, and the usual definitions and as-
sumptions.
(a) If fl, p, and (Tv are treated as a priori independent with each marginal prior
density proportional to the square root of the determinant of the relevant
partition of the information matrix, the joint prior density is approximately
Outline the steps you would go through to derive the predictive posterior
density function g(y! y). What form is the following distribution?
Exercise 4.3
Suppose that YI, Y2, . . . ,YT is a random sample from a Poisson distribution
with mean A, and that the prior distribution for A is the gamma distribution
Find the posterior density for A. What does this result tell you about the prior
for A?
Exercise 4.4
For the prior g(T) given in Equation 4.3.5, show that
(a) E[(T] =
f[(iif(ii/2)
- 1)/2] [ii'2 ]1/2 s_
- )112
(b) the mode for (T is (ii: I S
Exercise 4.5
Find the sampling theory mean and mean square error matrix for ~ -
(A + X'X)-I(A(l + X'Xb). Show that MSE(b) - MSE(~) is positive semidef-
inite if and only if
The Bayesian Approach to Inference 135
Exercise 4.6
Consider the model y = x{3 + e where E[ee'] = a 2/, {3 is a scalar,
y = (0 -1 1 2 5 1 1 -2 -2 -6)'
and
x = (0 1 2 3 4 0 -1 -2 -3 -4),
Also, suppose that your prior information about {3 and a can be summarized via
the densities
g({3la) ~ N(2,a 2 )
g(a) ex a- 3 exp{ -lIa 2}
(a) What are the values of 11 and &2 in the prior density for a?
(b) Write down the joint prior density g({3,a).
(c) What is the least squares estimate for {3?
(d) What is the conditional posterior density g({3la,y)? Give its mean and
vanance.
(e) Give the marginal posterior densities for {3 and a. Include the normalizing
constants. What are E[{3ly] and var[{3ly]?
(f) Assuming prior odds of unity, find the posterior odds in favor of Ho: {3 = 1
against the alternative, HI: {3 =1= 1. [Assume that g(aIHo) = g(aIH I).]
Exercise 4.7
Given that (va-/a 2) has a xlv) distribution, show that the density function for a is
that given in Equation 4.2.5.
Exercise 4.8
Consider the posterior density function g(l3,aly) derived from a natural conju-
gate prior. See Section 4.3.2. ShQw that if 11 = 0 and A = goX' X, then the
vs
expressions for g(l3,aly), 2 and J\ are given by Equations 4.3.18, 4.3.19, and
4.3.20, respectively.
Exercise 4.9
Show that the "poly-t" density in (4.3.23) is obtained by combining (4.3.21) and
(4.3.22) with the likelihood function, and integrating out a.
Exercise 4.10
Show that (a) the MELO estimator 0'1' in (4.4.7) is optimal with respect to the
loss function L = (0 1 - olf/of; and (b) the MELO estimator in (4.4.8) isoi
optimal with respect to the loss function L = {3~(02 - O2)2.
Exercise 4.11
Derive the predictive density f(yly) given in Equation 4.5.6.
136 Sampling Theory and Bayesian Approaches to Inference
4.10 REFERENCES
INFERENCE IN GENERAL
STATISTICAL MODELS
AND TIME SERIES
In the next three chapters we leave the conventional worlds discussed in Part I
and now consider tools for analyzing situations when (1) it is not possible to
derive exact sampling distributions, (2) the statistical models are nonlinear, and
(3) the economic data can be modeled within a time series discrete stochastic
process context. In each case the basic concepts and definitions are developed
and the statistical implications are evaluated. Numerical optimization proce-
dures useful in analyzing nonlinear statistical models are discussed in Appendix
B at the end of the book. Multivariate time series models are considered in
Part V.
Chapter 5
Some Asymptotic
Theory and Other
General Resu Its for
the Li near Statistical Model
In this section we are concerned with certain aspects of the limiting behavior of
the least squares (LS) estimator b T in the general linear model y = X(3 + e. For
convenience we will drop the subscript T and simply denote the LS estimator
by b. The notation introduced in Chapters 2 to 4 still holds. We assume:
b = (3 + (X'X)-IX'e (5.1.1)
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 145
and we wish to make statements about how, and in what sense, b converges to
13 as T - 00, it is clear that we must investigate the convergence properties of
X'X and X'e. With this point in mind we first consider X'x.
(S .1.2)
T T(T + 1)/2 ]
X'X - [
T(T + 1)/2 T(T + 1)(2T + 1)/6 (S.1.3)
From (S .1.3) it is clear that lim (X' XIT) is not finite, and so, in this case, the
assumption that Q is finite does not hold. However, in Section S.3.S, which is
concerned with "Asymptotically Uncooperative Regressors," most of the as-
ymptotic properties in this chapter can be derived under less restrictive as-
sumptions about X. This fact is important because, in economics, our explana-
tory variables are frequently trended.
The assumption that Q is nonsingular is more critical. For an example where
this condition is violated [Dhrymes (1978)], let XII = 1 for all t and Xt2 = "I,
where I"I < 1. Then
" - "T+I
T
X'X = 1- "
,,2 _ ,,2(T+1)
" - "T+I
lim
T_oc
(TXI~ -- [I 0] (S.1.S)
Thus although X' X is nonsingular for all finite T, X' XIT is singular in the limit,
and so (S.1.2) is violated.
5.1.2 The Order of Magnitude of a Sequence
verge to finite limits implies that T-IX'X is bounded (in the sense that the
sequences of the elements within T-1X' X are bounded), and under these cir-
cumstances we say that X'X is at most of order T. We write it as
More formally, we say that the sequence {aT} is at most of order Tk, and write it
as aT = O(Tk) if there exists a real number N such that
(5.1.7)
for all T. This definition extends to vectors and matrices by applying the condi-
tion to every element in the relevant vector or matrix. Note that, although
(5.1.2) implies (5.1.6), the converse is not necessarily true.
A related concept and its associated notation is introduced in the following
definition. A sequence {aT} is of smaller order than Tk, and is written as aT =
o(Tk), if
lim T-k aT = 0
T-+oX! (5.1.8)
For example, if (5.1.2) holds, and we extend the above definition to matrices;
then X'X = 0(T2) because lim T- 2X'X = lim T-IQ = o. Also, note that if aT =
O(Tk), then aT = o(Tk+S) where 8 > o.
It is frequently useful to employ results from the algebra associated with
orders of magnitudes of sequences, and so we will briefly state some of these
results. Consider the sequences {aT} and {b T} such that aT = O(Tk) and b T =
O(P). The following results hold
aTb T = O(Tk+J)
laTls = O(Tks); s>O
aT + b T = O(max{Tk,TJ}).
lim P[IZT -
T~oo
zi > e] = 0 (5.1.9)
plim ZT = Z (5.1.10)
or, alternatively, as
(5.1.11)
P(lim IZT -
T-->oo
zi > e) = 0
(5.1.12)
and we write this result as
a.s.
ZT~ Z (5.1.13)
(5.1.14)
148 Inference in General Statistical Models and Time Series
(S.1.1S)
if, for every e > 0, there exists a real number N such that
(S.1.16)
for all T. Also, we say that {ZT} is of smaller order in probability than Tk and
write
(S.1.17)
if
These definitions also hold for vectors and matrices by applying the conditions
to every element in the relevant vector or matrix. The concepts in (S.l.lS) and
(S.1.17) can be related by noting that if ZT = Op(Tk), then ZT = op(Tj) if j > k.
The algebra for Op and op is identical to that given for 0 and 0 in Section S.1.2.
For an example let us return to the least squares estimator b =
Il + (X'X)-'X'e and investigate the order in probability of the term X'e. To
simplify the discussion we will assume there is only one explanatory variable
whose observations are contained in the vector x, and therefore, we are inter-
ested in the order in probability of the scalar x' e. It is straightforward to extend
the analysis to each element of the vector X'e.
If (S.1.2) holds, that is, lim T-' x'x = q is a finite nonzero constant, then, as
described in (S.1.6), x'x = OCT), and there exists a number N, such that
(S.1.19)
(5.1.20)
Applying this result to x'e and using (5.1.19), we have, for any N2 > 0
(5.1.21)
or
(5.1.22)
If, for a given s > 0 we choose N2 such that a-2NI/N~ < s, that is, N2 :>
a-(NI/S)1/2, we can write
(5.1.23)
(5.1.24)
(5.1.25)
This last result will be useful for establishing the property of consistency for the
least squares estimator b, and for the least squares variance estimator a2 =
(y - Xb)'(y - Xb)/(T - K). Before turning to these problems it is worth stating
the result in (5.1.24) in more general terms [Fuller (1976)].
If {ZT} is a sequence of random variables such that
5.1.5 Consistency
XIX)-I X'e
b=rl+ ( - -
T T
and
XIX)-I X'e
plim b = rl + plim ( T T
XIX)-I (Xle)
= rl + plim ( T plim T from Slutsky's theorem
XIX)-I (Xle)
= rl + ( lim
T plim T
= rl + Q-10 from (5.1.2) and (5.1.26)
=rl (5.1.27)
In going from plim (X'X/T)-I to (lim X'X/T)-I, we have used Slutsky's theorem
and the fact that the probability limit of a nonstochastic sequence is simply
equal to its (nonprobabilistic) limit.
To show that the least squares variance estimator is consistent we first state
Khinchine's theorem. If ZI, Z2, . . , ZT are independent and identically dis-
tributed (i.i.d.) random variables with finite mean /-L, then plim T-I ~;~I q/ = /-L.
An important feature of this theorem is that it does not require the existence of
moments of higher order than the mean. Now consider the i.i.d. random varia-
bles, e], e~, . .. ,e}, which have mean E[e7J = (j2. From Khinchine's theorem
it follows that
(5.1.28)
1 e'(I - X(X'X)-IX')e
T- K
Then, using Slutsky's theorem and the fact that plim [TI(T - K)] = 1,
= (I2 - 0 . Q-l . 0
= (I2 (5.1.30)
where the second to last equality is based on (5.1.2), (5.1.26), and (5.1.28).
= L
i~ 1
bias (zir) +
2
L
i~ 1
var(Zir)
(5.1.32)
2 (Y'X)-I
lim 0-2(X'X)-1 = lim T lim T
0-
= 0. Q-l = 0
K
bias(&2) = E[&] - 0- 2 = - - 0- 2 ~ 0 as T ~ 00
T
and
2(T - K)aA
var(&2) = T2 ~ 0 as T ~ 00
When the error vector e is not normally distributed interval estimates and
hypothesis tests for J3 must be based on what is known as the asymptotic
distribution of the least squares estimator. We will take up this question in
Section 5.3. In this section, to demonstrate the need to consider the asymptotic
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 153
distribution, we will briefly state some of the finite sample properties of b when
e is nonnormal. We retain the assumption that E[ee'] = (T2J, with (T2 finite.
When the variance does not exist, the robust estimation procedures discussed
in Chapter 20 are likely to be more suitable.
The existence of nonnormal errors has the following implications for the
finite sample properties of least squares statistics.
1. Instead of being minimum variance from within the class of all unbiased
estimators, the least squares estimator b is only minimum variance from
within the class of linear unbiased estimators. More efficient nonlinear
unbiased estimators may exist.
2. The variance estimator 0. 2 = (y - Xb)'(y - Xb)/(T - K) is still unbiased but
is no longer minimum variance from within the class of all unbiased estima-
tors for (T2.
3. The respective distributions of band (T - K)6- 2/(T2 are no longer normal
and X7T-K)'
(5.2.1)
no longer has the F(J,T-K) distribution, and it is therefore no longer valid (in
finite samples) for testing a set of J linear restrictions given by R(3 = r.
It is sometimes argued that, because each element of the error vector e is made
up of the sum of a large number of individual influences, then from a central
limit theorem (discussed later in this section)" it is reasonable to assume that e
is normally distributed. This argument may be valid under some circumstances,
but it would be reassuring if there was some justification for test statistics such
as (5.2.1) even when e is not normally distributed. Thus, in this section we
investigate how large sample (asymptotic) theory can be used to make infer-
ences about (3, without invoking the normality assumption. We begin by intro-
ducing the concept of convergence in distribution followed by a statement of
the relevant central limit theorems.
to the random vector z with distribution function F(z) if, and only if, at all
continuity points of F, and for every e > 0, there exists a To such that
but the converse is not necessarily true. To appreciate this point note that, if
ZT 4 z, then, for large T, the probability that ZT differs from z is very small, and
so we would expect them to have approximately the same distribution. How-
ever, it is possible for completely unrelated (e.g., independent) random vari-
ables to have approximately the same distribution. Thus, if ZT ~ z, then ZT and
Z will have approximately the same distribution, but any realization of ZT need
not have a bearing on a realization of z.
When Z = 0 is a vector of constants the converse does hold. That is, it is also
true that
d
ZT~ 0 implies
Then, using the scalar version of a theorem that is given in Section 5.3.4,
Thus the mean of the asymptotic distribution of YT 1 is IL -I, but lim E[YT I] =1= IL- 1
because it can be shown that E[yTI] does not exist. Note that this example also
demonstrates that an estimator can be consistent (plim YT 1 = IL- 1), without its
bias and variance going to zero as T ~ 00 (E[YT I] and var[YT I] do not exist). For
further examples see Greenberg and Webster (1983, pp. 57-60).
There are certain conditions under which the limits of the moments of a
sequence of random variables are equal to moments ofthe limiting distribution.
In particular, in terms of a scalar random variable ZT, if ZT ~ z, and if E[zr],
E[zm] and lim E[zr] are all finite, then for all n < m, lim E[ZT] = E[zn]. See, for
example, Rao (1965, p. 100).
One particular expectation that always exists, and that is often useful for
finding the limiting distribution of a random variable, is the characteristic func-
tion. For a scalar random variable it is defined as
considering T-IX' e, from the above result it seems reasonable to consider the
limiting distribution of T- 1I2X'e. We follow this strategy in the next subsection.
We indicate how a central limit theorem can be used to establish the limiting
distribution of T- 1/2X'e, and, from this result, we derive the limiting distribu-
tion of
This theorem is not directly relevant for us, however, because each of the
elements in the vector X' e is not a sum of T identically distributed random
variables. For example, the kth element of X'e is
which is the sum of T independent random variables with zero means, but with
'CC 2 2 2 2 2 2
dlllerent vanances Xlk(T , X2k(T , . ,XTk(T .
To accommodate this situation we need a more general central limit theorem,
and one possibility is the following multivariate version of a theorem given by
Greenberg and Webster (1983).
If ZI, Z2, . . ,ZT are independently distributed random vectors with mean
vector fL and covariance matrices VI, V 2 , , V T and the third moments of
the ZI exist, then
T
T-1/2 L (ZI -
1=1
fL) ~ N(O,+)
(5.3.4)
where
T
~ = lim T-I L VI
1=1 (5.3.5)
T
~ = lim T-I 2: (T2X/X; = (T2 lim T-IX'X = (T2Q
(5.3.6)
/~ I
where the limit assumption about T-IX'X in (5.1.2) has been used. For the third
moments of the z/ to exist we need to assume that the third moment of e/ exists
and that the elements X/k are uniformly bounded. Then it follows that
T
T- 1/2 2: x/e/ =
/~ I
T- I12X'e..!J, N(O,(T2Q)
(5.3.7)
Assuming that the third moment of e/ exists was a convenient way of estab-
lishing (5.3.7), but it is a stricter condition than is necessary. A condition that is
both necessary and sufficient for the sum of T independent but not identically
distributed random variables to have a normal limiting distribution is given by
the Lindeberg-Feller central limit theorem. See, for example, Theil (1971),
Dhrymes (1974), Schmidt (1976), and Greenberg and Webster (1983). Schmidt
demonstrates how the Lindeberg-Feller condition can be used to establish
(5.3.7), while Theil uses the characteristic function approach to establish
(5.3.7) directly.
To move from the limiting distribution of T-1/2X'e to that ofVT(b - (l) we state
the following theorem. If the random vector ZT converges in distribution to Z
and the random matrix AT converges in probability to a matrix of constants C
then
(5.3.8)
X'X)-I
__ X'e __~ Q Iq
d _
(
T VT (5.3.9)
Note that, in (5.3.9), we use the result plim (X'X/T) = Q. This result holds
because we are, in fact, making the stronger assumption that X is non stochastic
and that lim (X'X/T) = Q. This point should also be kept in mind in what
follows.
The next step is to investigate how the asymptotic distribution in (5.3.10) can
158 Inference in General Statistical Models and Time Series
be used to make inferences about fl. For this purpose the following theorem is
useful.
If g is a continuous vector function and ZT ~ Z then
d
g(ZT) ~ g(z) (5.3.11)
(5.3.12)
and
(5.3.13)
(5.3.13). For this purpose we use (X'XIT) and 6- 2 , respectively. Strictly speak-
ing, X'XIT is not an "estimator" for Q because X is non stochastic , but, as
mentioned above, it does satisfy the required condition that plim (X'XIT) = Q.
Thus, we have
and this statistic can be used to construct interval estimates and test hypothe-
ses about Rfl. When J = 1 a result that is equivalent to (5.3.14) is
w = Rb - Rfl ~ N(O 1)
6-YR(X'X)-lR' , (5.3.15)
The statistic w is the familiar" t statistic" that is used for interval estimates
when e is normally distributed. However, in this case, the finite sample distri-
bution of w is unknown, and its large sample (approximate) distribution is a
normal distribution. Consequently, from a theoretical point of view, we should
employ the normal rather than the t distribution when e is not normally distrib-
uted. From a practical standpoint, however, it is quite possible that the t
distribution is a better approximation in finite samples, and in large samples it
makes no difference because the t distribution converges to the standard nor-
mal distribution.
An analogous situation exists with the statistic "-l in (5.3.14) and the conven-
tional F statistic given by
(5.3.17)
To make (5.3.17) operational we need a consistent estimator for /1-4, We will not
pursue this problem further, however, we note that an appropriate next step
would be to investigate the necessary additional assumptions about X to show
that ,14 = T-I 'L;=le: is consistent.
This model does not satisfy the assumptions of this section because the explan-
atory variable Yt-I is stochastic. Nevertheless, we will discover in Section
5.4, that if some additional assumptions are made, the asymptotic results of this
section still hold. In particular, 0(b - Jl) ~ N(O,(I2Q-I), where Q is a matrix
that is consistently estimated by X' X/T, and the three columns in X are, respec-
tively, a column of ones, a column of the x/ s, and a column containing the
lagged values of the dependent variable. Now suppose, instead of being inter-
ested in Jl, we are interested in ao, aI, and y, which we will write as gl(Jl),
gZ(Jl), and g3(Jl), respectively. We then have
need the asymptotic distribution of g(b). The following theorem is useful in this
regard.
If vY(b - (J) ~ N(O,(T2Q-I) and g is a I-dimensional vector function with at
least two continuous derivatives, then
where
ogl og2
-
o{31 O{31
ogl og2
-
og' o{32 o{32
(5.3.21)
o(J
ogl og2
o{3K o{3K
{3I(l - {33)-2
{32(l - {33)-2 (5.3.22)
-1
and the asymptotic covariance matrix for vY[g(b) - g(J)] is obtained by pre
and post multiplying (T2Q-1 by (5.3.22) and its transpose, respectively. A X2
statistic analogous to that in (5.3.13) can be constructed, and this statistic is
made operational by replacing Q-I by (X'XIT)-I, (T2 by lJ2, and by evaluating
og/oW at b. These substitutions do not alter the asymptotic distribution of the
statistic, which now can be written as
where
(5.3.24)
Equation 5.3.23 can be used to test hypotheses about g(J), or, if we are inter-
ested in a single element from g(J), we could use the normal distribution
counterpart of(5.3.23). For example, if we are interested in testing an hypothe-
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 161
(5.3.25)
where aij is the (i,j)th element of (X' X)~ I, and f2 is the second column of F.
An assumption that was made early in this chapter and that has been used
throughout is that lim(X'XIT) = Q is a finite nonsingular matrix. The purpose of
this section is to demonstrate that one aspect of this assumption-that the
elements of Q are finite-is not critical, and that the test statistics that we have
derived will still be valid under some less restrictive assumptions. Following
Schmidt (1976) we give these circumstances the title of asymptotically uncoop-
erative regressors.
We begin by considering the diagonal elements of X'X given by
dtk = L Xik
t~ I
; k = I, 2, . . . ,K
(5.3.26)
and we assume that lim d}k = 00. That is, the sum of squares of each regressor
approaches infinity with T. This is certainly a reasonable assumption for all
regressors that violate the assumption that lim X'XIT is finite, as well as those
for which lim X' XIT is finite but nonzero. It precludes regressors of the form At
where IAI < I.
If we let DT = diag. (d n , d T2 , . . ,dTK ), then, from (5.3.26), it can be shown
that [Schmidt (1976, p. 86)]
(5.3.27)
In addition, we will assume that Cis nonsingular. These assumptions are suffi-
cient to establish that the least squares estimator is consistent. We know that b
has mean Il and covariance matrix (J"2(X' X) ~ I, and, therefore, that b will be
consistent if lim(X'X)~1 = O. We have
Using the central limit theorem given by Equations 5.3.4 and 5.3.5, it can be
shown that
T
2: DT1Xte t = DTIX'e ~ N(O,(T2C)
(= 1 (5.3.30)
(5.3.31)
The central limit theorem we have used assumes the existence of the third
moment of e(, a much stronger condition than is required. For details of how to
prove the same result under less restrictive assumptions about the distribution
of e(, see Anderson (1971, pp. 23-25). The assumptions we have made about X
are sometimes referred to as the Grenander conditions. They will be used in
later chapters and may be summarized as
2. lim(xhldh) = o.
3. lim Di1X'XDi i = C is finite and nonsingular.
(5.3.32)
Replacing C and (T2 with the consistent estimators DTlX'XDT I and 6- 2 , respec-
tively, yields (see Problem 5.1)
(b - (3)'X'X(b - (3) d 2
-2 ~ X(K)
(T (5.3.33)
had begun by considering the linear combinations RP, it would have also been
possible to show that the more general result holds, that is,
(5.3.34)
Thus, our traditional test statistics are valid when the assumption of finite
lim(X'XIT) is violated, but to prove their validity, instead of using the scalar
VT, we must choose an appropriate matrix function DT to standardize the
distribution of (b - P).
In Section 5.3 we have been concerned with the problem of using a limiting dis-
tribution as a method for carrying out approximate finite sample inference. We
saw that, under certain conditions, VT(b - P) ~ N(O,(J'2Q-I), and that this re-
sult leads us to test statistics such as Al in (5.3.34). A common and imprecise
way of referring to the limiting distribution of VT(b - P) is to say that b is
asymptotically normal with mean P and covariance matrix (J'2Q-IIT. Alterna-
tively, we often replace Q-IITwith (X'X)-I and say b is asymptotically normal
with mean P and covariance matrix (J'2(X' X)-I. Such usage is imprecise, be-
cause, strictly speaking, the limiting distribution ofb is degenerate on the single
point p. It has covariance matrix zero. However, when making inferences
about p, with statistics such as At. we do behave as ifthe covariance matrix for
b is (J'2(X'X)-I. As a consequence, (J'2Q-IIT (and sometimes (J'2(X'X)-I) is fre-
quently called the asymptotic covariance matrix of b. This term is particularly
ambiguous because it could also describe the limit of the finite sample covari-
ance matrix for b, that is, lim E[(b - lim E[b])(b - lim E[b])']. As we discussed
in Section 5.3.1, in general these two definitions need not be the same.
Throughout this book the term asymptotic covariance matrix will generally
refer to the approximate covariance matrix obtained from a limiting distribu-
tion, rather than the limit of a finite sample covariance matrix. Although there
are a number of instances where finite sample moments are approximated by
their limits, most finite sample approximations in econometrics are based on
limiting distributions. For details of other approximation methods, including
asymptotic expansions of moments, density functions, distribution functions
and characteristic functions, see Greenberg and Webster (1983), Phillips (1980,
1982, 1983), Rothenberg (1982), Sargan (1976) and Taylor (1983).
So far, one of the assumptions we have employed for the general linear model
y = Xp + e is that the regressor matrix is nonstochastic. In economics, how-
ever, where we seldom have the luxury of setting the values of the explanatory
164 Inference in General Statistical Models and Time Series
In this section we assume that the regressor matrix X is stochastic but com-
pletely independent of the disturbance vector e. Under these circumstances the
finite sample properties of Chapter 2 will still hold, providing we treat them as
conditional on X. To appreciate this point note that the properties in Chapter 2
were derived from the marginal distribution of e. When X is stochastic and we
wish to obtain results conditional on X, the properties need to be derived from
the conditional distribution of elx. Sincef(e) = f(eIX) when X and e are inde-
pendent, it follows that, conditional on X, E[b] = Jl and E[(b - Jl)(b - Jl)'] =
(j2(X' X)-I; and if e is assumed to be normally distributed, the usual test statis-
tics also hold conditional on X.
We are more likely, however, to be interested in unconditional inferences
than in those restricted to the particular sample values of X. Unconditionally,
we can say that the least squares estimator b will still be unbiased and have
covariance matrix given by (j2E[(X'X)-I], providing that the expectations
E[(X'X)-IX'] and E[(X'X)-I] exist. See Judge et al. (1982) or Schmidt (1976) for
details. Note, however, that unless something is known about the form of the
distribution of X, it may be difficult to establish whether these expectations
exist. Furthermore, because b is a nonlinear function of X and e, it will not
follow a normal distribution even if X and e are both normally distributed.
Consequently, our usual test statistics do not hold in finite samples.
What about asymptotic properties? First we need to replace the assumption
that Iim(X'XIT) is finite and nonsingular with the stochastic assumption that
X'e)
plim (T = 0
(5.4.2)
Also, if XI is a (K x 1) vector equal to the tth row of X, and Xl> X2,' ,XT
are independent identically distributed random vectors with E[xlx;] = Q, then it
follows that Xlel, X2e2, ,xTeT are independent identically distributed ran-
dom vectors with
and (5.4.3)
and the central limit theorem in (5.3.3) is relevant. Its application yields
T
T- 1/2 2: xle l = T-1/2X'e ~ N(O,(]"2Q)
(5.4.4)
1=1
Then, (5.4.1), (5.4.2), and (5.4.4), and the various theorems in Section 5.3, can
be used exactly as before to establish the asymptotic validity of our usual test
statistics. We need to use the result that plim(X'XIT) = Q rather than lim(X'XI
T) = Q, but this fact is of no consequence because in all our previous deriva-
tions the finite non stochastic limit was a stronger condition than was required.
It should be emphasized that, to establish (5.4.4), we assumed that the XI are
independent and that both their first and second moments exist. These assump-
tions are sufficient to establish (5.4.3), and Khinchine's theorem can then be
used to establish (5.4.1) and (5.4.2). However, as Schmidt (1976) notes, there
may be circumstances where the existence of E[xl] and E[xlx:l are fairly strong
assumptions. For example, if XI = l!w l , where WI is normally distributed, then XI
has no moments.
The assumption that the XI are independent is not critical. Although the
central limit theorems we have discussed are not directly relevant for serially
correlated x;s, it is still possible to prove that the assumptions (5.4.1) through
(5.4.3) imply (5.4.4). See, for example, Pollock (1979, p. 335). Alternatively, if
all the moments of e l exist, (5.4.4) can be established using the Mann-Wald
theorem, presented in the next section.
Finally, it is possible to show that the least squares estimator will also be the
maximum likelihood estimator if the vector e is normally distributed, and if the
distribution of X does not depend on 13 or (]"2. In terms of probability density
functions the likelihood function can be written as
f(y,x) = f(yIX)f(X)
Thus, the log-likelihood function will contain the terms 10gf(yIX) and 10gf(X),
and, because it does not contain 13 and (]"2, the term 10gf(X) will not enter into
the maximization process. Furthermore, if e is normally distributed, then
f(yIX) will be a normal distribution, and so the least squares estimator for 13 will
also be the maximum likelihood estimator. Properties of maximum likelihood
estimators are discussed in Section 5.6.
166 Inference in General Statistical Models and Time Series
In this section we move from the model where the stochastic regressors are
totally independent of the disturbance vector to the case where they are par-
tially independent in the sense that the tth observation on the regressors is
independent of the tth and succeeding values of the disturbance term, but it is
not independent of past values of the disturbance term. The most common
model with these properties is the linear model with lagged values of the depen-
dent variable as explanatory variables. That is, a model where the tth observa-
tion is given by
Yt = 2: Yt-iYi + z;a + e t
i~ 1 (5.4.5)
and where the Zt are assumed to be nonstochastic, W = (YI, Yz, . 'YP' u') is
a vector of unknown parameters, and the e t are i.i.d. random variables with
zero mean and variance (Tz. This model is discussed in more detail in Chapters 7
and 10, the case where a = 0 being considered in Chapter 7.
It is clear that we can write (5.4.5) in the form
y = X(3 +e (5.4.6)
where X contains both the lagged values of Y and the non stochastic regressors,
and has lth row given by
Furthermore,
but
To illustrate that (5.4.9) is true, note that one of the components of X t will be
Yt-l, that Yt-l will be partly determined by e t-!> and it therefore can be shown
that E[Yt-let-d ofo O. Also, because Yt-I depends on Yt-Z, which depends in turn
on et-z, we can show that E[Yt-Iet-z] ofo 0; and this process can be continued for
all past values of the disturbance.
The conditions in (5.4.8) and (5.4.9) define the partially independent stochas-
tic regressor model and, in terms of the complete X matrix, they imply
(5.4.10)
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 167
Then
and
(5.4.12)
(5.4.13)
Thus, the least squares estimator and our usual test procedures are valid as-
ymptotically. Also, if e is normally distributed, it is possible to show (see
Chapter 7 or 10) that b is the maximum likelihood estimator conditional on Yo,
y _ I> . . . ,Y -p+ I For details on how to obtain the unconditional maximum
likelihood estimator see Chapter 10. Finally, it is possible, and more conven-
tional, to give conditions on "I' = (YI> Y2,' .. ,Yp) and Zt, which are sufficient
for plim(X'X/T) to be finite and nonsingular. See Chapters 7 and/or 10 or
Schmidt (1976, p. 97).
1. The model in (5.4.5) with the additional assumption that the e t are serially
correlated (Chapter 10).
168 Inference in General Statistical Models and Time Series
In all three cases, E[X' e) =1= 0 and plim(X' elT) =1= 0, and so the least squares
estimator is inconsistent. See Judge et al. [1982, p. 277] for some discussion.
An alternative, more general method of estimation that (under appropriate
conditions) leads to consistent and asymptotically normal estimates is what is
known as instrumental variable estimation. Many of the estimators that we
later develop for the three models specified above can be viewed as special
cases of instrumental variable estimators.
To introduce the instrumental variable estimator, we assume there exists a
(T x K) matrix of (possibly stochastic) variables Z, which are correlated with
X in the sense that
Z'X
lim- (5.4.14)
P T = L zx
(5.4.15)
Z'Z)
plim ( T = L zz (5.4.16)
exists and is nonsingular, and that e r has finite moments of every order. Under
these conditions it is possible to show that the instrumental variables estimator
defined as
(5.4.17)
(5.4.18)
To prove (5.4.17) and (5.4.18) we first note that Z satisfies the conditions of
the Mann-Wald theorem and so
and
(5.4.20)
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 169
Z'X\-t Z'e
plim fllV = fl + pijm (T ) T
= fl + 2,;) . 0
=fl (5.4.21)
Then, using (5.4.20), (5.4.14), and the theorem in (5.3.8) the result in (5.4.18)
follows.
If we wish to construct interval estimates or test hypotheses about fl, we can,
along the lines of Section 5.3.3, use (5.4.18) to derive an asymptotic X2 statistic.
Jp this statistic we need to replace 2,zx, 2,zz and (J2 by their respective consistent
qestimators (Z'XIT), (Z' ZIT) and 6- 2 = (y - XflIV)'(y - XflIV)IT. It is straightfor-
ward to prove consistency of 6- 2 providing we also assume plim(X' XIT) and
plim(X' e/T) are finite. Finally, we note that fllV is not necessarily efficient in an
asymptotic sense because there will be many sets of instrumental variables that
fulfill the requirement of being uncorrelated with the stochastic term and correl-
ated with the stochastic regressors. For details on how to combine a larger set
of instrumental variables efficiently, see Harvey (1981, p. 80); examples are
given in Section 11.2.1 and Chapter 15.
(5.5.2)
is the minimum variance estimator within the class of all linear unbiased esti-
A
~b - ~~ = (T2(X'X)-IX''\IfX(X'X)-1 - (T2(X''\If-lX)-1
= (T2A'\IfA' (5.5.4)
bias is given by
e= y - Xb = [I - X(X'X)-IX']e (5.5.6)
(5.5.7)
and
Since X t remains fixed as T~ 00, from (5.5.7) and (5.5.8) we have that the bias
and variance of (e t - e/) approach zero as T ~ 00, and therefore that plim (e t -
e t ) = O. It then follows that et -4 et and hence that et ~ e/.
The finite sample properties of the LS residual vector
(5.5.10)
This covariance matrix will not be equal to (J2/ even if'll = /. Consequently,
test statistics for the null hypothesis H o: 'II = / (the et are independent and
identically distributed) cannot be based on the assumption that, under Ho, the et
will be independent and identically distributed. This fact complicates the for-
mulation of appropriate tests and has led to the development of alternative
types of residual vectors, say e, which have the property that E[ee'] = (J2/
when E[ee'] = (J2/. Two such types of residuals are BLUS and recursive
residuals, and we will describe them briefly in the following two subsections.
Their possible use in testing for heteroscedasticity and autocorrelation is de-
scribed in Sections 11.3.3 and 8.4.2c, respectively.
(5.5.11)
where X o is (K x K), XI is [(T - K) x K] and Yo, YI, eo, and el are conform-
able, then we find a LUS residual vector such that E[(e - el)'(e - el)] is a
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 173
minimum. For a given partitioning this is the BLUS residual vector and it is
"best" in the sense that E[(e - et)'(e - et)] is a minimum. A convenient way to
calculate it is
(5.5.12)
where e' = [eo, ell. and di, d~, . .. ,dk and qlo q2, .. ,qK are the charac-
teristic roots and vectors, respectively, of Xo(X'X)-txo. The K roots of
Xo(X'X)-txowill all be positive and less than or equal to one. Only the H roots
which are strictly less than one, and their corresponding characteristic vectors,
appear in (5.5.12). The number H is equal to the rank of Xt.
When using BLUS residuals, we need to decide on the appropriate partition-
ing in (5.5.11) or, if more than one partitioning is to be considered, the appropri-
ate subset of the (T:!K) possibilities. See Theil (1971, p. 217) for some criteria
upon which to partition.
Early references to the development of BLUS residuals can be found in Theil
(1971, Chapter 5). For later developments and details of BLU residuals with a
fixed covariance matrix and BAUS residuals (best augmented unbiased resid-
uals with a scalar covariance matrix), see Abrahamse and Koerts (1971), Abra-
hamse and Louter (1971), Grossman and Styan (1972), Dubbelman, Abrahamse
and Koerts (1972), Farebrother (1977), Dent and Styan (1978), and Dubbelman
(1978).
j=K+l,K+2,. . , T
(5.5.13)
174 Inference in General Statistical Models and Time Series
(5.5.14)
and
Sj-1XjXjSj-l
1 + xj Sj-1Xj (5.5.15)
When we have the model y = Xf3 + e, with E[e] = 0, E[ee'] = (1'2"', and'll
unknown, an alternative to the least squares estimator is the estimated general-
ized least squares (EGLS) estimator
(5.5.16)
(5.5.17)
A
function of e (i.e., 'II(e) = 'II( -e, then it follows (Kakwani 1 1967) that IJ will
be an unbiased estimator for IJ, providing that the mean of ~ exists.
For the second result we define
A
e = y - XIJ (5.5.18)
~,.q,-I~
(p = ----:=
T- K (5.5.19)
and let 6 be the estimator of 9 (the unknown parameters in'll) obtained from
the least squares residuals. Then, under not very restrictive conditions on the
estimator 6, it is possible to show (Breusch, 1980) that the exact distributions of
~
(Il - IJ)/(T, (j2/(T2, 9, and e/(T do not depend on the parameters IJ and (T2. ThIS
.... '" 1\
result is an important one for the design of Monte Carlo experiments. It means
that in such experiments, although we need to consider a number of points in
the 6-parameter space, only one point in the parameter space for (1J,(T2) needs
to be considered.
For the asymptotic properties of IJ we first investigate the asymptotic proper-
ties of the GLS estimator ~ = (X''II-IX)-IX''II-I y, and then give sufficient
conditions for the EGLS and GLS estimators to have the same asymptotic
distribution. Consistency and asymptotic normality of ~ are established by
placing the same conditions on the transformed variables PX (where P is such
that P 'liP , = I) as were placed on X in Sections 5.1 and 5.3. Thus, we assume
that the elements in PX are bounded and that
X''II-IX)
lim (
T-~
T = V (5.5.20)
is a finite nonsingular matrix. Along the lines of the earlier sections it then
follows that ~ and
(5.5.21)
176 Inference in General Statistical Models and Time Series
(5.5.22)
. (X'~-IX) . (X''II-IX)
phm T = phm T
and
(X''II- e)
I
. (X'~-Ie)
phm v'T =
.
phm v'T
or, equivalently
and
Thus, (5.5.24) and (5.5.25) are a common way (e.g, Schmidt, 1976 and Theil,
1971) of stating sufficient conditions for
(5.5.26)
In practice these conditions should be. checked out for any specific function
'11(6) and the corresponding estimator 6. Most of the cases we study are such
that (5.5.24) and (5.5.25) hold if 0 is a consistent estimator for 6. HowevtEr, as
an example given by Schmidt (1976, p. 69) demonstrates, consistency of 6 will
not, in general, be sufficient for (5.5.26) to hold. For a set of sufficient condi-
tions which can be used as an alternative to (5.5.24) and (5.5.25), see Fuller and
Battese (1973).
One reason for using an EGLS estimator is that, under appropriate condi-
tions, it is asymptotically more efficient than the LS estimator. For further
details see Exercise 5.4.
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 177
The next task is to find a statistic that can be used for testing the hypo-
thesis R IJ = r where R is a (J x K) known matrix and r a (J x 1) known
vector. Working in this direction, we noted that, from (5.5.20) and (5.5.24),
(X'~-IX/T) is a consistent estimator for V, and, if we add the additional
condition that
then it is possible to use this condition along with (5.5.24) and (5.5.25) to prove
that plim(6- z - <T~) = O. It then follows that 6- z is a consistent estimator for <T2,
and, using results on convergence in probability and distribution given in Sec-
tion 5.3, we have
(5.5.28)
This result can be used to test R IJ = r, or, alternatively, the argument pre-
sented in Section 5.3.3 can be used to justify using Az = AI/] and the F(J.T-K)
distribution. Also, if in (55.28) we assume that the null hypothesis is true and
replace RIJ with r, then Al is often referred to as the Wald test statistic. More
general forms of this statistic will be discussed in Section 5.7.
5.6.1 Properties of
Maximum Likelihood Estimators
The maximum likelihood estimator 'Y is that value of y for which L(y) is a
maximum. Under some regularity conditions stated below it is possible to show
that:
(5.6.3)
0(1' - y) ~ N(O, W)
result that is ensured by the above conditions and that is useful in other con-
texts is that
and
(5.6.4)
1. 1
--E [a L]
2
(5.6.6)
(5.6.7)
Estimators (1) and (2) are, of course, identical (see Equation 5.6.4), but, de-
pending on the circumstances, one may be easier to compute than the other.
The estimator in (2) has the advantage that only first rather than second deriva-
tives are involved, but it has the disadvantage that the expectation of the first
derivatives is often harder to find. Estimators (3) and (4) also require second
and first derivatives, respectively, but not their expectations. Consequently,
estimators (3) and (4) will often be easier to use than (1) or (2), and this will be
particularly so if the computer program employed calculates numerical rather
than analytical derivatives, and therefore avoids the problem of finding (and
programming) explicit expressions for the derivatives. Which estimator yields
the best finite sample approximation is an open question and is likely to depend
on the model being studied. Also, in some cases additional conditions may be
necessary to ensure the consistency of each estimator. Dhrymes (1974) has
proved that (3) will always be consistent if !(z,I'Y) admits sufficient statistics;
estimator (4) was suggested by Berndt, Hall, and Hausman (1974).
One difficulty with the discussion so far is that, at the outset, we assumed the
z, were independent identically distributed random vectors. In econometrics
this assumption is frequently violated. For example, in a heteroscedastic error
model the observations will no longer be identically distributed and in an auto-
correlated error model the observations will no longer be independent. Fortu-
nately, the results we have given do extend to these other cases, providing
some additional assumptions are employed where necessary. With the excep-
tion of the next subsection, we defer details of other cases to the relevant
chapters. Further discussion and examples can be found in Cox and Hinkley
(1974, Ch. 9), Crowder (1976), and Pagan (1979).
y = X(3 + e, (5.6.9)
where the usual notation holds, and we have written 1ft(O) to emphasize that the
unknown disturbance covariance matrix 1ft depends on the (H x 1) vector of
parameters O. This model is clearly an example of one where the observations
(y,'s) are not independent and identically distributed. To find ML estimators.
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 181
T i l
L = - "'2 In (T2 - 2" lnl'ltl - 2<T2 (y - X(3)''It- I(y - X(3)
v (5.6.10)
(5.6.11)
and
We have used the notation ji(O), 6- 2(0), and 'It(0) to emphasize the dependence
of these quantities on o. Substituting (5.6.11) and (5.6.12) back into (5.6.10),
and ignoring constants, yields the concentrated log-likelihood function
The maximum likelihood estimator for 0, say 0, is that value of 0 for which L(O)
is a maximum, or, equivalently, that value of 0 for which
S(O) = 1'It(O)IIIT[y - xji(O)]''It-I(O)[y - xji(O)] (5.6.14)
(5.6.15)
and
(5.6.16)
Thus, the ML estimator for f3 is of the same form as an EGLS estimator, but
instead of using an estimate of 'It based on least squares residuals, it uses an
estimate of 'It obtained by maximizing (5.6.13) with respect to o. Maximization
of(5.6.13) does not, in general, yield a closed form solution for 0, the first-order
conditions being given by [see, for example, Magnus (1978)]
I I
a'lt- ) - (a'lt- ) -
2
6- tr ( aeh . 'It 9~9 = (y - X(3)' ae 9~9 (y - X(3)
h
h = 1, 2, . . . ,H
(5.6.17)
182 Inference in General Statistical Models and Time Series
o o
I(y) = iT<T- 4 i<T- 2(vec '\)I-I)'A (5.6.18)
i<T- 2A'(vec '\)I-I) M'('\)I-I '\)I-I)A
where A = A(O) = a vec('\)I(O)laO'), and the operator vec() stacks the columns
of a matrix as a vector.
Although '\)I =1= I, and thus the elements in the vector yare not independent
and identically distributed, under appropriate conditions [see Magnus (1978)],
it is possible to show that y is consistent, that
and that y will be asymtotically efficient in the sense described earlier. Details
of test statistics based on (5.6.19) are given in the next section. For some
specific examples where the quantities in I(y) are evaluated, see Magnus
(1978).
In this section we focus attention on three alternative tests for testing hypothe-
ses about the (K + H + I)-dimensional vector y = (W,<T 2 ,O')' in the model
The three tests are the Wald (W), likelihood ratio (LR), and Lagrange multiplier
(LM) tests, and each is a possible procedure for testing a general null hypothe-
sis of the form
Ho: g(y) = 0 (5.7.2)
against the alternative
HI: g(y) =1= 0 (5.7.3)
where g is a J-dimensional vector function.
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 183
To introduce the Wald test we note that, from (S.6.19) and a result analogous
to (S.3.20),
where l' is the unrestricted maximum likelihood estimator for 'Y, F is the [(K +
H + 1) x J] matrix (Jg'('Y)/(J'Y, and I('Y) is the information matrix. Then, assum-
ing Ho is true, we have
(S.7.S)
where F is the matrix F evaluated at'Y = 1', and 1(1') is a consistent estimator for
I('Y) based on 1'. The statistic A.w, used in conjunction with a critical value from
the xtJ) distribution, is known as the Wald test.
For the LM test we need the restricted ML estimator 'YR, which can be
obtained by maximizing the constrained likelihood, or by solving the first order
conditions from the Lagrangian function
(S.7.7)
where
(S.7.8)
and 'l}R is the optimal solution for 'I}. Furthermore, if the null hypothesis is true,
we would expect 'YR to be close to l' and d R to be close to zero. This idea
provides the underlying motivation for the test. Under suitable regularity con-
ditions, it is possible to show that
~ =
Vi
_1_ (JL ~ N [0 lim I('Y)]
Vi (J'Y 'T . (S.7.9)
(S.7.1O)
(5.7.11)
From (5.7.5) and (5.7.11) it is clear that the LM test is based only on the
restricted ML estimator, while the Wald test is based only on the unrestricted
ML estimator. The LR test uses both the restricted and unrestricted estimators
and is based on the statistic
(5.7.12)
In general, Aw, ALM, and ALR are asymptotically equivalent in the sense that
they all have the same asymptotic distribution, and they all lead to tests that
have the same asymptotic power characteristics. However, in finite samples
the three statistics will not yield identical values, and so it is possible for the
tests to yield conflicting conclusions. As Breusch and Pagan (1980) note, in the
absence of any information on the finite sample properties of the tests, the
choice between them is likely to depend on computational convenience. In
particular, when the structure of the problem is such that the restricted ML
estimates are easier to compute, ALM is likely to be computationally more
convenient than the other two. If the unrestricted estimates are easier to com-
pute Aw is likely to be preferable. Because it requires both restricted and
unrestricted estimates, ALR is likely to be computationally more demanding
than both Aw and ALM.
The precise form of each statistic will depend on the model being examined
and the hypothesis being tested. For example, special cases arise if the restric-
tions being tested are only on Il, say of the linear form RIl = r, or if a special
structure for'll is assumed and the restrictions pertain only to the vector o. In
this regard, examples relating to the parameter vector 0 are given in the chap-
ters on heteroscedasticity, autocorrelation, disturbance related sets of regres-
sion equations, and models for combining time series and cross-sectional data.
See also Breusch and Pagan (1980) and references therein. One general result
on tests of the form Ho: g(O) = 0 is worth noting at this point. Breusch (1980)
shows that the exact distribution of the Wald, LR, and LM tests does not
depend on the parameters (1l,(1"2) regardless of whether the hypothesis is cor-
rect or not. Consequently, if a Monte Carlo experiment is used to evaluate the
finite sample powers of the three tests for testing hypotheses about 0, only
different values of 0 need to be considered in the experimental design. Chang-
ing Il and (1"2 will not change the powers of the tests.
In the remainder of this section, we consider the special form of the three
statistics for testing the set of linear restrictions Ho: RIl = r against the alterna-
tive HI: Ril oF r. The results hold regardless of the structure of'll (providing
our assumptions of this se,ction still hold), and so are generally applicable for
the models in many of the subsequent chapters. We begin by assuming (1"2 and
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 185
'II are known. Under these conditions the statistics Aw, ALM, and ALR are
identical and can be written as
(5.7.13a)
(5.7.13b)
e*''II-l e * - e''II-l e
(J"2 (5.7.13c)
(5.7.14)
and
(5.7.15)
where e~ and eo are the residuals f~om the restricted and unrestricted models,
respectively, conditional on 'It = 'II0; and et and el are the residuals from the
186 Inference in General Statistical Models and Time Series
(5.7.16)
Thus, ALR can also be written in terms of residual sums of squares, but in this
case the determinants IljIol and IljIll also appear. In Aw and ALM the equivalent
determinants cancel out because each statistic depends only on one covariance
matrix estimator. Like Aw and ALM, ALR has an asymptotic Xl]) distribution.
To put ALR in a form more comparable to Aw and ALM we can transform it as
(5.7.18)
These inequalities imply that, although the three tests are asymptotically equiv-
alent, their finite sample distributions will not be the same. In particular, rejec-
tion of Ho can be favored by selecting Aw a priori, while acceptance of Ho can be
favored by selecting ALM a priori. Furthermore, the finite sample sizes of the
tests will differ. If, for example, a (large sample) 5% significance level is em-
ployed and the LR test correctly states the probability of a Type I error as .05,
then the probability of a Type I error will be less than .05 for the LM test, and
greater than .05 for the Wald test. This source of conflict has led to research
into "correcting" the tests so that their sizes are comparable in finite samples.
Evans and Savin (1982) consider the case where 'It = I, and Rothenberg (1984)
investigates the more general case where 'It is unknown. Using Edgeworth-
type expansions to derive approximate power functions for his size-corrected
tests, Rothenberg shows that when J = I the tests will have approximately the
same power, but for J > 1 the power functions cross, and so there is no obvious
reason for preferring one test over another.
One question we have not yet considered is the relationship between the
three statistics and the traditional F statistic used to test the null hypothesis Ho:
RJ3 = r when 'It is known but (J'2 is unknown. This statistic is given by
from the X~J) distribution, a common applied procedure is to use either U,LMIJ)
or Ol.wIJ) in conjunction with a critical value from the F(J,T-Kl distribution. This
procedure is a common one because, when 'It is unknown, both P,LMIJ) and
(AwIJ) can be viewed as "estimates" of AF , Also, it can be argued that such a
procedure is justified because F(J,T-Kl ~ X~J/J. See, for example, Theil (1971,
p. 402) and Section 5.3.
Further relevant work on the Wald, LR, and LM tests is that of Breusch
(1980), Buse (1982), Gourieroux, Holly, and Monfort (1982), and Ullah and
Zinde-Walsh (1982). Breusch shows that the exact distributions of Aw, ALR, and
ALM depend on (Jl,<T 2) only through (RJl - r)/<T and hence do not depend on
(Jl,<T 2) when the hypothesis is correct. This result is likely to be useful informa-
tion for the design of Monte Carlo experiments which investigate the finite
sample power of the tests. As a teaching device, Buse gives a diagrammatic
explanation of the relationship between the three tests. Gourieroux et al. show
how the tests can be modified when the alternative hypothesis is RJl > r rather
than RJl = r; and Ullah and Zinde-Walsh examine the robustness of the statis-
tics when the disturbances follow a multivariate t distribution.
5.8 SUMMARY
continuity points of F, and for every e > 0, there exists a To such that
5.S.22. If ZI, Zz, . . ,ZT are i.i.d. random vectors with mean fJ. and covari-
ance matrix +z then
T
T-IIZ 2: (Zt -
t~ I
fJ.) ~ N(O,+z).
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 189
5.8.23. If ZI, Z2, . . . ,Zr are independently distributed random vectors with
mean J1 and covariance matrices VI, V2 , , V r, and the third
moments of the Zr exist, then
T
+
where = lim T-I'2.;=1 VI.
5.8.24. Mann and Wald theorem. If, in the general linear model y = Xll + e,
the (possibly) stochastic regressor matrix X and the disturbance vec-
tor e satisfy the conditions (1) plim(X'XIT) = Q is finite and nonsing-
ular; (2) E[e] = 0, E[ee'] = (I 21, and e has finite moments of every or-
der; and (~E[xlel] = 0 for t = 1,2, . . . , T, then plim T-IX'e = 0 and
T-1/2X'e ~ N(0,(I2Q).
5.8.25. If Zr ~ Z and AT""~ C where Ar is a random matrix and C a matrix of
d
constants, then Arzr ~ Cz.
5.8.26. If Zr ~ z, then g(zr) ~ g(z) where g is a continuous function.
5.8.27. Let {Zr} and {xd be sequences of random variables and e be a con-
stant. Then
ZrXr ~ ez if e oF 0
ZrXr ~ 0 if e = 0
(Zrlxr) ~ (zle) if e oF 0
8>0
Exercise 5.1
Prove that the least squares variance estimator 6- 2 = (y - Xfl),(y - Xfl)/
(T - K) is consistent under the assumptions in Section 5.3.5.
Exercise 5.2
Consider a simple model with one explanatory variable that is a linear trend.
That is
y=xf3+e
x'e d (T2)
T3/2 ~ N 0, 3
and that
T3/2(b - f3) d
;;; ~ N(O, I)
6- v 3
(d) Show that the Grenander conditions hold for this model. (See Section
5.3.5.)
Exercise 5.3
Prove that the estimator 6- 2 = (y - XflIV)'(y - XflIV)/T is consistent under the
assumptions outlined in Section 5.4.3.
Exercise 5.4
Consider the linear model y = Xfl + e where E[ee'] = (T2\f1, and where lim
T-IX'X = Q, lim T-IX'\fIX = Vo and lim T-IX'\fI-IX = V are all finite and
nonsingular. Furthermore, assume that the result in Equation 5.5.26 holds.
Show that
Some Asymptotic Theory and Other General Results for the Linear Statistical Model 191
Exercise 5.5
.
Given the conditions
, .
outlined in Section 5.5.3a, show that the estimator &-2
(y - XIJ)''II-'(y - XIJ)/(T - K) is a consistent estimator for 0- 2
=
Exercise 5.6
Derive the information matrix given in Equation 5.6.18.
Exercise 5.7
Prove the equivalence of Equations 5.7.13a, 5.7.13b, and 5.7.13c.
Exercise 5.8
Use data of your own choice to find LS estimates for {3" {32, and {33 for the
partial adjustment model given in Section 5.3.4. Find the asymptotic covari-
ance matrix for (ao,a, ,y)' and construct asymptotic confidence intervals for
these parameters.
Exercise 5.9
For a linear model and data of your own choice, split the sample into two
groups and consider the corresponding model
o-TI 0 )
(o o-~I
5.10 REFERENCES
Nonlinear
Statistical Models
6.1 INTRODUCTION
Unfortunately, for many applied problems, these conditions are not always
fulfilled and nonlinear specifications cannot be avoided. An example is the
constant elasticity of substitution (CES) production function of the form
with 0 < f32 < 1, 0 =fo f33 < I, f31 < 0 or > 1. In this model Q, is the output, L, the
labor input, K, the capital input, and e t a random error. For the interpretation of
the parameters see Chiang (1974, pp. 417-419).
The CES production function model is of the general form
the notation we will often writef,(~) for f(Xt,~) and as a more compact notation
for (6.1.2) we use
y = f(X,~) +e (6.1.3)
or
y = f(ll) +e (6.1.4)
where y = (Yt. Y2, . . . ,YT)', X' = (XI, X2, . ,XT), e = (el, e2,' .. ,eT)'
and f(X,~) = f(fJ) = [.fi(~), . . . ,fT(~)]/ It will be assumed in the following
that all required derivatives of f() exist and are continuous.
To estimate the unknown parameters, an objective function is specified and
an optimal value of the unknown coefficient vector is computed as the estimate.
Least squares (LS) and maximum likelihood (ML) estimation are considered in
Section 6.2. In general, a system of nonlinear normal equations has to be
solved to find the solution of the optimization problem, and a closed form
expression of the resulting estimator can usually not be given. A brief account
of the computational aspects of this problem is given in Section 6.3. The use of
nons ample information in the form of equality or inequality constraints is dis-
cussed in Section 6.4, and confidence regions and hypothesis testing are con-
sidered in Sections 6.5 and 6.6, respectively. A summary and some concluding
remarks are given in Section 6.7. The topics of this chapter are depicted in a
flow diagram in Table 6.1. Throughout we will give references for extensions of
the material discussed. General reference sources and discussions of nonlinear
regression and related topics are, for instance, Amemiya (1983), Bunke et al.
(1977), Bard (1974), and Goldfeld and Quandt (1972). For an introductory treat-
ment see Judge et al. (1982, Chapter 24).
6.2 ESTIMATION
Suppose the underlying model is of the form (6.1.4). In the following, the true
parameter vector will be denoted by ~* whereas fJ is just some vector in the
parameter space. Similarly, the true variance of the errors in (6.1.4) will be
denoted by (P. Thus, e - (O.<PI). Two estimation methods, least squares
(LS) and maximum likelihood (ML), will be considered subsequently. Under
our assumptions the ML estimator will be identical to the LS estimator.
As in the linear model case, the LS principle is intuitively appealing for estimat-
ing the parameters of the nonlinear statistical model (6.1.4). The nonlinear LS
estimator b is chosen to minimize
aft af,
af3, p af3K p
af
Z(Il) = all' p - (6.2.3)
afT
af3,
afT I
p af3K p
198 Inference in General Statistical Models and Time Series
is the (T x K) matrix of first derivatives off with respect to the parameters. For
instance, for the CES production function model (6.1. I) the tth row of Z(~) is
given by the (1 x4) vector
I
In[{ha3 + (1 - /32)Kf3]
/31(a3 - Kf3]
/32 L f3 + (1 - /32)Kf 3
/31[ln(L/)/32Lf3 + In(K/)(I - /32)Kf3]
/32 L f3 + (1 - /32)Kf3 (6.2.4)
6.2.18 Consistency
Consistency of b can be established by showing that
exists and is nonzero for ~ 0/= ~*. In other words, lim X'XIT exists and is
nonsingular, which is one of the conditions for the consistency of the linear LS
estimator given in Chapter 5.
Amemiya partitions the sum of squares function as
where e = y - f(Jl*). The first term on the right-hand side is, of course,
independent of the nonlinear specification (6.1.4) so that
1 1
plm-e'e = *2
(J'
T (6.2.7)
1 } *2
P { T [f(Jl*) - f(Jl)]'e > e 2 < :ZT2 [f(Jl*) - f(Jl)]'[f(Jl*) - f(Jl)] (6.2.8)
it follows that the plim of the second term on the right-hand side of (6.2.6) goes
to zero by assumption 2. The third term of (6.2.6) is minimized for Jl = Jl* by
assumption 3 above, thus (6.2.5) holds and we have established the consistency
of b. In fact, it is possible to prove strong consistency of b, that is, b ~ Jl*.
i,j = 1, . . . , K,
U sing a first order Taylor expansion of the gradient of the sum of squares
function at Jl* gives
as as (a 2s) *
aJl b = aJl 11* + aJlaw p (b - Jl ) (6.2.9)
(
1 a2s
vT(b - Jl*) = - T aJlaw p
)-1 (1vT aJlIl*
as )
(6.2.10)
1 a2
s I
T al3al3' Ii =
1{ -, -
T 2Z(I3) Z(I3) - 2 ~T [Yt - !t(I3)]
-
al3al3' IIi)}
(a2j; (6.2.12)
can be shown to vanish asymptotically and the first term converges to 2Q.
Thus, in summary,
so that
(6.2.15)
is a consistent estimator for the variance (!J-2 and is often preferred over
6- 2 = S(b)/T (6.2.l7)
To set up the likelihood function the distribution of the error terms has to be
known. In practice, normality is assumed most often and we wi1\ only consider
this case. More precisely, it will be assumed that
e ~ N(O,(f21) (6.2.20)
202 Inference in General Statistical Models and Time Series
C( 1l 2) = 1 {_ [y - f(3)]'[y - f(3)]}
.... ,a- (21Ta- 2 )T12 exp 2a- 2 (6.2.21)
T S(3)
In C(3,a- 2) = constant - 2" In a- 2 - 2a- 2
(6.2.22)
where
32 In C -.
*\ [Z(3*)'Z(3*)]
u
o
1(0*) = - E [ aoao' 0* J=
0'
(6.2.24)
[see Judge et al. (1982, Chapter 24)]. The term lim TI(O*)-I is the Cramer-Rao
lower bound so that 6 is an asymptotically efficient estimator. The block-
diagonal structure of (6.2.24) results, of course, from the special assumptions
imposed on the considered model. Occasionally, models arise where 1(0*) is
not block diagonal. For discussion of an example, see Theil and Laitinen
(1979).
The asymptotic covariance matrix in (6.2.23) can be consistently estimated
by TI(9)-I. Alternative estimators !hat may provide a better approximation to
the variance-covariance matrix of (3 are considered by Clarke (1980). Note that
the ML estimator of (}2 is &2 in (6.2.17) as in the linear model case. The ML
estimator is a sufficient statistic if such a statistic exists. However, for nonlin-
ear models such a statistic exists only under rather special conditions [see
Hartley (1964)]. For a further discussion of the properties of the ML estimator
see Goldfeld and Quandt (1972).
White (1982) considers the problem of misspecified models in the context of
ML estimation and Gallant and Holly (1980) discuss a situation where the true
parameter vector may vary with the sample size. As T ~ 00 the sequence of true
parameter values is assumed to converge to a vector (3*. Gallant and Holly give
conditions for the ML estimator to have the usual desirable properties in this
case. Their framework is actually much more general than the present discus-
sion as they consider nonlinear simultaneous equations models. Burguete, Gal-
Nonline rar Statistical Models 203
lant, and Souza (1982) consider an even more general sit uation by allowing the
parameter vector to be infinite dimensional. They also di iscuss various estima-
tion procedures. Robust estimation in the context of nOI.,linear models is dis-
cussed by Bierens (1981) and Grossmann (1982) where fUl rther references can
be found. In the next section we consider some issues relm 'ed to the numerical
computation of the LS estimates.
6.3 COMPUTATION OF
PARAMETER ESTIMATES
Pn+1 = Pn + ~n (6.3.1)
where ~n is the step carried out in the parameter space from the point Pn that
was obtained in the (n - 1)th iteration. The first obvious problem is to find a
starting vector PI. Clearly, it is desirable to choose PI as close to the optimum b
as possible to avoid an unnecessary number of iterations. We will see shortly
that using a consistent estimate as starting vector is advantageous. In some
cases, methods are available to obtain such consistent initial estimates. For
example, in estimating time series or distributed lag models consistent initial
estimates can be derived (see Chapters 7 and 10).
For the general nonlinear model (6.1.2), Hartley and Booker (1965) described
a possible method for finding a consistent initial estimator. They partition the
sample into K nonoverlapping subsets of m observations each and compute the
average y;, i = 1, 2, . . . , K. Accordingly, the .t;(P) are averaged giving K
functi~ns j;(P). Then PI is chosen as solution of the system of K equations
y; = fi(P), i = 1, . . . , K. This, of course, is also a nonlinear system of
equations that needs to be solved iteratively. For a discussion of the consis-
tency conditions of this estimator see 1ennrich (1969) and Amemiya (1983).
Gallant (1975a) suggests using K representative sample values to compute an
initial estimate. This procedure cannot claim consistency of the resulting PI.
U sing one of the gradient methods described in Appendix B, the step ~n is of
the general form
(6.3.2)
204 Inference in GeneraL Statistical Models and Time Series
(6.3.3)
It can be shown that, under suitable conditions, Jl2 has the same asymptotic
distribution as t;he optimizing vector b [Amemiya (1983); Fuller (1976, Chapter
5)]. From (6.3.3) we get
(6.3.4)
(6.3.5)
. where J3 is on the line segment between JlI and Jl *, and substituting in (6.3.4)
gives
. 1 _I _ . 1 [ as
phm T PI - phm T aJlaw
2
I~ ]-_ . [a
1 2
S
phm T aJlaw ~*
]
(6.3.7)
then the assumptions of Section 6.2.1 ensure that VT(Jl2 - Jl*) and VT(b _
Jl*) have the same asymptotic distribution [see (6.2.10)].
This discussion implies that many of the optimization procedures presented
in Appendix B provide estimators asymptotically equivalent to the LS estima-
tor if only one iteration is carried out. For instance, for the Newton-Raphson
algorithm
(6.3.8)
and it follows from Section 6.2.1 that it satisfies (6.3.7). For the Gauss method
(6.3.9)
and it is argued in Section 6.2.1 that, after mUltiplying (6.3.9) and (6.3.8) by T,
Nonlinear Statistical Models 205
the two resulting matrices are asymptotically equivalent [see (6.2.12)]. Conse-
quently, starting with a consistent estimator fll the Newton-Raphson and the
Gauss methods asymptotically provide an LS estimator in one iteration. On the
other hand, for the steepest decent algorithm PI = I, so that one iteration of this
method does not suffice to obtain an estimator asymptotically equivalent to b.
Suppose that the true parameters fl * are known to fulfill equality constraints
given in the form
q(fl) = 0 (6.4.1)
1
q(fl) = {31 - -
(6.4.2)
{33
(6.4.3)
where
and
+ *
= (J2
.
hm
[Z(JJ*)'Z(3*)]-'
T
(6.4.5)
(6.4.6)
Note that this method increases the number of parameters in the objective
function and thereby may increase the minimization difficulties. This problem
can be avoided if the restrictions are given or can be written in the form
(3 = g(a) (6.4.7)
f30
g(a) = f31
f32
1/f31 (6.4.8)
af 0 g = [~ ]. [~ ] = Z(g(u [~ ]
(6.4.10)
au '" afJ g("') au '" au '"
we get
(6.4.11)
VT(b* - fJ*) ~
N (0 *2 lim [~
,U' au' ",*
]([~
au' ",*
]' Z(fJ*),Z(fJ*)
T
[~
au' ",*
])-' [~
au' ",*
]') (6.4.12)
Rothenberg (1973) discusses equality constraints of the above type in the con-
text of maximum likelihood estimation.
where q(fJ) is a function with values in the l-dimensional Euclidean space, this
information can also be used to improve the LS estimator of fJ. The inequality
sign ">" between vectors means that all components of the left-hand vector
are greater than or equal to the corresponding components of the right-hand
vector. If the true parameter vector fJ * is in the interior of the constrained
parameter space and if the constrained parameter space has the same dimen-
sion as I!A, the inequality constrained LS estimator has the same asymptotic
distribution as the unrestricted LS estimator b. This result is an immediate
consequence of the discussion in Section 6.2.1 since, provided q(.) is continu-
ous, (6.4.13) implies that the constrained parameter space, say '7Jt, is compact
if I!A is compact so that assumptions 1 to 4 are satisfied. The case where the
true parameter vector is on the boundary of the restricted region is more
difficult to deal with. Some results are given by Chant (1974). Even if the true
parameter vector is in the interior of the restricted region it is possible to
improve on the small sample performance of the LS estimator by taking the
inequality restrictions into account. We will discuss this possibility now.
Suppose that the constrained parameter space '7Jt is a convex subset of the
unconstrained parameter space I!A, which, for simplicity, is also assumed to be
208 Inference in General Statistical Models and Time Series
convex. Note that by assumption 1 in Section 6.2.1, (]A is a bounded set. The
assumption here is that in addition to the restrictions that are required to
specify (]A further constraints are available to reduce the feasible parameter
space to the convex set (jA. For example, (jA will be convex if q(J) is quasi
convex, that is, for vectors (J" (J2 E (]A the condition q(J,) <: q(J2) implies for
o <: A <: 1 that q(A(J, + (1 - A)(J2) <: q(J2) so that a linear function is quasi
convex. As an alternative to the restricted LS estimator, for any positive semi-
definite (K x K) matrix W, we can define an estimator b w such that
where (Jji is the topological closure of the constrained parameter space. (jA
being convex guarantees that b w is well defined. If b satisfies the constraints or
is on the boundary of (jA, that is, b E (Jji, then b w = b, whereas, for b outside of
(Jji, the restricted estimator b w will depend on W. It follows from results due to
Rothenberg (1973) that
where C(J) is the likelihood function. If Po(J) is continuous and Po(3) =F 0, the
maximum posterior distribution estimator is asymptotically equivalent to the
maximum likelihood estimator (3. For a further discussion of Bayesian methods
see Bard (1974, Chapter IV, Section C).
Nonlinear Statistical Models 209
where f3'* and hi are the ith coordinate of (3* and b, respectively; Za/2 is the
critical value of a normal distribution with mean zero and variance one; and Zii
is the ith diagonal element of [Z(b)' Z(b)]-I. Thus V (j2 Z i; is an estimate of the
asymptotic standard deviation of hi. If the sample size is small Zal2 could be
replaced by the critical value of a t distribution, since all results are only valid
asymptotically and asymptotically the t distribution converges to the normal
distribution. Also, 6- 2 may be used in (6.5.1) instead of (j2.
To illustrate we use the output, labor, and capital data given in Table 6.2. The
labor and capital data are random samples from a uniform distribution on the
interval (0,1) and In Qt is computed by adding an error e t , generated from a
normal distribution with mean zero and standard deviation .25, to -.5 In (.3L;2
+ .7K;2) for each t = 1,2, . . . ,30. Thus f36 = 0, f3i = -0.5, M = .3, {3j =
- 2, and &2 = .0625. The LS estimates and 95% confidence intervals computed
according to (6.5.1) are given in Table 6.3.
If the empirical investigator is not prepared to approximate the small sample
distribution of b by its asymptotic distribution, conservative confidence inter-
vals can be established, using Chebyshev's inequality
where hi may be any estimator for f3'* and (Fi is the standard deviation of hi.
Thus, if hi is unbiased
(Fi (Fi )
Pr ( hi - -Va <: f3t <: hi + ~ >- I-a
(6.5.3)
210 Inference in General Statistical Models and Time Series
t Lt Kt In Qt
~i ~i )
( hi - Va' hi + Va (6.5.4)
Hence we have to use a procedure to reduce the bias. Bard (1974, p. 187)
explains a possible method, which is due to Quenouille. In the following section
joint confidence regions for all the parameters IJ* are considered.
We will now assume that the model errors are normally distributed, that is,
e - N(O,(j.2l). The following three statistics have been suggested to compute
approximate joint confidence regions for IJ *.
[S(IJ) - S(b)]/K
F( = S(b)/(T - K) (6.5.5)
All three statistics are asymptotically distributed as F(K.T-K) if IJ = IJ* and thus,
approximate (1 - a)100% confidence regions are given by Fi < Fa(K.T-K) , i = 1,
2, 3. While all three statistics are identical for the linear model case (see Chap-
ter 2), this does not hold for the nonlinear model (6.1.2). The confidence con-
tours provided by FJ are in general not ellipsoids. Joint confidence regions for
the production function example are depicted in Figure 6.1.
Using a second-order Taylor expansion,
2
a s ~ (IJ - b)
S(IJ) - S(b) = "21 (IJ - b), [ alJalJ' I] (6.5.9)
212 Inference in General Statistical Models and Time Series
1.0 112
0.5
S(iJi = 6.0
113 -4.0
~~~~~~~~o.o -3.5 -3.2 -3.0 -2.8 -2.5 -2.1
-3.1 -2.9
Figure 6.1 Exact confidence contours for the CES production function parameters
with approximate confidence levels for the parameters f3i and f3j(f3o = .1245, f31 =
-.3363).
1.0
Ilo = 0.1245
Il, = - 0.3363
0.5
,, 0.4
0.3
I 0.2
/
/ 0.1
L----L.._..I---L_..J~--.,;:.....::oJLIo........d"'c_/__L..,_..I---LI_..JI..,~I f-J 0.0
-3.5 -3.0 -2.5
ilJ
Figure 6.2 Confidence contours for significance levels ex
= .01. Key: ___ , confidence region from (6.5.6); __ ,
confidence region from (6.5.5).
Nonlinear Statistical Models 213
Section 6.2.1, the second term on the right-hand side of (6.2.12) becomes small
as T - 00. The statistic F3 also provides confidence ellipsoids.
Since FI = F2 = F3 for linear models, it is plausible that the accuracy of the
confidence regions will depend on the functional form of the model. In particu-
lar, ifthe deviations from linearity are only slight the confidence regions can be
expected to be good approximations to the actual confidence regions. This
problem is discussed in detail by Beale (1960).
Another method to obtain confidence regions is suggested by Hartley (1964),
who defines a statistic
[y - (I3)]'W(W'W)-IW'[y - (I3)]IK
F4 = [y _ (13)]'[1 - W(W'W) IW'][y - (I3)]!(T - K) (6.5.10)
(6.5.11)
[Bard (1974), Section 7-10]. Here b may be any unbiased estimator for 13* and
~b is the variance-covariance matrix of b. The drawbacks of this procedure
have already been discussed in Section 6.5.1.
The above statistics can also be used to compute joint confidence regions for
subsets of the parameters and, of course, they can be used for hypothesis
testing as discussed in the next section. In addition, the Wald, the Lagrange
mUltiplier, and the likelihood ratio statistic presented in the following section
can also be used to obtain confidence regions.
Throughout this section we will assume that the model errors are normally
distributed, that is, e - N(O,(P/) , and the assumptions of Section 6.2.1 are
satisfied so that the LS and ML estimators are identical and have the properties
stated in Section 6.2. In fact, for much of what follows e need not be normally
distributed if the ML estimator is used and is assumed to have suitable proper-
ties [e.g., Gallant and Holly (1980)]. Moreover, some results hold for the LS
estimator if the assumptions of Section 6.2.1 are satisfied and the distribution of
e is unknown.
214 Inference in General Statistical Models and Time Series
against (6.6.1)
(6.6.2)
(6.6.3)
[S(b*) - S(b)]1J
FI = S(b)/(T - K) (6.6.4)
Yt = It ([;J) + Zt O + e t (6.6.5)
If the restrictions for the parameters are correct, then 0 should be zero. There-
fore, flo: 0 = 0 is tested in the usual manner and Ho is rejected if flo is rejected.
Since under Ho the model is linear this test has a small-sample justification. Of
course, a crucial problem is to find a regressor z; to give good power against the
alternative. The test can be generalized to cover the case where the model is
nonlinear in the nuisance parameters [Gallant (1977)].
Nonlinear Statistical Models 215
1 1
Ilo. f3*3 -- f3i or q(Il*) = f3i - f3j = 0 (6.6.7)
we cannot test Ho: f31 = 0, that is, q(Il*) = f31, since in this case f30 and f31 are
no longer identifiable if Ho is true [see Judge et al. (1982, Chapter 24)]. This
violates an assumption on which the validity of the asymptotic properties of the
e~timator is based. We will assume in the following that q(ll) satisfies the
assumptions of Section 6.4.1; that is, q(ll) is at least twice continuously differ-
entiable, and [aq/aWIIJ] has full rank in a neighborhood of p*. Detailed formal
conditions for model and constraints are given in Gallant and Holly (1980) and
Rao (1973) [see also Chapter 5].
provided the restrictions are correct and thus q(Il*) = 0 (see Chapter 5). As ML
estimator of the variance-covariance matrix
(6.6.10)
can be used in (6.6.9). A test based on AW rejects Ho if AW > xl}.,d' Note that this
test does not require the computation of the constrained ML estimator.
216 Inference in General Statistical Models and Time Series
For the CES example with q(Jl) = f3, - (l/f33) we get [aq/aWII3] = [0, 1,0,
1/f3j] and, consequently, for the hypothesis q(Jl*) = 0 we have Aw = .0018 <
X~l..05) = 3.84. Thus we cannot reject the null hypothesis (6.6.7) at the .05 level.
(6.6.11)
alnel
aJl b'
[a q J' -
+ all' b' "l) =
0
(6.6.12)
In e J'
ALM = [aaJl b'
[a In e ]
+R aJl b'
[y - f(X,b*)],Z(b*)[Z(b*)'Z(b*)r'Z(b*)'[y - f(X,b*)]
S(b*)IT (6.6.13)
which is the Rao efficient score test statistic [Rao (1973), Section 6e.3]. This is
preferable to (6.6.11) if we incorporate our constraints directly in the likelihood
function (see Section 6.4.1) and, consequently, we do not have to estimate ii.
For this test only the constrained parameter estimator is required. For the CES
production function we computed the constrained estimates of f3t, f3t and f3f.
The value of the test statistic ALM = .093 is, like the Wald test statistic, consid-
erably lower than the critical value X~l,.05)'
(6.6.14)
exceeds xl],,,) for a pre specified significance level a [e.g., Rao (1973), Section
6e.3J. Under Bo, U'h is the ML estimator of cP. Equivalently, (6.6.14) can be
written as
Nonlinear Statistical Models 217
6.7 SUMMARY
In this chapter we have extended the basic linear model of Chapter 2 by allow-
ing the functional relationship between the dependent and the independent
(exogenous) variables to be nonlinear in the parameters and the variables. The
LS and ML principle can be used for estimating the parameters as in the linear
model case. However, the small sample properties of the estimators can in
general not be derived and we have concentrated on asymptotic properties.
218 Inference in General Statistical Models and Time Series
6.8 EXERCISES
Most of the results in this chapter are only valid for large sample sizes. In
econometrics, however, we typically work with small or moderate samples.
Therefore, the small-sample and asymptotic distribution of the estimators or
test statistics may differ substantially in practice. Also the values ofthe asymp-
totically equivalent test statistics given in Section 16.6 will usually differ in
small samples. The reader is invited to explore the small-sample properties of
some of the estimators and test statistics by solving the following exercises.
The model under consideration is assumed to be
t = 1, 2, . . . , 20 (6.8.1)
where f(xl'(l) = f30 + f31(X~12 + f33Xt2) and the e t are independently, normally
distributed errors with mean zero and variance ~2 = 1. It is left to the reader to
give an economic interpretation to this model. The true parameter values are
f36 = 5, f3t = 2, f3! = 2, andf3t = .5; andxtl = L t , andxt2 = K t , t = 1,2, . . . ,
20. The values for L t and K t are given in Table 6.2. We give the gradient
and Hessian matrix of f(x/> (l) and thus provide all necessary derivatives for
constructing the gradient and Hessian of the sum of squared errors objective
function.
1
af X~2 + f33Xt2
all Ii f31 In (x tl )x~?
f3I X t2 (6.8.2)
o 0 o o
o In (x tl )X~2 Xt2
f3I[ln (xtl)fx~2 0
o (6.8.3)
220 Inference in General Statistical Models and Time Series
Exercise 6.1
Generate a sample of 20 values of y for the model (6.8.1) and compute LS/ML
estimates for 13* and &2. Apply different starting values for the numerical
optimization algorithm of your choice. Compute also the approximate covari-
ance matrix of b.
Exercise 6.2
Choose a significance level a and construct confidence intervals for the esti-
mates.
Exercise 6.3
Generate another sample of size 20, combine it with the one used in Exercise
6.1, and repeat the estimation with the sample of size 40. Compare the results
and interpret.
Exercise 6.4
U sing a sample of size 20, compute LS estimates that satisfy
(a) bl =b2
(b) b l = IIb 3
(c) b l = b2 and b l = IIb 3
Exercise 6.5
Use the test statistics Aw, ALM, and ALR, given in Section 6.6.2; to test the
following null hypotheses:
(a) Ho: f3f = f3f.
(b) Ho: f3f = 1If3t.
(c) Ho: f3f = f3f and f3r = 11m
Compare the values of the test statistics and interpret.
Exercise 6.6
Using the three F statistics (6.5.5) to (6.5.7), test the following hypotheses
(a) Ho: 13* = (5, 2, 2, .5)'.
(b) Ho: 13* = (3, I, 2, .5)'.
Compare the test values and interpret.
Exercise 6.7
Repeat all tests in Exercises 6.5 and 6.6 using a sample of size 40.
Exercise 6.8
Compute the average values of the estimates of 13 * and &2 from 100 samples and
compare with the true values.
Nonlinear Statistical Models 221
Exercise 6.9
Construct empirical frequency distributions for band (j2 and interpret.
Exercise 6.10
Use the Kolmogorov-Smirnov D statistic or some other appropriate statistic to
test the agreement between the empirical and theoretical asymptotic distribu-
tions.
Exercise 6.11
Use the computed test statistics for the tests in Exercises 6.5 and 6.6 from 100
samples to perform a small-sample comparison of the asymptotically equiva-
lent tests.
Exercise 6.1 2
Repeat Exercise 6.11 with 100 samples of size 40. Compare the results.
6.9 REFERENCES
Time Series
7.1 INTRODUCTION
Given a set of time series data y" Y2, . . . , Yr we have assumed in the
preceding chapters that each associated random variable Y, consists of a mean,
say IL" and a zero mean random part e,. The mean was usually assumed to be a
function of some other variables, for instance, IL, = x; 13 in the simple linear
statistical model in Chaptel 2, or IL, = I(x" 13) in the nonlinear statistical model
of Chapter 6. This presumes that a theory is available suggesting what other
variables summarized in x, have an impact on the observed value of Y,. This
assumption is not always realistic since sometimes no adequate theory is avail-
able or a particular researcher is not ready to accept the existing theories. In
such a situation the given time series is often analyzed without relating it to
other variables. In the case when forecasting is the objective such specifica-
tions have been used with some success. Also, in conventional econometric
models an analysis of the residuals is often made in an attempt to improve
estimator performance. Consequently, time series analysis is also relevant in
the context of econometric modeling if the original sample consists of time
series data, and thus, the residuals are also a time series. This possibility will be
discussed in greater detail in some of the follo'ving chapters.
The chapter is organized as follows: In Section 7.2 a range of useful time
series models is introduced and the use of these models for forecasting is
discussed in Section 7.3. Estimation and model specification is discussed in
Sections 7.4 and 7.5, respectively. Spectral analysis, which is a rather different
approach to time series analysis, is introduced in Section 7.6. Data transforma-
tions, such as trend and seasonal adjustment are treated in Section 7.7. A
summary and conclusions are given in Section 7.8. The content of this chapter
is summarized in Table 7.1. For the reader without some basic knowledge of
the subject, studying a more detailed introduction [e.g., Judge et al. (1982)] may
be useful before reading this chapter. Much of the material introduced below
will be useful in some of the later chapters.
Given a sample y" Y2,' . . ,Yr, we assume that these observations are realiza-
tions of random variables also denoted by y" Y2,' . . ,Yr, respectively. These
random variables are a subsequence of an infinite sequence y" t = 0, + 1, +2,
. . . which is called a stochastic process or, more precisely, a discrete stochas-
tic process since the time index t assumes integer values only. This general
framework is, for example, useful to investigate the post sample development
TABLE 7.1 TIME SERIES TOPICS
Time series
I I
Stochastic processes Data
(Section 7.2) transformations
I (Section 7.7)
Time domain
I
Frequency
analysis domain analysis
I (Section 7.6)
Box-Cox Seasonal and
I
Autoregressive
I
Moving average
I
ARIMA
I
Seasonal ARIMA
transfotmation trend adjustment
(Section 7.7.1) (Section 7.7.2)
processes processes processes processes
(Section 7.2.1) (Section 7.2.2) (Section 7.2.3) (Section 7.2.4)
I I I I
I I I
Forecasting Estimation Specification
(Section 7.3) (Section 7.4) (Section 7.5)
I
I I
Box-Jenkins AR order
approach selection
(Section 7.5.1) (Section 7.5.2)
226 Inference in General Statistical Models and Time Series
The first condition says that all the random variables have the same finite mean
J.L and 3 implies that the covariance between Yt and Yt+k depends only on k but
not on the particular time point t. Of course, for k = 0 this condition implies
that the variances var(Yt) are constant (=Yo) for all t. From 2 it follows that all
variances and covariances are finite.
Note, that this kind of stationarity is sometimes called covariance stationar-
ity or weak stationarity and a stochastic process is called strictly stationary if
the multivariate distribution of (Yt, . . . ,Yt+k) is identical to that of the time
shifted set (y" . . . ,Ys+k) for all t, s, and k. Since the normal distribution is
completely determined by its first and second moments, a stationary normally
distributed process is strictly stationary. Such a process is often called a Gaus-
sian process.
Examples of stationary processes have been discussed in previous chapters.
For instance, if as assumed in Chapter 2, the Yt are independent and identically
distributed with zero mean, the process is called white noise. Clearl y, such a
process is stationary and, in fact, strictly stationary.
The foregoing discussion implies that the covariances between the elements
of a stochastic process are an important device to describe the structure of such
a process. In general, the covariance between Yt and Yt+b E[(Yt - E[Yt)(Yt+k -
E[Yt+k])), is called autocovariance since it measures the linear relationship
between members of a single stochastic process. For a stationary stochastic
process with mean J.L the covariance structure is completely described by the
sequence
k=O,I,2, . . . (7.2.1)
Time Series 227
Yk
Pk =-, k=I,2, . . .
Yo (7.2.2)
(7.2.3)
or
(7.2.4)
In (7.2.3) and (7.2.4) VI is white noise as defined above and we assume (T~ =
var(v l ). An AR process mayor may not be stationary. Consider, for instance,
the special case
YI = YI-! + VI (7.2.5)
Obviously,
YI = YI-2 + V I _! + VI = YI-3 + V I -2 + V I _! + VI
for any positive integer n. Since the VI are uncorrelated the variance of their
sum is the sum of their variances and hence,
228 Inference in General Statistical Models and Time Series
which cannot be finite since nCT~ -+ 00 as n -+ 00. This violates a requirement for
stationarity.
It can be shown that (7.2.3) is stationary with mean zero if the solutions of
(7.2.7)
are all outside the complex unit circle. That is, if Z = ZI + iZ2 solves (7.2.7), the
modulus Izi = v' Z} + z~ > 1. For instance, for p = 1 and thus, Y, = 8 1Yt-I + v"
the stationarity condition amounts to requiring 18 11< 1 since 1 - 81z = 0 implies
Z = 1/8 1 , For this to exceed 1, 8 1 has to be less than 1.
Let us now assume that the AR process (7.2.3) fulfills the stationarity condi-
tion. An elegant way to determine the autocorrelations Pk of this process is to
left-multiply (7.2.3) by Y'-b take expectations and divide by the variance Yo of
Y,. That is,
E[Yt-kY,] = 8 E[Y,-ky,-d +.
1
Yo Yo
or
These difference equations are known as Yule- Walker equations. The general
solution is
(7.2.10)
for j > 0
(j = 1, 2, . . . ,k) (7.2.11)
yields values tPkl, . . . ,tPkk the last of which is the kth partial autocorrelation
coefficient. Of course, tPkk = 0 for k > p.
Thus, the autocorrelations of a stationary AR(p) process taper off whereas
the partial autocorrelations have a cutoff point at lag p.
So far we have only discussed finite-order AR processes. It is not difficult to
generalize the concept by allowing the order to be infinite. That is, we may
consider
'"
YI = OIYI-1 + (J2YI-2 + .+ VI = 2: (JiYI-i + VI
i~ I
(7.2.12)
E [~ (JiYI-i - Y r~ 0 as n ~ 00
(7.2.13)
YI = VI + alvl_1 + .. + OI.qV I _ q
= (1 + aiL +. . + OI.qL q)v l (7.2.14)
230 Inference in General Statistical Models and Time Series
where V t is again white noise with variance (]"~ < 00. To indicate the order of this
MA process we use the symbol MA(q) analogous to the AR case. The mean of
the MA(q) process (7.2.14) is, assuming E[v t ] = 0,
= E [i i
.=0 j=O
aiajVt-ivt+k-j] = it
.~Oj-O
aiajE[vt-1 VtH-j]
for k = 0, 1, . . . ,q
for k > q (7.2.15)
Yo = (]" 2(1
v + a 2I + a22 + . . .+ aq2) (7.2.16)
(7.2.17)
Yo = (]"2v(1 + ~21
.... + ~22
.... + ) (7.2.18)
value for all t since it is independent of t and thus Yt is determined for all t. A
process is called nondeterministic if it does not contain any deterministic com-
ponents.
Wold's decomposition theorem implies that a stationary AR process has an
MA representation. For instance, the MA representation of an AR(1) process Yt
= OIYt-1 + Vt, IOd < 1, is easy to find. Successive substitution for Yt-I! Yt-2, etc.,
gives
Yt = OI(OIYt-2 + Vt-I) + Vt
= Vt + 0IVt-1 + OiVt-2 +
= (1 + OIL + OiL2 + .. .)vt
1
(7.2.19)
we have
or
or
Yt = -aIYt-1 - aiYt-2 - . . . + Vt
Like AR operators, not all MA operators have an inverse, that is, not every
MA process can be written as AR process. To guarantee invertibility of an
MA(q) the solutions of the equation
(7.2.20)
have to be outside the complex unit circle. Obviously, this invertibility condi-
tion is similar to the stationarity condition for AR processes given in (7.2.7). It
is worth noting that, for any noninvertible MA process for which (7.2.20) has
no roots on the unit circle, there exists an equivalent invertible MA process
232 Inference in General Statistical Models and Time Series
(7.2.21)
and
(7.2.25)
where V t is as usual zero mean white noise. Such a process is called an autore-
gressive-integrated-moving average process or briefly ARIMA(p,d,q) if Op(L)
and Olq(L) have degree p and q, respectively.
Time Series 233
If the data shows a strong seasonal pattern, this indicates a high correlation
between values observed during the same season in consecutive years. For
example, for monthly data the random variables attached to the January of
consecutive years may have a high correlation whereas the June figures may
not be correlated with the January figures. In such a case, modeling the data
generation process by an ARIMA process of the type considered above may
require an unnecessarily large number of parameters. To cope with this prob-
lem Box and Jenkins (1976) suggest using multiplicative seasonal ARIMA
models. Their general form is
(7.2.26)
or
have also been considered in the context of economic data analysis. References
on nonlinear time series models include Granger and Andersen (1978), Engle
(1982), Nicholls and Quinn (1982), and Priestley (1980, 1981, Chapter 11). The
latter contains a number of further references on the topic.
Second, the only nonstationarities we have allowed for arise from unit roots
in the AR operator. This is clearly restrictive since all roots in and on the
complex unit circle cause nonstationarities. O. D. Anderson and de Gooijer
(1982) consider AR processes with roots on the unit circle that are not equal to
one. Other nonstationarities can arise from structural changes at a particular
point in time resulting in a different covariance structure before and after that
time point. Alternatively, the covariance structure may evolve smoothly
through time or be subject to stochastic changes. Publications on non stationary
time series models include Granger and Newbold (1977, Section 9.4), Priestley
(1965,1981, Chapter II).
If the generation process of a time series is known, this process can be used to
predict future values of the variable of interest. Usually an attempt is made to
derive forecasts that are optimal in some sense. Therefore, a loss function is set
up that depends on the forecasting error that we make in predicting future
values of the variable of interest. A fairly common loss function is the expected
squared forecasting error.
More formally, the objective is to investigate the properties of the random
variables YT+h when realizations of the Yt are only available for t < T. Let us
denote the forecast h time periods ahead by yr(h). From the above discussion it
follows that the minimization of the expected quadratic forecasting error
E[(YT+h - YT(h2] is a meaningful objective. Note that a quadratic cost function
implies that the costs are equal for over and understating the future values of
the random variable by the same amount. This is certainly not always realistic.
However, in many situations of practical relevance, the mean squared fore-
casting error can still be justified as a cost function. For a more detailed discus-
sion concerning the choice of a cost function the reader is referred to Granger
and Newbold (1977, Section 4.2). In the following we will give forecasting
formulas for ARIMA processes.
Let us assume that Yt is a stationary ARMA process with MA representation
As mentioned above, the objective is to find YT(h) = !(YT, YT-),' .) such that
for any a. For practical reasons we will require! to be a linear function and
rather than using the information set {Ytlt < T} we can equivalently use {vtlt <
T} since Yr is assumed stationary. The equivalence of the two information sets
Time Series 235
relies on the assumption that they consist of an infinite number of past values so
that, for example, for an AR(1) process, YI = OIYI-1 + VI' the tth value can be
computed as
Of course, in practice only finite samples are available. However, if the sample
size is sufficiently large, the assumption of an infinite information set is of little
consequence as long as the MA operator has no roots close to the unit circle.
For a more detailed discussion see Fuller (1976, Section 2.9).
Suppose that h(h) is any linear h-step forecast based on {VIII <: T}, say
(7.3.3)
(7.3.4)
(7.3.5)
(7.3.6)
236 Inference in General Statistical Models and Time Series
which implies that it has minimum mean squared error from within the class of
all predictors, not just those that are linear. However, this result does not hold
in general. Fuller (1976, Chapter 2) gives the following example of an AR(1)
process Yt = 0Yt-I + Vt, where 101 < 1,
t = 0, +2, +4, . .
t = + 1, 3, . . .
and the e t are independent N(O, 1) random variables. It is easy to see that V t is
uncorrelated but not independent white noise. The optimal linear one-step
forecast of YT+I is yr(1) = OYT [see (7.3.5)]. However, assuming T is even, a
better forecast is
(7.3.7)
Usually the orders as well as the parameters of an ARMA process are unknown
and have to be determined from the available data. In this section we assume
that the orders are known so that only the parameters need to be estimated. In
the next chapter various special processes will be assumed to represent the
error term of a regression model and for these processes the estimation
problem is treated in detail in a more general framework of regression models
with autocorrelated errors. Therefore, we will only give a brief account of
estimating ARMA processes in this section.
Assuming that the Yt are generated by a stationary Gaussian ARMA(p,q)
process (7.2.21) the likelihood function can be shown to be of the form
[e.g., Newbold (1974), Granger and Newbold (1977), where h() is a function of
the parameters that does not depend on the data and EAv t l is the conditional
expected value of Vt given the sample YI, Y2, . . . ,YT' The function h() is
negligible for large T and thus minimizing the sum of squares
1=-00 (7.4.2)
(7.4.3)
we get
EJv tl = Yt - OIYt-1 - . . . - OpYt-p (7.4.4)
I
for T ::> t > p. Replacing all presample values by their expected value of zero
the conditional sum of squares given these presample values becomes
T
Sc = L
t= 1
(Yt - OIYt-1 - . . . - OpYt_p)2
(7.4.5)
That is, YI, Y2, . . . ,Yp are regarded as presample values. Both (7.4.5) and
(7.4.6) allow us to use a linear least squares routine for estimating a pure AR
model if only large sample properties of the estimators are of interest.
If the mean I.t of the Y, is nonzero, then the sample mean can be subtracted
from the data before estimation of the remaining parameters. Alternatively, the
y,'s in (7.4.3) can be replaced by Y, - I.t giving a regression equation
(7.4.7)
where
(7.4.8)
(7.4.9)
Replacing all unknown initial conditions in (7.4.10) by zero we find the condi-
tional LS estimate of 0'1 by minimizing
T
Sc = L (y, + aIY,-1 + a7Y,-2 + ... a~-IYI)2
,~ I (7.4.11)
2 1 - ()tL
(1 + CPt L + CP ZL + .. )Yt = 1 + at L Yt
= (1 + (-()t - at)L + (aT + ()t a t)L 2 + .. )Yt
so that CPt = -()t - at and CP2 = a~ + ()tat. These two equations can be solved
for at and ()t and thus, replacing CPt and CP2 by their initial estimates, we get
initial estimates for at and ()t.
For a stationary, invertible Gaussian process the ML estimates are consis-
tent, asymptotically efficient and normally distributed. The inverse of the infor-
mation matrix
2
1(v) = -E [a In ;/ _] (7.4.12)
auau "~,,
7.5 SPECIFICATION
OF ARIMA PROCESSES
(7.5.2)
3. For an AR(p) the partial autocorrelations t/lkk = 0 for k > p and the auto-
correlations taper off.
4. If neither the autocorrelations nor the partial autocorrelations have a cutoff
point, an ARMA model may be adequate. The AR and the MA degree have
to be inferred from the particular pattern of the autocorrelations and partial
autocorrelations.
Also, if a seasonal ARIMA model is adequate this has to be inferred from the
autocorrelations and partial autocorrelations. However, the specification of a
tentative ARIMA model by visually inspecting the estimates of these quantities
requires some experience and a more detailed exposition and examples can be
found in Box and Jenkins (1976), O. D. Anderson (1976), Judge et al. (1982,
Chapter 25), and Nelson (1973).
Once the orders of the tentative model are specified, its parameters can be
estimated as discussed in Section 7.4. Finally, the adequacy of the model may
be checked for example by analyzing the residuals or by overfitting the ob-
tained model. For an overview of possible tests for model adequacy in the
present context see Newbold (1981a). If the model is rejected at the checking
stage, the model building cycle has to be restarted with a new identification
round.
The Box-Jenkins procedure involves a subjective element at the identifica-
tion stage that can be advantageous since it permits nonsample information to
be taken into account that may exclude a range of models for a particular time
series. On the other hand, this feature makes the procedure relatively difficult
to apply and there has been a search for methods that allow an automatic fitting
of ARIMA models or at least aid in the choice of an adequate process for a
given time series. The corner method [Beguin, Gourieroux, and Monfort
(1980)] and the S-array and R-array procedures [Gray, Kelley, and McIntire
(1978), Woodward and Gray (1981)] belong to this class. In the following sec-
tion we present some other methods for the special case of fitting pure AR
models. Most of the methods have been extended to ARM A model building.
(7.5.4)
where VI is white noise as before and it is assumed that some upper bound for p,
say m, is given for a particular time series under study. In the following section
we present some methods and criteria for estimating p. Some of the criteria are
listed in Table 7.2. and are further discussed in Chapter 21 in the context of
selecting the regressors for a linear model.
242 Inference in General Statistical Models and Time Series
T + k -2
FPE(k) = T _ k G"k
~ _ 1 + klT~ -2
FPE (k) - 1 _ kiT G"k O<~<l
ak)_2
FPE,,(k) = ( 1 +T G"k a>O
2k
AIC(k) = In G-~ + T
kin T
SC(k) = In G-~ + T
IkT-' T-k
CAT(k) = -
T
L j=1 T
J G-j-2 - T G-k 2
-2 2kCT In In T
<1>(k) = In G"k + T ' lim sup C1 > I
T~~
ST(k) = (T + 2k)G-~
HI: Om = 0
H 2 : Om = 0m-I = 0
etc. (7.5.5)
The likelihood ratio test statistic
which is asymptotically Xfi) distributed, can be used to test Hi. Here a-r
is the
ML estimate of the residual variance when an AR(k) is fitted to the data, that is,
I
a-r = T (sum of squares of errors of AR(k
(7.5.7)
Time Series 243
Qi = T 'L.J
" l/lkk
'2
(7.5.8)
k=m-j t I
which is asymptotically Xri) distributed under the null hypothesis [T. W. Ander-
son (1971, p. 221)] could be used. Here tf,kk is an estimate of the kth partial
autocorrelation computed as LS estimate of the kth coefficient of an AR(k).
If any of the hypotheses in (7.5.5) is wrong, then all succeeding ones are also
false and the testing procedure terminates if one of the hypotheses is rejected.
The estimated value of p will depend on the significance level chosen. There-
fore it is important to note that the Type I error of the kth test in a sequence of
tests will differ from the significance level of the kth individual test. Following
T. W. Anderson (1971, Section 3.2.2) the kth test in the sequence has a Type I
error
if k = 2, . . . , m (7.5.9)
where TJk is the significance level of the kth individual test. For further discus-
sion of sequential testing see also Hannan (1970).
Note that using a testing procedure to determine the AR order implies that
the resulting parameter estimators are actually pretest estimators. We pointed
out in Chapter 3 that the sampling properties of pretest estimators are unknown
in general. However, in the present case, asymptotic properties is all we can
derive anyway and the asymptotic properties of the parameter estimators are
not affected if the AR order is estimated consistently. Therefore, it is of interest
that Potscher (1982) gives conditions for consistent esimation of the AR order
using Lagrange multiplier tests. In fact, Potscher discusses testing procedures
in the more general context of ARM A model fitting.
7.S.2b FPE
Since forecasting is the ultimate objective of many time series analyses it seems
natural to choose p to minimize the prediction error. Based on this idea, Akaike
(1969, 1970a) developed his final prediction error (FPE) criterion given by
T + k -2
FPE(k) = T _ k Uk
(7.5.10)
f3 _ 1 + klTf3 _2
FPE (k) - 1 _ kiT (Tk, 0<{3<1
(7.5.12)
and McClave (1975) and Bhansali and Downham (1977) who investigate
Increasing ex reduces the probability of fitting too high an order. For ex < 1 the
asymptotic probability of overfitting is substantial. The two criteria (7.5.12) and
(7.5.13) are related by
An extension of the FPE criterion to the case of ARM A model fitting is given
by Howrey (1978).
7.S.2c Ale
Another criterion proposed by Akaike (1973, 1974a) for AR order estimation is
based on an extension of the maximum likelihood principle. Suppose we have a
random sample YI, . . . I, YT from a populati"n characterized by a pdf g(y) that
is known to be contained in a class of densities
This quantity is greater than zero unless g(y) = f(yIO) almost everywhere.
Thus, minimizing / is a meaningful objective. Obviously, this is a generalization
of the ML principle, where the class of densities is constrained to a set of the
type {f(yIO)IO E '!Jt n}, that is, the considered densities differ only through their
parameter vector and not its dimension.
Time Series 245
instead of (7.5.17) is reasonable. This result carries over to the more general
case where the sample consists of realizations of an AR process. Assuming a
Gaussian process Akaike's information criterion (AIC) reduces to
2k
AIC(k) = In U"k + T (7.5.19)
7.S.2d SC
Using Bayesian arguments, Schwarz (1978) derives yet another criterion. As-
suming an a priori probability of the true model being k for k < m and an a priori
conditional distribution of the parameters given the AR(k) is the true model,
Schwarz suggests choosing the a posteriori most probable model. This proce-
dure leads to an order estimation criterion
kin T
SC(k) = In U"k + T
(7.5.22)
in order to determine the order of an AR process. Later we will see that the
resulting estimator for the AR order is consistent. For a further discussion of
246 Inference in General Statistical Models and Time Series
7.S.2e CAT
Another criterion for estimating the order of an AR process was suggested by
Parzen (1974). It is assumed that the process of interest is as given in (7.5.4)
with possibly infinite order p and interest centers on the transfer function
(7.5.23)
In this situation, Parzen proposes to choose the AR order p for a given sample
of size T such that the overall mean square error of 8p(z) as an estimate of O(z) is
minimized. As an estimate of the MSE he suggests the criterion autoregressive
transfer function (CAT)
I k I I
CAT(k) = T 2: -;z -
j~ I O"j
-;z
0" k (7.5.24)
where
(7.5.25)
- I A 2 2kCT In In T
<I> (k) - n O"k + T
(7.5.27)
(7.5.29)
mT
mT- 00, 0- 0 as T- 00
(7.5.30)
Shibata (1980) shows that in this situation estimating the AR order p such that
where
Using the models introduced in Section 7.2 to represent the generation process
of a given time series can be regarded as an attempt to summarize the autoco-
variance structure of the data generation process in a convenient form. In the
following we discuss an alternative way to summarize the autocovariance char-
acteristics of a stochastic process. The spectrum or spectral density will be
introduced for that purpose and we will try to provide an intuitive basis for
some of the concepts underlying spectral analysis. A similar exposition can be
found for example in Granger and Newbold (1977) and Harvey (1981). Spectral
methods can be used in estimating regression models with autocorrelated dis-
turbances as discussed in Chapter 8 and will also be useful in estimating distrib-
248 Inference in General Statistical Models and Time Series
uted lag relationships in Chapter 10. Furthermore, spectral methods are used
when mUltiple time series are discussed in Chapter 16.
In Figure 7.1 the graph of a time series is depicted that might well appear to be
based on some real-life data, yet it is generated as a sum of m terms of the
general form
aJ cos(wt
J
+ h)
J
(7.6.1)
This term represents a periodic function with amplitude aj,frequency CJ.)j (num-
ber of cycles per unit measured in radians), phase hj and wavelength 21Tlwj.
Suppose that all the aj' j = 1,. . . ,m, are independent random variables with
zero mean, E[aj] = 0, and the hj' j = 1, . . . , m, are independent random
variables uniformly distributed over the interval (-1T,1T). Furthermore, we as-
sume independence of ak and hj for all k,j. The hj are confined to the interval
(-1T,1T) since cos(Wjt + 21T + hj) = cos(wi + hj)' Also, we may assume that Wj is
in the interval (-1T,1T) since
for integer t.
To investigate the properties of stochastic processes consisting of cosine
terms of the form (7.6.1) we begin with the most simple case
Yt = a cos(wt + h) (7.6.2)
x,
3
10 20 30 40 50 t
-1
and
= (T~
T [I'"_'" 27T1 cos(2wt + wk + 2h)dh + cos wk I'"_'" 27T1 dh ]
(T2
a
= 2 cos wk = 'Yk (7.6.4)
since the first integral is zero as we are integrating over two cosine waves. In
(7.6.4), (T~ = var(a). For k = 0 we get the variance of Yt,
(T 2
Y
= (T2/2
a (7.6.5)
(7.6.6)
Yt = 2: cos(Wjt)aj cos hj
j~l
- sin(wjl)aj sin hj
m m
= 2: cos(wjt)Uj - 2: sin(wjt)vj
j~ I j~ I (7.6.8)
where Uj = aj cos hj and Uj = aj sin hj are random variables with the following
properties:
Properties (7.6.9b) and (7.6.9c) follow since aj and ak are independent for j =1= k
and (7.6.9d) follows by the same argument ifj =1= k and by noting that
=
(T~J"_" cos h sin h dh
h
(T~
= -~ -
27T 2
[I sin h ]"-" =
2
(7.6.10)
E[dv(w)dv(A)] = 0, if w *- A
E[du(w)dv(A)] = 0, for all w,A (7.6.12)
It can be shown that any discrete stationary stochastic process can be repre-
sented in the form (7.6.11). Since this process potentially involves an infinite
number of random variables, namely du(w), dv(w), w E [-7T,7T], knowledge of
the past values of the process is in general not sufficient to predict future values
without error. Thus, the process Yt in (7.6.11) is in general not deterministic. In
fact, it may be purely nondeterministic and can have the same covariance
structure as the AR, MA, and ARMA processes considered in Section 7.2.
Before we investigate this possibility we will introduce a notational simplifi-
cation.
Defining dz( w) = du( w) + i dv( w) and noting that cos wt + i sin wt = eiwt ,
(7.6.11) can be written as
where the overbar denotes complex conjugation. The last term is, of course,
the covariance of dz(w) and dZ(A) since for complex random variables x, Y the
covariance is defined to be E[(x - E[x])(y - E[y])].
This notation makes it easy to evaluate the autocovariances of Yt. Of course,
= f" iwk
e E[dz(w)dz(w)] (7.6.16)
(7.6.17)
there exists a complex valued functionf(w) defined on the interval [-7T,'1T] such
that
(7.6.18)
and
(7.6.20)
Thus, if the "Yk satisfy the condition (7.6.17) (for instance, if Yr is a stationary
ARM A process) then there exists a function h(w) such that
and hence,
This formula shows that h(w)dw can be interpreted as the contribution of the
frequencies (w,w + dw) to the total variance of the process.
It follows from the symmetry of the autocovariances ("Yk = "Y-k) thath(w) is a
real valued function since, by (7.6.18),
=
1
-2 ("Yo + 2
7T
:i:
k~1
"Yk cos wk) (7.6.24)
Time Series 253
1 <T~
fvCw) = /Yew) = 27T Yo = 2; = 27T
<T2
(7.6.25)
= 1 + a2 if k = 0
Yk = a if k = +1
=0 otherwise
Consequently, by (7.6.24),
1
/Yew) = 27T 0 + a 2 + 2a cos w) (7.6.27)
Yt = OYt-1 + Vt (7.6.28)
(7.6.29)
254 Inference in General Statistical Models and Time Series
0<= -0.5
1
2iT
7T
and thus,
21T
rr; rr; L
/yeW) = - 1 ( '
+ "
k=1
(ie- iwk + rr; L"(ie' )
k=1
iwk
rr2 (
=21+
(Je-
iw
.+
(Je
iW
.
)
21T 1- (Je IW 1- (Je 'W
_ rr; 1- (J2
21T 1+ (J2 - 2(J cos w (7.6.30)
This function is depicted in Figure 7.3 for (J = .5 and (J = -.5. The following
result provides a convenient tool for deriving the spectral density of more
general stochastic processes.
Suppose
Yt = a(L)x t (7.6.31)
where
and Xt is a stationary process with spectral density fx(w). Then the spectral
density. of Yt is
iw
/yew) = /a(e )!2fx(w)
. = a(e-iw)a(eiw)fAw) (7.6.32)
Time Series 255
ay 2
n
2
5L
2n
IT
2
Since the spectral density ofa white-noise process V t isfv(w) = (Y~/21T this result
implies that the spectrum of a stationary ARMA(p,q) process ()(L)Yt = a(L)v t or
Yt = ()-I(L)a(L)v t is
(7.6.33)
These examples suggest possible uses of the spectrum. For instance, if the
residuals of an econometric model or a time series model are hypothesized to
be white noise, the spectrum of the residual process should be flat and thus can
be used to check the white-noise hypothesis. Furthermore, in Section 7.5 we
have mentioned that the specification of an ARMA process can be quite tricky,
since the orders have to be determined from the data. Thus, it may be difficult
in practice to represent the correlation structure of a data generating process by
an ARMA model. It will be shown in the following that the spectrum can be
estimated without a preliminary specification of a particular process.
1 (
A
h(w) = -2 Yo
1T
+ 2 L Yk cos wk)
x
k=1 (7.6.34)
256 Inference in General Statistical Models and Time Series
Having only T observations YI, . . . ,Yr we can only estimate Yo, .. ,Yr-Jo
As estimates one can use either
1 T-k
Ck = T ~ (Yt - Y)(Yt+k - y)
(7.6.35)
or
1 T-k
Ck = T-k ~ (Yt - Y)(Yt+k - y)
(7.6.36)
1 (
h(w) = -2
1T
Co +2 L
T-)
k~)
Ck cos wk
)
(7.6.37)
Unfortunately, this estimate is not consistent as its variance does not shrink
to zero with growing sample size. This is essentially a consequence of estimat-
ing too many parameters from the given data. For example, Cr-I will always be
computed using only two observations regardless of the size of T.
To overcome this problem various estimators of the spectral density have
been proposed. Their general form is
(7.6.38)
where the sequence '\0, .\1, . . . '.\Mr is called a lag window and Mr < T is the
truncation point. For example, for the Parzen window
(7.6.39)
This sequence implies that the weight given to Ck decreases with increasing k
and thus, the estimates based on fewer observations are less important in the
sum in (7.6.38). The estimator ofJ;,(w) based on the Parzen window is consis-
tent provided Mr is chosen such that Mr-~ 00 and MrlT ~ 0 as T ~ 00. Various
other "windows" are suggested in the literature. Many of them are discussed in
Hannan (1970, Chapter 5, Section 4), Priestley (1981, Section 6.2), and Neave
(1972). Estimated spectral densities of a number of time series are depicted in
Nerlove (1964, 1972) and Granger (1966) discusses the "typical spectral shape"
of economic time series.
So far, we have presented the spectrum as a tool to gain knowledge about
certain features of a stochastic process, more precisely, about its frequency
structure. For many economic time series it may seem difficult to interpret the
Time Series 257
Time series are often subjected to transformations before being analyzed. Usu-
ally transformations or adjustments are carried out if the investigator feels that
a particular model for the data generation process is more appropriately fitted
after the data have been transformed in a certain way. In the following para-
graphs, we will first briefly introduce the Box-Cox transformations and then
discuss trend removal and seasonal adjustment.
(7.7.1)
where A is a number between zero and one. Transformations of this type are
often used to stabilize the variance of a time series. The boundary cases corre-
spond to a logarithmic transformation (A = 0) and a simple mean reduction
(A = I). The choice of A can be based on the data, that is, the optimal A may be
estimated from the data. The estimation can be carried out simultaneously with
other parameters in a completely specified time series model for Yt. A discus-
sion of practical experience with the Box-Cox transformation in the context of
time series analysis is given by Nelson and Granger (1979) and Poirier (1980)
and a discussion of the Box-Cox transformation for the variables of a regres-
sion model is given in Chapter 20. A general discussion of instantaneous data
transformations is given by Granger and Newbold (1976).
258 Infe;"ence in General Statistical Models and Time Series
Many economic time series have a trend and a seasonal component and we
have seen in Section 7.2 how to take care of these components in the context of
ARIMA models. However, it cannot be expected that every trend can be
adequately dealt with by differencing or that all possible seasonal components
can be satisfactorily captured in a seasonal ARIMA model. Furthermore, build-
ing a univariate time series model is not always the ultimate objective and
sometimes the investigator may wish to remove the trend or seasonal compo-
nent before a data set is used in some other analysis or the analyst is merely
interested in the trend or seasonal component. Therefore procedures have been
introduced for trend removal and seasonal adjustment. Many of these proce-
dures rest on the assumption that the time series Yt consists of three compo-
nents, the trend 7 t , the seasonal St, and the irregular part Zt. A simple additive
structure
Yt = 7 t + St + Zt (7.7.2)
(7.7.3)
are the most popular models. For the latter case, a logarithmic transformation
can be used to obtain an additive model. Various different specifications and
methods have been proposed for modeling the three components. Recent con-
tributions include N erlove, Grether, and Carvalho (1979), Schlicht (1981), Har-
vey (1982), and Hillmer and Tiao (1982).
A major problem in separating the three components results from the lack of
precise and generally acceptable definitions of the trend and seasonal. The
trend is usually thought of as representing the long-term movements of the time
series whereas the seasonal could be described as "reflecting the regular peri-
odic fluctuations which recur every year with about the same timing and with
the same intensity." [Kallek (1978, p. 5)]. It must be admitted that more precise
definitions have been proposed in the literature [e.g., Granger (1978)]. How-
ever, people who use time series often have a good idea of what they mean by a
seasonal or trend component without reference to a specific definition. Thus,
confronting them with a seasonally adjusted series they may be able to say,
based on criteria such as the autocorrelation function, the spectrum, or simply
the plot of the series, whether the adjustment procedure used has been success-
ful in removing the seasonal component or not. It is much more difficult how-
ever, to make people specify what exactly they want an adjustment procedure
to remove or in other words, what exactly the seasonal component is in a
particular series. The subjective nature of trend and seasonality explains the
existence of a substantial number of adjustment procedures. Before we intro-
duce the basic foundations of some of them, we will discuss a number of
desirable properties of such procedures in the following paragraphs.
Time Series 259
1. Linearity. The seasonally adjusted series y~, say, should be a linear func-
tion of the original series Yt. In other words,
0.7.4)
(7.7.5)
(7.7.6)
(7.7.7)
and symmetric. It does not respond rapidly to changes in the seasonal pattern,
however, and, therefore, may lead to quite unsatisfactory results.
Moving average methods are another rather popular tool for seasonal adjust-
ment. For quarterly data one may for example replace Yr by
Usually considerably more complicated moving averages are applied for sea-
sonal adjustment. These procedures are linear but in general not idempotent or
orthogonal. The stability and the response to changes in the seasonal structure
will depend on the length of the particular moving average used and on the
treatment of the first and last values of the series that obviously cannot be
subjected to a symmetric moving average as in (7.7.9).
One of the most widely used seasonal adjustment procedures-the Census
X-ll method-is based on moving averages. However, outliers, that is, values
that are in some sense unusual compared with the rest of the series, are re-
moved at various stages. Therefore the procedure is in general not linear.
Despite much criticism of the method, its widespread use seems to suggest that
it removes from the data what many regard as being the seasonal component.
For a more detailed discussion see, for example, Shiskin, Young, and Mus-
grave (1967), Shiskin and Plewes (1978), Kallek (1978), Cleveland and Tiao
(1976), Dagum (1975, 1978), and Wallis (1981).
For modeling a trend, smooth curves such as polynomials, exponential, or
logarithmic curves are sometimes used. For these and other possible trend
curves see, for example, Granger (1980). The problem inherent in this approach
is, of course, the choice of the curve that is most suitable for a particular time
series under consideration.
Moving average procedures can also be used for trend removal. For in-
stance, for a nonseasonal series Yr the trend may be defined as
Kendall (1976), Kendall, Stuart, and Ord (1983), and T. W. Anderson (1971)
provide a more detailed discussion of this approach. Various other methods for
trend and seasonal adjustment have been suggested and the reader is referred
to the papers in Zellner (1978) for a further discussion of the subject.
nized by time series analysts [e.g., Liu (1980), Cleveland and Grupe (1981), and
Cleveland, Devlin, and Terpenning (1982)].
Second, if a time series is observed monthly but interest centers on the
corresponding quarterly series, then the question arises whether it is preferable
to seasonally adjust the monthly series and then aggregate to obtain quarterly
data or aggregate prior to the adjustment. For a discussion of the aggregation
problem in the context of seasonal series see, for example, Geweke (1978), Wei
(1978), Lee and Wei (1979, 1980).
Third, there are series for which a pure additive or a pure multiplicative
model is not adequate and a mixture is preferable. This problem is discussed,
for instance, by Durbin and Murphy (1975) and Durbin and Kenny (1978).
7.8 SUMMARY
If time series data for a single variable are available, information on the
generation process of the data can be used to forecast future values of the
variable. In this chapter we introduced the class of ARIMA models to summa-
rize the correlation structure of the data generation process. If the realizations
of a variable are known to be generated by a particular ARIMA process, opti-
mal forecasts can be obtained using this process. The mean squared forecasting
error has been used for measuring the forecasting precision. Forecasting on the
basis of ARIMA models implies linear forecasting functions. Since an optimal
linear forecast may be inferior to a nonlinear forecast, some references to
recent research on nonlinear time series models have been given.
If a particular ARIMA process is identified as being the generation process of
a given set of data, estimation of the unknown parameters is theoretically
straightforward. However, optimizing the sum of squares or likelihood function
requires in general nonlinear optimization methods.
The specification of an adequate ARIMA process for a given set of data is
often a rather difficult exercise. The Box-Jenkins approach to solving this
problem involves inspecting the sample autocorrelations and partial autocorre-
lations. A proper interpretation of these quantities can be quite difficult and
various automatic specification methods have been proposed that are based on
optimizing criteria such as the asymptotic forecasting error or the entropy.
The spectrum or spectral density offers a rather different way to summarize
the correlation structure ofa stochastic process. Using a spectral approach the
process can be thought of as being composed of sine and cosine waves with
frequencies between -71' and 71'. The spectral density can be interpreted as a
function that relates the frequencies to their contribution to the variance of a
stationary stochastic process. This function completely characterizes the cor-
relation structure of a stationary process and estimation of the spectrum does
not require a preliminary specification of a particular candidate in the class of
considered processes.
Often the given data are transformed prior to an analysis. We have briefly
discussed the Box-Cox transformations as well as trend and seasonal adjust-
ment. The latter transformations are usually based on the idea that a time series
Time Series 263
7.9 EXERCISES
In this chapter, asymptotic properties of the estimators were given. The follow-
ing exercises contain some sampling experiments that can help to relate the
asymptotic theory to the small-sample properties of the estimators. The sam-
pling experiments will be based on the models
Yt = 8Yt-1 + Vt (7.9.1)
and
Yt = Vt + aVt-1 (7.9.2)
264 Inference in General Statistical Models and Time Series
Exercise 7.1
Generate five samples of size 30 using (7.9.1) and (7.9.2) and estimate the first
five autocorrelations and partial autocorrelations. Compare the estimates to the
theoretical values.
Exercise 7.2
Repeat Exercise 7.1 with samples of size 100. Compare the estimates to those
obtained in Exercise 7.1. What are the consequences for using the Box-Jenkins
model building procedure?
Exercise 7.3
Use the samples of Exercise 7.1 and 7.2 to estimate the parameters f) and a of
the models (7.9.1) and (7.9.2) by conditional least squares. Compare the esti-
mates to the actual parameter values and interpret.
Exercise 7.4
Use the samples of Exercise 7.2 and estimate the AR order of the models using
the sequential testing procedures of Section 7.5.2a.
Exercise 7.5
Repeat Exercise 7.4 using the estimation criteria listed in Table 7.2. Compare
the results and interpret.
Exercise 7.6
Use the Parzen window to estimate the spectral densities of the processes
(7.9.1) and (7.9.2) from the data generated in Exercise 7.2. Compare the esti-
mates to the theoretical spectral densities and interpret.
7.9.2 Class Exercises
We will assume now that at least 100 samples for each process and each sample
size (T = 30, T = 100) have been considered by the students of a class.
Exercise 7.7
Use the estimates obtained in Exercises 7.1 and 7.2 for each of the two pro-
cesses and each of the two sample sizes to construct empirical frequency
distributions for the estimated autocorrelation and partial autocorrelation coef-
ficients.
Exercise 7.8
Use the estimates of f) and a obtained in Exercise 7.3 to construct empirical
frequency distributions for sample size 30 and sample size 100. Interpret the
results.
Time Series 265
Exercise 7.9
Use the estimates for the AR order obtained in Exercises 7.4 and 7.5, to
construct frequency distributions for each of the different criteria and interpret.
7.10 REFERENCES
DYNAMIC
SPECIFICATIONS
In this part 01 the book we return to the study of the general linear statistical
model y = XI3 + e and we examine the special problems and estimation proce-
dures that arise when using time series data. Economic behavior is seldom
instantaneous, and so, when using time series data, an important part of model
specification is the incorporation of appropriate lags. Lags can arise in a num-
ber of ways. In Chapter 8 we investigate estimation, prediction. and hypothesis
testing for models where each element in the disturbance vector e depends on
past values of itself or on current and past values of other disturbances. Such
models come under the general title of autocorrelation and involve an assump-
tion that e represents T (consecutive) realizations on a stochastic process (see
Chapter 7). Lags are also introduced if the X matrix contains lagged values of
one or more explanatory variables. In Chapter 9 we study finite distributed lag
models that are characterized by explanatory variables with a finite number of
lags, and that often involve restrictions on the coefficients of the lagged vari-
ables. When the number of lags is infinite, it is necessary to transform the
model so that only a finite number of parameters is involved. This can some-
times be achieved by permitting lagged values of the dependent variable to
appear as part of the X matrix. The special estimation problems that arise in
this case are studied in Chapter 10 under the title of infinite distributed lags.
Chapter 8
Autocorrelation
I PI P2 PT-I
PI I PI PT-2
(8.1.2)
276 Dynamic Specifications
where, for stationarity ipi < 1, and the V t are random variables with E[v t ] = 0,
E[v;] = (T~, and E[vtv..J = 0 for t 0/= s. Under (8.1.2) it can be shown [e.g., Judge
et al. (1982, p. 436)] that the autocorrelation coefficients are given by
ps = ps (8.1.3)
and that
(8.1.4)
(8.1.5)
because most data were annual data for which the AR(l) process was a reason-
able representation and because, without a computer, estimation of other possi-
ble processes is an onerous task. However, over the last decade, with the use
of more frequently observed data and the virtual elimination of computational
constraints, the other specifications outlined in Chapter 7 have been consid-
ered. These include autoregressive processes of finite order greater than one,
moving-average processes, and combined autoregressive moving-average pro-
cesses. They have the following form.
(8.1.6)
(8.1. 7)
(8.1.8)
The OJ and (X) are unknown parameters while the v( are random disturbances
such that E[v(] = 0, E[v~] = IT;, and E[v(v,] = for t =1= s.
Processes of order higher than one are likely to be relevant when using
monthly or quarterly data. For example, with quarterly data a fourth- or eighth-
order process could be considered while, for monthly data, a twelfth-order
process might be reasonable. The choice among an AR, MA, or ARM A process
is often not straightforward. In some instances [Nicholls, Pagan, and Terrell
(1975)), theory suggests that a particular type of process would be appropriate
but, in most cases, anyone of the three might be reasonable and the applied
worker should entertain this possibility. For details of the properties of the
various stochastic processes see Chapter 7 and Judge et ai. (1982, Chapter 25).
The major distinction between the models considered in this chapter and
those in Chapter 7 is the presence of the term XI3. In Chapter 7 we assumed
that an observable random variable y( foilowed the assumptions of one of the
stochastic processes, and we were concerned with estimation of the parameters
of that stochastic process. In this chapter, however, we assume that the unob-
servable disturbances satisfy the assumptions of a particular stochastic pro-
cess, and estimation of the parameters of that process is of secondary impor-
tance relative to estimation of 13. Generally, the models in Chapter 7 can be
viewed as special cases of those in this chapter, obtained by setting 13 = o.
It is also useful to keep in mind that the models in this chapter are special
cases of a general model outlined in Chapter 10. In terms of one explanatory
variable the general model is (Equation 10.1.2)
278 Dynamic Specifications
(8.1.9)
where y(L), cp(L), a(L), and 8(L) are finite-order polynomials in the lag opera-
tor L. The special cases with which we are currently concerned are obtained by
setting y(L) = Yo, cp(L) = I, and by considering different orders for the polyno-
mials a(L) and 8(L) in the disturbance specification et = [a(L)/8(L)]v t We
assume throughout that X (or X t in Equation 8. 1.9) is nonstochastic. Note,
however, that lagged values of the dependent variable do become involved if
we multiply both sides of (8.1.9) by 8(L) (where 8(L) 0/= I). The case where
lagged values of Yt occur through multiplication of both sides by cp(L) 0/= I is
treated in Chapter 10.
The plan of this chapter is as follows. First, in Section 8.1. I, we discuss the
consequences of using least squares (LS) when autocorrelation exists. Then,
under various alternative assumptions, other methods of estimation are re-
viewed in Section 8.2. In Sections 8.2. I to 8.2.4 we consider generalized least
squares, nonlinear least squares, and maximum likelihood estimation of models
with autoregressive errors of various orders; in Sections 8.2.5 to 8.2.7 this is
extended to moving-average and autoregressive moving-average errors. Pre-
diction is discussed in Section 8.3, and in Section 8.4 we describe methods for
testing for various forms of autocorrelation. Some recommendations are given
and the properties of pretest estimators are discussed in Section 8.5. An over-
view of the contents of the chapter is given in Table 8. I.
Since least squares (LS) requires less computation and, through testing proce-
dures, we may not always be able to detect the presence of autocorrelation, it is
worth studying the consequences of using LS when W( 0/= I) results from an
autocorrelation process. There are two main consequences.
1. The LS estimator b = (X'X)-IX'y will be unbiased, but it will not in gen-
eral be efficient.
2. The LS variance estimator 6- 2(X'X)-I, with 6- 2 given by (y - Xb)'(y - Xb)/
(T - K), will be a biased estimator of a 2(X'X)-IX'WX(X'X)-I. Conse-
quently, the usual LS test statistics will not be valid [Judge et al. (1982,
Section 10.6) and Section 5.5 of this text].
8.1.1a Efficiency
Error Models
Generalized least Generalized least Generalized least Nonlinear least squares Nonlinear least squares
squares squares squares Maximum likelihood Maximum likelihood
Estimated generalized Estimated generalized Nonlinear least squares Frequency domain
least squares least squares Maximum likelihood techniques (Section 8.2.8)
Nonlinear least squares Nonlinear least squares Bayesian
Maximum likelihood Maximum likelihood
Bayesian
Tests
Predictors
Optimal prediction with AR errors Optimal prediction with MA errors Prediction MSEs
(Section 8.3.1) (Section 8.3.2) (Section 8.3.3)
280 Dynamic Specifications
(I - p2 )(1 - pI..)
W = -:-:(l-+---'-p'2--'---=-2p-'-:A.""""'")-:-:(l:-'-+---'-p-'-:A.=)] (8.1.10)
where
I I 0 0 0
I 0 I 0 0
0 I 0 0 0
20 = (8.1.12)
0 0 0 0 I
0 0 0 I I
and C = diag (1,0, . . . ,0, I). If it is reasonable to omit C and assume that "'- I
can be approximated by W- I = (I + p2)[ - 2p0, then the efficiency question
can be framed in terms of the characteristic vectors of W. These are (ql, q2,
, qT) with the elements of q; = (qlj' q2j' . . . , qTj) given by
(T - 1)[6- 2 ] = 1
(J"-
(
T - 1 + PA)
1 - pA
(8.1.13)
If both p and A are positive, the usual situation with economic time series, both
these expressions indicate that LS procedures will underestimate the vari-
282 Dynamic Specifications
ances, and the extent of this underestimation can be substantial. This can lead
to unjustified confidence in the reliability of the estimates.
Although this result comes from a very special case, only one explanatory
variable and the assumption that both the explanatory variable and the error
follow AR(l) processes with positive parameters, it seems to have led many
textbooks to conclude generally that LS variance estimates will be biased down-
ward in the presence of autocorrelation. This will not be true in general, as is
clear from the above when p and A have opposite signs. However, a more
thorough analysis by Nicholls and Pagan (1977), still using only one explana-
tory variable but with alternative assumptions about it and the error process,
indicates that understatement of the variances is the more likely situation,
provided we have positive autocorrelation. But, in addition, they note that
negative autocorrelation in the regressors and/or the disturbances could be
quite likely if the regressors are the result of first difference transformations.
For general X and an AR(1) process, some results on the bias have been
provided by Chipman (1965), Sathe and Vinod (1974), and Neudecker (1977a,
1978). For the case [ee'] = CT~W, where W-I is the approximation to 'II-I
defined below Equation 8.1.12, Chipman shows that the condition X = Qr.
where Q contains the characteristic vectors of W corresponding to the K small-
est characteristic roots, is sufficient for the variances to be underestimated by
LS procedures. Sathe and Vinod tabulate upper and lower bounds on the bias
for alternative T, K, and p while Neudecker, in a similar way, just considers the
component given by [6- 2 ].
One of the difficulties of using biased variance estimates is the possible
reversal of hypothesis test decisions. If the F statistic used to test a set of linear
restrictions on 13, say RI3 = r, is based on LS estimates, then the decision to
accept or reject RI3 = r may be different than it would be if based on a test
statistic using an unbiased variance estimate. Granger and Newbold (1974)
illustrate this with some simulation experiments. However, a decision based on
an unbiased variance estimate can only reverse a least-squares-based decision
if the least squares F (or t in the case of one restriction) value falls within two
bounds. Kiviet (1980), correcting and extending earlier work by Vinod (1976),
has tabulated these bounds for a number of autocorrelation processes. Thus, if
one uses LS in an inappropriate situation, these tables can be used to assess the
likelihood of making an incorrect test decision.
However, if a calculated value falls outside the range of "possible test deci-
sion reversal," this does not necessarily mean that it is not worthwhile for the
applied worker to obtain a GLS estimate. The tables are constructed ill terms of
"appropriate" and "inappropriate" tests based on b. A test based on 13 is likely
to be more powerful and could lead to a different decision.
Thus we conclude that there are frequently circumstances when autocorre-
lated disturbances exist and the LS estimator is relatively efficient. However,
the use of LS can lead to poor assessments of the reliability of the estimates,
and to invalid test statistics; it is therefore desirable to consider alternative
methods of estimation. These alternative methods are discussed in the follow-
ing section.
Autocorrelation 283
In this section we are concerned with estimation of the linear model y = XJ3 +
e, where the earlier definitions hold and the elements of the vector e are auto-
correlated. Estimation under a number of alternative autocorrelation specifica-
tions will be considered; in each case the tth disturbance e l will be some
function of an uncorrelated homoscedastic (white noise) disturbance VI' with
variance <T~. The exact function and the matrix 'i' will depend on the case under
consideration but, in every case, we will write the covariance matrix for e as
E[ee'] = <I> = <T~'i'.
If this covariance matrix is known, the generalized least squares (GLS) esti-
mator ~ = (X''i'-IX)-IX''i'-ly is best linear unbiased and is obtained by
. . ..
mInImIZIng
and so for ML estimation the log likelihood is, apart from a constant,
(8.2.4)
and substituting this into L and ignoring constants gives the concentrated log
likelihood function
284 Dynamic Specifications
T 1
L(J,'It) = - 2 In[(y - X(J)''It-'(y - X(J)] - 21nl'ltl
(8.2.5)
The maximum likelihood estimators for (J and the unknown elements in 'It are
those values that maximize (8.2.5). After some rearranging, we can show that
these values are equal to the values that minimize
and so the difference between the objective function for the ML estimates
(8.2.6), and that for the GLS estimator (8.2.1), is that the former contains the
tth root of the determinant of 'It.
It is sometimes advantageous to concentrate (J out of the likelihood function.
Since ~ = (X''It-'X)-'X''It-'y is the ML estimator for (J conditional on 'It, ML
estimates for the unknown elements in 'It can be found by minimizing
(8.2.7)
Substituting the ML estimate for 'It into the expressions for ~ and a-~ gives the
unconditional ML estimates for (J and (T~, respectively.
In what follows we review estimation techniques for models with AR, MA,
and ARMA error processes. If, in each of the models, we set (J = 0 so that Yt =
e t , then the techniques are also relevant for the "pure" time series models
discussed in Chapter 7.
By far the most popular autocorrelated error process assumed for the vector e
is the first order autoregressive process. Under this assumption the model can
be written as
(8.2.8)
(8.2.9)
where E[Ut] = 0, E[u~] = (T~, E[utu,] = 0 for t =1= s, Ipl < 1, and x; is a (l x K)
vector containing the tth observation on K explanatory variables. From (8.1.3)
and (8.1.4) the covariance matrix E[ee'] = cp = (T~'It can be written as
1 P p2 pT-'
I P pT-2
(T2 P
cp = (T2'1t = _ v p2 p2 P I pT-3 (8.2.10)
v 1
T-3 I
P
Autocorrelation 285
I -p 0 o 0
-p I + p2 -p o o
<1>-1
I
= __ l{I-1 = -1 0 -p 1 + p2 o o (8.2.11)
(T2 (T2
u u
0 0 0 I + p2 -p
0 0 0 -p I
VI - p2 0 0 o 0
-p I 0 o 0
P=
0 -p I o o (8.2.12)
o o o 1 0
o o o -p 1
is such that pip = l{I -I. The first observation of the transformed model is given
by
(8.2.13)
t = 2, 3, . . . ,T (8.2. 14)
If the first column of X is a vector of ones, the first column of X* will no longer
be constant. Its first element will be VI - p2, while the remaining
elements will be (l - p).
The sum of squares, which is minimized by the GLS estimator, Equation
8.2.1, can be written as
T
(y* - X*Jl),(y* - X*Jl) = e'P'Pe = (VI - P2el)2 + 2: (e
1~2
l - pe l _I)2
T
= e*2 +
1
2:
1~2
v2
I (8.2.15)
286 Dynamic Specifications
(8.2.16)
where the disturbances (el's), because they are unobservable, have been
replaced by the LS residuals e l = YI - x;b, t = I, 2, . . . , T. When the
variables are transformed using rl in place of p in the transformation ma-
trix Po, the resulting approximate EGLS estimator for P is known as the
Cochrane-Orcutt (1949) two-step procedure. When both (8.2.13) and
(8.2.14) are used (the complete transformation matrix P), the estimator is
often termed the Prais-Winsten (1954) estimator.
We can also regard rl as an estimator of p in the linear regression el =
pel_I + VI except, in this case, the summation in the denominator of (8.2. 16)
would not include el' Theil (1971) gives another modification of rl, namely
(8.2.17)
Autocorrelation 287
is often used to test for the existence of autocorrelation (see Section 8.4), is
calculated by most LS computer programs, and can easily be modified to
yield an estimator for P that is approximately equal to 'I, namely
p = 1 - id (8.2.18)
3. The Theil-Nagar Modification. Theil and Nagar (1961) suggest that the
estimator
T2(l - d/2) + K2
P * = --'---:T='z.-----K="2-- (8.2.19)
Y t = PYt-1 + x; (3 - PX;_I (3 + vt
= PYt-1 + /31(1 - p) + x?'(3 - PX?~I(3 + Vt (8.2.20)
(y - X(3)'P'P(y - X(3) = ei 2
+ L v;
t~2
Similarly, given an estimate for p, say p, and corresponding matrices P and ,.,
p
such that P'P = W-I, we obtain the EGLS estimator = (X'W-IX)-IX'W-I y
by finding that ~ that minimizes (y - X~)'P'P(y - X~). Nonlinear least
squares (NLS) estimates are obtained by simultaneously finding both ~ and p
that minimize ei 2 + 2.;=2V;. In this case the estimator for ~ will still be of the
same form,
but the estimator for p need not correspond to any of those given in Section
8.2.lb.
Also, some slight modifications of NLS are often used. These differ in their
treatment of the first observation, namely
The first two are identical [Pagan (1974)] and, in effect, treat the first observa-
tion as fixed. Also, under the assumption of normality, they are often referred
to as ML estimators [Kmenta (1971), Hildreth and Lu (1960), Cochrane and
Orcutt (1949)]. However, since they are based on the joint density f(Y2,
Y3,' . . ,YTI YI), they are only ML estimators conditional on YI' The uncondi-
tional likelihood function based on all T observations, f(YI, Y2, . . . , YT),
includes ei 2 as well as an additional term that depends only on p. See Equations
8.2.22 to 8.2.29.
The third modification, setting eo = 0, fixes eo at its expectation and is equiv-
alent to assuming that the first observation was not generated by an autoregres-
sive process. All the procedures are asymptotically equivalent, and so the use
of any particular one has usually been based on convenience rather than on
statistical properties.
Whatever sum of squares is chosen, it can be minimized using either iterative
or search techniques. One iterative technique for minimizing 2.;=2V; [Cochrane
'I
and Orcutt (1949)] is to estimate p using in (8.2.16), obtain the corresponding
~o (first observation treated as fixed), reestimate p by applying LS to (8.2.14)
rewritten as
(8.2.21)
8.2.1d ML Estimation
Let us now consider maximum likelihood estimation under the assumption of
normality. The likelihood function is given by
(8.2.22)
where
(8.2.23)
and
(8.2.24)
where t = 2, 3, . . . , T. Thus
(8.2.25)
and, ignoring the constant 21T, the log of the likelihood function is equal to
+ -2I In (1 -
T 2 2 I I
L = - -2 In (Tv p ) - 2~2 (y* - X*(3) (y* - X*(3)
v v (8.2.26)
(8.2.27)
T 1
L(p,Jl) = - 2 In(y* - X*Jl)'(y* - X*Jl) + "2lnO - p2)
T In (
= - -
2
ei 2 + LT)
(:2
v; + -1 In(l - p2)
2 (8.2.28)
(8.2.29)
As in the case ofNLS estimate~, they can be found using either an iterative or a
search procedure. The ML estimator for Jl is of the EGLS type, but the ML
estimate of p upon which it is based can be different from the NLS estimate
because (8.2.28) contains the term! InO - p2) as well as the sum of squares
term. However, when the sample size is large and p is not too close to I, the
sum of squares term will dominate the other and the two sets of estimates will
essentially be the same. Because! In(l - p2) does not depend on T, they are
asymptotically equivalent.
Specific algorithms for maximizing (8.2.28) have been outlined by Hildreth
and Dent (1974) and Beach and MacKinnon 0978a). The former first substi-
tutes i3 into L( p,Jl) so that it is only a function of p. It then uses a search
procedure over p to locate the neighborhood of the minimum and, within this
neighborhood, locates p by using an iterative procedure that finds a smaller
neighborhood with each iteration. The Beach-MacKinnon algorithm is a modifi-
cation of the Cochrane-Orcutt iterative procedure to allow for the additional
terms ei 2 and! In(1 - p2). They claim that it is computationally more efficient
than using a search procedure.
S.2.1e Properties
If the general conditions given in Section 5.5.2 hold, then
(8.2.30)
where Jl can be any of the above estimators for Jl, and V = lim T-IX' 'It-IX.
Conditions more specific to the AR(1) process are given by Hildreth (1969),
Theil (1971, pp. 405-407), and Magnus (1978). From (8.2.30) and the results in
Chapter 5 approximate tests of hypotheses about Jl can be based on the as-
sumption that ~ is normally distributed with mean Jl and covariance matrix
6-~(X'P'PX)-I, where P is based on a consistent estimator for p and
(8.2.31)
matrix, we can show that the ML estimators Ii and a-~ have respective asymp-
totic variances (1 - p2)/T and 2a.4/T, and that they ar_e (asymptotically) inde-
pendent as well as independent of the ML estimator Il.
Because the alternative estimators for Il all have the same asymptotic distri-
bution, and their small-sample properties are difficult to derive analytically,
any choice between estimators that is based on sampling properties has been
based on Monte Carlo evidence.
Such evidence has been provided by Griliches and Rao (1969), Hildreth and
Lu (1969), Hildreth and Dent (1974), Beach and MacKinnon (1978a), Spitzer
(1979), Park and Mitchell (1980), and Harvey (1981a), but, as is frequently the
case with Monte Carlo studies, the evidence does not suggest a clear-cut
choice. Some of the contradictory conclusions are resolved by Taylor (1981)
who, using analytical approximations, shows that the conclusions of any Monte
Carlo study will depend heavily on the specification of X, and on whether the
results are conditional on a given X or unconditional with respect to the distri-
bution of a stochastic X. Despite the uncertainties that remain it does seem
clear that, when estimating Il, it is desirable to include the initial transformed
observation (Equation 8.2.13), and that the ML estimator seldom performs
poorly relative to the other estimators.
The efficiency of various estimators for Il will depend to a large extent on the
accuracy with which p is estimated. The studies by Hildreth and Dent (1974),
Beach and MacKinnon (1978a), and Griffiths and Beesley (1984) all found a
substantial bias in the ML estimator for p. Consequently, it is possible that an
adjusted ML estimator for p, such as the one suggested by Hildreth and Dent,
may lead to a more efficient estimator for Il.
Another related finding that has emerged from the Monte Carlo studies,
particularly those of Park and Mitchell (1980) and Griffiths and Beesley (1984),
is that a-~(X'W-IX)-I can \?e a very poor approximation to the finite sample
covariance matrix for any ~. This fact, and the fact that most estimators for Il
will be relatively efficient, mean that future research djrected toward methods
for more accurate assessment of the reliability of ~ is likely to be much
more profitable than further efficiency comparisons. For an evaluation of
some efforts in this direction see Ullah et at. (1983) and Miyazaki and Griffiths
(1984).
(8.2.32)
which results from the application of Jeffreys' rule. See, for example, Fomby
and Guilkey (1978). Combining this with the likelihood (Equation 8.2.25) yields
the joint posterior density
292 Dynamic Specifications
(8.2.33)
( u - U)'X*'X*(U - U)]-TI2
g(fl,ply) ex: (RSS)-T12 [ 1 +.... .... RSS .... ....
(8.2.34)
where tl = (X*'X*)-IX*'y* and RSS = (y* - X*tl)'(y* - X*tl). Note that
both ~ and RSS depend on p.
From (8.2.34) it is clear that the density g(fllp,y) is a multivariate t distribu-
tion with mean ~. Thus, if p were known, the Ba:yesian estimator with qua-
dratic loss is the GLS estimator. When p is unknown, the Bayesian estimator
will be the mean of the density g(flly), which is obtained by integrating p out of
(8.2.34). Unfortunately, this integration is not analytically tractable, and we
must resort to numerical methods.
To implement these methods, we first note that
where, after the first two equalities, the second integral is a K-dimensional one.
The last equality indicates that the Bayesian estimator (with quadratic loss) is a
weighted average of GLS estimators with weights given by the marginal poste-
rior for p. Since it uses all values of p in this way, it is an intuitively more
satisfying estimator than an EGLS estimator, which uses just one point esti-
mate of p. The density g(ply) is obtained by integrating fl out of (8.2.34) and is
given by
{3k(RSS)-(T-K)/2IX*'X*I-1/2 dp
J_I
E[{3kly] = - - - - : : 1 - - - - - - - - -
1-1
(RSS)-(T-K) /2 IX*'X*I-1/2 dp
(8.2.37)
If one is interested in the shape of the posterior density g(j3kly), as well as its
mean, this can be analyzed by applying bivariate numerical integration to
(R - Q)2]-(T-K+1)/2
g(j3 ply) IX (RSS)-(T-K+1) /2 IX*'X*I- 1I2 C -1/2 [ 1 + fJk fJk
b kk' Ckk (RSS) (8.2.38)
Yt = x;1l + et , (8.2.39)
et = Olet-I + 02 e t-2 + V t (8.2.40)
where E[v t ] = 0, E[vtv s ] = 0 for t =1= s, and E[vn = o-~. This process will be
stationary if 0 1 + O2 < 1, O2 - 01 < 1 and -1 < O2 < 1. The elements of the
covariance matrix E[ee'] = <I> = o-~'It can be found from the variance
(8.2.41)
(8.2.42)
(8.2.43)
and
I -(}I -(}2 0 o 0
-(}I I + (}2I -(}I + (}1(}2 -(}2 o 0
-(}2 - (}I + (}I (}2 I + (}i + ()~ -(}I + (}1(}2 o 0
'\f1-1 = 0 -(}2 - (}I + (}I (}2 I + (}2I + (}22 o 0
0 0 0 0 I + (}2
I -(}I
0 0 0 0 -(}I I
(8.2.45)
(Tj(Te 0 0 0 0 0
-PI VI - ()~ VI - (}2
2 0 0 0 0
-(}2 -(}I I 0 0 0
P= 0 -(}2 -(}I I 0 0 (8.2.46)
o o o 0
o o o 0
where
and PI, in terms of (}I and (}2, is given in Equation 8.2.42. Thus, if (}I and (}2 were
known, the GLS estimator Il = (X''\f1-IX)-IX''\f1-l y can be obtained by apply-
ing LS to the transformed model y* = X*f3 + e*, where y* = Py, X* = PX and
Autocorrelation 295
e* = Pe. Its mean and covariance matrix are E[~] = Il and +~ = (X'<I>-IX)-I =
(T~(X*' X* )-1.
One can gain more insight into the nature of the transformed model by
writing out the individual elements ofy* = X*1l + e* and by examining e*'e*,
the sum of squares being minimized. The transformed model is given by
(8.2.48)
VI - 0~(Y2 - PIYI) = VI - 0~(X2 - Plxl)'1l + VI - 0~(e2 - Plel)
(8.2.49)
(8.2.50)
e'P'Pe = [(Tv e l ]2
(Te
+ [VI - 0~(e2 - Plel)]2 +
1~3
(e l - Olel-I - 02 el_Z)2
T
= e*2
I + e*2
2 + "L.J v2I
1~3 (8.2.51)
and
t = 3, 4, . . . , T
,
As in the AR(l) case, we can obtain an approximate GLS estimator Ilo =
(X6'X6)- IX6'y6', where X6' = PoX, Y6' = PaY and Po, of dimension [(T - 2) x
T], is given by P with its first two rows deleted. This is the estimator obtained
by ignoring ef2 and ei2 and minimizing '2'..;~3 v;. It effectively only uses (T - 2)
observations.
s = I, 2
(8.2.52)
which are estimators for PI and P2. They can, however, contain considerable
small-sample bias, an indication of which is given by Malinvaud (1970, p. 517).
Estimators for 8 1 and fh are obtained by solving (8.2.42) and (8.2.43) for 8 1 and
82 in terms of PI and P2 and substituting the sample counterparts. This yields
(8.2.53)
(8.2.54)
t = 3, 4, . . . , T (8.2.55)
These two procedures are almost identical, differing only through the end terms
of the summations of the squares and cross-products of the e/s.
The values of OJ and O2 can be restricted to the stationary region, but such a
procedure could still be computationally expensive.
8.2.2d ML Estimation
If, in addition to the earlier assumptions, we assume that e r is normally distrib-
uted, we can follow a similar procedure to that outlined in Equations S.2.22 to
S.2.29 for the AR(1) process. In this case
(S.2.56)
Nonlinear LS estimates of OJ, O2 , and J3 will differ from the ML ones because of
the presence of the second term in (S.2.57). Beach and MacKinnon (197Sb)
outline an efficient algorithm for maximizing (S.2.57) or, alternatively, we could
use one of the procedures outlined in Appendix B. In the context of a distrib-
uted lag model with geometrically declining weights, Schmidt (1971) uses a
search procedure, searching over OJ, O2 , and the "distributed lag parameter."
8.2.2e Properties
Under the conditions in Section 5.5.2 the above estimators for J3 will be asymp-
totically normally distributed with mean J3 and covariance matrix
(J~(X'P'PX)-J. To use this result to test hypotheses about J3, we replace P by P
(which is based on oI)e of the above estimators for 01 and ( 2 ) and (J~ by 6-~) =
(y - X~)' P' P(y - X~)/(T - K). There is little evidence on which of the above
estimators performs best in small samples. However, conventional wisdom
[see Beach and MacKinnon (197Sb)]. suggests the ML estimator is likely to be
better provided that the model specification is correct. Also, if YI and Y2 are
legitimate observations from the AR(2) process, minimizing ei 2 + ei 2 + 'L;=3V;
should be more efficient than minimizing 'L;=3V;.
In the previous section we outlined how the general linear model with AR(1)
errors can be extended to that with AR(2) errors. If we have sufficient observa-
tions, this extension readily generalizes to errors that follow any finite-order
AR process. However, the expression for the transformation matrix P becomes
progressively more complicated. A general expression for ",-I is given by Wise
(1955) while Fuller (1976, p. 423) defines the required P matrix. Pagan (1974)
gives details of how the Gauss-Newton algorithm can be used to minimize
298 Dynamic Specifications
2.;=P+ I vi, where p is the order of the process and Pagan and Byron (1978)
summarize the various modifications of NLS and ML estimation. See also
Harvey (1981a, pp. 203-206).
Thomas and Wallis (1971) suggest that when quarterly data are being used, a
fourth-order process may be appropriate. However, instead of a general fourth-
order process, they suggest that only the disturbances in corresponding quar-
ters of each year should be correlated. This leads to the specification
(8.2.58)
(8.2.59)
where \}II is of dimension (T14 x T14) and has the same structure as \}I obtained
from an AR(I) process. See Equation 8.2.10. The structure of \}I4 is such that
E[etet-.,1 = 0 unless s is a multiple of 4 in which case it is (1";pS/4. Also, (1"; =
(1"~/(I - p2).
The appropriate matrix for transforming the data is P 4 = PI /4, where PI is
the P defined in (8.2.12) and the transformed elements of y* = P 4 y, for exam-
ple, are
e'P4P4e = e*'e* = L e1
t= I
2
+ L
t=5
v;
(8.2.62)
where
(=1,2,3,4
and
( = 5, 6, . . . , T
T
L(p,Jl) = - -2 In (y* - X*Jl)'(y* - X*Jl) + 2 In (1 - p2)
(8.2.63)
(8.2.64)
and
(8.2.65)
where E[v t] = 0, E[vi] = (J"~, E[vtv s ] = 0 for t of=- s and we assume the process is
invertible, that is lal < 1. The invertibility assumption means that the MA
process can be alternatively written as an infinite order AR process and it must
hold if the ML and various approximate NLS estimator~ (described below) are
to have the same asymptotic distribution. It should be kept in mind, however,
that a = 1 is a legitimate MA model that may arise in practice when overdif-
ferencing has taken place. For further details see Chapter 7, Davidson (1981)
and Harvey (1981b, Chapter 5 and 1981c).
The covariance matrix for e is E[ee'] = (J"~W where (J"; = (J"~O + ( 2 ); the
autocorrelations are PI = a/O + ( 2) and Ps = 0 for s > 1; and
1+a2 a 0 o o
a 1 + a2 a o o
o a 1+ a2 o o
'11= (8.2.66)
o o o
o o o
For known a the GLS estimator that minimizes (y - XJl)'W-I(y - XJl)
e'W-I e is given by t3 = (X'W-IX)-IX'W-I y .
as = I + a2 + . . .+ a 2., (8.2.67)
and
I 0 0 0 o
C al 0 0 o
c2 alc a2 0 o
Q= c3 alc
2
a2 c a3 o (8.2.68)
(8.2.69)
(8.2.70)
1/2
ei = (aO/al) el
(-.!!.!..-) 112 * _
et - e t
_a (a t- 2 ) 1/2 *
et-I
at-l at_I (S.2.73)
(S.2.74)
and so ML estimators for (3 and a (see Equation S.2.6) are obtained by numeri-
cally minimizing
8.2.Sb Approximations
Instead of finding estimators that minimize S(3,a) = e*/e* or SL(3,a) =
I'If lIlTe* /e*, approximate procedures with fewer computations are frequently
used. In particular, a common alternative is to replace the transformed obser-
vations implied by (S.2.72) with
t = I, 2, . . . , T (S.2.76)
1 o o o
1 o o
Po =
c 1 o (8.2.77)
8.2.5c Properties
In terms of asymptotic properties there is no basis for choice between the
conditional and unconditional LS and ML estimators. Under certain condi-
tions-see Pierce (1971)-these estimators will all have an asymptotic normal
distribution, will be consistent, and will have a covariance matrix that can be
estimated using the inverse of the matrix of second derivatives of the log
likelihood function, or the approximation being used in the NLS or ML al-
gorithm. Based on these estimates, approximate tests on the coefficients can be
carried out in the usual way.
The small sample properties of the various estimators have been investigated
using both Monte Carlo experiments [see, for example, Dent and Min (1978),
Ansley and Newbold (1980), and Harvey's review (1981b, pp. 135-139), and
the analytical properties of the criterion functions [see, for example, Davidson
(1981) and Osborn (1982)]. Most of these studies are concerned with the proper-
ties of estimators for a in the pure time series model where 13 = 0, rather than
with the properties of various estimators for 13. The main results can be summa-
rized as follows.
1. The equivalent of the likelihood function, SL, will always have a global
minimum within the region - 1 <a <: I and necessarily has stationary points at
a = + I. For small-sample sizes, one of these stationary points is frequently the
Autocorrelation 303
global minimum. Thus, in any given study, there is a high probability that eX =
+ 1 will be the maximum likelihood estimator for a [Davidson (1981), Cryer and
Ledolter (1981)], and this result is true even if the true value of 10'1 is not close to
one (say less than .8). However, the probability is greater the closer 10'1 is to
unity and the smaller the sample size. The results of Sargan and Bhargava
(1983b) and Griffiths and Beesley (1984) confirm that the probability will also be
high when the model has an X component (i.e., when p -:f= 0). In general the
distribution of the ML estimator eX tends to be bimodal with a well-behaved
distribution in the interval [- 1,1], and a spike at + 1.
2. The conditional function S1 will always have a minimum within the
region -1 < a < 1, but the least squares functions Sand S* do not necessarily
possess unconstrained minima within the corresponding closed interval -1 <:
a <: 1 [Davidson (1981)]. In terms of the expected values of the criterion func-
tions, it can be shown [Osborn (1982)] that E[Sd is the only expected function
that has a stationary point at the true value of a, that E[S*] always has a
minimum within the region -1 < a < 1, and that E[S] frequently has no
unconstrained minimum within this region, particularly if T is small and 10'1
close to unity.
3. If we consider the values of a that minimize the expected criterion func-
tions E[S], E[S*] and E[SL]' it can be shown [Osborn (1982)] that the condi-
tional LS minimizing value will be between zero and the true value of a, while
the unconditional LS minimizing value will be between the true value of a and
the invertibility boundary (or it will be a constrained minimum on the bound-
ary). These results are not surprising given the nature of the functions dis-
cussed in the first two points, and they go a long way to explaining the results of
many Monte Carlo studies. In particular, most Monte Carlo studies have found
that unconditional LS leads to more (constrained) coefficient estimates at + 1
than does ML, which in turn gives more boundary estimates than conditional
LS. Conditional ML has no boundary estimates. Consequently, when the vari-
ous estimators are evaluated in terms of their mean square errors [Davidson
(1981), Dent and Min (1978)], conditional ML and conditional LS are preferable
for true values of a away from the invertibility boundary, while ML performs
best for values of a approaching + 1. Unconditional LS is best for extreme
cases near + 1 but otherwise it is worse and it does not appear to be recom-
mended by any authors.
4. If a-~L' a-LLS, and a-~LS represent the alternative estimators for (]'~ ob-
tained from ML, unconditional and conditional LS, respectively, then it can be
shown that a-~L :> a-LLS and a-~LS > a-LLS' Furthermore, on average we would
expect a-hs :> a-~L :> a-tLS' See Osborn (1982). Thus, it is reasonable to
hypothesize that conditional LS will lead to overstatement of the variance,
while unconditional LS will underestimate the variance.
S. Because of the prevalence of boundary estimates for a when sample size
is small, extreme caution must be exercised when making inferences about a.
A question that is not addressed in the above results is the relative perfor-
mance of the various estimators in terms of efficiency of estimating p when the
model contains an X component and p -:f= O. In this case the unconditional LS
304 Dynamic Specifications
A A
and ML estimators for (3 will be of the form (3 = (X''I1- IX)-1 X''I1- l y, while the
conditional LS and ML estimators will be of the form ~o = (X'P6P oX)-'
X'PbPoY. Since ~o is based on an approximate transformation rather than the
more efficient GLS transformation, there is likely to be a loss of efficiency from
estimating (3 via one of the conditional procedures, and this loss of efficiency
will not necessarily be reflected in the relative efficiencies of the various esti-
mators for a. In view of this fact, and that in some circumstances with known a
the approximation can be considerably worse than GLS [Balestra (1980), Park
and Heikes (1983)], we can conclude that in terms of estimating (3, uncondi-
tional LS and ML are likely to be preferable to the conditional estimators.
Combining this conclusion with the results on relative efficiency of the various
estimators for a (see point 3 above), it seems reasonable to recommend the
unconditional ML estimator over the other three alternatives. One strategy that
has not been investigated but that may prove even better is to use the fully
transformed estimator (3 = (X''I1- IX)-IX''I1- l y, with '11 based on either a condi-
" " A "
A
1 - (1 - 4ri)112
a=
2rl (8.2.78)
Note that a is only meaningful if hi < .5. The invertibility condition requires
lal < 1, which in turn implies IpI! < .5. This estimator is easy to construct but,
because it ignores information contained in the other sample autocorrelations,
it is inefficient relative to the NLS and ML estimators. Because of this ineffi-
ciency, in this section we have concentrated on the NLS and ML techniques.
For further information on the estimator in (8.2.78) and other related results see
Fuller (1976, pp. 343-356), Durbin (1959), McClave (1973), and Mentz (1977).
1
g(3,a,(TJ DC - ,
(Tv lal < 1 (8.2.79)
Autocorrelation 305
This prior combines the commonly used diffuse one g(P,(Tv) IX lI(Tv with the
assumption that a is a priori independent of p and (Tv with prior g(a) = l, lal <
1. One might object to calling the prior g(a) = l uninformative, in which case an
alternative prior, such as one based on the square root of the determinant of the
information matrix [e.g., Zellner 0971, p. 47)], could be used. The latter would
add an additional term to the following results, but it would not change the
essential features of the analysis.
Combining (8.2.79) with the likelihood and integrating out (Tv yields the joint
posterior density
, ,
1 - a 2T+2]-1/2 [ (P - P)'X*'X*(P - P)]-T/2
g(p,aly) IX [ 1 _ a2 (RSS)-T12 1 + RSS
(8.2.80)
(8.2.81)
where Ckk is the kth diagonal element of (X*' X*)-I and ~k is the kth element of
p. Information about g(13kl y) can be obtained by applying bivariate numerical
integration to (8.2.81).
To obtain information about a, we can integrate 13k out of (8.2.81) and then
apply univariate numerical integration to
I - 2T+2]-1/2
g(aly) IX
[ 1 _a a 2 RSS-(T-Kl/2IX*'X*I-1/2
(8.2.82)
We now consider estimation of the parameters of the general linear model with
qth order MA errors. The first-order model that we discussed in the previous
section could, of course, also be considered within this framework. One differ-
ence is that, in the MAO) case, it is possible to derive more explicit closed form
expressions for the various criterion functions. The general model can be writ-
ten as
(8.2.83)
306 Dynamic Specifications
et = vt + 2: ajvt-j
j~1 (8.2.84)
where E[vtl = 0, E[v~] = O"~, E[vtv s ] = 0 for t oF s and we assume the a;'s are
such that the invertibility condition holds. See Chapter 7. The covariance ma-
trix E[ee'] = <I> = 0"~'iI is such that 0"; = (1 + ai + a~ + ... + a~)O"~, and the
autocorrelations are given by
s = 1, 2, . . . ,q
= 0, s>q (8.2.85)
Following Pagan and Nicholls (1976), who were influenced by Phillips (1966),
we can write (8.2.84) in matrix notation as
e = Mv + Mv (8.2.86)
I o o
1 o
M
(8.2.87)
(T x T)
o
1 0
o o o
and
M
(8.2.88)
(T x q)
o
o o o
Autocorrelation 307
with respect to Il, a, and v. It can be shown [Pagan and Nicholls (1976)] that
this minimization problem is equivalent to minimizing e''II-le with respect to Il
and a; and it is convenient computationally because the successive residuals in
the sum of squares can be calculated recursively as
Vo = Vo
VI = YI - x;1l - L
j~1
O'/ul_j
I-I q
Vt = Yt - x;1l - L
j~ I
O'jVt-j - L O'jUt-J,
j~1
t = 2, 3, . . . ,q
V t = Yt - x;1l - L
j~ 1
O'jVt-j, t = q + 1, q + 2, . . . ,T (8.2.91)
y = XIl + Mv + Mv (8.2.92)
308 Dynamic Specifications
Substituting the ML estimator 6"~ = v'vlT into (8.2.94) yields a concentrated log
likelihood function which, ignoring constants, is L = -(TI2) In (v'v). Thus
minimizing v'v with respect to Jl, <X and v yields, conditional on v, ML estima-
tors for Jl and <X.
Instead of minimizing (8.2.90), exact NLS estimates can also be found by
minimizing an alternative function that does not depend on the presample
disturbances v and that is based on a simplification of e,,,,-I e . To derive this
simplification we note that, from (8.2.89)
(8.2.95)
(8.2.97)
where 0 = M-I e . For computing NLS estimates for Jl and <X from (8.2.97) we
note that, since M is a lower triangular matrix, 0 = M-I e and C = M-IM can be
readily calculated recursively; in addition the inverse (lq + C'C)-I is only of
order q. The approximate procedure of setting v = 0 and minimizing v'v in
(8.2.90) is equivalent to ignoring the second term in (8.2.97) and minimizing
0'0. It is often called conditional least squares.
To derive another form of (8.2.97), which frequently appears in the literature
and which is useful for illustrating the relationship between (8.2.90) and
(8.2.97), we rewrite (8.2.89) as
(8.2.98)
Autocorrelation 309
and that the minimum mean square error predictor for v given e is
~ = E[vle] = -(Z'Z)-IZ'We (8.2.100)
The link between minimizing y'y + v'v and e''I1- l e can be made by replacing v
by ~. This yields
(8.2.101)
(8.2.102)
Note that, if P = D-l/zL -I, this procidure is equivalent to finding P such that
P'P = '11- 1 The Kalman filter algorithm [Harvey and Phillips (1979)] can be
viewed as a systematic framework for the decomposition '11 = LDL'.
For maximum likelihood estimation under the assumption of normally dis-
tributed disturbances, values of f3 and a that minimize 1'I11- IITe''I1- l e need to be
found. In this procedure the sum of squares function can be computed using
any of the above methods (or an approximation), while the determinant 1'111 can
be calculated from
8.2.6b Properties
Much of the discussion on estimator properties for the MAO) model also holds
for the general case, but many of the analytical results no longer necessarily
hold in small samples. The main conclusions can be summarized as follows.
1. Under appropriate conditions the alternative estimators will all have the
same desirable asymptotic properties as the maximum likelihood esti-
mator.
2. The likelihood function possesses at least two stationary points on the
invertibility boundary, and hence, in small samples, it is quite likely that
the ML estimates will occur at the boundary [Davidson (981)].
3. Unconditional LS appears to lead to a large number of boundary estimates.
4. In Monte Carlo experiments using a model with no X component, condi-
tional LS and (unconditional) ML perform best, with ML being better for
parameter values close to the invertibility boundary [e.g., Ansley and
Newbold (1980)].
5. Ifwe consider efficiency in the estimation of a, and the fact that an approx-
imate procedure (such as conditional LS) may lead to an efficiency loss in
the estimation of JJ, the preferred estimation technique is ML.
p q
et = 2: 8i e -i +
i~1
t Vt + 2:
j~ I
(XjVt-j
(8.2.104)
where E[v t ] = 0, E[vn = O'~, E[vtV,] = 0 for t =I=- sand JJ, 9 = (8,,8 2 , . ,
8p )', a = (XI, (X2, ,(Xq)' and O'~ are parameters to be estimated. We assume
that the process is stationary and that the invertibility condition holds. Let the
covariance matrix for e be given by E[ee'] = <I> = 0'~'I1. The structure of '11 is a
Autocorrelation 311
rather complex one. See Box and Jenkins (1970, p. 74) for the general case. As
an example, for the ARMA(1, 1) case, e t = (}et-I + Vt + aVt-l, the elements of'll
are given by
and
( =l=s
(8.2.105)
p q
Vt = et - L
i= I
(}iet-i - L ajvt-j
j= I
and
provided that we are given values of e' = (eo, e_), . . . , el_ p ) and V' = (vo,
V_), . . . , VI_q). Some asymptotically equivalent treatments of e and v [Pierce
(1971)] are
1. v;
Minimize L;=I with e and v set equal to O.
2. Minimize L;=p+IV; with v p , v p _), . . . , Vp-q+1 all set equal to zero; this
permits calculation of the e t for the early observations.
3. v;
Minimize L;=I with respect to jl, 6, a, e, and v, the last two vectors being
treated as unknown parameters.
De = De + Mv + Mv
= Mv + Nw (8.2.107)
where M and M are as defined in Equations 8.2.87 and 8.2.88; D has the same
structure as M except that the a)' s are replaced by - (}i' S and there are p (not q)
(};'s; D has the same structure as M with a/s replaced by +(}/s; N = (M,D);
and w' = (v' ,e'). Ifwe use (8.2.107), the covariance matrix (T~ 'fI can be written
as
where E[ww'] = (T~n and wand v are uncorrelated. The exact structure of n
will depend on the order of the process-see Newbold (1974) for the
ARMA(I,1) case.
Extending the work of Pagan and Nicholls (1976), we can show that mini-
mlzmg
with respect to (l, 6, lX, and w is equivalent to minimizing e''fI-le with respect
to (l, 6, and lX. We still need to find a matrix H such that H' H = n- I , but since
n is only of order (p + q), this is a considerably easier task than finding P such
that P'P = 'fI- I . See Newbold (1974) for an example.
For an alternative procedure that eliminates the presample disturbances w
and involves factorization of a matrix that is only of order max(p ,q) we proceed
as we did for the MA case, and derive a simplified expression for e''fI-le. [See
Ali (1977), Ljung and Box (1979), Kohn (1981), and a correction to Ali's paper
made by Ansley (1979).] Working in this direction, we first note that, if p :> q,
we can write
N = (M,D) = [ 0M* D
0 *] = [N*]
0
(8.2.110)
NnN' = [
N*nN*' 0] = (S)(S' 0') = BB'
o 0 0 (8.2.111)
(8.2.114)
(8.2.116)
y* = X*13 + e* (8.2.117)
ZIs = T-1/2 , t = 1
Z" = (2/T)1/2 COS[7Tt(S - l)IT], f = 2,4, 6, . . . , T - 2
(or T - 1 if T is odd)
ZIS = (2/T)1I2 sin[7T(t - l)(s - l)IT], t = 3,5,7, . . . , T - 1
(or T if T is odd)
t = T (if T is even)
(8.2.118)
estimator can be obtained via weighted least squares. See Chapter 11. When the
f,(wj) are unknown (which, of course, will usually be the case) consistent esti-
mates can be conveniently found from the LS residuals obtained from the
transforme~d model (~.2.117). See liarvey (1978, 1981b) for details. The EGLS
estimator p = (X*'!1*-IX*)-IX*'!1*-l y*, with!1* based on the estimates of
f.(wj), will be consistent and asymptotically efficient. The small-sample proper-
ties of this estimator, however, are not outstanding. This fact is not surprising
since the estimator does not utilize information on the form of the autocorre-
lated error process. Relevant Monte Carlo evidence has been provided by
Engle and Gardner (1976) and Gallant and Goebel (1976) who find that, in many
circumstances, it is better to use a time domain estimator that makes an incor-
rect assumption about the form of autocorrelation than to employ the fre-
quency domain estimator.
Regression in the frequency domain has been suggested for purposes other
than correcting for autocorrelation. In particular, if E[ee'] = (J2/, then E[e*e*']
= (J2/, and a restricted estimator that omits transformed observations corre-
sponding to some of the frequencies has been suggested. This technique is
known as band spectrum regression. For further details as well as reasons for
its implementation, see Engle (1974a, 1976) and Harvey (1978, 1981b).
Frequently, the general linear model is used not only to obtain information
about unknown parameters, but also to predict future values of the dependent
variable. In this section we outline a framework for prediction and give "best
linear unbiased predictors" for some specific cases.
The model can be written as
y = Xp + e (8.3.1)
where E[e] = 0, E[ee'] = <I> = (J~'II, the usual definitions hold, and we empha-
size that y is a (T x 1) vector of past observations on the dependent variable. In
general, the elements in e are assumed to be autocorrelated.
The problem is to find an appropriate predictor for N future observations
given by
y = Xp + e (8.3.2)
(8.3.3)
If the variances and covariances are known, the best linear unbiased predic-
tor (BLUP) for y is [Goldberger (1962) and Theil (1971, p. 280)]
where C = (X'lJI-1X)-I.
Note that the BLUP in (8.3.4) is made up of two components. The first, Xp,
is an estimate of the nonstochastic component of the future observations and
the second, V'lJI-1(y - XP), is an estimate of the future disturbances. This
latter estimate is based on the sample residuals and the covariance between the
past and future disturbances.
8.3.1 AR Errors
For the AR(I) case, where e t = pet-I + Vr, and lJI is given by Equation 8.2.5,
the tth element of y reduces to
(8.3.6)
where iT+t is the tth row of X and XT is the last row of X. In this case, the last
residual in the sample is the only one that contains information about the future
disturbances and, because of the term pt, the importance of this information
declines as we predict further into the future.
For a model with AR(2) errors, e t = (Jlet-I + (J2et-2 + Vt, the last two sample
residuals contain relevant information and the BLUP for the (T + l)th observa-
tion is given by
(8.3.7)
Autocorrelation 317
,
where et = Yt - xt{l. For predictions more than one period into the future, the
second component gets progressively more complicated, but we can simplify
matters by defining YT+t = XT+t~ + J T+t , where
8.3.2 MA Errors
When we have a model with MAO) errors, e t = Vt + av t-), the e/s more than
one period apart are uncorrelated so that the sample residuals do not contain
any information about disturbances more than one period into the future. Con-
sequently,
for t > 1 (8.3.9)
In terms of (8.3 .4) this occurs because only the first row in V' contains nonzero
elements.
However, for predicting eT+ I, all the sample residuals contain some infor-
mation.
Using the definitions for P, at, y* and X* given in Section 8.2.5a we have
where ei is the last sample residual from the transformed model. Note, from
Equation 8~2.72, that it will be a function of all the untransformed residuals
e = y - X{l.
An alternative procedure that approximates the BLUP and is less demanding
computationally is to set 00 = 0, then recursively obtain
t = I, 2, . . . , T (8.3.11)
(8.3.12)
For MA errors of higher order and ARMA errors this approximation readily
generalizes, and for most practical purposes it is sufficiently accurate. The
318 Dynamic Specifications
(8.3.14)
We can view the last term in (8.3.14) as the MSE contribution attributable to
estimation of p. The situation is not always this simple, however. For further
details on the asymptotic MSE for predictors with estimated parameters, as
well as some finite sample results and some results on misspecification we refer
the reader to Ansley and Newbold (1981), Baillie (1979, 1980), Bhansali (1981),
Fuller and Hasza (1980, 1981), Phillips (1979), Spitzer and Baillie (1983) and
Yamamoto (1976, 1979). Nicholls and Pagan (1982) build on work by Salkever
(1976) and Fuller (1980) to show how predictions and their standard errors can
be obtained with standard econometric computer packages.
Autocorrelation 319
rs =
~T '2
t= let
where the et are the LS residuals et = Yt - x;b, and then use these to make
some judgment about the p, and hence the type of process. The legitimacy of
such an approach depends on the condition that each et converges in probabil-
ity to the corresponding e t Unfortunately, in small samples the et will be
correlated even if the e t are not; also, the r, may exhibit substantial small-
sample bias [Malinvaud (1970, p. 519)]. Nevertheless, if the sample is suffi-
ciently large, they may yield some useful information.
An approximate expression for the variance of the rs has been provided by
Bartlett (1946) and, if Ps is essentially zero for lags s greater than some value p,
this expression can be estimated using
/'-..
var(rs) =
1(
T 1
p)
+ 2 ~ r7 ' s>p
(8.4.1)
Under a null hypothesis of no autocorrelation p = and each r" s = 1, 2,
. . . , has a standard error T- 1/2 Thus a reasonable first step would be to
compare each rs with T- 1/2 The largest s taken would depend on the type and
number of observations. If we have at least 40 observations and quarterly data,
it would seem reasonable to take values up to r8. As a second step we could
select p as the first lag at which a correlation was "significant" and then use
320 Dynamic Specifications
(8.4.l) to test for significant correlations at lags longer than p. For example, if rl
were significant, r2, r3, . .. could be tested using the standard error T-1i2(1 +
2rD 1i2
This type of procedure is certainly not conclusive, and it is frequently not
easy to discriminate between AR and MA processes. However, it may detect
correlations that are often overlooked by standard regression tests such as the
Durbin-Watson. If, based on the r" a number of autocorrelation assumptions
seem reasonable, each of the resulting models could be estimated and we could
attempt to discriminate between models on the basis of other criteria such as
the significance of the ();'s and/or a/s, the value of the likelihood function, and
the a priori validity of the coefficient estimates. See Pagan (1974) for two
examples.
(8.4.2)
question of why identical LM test statistics are sometimes appropriate for quite
different alternative hypotheses. In some finite sample power comparisons,
Godfrey shows that the LM test performs favorably relative to the LR test,
despite the fact that the alternative hypothesis for the LM test is less clearly
defined.
If one of the above "search procedures" is used to choose a particular
model, it must be kept in mind that reported standard errors, and so forth, are
conditional on the chosen model being the "correct one." Unconditional vari-
ances are not readily derived and can be quite higher. See the comments in
Section 8.5.
(8.4.3)
This model can be considered a special case of the more general model
(8.4.4)
son (1977), Hendry and Mizon (1978), Mizon and Hendry (1980), and the dis-
cussion in Harvey (198Ia). Kiviet (1981) outlines some dangers of the ap-
proach. The formulation of a general research strategy for dynamic models is
discussed by Hendry and Richard (1982).
e'Ae e'MAMe
e'Me (8.4.5)
where e is an element of e = y -
t Xb, M = 1- X(X'X)-IX' and
1 -I 0 0
-I 2 -I 0
A= o -I 2 o (8.4.6)
o o o 1
For testing Ho: P = 0 against the alternative HI: P > 0 the DW test rejects Ho if
d < d*, d* being the critical value of d for a specified significance level. If the
alternative hypothesis is one of negative autocorrelation, the critical region is
of the form d > d**. It is assumed that the disturbances are normally distrib-
uted, although it can be shown [King, (198Ia)] that some of the optimal proper-
ties of the DW test hold for a wider class of distributions.
A disadvantage of the DW test is that the distribution of d, and hence the
critical value d* (or d**), depend on the X matrix in the model under consider-
ation. Thus, tabulation of significance points which are relevant for all prob-
lems is impossible. Durbin and Watson partially overcame this difficulty by
finding two other statistics d L and d u , whose distributions do not depend on X,
and which have the property d L < d < duo For different significance levels, and
given values of T and K, they used the distributions of d L and d u to tabulate
critical values d1 and d"{; [Durbin and Watson (1951)]; and they used the in-
equality relationships between d, d L , and d u to establish a bounds test. In the
bounds test for Ho: P = 0 against the alternative HI: P > 0 the null hypothesis
is rejected if d < d1, it is accepted if d> di;, and the test is regarded as
Autocorrelation 323
inconclusive if di < d < d'[;. If the alternative hypothesis is HI: P < 0, the null
hypothesis is rejected if d > 4 - di, it is accepted if d < 4 - d'[;, and the test
is regarded as inconclusive if 4 - d'[; < d < 4 - di.
In instances where d falls within the inconclusive region the exact critical
value d * can be found numerically, providing appropriate computer software
[e.g., SHAZAM, White (1978)] is available. Numerical methods for finding d*
include the Imhof (1961) technique used by Koerts and Abrahamse (1969) and
techniques given by Pan Jei-jian (1968) and L 'Esperance, Chall, and Taylor
(1976). If an appropriate computer program is not available, or the associated
computational costs are too expensive, the a + bd u approximation [Durbin and
Watson (1971)] can be employed. This approximation is relatively cheap com-
putationally and also quite accurate. The value d* is found from
d* = a + bd*u (8.4.7)
where a and b are chosen such that E[d] = a + bE[d u ] and var[d] = b2 var[d u ].
The quantities E[d u ] and var[d u ], for different values of T and K, are tabulated
in Judge et aL (1982); while E[d] and var[d] can be calculated from
P
E[d] = T - K (8.4.8)
where
and
The critical value bounds originally tabulated by Durbin and Watson (1951)
have been extended by Savin and White (1977), and also appear in most text-
books [e.g., Judge et al. (1982)]. These bounds are constructed under the as-
sumption that X contains a constant term. It is also possible to derive critical
value bounds, some of which are much narrower, for other types of X matrices.
The X matrices for which additional critical value bounds have been published
include regressions with no intercept [Fare brother (1980), King (1981 b)], re-
gressions with a trend and/or seasonal dummy variables [King (1981 c)], regres-
sions with a trend and/or monthly dummy variables [King (1983b)], and polyno-
mial trend models [Bartels et al. (1982)].
The DW test has a number of optimal properties. In particular, it is approxi-
mately uniformly most powerful similar (UMPS), and approximately uniformly
most powerful invariant (UMPI), if X can be written as a linear transformation
324 Dynamic Specifications
A
= e
'A oe A A2
el
+ A2
er
d' AlA =d+ AlA
ee ee (8.4.12)
where Ao is identical to A, except that the top left and bottom right elements are
equal to 2 instead of unity. A test with critical region d ' < d ' * is both an
approximate UMPS and an approximate UMPI test for Ho: P = 0 against HI:
P > 0, providing X can be written as a linear transformation of K of the
characteristic vectors of Ao. Furthermore, it is a LBI test in the neighborhood
of P = 0 for any X matrix; and it is exactly related to the first-order sample
autocorrelation coefficient, r = '2.;=2etet-1 !'2.;~ I e;. through the expression d =
I
2(1 - r). See King (l981a) for details. King tabulates critical values for a bounds
test based on d ' and, in an empirical investigation of the power of d ' relative to
that of d, he finds that d ' is more powerful than d if (I) 0 < P < .5 and the
alternative hypothesis is one of positive autocorrelation; and (2) - I < P < 0
and the alternative hypothesis is one of negative autocorrelation. Thus, in
King's words, "it appears that for the typical problem of testing for positive
autoregressive disturbances, Durbin and Watson's choice of test statistic is
correct. However, there are situations where it is preferable to use d ' in place
of d."
8.4.2b King's Locally Optimal Bounds Test and the Berenblut-Webb Test
Building on his earlier results described in the previous section, King (1982b)
suggests a test that is locally most powerful invariant in the neighborhood of
P = PI, where PI is a prespecified central value of P under HI: P > 0 (or HI: P <
0). His test statistic is
Autocorrelation 325
(8.4.13)
for references to earlier work and tabulation of significant points for the von
Neumann ratio.
Further alternatives have been suggested by Abrahamse and Koerts (1971),
Abrahamse and Louter (1971), Neudecker (1977b), Durbin (1970b), and Sims
(197S). Under the assumption that the variables in X are "slowly changing over
time," Abrahamse et al. recommend the residuals
v = L(L'ML)-1/2L'My (8.4.14)
8.4.2d Testing for AR(!) Errors when X Contains a Lagged Dependent Variable
In the presence of a lagged dependent variable the DW statistic is likely to have
reduced power and is biased toward 2 [Nerlove and Wallis (1966), Durbin
(1970a)]. For these circumstances, Durbin (1970a) utilized the theory on the
Lagrange multiplier test to develop the .. h statistic"
(8.4.1S)
given a test for an AR(2) alternative when the null hypothesis is -an AR(1)
scheme. The statistic for the latter test does not require estimation of the AR(2)
model and has a slight error, which is noted in Durbin's (1970a) paper.
Because the test based on Durbin's h statistic is an asymptotic one, a number
of Monte Carlo studies [Maddala and Rao (1973), Kenkel (1974, 1975, 1976),
Park (1975, 1976), Spencer (1975), McNown and Hunter (1980), and Inder
(1984)] have been undertaken to evaluate its small-sample performance. The
results from these studies are not in complete agreement, and there are in-
stances where the power of Durbin's h test is quite low, and where the finite
sample size of the test is quite different from its asymptotic size. However,
there does not yet seem to be an operational alternative that is clearly superior
to the h test.
j=I,2, . . .
(8.4.16)
Like the Durbin-Watson test this is a bounds test, and Wallis tabulates two sets
of bounds, one for when the equation contains seasonal dummy variables and
one for when it does not. Further sets of critical values are provided by King
and Giles (1977) and Giles and King (1978). For the case when Yt-' is one of the
regressors, Wallis shows that d 4 will be biased toward 2, but not too greatly,
and hence it still could be useful. He gives some examples where autocorrela-
tion is detected by d4 but not by d, . An alternative test, similar in nature to that
described in Section 8.4.2b, has been suggested by King (l983e).
For testing against an alternative of an autoregressive process of order no
greater than some a priori level, say q, Vi nod (1973) recommends using d" d 2 ,
. . . ,dq to test a sequence of hypotheses. If all the hypotheses are accepted,
we conclude that autocorrelation does not exist. If the jth hypothesis is re-
jected, we stop and conclude that an autoregressive process of order at leastj is
appropriate. Tables of the bounds for d" d 2 , d 3 and d 4 , with and without
dummies, are provided by Vinod.
A further test and the relevant bounds are given by Schmidt (1972). For
testing the null hypothesis 0, = O2 = 0, when the alternative is e t = 0, e t - , +
02et-2 + V t , he suggests using d, + d2 .
e"*, e. . . *
r(ao) = A,A
ee (8.4.17)
A
and so it is impossible to tabulate critical values that are relevant for all prob-
lems. To alleviate this problem King tabulates critical values which can be used
for bounds tests with r(.5) and r( - .5). Also, in some empirical power compari-
sons, he finds that the tests based on r( + .5) are always more powerful than the
DW test when lal > .3.
(8.4.18)
and
(8.4.19)
Thus we are testing for the existence of either AR(q) or MA(q) errors. The
statistic for both these alternative hypotheses is the same, and so while it may
help to identify the order of the process it does not enable one to discriminate
330 Dynamic Specifications
between the two types of autocorrelation. Godfrey and Breusch show that
under the null hypothesis,
(8.4.20)
This test is a convenient one because it is based only on least squares quanti-
ties. Fitts (973) provides an alternative for the MAO) case, but his test re-
quires nonlinear estimation. Also of likely interest are test statistics which test
the null hypothesis that an assumed autocorrelation structure is correct, against
the alternative hypothesis that some higher-order process is appropriate. God-
frey (I 978b) provides statistics for a number of these cases and Sargan and
Mehta (1983) outline a sequential testing procedure which can be used to select
the order of an AR disturbance.
Given the array of models, tests, and estimation techniques discussed in Sec-
tions 8.2 and 8.4, the applied worker will be concerned with which of these
should be chosen and what research strategy should be followed. As always, it
is impossible to give a blanket recommendation that is appropriate for all situa-
tions. However, based on reported results the following steps appear rea-
sonable.
This strategy is designed to find a model that "best represents reality" and,
given that the selected model is the "real one," it is also designed to estimate it
in the most efficient manner. The problem with such an approach is that the
reported coefficient estimates and their reliability are conditional on the final
model being the "correct one." They ignore the fact that a number of other
models have been rejected and that one of these could be "correct." Ideally,
the reported reliability of the results should take this into account. Unfortu-
nately, it is very difficult to study the unconditional sampling properties of
estimates generated from search procedures or strategies such as the one sug-
gested here and such studies are in their infancy. Thus, given the current state
of the art, the best bet appears to be to carry out a limited, but hopefully
effective, search procedure, and to report both the procedure followed and the
final set of estimates. It may also be useful to report estimates from two or more
competing models. However, one should keep in mind the sense in which the
results are sUboptimal and also other possible strategies.
In connection with testing for AR( 1) errors one possible alternative strategy
is to use a Durbin-Watson significance level which is considerably higher than
the conventional .05 level, say 0.35 or 0.4. Such a strategy is suggested by
Monte Carlo evidence on the efficiency of pre-test estimators [Judge and Bock
(1978, Chapter 7); Schmidt (1971); Fomby and Guilkey (1978); Peck (1976);
Giles and King (1984)], and by the fact that when X is such that LS is very
inefficient, the DW has low power [Chipman (1965)]. In terms of the effects of
pretesting on hypothesis tests for J3 the Monte Carlo support for a high signifi-
cance level is not as strong [Nakamura and Nakamura (1978); Giles and King
(1984)], but a level above 0.05 can produce some gains. Monte Carlo results on
pretest estimators where the errors follow either an AR(1) or an MAO) scheme
have been provided by Griffiths and Beesley (1984).
We have assumed throughout that the autocorrelation process generating the
disturbances is stationary. Newbold and Davies (1978) argue that nonstationary
processes are quite likely and that tests of the hypothesis J3 = 0, which assume
stationarity, are particularly unrealistic. They condemn applied econometri-
cians who consider only an AR(1) scheme and give Monte Carlo results to show
how, in the presence of an alternative process, this can lead to incorrect infer-
ences. The question of nonstationary disturbances is worthy offuture research.
It is likely to depend heavily on the specification of X. The suggestion that
alternatives to an AR(1) process should be considered is a good one, and it is
one to which applied econometricians seem to be paying increasing attention.
Before concluding it is worthwhile mentioning some results obtained for
models containing errors that are both autocorrelated and heteroscedastic.
Harrison and McCabe (1975) provide Monte Carlo evidence, and Epps and
Epps (1977) give both asymptotic results and Monte Carlo evidence which
suggest that the Durbin-Watson and Geary tests are robust in the presence of
heteroscedasticity. Epps and Epps go on to consider the performance of the
Goldfeld-Quandt and Glejser tests for heteroscedasticity when both problems
are present. They find both tests to be invalid but that a useful approximate
procedure is to first test for~autocorrelation and, if necessary, correct for it
using a Cochrane-Orcutt transformation, and then use the transformed model
332 Dynamic Specifications
to test for heteroscedasticity. It should be pointed out that the usefulness of this
procedure was judged by the performance of the tests not by the efficiency of
the resulting estimators. Lagrange multiplier statistics for simultaneously test-
ing for both heteroscedasticity and autocorrelation (as well as normality and
functional form) have been considered by Jarque and Bera (1980) and Bera and
Jarque (1982). See also King and Evans (1983).
8.6 EXERCISES
Exercise 8.1
Calculate the covariance matrices and compute the relative efficiency of the LS
estimator b, the GLS estimator p, and the approximate GLS estimator in the Po
AR( I) error model
Yr = f31 + f32 t + e r, t = I, 2, . . . , 10
where the Vr are independent and identically distributed random variables with
zero mean and unit variance. To reduce the computational load we note that
150.8194 829.5064]
X''I1X = [
829.5064 5020.2833
Exercise 8.2
Prove that minimizing SL in (8.2.6) is equivalent to maximizing the log-likeli-
hood in (8.2.5).
Exercise 8.3
Show that (8.2.34) is obtained when (Tv is integrated out of (8.2.33); and that
(8.2.36) results from integrating JJ out of (8.2.34).
Exercise 8.4
For the AR(2) error model, prove that P' P = '11 -I where P is the transformation
matrix in (8.2.46) and '11- 1 is defined in (8.2.45).
Exercise 8.5
Prove that the transformed residuals given for the MAO) error model in (8.2.72)
are obtained from the transformation Pe where P is defined in (8.2.70). Relate
this transformation to that obtained using the Kalman filter [Harvey (l981b,
p.112)].
Exercise 8.6
Prove that, for the MA(q) error model, 1'111 = IIq + C'e!. See Equation 8.2.103.
Autocorrelation 333
Exercise 8.7
Prove that, in the ARMA error model, minimizing v'v + w'n~lw with respect
to~, 6, a, and w is equivalent to minimizing e''I1~le with respect to~, 6, and a.
See Section 8.2.7 for the various definitions.
Exercise 8.8
Prove that, in the AR(l) error model, the general expression for the BLUP in
(8.3.4) reduces to (8.3.6), and the expression for its MSE in (8.3.5) reduces to
(8.3.13).
Exercise 8.9
Prove that the predictor in (8.3.4) is a BLUP and that its MSE is given by
(8.3.5).
Exercise 8.10
(a) Prove that the DW statistic can be written as
d = -j~-I-
T~K
~ ZT (8.6.1)
i= I
1 T~K _
E[d] = ~ OJ = 0
T - K j~1 (8.6.2)
and
2 [T~K ]
var[d] = (T _ K)(T _ K + 2) ~ OT - (T - K){j2 (8.6.3)
p
E[d] = T - K (8.6.4)
and
2
var[d] = (T _ K)(T - K + 2) [Q - P . E[d]] (8.6.5)
For the remaining problems generate 100 samples of y, each of size 21, from the
model
(8.6.6)
where e t = pet-) + V t and the V t are independent normal random variables with
zero mean and variance E[v~] = (]'~. Use the first 20 observations for estimation
and the last one for prediction purposes. Some suggested parameter values are
f31 = 10, f32 = 1, f33 = 1, p = .8, and (]'~ = 6.4, and a suggested design matrix is
1.00 14.53 16.74
1.00 15.30 16.81
1.00 15.92 19.50
1.00 17.41 22.12
1.00 18.37 22.34
1.00 18.83 17.47
1.00 18.84 20.24
1.00 19.71 20.37
1.00 20.01 12.71
1.00 20.26 22.98
X= (8.6.7)
1.00 20.77 19.33
1.00 21.17 17.04
1.00 21.34 16.74
1.00 22.91 19.81
1.00 22.96 31.92
1.00 23.69 26.31
1.00 24.82 25.93
1.00 25.54 21.96
1.00 25.63 24.05
1.00 28.73 25.66
Exercise 8.11
Calculate the covariance matrices for the following estimators.
(a) The generalized least squares estimator
~ = (X'W-1X)-IX'W-1y
Autocorrelation 335
(b) The approximate generalized least squares estimator (see Section 8.2.1a)
(d) The "generalized least squares estimator" which incorrectly assumes the
errors are MAO) with a = 0.8.
Comment on the relative efficiency of the four estimators.
Exercise 8.12
Select five samples from the Monte Carlo data and, for each sample, compute
values for the foJlowing estimators.
(a) The least squares estimator b.
(b) The estimator p* = 1 - dl2, where d is the Durbin-Watson statistic.
(c) The approximate estimated generalized least squares estimator for 13 that
uses p*, caJl it ~o.
(d) The estimated generalized least squares estimator for 13 that uses p*, can it
~.
(e) The maximum likelihood estimator for 13 under the assumption the errors
follow an MAO) error process, can it (3.
Compare the various estimates with each other and with the true parameter
values.
Exercise 8.1 3
The computer programs used in Exercise 8.12 to obtain values of the estimators
~ * -
b, 130, 13 and 13 are also likely to yield estimates of the covariance matrices of
these estimators. In all four cases, however, the variance estimates could be
biased. Explain why. Also, taking the parameter f33 as an example, compare the
various estimates of var(b 3), var(~03)' var(~3) and var(t33) with the relevant
answers in Exercise 8.11.
Exercise 8.14
Use the results in Exercise 8.12 and 8.13 to calculate t values that would be
used to test
(a) Ho: f33 = 0 against HI: f33 of- 0
(b) Ho: f33 = 1 against HI: f33 of- I
Since we have four estimators and five samples this involves, for each hypothe-
sis, the calculation of 20 t values. Keeping in mind the validity or otherwise of
each null hypothesis, comment on the results.
336 Dynamic Specifications
Exercise 8.15
Given that-xT+I = (1 2020), for each of the five samples calculate values for the
following two predictors:
* = XT+I~~
(a) Ynl
(b) )lnl = XT+I~ + P*(YT - XT~)
Compare these predictions with the realized values.
Exercise 8.16
Calculate the mean and variance of the DW statistic for the X matrix used to
generate your data. Use this information and the mean and variance of the
Durbin-Watson upper bound [see Table 6 in Judge et al. (1982)] to find an
approximate critical value for the Durbin-Watson test. Use a 5% significance
level.
Exercise 8.1 7
Using the same five samples that were used in Exercises 8.12 to 8.13, an
alternative hypothesis of positive autocorrelation, and a 5% significance level,
test for autocorrelation using
(a) The Durbin-Watson test (use the critical value obtained in Exercise 8.16).
(b) King's locally optimal bounds test with PI = .5.
Exercise 8.18
Repeat Exercise 8.12 for all 100 samples and use the results to estimate the
mean and mean square error matrix for each estimator. For example, for b,
estimates of its mean and mean square error matrix are given, respectively, by
1 100 1 100
100 ~ b(i) and 100 ~ [(b(i) - 13)(b(i) - 13)']
where b(i) is the least squares estimator from the ith sample. Use these results
and those of Exercise 8.11 to answer the following questions.
(a) Do any of the estimators appear to be biased?
(b) By comparing the estimated mean and mean square error of b with the
true ones, comment on the extent of the sampling error in the Monte Carlo
experiment.
(c) Is (T2(X''I1- IX)-1 a g?od appro~imation of the finite sample mean square
error matrix of (a) P; and (b) Po?
(d) What are the efficiency consequences of incorrectly assuming the errors
follow an MA(I) error process?
Exercise 8.19
Use all 100 samples to estimate the means of the variance estimators var(b 1),
Autocorrelation 337
. . -
var({303), var({33) and var({33) that were obtained in Exercise 8.13. Comment on
the results, particularly with respect to vour answers to Exercise 8.11.
Exercise 8.20
Repeat Exercise 8.14 for all 100 samples and calculate the proportion of Type I
and Type II errors obtained when a 5% significance level is used. Comment.
Exercise 8.21
Repeat Exercise 8.15 for all 100 samples and calculate
and
1 100
MSE(9T+1) = 100 ~ LVT+I(i) - YT+I(i)F
where i refers to the ith sample. Comment on the results and compare the value
of MSE(YT+I) with the v,alue obtained from the asymptotic expression in
(8.3.14).
Exercise 8.22
Use all 100 samples to construct empirical distributions for p* and the ML
estimator for 0'. Comment.
8.7 REFERENCES
Durbin, J. (l970b) "An Alternative to the Bounds Test for Testing for Serial
Correlation in Least Squares Regression," Econometrica, 38, 422-429.
Durbin, J. and G. S. Watson (1950) "Testing for Serial Correlation in Least
Squares Regression I," Biometrika, 37, 409-428.
Durbin, J. and G. S. Watson (1951) "Testing for Serial Correlation in Least
Squares Regression II," Biometrika, 38, 159-178.
Durbin, J. and G. S. Watson (1971) "Testing for Serial Correlation in Least
Squares Regression III," Biometrika, 58, 1-19.
Engle, R. F. (l974a) "Band Spectrum Regression," International Economic
Review, 15, I-II.
Engle, R. F. (1974b) "Specifications of the Disturbance Term for Efficient
Estimation," Econometrica, 42, 135-146.
Engle, R. F. (1976) "Interpreting Spectral Analyses in Terms of Time Domain
Models," Annals of Economic and Social Measurement, 5,89-110.
Engle, R. F. and R. Gardner (1976) "Some Finite Sample Properties of Spectral
Estimators of a Linear Regression," Econometrica, 46, 149-166.
Epps, T. W. and M. L. Epps (1977) "The Robustness of Some Standard Tests
for Autocorrelation and Heteroskedasticity When Both Problems Are
Present," Econometrica, 45, 745-753.
Farebrother, R. W. (1980) "The Durbin-Watson Test for Serial Correlation
When There is No Intercept in the Regression," Econometrica, 48, 1553-
1563.
Fisher, G. R. (1970) "Quarterly Dividend Behavior," in K. Hilton and D. F.
Heathfield, eds., The Econometric Study of the United Kingdom, Macmillan,
London, 149-184.
Fitts, J. (1973) "Testing for Autocorrelation in the Autoregressive Moving
A verage Error Model," Journal of Econometrics, I, 363-376.
Fomby, T. B. and D. K. Guilkey (1978) "On Choosing the Optimal Level of
Significance for the Durbin-Watson Test and the Bayesian Alternative,"
Journal of Econometrics, 8, 203-214.
Fuller, W. A. (1976) Introduction to Statistical Time Series, Wiley, New York.
Fuller, W. A. (1980) "The Use of Indicator Variables in Computing Predic-
tions," Journal of Econometrics, 12,231-243.
Fuller, W. A. and D. P. Hasza (1980) "Predictors for the First-Order Autore-
gressive Process," Journal of Econometrics, 13, 139-157.
Fuller, W. A. and D. P. Hasza (1981) "Properties of Predictors for Autoregres-
sive Time Series," Journal of the American Statistical Association, 76, 155-
161.
Gallant, A. R. and J. J. Goebel (1976) "Nonlinear Regression With Autocorre-
lated Errors," Journal of the American Statistical Association, 71,
961-967.
Gastwirth, J. L. and M. R. Selwyn (1980) "The Robustness Properties of Two
Tests for Serial Correlation," Journal of the American Statistical Associa-
tion, 75, 138-141.
Geary, R. C. (1970) "Relative Efficiency of Count of Sign Changes for Assess-
ing Residual Autoregression in Least Squares Regression," Biometrika, 57,
123-127.
Autocorrelation 341
Kiviet, J. F. (1981) "On the Rigour of Some Specification Tests for Modelling
Dynamic Relationships," working paper, University of Amsterdam.
Kmenta, J. (1971) Elements of Econometrics, Macmillan, New York.
Koerts, J. and A. P. J. Abrahamse (1968) "On the Power of the B L US Proce-
dure," Journal of the American Statistical Association, 63, 1227-1236.
Koerts, J. and A. P. J. Abrahamse (1969) On the Theory and Application of the
General Linear Model, Rotterdam University Press, Rotterdam.
Kohn, R. (1981) "A Note on an Alternative Derivation of the Likelihood of an
Autoregressive Moving Average Process," Economics Letters, 7, 233-236.
Kramer, W. (1980) "Finite Sample Efficiency of Ordinary Least Squares in the
Linear Regression Model with Autocorrelated Errors," Journal of the Amer-
ican Statistical Association, 75, 1005-1009.
Kramer, W. (1982) "Note on Estimating Linear Trend When Residuals are
Autocorrelated," Econometrica, 50, 1065-1067.
L'Esperance, W. L. and D. Taylor (1975) "The Power of Four Tests of Auto-
correlation in the Linear Regression Model," Journal of Econometrics, 3, 1-
21.
L'Esperance, W. L., D. Chall, and D. Taylor (1976) "An Algorithm for Deter-
mining the Distribution Function of the Durbin-Watson Statistic," Econo-
metrica, 44, 1325-1346.
Ljung, G. M. and G. E. P. Box (1979) "The Likelihood Function of Stationary
Autoregressive-Moving Average Models," Biometrika, 66, 265-270.
McClave, J. T. (1973) "On the Bias of Autoregressive Approximations to Mov-
ing Averages," Biometrika, 60,599-605.
McNown, R. F. and K. R. Hunter (1980) "A Test for Autocorrelation in
Models with Lagged Dependent Variables," The Review of Economics and
Statistics, 62, 313-317.
Maddala, G. S. (1977) Econometrics, McGraw-HilI, New York.
Maddala, G. S. and A. S. Rao (1973) "Tests for Serial Correlation in Regres-
sion Models with Lagged Dependent Variables and Serially Correlated Er-
rors," Econometrica, 41, 761-774.
Maeshiro, A. (1979) "On the Retention of the First Observations in Serial
Correlation Adjustment of Regression Models," International Economic Re-
view, 20, 259-265.
Magnus, J. R. (1978) "Maximum Likelihood Estimation of the GLS Model with
Unknown Parameters in the Disturbance Covariance Matrix," Journal of
Econometrics, 7, 281-312.
Malinvaud, E. (1970) Statistical Methods in Econometrics, North-Holland,
Amsterdam.
Mentz, R. P. (1977) "Estimation in the First-Order Moving Average Model
Through the Finite Autoregressive Approximation: Some Asymptotic
Results," Journal of Econometrics, 6,225-236.
Miyazaki, S. and W. E. Griffiths (1984) "The Properties of Some Covariance
Matrix Estimators in Linear Models with AR(1) Errors," Economics Letters,
14, 351-356.
Mizon, G. E. and D. F. Hendry (1980) "An Empirical and Monte Carlo Analy-
Autocorrelation 345
9.1 INTRODUCTION
data are generated, time lags must often be included in the relations of an
economic model.
Generally, if we only consider one dependent and one explanatory variable
we get a model of the form
and we assume that the variable x at all times is Independent of the stochastic
process et. At this point allowing an infinite number of lagged X t in the model is
a matter of convenience since it does not force us to specify a maximal lag at
this stage. Also we will see in Chapter 10 that there may be theoretical reasons
for postulating an infinite lag length.
In order to keep the discussion simple we assume in this chapter that e t is a
normally distributed white noise process with zero mean. To assume a specific
process has implications for the estimation and interpretation of the relation-
ship postulated in (9.1.1). Furthermore, it will be assumed that f3n = 0 for
n > n* so that (9.1.1) is a finite distributed lag model with lag length n*. More
general types of distributed lag models will be considered in Chapter 10.
The organization of the present chapter is as follows: In the next section the
general problem of estimating lag weights and lag lengths without assuming any
constraints for the former is considered. It will be pointed out that constraining
the lag weights can improve the estimation accuracy, and various restricted lag
formulations will be discussed in Sections 9.3 to 9.7. A summary is given in
Section 9.8. A visual representation of this chapter is presented in Table 9.1.
n'
Yt = 2: f3i X t-i + e t
i~O (9.2.1)
will be discussed. The f3i are the unknown distributed lag coefficients or
weights, the Xt-i are lagged values of the explanatory variable, n*( <x) is the lag
length and e t is independent white noise. The X t will be assumed to be nonsto-
chastic for convenience. We will first repeat some estimation results when n* is
assumed to be known and then investigate joint estimation of the lag length n*
and the lag weights.
y = XB + e (9.2.2)
TABLE 9.1 FINITE DISTRIBUTED LACS
I j j j
Restricted least Misspecification Known lag Unknown lag
squares formulation (Section 9.3.2) length length
(Section 9.3.1) (Section 9.3.3) (Section 9.3.4)
Finite Distributed Lags 353
where
X-n*+ I
X-n*+2
X= (9.2.3)
Xr Xr-I Xr-n*
y' = (YI, . . . ,Yr), W = ({3o, . . . ,(3n*), and e' = (el, . . . ,er). If the vector
of random variables e is normally distributed with mean vector zero and covari-
ance matrix a- 2I, then the least squares estimator b = (X'X)-IX'y is best unbi-
ased with an N(Il,a- 2(X'X)-I) distribution.
To investigate the asymptotic distribution of b we make the following as-
sumptions:
1. d} = XI + x~ + (9.2.4a)
. Xr2
2. 1I_x
1m d 2 = 0
T (9.2.4b)
3. R = lim Di l X'XD 71 (9.2.4c)
T~x
(9.2.5)
(see Chapter 5). Now we are ready to investigate the estimation problem with
unknown lag length.
If the lag length n* is unknown the number of regressors to be used in the linear
model (9.2.1) is unknown. Using n -J n* regressors implies biased (n < n*) or
inefficient (n > n*) estimators for the lag weights. To estimate the lag length,
the criteria for model selection discussed in Chapter 21 can be employed. Here
we will only discuss some possible candidates reformulated for the present
situation. For example, Akaike's (1973) information criterion assumes the form
2n
AIC(n) = In a-~ + T (9.2.6)
354 Dynamic Specifications
where 6"~ is the maximum likelihood estimator for (I'2 evaluated under the
assumption that n = n* and an estimate n(AIC) of n* is chosen so that AIC
assumes its minimum for n = n. That is, n = nT(AIC) is chosen such that
_ -2 T + n + I
PC(n) - (I' n T - n + t (9.2.8)
n In T
SC(n) = In 6"~ + T
(9.2.9)
n
L 6"j-2(T - j - I)
CAT(n) = ,-j~_O _ _ _ _ __
T2 (9.2.10)
(9.2.11)
and
n6"~T In T
BEC(n) = 6"~ + T - NT - 1 (9.2.12)
X-n+1
X-n+2
(9.2.13)
. {(n* + 1 - i)a i = 0, 1, . . . , n*
{3; = !(I) = .
o otherwIse (9.2.15)
which depends only on the parameter a. This lag function was actually sug-
gested by Fisher (1937). He regarded this as a computationally simple specifica-
356 Dynamic Specifications
il;
(n + 1) '"
o 11
tion and a very rough approximation to the actual lag scheme. Another example
is due to DeLeeuw (1962). He used the inverted-V lag, which is defined by
(i + l)a O<:i<:s
f3i = !(i) = (n* - i + l)a s + I <: i <: n*
otherwise (9.2.16)
(see Figure 9.2). We will not elaborate on these lag schemes because they are
relatively inflexible. Instead, we will concentrate on more general attempts to
overcome the multicollinearity problem in the following sections.
f3;
L-________~--------~~--~I
o n n
2
In the Almon procedure it is assumed that the lag weights can be represented
by a polynomial of degree q < n*,
(9.3.1)
or
(3 = Ha (9.3.2)
1 0 o
1 1 1
H = 1 2 22 (9.3.3)
y = XHa + e = Za + e (9.3.4)
The specification of the finite polynomial distributed lag presented above repre-
sents what Cooper (1972) calls the direct approach. A more explicit formulation
358 Dynamic Specifications
of the restricted least squares nature of the Almon model is useful when dis-
cussing its properties. The relation P = Ho. implies that
(9.3.5)
0= P - Ho.
= P - HH+P
= (I - H(H'H)-IH')P (9.3.6)
(9.3.7)
where Q is a q
+ I) x (n * + 1) matrix of Lagrangian interpolation coeffi-
cients whose (k,m)th element is
(9.3.8)
(9.3.9)
In the analysis presented above it has been presumed that the Almon distrib-
uted lag model has been specified correctly, implying that the polynomial de-
gree q and lag length n * are known a priori. Unfortunately, these parameters
are seldom if ever known in applications. Several studies have investigated the
effects of misspecifying the lag length or polynomial degree in an Almon lag
model. See, for example, Frost (1975), Schmidt and Waud (1973), Trivedi
(1970), Teriisvirta (1976), Harper (1977), Griffiths and Kerrison (1978), Schmidt
and Sickles (1975), Trivedi and Pagan (1979), and Thomas (1977).
Studies of the consequences of applying incorrect restrictions to a linear
model show that [from Trivedi and Pagan (1979)]:
1. If the assumed polynomial degree is correct, but the assumed lag length is
greater than the true lag length, the polynomial distributed lag estimator
will generally be biased. It will definitely be biased if the difference between
the assumed and true lag length is greater than the degree of the approxi-
mating polynomial.
2. If the assumed polynomial degree is correct, then understating the true lag
length usually leads to bias in the polynomial distributed lag estimator.
3. If the assumed lag length is correct, but the assumed polynomial is of an
order higher than the true polynomial, the polynomial distributed lag esti-
mator is unbiased but inefficient.
360 Dynamic Specifications
4. If the assumed lag length is correct, but the assumed polynomial is oflower
order than the true polynomial, then the polynomial distributed lag estima-
tor is always biased.
These results have led researchers to adopt procedures using the sample data
to obtain information about the polynomial degree and/or lag length. Some of
these methods are reviewed below.
When the lag length n* is known, the nested nature of the restrictions associ-
ated with increasing the polynomial degree q may be used to construct likeli-
hood ratio tests for the "optimal" polynomial degree. Such procedures have
been suggested by Terasvirta (1976), Godfrey and Poskitt (1975), Trivedi and
Pagan (1979), and Hill and Johnson (1976). Since a polynomial of degree q may
be imposed on the lag weights by specifying n * - q linear homogeneous re-
strictions of the form RP = 0 on the model y = Xp + e, these restrictions may
be viewed as hypotheses and tested under a variety of norms. If the linear
hypotheses are RP = 0, the appropriate test statistic is
b'R '[R(X'X)-IR']-IRb
u=
(n* - q)6- 2 (9.3.11)
Under alternative measures of goodness the test statistic u has different distri-
butions under the null hypothesis. The various norms, hypotheses, and test
procedures are summarized in Table 9.2.
A possible testing procedure, when the lag length n* is known, is to select a
maximum polynomial degree that is felt to be greater than the true degree [if no
notion of a maximum degree is held, one may choose a polynomial of degree
n*, which corresponds to the unrestricted model (9.2. I)], sequentially lower
the polynomial degree by one, thus adding one restriction; and then choose as
the best polynomial degree that which produces the last acceptable hypothesis
under the adopted norm and test procedure. When carrying out this sequential
procedure, care should be taken with respect to the level of Type I error chosen
for each individual test. Trivedi and Pagan (1979) state that the significance
TABLE 9.2 LIKELIHOOD RATIO TEST PROCEDURES UNDER ALTERNATIVE NORMS fOR THE ALMON POLYNOMIAL
DEGREE GIVEN LAG LENGTH
Noncentrality Parameter
Norm Under the Null Hypotheses Test Procedure
Squared error loss [Wallace (1972) and A <: !dL Calculate u, compare it to the critical
Yancey, Judge, and Bock (1973)]. Pis tr{(X' X)-IR '(R(X' X)-IR ']-IR(X' X)-I}, value
better than b iff tr MSE(P) <: tr MSE(b) where d L is the largest characteristic root c = F(n'-q.T-n'-I,A.a), which must be
of expression under the trace operator calculated in each application. If u :> c,
reject Ho and conclude that b is better
than P
Weighted squared error loss [Wallace A <: n* - q - I Calculate u, compare it to the critical
(1972)]. Pis better than b iff 2 value
E(XP - XJl)'(XP - XJl) <: c = F(n*-q,T-n* - I.A.a) tabulated in
E(Xb - XJl)'(Xb - XJl) or Goodnight and Wallace (1972). If u :> c,
E(P - Jl)'X'X(P - Jl) <: reject Ho and conclude that b is better
E(b - Jl)'X'X(b - Jl) than p.
362 Dynamic Specifications
where Yj is the significance level of the jth individual test in the series. If all the
Yj are equal, the probability of rejecting the null hypothesis where it is true will
rise as the order of the polynomial is reduced. Following the suggestion of
Anderson (1971), one might make Yj very small for high values of q so that if a
higher-order polynomial is not required, the probability of deciding on a high
order is small, and values of Yj larger for low values of q in such a way that the
level of significance at lower polynomial degrees is acceptable. It should be
noted that any of these procedures produce pretest estimators (see Chapter 3,
Section 3.3) whose distributions and properties are currently unknown.
Amemiya and Morimune (1974), Schmidt and Sickles (1975), and Trivedi and
Pagan (1979) offer quite a different approach to the problem. They suggest
adopting a norm and estimating the value of loss under different model specifi-
cations. Amemiya and Morimune (1974) adopt the norm tr[MSE(rl)X'X],
where MSE (rl) = E(rl - (J)(rl - (J)'. It can be shown that this norm equals
the weighted squa!ed error loss function that is also interpretable as the mean
square error of X(J as a predictor of X(J. Schmidt and Sickles (1975) calculate
values of this norm using the unrestricted least squares estimates of (J and (J2 in
place of the true but unknown values. The polynomial degree could be chosen
by selecting the one that minimizes this value. However, there is no necessity
for adopting their ad hoc procedure when under the chosen norm straightfor-
ward mean square error tests can be carried out as shown above. Alternatively,
the model selection criteria discussed in Section 9.2.2 and/or Chapter 21 can be
used to estimate the polynomial degree.
What, then, is the best procedure for an applied worker faced with such a
situation to follow from a sampling theory point of view? Available results offer
very little comfort and apply only to investigators who are interested in point
estimation, not interval estimation or hypothesis testing.
function the modified pretest estimator (3.4.12) is known to dominate the tradi-
tional pretest estimator (3.4.11).
When the lag length n* and the polynomial degree q are unknown, the problems
discussed above are compounded. A possibility of utilizing a sequential testing
procedure for jointly estimating n * and q is noted by Hendry, Pagan, and Sargan
(1982). Suppose that the true lag length n* is known not to be greater than N.
Then an adequate polynomial degree can be determined conditional on the lag
length being N, using a sequential testing procedure as described in the pre-
vious section. Suppose Q is the polynomial degree corresponding to lag length
N. If n* is in fact less than N, then f3N = O. In this case, reducing the lag length
to N - 1, we can also reduce the polynomial degree by one since we have
eliminated one root of the polynomial and the number of roots of a polynomial
corresponds to its degree. Thus, Q - 1 is the polynomial degree for lag length
N - 1, and so on. Hence a sequence of nested hypotheses can be set up as
follows:
HI: q = Q - 1, n* = N - 1
H 2: q = Q - 2, n* = N - 2 (9.3.14)
suitable model specification criterion (see Chapter 21). Terasvirta and Mellin
(1983) have compared various of these criteria and procedures in this context in
a Monte Carlo study. Their overall conclusion is that the purpose of the analy-
sis should be taken into account when an estimation procedure is chosen.
However, in their study model selection criteria like the one proposed by
Schwarz (SC) and a procedure called TPEC did quite well in choosing the
correct lag length and polynomial degree and in producing relatively low mean
square errors. The idea underlying the TPEC procedure is to test, for all combi-
nations of lag lengths n and polynomial degrees q, whether imposing the corre-
sponding restrictions reduces a weighted squared error loss function as com-
pared to the LS estimator of the unrestricted model. The F statistic is evaluated
for any pair (n,q) and the most parsimonious model is chosen among the specifi-
cations with lowest F values. For more details the reader is referred to the
paper by Terasvirta and Mellin (1983).
As another means of shedding light on what might be appropriate values of
n* and q, Harper (1977) suggests using the specification error tests of Ramsey
(1969, 1974) and Ramsey and Gilbert (1972). Since incorrect selection of n* and
q results in a model with a disturbance that has a nonzero mean, the specification
error tests RESET and RASET can be adopted. These tests are designed to
detect a nonzero mean of the disturbances of a regression model. Further
discussions and modifications of these tests are given by Ramsey and Schmidt
(1976), Thursby and Schmidt (1977), and Thursby (1979).
Finally, although it is true that the above procedures will produce point
estimates for n *, q, and the corresponding lag weight vector, recall that the
exact small sample properties of the resulting estimators are unknown. It is by
no means clear that an unrestricted estimator necessarily produces inferior
estimates of the lag weights. Terasvirta (1980) gives some general guidelines
when the imposition of polynomial constraints may result in improved estima-
tion accuracy. Briefly, if the error variance of the model is large, the sample
size T is small, and the lag function is rather smooth, the use of a polynomial lag
may be advantageous.
As a generalization Corradi and Gambetta (1976) and Poirier (1976) suggest that
rather than a polynomial lag one may wish to consider a spline lag. To illustrate
we discuss a cubic spline that is a set of cubic polynomials joined together
smoothly by requiring
I
their first and second derivatives to be equal at the points
where they join. Such functions have a long history of use in curve fitting
applications because of their flexibility. The method presumes that the function
generating the weights can be approximated by the function
(9.4.1)
Finite Distributed Lags 365
where hJ(i) is an indicator function that is one if the argument i is in the stated
interval and zero otherwise, ij are points in the interval [O,n*] and the gj(i) are
cubic polynomials
b' 2 + . + dJ.
g}.( I.) = a'l
J
'3 +'1
} C'I
} (9.4.2)
Although three cubics are presented here, there may be any number in gen-
eral and the intervals they cover may be varied at will, but intervals of equal
length are usually adopted. The restrictions required to join these cubics
smoothly are
0= R(3 +v (9.5.1)
366 Dynamic Specifications
(9.5.2)
(9.5.3)
The value of k one adopts of course reflects the strength wi!h which the exact
polynomial restrictions are held. As k ~ 00 the e~timator (3* approaches the
corresponding Almon estimator, and as k ~ 0, (3* approaches the ordinary
least squares estimator b. Fomby (1979) has noted that the stochastic restric-
tions may be tested under the generalized mean square error norm for MSE
improvement by employing the procedures suggested by Judge, Yancey, and
Bock (1973) and Yancey, Judge, and Bock (1974). As with pretests involved
with exact restrictions, however, the pretest estimator thus produced is inferior
to the least squares estimator under both risk matrix and squared error loss
functions for a large range of the parameter space. (See Chapter 3.) For a
generalization of Shiller's approach, see Corradi (1977).
Hamlen and Hamlen (1978) note that occasionally when the Almon technique is
being used, a polynomial of low degree is insufficient to represent the lag
structure adequately. In such a case the multicollinearity problem may not be
mitigated and also there is a potential for roundoff errors as there may be a
substantial range between the smallest and the largest variable in the trans-
formed regressor matrix Z in (9.3.4). To cope with these difficulties they pro-
pose the use of trigonometric functions. In this approach the lag weights are
written as
{3j = {3 + L
k=O
(Ak sin ()jk + Bk cos ()jk) for j = 1,2, . . . ,n* (9.6.1)
where
27T
()jk = n* + 1 . j . k (9.6.2)
il = Ha (9.6.3)
where
and
1 0 0 0 0
1 sin 011 sin Olq cos 011 cos Olq
H= 1 sin 021 sin 02q cos 021 cos 02q (9.6.5)
Like the Almon lag, the parameters a may be estimated by substituting (9.6.3)
into (9.2.2) to obtain
y = XHa + e = Za + e (9.6.6)
(9.6.7)
Hatanaka and Wallace (1980) show that in some circumstances the first few
moments
n*
j = 0, 1, . . .
(9.7.1)
368 Dynamic Specifications
of the lag distribution can be estimated more precisely than the lag weights {3i'
Thus, if interest focuses on the long-run response ILo or the mean lag ILl, say, it
may be preferable to estimate these moments directly using
fL = W(3 (9.7.2)
where of course Z = XW -I. Under our standard assumptions for e the least
squares estimator tl = (ZIZ)-IZ' y for fL is distributed as N(fL,a- 2(Z'Z)-I).
Hatanaka and Wallace show that the variance of P-i is less than that of P-j if
i < j. In particular, the long-run response ILo and the mean lag ILl are estimated
most accurately. Clearly these "moments" are not the moments of a proper
distribution since ILo = 2..7~0{3i may not equal one. However, the ILi can be used
to compute the moments of the distribution Pi = {3;1ILo, i = 0, I, . . . ,n*. Let
us denote the moments of this distribution by M I , M 2 , ; that is, Mj =
2..7~OiJpi' Silver and Wallace (1980) suggest a method due to Pearson to identify
a particular lag distribution on the basis of M I , M 2 , M 3 , and M 4
9.8 SUMMARY
dures presented are to some extent of only pedagogical value. First, we have
assumed that there is just one explanatory variable. Most of the above proce-
dures can be extended to models with more than one lagged explanatory vari-
able. The techniques may however become more cumbersome and the men-
tioned problems will be compounded. Second, we have presumed that the lag
length is indeed finite and we will see in the next chapter that there are good
reasons for rejecting this assumption in many situations of practical relevance
as there are economic theories that lead to infinite lag distributions. Third, it
has been assumed that the error terms of the basic model (9.2.1) are indepen-
dent. This is a rather heroic assumption in a dynamic framework where inter-
temporal dependencies between the variables are explicitly postulated. Pagano
and Hartley (1981) demonstrate the potentially severe consequences of ignor-
ing residual autocorrelation in a distributed lag analysis. The independence
assumption will be relaxed in much of the next chapter. Fourth, it has been
assumed that the process X t is independent of the error process. In practice this
also is a rather severe constraint as it excludes situations with feedback be-
tween y and x. Dynamic models with feedback between the variables are con-
sidered in Chapter 16. Fifth, as in previous chapters the parameters of the
model are assumed time invariant. Models with time varying coefficients are
treated in Chapter 19.
9.9 EXERCISES
Exercise 9.1
U sing five samples of data compute the least squares estimates of the coeffi-
cients for each of the five samples assuming that the lag length is n* = 4.
Compute the sample means and variances of these estimates and compare them
to their theoretical counterparts.
Exercise 9.2
Use AIC (9.2.6) and SC (9.2.9) and a maximum lag of NT = 6 to estimate the lag
length for the five samples of Exercise 9.1. Interpret your results.
Exercise 9.3
Using the five data samples and the true lag length n* = 4, impose the Almon
constraint that the lag coefficients fall on a polynomial of degree q = 2. Com-
370 Dynamic Specifications
pute the restricted least squares estimates of the coefficients, their sample
means and variances, and compare them to their theoretical counterparts.
Exercise 9.4
Repeat Exercise 9.3 under the following three alternative hypotheses:
(a) n* = 4, q = 3
(b) n* = 6, q = 2
(c) n* = 3, q = 2
Exercise 9.5
U sing the five samples of data from the previous exercises compute the least
squares estimates for the coefficients of a distributed lag model with lag length
n = 6 for each of the five samples, and compute the sample means and sample
variances. Impose a set of restrictions that implies a lag length n = 5 and a
polynomial of degree two. Compute the positive part rule discussed in Chapter
3 which combines these restrictions with the sample data for the five samples,
their sample means and variances.
Exercise 9.6
U sing the specified statistical model and all 100 samples of data, compute the
least squares estimates of the coefficients assuming a lag length of n* = 4.
Compute the sample means and variances of the estimates and the squared
error loss.
Exercise 9.7
Estimate the lag length for all 100 samples using AIC and SC and a maximum
lag NT = 6. Construct a frequency distribution for each of the two criteria and
interpret.
Exercise 9.8
U sing the 100 samples and assuming a lag length of n * = 4 and a polynomial lag
distribution of degree two, compute the appropriate restricted least squares
estimates, and their sample means, variances, and squared error loss.
Exercise 9.9
Using the 100 samples generated above, compute the least squares estimates of
a distributed lag model with lag length n = 6, their sample means, variances,
and squared error loss. Impose a set of restrictions on the model that implies a
lag length n = 5 and a polynomial lag distribution of degree two. Compute the
corresponding restricted least squares estimates, and their sample means, vari-
ances, and squared error loss. Compute the positive part rule of Chapter 3 that
combines these restrictions with the sample data for the 100 samples, and their
sample means, variances, and squared error loss.
Finite Distributed Lags 371
9.10 REFERENCES
10.1 INTRODUCTION
= {3(L )x t + e t (10.1.1)
y(L) a(L)
Yt = cJ>(L) Xt + eeL) V t (10.1.2)
are finite-order polynomials in the lag operator L, defined such that L k Xt = Xt-k
(see Section 7.2). The model (10.1.2) implies that {3(L) = y(L)/cJ>(L). For this
expression to be meaningful cJ>(L) has to be such that the random variable
cJ>-I(L)y(L)xt is well defined. Therefore, we require that
y(L)
Yr = cfJ(L) Xr + Vr (10.1.5)
and it makes sense to regard (10.1.2) as a structural form equation that de-
scribes the response of Y to changes in the exogenous variable x. Alternatively,
00.1.2) may be thought of as a reduced form equation corresponding to a
structural form
(10.1.6)
where A(L) = cfJ(L)O(L), B(L) = O(L)y(L), and C(L) = cfJ(L)a(L) are all finite-
order operators.
Note that, as it stands, the model in (10.1.2) is not very useful for forecasting
Yr unless something can be said about the future development of Xr. Where
assumptions regarding the exogenous variable are necessary we will assume in
the following that Xr is an ARMA process of the form
(10.1.8)
where Wr is white noise, independent of the process V r , and y](L) and .p(L) are
finite-order invertible polynomials in the lag operator (see Chapter 7). The
presumption of Vr and WI being independent implies the type of exogeneity of Xr
that is needed for the analysis of this chapter. While many of the results below
remain valid without the exogeneity assumption, care has to be exercised in
interpreting (10.1.2) if such an assumption is not made. For instance,
(10.1.9)
a
(3(L) = a(1 + 'AL + 'A2U + .) = 1 - 'AL (lO.1.H
where 0 < 'A < 1. That is, y(L) = a and c/J(L) = 1 - 'AL. The error term e l ca
have different structures.
The Pascal lag was proposed and applied by Solow (1960) and has the fon
a(1 - 'A)m
(3(L) = (1 - 'AL)m (10.1.11
where again 0 < 'A < 1 and m is a positive integer. For m = 1 the lag structur
reduces to a geometric lag with geometrically declining lag weights {3i. If m i
greater than one, the lag weights increase first and then taper off. Such a la
shape is often regarded as plausible in practical applications.
Also the discounted polynomial lag proposed by Schmidt (l974a) can b
regarded as an attempt to give more flexibility to the geometric lag specific::
tion. Its general form is
(10.1.12
y(L)
(10.1.13
Infinite Distributed Lags 377
where y(L) has degree r [see, e.g., Burt (1980), Uitkepohl (1982a)]. For r = 0
we have again the simple geometric lag. It will be seen below that the parame-
ters of this model are relatively easy to estimate.
The general form of (10.1.2) with a(L) = 8(L) = 1 is often referred to as the
rational distributed lag model and was introduced in this context by Jorgenson
(1966) who also investigates the special version
f3 (L) = --=-r_--,-y-,-(L-,-)- -
fI (exp (j) -
j~1
L)
(10.1.14)
Note that both the numerator and the denominator polynomial have the same
degree.
Usually the above lag structures are regarded as being approximations to the
"true" lag structure f3(L). It was pointed out by Sims (1972) that such approxi-
mations have to be interpreted carefully. In particular, a lag structure capable
of uniformly approximating the true lag weights arbitrarily well will not neces-
sarily provide a long-run response arbitrarily close to the true response. Fur-
thermore, Sims gives an example where true and approximate lag weights are
only an e-distance apart, but the standard error of forecast diverges to infinity
with growing sample size. Thus, tighter approximation properties than just a
uniform approximation of the lag coefficients are in general required to justify
the use of a particular lag model. Sims shows that the rational lags fulfill such
requirements under quite general conditions for the true lag structure and the
exogenous variable. A similar result for the discounted polynomial lag is given
by Schmidt and Mann (1977) and Liitkepohl (1980).
In the next section suggestions for choosing an adequate model for a given
set of economic variables are discussed. Estimation problems are considered in
Section 10.3 and checking the adequacy of a fitted model is considered in
Section 10.4. Some alternative infinite lag specifications are presented in Sec-
tion 10.5 and some problems related to the interpretation of the lag weights are
treated in Section 10.6. Conclusions are given in Section 10.7. A guide for this
chapter is presented in Table 10.1.
The major problem in working with a lag model of the type (10.1.2) is the
specification of the operators y(L), cf>(L), a(L), and 8(L). In particular selecting
the orders r, m, p, and q is often a difficult exercise. Roughly there are two
approaches to specifying a lag model: (I) a theory-based approach and (2) a
data-based approach. Both possibilities will be considered in turn in the fol-
lowing.
10.2.1 Theory-based Model Specification
Various theories lead to specific distributed lag models. Some examples will be
presented below to illustrate how economists have specified distributed lag
TABLE 10.1 INFINITE DISTRIBUTED LAG MODELS
Infinite distributed lags
I
Omitted variables
I
Aggregation
I
Seasonality
(Section 10.6.1) (Section 10.6.2) (Section 10.6.3)
I
I
Pascal lag
Rational lag
I
I
Discounted Theory-based
' 'or
Data-based
10.2.2)
lag J
polynomial (Section 10.2.1)
I I I
I Box-Jenkins Approximation Common factor
Geometric (Section of lag structure analysis
lag 1O.2.2a) (Section 10.2.2b) (Section 10.2.2c)
models in the past. For other examples, see, for example, Nerlove's (1972)
survey and Just (1977).
(10.2.1)
The expectations xi are revised in proportion to the error associated with the
previous level of expectations:
(10.2.2)
where 0 < A < 1 is the coefficient of expectations, Xt-I is the actual observable
price, e t is the stochastic term, and ao and al are parameters. If we solve the
above difference equation for x7, we have
xi = L 1..(1
i~O
- A)iXt_1_i
(10.2.3)
which implies a geometrically declining distributed lag form for expected prices
as a function of all past prices. If we substitute (10.2.3) into (10.2.1) we have the
geometric distributed lag function
Yt = ao + al L 1..(1
i~O
- A)iXt _ H + e t
(10.2.4)
Koyck (1954) showed that an equation of the above form can be reduced by
lagging it once, mUltiplying through by (1 - A) and subtracting it from the
original equation, to yield a simple form
(10.2.5)
or
Yt = ao + 1 _ (1 _ A)L Xt + et (10.2.6)
which is obviously a model of the type (10.1.2) (if the intercept term is ne-
380 Dynamic Specifications
(10.2.7)
but only some fixed fraction (y) of the desired adjustment is accomplished in
one period:
(10.2.9)
or
atY Y
Yt = ao + 1 - (l - y)L X t + 1- (l - y)L V t ( 10.2.10)
Y t* = ax t (l0.2.11)
The cost for not being in the static optimum in period t is assumed to be
(10.2.12)
Assuming an infinite planning horizon and a discount rate r, the individual has
to solve the following optimization problem at time 0:
(10.2.16)
where
/3i
Pi= x
(10.2.18)
382 Dynamic Specifications
H = - L Pi In Pi
i=O (10.2.19)
It turns out that if the mean lag is known to be IL, then the ME distribution is
1 (lL)i
Pi = 1 + IL 1 + IL
(10.2.20)
Yt = a(l - A) L
;=0
AiXt_i + et
a(1 - A)
= 1 - AL Xt + et (10.2.2l)
The above examples show that economic and other theories are usually only
able to specify a part of the model. The rest will have to be derived from the
data. Consequently, a joint theory- and data-based approach is called for. In
the next section we take the extreme viewpoint that the only prior information
about the relationship between x and Y is the exogeneity of x and the station-
arity conditions (10.1.3) and (10.1.4) are assumed to hold. Of course, these are
already quite strong assumptions in some sense, but no economic theories will
be used to justify a particular lag and tllOr structure. Viewing an economic
theory as a set of constraints to be observed in the statistical analysis, we will
more or less assume in the following section that such constraints do not exist.
If economic theory does not provide a model for the relationship between two
variables x and Y, statistical tools may be applied to fill the gap. That is, if the
model (10.1.2) is hypothesized, the operators y(L), cjJ(L), a(L), and eeL) and,
specifically, their orders have to be determined with statistical techniques. In
the following discussion we will discuss possibilities to determine the lag struc-
ture and the error structure of a general model like (10.1.1). We will first
assume that the model has the special form (10.1.2).
y(L) y/(L)a(L)
Zt = cfJ(L) Wt + t/J(L)(J(L) Vt (10.2.23)
(10.2.24)
(10.2.25)
The above outline is only meant to give a rough impression of what is in-
volved in a purely data-based specification of a model like (10.1.2). The suc-
cessful application of the above technique requires some experience and for
384 Dynamic SpecifiGltions
examples the reader may consult Box and Jenkins (1976) and Granger and
Newbold (1977).
An alternative specification scheme is proposed and discussed for example
by Haugh and Box (1977) and Granger and Newbold (1977). They "prewhiten"
YI and XI; that is, they build a univariate ARMA model for each of the two
variables first and then they construct a bivariate model for the two resulting
residual series. In a final step the univariate models are amalgamated with the
bivariate residual model. This approach can be generalized to the case of a
feedback relationship between X and y.
(3(L) = 2: Aip(i)V
i~O (10.2.27)
where 0 < A < I and p(i) = 2:,j=oPj U. Such a model has the advantage that
only the polynomial degree r has to be determined in order to specify the lag
structure instead of the two orders rand m in the case of a rational lag. We will
see in the next section that varying the polynomial degree r comes down to
varying the number of regressors in the regression model used for estimating
the parameters. Therefore, the usual criteria for model selection such as AIC
and PC (see Chapter 9 or Chapter 21) may be used to determine r.
If e l is not white noise but a (nondeterministic) stationary stochastic process,
it is known to have a possibly infinite-order moving average representation, say
where
1
8(L) = L [1 - JL ~ teL)] and veL) = JL ~ teL) {3(L)
2:
i~O
}...iPt(i)L i and 2: AiP2(i)L i
i~O
respectively, where Pt(i) and P2(i) are polynomials of finite but unknown de-
grees that have to be determined. After the model has been fitted, the lag
coefficients can be recovered by noting that
Another strategy for specifying the lag model (10.1.1) is to approximate the
infinite lag by a lag of finite length. For this purpose also the criteria discussed
in Chapter 9 can be used. Shibata (1981) derives asymptotic optimality condi-
tions for some of them. However, such a strategy may lead to unsatisfactory
results ifthe sample size is small since such models may involve a considerable
number of parameters. Below we will discuss a procedure for finding a more
parsimonious specification.
(10.2.31)
and VI is white noise as usual, the following procedure is useful to reduce the
number of parameters to be estimated. In (10.2.32) the 1/Ii and Wi are the roots of
geL) _and veL), respectively. If k of the 1/Ii are equal to k of the Wi, then geL) =
I;(L)g(L) and veL) = I;(L)v(L), where I;(L) consists of the k common factors of
geL) and veL). Using this notation and assuming that I;(L) is invertible, (10.2.31)
can be rewritten as
(10.2.33)
386 Dynamic Specifications
10.2.3 Conclusions
(10.2.34)
where VI is Gaussian white noise and XI is exogenous. Hendry and Richard list
nine special cases of (10.2.34) obtained by placing constraints on the parame-
ters /30, /31, and A.
f30 = - f3, imply a static linear regression model for the growth rates of the
two variables.
5. For A = 0, (10.2.34) is a finite distributed lag model with lag length 1.
6. Constraining f3, to be zero results in a partial adjustment model (see Sec-
tion 10.2.1 b).
7. A simple regression model with AR(1) error term is implied by f3,/f3o = -A.
This can be seen by rewriting (10.2.34) as
This example shows how "small" changes in a model can make a great
difference for its interpretation. If there is uncertainty as to which interpreta-
tion is appropriate in a particular case, a framework such as the above makes it
easy to use the data to settle the matter by testing constraints for the general
specification (10.2.34). Of course, sometimes there may not be sufficient infor-
mation in the data to obtain a definite model even if one starts from a specifica-
tion as simple as (10.2.34). This problem is aggravated if the starting point is a
more general specification involving more lags of x and Y like the model
(10.1.2), and the applied researcher should avoid interpreting the lack of evi-
dence against a specific model due to insufficient information as evidence in
favor of this model. In Section 10.3 we will assume that a model has been
specified so that we can proceed to parameter estimation.
y, = a L
i~O
}'/X,-i + e, = f,(a,A) + e" t=I, . . . ,T
(10.3.1)
under alternative assumptions for e,. Heref,(a,A) is defined in the obvious way
and 0 < A < 1.
10.3.1a Exact Maximum Likelihood When the Disturbances Are Not Autocorrelated
If the e, are a Gaussian white noise process-that is, the e, are independent and
identically normally distributed with zero mean and variance (J";-it is easy to
see that ML and LS (least squares) estimation are equivalent (see Chapter 6).
Therefore, instead of maximizing the likelihood function, we can minimize
T
S(a,A) = L ,~ I
(y, - f,(a,A2
T x
= L
,~ I
(y, - a L
i~O
AiX,_Y
(10.3.2)
An obvious problem results since the x, are unknown for t < O. One way to
proceed in this case was suggested by Klein (1958). The model (10.3.1) is
rewritten as
1-1 Xl
y, = a L
;=0
AiX,_i + A'a L
;=0
AiX_i + e,
I-I
= a L
i~O
AiX,_i + A'YJo + e,
(10.3.3)
1 T
lim - L
T~x T '~I
X2
' (10.3.4)
exists and is finite, the ML estimators a, X, and 6"; are consistent, asymptoti-
cally efficient, and normally distributed; that is,
Infinite Distributed Lags 389
a a
- d
A - N(O,fl)
-2
lIe (10.3.5)
where
[
aNa,A)
.:l ,
aft(a,A)] - [~i
.:l\ -
~
L.J A XI-i, a L.J IA
. i-I ]
XI-i
ua u/\ i~O i~ I (l0.3.7)
(10.3.8)
where e' = (e2,. . . ,eT). This function can be evaluated recursively starting
with e2, if el is assumed to be zero. Minimizing (10.3.9) is equivalent to ML
estimation when Yl is assumed fixed. Hence, the LS estimates will be asymptot-
ically equivalent to the exact ML estimates.
The function S *(0:,A.) can be minimized with respect to 0: and A. using an
iterative optimization method. Employing the Gauss algorithm is one possibil-
ity. As discussed in Appendix B the nth iteration has the general form
(10.3.10)
where Bn = (O:n,A. n), and all derivatives are evaluated at Bn. Thus the Gauss
method requires only first partial derivatives and these can also be determined
recursively since, from (10.3.8),
ael-l
- x + A. -=---:.
I ao:
and
(10.3.11)
is correlated with the error term U I = e l - A.el-l, it is known that the LS estima-
tor of A. will be inconsistent [see Chapter 5, Judge et al. (1982, Chapter 27) or
Infinite Distributed Lags 391
Doran and Griffiths (1978)]. However, the instrumental variable technique can
be used to obtain consistent estimators.
To obtain useful instruments recall that an instrumental variable should be
correlated with the associated regressor but uncorrelated with the error term.
For example, as instrument for Yt-I in (10.3.11), we may use Xt-I so that we get
as instrumental variable estimator
where
Y2
X2 YI X2 XI
X3 Y2 X3 X2
y= , X= , and Z = (10.3.13)
YT
XT YT-I XT XT-I
where -1 < P < 1 and V t is Gaussian white noise with variance U"~. This case
can be treated as discussed in Chapter 8. Writing (10.3.1) as
y = f(a,A) +e (10.3.15)
VI - p2 0 o 0
-p 1 o 0
p= 0 -p o 0
(10.3.16)
o o 1 0
o o -p 1
y* = f*(a,A,p) ,- e* (10.3.17)
392 Dynamic Specifications
(10.3.18)
can be minimized (see Section 8.2.ld). Ifthe explanatory variables XI satisfy the
condition (10.3.4), the ML estimators are consistent, asymptotically efficient,
and normally distributed,
a a
-
A A
~ N(O,O*) (10.3.19)
P P
where
- -
-2 [y* - f*(a,A,p)]'[y* - f*(a,A,p)]
(Iv = T (10.3.20)
and
- 2
(Iv
r1m T-
1[ Bf ]' , [ Bf ]
B(a,A) P P B(a,A) 0 0
0*-1 I
0 0 (10.3.21)
I _ p2
I
0 0 2(I4
v
in either case f*(a,A, p) contains the presample values of Xt, t <: 0, that are
usually not available in practice. They can be replaced by zero or estimated via
the truncation remainder term (see Section to.3 .Ia). Again this term cannot be
estimated consistently.
An alternative way to compute approximate ML estimates results from writ-
ing the model in lag operator notation as
!XX t Vt
Yt = 1 - AL + 1 - pL
and minimizing
T T
2: v~ = 2: (Yt -
t~3 t~3
(A + P)Yt-1 + APYt-2 - aXt + apxt-I + Av t_d 2 (10.3.23)
is equivalent to ML estimation when the first two observations, YI and Y2, are
assumed to be fixed and V2 = O. If the Gauss algorithm is used to find the
minimum of (10.3.23), the nth iteration is
(10.3.24)
where Jln = (an, An, Pn)' is the parameter vector obtained in the (n - l)st
iteration, v = (V3, V4,' . . ,VT)' and all derivatives are evaluated at Jln and can
be determined recursively as in Section to.3.1 b. If the starting vector JlI con-
tains consistent estimates of a, A, and p, asymptotically efficient estimates are
obtained in one iteration of the Gauss algorithm; that is, Jl2 will be an asymptot-
ically efficient estimate (see Chapter 6).
As in the case of independent disturbances instrumental variable estimation
ofYt = AYt-1 + aX t + u t , where U t = e t - Aet-I, provides consistent estimates ~
and eX of A and a (see Section 10.3.1c). These estimates can be used to compute
residuals
(10.3.25)
The estimates ~, Ii, and p can be used as initial values for the Gauss or some
other iterative algorithm. Note that the truncation remainder has been omitted
in computing el in (10.3.25). From an asymptotic point of view this simplifica-
tion is of no consequence. However, in small samples it may be preferable to
take account of the truncation remainder when computing the el For details
see Amemiya and Fuller (1967) and for a different development of the two-step
procedure see Hatanaka (1974).
A number of authors have discussed the above and other estimation proce-
dures for the geometric lag model. These include Koyck (1954), Cagan (1956),
Taylor and Wilson (1964), Wallis (1967), Oi (1969), Dhrymes (1969, 1971b),
Zellner and Park (1965), Harvey (1981), and Fomby and Guilkey (1983). More
details are provided in Judge et al. (1980), and a summary of the earlier work on
the subject is contained in Dhrymes (I971a).
A similar simplification as in (10.3.3) for the geometric lag can be obtained for
the discounted polynomial lag in (10.1.12). To illustrate we assume that (10.1.1)
has the form
YI = 2: 'A/P(i)xl-i + e l
i~O (10.3.27)
p(i) = 2: pji j
j~O (10.3.28)
YI = 2: PjSjl + rl + e l
j~O (10.3.29)
where
I-I
Sjl = 2: >..iUXI_i,
i~O
j=O,I, . . . ,r
(10.3.30)
The TJj contain the presample values of x and can be treated as further parame-
ters in the model. Alternatively, the rl can be replaced by zero without affecting
Infinite Distributed Lags 395
(10.3.32)
If the error term in the original lag model (10.1.1) is not white noise, it is
preferable to transfer to the version (10.2.29) and then approximate the opera-
tors 8(L) and v(L) as explained in Section lO.2.2b. A similar transformation as
above can then be used to obtain a linear model with a finite set of parameters
that can be estimated by linear least squares if the discount factor A is fixed. For
details see Liitkepohl (1982a).
Of course, the infinite lag operators 8(L) and v(L) in (10.2.29) could also be
approximated by finite lags. This possibility is considered by Sims (1971a) and
Shibata (1981). As pointed out earlier, the disadvantage of this procedure is that
a substantial number of parameters may be required to obtain an adequate
approximation.
Various estimation procedures have been investigated for other than the
above mentioned special case models. For instance, Solow (1960), Chetty
(1971), Maddala and Rao (1971), Schmidt (1973), and Guthrie (1976) consider
estimation of the Pascal distributed lag [see (10.1.11)] and maximum likelihood
procedures for rational distributed lag models with white noise error term are
considered by Dhrymes, Klein, and Steiglitz (1970) and Maddala and Rao
396 Dynamic Specifications
(1971). In Section 10.3.3 we will discuss the estimation of general models of the
type (10.1.2).
or
1
Yt = - L (<I>(L)O(L) - l)Yt-1 + 8(L)y(L)xt + <I>(L)a(L)vt
where the t/J;, 'Y);, and ~; are determined by the coefficients of <I>(L) , 8(L), y(L),
and a(L). Assuming that V t is Gaussian white noise and that, for h = max(m +
p, r + p), the YI, Y2, . . . ,Yh are fixed and Vt = 0 for t <: h, then, instead of
maximizing the likelihood function, it follows from (10.3.33) that we can equiv-
alently minimize the sum of squares function
T
v'v = 2:
t~h+ I
(Yt - t/JIYt-1 - . . . - t/Jm+pYt-(m+p) - 'Y)OX t - . . . - 'Y)r+pxt-(r+p)
T 1
t~ I [<I>(L)8(L)Yt - 8(L)y(L)x t - L (<I>(L)a(L) - l)V t-l F
(10.3.34)
As in the case of the simple geometric lag model, consistent starting values
can be obtained by instrumental variable estimation. Pre multiplying (10.1.2) by
c/>(L) gives the linear model
.+ YrXt-r + Ut
00.3.35)
, y(L)
et = Yt - 1>(L) Xt 00.3.36)
where presample values of X t are set at zero, can be computed. These can be
used to obtain initial estimates of a(L) and (J(L) by methods described in Chap-
ter 7.
Algorithms especially designed for estimating the parameters of models like
(10.1.2) exist in many computing centers, and rather than trying to write one's
own routine it is usually preferable to make use of this service where available.
s = _I
211" j~-s+l
[h(wj) - 2f3(wj)/x/Wj)
fe(wJ
+ 1f3(WjW/x(Wj)]
00.3.37)
(10.3.38)
(see Chapter 7). Similarly, the cross-spectral density of two jointly stationary
processes Xt and Yt with cross-covariances 'Yk(X,Y) = cov(xt+k.Yt) is
(10.3.39)
The estimation of spectral densities is briefly discussed in Secion 7.6 and will
not be repeated here. For a more detailed presentation of spectral methods see
Dhrymes (1971a). Sims (1974a) also discusses another (though inefficient) esti-
mation method due to Hannan (1963, 1965) and its equivalence to time-domain
estimation methods. For an application of spectral methods in a distributed lag
analysis see Cargill and Meyer (1972) and for a Monte Carlo comparison with
other procedures see Cargill and Meyer (1974).
For the estimators resulting from the procedures discussed in the foregoing in
general only large sample properties are known. To investigate the small sam-
ple properties, a number of Monte Carlo experiments have been carried out.
Many of them are based on a geometric distributed lag model of the type
After a model has been fitted to a given set of data, some checks for the
adequacy of the model should be carried out. This procedure is routine in a
static linear regression analysis where t values and F values as well as the
Durbin-Watson statistic are examined. Similar checks are useful for a dynamic
model. Clearly one should check the significance of individual parameters and,
if possible, reduce the model size in accordance with the principle of parsi-
mony. Also a residual analysis can be useful. However, for a dynamic model
something more should be done than just a test against first-order autocorrela-
tion like the Durbin-Watson test for nondynamic regression models.
Assuming that the fitted model has the general form (10.1.2) and the exoge-
nous variable x is generated by the ARM A process (10.1.8), Box and Jenkins
(1976) suggest to inspect the residual autocorrelations and the cross-correla-
tions between the residuals of (10.1.2) and (10.1.8). Estimating the residual
autocorrelations of (10.1.2) by
T-k
2:
t~ I
V t Vt+k
rk(v) = ':"""":-T--
2: vi
t~ I
(10.4.1)
and assuming that they are calculated from the true model errors, where the
true parameters are assumed to be known, these quantities are approximately
normally distributed with mean zero and variance liT. Thus, a rough check of
the model adequacy would be to compare the rk(v) with +2/VT, the bounds of
an approximate 95% confidence interval around zero. Also, the user of this
check should look for possible patterns of the correlogram that may indicate
nonwhite residuals.
It is clear, however, that this check of the model adequacy is very crude
since in practice the autocorrelations are computed from the estimation resid-
uals; that is, rk(O), say, is used instead of rk(v), As a consequence these quanti-
ties will be correlated and liT will be a poor approximation to their variance, at
least for small k. In fact, liT may overstate the true asymptotic variance consid-
erably for small k, and, thus, an uncritical comparison with +2/VT may give
the wrong impression of a correctly specified model when it is actually inade-
quate. Therefore, it is preferable to use tests based on exact asymptotic distri-
butions.
One candidate is a portmanteau test based on the statistic
(10.4.3)
where is = Xs for s <: t and it+h = xt(h), the optimal h-step forecast of X t (see
Chapter 7), for h > 1 and similarly for Ys. This formula can be used for recur-
sively calculating the Yt(h) starting with h = 1. A more detailed discussion of
forecasting with dynamic models will be given in a more general framework in
Chapter 16. It should be understood that the model specification has to be
reexamined if the model fails to pass any of the adequacy checks.
1 0.5 ALTERNATIVE
DISTRIBUTED LAG MODELS
In certain situations the use of the rational lag model including its special case
versions is not satisfactory-for instance, if nons ample information is available
that cannot be easily accounted for in the specification (10.1.2). For example, it
is not easy to constrain the lag weights to be all nonnegative in the general
rational lag model. Therefore, alternative lag distributions have been proposed
that make it easier to allow for certain prior information. We will present two of
these lag models below. It will be assumed throughout this section that the
error term is Gaussian white noise. This assumption is imposed for ease of
presentation. If, in practice, it is not satisfied, a model for the error term can be
constructed based on the estimation residuals obtained from the procedures
introduced below.
In the gamma distributed lag proposed by Tsurumi (1971), the 13' s are assumed
to fall on a multiple of the density of a gamma distribution defined by
f(x) = -
y' s
res)
x - I exp(-yx)
, x > 0, y > 0, s > (l0.5.1)
where a and s are parameters to be estimated. The distributed lag model, when
this scheme is substituted, becomes
00
Some gamma distributions are shown in Figure 10.1, where 130 is zero for s =1= 1.
They show intuitively plausible, unimodal lag schemes, which is the reason for
using the gamma lag.
402 Dynamic Specifications
YI = a 2:
i= 1
is-I exp( -i)XI-i + e l
(l0.5.4)
and for estimation Tsurumi truncates the infinite sum at m and writes
m
YI = a 2: js-I exp(
i= 1
-i)XI-i + (1)1 + el
(l 0.5. 5)
where
1)1 = 2:
i=m+ I
is.- I exp( - i)XI-i
(l0.5.6)
suggests replacing the exponent s - 1 with 8/(1 - 8), where 0 <: 8 < 1. Thus
the lag parameter specification would be
with parameters 0 <: 8 < 1 and 0 <: A < 1. The above specification reduces to
the geometric lag model when 8 = O.
In estimation, the distributed lag function is written as
(10.5.11)
where
t-I
Zt = L
i~O
(f3;/a)x/-i
(10.5.12)
00
and f3i is specified in (10.5.10). Note that YJi is different from, but consistent
with, the "truncation remainder" of (10.3.3). Schmidt acknowledges the time-
dependent nature of YJi but argues that it is asymptotically negligible and that its
omission will not affect the asymptotic properties of the resulting estimates. If
7)i is dropped from the equation and the et are assumed to be identically and
independentlY normally distributed, the maximum likelihood estimates of 8, A,
and thus a are also the least squares estimates, since the estimation of a is
linear, given 8 and A (hence Zt). Schmidt suggests to search over 8 and A and
then pick the values that minimize the sum of squared errors. For the asymp-
totic variances of the estimates, the inverse of the information matrix can be
used. The information matrix is given by Schmidt (1974b).
A theoretical justification for using the gamma lag is provided by Theil and
Fiebig (1981). They note that this distribution arises as a maximum entropy
distribution (see Section to.2.le). Theil and Stern mentioned the gamma distri-
bution as a plausible lag distribution as early as 1960 [Theil and Stern (1960)].
Although the gamma lag is quite flexible, there are many intuitively appealing
lag structures that cannot be approximated very well by a member of this
class of lag schemes. Viewing this problem, Liitkepohl (1981) proposes a class
of infinite lag models that avoid sign changes in the lag weights and can approxi-
mate any possible lag structure to any desired degree of accuracy. We may call
404 Dynamic Specifications
this family oflag models exponential lag models, since the lag weights are ofthe
form
m
p(i) = L Pk ik
k~1 (10.5.15)
Yt = a L [exp p(i)]Xt-i + e t ,
i~O
t = 1, . . . , T
(10.5.16)
s = (y - f)'(y - f) (10.5.17)
f = [a L
i~O
(exp P(iXI-i, . ,a L
i~O
(exp P(iXT-i]'
(10.5.18)
where To is some positive number. Since exp p(i) approaches zero as i goes to
infinity, it is clear that the impact of the initial conditions declines with increas-
ing To, which, on the other hand, reduces the degrees of freedom. Thus, the
Infinite Distributed Lags 405
choice of To will depend on the size of the available sample. It should be clear,
however, that To has no impact on the asymptotic properties of the estimators
and asymptotic properties is all we can derive since the considered model is
nonlinear in the parameters.
Alternatively, the impact of the presample values of x can be mitigated by
replacing them by "back forecasts" [see Box and Jenkins (1976, Section
6.4.3)]. For a stationary process XI the presample values could simply be re-
placed by the mean of the observed XI' The minimization of (10.5.17) can be
accomplished by using a nonlinear least squares algorithm (see Appendix B).
A crucial problem in setting up an exponential lag model is the choice of the
polynomial order m. Liitkepohl suggests using model selection criteria or se-
quential testing procedures for this purpose.
It is worth noting that one of the earliest suggested infinite lag distributions is
the log normal distribution [see Fisher (1937)]. This lag distribution seems to
have attracted little attention by later authors although it also has an intuitively
plausible unimodular shape.
In closing this section we remind the reader that after having fitted any of the
above lag schemes, it is imperative that the model adequacy be checked. Some
of the techniques described in Section 10.4 lend themselves for this purpose.
As shown in Section 10.2.1 there are some economic theories that lead to
specific lag distributions. However, Nerlove's (1972) observation that current
empirical research on lags in economic behavior is not soundly based on eco-
nomic theory and that the theoretical research on distributed lags is not very
strongly empirically oriented, seems still valid. Therefore, lag distributions are
often based purely on the data or on ad hoc assumptions rather than on sound
economic theory. There are some problems that have to be recognized when
empirical distributions are interpreted. Some of these problems will be ad-
dressed in the following discussion.
If "important" variables are neglected in the model, the lag distribution cannot
be interpreted as the response function of y to a unit increase in X in general.
Since allowance is made for an autocorrelated error term, this does not mean
that the model is incorrectly specified. However, it is important to note that
omitting variables may induce feedback between X and y so that X is no longer
exogenous in a model like (10.1.1). This violates one of our basic assumptions
and can lead to problems in the course of specifying a model.
On the other hand, omitting important variables may also induce a one-way
causal bivariate system where feedback actually exists between the variables.
406 Dynamic Specifications
This situation is particularly troublesome since it means that the bivariate em-
pirical model implies exogeneity of x in the bivariate system; whereas in the
actual system, involving other variables as well, fluctuations in y have an
impact on x. An example of this problem occurring in practice is given by
Liitkepohl (1982b). We emphasize, however, that if the generation process of
the bivariate system is correctly specified, it is still useful for forecasting pur-
poses even though superior forecasts may be obtained from a higher dimen-
sional system. It will be easier to treat the problem of omitted variables in a
more general multiple time-series setting, and we will return to this problem in
Chapter 16 where also the related problem of measurement errors is considered
[see also Grether and Maddala (1973)].
10.6.3 Seasonality
When quarterly data or monthly data are used, the dependent variable in a
distributed lag model may exhibit a pronounced seasonal pattern that cannot be
Infinite Distributed Lags 407
(10.6.1)
and
(10.6.4)
(10.6.5)
(10.6.6)
where Db D 2 , and D3 are seasonal dummy variables, and the c's and s's are I
parameters. If the above seasonally varying, distributed lag weights are substi-
tuted into a general finite lag function
n
y, = 2: f3i X ,-i + e,
i=O (10.6.7)
408 Dynamic Specifications
we obtain
n 3 n
Yt = 2:
i~O
CiXt-i + 2: 2:
j~ I i~O
SjiDj,t-iXt-i + et (10.6,8)
Pesando (1972) estimated the relationship (10,6,8) by constraining the c's and
s's to lie along low-order polynomials, thus employing four sets of Almon
variables (Chapter 9) in actual regressions,
In this chapter we have discussed rather general infinite distributed lag models
as well as a number of special cases. A summary of the considered models is
given in Table 10.2, We have seen how rational distributed lag models with
ARMA error term can arise from economic theories and we have sketched
methods for specifying such models if no theory is available from which to
derive a precise specification. In many cases it will be necessary to combine
economic theory and sample information to obtain a reasonable model since
some of the theories leading to distributed lag models are based on ad hoc
assumptions that will not necessarily provide an acceptable description of the
data generation process,
Once a model is specified, its parameters have to be estimated, We have
mainly considered maximum likelihood methods and approximate maximum
likelihood methods. For a general rational distributed lag model with ARA
errors, these methods lead to a nonlinear optimization problem. Available soft-
ware can be used for its solution. In special cases the applied researcher may
take advantage of the simplifications noted in Section 10.3.
After having fitted a particular model, its adequacy needs to be checked, and
if inadequacies are detected modifications have to be made. This is an impor-
tant step in all econometric analyses. Whereas in static linear regression
models a test against first -order residual autocorrelation by using the usual
Durbin-Watson statistic may be sufficient, tests against higher-order autocorre-
lation should accompany dynamic modeling. Some possible tests have been
indicated in Section 10.4.
If nonsample information is available that cannot easily be incorporated in
the general rational lag model, it may be useful to select the lag model from
some other class of lag distributions. In Section 10.5 we have presented the
gamma lag and the exponential lag for that purpose. The former can approxi-
mate a wide range of unimodal distributed lags with nonnegative or nonpositive
lag coefficients, whereas the latter can approximate any reasonable distributed
lag function without sign changes in the lag coefficients to any desired degree of
accuracy.
If the specification of a lag model is based on sample information, the inter-
pretation of the lag coefficients is usually difficult. Some sources for such
problems are discussed in Section 10.6. In particular, the impact of omitted
variables, temporal aggregation, and seasonality have been considered.
TABLE 10.2 ALTERNATIVE INFINITE DISTRIBUTED LAG MODELS
Rational {3(L) = y(L) y(L) and cf>(L) are polynomials in the lag operator L of
cf>(L)
order rand m, respectively
x
10.8 EXERCISES
Yt = 0'0 + 0' L
j~O
i
1I. X t _i + et
(10.8.1)
where 0 < A < I, e t = pet-I + VI' - 1 < P < I, and VI is zero mean Gaussian
white noise with variance (J'~. Monte Carlo data should be generated using the
transformed model
with 0'0 = 10,0' = 0.3, A = 0.8, p = 0.6, Yo = 53.3, eo = 2.84, and VI ~ N(0,4).
One hundred fixed values for the explanatory variable X t are listed in Table
10.3.
Exercise 10.1
Generate a sample of size 100. Use the first 25 values to compute instrumental
variable estimates for 0'0,0', and A with Xt-I as instrument for Yt-I (see Section
1O.3.lc).
Exercise 10.2
Repeat Exercise 10.1 with a sample of size 50 and a sample of size 100.
Infinite Distributed Lags 411
Exercise 10.3
Repeat Exercises 10.1 and 10.2 using XH as an instrument instead of X/-I'
Exercise 10.4
Compute consistent estimates of p using the instrumental variable estimates for
ao, a, and A obtained in Exercises 10.1 and 10.2.
Exercise 10.5
Making use of the results of Exercises 10.1, 10.2, and 10.4 improve the esti-
mates with the two-step Gauss method.
Exercise 10.6
Assume that the e/ are not autocorrelated and use the first 25 sample values of
your generated data to compute Klein's ML estimates for ao, a, and A.
Exercise 10.7
Generate 100 samples of size 100 and repeat Exercises 10.1 and 10.2. Compute
the means, variances, and mean square errors (MSE's) of the resulting estima-
tors and construct frequency distributions.
412 Dynamic Specifications
Exercise 10.8
Making use of the results of Exercise 10.7, compute two-step Gauss estimators
for all 100 samples and all three different sample sizes. Compute the means,
variances, and MSE's of the estimates, and construct frequency distributions.
Exercise 10.9
Repeat Exercise 10.6 with all 100 samples of Exercise 10.7.
Exercise 10.10
Using results from Exercises 10.7-10.9, construct a table comparing the alter-
native estimators based on samples of different size. Rank the estimators in
terms of MSE and in terms of their variances. What conclusions can you draw?
10.9 REFERENCES
SOME ALTERNATIVE
COVARIANCE
STRUCTURES
Chapter 11
Heteroscedasticity
11.1 EXISTENCE OF
HETEROSCEDASTICITY
AND THE USE OF LS AND CLS
For the general linear model y = XIJ + e, E[e] = 0, E[ee'] = ~ = (T2\f1, het-
eroscedasticity exists if the diagonal elements of ~ are not all identical. If, in
addition, e is free from autocorrelation, ~ can be written as a diagonal matrix
with the lth diagonal element given by (T~. This assumption is likely to be a
realistic one when using cross-sectional data on a number of microeconomic
units such as firms or households. A common example is the estimation of
household expenditure functions. In this case Yt represents expenditure by the
lth household on some commodity group such as food, and the explanatory
variables include total expenditure (or income) and household size. It is usually
postulated that expenditure on some commodity is more easily explained by
conventional variables for households with low incomes than it is for house-
holds with high incomes. Thus, ifx; represents a row of X corresponding to the
tth observation, the variance of Yt around x; IJ is expected to be low for low
incomes and high for high incomes, and this is introduced formally by relating
(T~ to the relevant Xkt, or more generally, to some function ofx t . For example,
see Prais and Houthakker (1955), and the discussion in Theil (1971, p. 245).
Heteroscedasticity also naturally arises (1) when the observations are based
on average data, and (2) in a number of "random coefficient" models. To
illustrate the first case, let Yit be the yield of a particular crop on the ith acre of
the l th farm and let Xlit and X2it, respectively, be the amount of labor and capital
used on the ith acre of the lth farm. Suppose that it is reasonable to assume a
production function either linear in the original or logarithms of the original
units, say
(11.1.1 )
h
were E[ ee '] -- (T 2/N, e ' - (" ) N tIS
' )e't -- (elt, e2t, . . . , eNtt,
el, ez, . . . ,er, .
the number of acres on the lth farm, and N = L;~I Nt. Also, suppose that the
only data available on y, XI, and X2 is average farm data. This being the case, we
average (11.1.1) over i and obtain
(11.1.2)
where Yt = N; I Lf'fll Yit and the other averages are similarly defined. Now
420 Some Alternative Covariance Structures
E[e t ] = 0 and
(l1.1.3)
where the second equality is based on the fact that the eit are uncorrelated and
have constant variance. Thus, if estimation is based on (11.1.2), we have the
problem of a heteroscedastic disturbance, where a'; = (J2INt However, we
would expect the Nt to be known, which means that
(I 1. 1.4)
is known and that generalized least squares (GLS) can be applied directly.
As another example of a heteroscedastic error model, consider the random
coefficient model introduced by Hildreth and Houck (1968). In this case
K K K
Thus each coefficient, {3kr. is assumed to be a random variable with mean 13k-
and estimation of the means fl' = ({3" (32, . . . ,13K) can be carried out in the
framework of the general linear model with heteroscedasticity. The estimated
generalized least squares (EGLS) estimator requires estimation of the lXk; this is
discussed in Section 11.2.4.
Before examining some specific heteroscedastic structures, we shall discuss
the use of least squares (LS) and GLS under the general assumption that
Yt = x:J3 + et (11.1.6)
T )_1 T
= ( L (Tt
/~ 1
-2'
XtX t L (Tt-2 XtYt
/~ 1
T ) 1 T
= ( ~ xixi' ~ xiyi
= (X* 'X*)-IX* 'y* (ILL 7)
(I 1.1.8)
(I I. 1.9)
When (T; = (T2, this difference is zero and if (Tl = (T2xl it is given by
(T2/("i.xl)2)("i.xi - ("i.x;)2/T) indicating, as one would suspect, that the greater
the variation for xl, the greater the loss in efficiency. The bias in the LS
variance estimator for the model Yt = f3Xt + e t , E[el] = (Tl, is given by
422 Some Alternative Covariance Structures
and hence depends on the degree of correlation between ui and x~. When this
correlation is positive, as is frequently the case, the sampling variance ofb will
be underestimated. When there is no correlation, there will be no bias. For
further examples see Theil (1951), Prais (1953), Geary (1966), Goldfeld and
Quandt (1972), and Sathe and Vinod (1974). Some asymptotic properties of b
are considered in Section 11.2.1.
The choice of an appropriate variance estimation method and an estimator
for Il depends upon what further assumptions are made about the variances
(ui). A number of assumptions, and the possible techniques that can be used
under each assumption, are reviewed in Section 11.2. In Section 11.3 tests for
heteroscedasticity are discussed, and in Section 11.4 we make some recom-
mendations on testing, modeling, and estimation.
(11.2.1)
11.2.1 No Restrictions on
the Variance of Each Y.
Consider the model y = Xjl + e, E[e] = 0, E[ee ' ] = <I> = diag(a1, aL ... ,
a}). Let a matrix A with each of its elements squared be denoted by A. Thus
a = (al, aL. . . ,a}), contains the nonzero elements of <I> and ~ contains the
squares of the LS residuals e = y - Xb. If there are no restrictions placed on
Heteroscedasticity 423
the 0";, there is only one observation corresponding to each variance, and so
one should not be too optimistic about obtaining reliable estimates. However,
following Rao (970), it is possible to derive an estimator for 0- which is a
"minimum norm quadratic unbiased estimator" (MINQUE). The quadratic
form y' Ay is said to be a MINQUE of the linear function LdtO"; if the Euclidean
norm of A, which is equal to (tr[AA]) 1/2 , is a minimum subject to the conditions
AX = 0 and La tt 0"7 - LdtO";. [See Rao (1970) for details and for reasons why
such a criterion might be desirable.] The MINQUE of 0- is given by
6- = M-I~ 01.2.2)
can be written as
01.2.4)
CT~Irm
(Section 11.2.2)
(Section 11.2.6)
(Section 11.2.4a)
426 Some Alternative Covariance Structures
(11.2.5)
where Q = lim T-1X'X and V = lim T-IX'<I>X. The result in (11.2.5) can be
used to test hypotheses and construct interval estimates for (1, providing that
we can find a consistent estimator for V. Under appropriate conditions [Eicker
(1967), White (1980), Nicholls and Pagan (1983)], it can be shown that such an
estimator is given by
T
V= T-IX'<PX = T-I L e?xtx;
t~ I
(11.2.6)
where et = Yt - x;b, and <i> is a diagonal matrix with tth diagonal element given
bye? Eicker considers the conditions for non stochastic X; White allows for
the possibility of stochastic but independent x;; and Nicholls and Pagan con-
sider the situation where x; can contain lagged values of Yt. See also Hsieh
(1983). The results of White and Nicholls and Pagan are stated in terms of
almost sure convergence (strong consistency).
Thus, although it is impossible to obtain consistent estimators for the a-7
without further restrictions on these variances, it is nevertheless possible to
obtain a consistent estimator of the limit of the weighted average T-IX'<I>X =
T-I L;~I a-7x t x;. In addition, as an alternative to V, a consistent estimator for V
e;
can be found by replacing in (11.2.6) with the lth element of the MINQUE
estimator & given in Equation (11.2.2).
From (11.2.5), (11.2.6), and the consistency of V, we have
01.2.9)
01.2.10)
A
where <I> is defined below 01.2.6). White (982) and Cragg (1983) provide
conditions under which
and
(White actually considers a more general model where X can contain stochastic
regressors that are not independent of e, and where <I> is based on the squares of
the residm~"s from an instrumental variables estimator.) Noting that the GLS
estimator 13 = (X'<I>~IX)~IX'<I>~ly can be viewed as a regression of <I>~1/2y on
<I>~1I2X, Cragg points out that the estimator 13* can be viewed as an instrumental
variables estimator in which the instruments for <I>~II2X are its values predicted
by its regression on <l>1/2Z. That is, the instruments are the values of
<l>1I2Z(Z'<I>Z)~IZ'X arising from regressions of the form <I>~1/2X = <l>1/2Z n + U.
Cragg goes on to show that the greatest gain in (asymptotic) efficiency over
the LS estimator is likely to occur if the columns of F are chosen to include
those variables in (x; x;) not already included in X, as well as any other
exogenous variables likely to influence the variances. This result suggests that
the number of variables in F should be large. However, based on small sample
evidence from a Monte Carlo experiment, Cragg recommends that the number
of auxiliary variables should not be too large. His Monte Ca~lo experiment also
showed that the gain in small sample efficiency from using 13* instead of b can
be substantial, but it did uncover some problems with small sample bias in the
covariance estimators [X'Z(Z'<i>Z)~IZ'X]~1 and (X'X)~IX'<i>X(X'X)~I. Re-
placing the diagonal elements of <i> with the MINQUE estimator &- seemed to
reduce this bias, but Cragg still concludes that, for his simple versions of the
Wald statistics in 01.2.7) and 01.2.12), the x2-distribution is not a good finite
sample approximation. One question not yet investigated, but relevant for the
applied worker, is whether the White-Cragg estimator is more efficient than an
EGLS estimator that makes an incorrect assumption about the form of het-
eroscedasticity.
428 Some Alternative Covariance Structures
Yl Xl el
Y2 X2 e2
fJ+
Ym Xm em (11.2.13)
where Yi is (Ti X 1), Xi is (Ti x K), ei is (Ti x 1), E[eie;] = crtI, and E[eieJ] = 0,
i =1= j. Equation (11.2.13) can be viewed as a set of regression equations where
the coefficient vector for each equation is restricted to be the same and where
there is no correlation between the disturbances in different equations (Section
12.1). Alternatively, each of the Xi matrices may be a different set of replica-
tions of the explanatory variables in which case all rows in a given Xi would be
identical, thus providing a natural way for grouping the data. Other possibilities
of groups with constant variances are observations grouped according to geo-
graphical region or, with time series data, observations before and after some
critical point in time.
For (11.2.13) the GLS estimator for fJ is given by
(11.2.14)
(11.2.15)
(11.2.16)
[Taylor (1977)]. Some finite sample results on the moments of (3, approximate
for the general case and exact for m = 2, have been obtained by Taylor (1977,
1978). These results indicate that, in terms of mean square error, EGLS domi-
nates LS over a large range of the (O"I!O"~) parameter space including rather
small samples with a moderate degree of heteroscedasticity. The relative effi-
ciency of (3 declines as m increases. Also, for m = 2, EGLS is almost as
efficient as GLS indicating that there is little to gain from a perfect knowledge
of the variances. Swamy and Mehta (1979) also give some results for the m = 2
case. They propose a new estimator and suggest that one should choose be-
tween b l , b 2 , b, ~, and their estimator, with the choice depending on (1) the
relative magnitudes of O"T and O"~, (2) whether or not X 2X 2 - XI XI is positive
definite, and (3) the relative magnitudes of TJ, T2 , and K.
If the rows for a given Xi are identical, 01.2.15) reduces to
Ti
6'; = (Ti - 1)-1 ~ (Yij - y;)2
j~1
where Yij is thejth element ofYi, Yi = L[101 Yij!T; and, in this case, we only need
the condition Ti > 1.
In 01.2.15) no allowance is made for the fact that each b i is an estimate of the
same (3. If we incorporate the additional information given by this restriction, it
is likely that the resulting eF7'S will be more efficient, and, although there will be
no difference asymptotically, this may lead to a (3 that is more efficient in finite
samples. Estimation techniques that use all the information include the maxi-
mum likelihood estimator [Kmenta 0971, pp. 265-266), Hartley and layatillake
(1973), Oberhofer and Kmenta (974), Magnus (1978), MINQUE and its modi-
fications [Rao 0970, 1972), Rao and Subrahmaniam (1971), Brown (1978), and
the Bayesian approach [Dreze (977)]. For example, in an iterative procedure
to obtain ML estimates, first-round estimates of the O"f can be obtained from
01.2.17)
where
is the LS estimator applied to the whole sample. These variance estimates are
p
used to obtain an EGLS estimator which then replaces b in (11.2.17) to give a
second round of variance estimates. This process is continued until the esti-
mates converge.
If, in the context of this model, we wish to test for homoscedasticity where
O'T = O"~ = . . . = O"~ = 0"2, the likelihood ratio test [Kmenta (1971, p. 268);
Maddala 0977, p. 263)] or Bartlett's test (Section 11.3.3) could be used. When
m = 2, the usual F test for equality of variances can be employed.
430 Some Alternative Covariance Structures
(11.2.18)
where the hypothesis of equal variances is accepted if c] < Fo < C2, and
rejected otherwise. The EGLS estimator 13 is defined in (11.2.16). Under a
squared error loss measure the characteristics of the risk functions for the
EGLS estimator 13, the LS estimator b and the pretest estimator 13*, relative to
A
the GLS estimator with known variance 13, are given in Figure 11.1 for various
values of A = O'I/al Mandy (1984) has investigated the performance of the
inequality pretest estimator when Ho: O'I = O'~ and Ha: O'I > O'~ and the charac-
teristics of this pretest risk function are also illustrated in Figure 11.1. No
estimator dominates over the whole range of the parameter space. However,
unless A is close to one performance of the EGLS estimator is superior to that
tlc
.,
.;:
13* - - - Greenberg pretest
.,>
""~ / ~" \ /
/ " ........ \
\
/ \ / \
/ \ I .. ..... \
/ \ ~: -.. \
/ '\ -.. \
/ .... \
/ ". \
/ ..... "-
/ -----..::.~:- ......
0.5
----
/'
/
12 20
X~ a/la/
Figure 11.1 Risk characteristics of pretest and conventional estimators,
double log scale.
Heteroscedasticity 431
of the pretest estimator 13*. Again the problem of an optimum level of signifi-
cance arises and is unresolved. Yancey, Judge and Miyazaki (1984), utilizing
appropriate Stein estimators, have developed estimators that are uniformly
superior to the pretest and EGLS estimators.
When economic data are being used, replications of a particular X t are rare and
it is often difficult to decide how to group the data to obtain constant variance
within each group. An alternative approach is to relate a'; explicitly to Xt. The
models considered in this and the remaining subsections of Section 11.2 fall
into this framework. In this subsection we assume that Ut is a linear function of
a set of explanatory variables. Hence we are considering the model
t = 1, 2, . . . , T (11.2.19)
(t =1= s) 01.2.20)
T T
= (~ (z;a)-2 xtx;) -I ~ (z;a)-2 xtYt
01.2.21)
(11.2.22)
(11.2.23)
where Z' = (ZI, Z2, ... ,ZT) and lei = <led, le21, ... ,leTI)'. The fact that we
have estimated ea instead of a is of no consequence for estimation of P be-
cause the EGLS estimator
T T
P = (2: (z;a)-2xlx;) -I 2: (z;a)-2xIYI
I~I t~1
( 11.2.25)
The properties of a and &have not been fully investigated, but it seems reason-
able to expect that under appropriate conditions both a and &will be asymptot-
ically normal with asymptotic covariance matrices
(11.2.26)
and +&
= (1 - ( 2)/e2)(Z'<l>-IZ)-I; and so, in this sense, &will be more efficient.
Thus, although there is no difference asymptotically, it seems reasonable to
expect an EGLS estimator for p based on & to be more efficient than the
corresponding one based on a.
Heteroscedasticity 433
The estimator & was suggested by Harvey (1974) while, for the purpose of
testing for homoscedasticity, & was suggested by Glejser (1969). Now we indi-
cate how some asymptotic tests can be carried out.
The e, will be homoscedastic if a* = 0, where a* = (a2, a3, . . . , as)' and
a' = (al,a*'). Under this as the null hypothesis,
01.2.27)
and aT can be consistently estimated from the LS residuals, say &i = L;~le~/T. If
& and & are approximately normal, then under the null hypothesis,
01.2.28)
and
*_ c2 ) aA*'D-I'*
a
g - ( 1- c 2 &i 01.2.29)
will each have an approximate X7s-1) distribution, where D is the matrix (Z' Z)-I
with its first row and column deleted and, if the e, are normally distributed, c =
(2I7r) 112. The null hypothesis a* = 0 is rejected if the chosen statistic (g or g*)
falls in the appropriate critical region. The statistic g* is likely to be computa-
tionally more convenient than g because c 2&*'D-I&* is equal to the regression
sum of squares of the regression of lei on Z [Goldberger 0964, p. 176)]. For a
related testing procedure see Glejser (1969).
Maximum likelihood estimation provides another method for estimating a
and 13. If the e, are normally distributed, the logarithm of the likelihood function
is apart from a constant
L= ~ I'
-L.. nZa--L..
1~ (y, - X;I3)2
'~I ' 2 I~I z;a (11.2.30)
Setting first derivatives of L (with respect to a and 13) equal to zero yields
equations nonlinear in a and 13 so that nonlinear methods (Appendix B) are
required to find the ML estimates. For an example, see Rutemiller and Bowers
(968) who, for their application, used the method of scoring. An asymptotic
test for heteroscedasticity can be based on the likelihood ratio test statistic u =
(L I - Lo)/2, where LI is (11.2.30) evaluated at the ML estimates of a and 13
and Lo is the same function evaluated at &i = T-I(y - Xb)'(y - Xb), a* = 0 and
b = (X'X)-IX'y. The latter are ML estimates under the assumption a* = O.
Under the null hypothesis a* = 0, u will have an approximate Xls-l) distribu-
tion.
Maximum likelihood estimation may be computationally difficult and so as a
further alternative Harvey (1974) suggests an estimator based on one iteration
of the method of scoring and that has the same asymptotic distribution as the
434 Some Alternative Covariance Structures
Another heteroscedastic error model that has received some attention in the
literature [Hildreth and Houck (1968), Theil (1971), Goldfeld and Quandt
(1972), Froehlich (1973), Harvey (1974), Raj (1975), Amemiya (1977)] occurs
when a} is assumed to be a linear function of a set of exogenous variables. In
this case we have
(11.2.33)
(11.2.34)
" _
a -
T
2: , )-1 2: zle
T .. 2 _ ,
(Z Z) Z e
- I , .I.
(
1= 1
ZIZI
{= I
I - (11.2.35)
h
were Z ' = (z" Z2, . . . ,ZT) an d e~ = (el,
,2 e2, ,2 )' H owever, VI In
,2 . . . ,er .
(11.2.34) has a nonzero mean and suffers from both heteroscedasticity and
autocorrelation. Thus a will be biased and, relative to a GLS estimator for n,
inefficient. The bias can readily be demonstrated by noting that
Heteroscedasticity 435
(11.2.36)
(11.2.37)
and
(11.2.38)
~ = MZa + w (11.2.39)
(11.2.40)
where (z;a)2 has been used to estimate the variance of VI. This estimator will be
asymptotically more efficient than a in (11.2.35).
For Equation (11.2.39), ifthe.e l are normally distributed, w will have an exact
covariance matrix given by 2Q, where Q = Mf>M, and application of GLS
gives [Raj (1975)]
(11.2.41)
where Q has been replaced by Q = M<I>M and <I> has the Ith diagonal element
(z;a(1))2. Theil (1971, p. 624), based. on Theil and Mennes (1959), suggests
ignoring the off-diagonal elements of Q. This yields an estimator that is simpler
and is asymptotically equivalent to &(1). Also, the estimators & and &(1) can be
regarded as special cases of nonlinear estimators investigated within a more
general framework by Jobson and Fuller (1980).
Although the GLS estimators & and &(1) are asymptotically more efficient
than their LS counterparts, they do not lead to EGLS estimators for Il that are
asymptotically more efficient. Under appropriate conditions [Hildreth and
Houck (1968), Amemiya (1977)] the EGLS estimators for Il will all have an
436 Some Alternative Covariance Structures
(11.2.42)
and
&*'D-I&*
g = 26- 4 01.2.43)
L = - ! .f In
2 (~I
z'o: -
(
! .f
2 (~I
(y/ - X:IJ)2
z:o: (11.2.44)
Both Harvey (1974) and Amemiya (1977) use the information matrix from this
function to demonstrate the asymptotic efficiency of &under normality.
Heteroscedasticity 437
Yt = x;fl + e t , t = 1, 2, . . . , T (11.2.45)
where
This can be regarded as a special case of the model U'; = (Z;U)2 [Equation
(11.2.20)], where Zt = XI and u is proportional to fl.
The GLS estimator ~ = (X'<I>-IX)-IX'<I>-ly is not a feasible one because <I>
depends on fl. However, one can first use the LS estimator b = (X' X)-IX'y,
and then use the resulting estimate in an EGLS estimator,
438 Some Alternative Covariance Structures
(1 t .2.47)
(11.2.48)
where
T
6- 2 = (T - K)-l L [(x;b)-2(Yr -
r~ I
X;~)2] 01.2.49)
We say "convenient" because this would be the covariance matrix given by the
transformed least squares computer run used to obtain ~. Asymptotically,
there is no difference; but, in finite samples, we may be able to obtain a better
estimator for +p by substituting (l for b in (11.2.48) and 01.2.49). This also
raises the question of using the above results in an iterative fashion. Specifi-
use the new estimator for (l to
cally, in (1 t .2.47) we could replace x;b with x;(l,
form another estimate of cri, and so on. See Prais (1953), Prais and Aitchison
(1954), and Kakwani and Gupta (967).
As an alternative to (11.2.47) we can make some specific distributional as-
sumptions about y and use numerical methods to obtain ML estimates. lfy has
a multivariate normal distribution, then, in contrast to the other models stud-
ied, the ML estimator for (l will not be of the "EGLS type." The dependence
of <I> on (l implies that both the sum of squares function (y - X(l)'<I>-I(y - X(l),
as well as the complete log-likelihood function, are no longer simple quadratic
functions of (l. For each of three cases, Yr with normal, lognormal, and gamma
distributions, Amemiya (1973) gives the logarithm of the likelihood function
and its first- and second-order partial derivatives. Based on the Cramer-Rao
lower bound, he demonstrates that the EGLS estimator in 01.2.47) is asymp-
totically inefficient when Yr is normal or lognormally distributed, but efficient
when Yr has the gamma distribution. The asymptotic inefficiency of ~ under the
normality assumption is confirmed within a more general context by Jobson and
Fuller (1980), who go on to suggest a nonlinear least squares estimator that is
asymptotically equivalent to the ML estimator. Their estimator for (l is one that
utilizes information in both y and the least squares residuals. Specifically, it is a
weighted nonlinear least squares procedure that is applied jointly to the two
equations
Heteroscedasticity 439
(11.2.50)
and
(11.2.51)
Further results presented by Carroll and Ruppert (1982) show that, although
the EGLS estimator is asymptotically not as efficient as the Jobson-Fuller
estimator, it is more robust with respect to misspecification of the variance
function.
The use oftesting procedures to discriminate between the normal, lognormal
and gamma distributions has been illustrated by Amemiya (1973) and Surekha
and Griffiths (1982). Amemiya uses a classical nonnested hypothesis test, while
Surekha and Griffiths calculate the posterior odds within a Bayesian frame-
work. Test statistics used to test for heteroscedasticity of the form a-7 =
(T2(X;f3)2, without specification of the form of distribution, have been studied by
Bickel (1978), Carroll and Ruppert (1981), and Hammerstorm (1981).
The more general variance specification, where (T7 = (T2(X:f3Y' and p is an
unknown parameter to be estimated, has been considered by Box and Hill
(1974) and Battese and Bonyhady (1981). Assuming normality, Battese and
Bonyhady use maximum likelihood methods to estimate the model for a num-
ber of household expenditure functions, and they demonstrate how a test for
homoscedasticity (p = 0), or a test for p = 2, can be based on the asymptotic
normality of the ML estimates and the Cramer-Rao lower bound.
Another heteroscedastic error model that has appeared in the literature is one
called "multiplicative heteroscedasticity" [Harvey (1976)]. The assumptions of
this model are
t = 1,2, . . . , T (11.2.52)
t =1= s (11.2.53)
(11.2.54)
In e; = z;a + VI (11.2.55)
(11.2.56)
A problem with this estimator is that the VI have nonzero means and are het-
eroscedastic and autocorrelated. However, if the e l are normally distributed
and if el converges in distribution to e l , then, asymptotically, the VI will be
independent with mean and variance given by Harvey (1976):
This implies that <XI, the first element in a, will be inconsistent. However, this
is of no consequence, since, in
(11.2.58)
aA*'D-IA*
a
g = ---;-:::-::--:-:::--
4.9348 01.2.60)
will have an approximate xts-I) distribution, where D is equal to (Z' Z)-I with its
first row and column deleted and a*'D-l a* is equal to the regression sum of
squares from (11.2.56).
The log of the likelihood function is, apart from a constant,
1 TIT
L = - - 2: z;a - -2 2:
2 I~I I~
[exp( -z;a)(YI -
I
X;P)2]
(11.2.61)
Heteroscedasticity 441
and, based on the inverse of the information matrix, the asymptotic covariance
matrices for ML estimators t3 and ii can be derived as +13 = (X'<I>-IX)-I and
+ci = 2(Z' Z)-I. This verifies that ~ is asymptotically efficient but indicates that
& is not. Thus we may wish to improve on &, hoping that, in finite samples, this
will lead to a better estimator for (3. As an alternative to ML estimation,
Harvey (1976) provides an estimator based on one iteration of the method of
scoring. It has the same asymptotic distribution as & and is given by
T
6: = & + d + 0.2807(Z'Z)-1 2: [Zt exp(-z;&)e;]
t~ I
(11.2.62)
where d is an (S x 1) vector with the first element equal to 0.2704 and the
remaining elements equal to zero.
To test a* = 0, Equation (11.2.60) can be used with &* replaced by 6:* or ii*,
and 4.9348 replaced by 2. This test will be asymptotically more powerful than
the one based on &*. Alternatively, we can use the likelihood ratio test which,
in this case, can be conveniently written as [Harvey (1976)]
T
U = T In S2 - 2: z; ii
t~ I
(11.2.63)
Yt = x;Jl + et 01.2.64)
where
01.2.65)
the V t are independent normal random variables with zero mean and unit vari-
ance, and the parameters (ao,al) are assumed to satisfy ao > 0 and al :> o.
Under these assumptions the conditional means and variances for et and Yt are
and, furthermore, both (etlet-I) and (YtIYt-l) are normally distributed. However,
the marginal (unconditional) densities for e t and Yt will not be normal. The
unconditional means are E[e t] = 0 and E[Yt] = x;Jl, and, providing 0 <: al < 1,
the unconditional variances can be derived as
01.2.68)
It can also be shown that the et are uncorrelated so that we have E[e] = 0 and
E[ee'] = (I21, where (I2 = ao/O - al). Thus, it follows that the LS estimator b =
(X'X)-IX'y is best linear unbiased, and that the usual estimator for the covari-
ance matrix for b is valid. (We are assuming that X does not contain lagged
values ofy.) However, by using maximum likelihood techniques it is possible to
find a nonlinear estimator that is asymptotically more efficient than LS.
Conditional on an initial observation Yo, the likelihood function for a =
(ao,al)' and Jl can be written as fey) = ni=I!(YtIYt-I). Using this result, and
ignoring a constant factor, the log-likelihood function can be written as
L = - ~L t
In[ao + al(Yt-1 - X;_I Jl)2]
01.2.69)
It is clear from this function that, although (YtIYt-l) is normally distributed, the
vector y does not possess a multivariate normal distribution. A maximum likeli-
hood estimator for a and Jl can be found by numerically maximizing 01.2.69),
or, alternatively, one can use an asymptotically equivalent estimator that is
based on the scoring algorithm and that can be found using most least squares
computer programs [Engle (982)]. To find a value for this estimator we pro-
ceed as follows.
Heteroscedasticity 443
(11.2.70)
I = 1, 2, . . . , T (11.2.71)
Zr
*, _
-
,
Zt _ (1 .2)
h? - h? ' h?
et-I
and
(11.2.73)
hypothesis Ho: 0:( = O. For more details of the ARCH model and its estimation,
as well as an application, see Engle (1982).
01.2.78)
where U t ~ N(0,cr 2). From the properties of the lognormal distribution the
variances of eUt and Yt are Var(e Ut ) = ea-2(e o.2 - I) and
(11.2.79)
where x; = (I, In X2t, In X3t), 13' = (In f31' f32, f33), and In Yt is both homoscedastic
and a linear function of p. Thus LS is the appropriate technique, and this is true
because we assumed that U t is homoscedastic, and that we have a constant
elasticity model.
Suppose that we wish to retain the assumption of a heteroscedastic y that
follows the lognormal distribution, but instead we assume that the slopes are
constant, or at least they do not vary systematically with Xt and Yt. In this case
the model becomes
(11.2.80)
where U t ~ N(0,cr 2) and x; = (I, X2t, X3t). Taking logs of (11.2.80) gives
In Yt = In(x;p) + U t (11.2.81)
they are random variables. However, unlike the constant elasticity model, they
do not vary systematically with X t and Yt.
Alternatively, if y is a normally distributed random vector, we could specify
Yt = x;(3 + U t (11.2.82)
specifying precisely the form it should take. Some tests that have been sug-
gested for this purpose are summarized below. In addition to (or instead of)
these tests we could, of course, use one of the tests given in Section 11.2. In
this case, since each test is designed against a specific alternative, we would
hope that the test being used is robust under alternative heteroscedastic specifi-
cations.
The first general test we consider is relevant for alternative hypotheses of the
form
a} = h(z;a) (11.3.1)
q'Z(Z'Z)-IZ'q
TJ = 26-4 (11.3.2)
(11.3.4)
The denominator in 71* can be viewed as an estimator for the variance of e;.
When Ho is true and the e t are normally distributed
(11.3.5)
01.3.6)
where
448 Some Alternative Covariance Structures
T, m
TiS; = 2: (Yij -
j=1
)1;)2, TS2 = 2: TiS;,
i= I
where
T, m
(T,. - 1)6-; = 2: (Yij -
j=1
)1;)2 and (T - m)6- 2 = 2: (Ti -
i= I
I)6-T
T-K m
(T - K)S2 = 2:
j=1
ej, and 2: Vi =
i::::: I
T - K
tions, has been utilized, and the independence of the ST has been ensured by
using BLUS residuals. If one is testing for heteroscedasticity in general, the
choice of groups is completely arbitrary. Ramsey suggests m = 3 as a compro-
mise between sufficient degrees of freedom and sufficient observations in each
group. Some Monte Carlo evidence on the power of the test has been provided
by Ramsey and Gilbert (1972).
Another possible test for heteroscedasticity is the one outlined by Goldfeld and
Quandt (1965, 1972). As with the previous test, its performance will be best
when it is possible to order the observations according to nondecreasing vari-
ance. That is, o} :> 0';-" t = 2, 3, . . . ,T. The steps for implementing the test
are:
It is not essential that the two regressions be based on the same number of
observations. If the number of observations does differ, then the degrees of
freedom must be changed accordingly.
In contrast to the previous tests, this one is based on an exact finite sample
distribution. It overcomes the problem of the lack of independence of LS
residuals by running two separate regressions. Some central observations are
omitted so that, under heteroscedasticity, residuals in the numerator corre-
spond to relatively large variances and those in the denominator to relatively
small variances. The omitted observations correspond to intermediate-sized
variances. The optimum value of r is not obvious. Large values are likely to
increase the power of the test through an increase in the value of the F statistic
but decrease the power through a reduction in degrees of freedom. Goldfeld
and Quandt (1965) suggest r = 8 for T = 30, and r = 16 for T = 60. However, in
their later work (1972, Chapter 3) they use r = 4 for T = 30.
Instead of running two separate regressions to obtain independent sums of
squares for the numerator and denominator of the F statistic, it is possible to
use either BLUS residuals [Theil (1971, pp. 196-218)] or recursive residuals
[Harvey and Phillips (1974)] from one regression. In both these cases the
(T - K) residuals are divided into two approximately equal groups and an F
statistic similar to that used for the Goldfeld-Quandt test is calculated. Assum-
ing that (T - K) is even, this procedure leads to a statistic that, under homosce-
dasticity, has an F[(T-K)12,(T-K)12] distribution.
450 Some Alternative Covariance Structures
Szroeter (1978) introduces a class of tests that includes those in the Section
11.3.3 as special cases. He assumes that the e t are normally distributed and, as
before, that the observations have been ordered such that (]'~ :> (]'~-J, 1 = 2,3,
, T. The class is described by the test statistic
(11.3.9)
h>c (11.3.10)
7T(t - K) ]
hI = 2 [ 1 - cos T _ K + 1 ' t=K+l,K+2, . . . ,T 01.3.1l)
It (s then possible to constr_uct a bounds test where we reject the null hypothesis
if h > 4 - d L , accept it if h < 4 - d u , and otherwise treat the test as inconclu-
sive. The values d L and d u are the Durbin-Watson lower and upper critical
values, respectively, corresponding to T + 1 observations, K + 1 regressors,
and the required significance level [Durbin and Watson (1951)]. The problem of
the inconclusive region has been investigated further by Harrison (980) and
King (1981). Harrison demonstrates that the probability of obtaining an incon-
clusive result is quite high, and, to avoid the computational demands of finding
the exact critical value via the Imhof procedure, he suggests a relatively simple
approximation method. King notes that a narrower inconclusive region can be
obtained by recognizing that most regression models contain a constant term.
For this case he derives and tabulates new values hi and hij, which, in the
context of the test described above, lead to critical value bounds of 4 - hi and
4 - hij. Some examples are presented to demonstrate how the probability of
obtaining a value in the narrower inconclusive region (4 - hij, 4 - hL) can be
considerably lower than that for the region (4 - d u , 4 - dd.
Harrison and McCabe (1979) suggest another member of the Szroeter class of
tests that uses LS residuals and that sets hI = 1 for t = 1, 2,. . . , m and hI = 0
for t = m + 1, m + 2, . . . , T, with m approximately equal to T12. Because this
set of hI is nonincreasing rather than nondecreasing, the critical region in this
case is of the form ii < c. Using existing F distribution tables, Harrison and
McCabe indicate how critical values for a corresponding bounds test can be
evaluated. They also suggest an approximation for the inconclusive region and
carry out some small sample power comparisons.
452 Some Alternative Covariance Structures
A further bounds test that uses LS residuals and that is another member of
the Szroeter class is that advocated by King (1982). He shows that setting hI = t
leads to a test that is approximately locally best invariant under quite general
conditions. To write the test in a form more convenient for the derivation of
critical value bounds, he assumes the observations have been ordered accord-
ing to nonincreasing rather t~an nondecreasing variances. This implies the
critical region is of the form II < c. Also, he finds it is further convenient to
write this critical region in the equivalent form
T
2: e~(t -
I~ I
I)IT
S = - - - :T : - - - - < c*
2: e;
I~ I (11.3.13)
- -
v(h - h)
Q= ~ N(O,I)
T ] 1/2
[ 2 ~ (hI - h)2 (11.3.14)
where h = 2:,:=lh/T. As a convenient and useful choice for hI, SZroeter suggests
hI = t. King's (1982) results add support to this choice. In this case (11.3.14)
simplifies to
Q
= ( 6T ) 1/2
T2-1
(- _
h
T+
2
I)
(11.3.15)
where h = 2:,e~tl2:,e~. Monte Carlo results on the finite sample power of this test
have been provided by Surekha and Griffiths (1983). In the models they con-
sider the test outperforms both the Breusch-Pagan and Goidfeld-Quandt tests.
For the applied worker the relevant question is which member of Szroeter's
class of tests performs best. Unfortunately, there is no one member that will be
best under all circumstances. The relative power of each test will depend on the
alternative hypothesis. In particular, it will depend on how the a-~ vary, the
very question we are trying to answer. However, if we take into account
I
Heteroscedasticity 453
A test based on the fact that heteroscedasticity leads to a least squares covari-
ance matrix estimator that is inconsistent has been suggested by White (1980).
To introduce this test let e = (el, e2, . . . ,eT)' be the vector of least squares
residuals and let the least squares variance estimator be a- 2 = (T - K)-'e'e.
When homoscedasticity exists, V = T-I'2.e;x t x; and V = T- 1a- 2X'Xwill both be
consistent estimators of the same matrix. However, under heteroscedasticity,
these estimators will tend to diverge, unless the heteroscedasticity is of a
special type that is independent of X. [See White (1980, pp. 326-327) for a
formal statement of the conditions under which heteroscedasticity does not
lead to a divergence in V and V.l It is therefore possible to base a test on
whether or not the difference V - V is statistically significant. The most simple
version of a test statistic for this purpose is given by TR2, where R2 is the
squared multiple correlation coefficient of the regression of e7 on the variables
in (x; x;), with redundant variables deleted and a constant term included
even if one does not appear in Xt. Under the null hypothesis TR2 has an asymp-
totic X2-distribution with degrees of freedom equal to the number of explana-
tory variables in the above auxiliary regression, excluding the constant. If there
are no redundancies in (x; x;) other than the obvious ones, and X t includes a
constant, this number will be K(K + 1)/2 - I. See White (1980) for details and
for sufficient conditions for the asymptotic results to hold.
As White notes, this test could be regarded as a general one for misspecifica-
tion as it is also likely to pick up other specification errors such as a misspeci-
fied mean function E[Ytl = x;Jl, or correlation between.x t and e t in a stochastic
regressor model. However, if the researcher feels relatively confident about the
absence of these latter problems, the test can be regarded as one designed to
detect heteroscedasticity. Note that the test is similar in form to other tests
suggested in this section, particularly the Lagrange-multiplier test statistic.
Hsieh (1983) has modified White's test for errors which can be both heterosce-
dastic and heterokurtic, and he has examined the small sample power of the
tests in a Monte Carlo experiment. Extensions of the test for more complex
models have been discussed by White (1982) and Cragg (1982).
454 Some Alternative Covariance Structures
Goldfeld and Quandt (1965, 1972) suggest a "peak test" that does not depend
on the assumption that the et are normally distributed. It involves comparing
the absolute value of each residual with the values of preceding residuals. One
that is greater constitutes a "peak," and the test is based on whether the
number of peaks is significant. A probability distribution for the number of
peaks under the null hypothesis of homoscedasticity is given in Goldfeld and
Quandt (1972, pp. 120-122). This distribution is valid asymptotically for LS
residuals and valid in finite samples for BLUS or recursive residuals, the latter
having been advocated by Heyadat and Robson (1970). It is also possible to
derive tests based on the rank correlation coefficient [Johnston (1972, p. 219),
Horn (1981)], and on regression quantiles [Koenker and Bassett (1982)].
In this section we make some recommendations for the researcher who wishes
to estimate a function and allow for the possibility of heteroscedasticity. The
questions of interest are how to test for heteroscedasticity, how to model the
heteroscedasticity if it appears to be present, and how to estimate the chosen
model. At the outset we must realize that there is no well-established "best
way" to test for and model heteroscedasticity. The relative merits of alterna-
tive procedures will depend on the unknown underlying model and unknown
parameter values. The best we can do is to suggest what we consider to be
reasonable strategies to follow.
As Harrison (1980) and other authors have noted, testing for heteroscedastic-
ity has been much less popular than testing for autocorrelation. Nevertheless,
there are many occasions, particularly when using cross-sectional data, where
the assumption of heteroscedastic errors is a reasonable one. Furthermore, the
consequences of ignoring heteroscedasticity can be severe. It is desirable,
therefore, to carry out routine testing for heteroscedasticity, just as the Dur-
bin-Watson statistic is used for routine testing for autocorrelation. Toward this
end, it would not be difficult to incorporate a member of Szroeter's class of
tests (such as h t = t) into standard computer packages. In addition, it is worth
pointing out that tests such as the Goldfeld-Quandt test, the Lagrange-multi-
plier test, and White's test can be implemented using existing packages without
much additional effort. With respect to choice of a test, when it is possible to
order the observations according to nondecreasing variance, one of Szroeter's
tests with increasing h/s is likely to be a good choice. If the variance depends
on more than one explanatory variable and the variables do not always move in
the same direction, it is impossible to rank the observations according to non-
decreasing variance. In this case the Breusch-Pagan test or alternatively the
White test, could be used. Finally, if we have no a priori notions about how the
I
Heteroscedasticity 455
variance might change, but we still wish to test for heteroscedasticity, anyone
of a number of tests (e.g., Bartlett, Goldfeld-Quandt) could be used. However,
we should not be too optimistic about their power.
When testing for heteroscedasticity it should be kept in mind that a signifi-
cant test statistic may be an indication of heteroscedasticity, but it could also be
an indication of some other type of misspecification such as an omitted variable
or an incorrect functional form. Available evidence [White (1980), Thursby
(1982)] suggests that this will certainly be true for White's test and the Gold-
feld-Quandt test. A testing strategy designed to discriminate between misspe-
cification and heteroscedasticity has been suggested by Thursby (1982).
If testing suggests that heteroscedasticity is present, we need to choose an
appropriate model. If we have strong a priori notions about the form of het-
eroscedasticity, the model should be chosen accordingly. Where uncertainty
about model specification exists, a diagnostic check can be made using a test
suggested by Godfrey (1979). However, unless the number of observations is
large, and as long as z is specified correctly, it may be asking too much of the
data to discriminate between models such as a~ = (z;a)2, a~ = z;a, and a~ =
exp{z;a}. Many may work equally well, in which case the choice could be based
on estimation convenience. In a Monte Carlo study using the variance specifi-
cations a~ = (z;a)2 and a~ = exp{z;a}, Surekha and Griffiths (1984) found that
the efficiency of EGLS estimators for jl depended mQre on the choice of esti-
mator and the sample size than it did on specification of the correct variance
structure. Their results also indicate that the EGLS estimator for jl based on
(11.2.62) can perform poorly in small sample sizes (T = 20); and other studies
[e.g., Griffiths (1971)] have shown that the specification a~ = z;a can lead to
poor estimators for jl if negative variance estimates are retained. In summary, a
choice between variance structures such as a~ = z;a, a~ = (z;a)2, and a~ =
exp{z;a} is not likely to be very important providing estimators with poor
properties are avoided. There are, of course, other variance structures such as
a~ = a 2(x;jl)2, and the ARCH model, which could be chosen if a researcher
regards them as more appropriate. In the absence of any prior knowledge
concerning the form of heteroscedasticity, asymptotically invalid test statistics
can be avoided by using White's (1980) least squares covariance matrix esti-
mator.
Unless we have strong a priori reasons for believing that heteroscedasticity
does (or does not) exist, the Monte Carlo results of Goldfeld and Quandt (1972)
and Surekha and Griffiths (1984) suggest that it is reasonable to use a pretest
estimator. That is, the conventional procedure of testing for heteroscedasticity
and using LS if the test is negative and an alternative EGLS (or ML) estimator
otherwise appears justified. For estimation of a specific model, most recom-
mended estimators for jl have identical asymptotic properties. Thus, apart
from the estimators mentioned above that have exhibited some poor small
sample properties, there is no clear basis for choice. It seems reasonable to use
any EGLS estimator for jl that is based on asymptotically efficient estimates of
the variance parameters, or to use maximum likelihood estimation.
456 Some Alternative Covariance Structures
11.5 EXERCISES
Exercise 11.1
Given that the disturbance vector e is normally distribut~d, show ~hat the
covariance matrix for w in Equation 11.2.39 is given by 2Q, where Q is the
matrix Q = Mct>M with each of its elements squared, and M and ct> are defined
in Section 11.2.4.
Exercise 11.2
Derive the information matrix for the parameters of the model in Section
11.2.4, under the assumption of normality. Use the result to show that & in
01.2.40) is asymptotically more efficient than a in 01.2.35).
Exercise 11.3
Prove that the statistic g in (11.2.43) is identical to the Lagrange-multiplier
statistic TJ in 01.3.2).
Exercise 11.4
Derive the information matrix for the parameters of the model in Equation
I 1.2.46, for the normal, lognormal, and gamma distributions. Under what cir-
cumstances is EGLS asymptoticaIly efficient?
Exercise 11.5
Consider the model in Equations 1I .2.52 and 11.2.53 and assume that the et are
normally distributed. Write down the likelihood function for (3' ,a') and derive
the information matrix. Show that the estimator a (see Equation 11.2.56) is
asymptotically inefficient relative to the maximum likelihood estimator but that
the same is not true for the estimated generalized least squares estimator JJ
defined in Equation 11.2.58.
Exercise 11.6
Prove that the Goldfeld-Quandt test (Section 11.3.3) is a member of Szroeter's
class of tests (Section 1I .3.4).
Exercise 11.7
Prove that (11.3.14) reduces to (11.3.15) when h t = t.
For the remaining exercises, generate 100 samples of y, each of size 20, from
the model
(11.5.1)
Heteroscedasticity 457
where the e/s are normally and independently distributed with E[el] = 0,
E[ e~] = a-7, and
(11.5.2)
Some suggested parameter values are f3l = 10, f32 = I, f33 = I, (Xl = - 2, and
(X2 = 0.25; for a possible X matrix where the observations have been ordered
Exercise 11.8
Calculate the covariance matrices for the generalized least squares (GLS) esti-
mator
Exercise 11.9
Select five samples from those generated, and, for each sample, compute val-
ues for the following estimators:
1. The LS estimator b.
2. The estimator
T ]_1 T
&(2) = &(1) + d + 0.2807 [ L zlz;
I~ 1
L ZI exp( -z;&(1er
I~ 1
Exercise 11.1 0
With the exception of b all the estimators in Exercise 11.9 were correctly
chosen in the sense that they are intended for a model with multiplicative
heteroscedasticity, and this type of heteroscedasticity does exist. For each of
the same five samples calculate values for the following estimators that incor-
rectly assume the existence of alternative types of heteroscedasticity.
1. The estimator
where
Exercise 11.11
i = 1,2,3,4, to calculate 25 "t-
Use the estimates obtained from band (3U),
values" that, in each case, would be used to test the null hypothesis f33 = 1.0.
Comment on the values.
Exercise 11.1 2
For each of the five samples compute values for
Exercise 11.1 3
~ ~ ..
Define (3(5) and (3(6) as the followmg estimators:
. {b
(3(5) = P(2)
if the BP test accepted homoscedasticity
if the BP test rejected homoscedasticity
. {b
(3(6) = P(2)
if Szroeter's test accepted homoscedasticity
if Szroeter's test rejected homoscedasticity
and (3(6).
For each of the five samples write down the values for (3(5)
Heteroscedasticity 459
Exercise 11.14
be the estimate obtained from the ith sample using the estimator (3(j).
Let (3j(j)
From the tOO samples generated, calculate estimates of the means and mean
square error matrices of all the estimators for (3 mentioned in the earlier prob-
lems. That is, find
~ (3j(j)
and
i=1 100
for j = 1, 2,. . . ,6, and repeat the process for the least squares estimator b.
Exercise 11.1 5
Along the lines of Exercise 11.14 estimate the means and mean square error
matrices for &(1) and &(2).
Exercise 11.16
The following questions should be answered using the numerical results ob-
tained in Exercises 11.4 and 11.15.
Exercise 11 .17
U sing all tOO samples, construct empirical distributions for the t statistics ob-
tained in Exercise 11.11. Comment on the shape of these distributions and the
proportion of times the hypothesis {33 = 1.0 was rejected at the 5% significance
level. In terms of accurate interval estimation of {33
Exercise 11.1 8
Based on the results from the tOO samples, which test for heteroscedasticity,
the Breusch-Pagan or the Szroeter test, is more powerful?
460 Some Alternative Covariance Structures
11.6 REFERENCES
Disturbance-Related Sets
of Regression Equations
i = 1, 2, . . . ,M
I
Contemporaneous Contemporaneous Estimation with
correlation only correlation and a singular disturbance
E[ eielJ = U"ijI AR( 1) errors covariance matrix
(Section 12.1) e(t) = Reu_1) + V(t) (Section 12.5)
I (Section 12.3)
Estimation I
(Section 12.1.1) Estimation with
I diagonal R
Hypothesis testing (Section 12.3.1)
(Section 12.1.2) I
I Estimation with
Goodness of fit nondiagonal R
(Section 12.1.3) (Section 12.3.2)
I I
Bayesian estimation Hypothesis testing
(Section 12.1.4) (Section 12.3.4)
I
Unequal numbers of
observations in each
equation
(Section 12.2)
The linear statistical model studied in the earlier chapters can be extended to
the case where we have M such models
I = 1, 2, . . . ,M (12.1.1)
+ (12.1.2)
Disturbance-Related Sets of Regression Equations 467
or, alternatively,
y = XJl +e (12.1.3)
where the definitions ofy, X, Jl, and e are obvious from (12.1.2) and where their
dimensions are respectively (MT xl), (MT x K), (K x 1), and (MT x 1) with
K = 'Lf!IK. In this section it will be assumed that E[e;] = 0 arid E[eieJ] = (Tij/r
and hence that the covariance matrix of the joint disturbance vector is given by
where
12.1.1 Estimation
When the system in (12.1.2) is viewed as the single equation (12.1.3), we can
estimate Jl and hence all the Jli via generalized least squares (GLS). If X is of
rank K and 'L is known and of rank M, the GLS estimator exists and is given by
1
-I M
(T12X;X2 (TIMX;XM
(TIIX;XI
(T2MX2XM
2: (TliX;Yi
(T12XZXI (T22XzX2 i= I
M
2: (T2iX2Yi
A
i=1
IlM (TIMXMXI (T2MXMX2 (TMMXMXM
2: (TMiXMYi
i= I
(12.1.7)
where (Tij is the (i ,j)th element of L -I. Within the class of all estimators that are
unbiased and linear functions of Y, this estimator is minimum variance and, if Y
is normally distributed, it is the maximum likelihood estimator and i~ minimum
variance within the class of all unbiased estimators. It has mean E[Il] = Il and
covariance matrix
If interest centers on one equation, say the ith, and only estimators that are a
function of Yi are considered, then the least squares (LS) estimator b i =
(X,'Xi)-IX;Yi is the: minimum variance, linear unbiased estimator. However, we
can improve on this estimator by considering a wider class-namely, linear
unbiased estimators that are a function of y. Within this class ~i' the ith vector
component of~, is better than b i because it allows for the correlation between
ei and the other disturbance vectors, and because it uses information on explan-
atory variables that are included in the system but excluded from the ith equa-
tion. The possible gain in efficiency obtained by jointly considering all the
equations led Zellner (1962) to give (12.1.2) the title "a set of seemingly unre-
lated regression equations."
Zellner notes that if (Tij = 0 for i =1= j or if XI = X 2 =. . . = X M the estimators
b i and ~i will be identical, and so there will be no gain in efficiency. Also, the
efficiency gain tends to be higher when the explanatory variables in different
equations are not highly correlated but the disturbance terms corresponding to
different equations areA highly correlated. Conditions that are both necessary
and sufficient for b i = Ili for all equations are given by Dwivedi and Srivastava
(1978).
In addition, there are some conditions under which GLS applied to a subset
of the system of equations yields an estimator identical to the corresponding
sub vector of the GLS estimator applied to the complete system [Schmidt
(1978)]. If there is a subset of H equations in which some of the explanatory
variables do not appear, but these variables, along with all other explanatory
variables in the system, appear in the remaining M - H equations, then the
GLS estimator from the H equations is identical to the appropriate subvector of
Disturbance-Related Sets of Regression Equations 469
the GLS estimator of the whole system. As a simple example, consider the M =
2 case studied by Revankar (1974). In this case y = XIl + e can be written as
[ YI]
Y2
= [XI
0
0 ][IlI] +
X2 112
[el]
e2 (12.1.8)
l,j = I, 2, . . . ,M (12.1.10)
(12. I.I 1)
A I:; A A
where 13 = (J3i, 132,. . . ,13M); these can be used to form a new estimator for
I
(12.1.12)
low. Also, the iterative maximum likelihood estimator is not uniformly better
than (J.
(12.1.13)
A
where (J is the GLS estimator for (J. Under a squared error loss measure, this
estimator is a minimax estimator and will dominate the GLS maximum likeli-
hood estimator if 0 < a < 2(K - 2) with optimum value a = K - 2. The
positive-rule counterpart of the above estimator will dominate both the Berger
and maximum likelihood (GLS) estimators.
Also, if we assume the disturbance covariance matrix is (J'22: o h, where 2:0
is a known nonsingular matrix, (J'2 is an unknown positive constant, and s is an
estimate of (J'2 such that SI(J'2 has a chi-square distribution with MT - K degrees
of freedom and is independent of ~, a corresponding extended Stein-rule esti-
mator is
(12.1.14)
This estimator and its positive-part counterpart will be minimax and dominate
the GLS (maximum likelihood) estimator under squared error loss if 0 < a <
(K - 2)/(MT - K + 2).
To generalize the results for the model y = X(J + e, let us assume that 2: is an
unknown positive definite diagonal matrix with diagonal elements (J'ii' Let v =
(VI, V2,. . . ,VM)' be a random vector independent of ~ where V)(J'ii has a chi-
square distribution with (T - K i ) degrees offreedom, independent for i = 1, 2,
. . . ,M. Further define a (K x K) block diagonal matrix W with blocks Wi =
(v)(T - Ki - 2))[XjXil- l Also let
Given these specifications and definitions the Berger and Bock (1976)-James
and Stein estimator for the set of regression equations is
(12.1.15)
(12.1.16)
Thus, the test procedure can be viewed as a test of the significance of the
increase in the residual sum of squares fr<?m theA impo~ition of linear
constraints. If, in (12.1.19), }.; is replaced by }.; andll and ll* are modified
accordingly, the resulting expression is equal to g given in (12.1.17).
In addition, rather than use the expression in (12.1.18), it is often easier to
obtain the restricted estimator ~* by rewriting the model (12.1.2) to incorporate
the restrictions. For example, if KI = K2 = . . . = KM = k, and we wish to
impose the restrictions III = ll2 = . . . = llM' 02.1.2) can be written as
YI XI el
Y2 X2 e2
III + (12.1.20)
YM XM eM
or, equivalently, as
y = X*lll +e (12.1.21)
where X* is of order (MT x k). The M vector components of ~* will all be the
same, and one of them can be obtained by applying GLS to (12.1.21). This
yields
Discussion of the restricted esiimator and the model in (12.1.21) raises the
question of whether to estimate}.; using LS residuals from the restricted or the
unrestricted model. In (12.1.10) we suggested using the unrestricted residuals
474 Some Alternative Covariance Structures
Then, it can be shown that the Wald statistic for testing Ho: R~ = r is
Both these statistics have an asymptotic x}J)-distribution. For the special case
where XI = X 2 = . . . = X M , we have fiR = fiu and (12.1.25) becomes
(12.1.26)
Aw = T tr(L*L- I) - TM (12.1.27)
(12.1.28)
and it is possible to show that Aw > ALR > ALM. Thus, although the three test
statistics have the same asymptotic distribution, their finite sample distribu-
tions will not be the same. Rejection of Ho can be favored by selecting Aw a
priori, while acceptance of Ho can be favored by selecting ALM a priori. See
Berndt and Savin (1977) for further details.
One difficulty with using a test procedure based on one of the above asymp-
totic test statistics is that it may not perform well in small samples. Some
evidence of particularly bad small sample performance in the context of testing
for homogeneity and symmetry in demand systems has been provided by
Laitinen (1978), Meisner (1979), and Bera, Byron, and Jarque (1981). For the
special case where X, = X 2 = . . . = X M and the constraints are such that R =
1M a' (identical constraints on the coefficients of each equation), Laitinen
shows that the exact distribution of g is the Hotelling T2-distribution. [See,
also, Malinvaud (1980, p. 230).] The critical values for the T2-distribution are
considerably higher than the corresponding ones for an asymptotic X2-distribu-
tion, implying that the latter test will tend to reject a correct null hypothesis
more frequently than it should. In the more general situation of testing symme-
try restrictions, an exact finite sample distribution is no longer available. How-
ever, using Monte Carlo methods, Meisner (1979) demonstrates that the prob-
lem of an excessive number of Type I errors still remains. Bera et al. (1981)
show that even though the problem is not nearly as severe if the LM test is used
instead of the Wald test, Ho is still rejected more frequently than it should be.
For some related results on the small sample efficiency of the restricted estima-
tor and the accuracy of its estimated covariance matrix, see Fiebig and Theil
(1983).
One possible procedure that may lead to a test with better finite sample
properties is to use a statistic analogous to the F-statistic used to test a set of
linear restrictions in the single equation case. For known L this statistic is given
by
(r - RP)(RCR')-'(r - RP)IJ
AF = (Y _ XP)'(L-' I)(l - XP)/(MT - K)
glJ
T tr[SL-']/(MT - K) ~ FU,MT-K) (12.1.29)
(12.1.30)
where
(12.1.31)
and G-ij = (Yi - XibJ'(Yj - Xjb)IT. Under H o, ALM has an asymptotic XfM(M-1)/2]
distribution. Note that M(M - 1)/2 is half the number of off-diagonal elements
in L.
For the two equation case Kariya (1981 b) shows that the test with critical
region rl2 > c is locally best invariant for one sided alternatives of the form
HI: (T12 > O. He also derives the finite sample null distribution for rl2'
The likelihood ratio test can also be used to test the diagonality of L. It is
possible to show that, apart from a constant, the maximized log-likelihood
function is given by (- TI2) Inlil, and thus the likelihood ratio statistic can be
written as
(12.1.32)
(12.1.33)
where e = y - Xb are the least squares residuals and DT = Ir - ii' IT, with j =
(1, 1,. . . , 1)'. The matrix DT transforms a given Yi from its original observa-
tions into deviations around its mean. Note that DT is idempotent, and [(1M (8)
DT)y]'(lM (8) DT)y = y'(lM (8) DT)y. If Rl is the coefficient of determination for
the ith equation, it can be shown that [Dhrymes (1974)]
(12.1.34)
and thus this definition is a weighted average of the coefficients for each equa-
tion with the weight of the ith equation given by that equation's proportion of
the total dependent variable variation. Thus it has a certain amount of intuitive
appeal. However, it is based on least squares residuals, not those of the effi-
cient GLS estimator. Also, as Dhrymes points out, the ratio R2/(1 - R2) cannot
be expressed solely as a function of the observations and hence cannot be used
to test the overall significance of the relationships, unless L = '}"2/. The first
problem can be overcome by using the GLS residuals e = y - Xp in place of e
in (12.1.33) but, as was the case with the single equation GLS approach, R2 can
then be negative.
Both problems can be overcome by using McElroy's (1977a) multiequation
analog of Buse's (1973) result. This is
e'(L- 1 (8)l)e
R~ = 1 y'(L 1(8) DT)y (12.1.35)
478 Some Alternative Covariance Structures
Note that the denominator in the second term can be written as y'C~::-1
DT)y = y'(/ DT)'('L- I 1)(/ DT)y; therefore, the variation for a given
equation is measured around the unweighted mean of y for that equation. The
measure in (12.1.35) will be between zero and one, and it is monotonically
related to the F statistic
m MT - 'LiKi
F = 1 - R~ 'LXi - M (12.1.36)
which is used to test the null hypothesis that all coefficients in the system are
zero, with the exception of the intercepts in each equation. See Equation
12.1.29 for a general expression for the F test used to test any set of linear
restrictions. However, one could still object to (12.1.35) on the grounds that it
gives the proportion of explained variation in the transformed y, not y in its
original units. This is a legitimate objection but, as was the case with single-
equation GLS, it is impossible to get a single measure that satisfies all the
criteria.
For the case where XI = X 2 = ... = X M (or if these matrices are not equal,
the system is estimated in its "unrestricted form"), some further measures
have been provided by Hooper (1959), Glahn (1969), and Carter and Nagar
(1977). See Dhrymes (1974, pp. 240-263) for a discussion on the relationship
between canonical correlations, coefficients of vector correlation and aliena-
tion, and Hooper's measure. McElroy (1977a) points out the differences be-
tween the measures of Hooper, Glahn, and the one in (12.1.35).
If E[ee'] = n f- 'L I, as occurs when the elements in each ei are autocor-
related, the definition in (12.1.35) is no longer appropriate. Buse (1979) has
extended his earlier work to allow for this case.
12.1.4 Bayesian Estimation
When M = 1, (12.1.38) reduces to g(O"ll) ct: 0"11 1, and, in addition, (12.1.38) can
be viewed as the limiting form of an informative prior that is in the form of a
Disturbance-Related Sets of Regression Equations 479
Wishart density. For further details and justification see Zellner (1971, Ch. 8)
and references therein.
Combining (12.1.37) and (12.1.38) via Bayes' Theorem yields the joint poste-
rior density
residuals, say L, then the mean becomes the Zellner EGLS estimator Il =
[X'(i- ' I)X]-IX'(i- 1 I)y.
Unfortunately, it is difficult to further analyze the marginal posterior densi-
ties for Il and L unless we make additional assumptions about X or L, or use an
approximation. One special case that does lead to tractable results is the tradi-
tional multivariate regression model where XI = X 2 = ... = X M = X (say). In
this case X = I X, ~ = b, and it is possible to write
where
and
S = (Y - XB)'(Y - XB)
and
(12.1.46)
There have been a number of other studies and extensions of the seemingly
unrelated regressions model. These include the effects of misspecification [Rao
(1974), Srivastava and Srivastava (1983), the inclusion of random coefficients
[Singh and Ullah (1974), nonlinear equations with related disturbances [Gal-
lant (1975), the inclusion of error components [Avery (1977), Baltagi (1980),
Prucha (1984), Verbon (1980), the use of biased estimators with lower mean
square error [Zellner and Vandaele (1971), Srivastava (1973), estimation and
inference for heteroscedastic systems of equations [Duncan (1983), and esti-
mation under inequality variance-covariance restrictions [Zellner (1979)].
one were investigating investment functions for a number of firms , for example,
it would not be surprising to find that the available data on different firms
correspond to different time periods. Thus, in this section, we allow for the
possibility that the available number of observations is different for different
equations, and we investigate the implications for the estimation and hypothe-
sis testing results in Section 12.1.
There are two main consequences. The GLS estimator of the vector contain-
ing the coefficients from all the equations can still be obtained in a straightfor-
ward manner, but it does not reduce to the same expression obtained when the
numbers of observations are equal. Second, the choice of an estimator for the
disturbance covariance becomes a problem. Following Schmidt (1977), we will
illustrate these facts with a system of two seemingly unrelated regressions
(12.2.1)
y = XIl + e (12.2.2)
(12.2.3)
This, in turn, implies that
O"lIh O"l2h o
E[ee'] = n = O"l2h 0"22h o (12.2.4)
o o
and the GLS estimator is
Thus, although the GLS estimator is readily attainable, it does not reduce to the
expression given in (12.1. 7). To obtain the corresponding expression, we can
partition X 2 and Y2 as
Xi
X2 = [X~ J and Y2 = [Yi]
y~
(12.2.6)
where (Iii is the (i,j)th element on:- I . lfthe additional N observations (xg,yg)
are ignored, this estimator becomes identical to ~ in (12.1. 7).
When ~ is unknown, the more usual case, ~ is unobtainable so that an
estimator for ~ must be found. Five possible estimators are discussed and
evaluated in a Monte Carlo experiment by Schmidt (1977). To describe these
estimators, we let el and e2 be the least squares residuals from the first and
second equations, respectively, and let e; = (et, en be partitioned conform-
ably with Y2. Also, define S" = e;eIIT, SI2 = e;eilT, Si2 = et eilT, Sg2 =
eg' eglN, and S22 = e;e2/(T + N).
The five estimators are
1.
A
These estimators differ in the extent to which they utilize the additional N
observations on Y2' The first ignores the extra observations; the second utilizes
them in the estimation of (I22, but this can lead to an estimate of ~ that is not
positive definite; the third estimates both (I12 and (I22 with the extra observa-
tions; and the fourth and fifth use all the observations to estimate all the compo-
nents in ~.
All these estimators of ~ are consistent and lead to EGLS estimators of J3
that have the same asymptotic distribution as the GLS estimator. However,
because the estimators in (4) and (5) use all the observations, in finite samples
we might expect that they would lead to EGLS estimators for J3 that are better
than those obtained from variance estimators (I) to (3). This point was investi-
gated by Schmidt in his Monte Carlo study. Some of his general conclusions
were:
Disturbance-Related Sets of Regression Equations 483
1. As N increases for fixed T, the efficiency of the various estimators does not
necessarily increase and the relative performance of estimators that use all
the observations is not always better.
2. When <T12 is high, 0.925 in the experiment, the maximum likelihood estima-
tor is best, but its superiority decreases with decreasing sample size.
3. For T = 10 and <T12 anything but high, the maximum likelihood estimator is
particularly bad.
4. The second variance estimation procedure can yield an estimate of L that is
singular, and the resulting EGLS estimator for 11 may not possess finite
moments.
Although there seemed to be no clear advantages from using the extra obser-
vations (except when <TJ2 is high), Schmidt recommends using procedure (4)
because the results from a single Monte Carlo study are necessarily limited,
and, it is an intuitively more pleasing estimator. Also, there were no occasions
when it did substantially worse than any of the other estimators.
The same problem, for a more general model, has been studied from the
Bayesian point of view by Swamy and Mehta (1975). They investigated the
model outlined in the next section, where the disturbances are assumed to
follow a first-order autoregressive process.
III
132 + (12.3.1)
484 Some Alternative Covariance Structures
y = X~ +e (12.3.2)
PMI PM2
(12.3.4)
where the Vet) are independent identically distributed random vectors with mean
zero, E[v(t)] = 0, and covariance matrix
(12.3.5)
That is, E[Vit] = 0 and E[VitVjs] = (Tij for t = s and zero otherwise. The AR(1)
representation in (12.3.3) and (12.3.4) extends the single equation case by al-
lowing the current disturbance for a given equation to depend on the previous
period's disturbance in all equations. It is often referred to as a vector autore-
gressive model to distinguish it from the simpler case where R is diagonal and
each eit depends only on the previous disturbance in the ith equation. That is,
eit = Piiei,t-I + Vito The general process in (12.3.4) will be stationary if the roots
of II - Rzi = 0 are greater than one in absolute value.
Note the notational distinction between the vector definitions e(t) and Vet) and
those given by ei and Vi. The (M x I) vector e(t) contains the lth disturbances
from all equations, while the (T x I) vector ei contains all the disturbances for
the ith equation; vet) and Vi are defined in a similar fashion. The assumptions we
have made about Vet) could also be written as
/
E[ViV;] = (TijIr and E[vv ] = L Ir (12.3.6)
Va = RVoR' + L 02.3.7)
Solving this equation for the elements of Vo gives [Guilkey and Schmidt (1973)]
Then, again using (12.3.4), the other elements of fi can be found from
E[ee'] = fl = (12.3.10)
then the diagonal elements of a given flu are identical and equal to the (i,j)th
element of Va; the matrix VI provides the elements adjacent to the diagonal in
each flij; V 2 provides the elements two positions from the diagonal, and so on.
Note that Oij is not necessarily symmetric unless i = j. For details of the
structure of these matrices in the case where R is diagonal see Exercise 12.4.
We turn now to the problem of estimating 13, and we begin by considering the
special case where R is diagonal and where, for the moment, Rand L are
assumed known. A diagonal R implies that (12.3.4) can be written as
so that, in this case, the value of the disturbance eit does not depend on the
lagged disturbances in other equations.
486 Some Alternative Covariance Structures
fJ = (X'O-IX)-IX'O-ly (12.3.12)
This estimator is minimum variance within the class of all linear unbiased
estimators and has a covariance matrix (X'O-IX)-I. However, in practice MT
could be quite large and.in general, it is easier to transform the observations so
that the GLS estimator fJ can be obtained by applying the GLS-SUR estimator
of Section 12.1 directly to the transformed observations. To do this, we need an
(MT x MT) transformation matrix P such that
This implies that 0- 1 = P'(~-l (8) I)P and that Equation 12.3.12 can be written
as
(12.3.14)
P ll 0 0
P 21 P 22 0 (12.3.15)
P
(MT x MT)
P M1 P M2 P MM
where
aii 0 0 0
-Pii 1 0 0
P ii 0 -Pii 1 0 [= I, 2, . . . ,M
(T x T)
0 0 0 1 (12.3.16)
aij 0 0
0 0 0 (12.3.17)
Pij - i i= j
(T x T)
0 0 0
Disturbance-Related Sets of Regression Equations 487
I = 1, 2, . . . ,M
(12.3.19)
and
For all observations except the first, this transformation is the familiar one used
for a single equation with AR(l) errors. This is clear from Equation 12.3.20
where each regression equation is transformed separately using the autoregres-
sive parameter for the equation in question. However, from Equation 12.3.19,
we can see that the transformedjirst observation for a given equation depends
on the first observations in other equations.
To demonstrate that the above transformation satisfies the condition POP' =
L I we need to show that the transformed disturbance vector e* = Pe is such
that E[e*e*'] = L I, or, equivalently, that E[et)e~;)] = L and E[eu)e~;)] = 0 for
t oF s, where eU) = (eft, eit, . . . ,et1t)'. Since
for t = 2, 3, . . . ,T 02.3.21)
and the v(t) possess the required properties, it is clear that the transformation is
satisfactory for t > 1. To prove that it is also satisfactory for t = 1 we note that
e~) = Ae(l) and, therefore,
02.3.22)
-Pii I 0 o
P Oii 0 -Pii I o
(T-l)xT (12.3.23)
0 0 0 I
Thus each equation can be transformed separately with its own Pii; the trans-
formed variables become xti = PoX and yti = Poy; and the approximate GLS
estimator is given by
(12.3.24)
For known fIi) and Pii this estimator has covariance matrix given by
When fIij and Pii are unknown and are replaced by estimates cTij and Pii, the
GLS estimator p becomes an estimated generalized least squares (EGLS) esti-
mator ~ and the estimator po becomes an approximate EGLS estimator ~o.
Both these new estimators (3 and (30) have the same asymptotic properties
and, in finite samples, (3 is not necessarily more efficient than (30.
To estimate (3 when the fIi) and Pii are unknown, the following procedure can
be used.
1. Apply LS separately to each equation and estimate each Pii from the
corresponding residuals. That is,
Disturbance-Related Sets of Regression Equations 489
i = 1, 2, . . . ,M
(12.3.25)
where, in contrast to Equations 12.3.19 and 12.3.20 the "*,, will now denote
observations transformed with estimated values of Pii and/or aij. Note that we
have only transformed M( T - 1) observations.
3. Using, for each equation, the appropriate (T - 1) transformed observa-
tions, re-estimate the equations separately and estimate the aij from the resid-
uals of these estimated equations. The residuals are given by
, -
Vi - *
YOi - X*b*
Oi i -- (I T -I - X*(X*'X*)-IX*')
Oi Oi Oi *
Oi YOi (12.3.27)
where Y6I = POiiYi and X6i = POiiX;; and the estimated covariances by
c7ij = (12.3.28)
4. Find A, the matrix A (Equation 12.3.18) with each Pii and aij replaced by
their corresponding estimates Pii and a-ij' Based on A, transform the initial
observations according to Equation 12.3.19 and combine these M transformed
observations with the other M( T - 1) transformed observations that were
obtained in step 2. If P denotes the complete (MT x MT) transformation matrix
formed using A and P Oii , (i = 1, 2,. . . ,M), the complete set of transformed
observations is now given by X* = PX and y* = Py.
5. Apply the SUR estimation technique to the transformed observations to
obtain the EGLS estimator
Another possible variation is, in step I, to apply the SUR estimator to the
untransformed data and to use the resulting residuals, instead of those from
least squares, to estimate the Pii. In addition, if nonlinear least squares and
maximum likelihood estimation are considered, this leads to some further alter-
native estimators, which differ depending on the function being minimized and
on the treatment of the initial observations. See, for example, Kmenta and
Gilbert (1970) and Magnus (1978).
We turn now to the more general case where R is not necessarily diagonal. In
this case a suitable transformation matrix P such that POP' = ~ / is given by
P= (12.3.32)
where
a
II 0 0 0 0
P ii - -Pii 1 0 0 0 (12.3.33)
(T X T)
0 0 0 -Pii 1
aij 0 0 0 0
P ij -Pi) 0 0 0 0
- ii=j (12.3.34)
(T X T)
0 0 0 -Pi} 0
Disturbance-Related Sets of Regression Equations 491
o
A
o (12.3.35)
(M x M)
Y~ = L
j=1
aijYjl, i=I,2, . . . ,M
(12.3.36)
and
The transformation for the first observation from each equation is similar in
form to that for the diagonal R case, and it involves both the Pij and the lIij. For
the remaining observations the diagonal R transformation is extended by in-
cluding the previous period's observations from all equations; it depends only
on the Pij. Thus, ifthe first row in each of the submatrices Pij, i,j = 1,2, . . . ,
M is ignored, the transformation matrix depends only on the Pij and is relatively
straightforward. To include the first rows, we need to know both the lIij and Pij;
from these values we need to derive first Vo and then A.
When the complete transformation matrix is employed we obtain the GLS
estimator
(12.3.39)
where X6' = PoX, y6' = Poy, and Po is the [M( T - I) x MT] matrix obtained by
deleting the first row of eac~ Pij. ,
If R and ~ ~re known, flo is less efficient than fl, but it is much more
convenient. If flo is used, the matrices Vo and A do not have to be derived and
the transformation depends only on the elements of R, not on those of~.
492 Some Alternative Covariance Structures
When Rand L are unknown and replaced by consistent estimates, the result-
ing estimators, which we denote by fl and flo,
are asymptotically equivalent. As
in the previous section, there are a number of possible estimators for Land R.
The estimation steps that follow are those suggested by Guilkey and Schmidt
(1973), modified to include the initial observations.
i = 1,2, . . . ,M (12.3.40)
2. For each equation regress eil on (eJ,t-I, e2.1-1, . . . , eM.I-I). This yields
estimates Pi!, Pi2, . . ,PiM, i = I, 2,. . . ,M, or, in matrix notation, k
3. Let Po be the matrix Po with the Pij replaced by the estimates obtained in
step 2, and obtain the transformed variables X6' = PoX, y6' = Poy.
4. Regressy6'onX6'andcalculatetheresiduals(v12,v13, . . . ,VIT,V22, .. . ,
V2T, . . . ,VM2". . . ,VMT) from this regression.
5. Estimate L by L where its elements are given by
(12.3.41)
(12.3.43)
(12.3.44)
The estimators for Jl with or without the modifications, are all consistent and
asymptotically efficient with an asymptotic covariance matrix that is consist-
ently estimated by [X*'Ci:- 1 h)X*]-1 or [X6'(i:- 1 h-I)X6']-I.
An alternative estimation technique is that of maximum likelihood. Under
the assumption of normally distributed disturbances, Beach and MacKinnon
(1979) advocate maximum likelihood estimation that retains the initial observa-
tions and that, through the Jacobian term, constrains the parameter estimates
to lie within the stationary region. However, as they point out, except in a very
special case it is impossible to concentrate any parameters out of the likelihood
function, and numerical maximization of the function with respect to all the
parameters may prove to be computationally difficult.
(12.3.45)
(12.3.47)
U = i'W-li (12.3.49)
Under the null hypothesis that the nondiagonal elements of R are zero, ii*
converges in distribution to it X~MLM) random variable.
In a Monte Carlo experiment that estimated the finite sample power of the
above four tests (and one other one), Guilkey (1974) found that all the tests
were satisfactory and equally good for T :> 50 but rather poor for T = 20.
For the case where XI = X 2 =. . . = X M , Szroeter (1978) has provided exact
tests for testing for first- or higher-order-vector autoregressive errors. His
higher-order specification is not the general one
where the e(t-i) and Vet) are (M x 1) vectors and the Ri are (M x M) matrices,
but rather a multivariate generalization of the specification considered by Wal-
lis (1972)-namely, e(t) = Rpe(t-p) + Vet). The finite sample power of these tests
has not been evaluated, but they seem worthy of consideration. If the X/s are
not identical, the tests could be used providing they were based on "unre-
stricted residuals," those obtained by including every explanatory variable in
every equation.
Noting that the Durbin-Watson test is a useful device for detecting a number
of forms of misspecification in single equation models (e.g., omitted variables,
496 Some Alternative Covariance Structures
12.4 RECOMMENDATIONS
In linear regression, we often assume that the covariance matrix of the distur-
bance term is nonsingular, in which case the generalized least squares proce-
dure is applicable. However, there are cases when the disturbance covariance
matrix cannot be assumed nonsingular. An example is the estimation of the
parameters of the linear expenditure system [Stone (1954)1:
1=1,2, . . . ,M (12.5.1)
where Yi is the T-vector of observations on the expenditure for the ith commod-
ity, m is the T-vector of observations on total expenditure (income), Pi the T-
vector of prices for the jth commodity, and ei the T-vector of unobserved
random disturbances. The Yi and {3i are unknown scalar parameters to be esti-
mated. The parameter Yi is often interpreted as the subsistence level of expen-
diture and {3i is the budget share after fulfilling the subsistence expenditures.
Thus the budget share {3i is subject to a constraint 22'j! I {3i = I. The constraint on
the {3;'s and the fact that total expenditure m is the sum of the y;'s imply that
22i~ lei = O. Hence the disturbances are linearly dependent. If one tries to esti-
mate the M equations jointly using the seemingly unrelated regression tech-
nique discussed above, one will be faced with a singular disturbance covariance
matrix that cannot be inverted. One will note that the complete disturbance
498 Some Alternative Covariance Structures
t = 1, 2, . . . , T, i = 1,2, . . . ,M 02.5.2)
where Pit is the lth true proportion for the ith category and the eit are assumed
to be independently distributed, each with a multinomial distribution with mean
zero, variances Pit(1 - Pit)/Nt and covariances -PitPjtINt. The total number of
cases for each I is denoted by N/ = Linit.
Assume now that the true proportions are related to explanatory variables, at
least over a range, by a relationship that is linear in the parameters such that
P = Xu.
1 1 .... 1' i=I,2, . . . ,M (12.5.3)
(12.5.4)
Y = Xf3 + e (12.5.5)
h
were Y = YI, Y2, . . . ,YM 'I-'
I (' I U'
= (UI
1-'1, U'
') UI)
1-'2, . . . 'I-'M ,e = eJ, e2, . . . ,
I (' I
eM), and X is a block diagonal matrix with Xi'S on the diagonal. Since LiPi/ = 1,
the covariance matrix of e is singular. This may be seen from the Ith cross-
sectional covariance matrix of e, that is,
E[ete;] = ~/ -P2tPI/
(12.5.6)
The covariance E[e/e;] is singular, since the sum of any row or column is zero.
The total covariance matrix E[ee ' ] is (MT x MT) in dimension, but the rank is
only (M - 1) T. Also, note that if XI = X 2 = ... = X M , the equations in
Disturbance-Related Sets of Regression Equations 499
(12.5.5) are also linearly dependent. This dependence can be removed by delet-
ing one of the M equations [Lee, Judge, and Zellner (1968)].
However, because the linear dependence of disturbances may imply linear
constraints on the parameter vector, deleting equations is not always a desir-
able procedure. Unless the implied parameter restrictions are imposed or ful-
filled automatically, discarding information will result in estimates that are
inefficient. In the following section, we will discuss the procedure of utilizing
the seemingly redundant information.
Another situation in which the singular disturbance covariance matrix could
occur is the case of precise observations. Assume that y, is a vector of observa-
tions with noise or errors and Yz is a vector of precise observations that have no
errors. In this case, the model can be partitioned as
(12.5.7)
y = Xfl +e (12.5.8)
y = Xfl + e (12.5.10)
500 Some Alternative Covariance Structures
where E[e] = 0 and E[ee'] = 'II, which is of order (T x T) and singular with
rank g < T. The covariance matrix'll does not have an inverse and generalized
least squares cannot be applied. Since'll is symmetric and positive semidefinite
of order T, with rank g, it has T - g zero characteristic roots and g positive
characteristic roots. Let F be a (T x g) matrix whose columns are characteris-
tic vectors of'll corresponding to the positive roots, and G be a (T x (T - g
matrix whose columns are characteristic vectors of 'II corresponding to the
zero roots. Then the augmented matrix V = (F G) is orthogonal, that is,
V'V =
F 'FF
[ G'F G'G
I G] =
[III
0
0]
Ir-g
= h.
(]2.5.11)
V''IIV = A = = [0Ali OJ
0 02.5.12)
o
o
where Ax is a diagonal matrix of characteristic roots Ai, i = I, 2,. . . ,g. Thus,
if we transformed the model (12.5.10) by premultiplying by V', we have
which is equivalent to
rL
pry]
G'y
= [F'X]
G'X 13
+ [FIe]
G'e
(12.5.14)
where
A
f3 = CX'FA-'F'y
g (12.5.19)
C = (X'FA;'F'X)-' (12.5.20)
It is assumed that F'X has full column rank K so that the inverse (12.5.20)
exists. Also note that FA;'F' is the generalized inverse of lJI(=FAI.'F') and
may be denoted by lJI+. The generalized inverse lJI+ fulfills the following four
conditions:
There are cases when G' X = 0 and the constraint on the parameters in
(12.5.16) vanishes. In this case, the restricted estimator f3* is the same as the
unrestricted estimator ~,
There are also cases when G'X =F 0 but the unrestricted estimator f3 automati-
cally fulfills the restrictions (12.5.16). In this case the last factor of(12.5.18) is
~ero and the adjustment factor for f3* vanishes, making f3* identically equal to
f3. For an example, see the generalized inverse method of estimating transition
probabilities [Lee, Judge, and Zellner (1977, Appendix A)].
A simple example of a singular covariance matrix is that obtained from a two-
equation system of linear probability models (12.5.4). The tth cross-section
disturbance covariance matrix is a (2 x 2) version of (12.5.6)-namely,
502 Some Alternative Covariance Structures
where PIt + P2t = I. The characteristic roots are 2pltP2tlNt and O. The associ-
ated characteristic vectors when normalized are
F=VI[_~] (12.5.28)
1 -)
4PltP2t 4PltP2t (12.5.30)
'11+t = FA-IF'
g
= Nt
-1 )
4PltP2t 4PltP2t
In the previous sections it was shown that a singular covariance matrix will
occur if (I) we combine equations that are linearly dependent; (2) some obser-
vations are free of errors; or (3) some a priori restrictions on parameters are
treatec1 as additional observations. The orthogonal transformation procedure
shows that the generalized inverse of the singular covariance matrix may be
used in place of the ordinary inverse in the restricted least squares formula. The
restriction is the linear combination of equations, the identity relating the pre-
cise observations, or the extraneous parameter restrictions. It can be shown, as
Theil suggests (1971, p. 281) that if the singular covariance matrix is due to
linear dependency among equations, as an alternative to dealing with the whole
system. the dependent equations may be deleted. In this case, for the subsys-
tem obtained after deleting the redundant equations, the condensed covariance
matrix will be nonsingular and the regular procedure is applicable [Lee, Judge,
and Zellner (1977, Section A.6)]. However, if the singular covariance matrix
results from one of the last two cases, the equations relating to the precise
observations become parameter restrictions and should not be discarded. If the
number of precise observations, say J, exceeds the number of unknown pa-
rameters, K, then there are at least (J - K) precise observations that are
repeated. Even when J < K, some precise observations may be repetitive.
Repetitive, precise observations duplicate parameter restrictions and may be
discarded without changing the restricted generalized inverse estimator. For
examples of deleting equations, see Kendall and Stuart (1958, pp. 355-356);
Lee, Judge, and Zellner (1968,1977); Parks (1971); and Theil (1971, pp. 275,
281). The deletion of redundant equations is in the spirit of preventing an
Disturbance-Related Sets of Regression Equations 503
12.6 EXERCISES
Exercise 12.1
Prove that the GLS estimator ~ = (X'(L- I I)X)-IX'(L- 1 I)y is identical
to the LS estimator b = (X'X)-IX'y when:
1. XI = X 2 = . . . = X M = X.
2. L is diagonal.
Exercise 12.2
Prove that the two expressions for g In Equations 12.1. 16 and 12.1.19 are
identical.
Exercise 12.3
Derive the expressions for AW and ALM given in the last lines of Equations
12.1.24 and [2.1.25.
Exercise 12.4
Consider the SUR model with AR(I) errors studied in Section [2.3 where the
matrix R is diagonal and given by
PI
R= P2
P.w
1. Show that
(T II (T12
,
1 - Pi 1 - PIP2
(T21 (Tn (T2M
,
1 - P2PI 1 - P'i
Vo =
[ - PMPI
504 Some Alternative Covariance Structures
and that
T-I T-2 1
Pi Pi
2. For the case where M = 2, show that the matrix A satisfies the equation
L = A VoA' when it has the following elements:
al2 = 0
a21 = (T12 V1 - PT (I _ { (TII(T22 - (TTz)(I - PT)(I - pD }112)
(TIl (Til (Tzi 1 - PlP2)2 - (T12(1 - p~)(1 - pD
Exercise 12.5
Show, for the SUR model with contemporaneous correlation only, that the
maximized log-likelihood function is, apart from a constant, equal to
( - T12) Inlil.
Exercise 12.6
Consider the Bayesian approach in Section 12.1.4. Prove that the marginal
posteriors g(Lly) and g(Jlly) are given by (12.1.41) and (12.1.42), respectively,
and that they can be written as (12.1.44) and (12.1.45) when Xl = Xz = -
XM=X,
Exercise 12.7
Consider a SUR-model with AR(l) errors where
To give experience in estimation and hypothesis testing and more insight into
the sampling properties of various estimators, we recommend generating some
Disturbance-Related Sets of Regression Equations 505
Monte Carlo data from a set of regression equations, and using these data to
answer the exercises in this section. As one possible model for generating the
data we suggest
(12.6.1)
where
and (Vlt, V2t)' is a normal random vector with mean zero and covariance matrix
Exercise 1 2.8
Show that the autoregressive process described in Equations 12.6.3 and 12.6.4
is stationary.
Exercise 12.9
Calculate the contemporaneous disturbance covariance matrix
Exercise 12.10
Use A, obtained in Exercise 12.9 to find the transformation matrix P which is
such that PDP' = ~ Jr, where 0 = E[ee']. Use P to obtain the <;ovariance
matrix for the generalized least squares (GLS) estimator of Jl, say JlG.
Exercise 12.11
If one incorrectly aSsumes that the model is less complicated than that in
Equations 12.6.1-12.6.4, or does not transform the observations for t = 1, an
estimator less efficient than the GLS will be obtained. Given that the model in
Equations 12.6.1-12.6.4 is the correct one, find algebraic expressions and nu-
merical values for the covariance matrices of the following estimators.
(12.6.5)
0.8
R= [
(12.6.6)
the contemporaneous covariance matrix for (elt. e21)' and the transforma-
tion matrix for the initial observations, say Vo and A, respectively.)
3. The approximate GLS estimator obtained by assuming R = diagonal (0.8,
-0.1),
where Po is the (38 x 40) matrix obtained by deleting the first and twenty-
first rows of P.
4. The approximate GLS estimator obtained by correctly assuming R is given
by Equation 12.6.4,
where Po is the (38 x 40) matrix obtained by deleting the first and twenty-
first rows of the matrix P that was obtained in Exercise 12.10.
Exercise 12.12
Using as a definition of efficiency the "weak mean squared error" criterion
A " " " ,..." "
E[CIl - 1l)'CIl - Il)]' rank the estimators Ilc, Ilz, IIp, IlPO and Ileo according to
their relative efficiency. Comment on this ranking.
Exercises 12.13-12.16
Each student selects five samples of y and calculates values for a number of
estimators and test statistics for each of these samples. Exercises 12.17 to 12.19
use the combined set of answers from all students.
Exercise 12.13
Select five samples and for each of these samples compute values for the
estimated GLS estimators ~z, ~po, ~p, ~GO' and ~G. These are given by the
estimators in Exercises 12.10 and 12.11 with the crij and/or Pij replaced by
estimates. Compare these values with each other and the true parameter
values.
Exercise 12.14
For each of the 25 estimates obtained in Exercise 12.13 calculate" (-test statis-
tics" for the null hypothesis i3l2 = 0.04. In each case assume that the estimator
used was appropriate for the model so that the asymptotic covariance matrices
are estimated, respectively, by
Exercise 12.15
For each of the five samples selected in Exercise 12.13, use the restricted
counterpart of the estimator ~GO to estimate J3 under the set of restrictions
{312 = {322, {313 = (323. Under the null hypothesis that this set of restrictions is
true, calculate the x2-test statistics obtained by appropriately modifying Equa-
tion 12. 1. 19.
Exercise 12.16
Again, for each of the five samples,
Exercise 12.17
Let ~e(i) be the estimate from the ith sample obtained using the estimator ~(;.
Calculate L~~I (~di) - J3)(~di) - J3)'/50, which is an estimate of the mean
square error matrix for ~e. Repeat this process for estimators ~z, ~po, ~p, and
J3eo.
Exercise 12.18
Compare the results of Exercise 12.17 with those in Exercises 12.10 and 12.1 I.
Do the results suggest that the ranking obtained in Exercise 12.12 is still appro-
priate when the disturbance covariance matrix and autoregressive parameters
are unknown?
Exercise 12.19
U sing all 50 samples, construct empirical distributions for
Giving due consideration to the model that generated the data, and the truth or
otherwise of each of the null hypotheses, comment on the shape of each of the
above distributions and on the proportion of times each null hypothesis was
rejected at the 5% level of significance.
12.7 REFERENCES
Doran, H. E., and W. E. Griffiths (1983) "On the Relative Efficiency of Estima-
tors Which Include the Initial Observations in the Estimation of Seemingly
Unrelated Regressions with First-Order Autoregressive Disturbances,"
Journal of Econometrics, 23, 165-191.
Dreze, J. (1977) "Bayesian Regression Analysis Using Poly-t Densities," Jour-
nal of Econometrics, 6, 329-354.
Dufour, J.-M. (1982) "Generalized Chow Tests for Structural Change: A Coor-
dinate Free Approach," International Economic Review, 23,565-575.
Duncan, G. M. (1983) "Estimation and Inference for Heteroscedastic Systems
of Equations," International Economic Review, 24, 559-566.
Dwivedi, T. D., and V. K.Srivastava (1978) "Optimality of Least Squares in
the Seemingly Unrelated Regression Equations Model," Journal of Econo-
metrics, 7, 391-395.
Fiebig, D. G. and H. Theil (1983) "The Two Perils of Symmetry-Constrained
Estimation of Demand Systems," Economics Letters, 13,105-111.
Fisher, F. M. (1970) "Tests of Equality Between Sets of Coefficients in Two
Linear Regressions: An Expository Note," Econometrica, 38, 361-366.
Gallant, A. R. (1975) "Seemingly Unrelated Nonlinear Regressions," Journal
of Econometrics, 3, 35-50.
Glahn, H. (1969) "Some Relationships Derived from Canonical Correlation
Theory," Econometrica, 37, 252-256.
Graybill, F. A. (1969) Introduction to Matrices with Applications in Statistics,
Wadsworth, Belmont, Calif.
Guilkey, D. K. (1974) "Alternative Tests for a First-Order Vector Autoregres-
sive Error Specification," Journal of Econometrics, 2, 95-104.
Guilkey, D. K. (1975) "A Test for the Presence of First-Order Vector Autore-
gressive Errors When Lagged Endogenous Variables are Present," Econo-
metrica, 43, 711-718.
Guilkey, D. K., and P. Schmidt (1973) "Estimation of Seemingly Unrelated
Regressions with Vector Autoregressive Errors," Journal of American Sta-
tistical Association, 68,642-647.
Hall, A. D. (1977) "Further Finite Sample Results in the Context of Two
Seemingly Unrelated Regression Equations," Australian National Univer-
sity Working Papers in Economics and Econometrics No. 39, Canberra,
Australia.
Hall, A. D. (1982) "The Relative Efficiency of Time and Frequency Domain
Estimates in SUR Systems," Journal of Statistical Computation and Simu-
lation, 16,81-96.
Harvey, A. C. (1982) "A Test of Misspecification for Systems of Equations,"
Discussion Paper No. A31, London School of Economics Econometrics Pro-
gramme, London, England.
Hendry, D. F. (1971) "Maximum Likelihood Estimation of Systems of Simulta-
neous Regression Equations with Errors Generated by a Vector Autoregres-
sive Process," International Economic Review, 12, 257-272.
Hooper, J. W. (1959) "Simultaneous Equations and Canonical Correlation
Theory," Econometrica, 27, 245-256.
Disturbance-Related Sets of Regression Equations 511
Schmidt, P., and P. Sickles (1977) "Some Further Evidence on the Use of the
Chow Test Under Heteroskedasticity," Econometrica, 45, 1293-1298.
Singh, B., and A. Ullah (1974) "Estimation of Seemingly Unrelated Regres-
sions with Random Coefficients," Journal of the American Statistical Asso-
ciation, 69, 191-195.
Smith, P. 1., and S. C. Choi (1982) "Simple Tests to Compare Two Dependent
Regression Lines," Technometrics, 24, 123-126.
Spencer, D. E. (1979) "Estimation of a Dynamic System of Seemingly Unre-
lated Regressions with Autoregressive Disturbances," Journal of Economet-
rics, 10, 227-241.
Srivastava, V. K. (I970) "The Efficiency of Estimating Seemingly Unrelated
Regression Equations," Annals of the Institute of Statistical Mathematics,
22, 483-493.
Srivastava, V. K. (1973) "The Efficiency of an Improved Method of Estimating
Seemingly Unrelated Regression Equations," Journal of Econometrics, 1,
341-350.
Srivastava, V. K., and T. D. Dwivedi (1979) "Estimation of Seemingly Unre-
lated Regression Equations: A Brief Survey," JOllrnal of Econometrics, 10,
15-32.
Srivastava, S. K., and V. K. Srivastava (1983) "Estimation of the Seemingly
Unrelated Regression Equation Model Under Specification Error," Bio-
metrika, forthcoming. .
Stone, R. (1954) "Linear Expenditure Systems and Demand Analysis: An Ap-
plication to the Pattern of British Demand," Ihe Economic Journal, 64,511-
527.
Swamy, P. A. V. B., and 1. S. Mehta (1975) "On Bayesian Estimation of
Seemingly Unrelated Regressions When Some Observations are Missing,"
JOllrnal of Econometrics. 3, 157-169.
Szroeter, 1. (1978) "Generalized Variance-Ratio Tests for Serial Correlation in
Multivariate Regression Models," Journal of Econometrics, 8, 47-60.
Telser, L. G. (1964) .. Iterative Estimation of a Set of Linear Regression Equa-
ti0ns," Journal of the American Statistical Association, 59,845-862.
Theil, H. (1971) Principles of Econometrics, Wiley, New York.
Toyoda, T. (1974) "U se of the Chow Test Under Heteroscedasticity," ECONO-
metrica, 42, 601-608.
Verbon, H. A. A. (1980) "Testing for Heteroscedasticity in a Model of Seem-
ingly Unrelated Regression Equations with Variance Components
(SUREVC)," Economics Letters, 5, 1.49-153.
Wallis, K. F. (1972) "Testing for Fourth-Order Autocorrelation in Quarterly
Regression Equations," Econometrica, 40, 617-636.
Wang, 1. H. K., M. Hidiroglou and W. A. Fuller (1980) "Estimation of Seem-
ingly Unrelated Regressions with Lagged Dependent Variables and Autocor-
related Errors," Journal qf Statistical Computation and Simulation, 10, 133-
146.
Zellner, A. (1962) "An Efficient Method of Estimating Seemingly Unrelated
514 Some Alternative Covariance Structures
13.1 INTRODUCTION
Yit = 131 + L
k~2
f3k X kit + eit (13.1.2)
2. Slope coefficients are constant and the intercept varies over individuals.
Yit = f3li + L
k~2
f3k X kit + eit (13.1.3)
3. Slope coefficients are constant and the intercept varies over individuals
and time,
K
Yit = f3lit + L
k~2
f3k X kit + eit 03.1.4)
Yit = f3li + L
k~2
f3ki X kit + eit (13.1.5)
Yit = f3lit + L
k~2
f3kit X kit + eit (13.1.6)
K
YII 00"- {JIll + L {3AII~AII + ell
-r-- ~=2
.__ .. -
All coefficients constant, Slope coefficients constant. Slope coefficients variable
13,;, ~ 13, intercept variable
and
eilheteroscedastic
and autocorrelated
(Section 13.2)
r-
Intercept varies over Intercept varies o\.'cr Coefficients vary over Coefficients vary over
individuals .'Jniy individuals and time individuals individuals and time
{3lit = {311 {hI = f31 + J.LI + AI f3kit = {3k.i {3k.i/ = 13k + IJ-h + hk.1
13k;, ~ 13, k " 13."~ 13,k " I (Section 13.5) (Section 13.6)
f31i
,-
(Section 13.3)
fixed coefficients
Dummy variable model
f311 random
Error components
, (Section 13.4)
I I I
Ih fixed or random? J.l.i. Al fixed or random? f3k.i fixed or random'!
(Section 13.3.3) (Section 13.4.31 (Section 13.5.2)
518 Some Alternative Covariance Structures
Pj p,T I
T-'
E[e,ej]
ai} p, Pi - (13.2.2)
- p,p,
p,T - I p,T 2
i = I, 2, . . . ,N (13.2.3)
where E[u't] = 0, E[u'tujt] = ai, and E[u'tuj,] = for t oF s. Thus this model
assumes that (1) the coefficients are the same for all individuals, (2) the distur-
bance vector for a given individual follows a first-order autoregressive process,
(3) the variance of the disturbance can be different for different individuals, and
(4) the disturbances for different individuals are contemporaneously correlated.
In many circumstances these are likely to be a reasonable set of assumptions.
Perhaps the most restrictive is that the coefficients are constant for all individ-
uals. However, the more general case where the coefficients are not necessarily
constant over individuals has already been extensively discussed in Section
12.3. In fact, the results of that section can be used to estimate the restricted
model in Equation 13.2.1 and to test the hypothesis that the coefficients are the
same for all individuals. See Sections 12.1.2a and 12.3 and also Kmenta (1971).
If the first-order autoregressive scheme is regarded as too restrictive, alterna-
tive assumptions discussed in Chapter 12, such as first-order vector autoregres-
sive errors or moving-average errors, can be made.
Inference in Models That Combine Time Series and Cross-Sectional Data 519
One of the most simple models used for combining time series and cross-
sectional data is one where a varying intercept term is assumed to capture
differences in behavior over individuals ar: j where the slope coefficients are
assumed to be constant. This model can be written as
K
(=\,2,. ,T
(\3.3.1)
where f3li = [31 + Ili is the intercept for the ith individual, [31 is the "mean
intercept," and III represents the difference from this mean for the ith individ-
ual. The appropriate estimation procedure for (13.3.1) depends upon whether
the III are assumed to be random or fixed. If the III are fixed (13.3.1) is the
dummy variable or covariance model while if the III are random it is an error
components model. These two models are discussed in Sections 13.3.\ and
\3.3.2, respectively, and in Section 13.3.3 we discuss the relevant issues for
choice between the two models. Further discussion can be found, for example,
in Maddala (1971), Nerlove (l97laL Swamy (1971), Mundlak (I 978a), and
Hausman and Taylor (1981). An introductory discussion may be found in Judge
et al. (1982, pp. 477-487).
(13.3.3)
520 Some Alternative Covariance Structures
where y' = (y;, yL . . . ,YN), X; = (X:. I , X;2, . . . , X:N ), e' = (e;, e2, . . . ,
er), and PI = (fJll, fJ12' . , fJIN)'. To illustrate the dummy variables, note
that, for N = 3,
1 1 1
1
1 1
y = [. (IN-I)
JNT 0'
. X., ]( P.,0) + e
JT (\3.3.4)
1 1 1 1 1 1 1 1 1 '
1 1
1 1
(13.3.5)
Least squares estimates of (P; , P:') can be found from (13.3.3) or (13.3.4) and,
of course, the results will be identical. If(l3.3.4) is used and 8i , i = 1,2 . . . . ,
N are the least squares estimates of the Oi, then the least squares estimates of
the fJli are given by hli = 81 + 8i+ I, i = 1, 2, . . . , N - 1 and hlN = 81.
Estimates of the J-ti are given by
(13.3.6)
(13.3.7)
where the kth mean is Xki. = 2.;= I XkilIT. The transformation on y is similar.
If Equation 13.3.1 is averaged over time and the result subtracted from
(13.3.1), we have
(13.3.8)
where Yi. = 2.;= I YiJT and bs can be viewed as the estimator obtained when least
squares is applied to (13.3.8). This estimator only requires a matrix inversion of
order K'; its covariance matrix is given by (T;(X;(IN DT)Xs)-I. Because it
utilizes the variation of the variables within each group or individual, it is often
known as the "within estimator." Furthermore, it plays an important role in
the error components model of the next section.
If bs is found from (13.3.8) rather than from (13.3.3) or (13.3.4), the least
squares estimates of the intercepts can be found from
K
b li = Yi - 2: bkXki.
k=2 (13.3.9)
For testing hypotheses about the coefficients, the usual least squares pro-
cedures are appropriate. Of particular interest is the hypothesis f311 = f312 =
. . . = f3IN = (3" and this can be tested using the conventional F test that
compares the restricted and unrestricted residual sums of squares. It is worth
noting that a joint test such as this one is preferable to individual t tests on the
coefficients of each dummy variable. If one follows the procedure of dropping
dummy variables where t tests are insignificant, two different parameteriza-
tions of the same problem can lead to different dummy variables being omitted.
In this section we are again interested in using Equation 13.3.1 to model differ-
ences in individual behavior, but instead of assuming that the /-ti are fixed
parameters, we assume that they are random variables with E[/-ti] = 0, E[/-tT] =
(T~, and E[/-ti /-tj] = for i -=F j. The /-ti and eil are assumed to be uncorrelated and,
522 Some Alternative Covariance Structures
(13.3.10)
(13.3.11)
The assumption that the lLi are random variables implies that the N individ-
uals can be regarded as a random sample from some larger population, and it
also implies that the lLi and Xi are uncorrelated. This point will be taken up in
Section 13.3.3.
Also, from (13.3.10) we can argue that our classification of models is, to some
~xtent, arbitrary. We are regarding the error components model as one with a
random intercept. It could also be regarded as one where all coefficients are
constant and the disturbance covariance matrix is of the form given by
(13.3.11). Alternatively, the intercept in the constant coefficient model in Sec-
tion 13.2 could be redefined to include the disturbance and, as such, it would
vary over time and individuals.
It is interesting to compare the error components covariance matrix with that
ofthe model studied in Section 13.2. In Equation 13.3.11 the covariance matrix
is identical for all individuals. Disturbances in different time periods for the
same individual are correlated, but this correlation is constant over time and it
is identical for all individuals. In Equation 13.2.2 the correlation declines as the
disturbances become further apart in time and it can be different for different
individuals.
y = Xp + f.1 @ h + e (13.3.12)
where X' = (X;, XL . .. ,XN), f.1 = (ILl, 1L2, .. ,ILN)', y' = (y;, yz. . .. ,
YN), and e' = (e;, ez, . . . ,eN). The covariance matrix is block diagonal and
given by
If cr~ andcr;are known, the generalized least squares (GLS) estimator for 11
is best linear unbiased and is given by
(13.3.14)
If we partition this estimator as ~' = (~I' ~;), then it is possible to show that
~s is a matrix weighted average ofthe within estimator bs , which was described
in Section 13.3.1, and another estimator which is known as the "between
Inference in Models That Combine Time Series and Cross-Sectional Data 523
(13.3.15)
where O"I = TO"~ + 0";. Then, if this expression and the partitions X = (jNT,X s )
and Xi = (jT,X'i) are substituted into (13.3.14), we obtain, after some algebra,
where
. ., . .,
_I JTJT _ JNTJNT
r.:>.
Q1 - N 0; T NT (13.3.17)
X21. - X2 .. , ,XKI. - XK ..
- -
, XKN. - XK ..
and Xk .. = L~ I LT= I Xkitl NT. Thus QI , when multiplied by X or y, has the effect of
calculating the means for each individual (Xki.), expressing the individual means
in terms of deviations from the overall mean, and repeating each of these N
observations T times.
The between estimator is given by
(13.3.18)
(13.3.19)
which is Equation 13.3.1 averaged over time. Note that (3; only utilizes the
variation between individuals. Now Ils can be written as
(13.3.20)
~, = Y., - 2: {3kh
k~2 (13.3.21)
The expression for Ps in Equation 13.3.20 is an informative one but, from the
point of view of the applied worker, it may not be a very convenient one for
calculation. To calculate Pvia a standard least squares computer package, we
need a transformation matrix P such that P'P = c<l>-' where c is any scalar.
Fuller and Battese (1973) suggest the transformation P = IN Pi, where
Pi = h - (I _0-,o-e) jdT
T (13.3.22)
PiPi = o-;<I>i-' and hence P'P = 0-;<1>-'. Multi r1 ying both sides of(13.3.12) by P
yields
K
fli = ( o-~)
0-, jT(Yi -
-2
A
XiJl)
(13.3.24)
Inference in Models That Combine Time Series and Cross-Sectional Data 525
and it can be viewed as a proportion of the GLS residual allocated to fli' the
precise proportion depending upon the relative variances (J"~ and (J";. It is best
linear unbiased in the sense that E[fli - ILi] = 0, and that E[(fli - ILi )2] is smaller
than the prediction error variance of any other predictor that is unbiased and a
linear function of y. The expectation is taken with respect to repeated sampling
over both time and individuals.
Conditional on ILi' fli is a biased predictor because E[fldILi] =1= ILi. The bias, as
well as a "best constrained predictor" that restricts the mean squared bias to
be less than a predetermined value, have been considered by Battese and Fuller
(1982).
Also, it is interesting to note that fli is that value of ILi obtained when the
quadratic function
is minimized with respect to 13 and the ILi. This can be regarded as an extension
of the Theil and Goldberger (1960) mixed estimation procedure [Lee and Grif-
fiths (1979)].
e*'e*
a-T = ..,..-,----
N-K (13.3.26)
and
(13.3.27)
where e* = QIY - QIXsJ3.t are the residuals from the estimator 131 and e =
(IN DT)y - (IN DT )Xsb s are the residuals from the estimator bs . An
estimator for the variance of ILi is
(13.3.28)
This estimator has the disadvantage that a-~ can be negative. If this occurs, we
can set the negative estimate to zero or, as Maddala (1971) suggests, it may be
526 Some Alternative Covariance Structures
an indication that time effects, which are discussed in Section 13.4, have been
incorrectly omitted. Assuming that (F~ = 0 leads to the least squares estimator.
A number of other variance estimators has been suggested. These include
estimators based on least squares residuals [Wallace and Hussain (1969), esti-
mators based only on the residuals e [Amemiya (1971))' the "fitting of con-
stants" method [Henderson (1953), Fuller and Battese (1973), MINQUE esti-
mation [Rao (1970, 1972)] and maximum likelihood estimation [Amemiya
(1971), Nerlove (1971a), Maddala (1971)]. When any of these variance esti-
mates are used in place of the real variances in the GLS estimator for p, we
have an estimated generalized least squares (EGLS) estimator. Under appro-
priate conditions these EGLS estimators will possess the same asymptotic
distribution as the GLS estimator. See, for example, Swamy (1971) and Fuller
and Battese (1973).
The finite sample properties of various EGLS estimators as well as the be-
tween and within estimators have been investigated analytically by Swamy and
Mehta (1979) and Taylor (1980), and in Monte Carlo studies by Arora (1973)
and Maddala and Mount (1973). Taylor concludes that (i) the EGLS estimator
is more efficient than the within estimator for all but the fewest degrees of
freedom and its variance is never more than 17% above the Cramer-Rao bound,
(ii) the asymptotic approximation to the variance of the EGLS estimator is
similarly within 17% of the true variance but remains significantly smaller for
moderately large sample sizes, and (iii) more efficient estimators for the vari-
ance components do not necessarily yield more efficient EGLS estimators.
Swamy and Mehta introduce some new estimators and outline criteria that one
could use to choose between least squares, estimated generalized least squares,
the within estimator, the between estimator, and their new estimators.
_ NT [e'(lN jrjr)e _ ]2
ALM-2(T_l) e'e 1 (13.3.29)
(13.3.30)
where fJ-i was first assumed to be fixed (Section 13.3.1) and then random (Sec-
tion 13.3.2). If the choice between these two assumptions is clear, then the
estimation procedure could be chosen accordingly. However, if the choice is
not an obvious one, it is useful to consider the problem in terms of whether or
not the fJ-i are correlated with the Xi, and if so, the possible nature of the
dependence of fJ-i on Xi [Mundlak (1978a), Chamberlain (1978, 1979, 1983),
Hausman and Taylor (1981)].
Mundlak (1978a) suggests that the fJ-i can always be considered random, but,
when using the dummy variable estimator, our inference is conditional on the
fJ-i in the sample. When the error components model is used, specific assump-
tions about the distribution of the fJ-i are made, and, as a result, unconditional
inference is possible. Because the conditional inference implied by the dummy
variable estimator does not make any specific assumptions about the distribu-
tion of the fJ-i' it can be used for a wider range of problems. However, if the
restrictive distributional assumption of the error components model is correct,
then, as expected, using this additional information leads to a more efficient
estimator.
The relevant question then is whether it is reasonable to assume that the fJ-i
are i.i.d. (O,<T~) or whether they are a consequence of some other random
process. Mundlak, Chamberlain, and others argue that it is more reasonable to
assume that fJ-i and X are correlated and hence that E[fJ-i] will not be constant
but some function of Xi. For example, if we are considering a production
function where i refers to the ith firm, and the ith firm knows its value fJ-i' then a
simple profit maximization assumption will imply that Xi depends on fJ-i. If fJ-i
and Xi are in fact correlated, the error components model is similar to an
omitted variable misspecification and its GLS estimator, as well as the LS
estimator b = (X' X)-I X'y, will be biased. However, conditional on the fJ-i in the
sample, the dummy variable estimator will be best linear unbiased.
Thus, it would appear that a reasonable prescription is to use the error
components model if the fJ-i ~ i.i.d. (O,<T~) assumption is a reasonable one and
N is sufficiently large for reliable estimation of <T~; otherwise, particularly
when fJ-i and Xi are correlated, or N is small, use the dummy variable (within)
estimator. However, as noted by Chamberlain (1978) and Hausman and Taylor
(1981), there are two possibly undesirable features of the within estimator when
fJ-i and Xi are correlated. First, if Xi contains some time invariant measurable
variables, then it is impossible to separate these variables from the dummy
variables. They will be eliminated when the within estimator b s is obtained.
This problem does not arise if the error components model is used, but, in this
case, we would be faced with the problem of biased and inconsistent estimates.
528 Some Alternative Covariance Structures
The second problem with using the within estimator is that it may be less
efficient than alternative consistent estimators that exploit information on the
relationship between fJ-i and Xi, and that do not ignore sample variation across
individuals.
For the case where there are no time invariant variables in Xi, Mundlak
(1978a) shows that there will be no efficiency loss from using the within estima-
tor if E(fJ-;) is linearly related to Xi = (X2i., X3i., . . . ,XKiJ'; in this case the
unconditional GLS estimator is identical to the dummy variable estimator.
However, as Lee (1978a) and others have pointed out, this is a fairly special
case. The fJ-i may be related to some other variables that are correlated with Xi,
they may be related to only some of the components in Xi, or the relationship
may be a nonlinear one. In these cases the two estimators are not identical.
Some alternative estimation techniques that consider these aspects are those of
Chamberlain (1978, 1983), Chamberlain and Griliches (1975) and Hausman and
Taylor (1981). Chamberlain (1978) and Chamberlain and Griliches (1975) relate
the individual effects to a set of exogenous variables in a factor analytic model.
Hausman and Taylor (1981) do not assume any specification for the fJ-i' but use
knowledge about which vectors in Xi are correlated with the fJ-i to develop an
efficient instrumental variables estimator. They show that Mundlak's (l978a)
specification where the GLS and within estimators are identical can be consid-
ered a special case.
A completely different approach particularly designed for small T and large
N is that outlined by Chamberlain (1983). He shows that if fJ-i and Xi are
correlated, then this fact will be reflected in the values of the coefficients in a
linear regression of Yit on all leads and lags of the explanatory variables. This
essentially involves a multivariate specification with T dependent variables, KT
explanatory variables, and N observations on each of the variables. Chamber-
lain suggests that it is too restrictive to assume that E(fJ-i!X;) is a linear func-
tion, and he goes on to formulate inference procedures based on the estimation
of a minimum mean square error linear predictor. The estimates obtained can
be used first to test whether fJ-i and Xi are correlated, and, secondly, to test
whether the coefficients obey certain restrictions that are implied by the specifi-
cation. Evidence that the restrictions do not hold is suggestive of some type of
specification error, such as random coefficients [Chamberlain (1982)], omitted
variables, or serial correlation in the eit. See also, Chamberlain (1979).
The foregoing discussion suggests that it would be convenient to have a
statistical test to test the hypothesis that the fJ-i and Xi are uncorrelated. In
addition to the Chamberlain procedure, tests for this purpose have been devel-
oped by Hausman (1978) and Pudney (1978) and considered further by Haus-
man and Taylor (1981). Under the null hypothesis that the error components
model is the correct specification, Hausman shows that
(13.3.31)
is the covariance matrix for the generalized least squares estimator p,. The
unknown variances in p" Mo, and M, can be replaced by appropriately chosen
estimates without affecting the asymptotic distribution of m. If the null hypoth-
esis is true, then, asymptotically, b s and rl, differ only through sampling error.
However, if /J-i and Xi are correlated, Ps and bs could differ widely and it is
hoped that this will be reflected in the test. Rejection of the null hypothesis
does suggest that bs is the more appropriate estimator, but Hausman recom-
mends carefully checking the specification because errors in variables prob-
lems, if present, may invalidate the dummy variable estimates.
Finally, we note that an approach for integrating the fixed and random effects
into a combined model, which is a special case of a switching regressions
model, and which is particularly relevant when a number of the individual
effects are likely to be equal, has been outlined by Mundlak and Yahav (1981).
The analysis of the previous section can be extended so that the intercept, in
addition to containing a component that is constant over time and varies from
individual to individual, also contains a component that varies over time and is
constant over individuals. This extended model is
K
t=1,2, . . . ,T (13.4.1)
with intercept f3lit = /31 + ILi + At. In a given time period, the time effects, At,
represent the influence of factors that are common to all individuals.
Our discussion in this section runs parallel to that of Section 13.3. If the ILi
and At are fixed, Equation 13.4.1 is a dummy variable model, and if they are
random it is an error components model. These two models are discussed in
Sections 13.4.1 and 13.4.2, respectively, and, to help the applied worker
choose between the two models, a statistical test is given in Section 13.4.3. For
further discussion see Wallace and Hussain (1969), Swamy (1971), NerIove
(197Ib), Swamy and Arora (1972), and Mundlak (1978a).
When ILi and At are treated as fixed parameters, one of the ILi and one of the At
are redundant and some restrictions such as 2.iILi = 0 and 2. t At = 0 need to be
imposed. To estimate (13.4.1) using least squares, we need to reparameterize
the equation so that the restrictions are incorporated. One such reparameter-
ization is to define f3li = /31 + ILi' i = 1,2, . . . ,N and Ai = At - AT, t = 1,2,
. . . , T - I, and write the model for the ith individual as
_ . + (Ir-I)
Yi - f3IiJT 0' A* + X'ilJs + ei
(13.4.2)
where A* = (Ai, At, . . . ,At-I )'. With all observations included, this becomes
Inference in Models That Combine Time Series and Cross-Sectional Data 531
(13.4.3)
where 131 = (/311, /312, ... ,/3IN)', y = (y;, yz, ... ,YN)', X; = (X; I , X;2,
. . . ,X;N), and e' = (e;, e2, . . . ,eN). The matrix Xs contains observations
on all the explanatory variables except for the constant. It is assumed that
E[e] = 0 and E[ee'] = cr;INT' and hence that the least squares estimator for the
[(N + T - 1 + K') x 1] parameter vector (13;, X*', 13;)' is best linear unbiased.
Before giving an expression for the least squares estimator, it is worthwhile
illustrating the dummy variables. For N = 3,
'
.
IN ( )]
/r-I
0'
1 1 1 1
1 1 1 1
1 1 ... 1 1
(13.4.4)
1 1 1
1 1 1
1 0 1 0 ... 1 0
(13.4.5)
The matrix
. -, .. , . -,
= I - I fV\JTJT _ JNJN fV\ I + JNTJNT
Q NT N 0; T N 0; T NT (13.4.6)
532 Some Alternative Covariance Structures
is idempotent and transforms the observations on X, and y such that QXs, for
example, is an (NT x K') matrix with the typical element given by
(13.4.7)
where
and
(13.4.8)
(13.4.9)
and
(13.4.10)
where Vit is defined in an obvious way. The estimator b, in Equation 13.4.5 can
be viewed as the least squares estimator obtained from Equation 13.4.11. It can
also be interpreted as the GLS estimator obtained from (13.4.11), where QQ =
Q is the covariance matrix of the disturbance and Q is the generalized inverse
of itself [Swamy and Arora (1972)]. The covariance matrix for b s is
or;(X;QXs)-I.
If Jls is estimated using (13.4.5), the remaining parameters in the model can
be estimated from
and
K
hi = Y. - 2: Xk .. bk
k~2 (13.4.14)
Instead of assuming that the individual and time effects are fixed parameters, in
this section we assume they are random variables such that E[J.ti] = 0, E[A(] =
0, E[J.t7] = tT~, E[A7] = tTL E[J.ti J.tj] = 0 for i 40 j, E[A(Ao5] = 0 for t 40 s, and A( , J.ti
and eit are uncorrelated for all i and t. Under these assumptions the observa-
tions on the ith individual are given by
(13.4.15)
where A = (AI, A2, . . . , Ar)' and the other symbols have their earlier defini-
tions. In particular, Xi = (jr,X"i) includes a constant term. The covariance
matrix for the composite disturbance is
and the covariance between disturbance vectors for two different individuals is
where J.1 = (J.tl, J.t2, . . . , J.tN)' and the complete disturbance covariance ma-
trix is
Unlike <I> in Section 13.3.2 this covariance matrix is not block diagonal. The
disturbances corresponding to different individuals are contemporaneously
correlated.
(13.4.21)
where Q and QI, are defined in Equations 13.4.6 and 13.3.17, respectively,
I ,
_ JNJN 0. I _ JNTJNT
Q2 - N \2$; T NT (13.4.22)
I
_ JNTJNT
Q3 - NT (13.4.23)
O"T = 0"; + TO"~, O"~ = 0"; + NO"~ and O"~ = 0"; + TO"~ + NO"~. Partitioning
(13.4.20) and using (13.4.21) yields, after some algebra,
and
Al N-1/2jfv T-1/2jr
A2 N-1/2jfv C I
A3 C2 T- I12jr
A4 C2 C I (13.4.25)
where C I and C 2 are such that CIC; = Ir-I' C;C I = Dr, C2 C2 = IN-I, and
C ZC 2 = D N The characteristic roots of <t> are fTt fTL fTT. and fT; with multiplic-
ity 1, (T - 1), (N - 1), and (N - 1)( T - 1). Swamy and Arora (1972) demon-
strate how A 2, A 3, and A4 lead to the estimators (3~, (3;, and bs , respectively.
The inclusion of A I in any of these transformations permits estimation of the
intercept term.
(13.4.26)
and
(13.4.27)
Prediction of the lLi gives information on the future behavior of individuals and
so it is likely to be of more interest than the prediction of At, which gives
information on past realizations. It can be shown that these predictors are
obtained by minimizing an extended' 'residual sum of squares function" similar
to that given in Equation 13.3.25. Also, further results and some computational
suggestions can be found in Henderson (1975).
536 Some Alternative Covariance Structures
e*'e*
0-1 = -::-::--=
N-K (13.4.28,
(13.4.29)
and
(13.4.30)
where e* = QI y - QI X,(l; are the residuals from the estimator (l;, eO = Q2Y -
Q2X2(l~ are the residuals from (l~ and e = Qy - QX,b.l are the residuals from bs .
Some alternative variance estimators are those based only on least squares
residuals [Wallace and Hussain (1969)], those based on the dummy variable
residuals [Amemiya (1971)]' and those derived from the fitting of constants
method [Fuller and Battese (1974)].
Substitution of a set of variance estimates into ~ leads to an estimated gener-
alized least squares (EGLS) estimator, ~, and the properties of various EGLS
estimators have been examined by Wallace and Hussain (1969), Swamy and
Arora (1972), Fuller and Battese (1974) and Kelejian and Stephan (1983). Wal-
lace and Hussain show that the dummy variable estimator is asymptotically
efficient relative to the GLS estimator while Swamy and Arora, in terms of
finite sample properties, indicate the conditions under which ~, with variance
estimators 0-1, o-~, and 0-;, is likely to be better than the least squares and
dummy variable estimators. Fuller and Battese consider a less restrictive set of
assumptions and give the conditions under which an EGLS estimator is unbi-
ased and asymptotically equivalent to the GLS estimator. Further aspects of
the asymptotic distribution of the GLS and EGLS estimators are considered by
Kelejian and Stephan. Baltagi (1981) carries out an extensive Monte Carlo
~xperiment where he examines the finite sample properties of EGLS estima-
tors, the performance of tests for the stability of cross-sectional regressions
over time (and time series regressions over individuals), the performances of
the Hausman specification test and the Lagrange multiplier test (see Sections
13.4.2d and 13.4.3), and the frequency of negative variance estimates.
residual sums of squares. Another alternative is the Lagrange multiplier test for
the hypothesis (J"~ = (J"~ = O. Breusch and Pagan (1980) show that, under the
null hypothesis
The problem of choice between the fixed and random effects models can be
examined within the same framework as that considered in Section 13.3.3 for
the model with individual effects only. Specifically, it can be argued that JLi and
At can always be regarded as random components, but when we treat them as
fixed and use the dummy variable estimator our inference is conditional on the
JLi and At in the sample. Such a procedure can always be employed, irrespective
of the properties of JLi and At, and, in particular, irrespective of whether or not
the JLi and At are correlated with the explanatory variables. The error compo-
nents GLS estimator will suffer from omitted variable bias if the JLi and At are
correlated with the explanatory variables, and it will not in general be efficient
unless the required distributional assumptions for JLi and At hold. On the other
hand, when these more restrictive assumptions do hold, the error components
GLS estimator will be more efficient than the dummy variable estimator, and
our inference is not conditional on the JLi and At in the sample.
Another possibility is to pursue a mixed approach where the At (say) are
treated as fixed and the JLi are treated as random variables. Such an approach is
likely to be particularly useful if T is small and if the researcher wishes to follow
some of the more general specifications for JLi and eit mentioned in Sections
13.3.3 and 13.3.4. Note that if the dummy variables for time are included within
X, then the model becomes one with individual effects only, and it can be
handled within the framework of Section 13.3 provided T is not too large.
A null hypothesis of no correlation between the effects and the explanatory
variables can be tested by comparing the dummy variable estimator with an
EGLS estimator. Following Hausman (1978), if the null hypothesis is true,
(13.4.32)
is the covariance matrix for the GLS estimator ~s. In (13.4.32) unknown vari-
ances can be replaced by appropriate estimates without changing the asymp-
totic distribution.
It is possible that different behavior over individuals will be reflected not only
in a different intercept but also in different slope coefficients. When this is the
case, our model can be written as
t = 1, 2, . . . , T
(13.5.1)
where Xlit = 1 and, in contrast to the previous two sections, we are no longer
treating the constant term differently from the other explanatory variables. Our
assumptions imply that the response of the dependent variable Yit to an explana-
Inference in Models That Combine Time Series and Cross-Sectional Data 539
tory variable Xkit is different for different individuals but, for a given individual,
it is constant over time. When the response coefficients 13ki are fixed parame-
ters, Equation 13.5.1 can be viewed as the "seemingly unrelated regressions
model" and when the 13ki are random parameters, it is equivalent to the
"Swamy random coefficient model." The seemingly unrelated regressions
model was discussed extensively in Chapter 12, and so we will not discuss it
further at this point. We begin with the Swamy random coefficient model in
Section 13.5.1, and in Section 13.5.2 we discuss the problem of choice between
the fixed and random assumptions.
When the (K x 1) response vector for each individual, l3i' can be regarded as a
random vector drawn from a probability distribution with mean 13 and covari-
ance matrix Ll, the model in (13.5.1) becomes the random coefficient model
introduced by Swamy (1970, 1971). For the ith individual this model can be
written as
I = 1, 2, . . . ,N (13.5.2)
with E[ .... iJ = 0, E[ ....i ....fJ = Ll, and E[ ....i ....J] = 0 for i =F- j.
For the seemingly unrelated regressions model studied in Chapter 12 a num-
ber of assumptions about ei, of varying degrees of complexity, were consid-
ered. Any of these could also be used for this model but, in general, we will
consider the simple one, where E[eiefJ = (J"i;! and E[eieJ] = 0 for i =F- j. This
implies that the disturbances across individuals are heteroscedastic but uncor-
related and, if the l3i were fixed parameters, the least squares estimators b i =
(XiXi)-IXiYi would be best linear unbiased.
For this model we are interested in estimating the mean coefficient vector 13,
predicting each individual vector l3i' estimating the variances upon which the
generalized least squares (GLS) estimator for 13 and the best linear unbiased
predictor (BLUP) for l3i depend, and testing the hypothesis that Ll = O. Each of
these items will be considered in turn.
where y' = (yi, yL . . . ,YN), X' = (Xi, XL . .. ,XN), fL' = (fLi, fLz, . . . ,
fLN), e' = (e;' eL . . . ,eN), and
Z=
The covariance matrix for the composite disturbance is <I> = E[(ZfL + e)(ZfL +
e)'] and is block diagonal with ith diagonal block given by
The GLS estimator for Ii has its usual properties and is given by
(13.5.6)
where
(13.5.7)
and
(13.5.8)
is the least squares estimator (predictor) of (Ji. The last equality in (13.5.8) is
based on a matrix result that can be found, for example, in Rao (1965, p. 29); it
shows that the GLS estimator is a matrix weighted average of the estimators b i ,
with weights inversely proportional to their covariance matrices. Also, the last
expression in Equation 13.5.8 only requires a matrix inversion of order K, and
so, if T is large, it is a convenient one for computational purposes. The covari-
ance matrix for ~ is
(13.5.9)
(13.5.10)
is the BLUP for Jli [Lee and Griffiths (1979)]. This predictor is unbiased in the
sense that E[~i - Jl;] = 0, where the expectation is an unconditional one, and it
is best in the sense that V* - V is nonnegative definite, where V is the covari-
ance matrix of the prediction error (~i - Jli) and V* is the covariance matrix of
the prediction error from any other linear unbiased predictor.
The predictor ~i can be viewed as an estimate of the mean, ~, plus a predic-
tor for lJ.i' where the latter is given by a weighted proportion of the GLS
residual vector Yi - XiJl. Also, using the same matrix result mentioned earlier
[Rao (1965, p. 29)], it can be written as
(13.5.11)
with respect to Jl and the Jli [Lee and Griffiths (1979)]. Finally, it is interesting
to note that the GLS estimator for the mean coefficient vector is a simple
average of the BLUPs. That is, ~ = 2.~1 ~JN.
(13.5.13)
542 Some Alternative Covariance Structures
and
(13.5.14)
where
(13.5.15)
The estimated generalized least squares (EGLS) estimator for p that uses these
variance estimates is, under certain conditions [Swamy (1970)], consistent and
asymptotically efficient. An estimate of the asymptotic covariance matrix is
given by substituting a-ii and .:i into Equation 13.5.9. Conditions under which
the EGLS estimator for p has a finite mean and is unbiased are given by Rao
(1982).
One difficulty with the estimator .:i is that it may not be nonnegative definite.
Swamy (1971) discusses this problem at length and suggests methods for adjust-
ing it or constraining it such that it is nonnegative definite. This will, of course,
destroy the unbiasedness property. However, unbiasedness should not be sa-
cred, at least not as sacred as nonnegative definite covariance matrices, and
so the easiest solution seems to be to use the natural sample quantity Ii =
Sb/(N - 1) as an estimator. This estimator will be nonnegative definite and
consistent. In finite samples, which variance estimator leads to the more effi-
cient EGLS estimator for p is an open question. Under the assumption of
normally distributed disturbances one can also use the maximum likelihood
technique to estimate the parameters. This is outlined by Swamy (1971, 1973)
and, for a more general model, by Rosenberg (I973b).
If this null hypothesis is true, the individual coefficient vectors are not random
and they are all identical to the mean. In addition, the joint model in Equation
13.5.4 becomes a heteroscedastic error one, where the variances are constant
within subgroups of observations. Swamy (1970) suggests the statistic
(13.5.16)
Inference in Models That Combine Time Series and Cross-Sectional Data 543
where
and
i =1= j (13.5.18)
where
1
Pi
(13.5.19)
T-l pT-2 1
p, i
Estimators for d, the U"ij, and the Pi are given by Swamy (1974), and these can
be used in an EGLS estimator for ~. One possible problem with the EGLS
estimator is that it requires a matrix inversion of order NT. A convenient
transformation that enables its calculation on standard computer packages does
not seem to be available.
Other extensions include instrumental variable estimation when X contains
544 Some Alternative Covariance Structures
lagged values of the dependent variable [Swamy (1974), estimation when the
covariance matrix could be singular [Rosenberg (1973b), and attempts to pro-
duce estimators that may be biased but which have lower mean square error
[Swamy (1973, 1974)]. Applications can be found, for example, in Swamy
(1971), Feige and Swamy (1974), and Hendricks et al. (1979).
The question concerning whether the (3i should be assumed fixed and different
and the seemingly unrelated regressions framework used, or whether they
should be assumed random and different and the random coefficient framework
used, has been considered by Mundlak (1978b). Extending his work on the
error components model [Mundlak (1978a), he suggests that, in both cases, the
coefficients can be regarded as random but, in the first case, our inference is
conditional on the coefficients in the sample. The important consideration is
likely to be whether the variable coefficients are correlated with the explana-
tory variables. If they are, then the assumptions of the Swamy random coeffi-
cient model are unreasonable and the GLS estimator of the mean coefficient
vector will be biased. In these circumstances the fixed coefficient model can be
used. If the variable coefficients are not correlated with the explanatory vari-
ables, then it might be reasonable to assume that they are random drawings from
a probability distribution and the Swamy random coefficient model could be
used. This model uses the additional information provided by the "random
assumption," and so, if the assumption is reasonable, the estimates should be
more efficient.
Wherever possible, any dependence of the coefficients on the explanatory
variables should be modeled explicitly, and, as an example, Mundlak (l978b)
considers estimation of the special case where E[(3;] is linearly related to the
mean vector of the explanatory variables for the ith individual.
The "heterogeneity bias" that arises when least squares is applied to a model
with random coefficients correlated with the explanatory variables has been
considered further by Chamberlain (1982). He demonstrates that, as in the case
of the error components model, heterogeneity bias is reflected by a full set of
lags and leads in the regression of Yi/ on Xkil, Xki2,. . . ,XkiT; conversely, under
certain conditions, evidence of a contemporaneous relationship only is indica-
tive of no heterogeneity bias. Based on the minimum mean square error linear
predictor of Yi/ given Xkil, Xki2, . . ,XkiT, Chamberlain suggests an estimation
technique which is particularly useful when T is small and N is large. If the
coefficient estimates tend to exhibit a certain structure, then heterogeneity bias
can be attributed to the correlation between the explanatory variables and a
random intercept only. On the other hand, no evidence of this structure is an
indication of correlation between the explanatory variables and all the random
coefficients.
Under the assumptions that E[ eief] = IT;I and E[ eie;] = 0 for i 0/= j, a test for
the null hypothesis that the variable coefficients and the explanatory variables
are uncorrelated has been provided by Pudney (1978). It is based on the sample
Inference in Models That Combine Time Series and Cross-Sectional Data 545
covariance between the b i = (Xl X;)-'X; Yi and the means of the explanatory
variables for each individual, xl = j ;X)T, where iT = (1, 1, . . . , 1)'. The
sample covariance is given by
(13.5.20)
(13.5.21)
(13.5.22)
Just as it was possible to extend the error components model with individual
effects to the Swamy random coefficient model where all coefficients could
vary over individuals, it is possible to extend the error components model, with
individual and time effects, so that all coefficients have a component specific to
an individual and a component specific to a given time period. This model can
be written as
Yit = L
k~l
(ih + fJ-ki + Akt)Xkit + eit, 1=1,2,. ,N
t = 1,2, . . . , T (13.6.1)
546 Some Alternative Covariance Structures
where f3kil = 13k + ILki + Akl measures the response of the dependent variable to
the kth explanatory_variable for individual i in time period t. Each coefficient
consists of a mean 13k. an individual specific component ILki and a time specific
component Akl'
As in the previous models, it is possible to assume that the ILki and Akl are
either fixed parameters to be estimated or random variables. We will generally
assume that they are random variables but, when convenient, we will indicate
how they could be estimated as fixed parameters. Also, as Pudney (1978) points
out, sometimes it may be reasonable and practical to assume that one of the
components is fixed and the other random. For example, if T is small and N is
large, and both components are random, any estimate of the variance of Akl is
likely to be unreliable. In such circumstances it would be preferable to treat the
Akl as fixed, include appropriate dummy variables, and thus use inference con-
ditional on the Akl in the sample. If, in Equation 13.6.1, the time effects are
removed by including dummy variables in the set of explanatory variables, the
model becomes identical to the Swamy random coefficient model.
Before outlining how the parameters in Equation 13.6.1 can be estimated, we
briefly mention some other models where the coefficients vary over time and
individuals.
One alternative to (13.6.1) is the model
We assume XliI = 1 and so, in this case, the disturbance elil replaces the distur-
bance eil of the earlier model. In addition, the time effects Akl have been re-
placed by a random component ekil which is not restricted to be the same for all
individuals in a given time period. The estimation of this model has been
discussed by Swamy and Mehta (1975, 1977); Pudney (1978); and Singh and
Ullah (1974). The most general set of random assumptions are those adopted
by Swamy and Mehta. In addition to all the disturbances having zero mean,
they assume that
, {d0
E[f.tif.tj] =
if i = j.
otherwise (13.6.3)
and
, {d
0
E[ eil ej,] =
ii if i = j and t = s
otherwise (13.6.4)
where f.ti = (ILIi, IL2i' . . ,ILKi)' and eil = (elit, e2il, . , eKil)', and both d
and the d ii are not restricted to be diagonal. Pudney assumes that d ii = de for
i = I, 2, . . . ,N while Singh and Ullah rewrite the model as
Inference in Models That Combine Time Series and Cross-Sectional Data 547
treat the 13ki as fixed parameters, and assume that E[eitelt] = Ilij with Ilij di-
agonal.
The above models all assume that the random coefficients vary around a
constant mean. Models where the coefficients also change systematically over
time have been analyzed by Rosenberg (l973c), Johnson and Rausser (1975),
Harvey (1978), and Liu and Hanssens (1981).
Estimation and hypothesis testing for the model in Equation 13.6.1 have been
discussed by Hsiao (1974, 1975); in this section, we outline some of his proce-
dures. A special case of the Hsiao model where random coefficients are only
associated with time invariant and individual invariant variables has been stud-
ied by Wansbeek and Kapteyn (1978, 1979, 1982).
For the ith individual, Equation \3.6.1 can be written as
-
Y =
I xa-
It' + XII. + Z"
1,-1 ll\. + e" (13.6.6)
Zi
(T x TK)
and X;t = (Xlii, X2it, . . . , XKit). Hsiao assumes that E[e;] = 0, E[f.ti] = 0,
E[A t ] = 0, E[eief] = U';I, and E[eieJ] = 0 for i 0/= j; E[f.tif.til = Il and E[f.tif.tJ] =
o for i 0/= j; and E[AtA;] = A and E[AtA;] = 0 for t 0/= s. Furthermore, f.ti' At.
and ei are all uncorrelated, and A and Il are diagonal with diagonal elements
given by CXk and lh, respectively.
Rewriting (13.6.6) to include all NT observations yields
Y = X~ + Zf.t + iA + e (13.6.7)
where y' = (Y;, Yz, . . . ,YN), X' = (X;, X z, . . . ,XN), Z is block diagonal
with Xi as the ith diagonal block, i ' = (i;, i z, . . . , iN). f.t' = (f.t;, f.tz,
') ,ande' = ("
. . . , f.tN ')
el, ez, . . . ,eN'
548 Some Alternative Covariance Structures
If fl. and A. are regarded as fixed parameters, (3, fl., and A. can be estimated by
applying least squares to (13.6.7), provided that we reparameterize to eliminate
redundant parameters, and provided that NT is sufficiently large. The matrix
[X, Z, Z] is of dimension (NT x (T + N + l)K), and of rank (T + N - l)K, and
so 2K parameters are redundant. For estimation a convenient approach would
be to drop (IL IN, 1L2N, . . . ,IL KN, AIT, A2T, . . . ,A KT), eliminate the corre-
sponding columns in Z and Z, and redefine the other parameters accordingly.
Note that we are assuming NT > (T + N - l)K.
where C = [Z'Z + (IT A~I)]~I and G = [Z'Z + (IN d~I)]~I. In this ex-
pression the largest order of inversion is reduced to max {NK,TK}, but this
could still be quite large. Except for a special case [Wansbeek and Kapteyn
(1982)], a convenient transformation matrix P such that P' P = c<l>~ I, where cis
a scalar, does not seem to be available.
If one is interested in predicting the random components associated with
each individual, then, following Lee and Griffiths (1979), the predictor
(13.6.9)
Yil = 2:
k~1
{3ki X kil + Vii
(13.6.10)
where {3ki = 13k + ILki and Vii = ~f=1 AklXkit + eil' The variance of ViI is
K
Oit = E[vtl] = 2:
k~1
XLak + (T~
(13.6.11)
(13.6.12)
where.:t is Xi with each of its elements squared and a = (al + (T~, a2, . . . ,
aK)' .
Let ei = Yi - Xibi be the residual vector from the least squares estimator, bi =
(Xf XJ-IXfYi, applied to the ith cross section and let ~i be the squares of these
residuals. Then,
(13.6.13)
E[~] = Fa (13.6.14)
where the et are the residuals obtained by applying least squares separately to
each time period. From a, ~, and a-; we can derive estimates for 81 and al.
Hsiao (1974) gives sufficient conditions for the consistency and asymptotic
efficiency of the estimated generalized least squares estimator for p which
depends on the above variance estimates, and he also indicates how one can
test for the constancy of coefficients over time or individuals. Some of Hsiao's
asymptotic results have been corrected and extended by Kelejian and Stephan
(1983).
550 Some Alternative Covariance Structures
Various refinements of the above technique are available and are discussed in
Chapters 11 and 19. One problem is that some ofthe variance estimates may be
negative and, in this case, the easiest solution is to change the negative esti-
mates to zero.
In empirical work the Hsiao model does not seem to have gained the same
widespread acceptance that some of the other models have achieved. This may
be because the estimation procedures are not easily handled on standard com-
puter packages, because the assumptions are not considered realistic, or be-
cause we seldom have data sets where both Nand T are sufficiently large. It
would be preferable to relax the assumption that Ll and A are diagonal and, as
Hsiao notes, in principle this can be handled within the same estimation frame-
work. However, it would be computationally more difficult and the problem of
A and ~ being nonnegative definite would be magnified. Also, if there are a
sufficient number of time series observations, it might be preferable to drop the
Akr and model the "time effects" with an autocorrel~ted eit [see Swamy (1974)].
13.7 RECOMMENDATIONS
if N is small, it is unlikely that fT~ for the error components model or Afor
the Swamy random coefficient model will be very reliable. Consequently,
the estimated generalized least squares estimators for the slope coeffi-
cients are also likely to be unreliable and, for estimation, it may be better to
treat the coefficients as fixed even when the random assumption is reason-
able.
At various points in the chapter we gave a number of tests that could be used
to help us choose between models. In general, for model choice, we recom-
mend that applied workers combine ajudicious use of some of these tests with
their a priori answers to the above four questions.
13.8 EXERCISES
Exercise 13.1
Prove that the matrices Q, QI, Q2, and Q3 are idempotent. See Equations
13.3.17, 13.4.6, 13.4.22, and 13.4.23 for the definitions.
Exercise 13.2
Use results on the partitioned inverse of a matrix to derive the estimators given
in Equations 13.3.7, 13.3.16, 13.4.5, and 13.4.24.
Exercise 13.3
Show that the inverse of <l>i = <T~jd~ + <T;/r is
Exercise 13.4
Show that P; Pi = <T;<I>i I, where <l>i- I is given in Exercise 13.3,
Pi = Ir - ajd~/T
Exercise 13.5
Let A = IN D T Show that (Section 13.3.2c):
1. e = Ay - AXsbs = (A - AXs(X;AXs)-IX;A)e.
2. A has rank N(T - 1).
3. A - AX,(X;AXs)-IX;A is idempotent of rank N(T - 1) - K I.
Exercise 13.6
In Chapter 8 (Equation 8.3.4) we showed that a general expression for a best
linear unbiased predictor is
See Equations 8.3.2 and 8.3.3 for the relevant definitions. By appropriately
defining y, X, and V for each case, derive the BLUP predictors given in Sec-
tions 13.3.2b, 13.4.2b, 13.5.1b, and 13.6.1a.
Exercise 13.7
Consider the following error components model,
-
Yj/ = f31 + f32 Xj/ + f-tj + ej/, i = 1,2,3, 1 = 1,2,3,4
where the assumptions of Section 13.3.2 hold and we have the following data:
1= 1 i = 2 i = 3
Y x Y x Y X
Exercise 13.8 ,
Prove that the estimator Ll (Equation 13.5.14) is unbiased.
For the remaining problems we recommend generating Monte Carlo data con-
sisting of 50 samples from an error components model with individual effects.
Specifically, we suggest
1=1,2,3,4
Yj/ = f31 + f32 Xjt + f-tj + ej/,
1=1,2, . . . ,10 (13.8.1)
where f31 = 10 and f32 = 1; the f-tj are i.i.d. N(O, 14); and the ej/ are i.i.d. N(O, 16).
Inference in Models That Combine Time Series and Cross-Sectional Data 553
Exercise 13.9
Calculate the variance of the following estimators for f32:
1. The LS estimator h2
2. The GLS estimator {32'
3. The dummy variable estimator h 2
Exercise 13.11
-
For the same five samples compute values for the estimators h2' and h2' and f32,
.
where ~2 is the EGLS estimator for f32 that uses the variance estimates obtained
in Exercise 13.10. Note that hl may already have been calculated in Exercise
13.10. Compare the estimates with each other and the true parameter values.
Exercise 13.12
For each of the five samples,
Exercise 13.13
For the same five samples compute estimates for ILl, ILl, IL3, and IL4 from
Exercise 13 .14
Incorrectly assume that the data are generated from the Swamy random coeffi-
cient model with the assumptions employed in Section 13.5.1. For each of five
samples, find ~, defined in Equations 13.5.14, and .i = Sb/(N - 1), where Sb is
defined in (13.5.15). Comment on the results and note how many times ~ is not
nonnegative definite.
Exercise 13.15
For
__
the same five samples compute ,,-
values for the two88
EGLS estimators
== for
({3., {32)' which use, respectively, ~ and ~. Call them ({3., {32)' and ({3., {32)' and
comment on the values.
Exercise 13.16
- .
Use the 50 samples to estimate the mean square errors (MSEs) of the estima-
tors b 2 , b 2 , an_d {32. With respect to the answers in Exercise 13.9 are the MSE
estimates of b 2 and b 2 accurate? Comment on, and suggest reasons for, the
relative efficiencies of b2 , b 2 , ~2.
Exercise 13.17
Use the 50 samples to estimate the MSEs of the prediction errors for the two
predictors of fJ-. , fJ-2, fJ-3, and fJ-4 given in Exercise 13.3. Comment on the relative
efficiency.
Exercise 13.18
Using all 50 samples, comment on the proportion of times each of the null
hypotheses in Exercise 13.12 were rejected.
Exercise 13.19
~rom the results of all 50 samples, estimate the MSEs of the estimators i32 and
i32 given in Exercise 13.15. Based on the results and those in Exercise 13.16:
13.9 REFERENCES
IN FERENCE IN
SIMULTANEOUS
EQUATION MODELS
Specification and
Identification in
Si mu Itaneous
Equations Models
14.1 INTRODUCTION
ables, and what this implies for the statistical model and our basis for parameter
estimation and inference.
Up to this point we have lived in the simple world of one-way causality. That
is, we have assumed that the explanatory variables in the design matrix have an
impact on the values of the left-hand side random variable y, but there is no
feedback in the other direction. While this model is appropriate for describ-
ing the data generation process in many circumstances, it is inconsistent with
many others. In particular, it is usually inappropriate when modeling the gener-
ation of price and quantity data in one or more markets, since we know that the
values of these variables are determined jointly or simultaneously by supply
and demand. Another example arises if we consider a set of macroeconomic
variables whose values are jointly determined by the equilibrium of a set of
macroeconomic behavioral relations. The specification of simultaneous eco-
nomic models is well documented in the economic and econometric literature
and for a discussion of and motivation for the use of simultaneous equations
models, the reader is encouraged to read Haavelmo (1943), Girshick and
Haavelmo (1953), Marschak (1953), and Koopmans (1953). For an introductory
discussion of the simultaneous equation statistical model and the problem of
identification, see Chapter 12 in Judge et al. (1982).
The remainder of this chapter is devoted to formulating the simultaneous
equation model as well as discussing basic issues. In particular, in Section 14.2
the statistical model is presented; the consequences of using ordinary least
squares (LS) as an estimation rule for an equation in a simultaneous equation
system are developed in Section 14.3. Section 14.4 contains a discussion of the
reduced form of a simultaneous equation system and the use of LS for its
estimation. In Section 14.5 the concept of identification is presented and Sec-
tion 14.6 contains concluding remarks.
To introduce the idea of simultaneity we must pay special attention to how the
economic variables are classified. Basically, each equation in the model may
contain the following types of variables:
Endogenous, or jointly determined variables, have outcome values deter-
mined through joint interaction with other variables within the system. Exam-
ples, within the context of a partial or general equilibrium system, include such
variables as price, consumption, production, and income.
Exogenous variables affect the outcome of the endogenous variables, but
their values are determined outside the system. The exogenous variables thus
are assumed to condition the outcome values of the endogenous variables but
are not reciprocally affected because no feedback relation is assumed. Exam-
ples in this category are weather-related variables, such as rainfall and temper-
ature, and the world price of a commodity for a country model that involves
only a minor part of the production or consumption of the commodity. Lagged
Specification and Identification in Simultaneous Equations Models 565
1
Consequences of LS Consistent estimation of
estimation of reduced form parameters
structural parameters (Section 14.4)
(Section 14.3) I
Derivation of structural
parameters from reduced
form parameters
Identification
(Section 14.5)
1
I.
BasIc
1
Identification of
1
Identification of
1
General
concepts the linear systems that are methods
(Section 14.5.1) model linear in parameters of
(Section 14.5.2) but nonlinear in identification
variables (Section 14.5.4)
(Section 14.5.3)
abIes by the (T x 1) vectors YI, Y2, . . . ,YM; the K exogenous and predeter-
mined variables by the (T x 1) vectors XI. X2, . . . ,XK; and the M random
error variables by the (T x 1) vectors eJ, e2, . . . , eM. A general linear
statistical model reflecting the M equations that represent the relationships
among the jointly endogenous variables, the exogenous and predetermined
variables, and the random errors, may be stated as
YI 'YIM + Y2'Y2M +
(14.2.1)
where the 'Y's and the f3's are the structural parameters of the system that are
unknown and are thus to be estimated from the data. In matrix notation the
linear statistical model may be written compactly as
Yf + XB + E = 0 (14.2.2)
and
are the sample values ofthe jointly dependent and the predetermined variables,
respectively, and
eTM (T x M)
568 Inference in Simultaneous Equation Models
is the matrix of unobservable values of the random error vectors. The matrix
f31 J
f321
B=
f3KM (K x M)
and
for I = I, 2, . . . , M (14.2.9)
and
or, compactly, as
E = ~ Ir (14.2.12)
X'X
lim - T =
T~oo
+xx (14.2.13)
(14.2.14)
is finite for k = I, 2, . . . .
(14.2.15)
(14.2.16)
570 Inference in Simultaneous Equation Models
are greater than one in absolute value. This assumption is motivated in greater
detail in Chapter 16.
Two important consequences of these assumptions, that will be used later,
are that
plim(E'EIT) = ~ (14.2.17)
and
plim(X'E/T) = o. (14.2.18)
Assumptions 3 and 4 also imply that plim X'XIT is finite and nonsingular, even
under the presence of lagged endogenous variables. For a proof of (14.2.18),
see Schmidt (1976, p. 123).
The least squares estimator is biased and inconsistent for the parameters of a
structural equation in a simultaneous equation system. To see this, consider the
ith equation in (14.2.1) written as
yr; + XB i + ei = 0 (14.3.1)
Some elements ofr i and Bi are generally known to be zero; also it is customary
to select one endogenous variable to appear on the left-hand side of the equa-
tion. This is called selection of a normalization rule and is achieved by setting
one coefficient, say Yii, to the value -1. Then, with rearrangement if necessary,
J + Y*'Y* + XU
y,. = y.'Y.
I I + X*u*
I PI + e
lPI I I
Yi = Yi'Yi + Xi Ili + ei
or compactly as
lU, + eI
YI = Z<;: (14.3.2)
where
-1 -1
c= 'Yi
'Yi
'Yi
0
B. = [Ili] = [lliJ
' M 0' 8i = [~J
Y = [Yi Yi Yi], X = [Xi xi], Zi = (Yi Xi)
Specification and Identification in Simultaneous Equations Models 571
(14.3.3)
It has expectation
A , ,
The last term in (14.3.4) does not vanish because Zj contains endogenous vari-
ables that are jointly determined with Yj and thus not independent of ej. Conse-
quently, E[ 8;] =F OJ and 8j is biased. A
The last term, plim[Z: e;/T] does not converge to the zero vector because Zj =
(Yj X j ) contains Yj , which is not independent of the error vector ej, a result
that obtains no matter how large the sample size. Therefore, the use of the least
squares rule on a structural equation containing two or more endogenous vari-
ables will yield biased and inconsistent parameter estimators.
v; = -e;r- I (14.4.4)
Since the vectors e/ have mean E(e/) = 0 and covariance matrix E(e/e;) = ~,it
follows that
and
(14.4.6)
E(X'V) = E(-X'Er- l ) =
so that E(X'vJ = O. If X contains lagged endogenous variables, then
plim(X'VIT) = plim(-(X'EIT)r- l ) =
[see (14.2.18)]. In the former case,
using standard proofs, the least squares estimator of the reduced form parame-
ters for the ith equation,
(14.4.7)
fI = (X'X)-IX' Y (14.4.8)
is unbiased and consistent, and in the latter case it is consistent. Note that joint
estimation of the M reduced form equations in a seemingly unrelated regression
Specification and Identification in Simultaneous Equations Models 573
format does not lead to an increase in efficiency in this case, even though
E(ViV}) = wijI since the matrix of explanatory variables for each equation (Le.,
X) is identically the same.
We conclude that while OLS is not a consistent way to estimate the struc-
tural parameters, the reduced form parameters can be consistently estimated
by least squares. This result is useful for several reasons. First, in cases where
only predictions are desired, the estimated reduced form coefficients provide a
basis for making those predictions. Specifically, if Xo is a (To x K) matrix of
values for the predetermined variables for which predictions of the endogenous
variables are desired, then
(14.4.9)
n = -Br- I (14.4.10)
(14.4.11)
14.5 IDENTIFICATION
14.5.1 Basics
Certain concepts can be stated that will apply not only to the model developed
earlier in this chapter, but also to more general models that will appear later on.
Let y be a vector of observable random variables. A structure S is a complete
specification of the probability density function of y, say f(y). The set of all
possible structures, :J, is called a model. The identification problem then is that
of making judgments about S given:J and the observations y. To make matters
more precise, assume y is generated by the parametric probability density
f(yIS) = f(yla), where a is a K-dimensional real vector. The function f is
assumed known but a is not. Hence a structure is described by a point in rzh K
and a model by a set of points in rzhK.
In general it is said that two structures S = a and S* = a* are observation-
ally equivalent if f(yla) = f(yla*) for all y. Here, however, we will view two
structures as observationally equivalent if they produce identical first and sec-
ond moments of f. For the normal distribution, recall that all information is
contained in the first and second moments. Furthermore, the structure So = aO
in :J is globally identified if there is no other feasible a that is observationally
equivalent.
Since a set of structures is simply a subset of rzh K, it is possible that there may
be a number of observational1y equivalent structures, but that they are isolated
from one another. Consequently, a structure So = aO is locally identified if
there exists an open neighborhood containing aO such that no other a in the
open neighborhood is observationally equivalent to aO.
where y;, x;, and e; are the tth rows of Y, X, and E, respectively. Suppose et
has density P(etIL), then the joint density of e, , e2, . . . ,eT is
(14.5.2)
Specification and Identification in Simultaneous Equations Models 575
T
P(YI, . . . 'YTlf,B,~,X) = IlfilT TI P(etl~)
t=1
= IIfliT TI
1= I
P(-y;f - x;BI~) (14.5.3)
SInce Wt, = et'F ,
(14.5.5)
T T
IIfFIiT. TI P(WtIF'~F) = IIfliT IIFIIT IIFII-T TI P(etl~)
1=1 1=1
T
= IIfliT TI P(etl~) (14.5.6)
1= I
the joint density which is identical to (14.5.3) determined from the original
structure. Hence (14.5.1) and (14.5.4) are observationally equivalent. Note that
by letting F = f- I , one observationally equivalent structure is the reduced form
of the simultaneous equation system. Consequently, all structures that could be
created by post multiplying the original structure by a nonsingular matrix have
the same reduced form.
Given that we are defining observational equivalence by the first two mo-
ments of P, the conditions for observational equivalence can be stated as fol-
lows: Two structures S = (f,B,~) and S* = (f*,B*):*) are observationally
equivalent if and only if:
and ~* = F'~F
576 Inference in Simultaneous Equation Models
F = Cii
o
That is, the ith column of F must be a scalar multiple of the (M x l) unit vector
whose elements are all zero except a one in the ith row. Note that when rand B
are post multiplied by such an F, their ith columns are only changed by a scalar
multiple.
Extending this definition to the entire system, we have that the system is
identified if and only if all the equations are identified, or equivalently, that the
admissible transformation matrix is diagonal. In the instance where normaliza-
tion rules are imposed, such as Yii = -1, then the only nonzero element of the
ith column must be 1.
To determine conditions under which the ith equation is identified, let
A = [~J
so that ai, the ith column of A, contains the parameters of the ith equation. Let
Ri and ri be J x (M + K) and (J x 1) matrices of constants, respectively, such
that all prior information about the ith equation of the model including the
Specification and Identification in Simultaneous Equations Models 577
(14.5.7)
For example, if the a priori information is that Ylj = 0, Y2j + Y3j = 1, and Yjj =
-1, then (14.5.7) is
1 000 o o o
o 1 1 0 o o aj = 1
o 0 0 0 1 o -1 (14.5.8)
If (14.5.7) is the only restriction available, then we can make the following
observation: We know that the ith column of AF is AC where fj is the ith
column of F. Since F is admissible, the resulting structural coefficients satisfy
the restrictions Rj(Afd = rj. However, for the ith equation to be identified, fj
should be equal to jj, a unit vector whose elements are zero except a one in the
ith row. Indeed jj is a solution of RAfj = rj because RjAj j = Raj = rj by
(14.5.7). Therefore, for (RjA)fj = rj to have a unique solution of fj = J, it
requires that Rank(RjA) = M. Now, we can formally state the following theo-
rem:
Note that if the only restrictions are zero coefficient restrictions, then rj = 0
and the rank condition reduces to Rank(RA) = M - 1 so that the homogeneous
equation RjAfj = 0 has a unique solution up to scalar mUltiplication. By normal-
ization the ith equation is identified. A necessary but not sufficient condition to
achieve the rank condition can be stated in the following corollary:
It should be noted that restrictions on the system coefficients are not the only
type of information that can be used to identify equations. It is possible that
restrictions on the contemporaneous covariance matrix ~ can help reduce the
number of admissible transformation matrices F. Let Ii denote the ith column
of~, which contains the variance of the ith equation error and covariances of
the ith equation error and other equation errors. Also let the g linear restric-
tions on Ii be
(14.5.9)
where I7 is the ith column of F'~F. Since the ith column of F'~F is F'~fi' the
restrictions become
HiF'~fi = hi (14.5.11)
(14.5.12)
Thus, for fi to have a unique solution of fi = ji' where ji is the unit vector, it is
necessary that
Rank [Rio 0
HiF'
][A] = 114
~ (14.5.13)
Since the above conditions must hold for all admissible F, including F = I, we
can generalize the rank and order conditions as follows:
Rank [RA]
I
Hi~
= M
(14.5.14)
Note that the generalized rank condition is necessary but not sufficient. For
sufficiency, it is required that the unique solution of fi in (14.5.12) be ji. This
requires that HiF'2.ji = hi holds for every admissible F in addition to the
already known result RiAji = ri.
To form a useful sufficiency condition, we consider the case where the only
restrictions on 2. are zero restrictions so that hi = 0 and Hi consists of only unit
vectors. We also rearrange all the structural equations so that the ith equation
is the first equation, and the equations whose equation errors are uncorrelated
with the first equation error are arranged to appear at the end of the system. In
this way, Hi becomes Hi = (0 1), and 2. may be partitioned according to Hi
into
2. 12]
2.22 (14.5.15)
where the first row of 2.12 and the first column of 2.21 contain all zeros. The
admissible F matrix is also partitioned into
(14.5.16)
F' 2.12] Ji -_ 0
(0 1) II
[ Fb 2.22 (14.5.17)
or
(14.5.18)
For this result to hold the first column of Fb2.11 + FZ22.21 must be zero. In view
of the partitioning, the first column of 2.21 is zero, and hence we require the first
column of Fb2.11 to be zero. (This condition alone is another necessary condi-
tion for identification.) The first column of Fb2.11 will be zero when FI2 = O. In
words, FI2 = 0 means that every equation whose equation error is uncorrelated
with the first equation error is identified with respect to every equation whose
equation error is correlated with the first equation error. Therefore. we can
state a sufficiency theorem as follows:
THEOREM
A sufficient condition for the ith equation to be identified is that:
COROLLARY
A sufficient condition for the identification of the ith equation is that:
neous restrictions so that ri = 0 and hi = 0, then the rank of (:i ;)(~) must
be M - 1 and the order of (R; 0.) must be J + g M - 1. When there are
Hi
>
no restrictions on the covariances, then g = 0, and the rank and order condi-
tions reduce to those of RiA and Ri as previously stated.
An example of zero restrictions on covariances is a recursive system, in
which f is a triangular matrix and L is a diagonal matrix. In this case, for every
structural equation, there are M - 1 zero covariances; therefore, g = M - 1
and J > 1 (at least for normalization). Thus g + J > M and the generalized
rank condition holds for every equation. Also because f is a triangular matrix,
the M - 1 zero restrictions on the coefficients in f) identify the first equation,
which together with the generalized rank condition recursively identifies the
subsequent equations.
Generalizing the idea of recursiveness, Hausman and Taylor (1983) show
that, within the context of a single equation, covariance restrictions aid identifi-
cation if and only if they imply that a set of endogenous variables is predeter-
mined in the equation of interest. Although they do not establish necessary and
sufficient conditions for identification within the context of the complete sys-
tem, they show that if the system of equations is identifiable as a whole,
covariance restrictions cause residuals to behave as instruments in the sense
that the full system estimator can be treated as an instrumental variable estima-
tor where the instruments include residuals uncorrelated with the equation
error due to the covariance restrictions.
nr + B = (n h)(:) = 0
(14.5.19)
or
or
<l>d =0 (14.5.21)
where <l> is the "(MK x (M2 + MK matrix on the left-hand side of (14.5.20).
Now assume that all prior information, including normalization rules and possi-
ble cross equation parameter constraints, may be written
Rd = r (14.5.22)
(14.5.24)
Then
IM;/J
(14.5.25)
Rank(W) = Rank ( 0
RQ
1M OX h) (14.5.26)
For a proof, see Richmond (1974). The implication of (14.5.27) is that although
an individual parameter is not identified when Rank(RQ) < M2 the parametric
function c'd may be identified if the condition (14.5.27) holds.
Simultaneous equations models that are linear in the parameters and nonlinear
in the variables are frequently used in economics. The nonlinearities often arise
because endogenous variables enter different equations in different forms. For
example, the endogenous variable may appear in one equation and its natural
logarithm, square, or reciprocal may appear in another. In such models, identifi-
cation rules developed for the simultaneous equations model that is linear in the ,
parameters and variables do not apply, so it is important to determine appropri- i
ate identification conditions.
One approach is to employ the general local identification concepts for the j
!
general nonlinear model of Rothenberg (1971) or Bowden (1973). These are
discussed in Section 14.5.4. Their conditions depend on the likelihood function,
and thus require the specification of a particular distribution for the errors.
Conditions for identification in nonlinear models under weaker distributional
assumptions about the errors have been developed by Fisher (1961, 1965,
1966), Kelejian (1971), and Brown (1983). Brown extends the work of Fisher
and develops necessary and sufficient conditions for identification in models
nonlinear in the variables. We will briefly present these conditions.
Write the tth observation for a simultaneous equation model that is nonlinear
in the variables as (where for notational convenience the t subscript is sup-
pressed and the model has been transposed)
A'q(y,x) = e (14.5.28)
Specification and Identification in Simultaneous Equations Models 583
Rial = 0 (14.5.29)
Also define q(x) = E[q(G(e,x), x)lx], where the expectation is with respect to
the distribution of e given the true structural parameter values. Then let q =
q(XO) for any choice of xO.
Under the assumptions stated above, the ith equation is identified if and only
if the rank of the partitioned matrix (q : Q' : R!) = N - 1. For the proof of
this result see Brown (1983, p. 182). While the application of this condition is
straightforward, determination of q(x) is conceivably difficult. Brown shows,
however, that if the intercept is unrestricted, then an equivalent necessary and
sufficient condition is that the rank of (I : R[) be N - 2. Brown also presents
an alternative form of the identification condition that relates his work to
Fisher's, and shows why Fisher's results are sufficient but not necessary for
identification. He presents a simple example as well.
584 Inference in Simultaneous Equation Models
iP lnf)
/(a) = -E ( aa aa'
V(a) = [/(a)]
t\I(a) (14.5.33)
THEOREM
Suppose aOelJt K is a regular point of both t\I(a) and V(a). Then aO is
locally identifiable if and only if V(aO) has full column rank K.
The meaning of "regular" in this context is that there exists an open neigh-
borhood of aO where the matrix (or matrices) in question has (have) constant
rank. Global identification results are more difficult to obtain and the reader is
referred to Rothenberg for the appropriate theorems.
An important special case of the above result arises when there exist reduced
form parameters. That is, if the probability density of y depends on a only
through an r-dimensional reduced form parameter vector n. That is"there exist
r known continuously differentiable functions 1Ti = hi(a), i = I, . . . ,r, and a
functionf*(yln) such thatf(yla) = f*(ylh(a = f*(yln) for all y and a. Also, let
sl**"'e the image of sl* under the mapping h. Then every n in sl** is assumed
Specification and Identification in Simultaneous Equations Models 585
W(a) = [H(a)]
t\I(a) (14.5.34)
If the functions hi and t\li are linear, then aO is globally identified if and only if
the constant matrix W has rank K. Otherwise, if aO is a regular point of W(a),
then aO is locally identified if and only if W(aO) has rank K. Again global results
are more difficult to obtain and the reader is referred to Rothenberg's paper.
Rothenberg also specializes the results to the usual simultaneous equations
models and derives conditions equivalent to those stated earlier in this chapter.
14.6 SUMMARY
In this chapter, we have specified the simultaneous equations model and stated
the usual assumptions associated with it. We showed that a single equation
from a simultaneous equations system cannot be estimated consistently using
LS. The reason is that the presence of an endogenous variable on the right-
hand side of the equation ensures that there will be contemporaneous, nonvan-
ishing correlation between the error term and the set of explanatory variables,
even as the sample size grows toward infinity. The reduced form equations,
however, can be consistently estimated by LS because current endogenous
variables do not appear in the RHS of those equations.
The notion of the identification problem was presented and necessary and
sufficient conditions for identification stated, both for a specific equation and
the whole system of equations. The complexities ihat arise when a system is
nonlinear, either in the variables or in the parameters, were also briefly dis-
cussed. The reader should be reminded that the identification problem is logi-
cally prior to any problems of estimation. If a structural parameter is not
identified, then no estimation technique exists that will allow its consistent
estimation. It should also be noted that the identification problem forces one to
recognize and deal with the limits of sample information alone. The theoretical
underpinning of the statistical model must be rich enough to satisfy the condi-
tions for identification before progress in learning about the structure can be
made. This interdependence between economic theory and statistical inference
is one of the fascinating aspects of simultaneous equations models.
For those interested in the historical development of the identification litera-
ture, the following references may be of interest. The rank and order condi-
tions, first stated as conditions on a submatrix ofthe reduced form coefficients,
stem from Koopmans (1949), Koopmans and Reiersol (1950), and Wald (1950).
Fisher (1959, 1963) generalizes these results and extended them to nonlinear
586 Inference in Simultaneous Equation Models
systems [Fisher (1961, 1965)]. Fisher (1966) summarizes the identification liter-
ature. Wegge (1965) and Rothenberg (1971) examine the identification of whole
systems, the former using the Jacobian matrix and the latter using the informa-
tion matrix. Following these works, Bowden (1973) shows the equivalence of
the identification problem to a nonlinear programming problem and Richmond
(1974) provides conditions for the identification of parametric functions. Kelly
(1971, 1975) provides results on the identification of ratios of parameters and
the use of cross-equation constraints. Gabrielsen (1978) and Deistler and
Seifert (1978) independently investigate the relationship between identification
and consistent estimation, and obtained conditions under which identifiability
and consistency are equivalent almost everywhere. Hausman and Taylor (1983)
establish from a recursive system that an equation is identified if jointly deter-
mined endogenous variables become predetermined in the equation under con-
sideration. Sargan (1983) discusses cases where an econometric model linear in
the variables is identified, but where the estimators are not asymptotically
normally distributed. See also Hausman (1983) for a good summary of identifi-
cation.
Special problems of identification occur where autocorrelation or measure-
. ment error is present. If equation errors are autocorrelated, then the equation
error and lagged endogenous variables cannot be considered predetermined.
See Hatanaka (1975), Deistler (1976, 1978), and Deistler and Schrader (1979)
or Chapter 16 for discussions of identifiability and autocorrelated errors. The
effects of measurement error on identification can be found in Hsiao (1976,
1977), Geraci (1976), Aigner, Hsiao, Kapteyn, and Wansbeek (1981), and Hsiao
(1982).
One can approach identification and estimation from a Bayesian perspective.
As stated above, identification deals with the amount of nonsample information
required to combine with sample data to permit consistent estimation of struc-
tural parameters. In the context of the Bayesian approach the identification
problem may be viewed as whether the posterior distribution of the relevant
parameters has a unique mode, so that the set of parameters in question may
be uniquely estimated. See Zellner (1971), Kadane (1974), and Dn!ze (1974) for
a discussion of the Bayesian approach.
14.7 EXERCISES
Exercise 14.1
Find the reduced form equations from the following structural equations.
Exercise 14.2
Let aj be the ith column of A. Define vec(A) = (a; a2 . . . a~)', where A has n
columns.
Specification and Identification in Simultaneous Equations Models 587
Exercise 14.3
For the system Y[ + XB + E =
show that
with reduced form equation Y = XII + V,
Exercise 14.4
Consider the model
Exercise 14.5
For the two-equation system in Exercise 14.1, suppose that the a priori restric-
tions on parameters are
Examine the order and rank conditions for each equation and the system as a
whole.
Exercise 14.6
Consider the following three-equation system:
Exercise 14.7
Examine the identification of each equation in the following model nonlinear in
variables and linear in parameters:
14.8 REFERENCES
15.1 INTRODUCTION
As noted in the previous chapter, economic data are often viewed as being
passively generated by a system of equations that are dynamic, simultaneous
and stochastic. In this chapter we consider the problems of estimation and
inference in the context of a structural equation system, assuming sufficient a
priori information exists to identify the system. We first review traditional
estimation procedures for both single equations within the system and the
system as a whole. We then present general classes of estimators that contain
the traditional estimators as special cases. While the large sample properties of
the traditional estimators are well known, it is only fairly recently that the small
sample properties have begun to be studied analytically. Some of this literature
is reviewed in Section 15.4. Procedures for testing hypothesis about structural
parameters are presented in Section 15.5, as well as tests for the orthogonality
between the disturbances and explanatory variables in a regression equation.
Following these discussions of basic material, sections are presented on effi-
cient estimation of reduced form parameters, methods for nonlinear systems,
Bayesian methods, and disequilibrium models. The structure of the chapter is
summarized in Table 15.1.
This section deals with the problem of estimating the parameters of the model
yr + XB +E = 0 (15.2.1)
--i
(Section 15.2.2) FIML (Section 15.2.2b)
Efficient estimation of
reduced form parameters
(Section 15.6) I Single equation methods (Section 15.7.1)
Disequilibrium models
(Section 15.9)
594 Inference in Simultaneous Equation Models
= Z~
lU, + eI , i= 1,2, . . . ,M (15.2.3)
where
B
I
= (Jli)
0'
and
If all M equations of (15.2.3) are considered jointly, the statistical model can
be written as the set of equations:
YM ZM OM eM (15.2.4)
or compactly
Y = Zo + e (15.2.5)
where
YI ZI 01 el
Y2 Z2 02 e2
y= , Z= , 0= , and e=
YM ZM OM eM
(15.2.6)
via the relation nr; - Hi. In particular, we derive the estimator Oi from
-I
.
"Ii
o
or
(15.2.7)
Solving (15.2.7) for Bi , providing X'Zi is square and nonsingular, yields the
indirect least squares (ILS) estimator of Oi.
(15.2.8)
't'-I't' 't'-I
== (Til "'PZiX "'Pxx "'PXZi (15.2.9)
where
596 Inference in Simultaneous Equation Models
and
+xx = plim(X'XIT)
with
8i (ILS) = (Z;XX'Zi)-IZ;XX'Yi
of the ith equation holds as well as some other fairly general conditions [see
Dhrymes (1974, Section 4.5)],
(15.2.13)
For the just identified case, the limiting covariance matrix (15.2.13) is the same
as (15.2.9) because +zox,- will be a square matrix. In finite samples, the unknown
covariance matrix of 8 i can be estimated by
with
Zi = [Yi X;]
= [xITi X (~k) ]
= [X(X'X)-IX' Yi X(X'X)-lX'X;]
= X(X'X)-IX'Zi (15.2.14)
8; = [Z;X(X'X)-IX'Z;]-IZ;X(X'X)-IX'Yi
= [Z;X(X'X)-IX'X(X'X)-IX' Zi ]-IZ;X(X'X)-IX'Yi
= (Z;Zi)-IZ;Yi
This estimator is also obtained if the least squares rule is applied to the statisti-
cal model
under the assumption that the errors are normally distributed. The likelihood
function is
(0. ,L.ly.) = (21T)- miTl2 IL. hl- I12 Ir.IT
. exp[ -!(y. - Z.O.)'(L;I h)(y. - Z.o.)]
lU, + eI
YI = Z<;:.
where Zi = (Yi Xi), YR = vec( Yi ), TrR = vec(lli), and VR = vec(Vi). This results
because the restrictions imposed on the reduced form of Yi and YR are exactly
the LIML restrictions. The seemingly unrelated regression is
O-T <1>']
[ <I> D*
Estimation and Inference in a System of Simultaneous Equations 599
the estimator is
[ ~j(LIML)
Tl'R(LIML)
] = [Zl(a
(b
I)Z;
X')Zj
Z[(b' X)]-I[Z[(a I)
C X' X b X'
Z[(b'
c X'
X)][Y;]
YR
where
X'eM (15.2.16)
or compactly as
where
ZI
Zz M
Z= is a [TM x LI (mi -
i~
1 + ki )] matrix
ZM
YI 01
Oz
y =
Yz
is a [TM x 1] vector, 0 = is a [~ (mi - 1 + kd XI] vector
YM OM
and
el
ez
e = is a [TM x 1] vector
The covariance matrix of (l @ X')e can be taken to be, in large samples and
under conditions given in Dhrymes (1974, pp. 183-184), L @ E(X'X), where L
is an unknown (M x M) matrix. Assuming that (X'XIT) converges to a nonsto-
chastic limit, X' XIT will be a consistent estimator for T-I E[X' X]. Thus for large
samples, we may use L @X'X in place of L @ E[X'X] in estimation. The GLS
(SUR) estimator of 0 is
(15.2.18)
where 8i is the 2SLS estimator and Tij may be calculated as Tij = Tor Tij = [( T -
mi + 1 - ki)(T - mj + 1 - kj )] liZ. A feasible 3SLS estimator
may be calculated in three stages. In the first stage, each endogenous variable Yi
Estimation and Inference in a System of Simultaneous Equations 601
y=Z8+e (15.2.20)
(15.2.21)
f(y) = (27T)- MTI2 IL hl- 1/2 1r1 T exp[ -i(y - Z8)'(L- 1 h)(y - Z8)]
(15.2.22)
MT T
In (8,Lly,Z) = - 2In(27T) + IlnlLI-l + T Inlrj
In this section we will discuss briefly the relative merits of the estimators of the
parameters of the structural equations. To do so, we identify the estimators
with their characteristics within a certain family so that related estimators can
be compared. The special characteristics that we will consider in this section
are the values of k in the k-class estimators, the instrumental variables used,
and numerical solution algorithms used in approximating the full information
maximum likelihood estimator.
Oi(k) - X' y.
I I
X~ X X~
I I
y,
I
It can be shown that ordinary and two-stage least squares are members of the k-
class with k chosen equal to be 0 and 1, respectively. The k-class estimator is
consistent if plim k = 1 and it has the same asymptotic covariance matrix as
2SLS if plim vT(k - 1) = o. See Schmidt (1976, pp. 167-169) for a proof. Since
OLS corresponds to the choice k = 0, it is not consistent as noted in the
previous chapter.
Another well known member of the k-class estimators is the limited informa-
tion maximum likelihood estimator, the value of k being equal to the smallest
root [ of the characteristic equation
where
Wi = YfMiYi , MI = 1- X(X'X)-IX'
I I I I
W = YfMYi ,
For the proof, see, for example, Goldberger (1964, pp. 338-344) or Schmidt
(1976, pp. 169-200). To examine the consistency of the LIML estimator, we
make use of the result of Anderson and Rubin (1950) that the asymptotic
distribution of T(l - 1) is xtki- m i+l) The asymptotic expectation of VT(l - 1)
is (ki - mi + 1)/VT and the asymptotic variance of VT(l - 1) is
2(k7 - mi + 1)/T. Since both the mean and variance go to zero as T approaches
infinity, the asymptotic distribution of VT(l - 1) is degenerate at O. Therefore,
we have plim VT(l - 1) = 0, which implies that the LIML estimator is consis-
tent and has the same asymptotic covariance matrix as the 2SLS estimator.
Estimation and Inference in a System of Simultaneous Equations 603
The generalization makes the k-class estimator a special case (k, = k2 = k). If
k, = kz = 1 it is the 2SLS estimator. Another special case is Theil's (1961) h-
class of estimators, when k, = 1 - h Z and kz = 1 - h. The double k-class
estimator is consistent and has the same asymptotic covariance matrix as that
of the 2SLS estimator ifplim VT(k, - 1) = plim VT(k 2 - 1) = O. Nagar (19S9,
1962) works out some approximate small sample properties of the k-class and
double k-class estimators under the assumptions that k, and k2 are nonstochas-
tic, the disturbances are normal and serially independent and all predetermined
variables are nonstochastic and exogenous. For more discussion of the small
sample properties, see Section IS.4.
Generalizations of the double-k-class estimator to full information simulta-
neous system estimation are attempted by Roy and Srivastava (1966), Srivas-
tava (1971), Savin (1973), and Scharf (1976). Realizing that the k-class normal
equations are obtained from the 2SLS normal equations if the matrix
X(X'X)-'X' is replaced by / - kiM with M = / - X(X'X)-'X', Scharf (1976)
replaces the matrices X(X'X)-'X' occurring in each submatrix of the 3SLS
normal equations by / - kijM to define the K-matrix-class estimators. When all
kij are equal to k, the K-matrix-class can be easily expressed as
If each kij has the property that plim VT(kij - 1) = 0 for all i,j = 1,2, . . . ,M,
then the K-matrix-class and 3SLS estimators have the same asymptotic distri-
bution. Scharf (1976) has proved that the FIML estimator is an estimator of the
K-matrix-class with (M x M) matrix K given by
K = (kij) = (wiJ/sij)
and
15.3.2 A Family of
Instrumental Variable Estimators
where W is a matrix of instruments that has the same dimension and rank as Z,
the elements of Ware uncorrelated with e so that
plim(W'e/T) = 0 (15.3.6)
Given the definition of the instrumental variable estimator and its asymptotic
properties, we can now give the well-known estimators discussed in previous
sections an instrumental variable interpretation. Based on the amount of infor-
mation employed in forming the instrumental variables, we will be able to
compare alternative estimators in terms of their asymptotic efficiency.
We will start with the OLS estimator of the structural parameter vector, 0,
W' = Z'(i- I M)
is also asymptotically uncorrelated with e and plim(W' ZIT) has a limit that is
assumed nonsingular. Thus, the 3SLS estimator is consistent and the asymp-
totic covariance matrix of VT(80 SLS ) - 0) is
It can be shown that the difference between the covariance matrix of the 2SLS
estimator and the 3SLS estimator is a positive semidefinite symmetric matrix.
The efficiency of 3SLS over 2SLS is seen from the amount of information
contained in the instrumental variables Z'(i- I M) as compared to Z'(l M)
used in 2SLS. The 2SLS estimator ignores the information in ~-I.
Hausman (1975) has given an interesting instrumental variable interpretation
of FIML. Hausman shows th~t by maximizing the log likelihood function
L(f ,B ,~), the FIML estimator O(Fn of the unknown elements of 0 is
lV' = Z'(5 h)
with
and the a priori restrictions of zero and normalized coefficients have been
imposed. The estimator 8(FI) is the instrumental variable estimator that fulfills
the assumptions of orthogonality, plim( lV' e/T) = 0 and the existence and non-
singularity of the second order moment matrices. Note that both Z and 5
606 Inference in Simultaneous Equation Models
depend on t and 13, which are elements of ii. Therefore, ii('F1) could be obtained
by an iterative process
where k and k + 1 denote iteration numbers. The limit of the iterative process,
if it converges, 8(FI) , is the FIML estimator with estimated covariance matrix
[Z*'(S* h)-IZ*]-I, since asymptotically VT(5tFl) - 8) - N(O, V-I), where
V = lim[ - E(T-I a2L1a8 a8')], and Z* and S* are computed using 8(FI)'
By an examination of the difference in instruments between z'ct- I M) and
Z'(S I), we can state that the FIML estimator uses all a priori restrictions in
constructing the instruments, while 3SLS uses an unrestricted estimator in
constructing the instruments. In other words, in FIML, zero restrictions have
been imposed to construct Z;, while the 3SLS does not impose the known
parameter constraints. In short, the difference between FIML and 3SLS lies in
the latter not making complete use of the identifying restrictions in constructing
the instruments and different estimators of the covariance matrix ~. The two
methods are equivalent when each structural equation is just identified since
the instruments are identical. The 3SLS estimator and the FIML estimator
converge to the same limiting distribution since both are consistent and asymp-
totically normal and have covariance matrices that converge asymptotically to
the inverse of the information matrix.
Since iterative procedures are used in computing the FIML estimator, the
convergence properties are not clear. For a test of the method and certain
proposed alternatives, see Hausman (1974). Alternative ways of constructing
instruments for FIML are also available in Lyttkens (1970), Dhrymes (1971),
and Brundy and Jorgenson (1971). These alternative estimators are shown by
Hausman (1975) to be particular cases of the FIML iteration discussed above.
t
Estimation and Inference in a System of Simultaneous Equations 607
or
WA +E =0
where W = (Y X), A = (f' B')', and E = (el e2 . . . eM). Taking the
derivatives of L with respect to unrestricted elements of f, B, and L- 1 and
setting them to zero, we obtain, respectively,
[T(f')-I - Y'WAL-1]U = 0
(-X'WAL-1)U = 0
and
TL - A'W'WA = 0
L = A'(W'WIT)A
where
Y = XTI + V
TI = -Bf- I
Q = (TI l)
W = (Y X) = XQ + (V 0)
are used. In solving for f, B, and L, alternative solution algorithms are avail-
able for either implementing or approximating FIML. The following algorithm
approximating FIML is well defined:
Step 1 Use any consistent estimator f(l) and Bm of f and B to obtain TIm with
TI(I) = -B(I)fili
Step 2 Compute Q(I) = (TI(I) l).
Step 3 Compute L(2) = A(2)( W' WIT)A(2).
Step 4 Solve A(3) from [LiZiAb)( W' XIT)Q(I)]U = o.
608 Inference in Simultaneous Equation Models
In the above steps, the subscripts in the parentheses denote that different
estimators may be used in each expression, where the number shows the order
in which the estimators are often obtained. The equation r~:-IA'(W'XIT)Q]U = 0
is linear in A and constitutes a generating equation for estimators in that varia-
tions in the choice of Q and L generate almost all known estimators for A.
The matrix [L -lA' (W' XIT)Q]U is the result of the first derivatives of the
concentrated likelihood function L(AIL). Thus the second derivatives of L(AIL)
with respect to the elements of A will give the information matrix. To determine
these derivatives, it will be easier to consider d = vec(A) = (a; a2 . . . aM)"
retain the unrestricted coefficients and pack together to obtain 0, then take the
derivative with respect to o. With the vectorization, the matrix in question can
be written
where again we remind the reader that u denotes choosing only the elements
corresponding to the unrestricted coefficients o. The derivative of (L- 1
Q'X'W)Uo with respect to 0 is
or
by dropping the asymptotically negligible terms. Thus, the covariance matrix of
the FIML estimator of 0 is
because plim(V'XIT) = o.
Estimation and Inference in a System of Simultaneous Equations 609
For the generation and discussion of other estimators, the reader is referred
to Hendry (1976), Hendry and Harrison (1974), Maasoumi and Phillips (1982),
Hendry (1982), Anderson (1984), and Prucha and Kelejian (1984).
In Section 15.3, where general classes of estimators were discussed, the com-
parisons of alternative estimators were primarily based on the similarities in the
asymptotic properties. In other words, the estimators were compared when the
sample size was assumed large. For example, the distribution of the FIML
estimator and 3SLS estimator are known to converge to the same limiting
distribution. However, in small samples their behavior may be very different.
In this section, we will concentrate on the small sample properties of simultane-
ous equations estimators.
During the 1950s and 1960s, studies of small sample properties of estimators
were often performed by Monte Carlo experiments. For a classified bibliogra-
phy of such Monte Carlo experiments over the period 1948-1972, see Sowey
(1973). The sampling experiments are often performed by setting up a structural
model with specified values of structural coefficients and specified distributions
of equation errors. The values of the predetermined variables are also speci-
fied, and the values of the jointly dependent variables obtained through the
reduced form equations in conjunction with the reduced form disturbances that
may be generated by a random number generator of a digital computer. Many
samples of different sizes are generated and used to estimate the structural
parameters with different estimators. The resultant distributions of the sample
estimates are summarized and compared. Although past Monte Carlo studies
have provided insights into the comparative merits of statistical procedures,
there are limitations on the generalization that may be made from these experi-
ments. The results may be specific to the set of parameter values assumed in
the simulation and may be subject to sampling variations. Monte Carlo experi-
ments neither permit a systematic analysis of the relationship between parame-
ter values and the corresponding distributions, nor allow a way to reveal that
moments of the exact distribution may exist. However, they may give very
strong indication of such nonexistence. Recent efforts to cope with these issues
are discussed in Hendry and Harrison (1974), Hendry and Mizon (1980), and
Dempster, Schatzoff, and Wermuth (1977).
Recent studies of small sample properties have concentrated on analytical
work. Although there are pioneering works by Nagar (1959), Basmann (1960,
1961), and Bergstrom (1962), most of the results appear in the literature after
1970. In the past to to 15 years, although we have seen many achievements,
results are available only for certain special cases because of the analytical
complexities involved in carrying through the necessary derivations. Recent
contributions by Basmann (1974), Phillips (1980a), Mariano (1982), Anderson
(1982, 1984), Greenberg and Webster (1983), Phillips (1983) and Taylor (1983)
summarize many of these results. Most of the results are about the 2SLS
610 Inference in Simultaneous Equation Models
Y2 = ),y, + {3x, + e
j
Estimation and Inference in a System of Simultaneous Equations 611
exist and those of higher order do not. An interesting special case arises when
the number of excluded exogenous variables is ki = K - k i = 1, which corre-
sponds to the case that the equation with two included endogenous variables is
just identified. In this case, the moments do not exist for the 2SLS estimator.
Sawa's (1969) results for the density functions for the OLS and 2SLS are
complicated in mathematical structure and it is difficult to deduce any definite
conclusions concerning small sample properties. Therefore, Sawa evaluates
the density functions numerically for y = 0.6, 0"11 = 0"22 = 1, and for several
values of T, K, and T2 (which is T times the square of the reduced form coeffi-
cients for YI). His graphs show that:
Both Richardson (1968) and Sawa (1969) have independently confirmed Bas-
mann's (1961) conjecture that the moments of the 2SLS estimator of an equa-
tion in a simultaneous system of linear stochastic equations exist if the order is
less than ki - mi + 1, where mi and ki are respectively the numbers of
endogenous variables included and predetermined variables excluded from the
equation being estimated. Also in the case of two included endogenous vari-
ables, Mariano and Sawa (1972) derive the exact probability density function of
the LIML estimator and show that for arbitrary values of the parameters in the
model, the LIML estimator does not possess moments of any order. Mariano
(1972) further proves for the general case with an arbitrary number of included
endogenous variables, even moments of the 2SLS estimator exist if and only if
the order is less than ki - mi + 1, and that even moments of the OLS estimator
exist if and only if the order is less than T - ki - mi + 1.
The exact distributions derived for the 2SLS and LlML estimators are too
complicated to provide any basis for a comparison of the two estimators.
Hoping that approximations to the distribution functions of these two estima-
tors will provide more tractable expressions for comparison, Mariano (l973a)
approximates the distribution function of the 2SLS estimators up to terms
whose order of magnitude is 1IvT: where T is the sample size. Anderson and
Sawa (1973) also give an asymptotic expansion ofthe distribution of the k-class
estimates in terms of an Edgeworth or Gram-Charlier series. The density of the
approximate distribution is a normal density multiplied by a polynomial. The
first correction term to the normal distribution involves a cubic divided by the
square root of the sample size, and other correction terms involve a polynomial
of higher degree divided by higher powers of the square root of the sample size.
The approach of Anderson and Sawa (1973) permits expression of the distribu-
612 Inference in Simultaneous Equation Models
tion of the 2SLS and OLS estimates in terms of the doubly noncentral F
distribution. Anderson and Sawa (1973) also give an asymptotic expansion of
the cumulative density function of the 2SLS estimate in terms of the noncen-
trality parameter for a fixed number of excluded exogenous variables. Their
expansion agrees with that given by Sargan and Mikhail (1971). Mariano
(1973b) also approximates the distribution of the k-class estimators extending
the method he used for the 2SLS (Mariano, 1973a). Anderson (1974) makes an
asymptotic expansion of the distribution of the LlML estimate for the two
included endogenous variables case.
Anderson and Sawa (1979) complete their study of the distribution function
of 2SLS by giving extensive numerical tables of the exact cumulative density
function of the 2SLS estimator and examining the accuracy of the approximate
distributions based on the asymptotic expansion in the noncentrality parame-
ter. The distribution of the estimator depends on the values taken by the noncen-
trality parameter and a standardization of the structural coefficients as well as
the number of the excluded endogenous variables. Anderson, Kunitomo, and
Sawa (1982) present corresponding tables for the LlML estimator. The distri-
bution of this estimator depends on the same factors that affect the distribution
of the 2SLS estimator plus the number of degrees offreedom in the estimator of
the covariance matrix of the reduced form. They also show that the LlML
estimator approaches normality much faster than the 2SLS estimator and
LlML is strongly preferred to 2SLS. Holly and Phillips (1979) point out that
approximations based on the Edgeworth expansion are sometimes unsatisfac-
tory in the tail area and propose the use of the saddle point approximation.
Their numerical computations suggest that the new approximation performs
very well relative to Edgeworth approximations, particularly when the degree
of overidentification in the equation is large.
Kinal (1980) summarizes the conditions for the existence of moments of k-
class estimators for the special case of two included endogenous variables as
follows: The rth-order moment of k-class estimators, of the coefficient of an
endogenous variable of an equation in a simultaneous system of linear stochas-
tic equations, exists for non stochastic k if and only if r < N, where
15.5 INFERENCE
Several kinds of tests have been proposed for the estimated coefficients of
equations in a system of simultaneous linear equations. In this section, we will
present selected test procedures for parameter significance, overidentification
and the orthogonality assumption.
614 Inference in Simultaneous Equation Models
As we discussed in Section 15.4, for small sample sizes the probability distribu-
tions of estimators in a system of simultaneous equations are unknown except
for a few highly special cases. Consequently, tests of parameter significance are
usually based on the asymptotic distributions of the estimators.
Let the general linear hypothesis be Ho: R8 = r, where R is (J x n), r is
(J x 1), and n = ~i(mi + ki ). To demonstrate how a test of these constraints can
be derived, we use the 3SLS estimator 8*. A test statistic for the 2SLS estima-
tor can be obtained analogously. Since TI/2(8* - 8) is asymptotically distrib-
uted N(O,~&*), where ~&* = plim{T-IZ'[i- 1 X(X'X)-IX']Z}-I,
(15.5.1)
and
(15.5.2)
(15.5.4)
(15.5.5)
or
(15.5.6)
1
Thus asymptotic tests on individual parameters can be carried out in finite
samples using (15.5.6) with the estimated standard error of at
and critical
I
values from the standard normal or "t" distribution, since they are the same
asymptotically.
identifying restrictions are of particular interest and tests are available for the
overidentifying restrictions.
A well-known overidentification test is Anderson and Rubin's (1949) likeli-
hood ratio test. The essential idea of the test is the following: If all the a priori
restrictions on parameters are correct, then the ratio of the likelihood function
for the restricted and unrestricted models should be close to one. If the ratio
deviates from one too much in a statistical sense, then the hypotheses, here the
overidentifying restrictions, are concluded not to agree with the sample data.
Since I is the minimum value of the ratio of two residual variances 'Yf Wi'YJ
'Yf W'Yi, the likelihood ratio test statistic turns out to be the smallest root lof
q = Th(fr)'(H''IIH)-lh(fr) (15.5.10)
which is the Wald test statistic, is asymptotically xl}). where J is the number of
restrictions that overidentify the structural equation. Byron shows that under
the null hypothesis, the Wald test is asymptotically equivalent to a likelihood
ratio test.
Wegge (1978) treats an overidentified model as an exact identified model that
is subject to overidentifying restrictions. Consider the just identified ith struc-
tural equation
(15.5.11)
(15.5.12)
Since
(15.5.13) I
the quadratic form
where &;; is the 2SLS estimate of (J";;. Under the null hypothesis M = 0, so that
the test statistic becomes M'(Xi' M* Xi)tliM;;. For a comparison of alternative
test procedures, see Hwang (1980b).
(15.5.14)
618 Inference in Simultaneous Equation Models
y = 28 + ZO/. + E (15.5.15)
(15.5.16)
(15.5.17)
tion. Both Hausman's IV test statistic and Durbin's (1954) test statistic are also
identical to Wu's (1973) two other test statistics depending on the estimator
used for the nuisance parameter of the variance of the equation error. None of
these tests, however, are true Lagrange multiplier or Wald tests, since they are
not based on maximum likelihood 'estimates.
The estimation of reduced form equations was discussed earlier for the purpose
of using the results to consistently estimate structural parameters. For certain
applications of estimated econometric models, such as forecasting, interest
centers on the reduced form of the model rather than on the structural form. In
such cases it appears desirable to estimate the reduced form by the most effi-
cient means available.
In Chapter 14, we have shown that the direct application of the OLS to the
reduced form equations results in consistent estimates of the reduced form
parameters. The method is computationally simple, but the estimates are not
efficient since the method does not take account of overidentifying restrictions.
In other words, each endogenous variable is regressed on all predetermined
variables disregarding the fact that restrictions may relate some reduced form
coefficients. Therefore, in the following discussion, we will explore alternative
methods of estimating the reduced form parameters.
A well-known method is to derive the estimator for the reduced form coeffi-
cients from the known consistent estimators of structural parameters. The
result is known as the derived reduced form. This method incorporates all
overidentifying restrictions on all structural coefficients, but the computations
require use of consistent estimators of the structural coefficients. In addition,
the covariance matrix of the reduced form estimators must also be derived from
the already very complicated covariance matrix of the consistent estimates of
the structural estimates. Moreover, the covariances of reduced form estimates
and of forecasts are available only to order liT, where T is the sample size.
Besides, accumulation of rounding errors in computing the derived covariance
matrix for a large model may become a significant problem. See Schmidt (1976,
pp. 236-247) for a summary of the properties of derived reduced form parame-
ter estimates.
Alternatively, direct estimation of reduced forms can be improved by incor-
porating a priori restrictions on the reduced form coefficients. The procedure is
feasible if the a priori restrictions on reduced form coefficients are known in
advance. However, such a priori restrictions on reduced forms implied by the
restrictions on structural parameters are often hard to derive. This is particu-
larly true when the model is large. Besides, if the restrictions are not correct,
the resulting estimators are biased.
Kakwani and Court (1972) propose the use of overidentifying information on
the coefficients of one structural equation at a time. Recall that a structural
620 Inference in Simultaneous Equation Models
Yi = (Xll; Xi )Oi + Vi
= ZiOi + Vi
Yi = XHo; + Vi
Comparing the coefficients with its own reduced form equation, we have
Kakwani and Court (1972) define the partially restricted reduced form estima-
tor Tr; as
(1S.6.2)
where
(1S.6.3)
Bi = (Z;MZi)-IZ;MYi
= (H'X'XIl)-IH'X'Yi (1S.6.4)
Kakwani and Court (1972) showed that the partially restricted reduced form
estimates can be regarded as a first approximation to the derived reduced form
coefficients. An iterative procedure is thus proposed. The initially estimated IT
Estimation and Inference in a System of Simultaneous Equations 621
(built from the Trj)implies a new H = ( fI (~) ), which may be used to obtain
A
Ys = Zsos + e s
where the subscript s refers to the fact that not all M equations may be present.
Similarly, several of the reduced form equations Yj = X'ITj + Vj can be jointly
written as
where again the subscript R implies that not all of the reduced form equations
may be present. If these two equations are written jointly as
[ Ys]
YR = 0
[Zs IR
0
X
][os] [es]
'ITR + VR (15.6.6)
and then estimated by 3SLS, Court shows that the resulting estimates of the
structural parameters are unaffected by the presence of the reduced form equa-
tions, but the estimators of the reduced form parameters are more efficient than
the estimators of the reduced form parameters by least squares. He also shows
that if Ys and YR are complete and contain all the endogenous variables in the
system, the resulting estimators of the structural parameters are simply the
3SLS estimates and the estimators of the reduced form parameters are the
derived reduced form estimators from the 3SLS estimators of the structural
parameters. One convenience of this approach is a formula for the covariance
matrix of the derived reduced form parameter estimators that appears easier to
compute than that of Goldberger, Nagar, and Odeh (1961).
Maasoumi (1978) proposes to combine the corresponding restricted 3SLS
and the unrestricted least squares estimators for the reduced form coefficients
622 Inference in Simultaneous Equation Models
(1S.6.7)
where ll3SLS = -Br- 1 with Band r obtained from 3SLS, llOLS = (X'X)-IX' Y,
and the value of A. is
if cfJ* < cp
if cfJ* > cp
(1S.6.8)
depending on (1) the chosen critical value cp of the test of the validity of the
parameter constraints (and specification) on reduced form parameters, and (2)
the value of the test statistic
(1S.6.9)
in which I(cp.co)(cfJ*) = I if the argument cfJ* falls within (cp,oo) and zero other-
wise. The combined OLS-3SLS estimator for n can be seen to be the same as
that derived from 3SLS when the model is correctly specified, but is closer to
the direct OLS estimates of n when the specification is less reliable.
Maasoumi (1978) shows that whereas the restricted (derived) 3SLS and 2SLS
reduced form estimates possess no finite moments, the modified Stein-like
reduced form estimator llMS has finite moments of up to order (T - M - K).
For a discussion of the existence of moments, see Section IS.4. He argues that
the difference between llMS and ll3SLS is asymptotically negligible and Monte
Carlo experiments support the above observations [Maasoumi (1977)].
(15.7.1)
where Yti is a scalar random variable, eti is a scalar random variable with zero
mean and constant variance U"ii, Zti is a [(mi + k i ) X 1] vector of mi endogenous
variables and k i exogenous variables, Oi is a (g x 1) vector of unknown parame-
ters, and fri is a nonlinear function in both Zti and Oi having continuous first and
second derivatives with respect to Oi.
The nonlinear two stage least squares estimator (NL2SLS) of Oi is defined by
Amemiya (1974a) as the value of Oi that minimizes
where ei, Yi, and fi are (T x I) vectors of elements eti, Yti, and .fti(Zti ,0;),
respectively, and D is a (T x K) matrix of certain constants with rank K. The
matrix D may consist of low-order polynomials of all exogenous variables of
the system, as in Kelejian's (1971) model that is nonlinear in only the variables,
or just all the exogenous variables of the system, as in the usual linear model or
that of Zellner, Huang, and Chau (1965), which is nonlinear only in the parame-
ters. Under the assumption that the elements of ei are independent, plus others
concerning the limiting behavior of the derivatives of fi and the matrix D,
Amemiya shows that the NL2SLS estimator Bi converges in probability to the
true value of Oi, 0,*, and VT (Bi - 0,*) converges in distribution to
1 (af;)
N {o ,U"ii [ P I1m , --I ,
(afi
T aO 8,' D(D D) D ao; 13,'
)] -I}
i (15.7.3)
For the proof, see Amemiya (1974). Minimization of the criterion function
(15.7.2) can be accomplished using the Gauss-Newton iterative procedure with
iterations
where fi and af[!ao i in the right-hand side are evaluated at Bi(n - 1), the esti-
mate obtained in the (n - I)th iteration (see Appendix B).
The consequences of several different choices of the matrix D are examined
by Amemiya (1975), who considers the special case where both simultaneity
and nonlinearity appear only in the ith equation. That is,
(15.7.5)
624 Inference in Simultaneous Equation Models
Yi = xn i + Vi (15.7.6)
where Yi and Vi are [T x (mi - 1)] matrices of random variables and lli is a
[k i x (mi - 1)] matrix of unknown parameters [Amemiya (1975)]. It is addition-
ally assumed that the rows of (ei V;) are multivariate normal with mean zero
and covariance matrix
(15.7.7)
where
af'
H = - ' X(X'X)-IX' - '
af
aOi aoI (15.7.8)
Amemiya shows that VB <: V2S However, SNL2S may be practically more
useful than BNL2S because E(afI lao;) does not always exist or may be difficult
to find. Moreover, SNL2S is consistent with the spirit of limited information
estimation whereas BNL2S is not. In the linear case, both SNL2S and BNL2S
reduce to the usual 2SLS estimator.
To obtain the nonlinear limited-information maximum likelihood estimator
(NLLI), we write the log likelihood function as
where i
S = rei ei eI Vi ]
VI ei VI Vi
!
j
Estimation and Inference in a System of Simultaneous Equations 625
' = SIT
Inserting ' = SIT into L **, we obtain the concentrated likelihood function
where
II; = (X'AX)-IX'AY;
The NLLl of 8;, denoted by B;LI is the value of 8; that maximizes L. The
asymptotic covariance matrix of B;LI is
where
and
af,' af
G =-[]-V(V'V)-IV']
a8; ; ; ; ; o8i;
It can also be shown that Vu VM <: VB <: V 2S in the nonlinear case in gen-
<:
eral. In the linear case, they have the same asymptotic variance, and the
estimators NLLI, MNL2S, and BNL2S are identical for every sample. Thus,
NLLI and MNL2S are recommended by Amemiya (1975) in the limited infor-
mation context. For a computational simplification of NLLI, see Raduchel
(1978).
For a system method, all of the M structural equations are considered simulta-
neously in estimation. A system of M nonlinear structural equations can be
written as
1=1,2,. ,M
t = 1, 2, . ,T (15.7.10)
afti)
det I11 I = det (-a. ~ 0
Yo
where (Tij is the (i,j)th element of L- 1 and Llij denotes a triple summation. The
logarithmic likelihood function may be written
Estimation and Inference in a System of Simultaneous Equations 627
L* = - (T) InlLI +
2
f In II aYtaft,
t~1
where f t = (ftl ft2 . . . frM)', and a constant is dropped. Equating the partial
derivatives of L * with respect to L to zero, we obtain
T
L = T-I L
t~ 1
ftf; (15.7.11)
a2L
N [ 0, -plim T-I ( a8 a8' 8*
)-1]
where 8 * is the true parameter vector.
Computation of the maximum likelihood estimator can be done iteratively.
The gradient class of iteration methods may be represented as
where 8(1) is an initial estimator and A is some positive definite matrix that may
be stochastic. Amemiya (1977) proposes iterations
(15.7.13)
where
G = diag(G" G2 ,
G = diag(G 1 , G2 ,
t;. = G' -
I I
T-'" aj;;
LJ ae ae'
F'
628 Inference in Simultaneous Equation Models
and
i = T-1F'F
Amemiya has shown that the asymptotic distribution of the second round esti-
mator in the above iteration depends on the asymptotic distribution of the initial
estimator and is not asymptotically efficient. Although (15.7.13) may not be a
good method of iteration, it serves to demonstrate a certain similarity between
the maximum likelihood estimator and the nonlinear three stage least squares
estimator (NL3S) that is discussed below.
Extending the result of NL2S described in Section 15.7.1, Jorgenson and
Laffont (1974) defined the NL3S as the value of & that minimizes
f(&),[i- l X(X'X)-IX']f(&)
.
where L is some consistent estimator of L, and X is a matrix of exogenous
variables that may not coincide with the exogenous variables that appear origi-
nally as the arguments off. The asymptotic variance-covariance matrix is given
by
Amemiya (1977) further extends the definition ofNL3S to be the value of & that
minimizes f'Af, where A could take any of the following three forms:
A2 = S2(SzAS 2 )-ISZ
and
af'
[ plim T-l ( a& 0*
)
A
(af
a&' o'
)]-1 (15.7.14)
1
1
Estimation and Inference in a System of Simultaneous Equations 629
SI = A -II2E(af/ao')
S2 = A-IE(af/ao')
and
S3 = E(af/ao')
where G' = diag(G;, G2. . . . ,GM), and Gt = E[Gf]. The resulting NL3S is
asymptotically less efficient than the BNL3S but is much more practical.
The NL3S has been extended to the estimation of the parameters of a system
of simultaneous, nonlinear, implicit equations in the recent literature. Within
this context, it is not assumed that it is possible to write one endogenous
variable in each structural equation explicitly in terms of the remaining varia-
bles of the system and nonlinear parametric restrictions across equations are
permitted. Interested readers should refer to Gallant (1977), Gallant and
Jorgenson (1979), and Gallant and Holly (1980).
For the case of nonlinearity in the parameters, a nonlinear 3SLS and a
nonlinear full information instrumental variables estimator are proposed by
Hausman (1975). A system of M simultaneous equations with nonlinearity only
630 Inference in Simultaneous Equation Models
Yf(u) + XB(u) = E
or
where f(u) and B(u) denote matrices, and 'Vi(U) and I3I(u) denote vectors with
each element a function of a (g x 1) parameter vector u. The system can be
written compactly as
Y = Z~(u) + e
where ~(u) = (~;(u) ~~(u) . . . ~~(u' and ~i(U) = ('Vi(U) l3i(u, . An example of
such a system is a linear simultaneous equations model with autoregressive
errors. Other examples are partial adjustment or distributed lag models con-
taining a desired stock that is a function of structural parameters. The log
likelihood function of the elements of (U,L) is
,
j
and Z(a~(u)/aul"(k is a (TM x $) matrix of derivatives evaluated at the kth I
estimate of u. The elements of WCk ) are 1
I
Z = diag(ZI, Z2,. ,ZM)
i
Zi = [X(B(u)I\u)~ I)i Xi]
and
Repeat steps 2 through 4 until a(k) converges. The result is the NLFIML esti-
mator aNLM. A consistent estimate of the covariance matrix of aNLM is
[(Zao(u)/au)'(S Ir)-I(Zao(u)/au)]-I.
By analogy, a nonlinear 3SLS or NL3SLS estimator a may be defined as
with instruments
where S is a consistent estimate of~. The NL3SLS and FIML are asymptoti-
cally equivalent. A consistent estimator of the asymptotic covariance matrix of
u IS
In the sampling theory approach, the a priori restrictions are imposed in exact
form, through equalities or inequalities with known coefficients, and certain
parameters are left completely unrestricted. However, the a priori information
provided by economic theory is often not of such an exact nature and might be
better reflected in probabilistic statements. The Bayesian approach allows the
flexibility of stochastic restrictions reflecting the imprecise nature of the prior
information. In estimation, we assume that the parameters of the equations to
be estimated are identified in the sense that the posterior pdf of the parameters
is unique. In the following section we will discuss the single equation, limited
information, and full system Bayesian analysis. For a discussion of Bayesian
inference the reader should review Chapter 4.
Using the notation defined in Section 15.2, consider the ith structural equation
of an M-equation model
632 Inference in Simultaneous Equation Models
where 'Vi = 0 and l3i = 0 are identifying restrictions. The reduced form equa-
tions for Yi are
(15.8.2)
and for Yi it is
where Vi = ei + Vi'Vi. Thus for given IIi, say IIi = fIi = (X'X)-IX'yi , (15.8.3) is
in the form of a multiple regression model. The Bayesian approach of Chapter 4
is directly applicable.
A result following the approach taken by Dreze (1968) and Zellner (1971) is
summarized in the following. Under the assumption that the rows of (Vi Vi)
are independent and N(O,Oi), the covariance matrix of (ei v;) or 0., can be
formed and the likelihood function of 'Vi, l3i, IIi, and 0., given Yi and Y;, can be
constructed as
where
(15.8.5)
where p; = p;('V;,I3;IM = 0) is the prior pdf for 'V; and 13;. Combining the
I
likelihood and prior pdf we obtain the posterior pdf
By integrating the pdf for 'Vi, l3i, IIi, and O. with respect to the elements of !1.,
the resulting marginal posterior pdf for 'Vi, l3i, and II; has the form of a general-
Estimation and Inference in a System of Simultaneous Equations 633
ized Student t pdf. If ni is further integrated out, the resulting marginal poste-
rior pdf for Oi has the form
where Pi is the prior pdf of "Ii and l3i' aii = (Vi - Vi'Yi)'(Vi - Vi'Yi), (Vi Vi) =
(Yi Yi ) - X(fr i fl;), H = iii/au, ii = (Yi Xi), and 8i is identical to the 2SLS
estimator. If the prior pdf of "Ii and l3i is diffuse, then Pi is a constant and the
marginal posterior pdf for Oi would be in the form of a multivariate t centered at
the quantity Oi were it not for the fact that aii depends on "Ii. The posterior mean
of Oi is
E(OiIYi,Yi )
fI!X'Xll- Il~X'X]-I[Il'X'Xll-] + remainder (15.8.8)
= [ :
xxn
I
A
X-X
I
I :
XX--n-
I I
I :
I I
A
I
I
If lli is given, the conditional posterior pdf for "Ii and l3i is in the multivariate
Student t form with mean vector given by
(15.8.9)
which is just the 2SLS estimate. Furthermore, given "Ii, the conditional poste-
rior pdf for l3i is in the multivariate Student t form, which when integrated over
l3i yields the marginal posterior pdf of "Ii:
(15.8.10)
In the previous section, we dealt with the posterior pdf for the parameters of a
single equation, using just the identifying prior information for those parame-
ters. In this section, we will deal with ajoint posterior pdffor the parameters of
all structural equations incorporating the prior identifying information for all
parameters I)f an M-equation system.
If we write the model in the notation of Section 15.2 as
yr + XB +E =0
634 Inference in Simultaneous Equation Models
then, under the assumption that the rows of E are normally and independently
distributed, each with zero mean vector and (M x M) covariance matrix L, the
likelihood function is
(15.8.12)
Zellner (1971) finds that the marginal posterior pdffor rand B, after integrating
(15.8.13) with respect to the elements of L, is
. . PI(r,B)lr'Ofl T!2
p(r,Bldata, pnor assumptIOns) ex I A A I! A
(15.8.16)
(15.8.17)
Estimation and Inference in a System of Simultaneous Equations 635
where y = XfI', Z = diag(Z\, Z2, . .. ,ZM), Zi = (Yi Xi), and Yi = XII;. The
above results are analogous to the large sample covariance matrix estimator.
For details, see Zellner (1971).
In Sections 15.8.1 and 15.8.2 we have presented the highlights of the basic
Bayesian analysis for a system of simultaneous equations utilizing a diffuse
prior pdf for the structural parameters r, E, and L. For a detailed analysis of
special models and the use of some informative prior pdf's, Monte Carlo exper-
iments, and so on, the reader is referred to Chetty (1968), Harkema (1971),
Zellner (1971), Morales (1971), Richard (1973), Rothenberg (1974), Kaufman
(1974), Zellner and Vandaele (1975), Dreze (1976), van Dijk and Kloek (1977),
and Kloek and van Dijk (1978). In the following section, we will discuss the
further use of a posterior distribution in formulating an estimator.
It is well known that many widely used estimators such as 2SLS, LIML, 3SLS,
FIML, and so on, can fail to possess finite moments as discussed in Sawa
(1972, 1973), Bergstrom (1962), Hatanaka (1973), and others. See also Section
15.4. As a consequence those estimators can have infinite risk relative to qua-
dratic and other loss functions. Further, means of posterior distributions based
on diffuse or natural conjugate prior distributions, often fail to exist. In these
cases, the usual Bayesian point estimate using the posterior mean is.not avail-
able.
This section deals with parameter estimators that minimize the posterior
expectation of generalized quadratic loss functions for structural coefficients of
linear structural models. Parameter estimators that minimize posterior ex-
pected loss, called MELO estimators, have at least a finite second moment and
thus finite risk relative to quadratic and other loss functions. We first consider a
single structural equation and then go on to consider joint estimation of param-
eters of a set of structural equations.
As we discussed in Section 15.6, the ith structural equation can be written as
Yi = Xni'Yi + X;lli + Vi
= (Xn x)o +
I I I VI
= ZiOi + Vi (15.8.18)
(15.8.19)
If ~i is an estimate of Oi, let e = X 1Ti - Zi8; and we define the loss function as
where X1Tj = ZjOj is used. Thus, the loss function (15.8.20) is quadratic in OJ - 5j
with a positive definite symmetric matrix ZiZj as the weight matrix.
Given a posterior pdf for (1Tj lld that possesses finite first and second mo-
ments, the value of OJ, or, that minimizes the posterior expected loss
(15.8.21)
81 = (EZ:Z j)-IEZ;X1Tj
= [Ell/X'Xll j Ell/X'Xi]-I[Ell/X'X1T i] (15.8.22)
EXiXll i XiX j EXiX1Tj
The above result is quite general in that any posterior pdf may be employed
in evaluating the posterior expectation. In a specific case where a diffuse prior
pdf for II and 0 is used, or
the marginal posterior pdf for II is in the following matrix Student t form (see
Zellner, 1971, p. 229):
(15.8.24)
where
_ (Y - XIl)'(Y - XIl)
S - -
[.511
- (v - 2) - 512
or
w = Z8 (15.8.25)
L = (w - Z~)'Q(W - Z~)
= (8 - ~)IZ'QZ(8 - ~) (15.8.26)
where Q is a positive definite symmetric matrix that may have elements that are
parameters with unknown values.
Given a posterior pdf for the reduced form coefficients and elements of Q,
the posterior expectation E(L) can be minimized to find the general MELO
estimator of 8 to be
(15.8.27)
where the entry in the first pair of curly braces is a typical submatrix, i,j = 1,2,
. . . ,M, and in the second pair, a typical subvector, Yi = xtt, wij is the (i,j)th
element of 0 -], Oij is a submatrix of 0 that is equal to the sampling covariance
between corresponding rows of Vi = Y i - xn i and Vj = lj - xnj , and Wij is a
vector of sampling covariances between elements of Vi = Yi - X11'i and corre-
sponding rows of Vi = Y i - xn i . For the properties ofMELO estimators, the
reader is referred to Zellner (1978), Zellner and Park (1979), and Park (1982).
638 Inference in Simultaneous Equation Models
dt = x;a + Ut (15.9.1)
St = z;J3 + Vt (15.9.2)
qt = min(dt.s t ) (15.9.3)
aPt = y(dt - St) (15.9.4)
(1S.9.S)
Similarly, in the periods with falling prices, there will be excess supply and thus
the observed quantity will equal the demand. Consequently, the demand func-
tion (lS.9.1) can be estimated using the observed quantity as the dependent
variable. In this case, equation (1S.9.4) can be used to write (1S.9.2) as
(1S.9.6)
(1S.9.7a)
where
if D.Pt > 0
otherwise (1S.9.7b)
and
(lS.9.8a)
where
With the model (1S.9.7) and (1S.9.8), the parameters can be consistently esti-
mated by first regressing gt and h t on all the exogenous variables, X t and Zt in
order to obtain their calculated values gt and ht, and then in the second stage,
regressing qt on X t and gt for (IS.9.7) and regressing qt on Zt and ht for (1S.9.8).
The 2SLS estimator is consistent but not asymptotically efficient in this model
because no restriction is imposed to force the same y to appear in both equa-
tions and furthermore, gt and h t are not, strictly speaking, linear functions of
exogenous variables.
Amemiya (l974b) proposes the following iterative method of obtaining the
maximum likelihood estimator. Since in period A when d t > St, the conditional
density of D.Pt given qt is N(y(qt - z; (3), y2(]'~), and in period B when St > d t , the
conditional density of D.Pt given qt is N(y(qt - x; a), y2(]'~), the log likelihood
640 Inference in Simultaneous Equation Models
function is
Taking the derivatives with respect to zero yields a set of conditions that are to
be solved jointly for the parameters. Amemiya (1974b) shows that the equations
for a and (J are the same as the LS estimator of a and (J given 'Y applied to
(15.9.7a) and (15.9.8a), respectively. The equations for (I;' and (I~ are the resid-
ual sums of squares of equations (15.9.7a) and (15.9.8a), given 'Y, divided by T
as for the usual ML estimators, and the equation for 'Y is
T'Y + ~ 2:
(Iu A
(ql - 1.'Y PI - z; (J) dPI - ~ 2:
(Iu B
(ql + 1.'Y PI - x; a) dPI = 0
(15.9.10)
which is a quadratic function in 'Y. Thus, the parameter estimates can be solved
for by using the following iterative procedure:
Step 1 Use the 2SLS estimates of a, (J, (I~, and (I~ as the initial estimates.
Step 2 Substitute the estimates &, fl, cr;" and cr~ into (15.9.10) and solve for
the positive root of 'Y, y.
Step 3 Use y in (15.9.7a) and (l5.9.8a) to obtain least squares estimates of a,
(J, (I~, and (I~.
15.10 SUMMARY
tional methods of estimation including ILS, 2SLS, 3SLS, LIML, and FIML
estimators are presented. The methods for nonlinear models are heavily based
on Amemiya's work and the Bayesian methods are basically Zellner's results.
A recent advancement in the MELO approach, which makes use of the poste-
rior pdf in evaluating the expectation of a loss function, is included as a part of
Bayesian results. Reduced form parameter estimation methods that incorpo-
rate overidentifying restrictions are also discussed. In comparing the relative
merits of estimators, some characteristics of estimators within a certain family
are identified. The special characteristics that we considered are the value of k
in the k-class family, the construction of instrumental variables, and the numer-
ical solution algorithm used in approximating the full information maximum
likelihood estimator. The comparisons of sampling properties are mainly of an
asymptotic nature. On the small sample properties of estimators, a literature
review is given and the highlights of the research regarding the derivation of
exact densities and/or distributions, the existence of moments, and their ap-
proximations, are discussed. With respect to statistical inference, we have
discussed three types of hypothesis tests, namely tests of parameter signifi-
cance, overidentification, and the orthogonality assumption. Finally, a special
model of a single market disequilibrium is discussed with its special economet-
ric problems. There are many other interesting topics related to simultaneous
equations in recent advancements of econometrics, that are not included in this
volume. Some of the neglected topics are two-stage least absolute deviation
estimators, an indirect least squares estimator using the Moore-Penrose inverse
for overidentified equations, ~he extension of 3SLS for a singular covariance
matrix, tests of serial correlation, heteroscedasticity and normality for simulta-
neous equation errors, problems in simultaneous forecasts, undersized and
incomplete samples and the use of auxiliary regression, and treatments of
vector autoregressive errors, systems of simultaneous differential equations in
the time domain or in the frequency domain. Models with truncated normal
errors, binary exogenous variables and the use of switching regime, binary
endogenous variables are discussed in Chapter 18, as are multiple logit and
generalized probit models. Some of the other topics are also discussed in later
chapters. Other topics related to simultaneous equations, such as multiple time
series and the rational expectations hypothesis, are discussed in Chapter 16.
15.11 EXERCISES
Exercise 1 5.1
Let y = 28 + e, where 2 = PZ, e = Pe, y = Py, and P = (1M
X(X'X)-IX'). Show that 8 = (2'2)-12'y is the 2SLS estimator.
Exercise 15.2
Let y = Z8 + e, i = P.Z, and P. CL- I X(X'X)-IX'). Show that &* -
(i' Z) -li'y is the 3SLS estimator.
642 Inference in Simultaneous Equation Models
Exercise 15.3
Show that when an equation is just identified, its 2SLS estimator is identical to
the indirect least squares estimator.
Exercise 15.4
Show that if all structural equations are just identified, the 3SLS and 2SLS
estimators have the same asymptotic covariance matrix.
Exercise 15.5
Let Zi = (Yi Xi), Zi = (Yi Xi), Yi = X(X'X)-IX' Y;, Y = Zo + e,
Z = diag(ZI, Z2, . . . , ZM), and Z = diag(Z" Z2, . . . , ZM)' Show that
Z = (1M X(X'X)-IX')Z and y = (1M X(X'X)-IX')y.
Exercise 15.6
Let y = Zo + e, with estimated covariance (i-I h), where Z diag(Z" Z2,
=
. . . ,ZM), Zi = (Yi X;), Yi = XII;, It = (X'X)-IX' Yi , Y= Xfr. Show that B*
= (Z'(i- ' h)Z)-'Z'(i- ' h)y is the 3SLS estimator of 0 in y = Zo + e.
Exercise 15.7
Show that minimizing [ = In('Y; Wi'Yi/'Y; W'Y;) is equivalent to finding the smallest
(characteristic) root of IWi - [Wi = 0, where Wi = Y![I - Xi (X; X;)-IX;] Yi , and
W = Y;[l - X(X'X)-IX']Yi .
Exercise 15.8
Since the reduced form equation of Yi can be expressed as Yi = Yi'Yi + X i (3i +
ei = (Xn; Xi)o + Vi, where Vi = ei + Vi'Yi, Yi = xn i + Vi, and 0 =
('Y;(3;)', show that the structural parameter 0 can be expressed as a function of
reduced form coefficients 1T; and ni as follows:
Exercise 15.9
Let Wi = (Yi - kVi Xi), where Vi = Y; - Yi , and Y; = X(X'X)-IX' Yi . Show
that ~i = (W:Z;)-IW:Yi is the k-class estimator for Oi in Yi = ZiOi + ei.
Exercise 5.10
In Hausman's instrumental variable interpretation of FIML and 3SLS, show
that when all the structural equations are just identified, the instruments are
identical.
Exercise 15.11
In Hendry's derivation of the estimator generation equation, show that
(f- ' 0) = L-1A'W'WIT holds given that T(f')-I - Y'WAL- 1 = 0 and
-X'WAL- 1 = O.
Estimation and Inference in a System of Simultaneous Equations 643
Exercise 15.12
Consider the model yr + XB + E = O. Let
-1 0.2 0
-10 -1 2
2.5 0 -1
and
or equivalently
4 1.26 -1.52
0= 1.26 1 -0.69
-1.52 -0.69 9
Consistent with the covariance structure ~, the reduced form equation errors
VI, V2, V3, and V4 are simulated and a sample is listed in Table 15.2 together with
the fixed values of XI, X2, X3, ~, and X5.
1. Find the reduced form equations with the particular assumed structural
coefficients.
2. Calculate the systematic part of Y, using the fixed value of X.
3. Generate a sample of data Y by adding the systematic part of Y and the
reduced form equation errors listed in Table 15.2.
4. Generate your own samples of data Y by adding to the systematic part of Y
different sets of equation errors that are consistent with your own assumed
values of~ and O. (Hint: See p. 810 in Judge et al. (1982) for the method
of generating multivariate normal random variables.)
Exercise 15.13
U sing the data generated from Exercise 15.12, calculate the following estima-
tors for each equation: (a) OLS, (b) 2SLS, (c) 3SLS, (d) LlML, and (e) FIML.
644 Inference in Simultaneous Equation Models
Exercise 15.14
Consider the following disequilibrium model:
d, = ao + aIP,_1 + a2Y, + u,
s, = {3o + {3IP,-1 + (32W, + v,
q, = min(d"s,)
p, = P,-I + y(d, - s,)
where u, and v, are equation errors, a's, {3's, and yare parameters to be
estimated, and
t Yr Wr Ur Vr
1. Using the given information, generate the data for Pr, dr, Sr, and qr.
2. Using your own equation errors U r and Vr that are consistent with your own
assumptions, generate the data for Pr, dr, Sr, and qr.
Exercise 15.15
Pretending that d r and Sr are not observed and using the data qr and Pr generated
in Exercise 15.14 together with the fixed values of Yr and Wr, estimate the
parameters a's, {3's, and y by the (a) 2SLS procedure and (b) iterative maxi-
mum likelihood procedure.
Exercise 15.16
Using the following data on qr and Pr (Table 15.4) together with Y{ and Wr of
Table 15.3, estimate the parameters of the disequilibrium model of Exercise
15.14 by (a) 2SLS and (b) iterative maximum likelihood procedure.
646 Inference in Simultaneous Equation Models
TABLE 15.4
OBSERVED VALVES
Of p, AND q,
t PI ql
1 39.11 11.33
2 36.03 21.98
3 33.43 32.53
4 31.67 40.65
5 30.30 47.50
6 29.26 52.07
7 28.42 58.54
8 28.l3 65.12
9 27.92 69.51
10 28.03 71.64
11 28.44 68.32
12 28.78 73.68
l3 29.34 70.85
14 29.95 74.64
15 30.62 74.55
16 3l.37 74.81
17 32.07 80.28
18 32.87 78.52
19 33.66 8l.36
20 34.50 82.78
15.12 REFERENCES
Nagar, A. L. and S. Sahay (1978) "The Bias and Mean Squared Error of
Forecasts From Partially Restricted Reduced Forms," Journal of Economet-
rics, 7, 227-244.
Nakamura, A. and M. Nakamura (1981) "On the Relationships Among Several
Specification Error Tests Presented by Durbin, Wu, and Hausman," Econo-
metrica, 49, 1583-1588.
Pagan, A. (1979) "Some Consequences of Viewing LIML as an Iterated Aitken
Estimator," Economics Letters, 3, 369-372.
Park, S. B. (1982) "Some Sampling Properties of Minimum Expected Loss
(MELO) Estimators of Structural Coefficients," Journal of Econometrics,
18,295-311.
Phillips, P. C. B. (1980a) "Finite Sample Theory and the Distributions of Alter-
native Estimators of the Marginal Propensity to Consume," Review of Eco-
nomic Studies, 47, 183-224.
Phillips, P. C. B. (1980b) "The Exact Distribution of Instrumental Variable
Estimators in an Equation Containing n + 1 Endogenous Variables,"
Econometrica, 48, 861-878.
Phillips, P. C. B. (1983) "Exact Small Sample Theory in The Simultaneous
Equation Model," Chapter 8 of Handbook of Econometrics, Vol. I Edited by
Z. Griliches and M. D. Intriligator, North-Holland, Amsterdam, 451-516.
Prucha, I. R. and H. H. Kelejian (1984) "The Structure of Simultaneous Equa-
tion Estimators: A Generalization Towards N onnormal Disturbances,"
Econometrica 52, 721-736.
Quandt, R. E. (1978) "Tests of Equilibrium vs. Disequilibrium Hypotheses,"
International Economic Review, 19, 435-452.
Quandt, R. E. and H. S. Rosen (1978) "Estimation of a Disequilibrium Aggre-
gate Labor Market," Review of Economics and Statistics, 60,371-379.
Raduchel, W. J. (1978) "A Note on Non-Linear Limited-Information Maxi-
mum-Likelihood," Journal of Econometrics, 7,119-122.
Richard, J. F. (1973) Posterior and Predictive Densities for Simultaneous
Equation Models, Springer-Verlag, Berlin.
Richardson, D. H. (1968) "The Exact Distribution of a Structural Coefficient
Estimator," Journal of the American Statistical Association, 63,1214-1226.
Richardson, D. H. and D. M. Wu (1971) "A Note on the Comparison of
Ordinary and Two-Stage Least Squares," Econometrica, 39, 973-982.
Rothenberg, T. J. (1974) "Bayesian Analysis of Simultaneous Equations
Models," Section 9.3 of Studies in Bayesian Econometrics and Statistics,
405-424, Fienberg, S. E. and A. Zellner, eds.
Roy, A. R. and V. K. Srivastava (1966) "Generalized Double K-Class Estima-
tors," Journal of the Indian Statistical Association, 4, 38-46.
Sant, D. (1978) "Partially Restricted Reduced Forms: Asymptotic Relative
Efficiency," International Economic Review, 19,739-748.
Sargan, J. D. (1976) "Econometric Estimators and the Edgeworth Approxima-
tion," Econometrica, 44, 421-448.
Sargan, J. D. and Mikhail (1971) "A General Approximation to the Distribution
of Instrumental Variable Estimates," Econometrica, 39, 131-169.
Sawa, T. (1969) "The Exact Sampling Distribution of Ordinary Least Squares
652 Inference in Simultaneous Equation Models
and Two Stage Least Squares Estimators," Journal of the American Statisti-
cal Association, 64, 923-937.
Sawa, T. (1972) "Finite-Sample Properties of the k-Class Estimators," Econo-
metrica, 40, 653-680.
Sawa, T. (1973) "Almost Unbiased Estimator in Simultaneous Equations Sys-
tems," International Economic Review, 14, 97-106.
Savin, N. E. (1973) "Systems k-Class Estimators," Econometrica, 41, 1125-
1136.
Scharf, W. (1976) "K-Matrix Class Estimators and the Full Information Maxi-
mum-Likelihood Estimator as a Special Case," Journal of Econometrics, 4,
41-50.
Schmidt, P. (1976) Econometrics, Marcel Dekker, New York.
Sowey, E. R. (1973) "A Classified Bibliography of Monte Carlo Studies in
Econometrics," Journal of Econometrics, 1,377-395.
Spencer, D. E. and K. N. Berk (1981) "A Limited Information Specification
Test," Econometrica, 49, 1079-1085.
Srivastava, V. K. (1971) "Three Stage Least-Squares Generalized Double K-
Class Estimators: A Mathematical Relationship," International Economic
Review, 12,312-316.
Swamy, P. and J. Mehta (1980) "On the Existence of Moments of Partially
Restricted Reduced Form Coefficients," Journal of Econometrics, 14,183-
194.
Swamy, P. and J. Mehta (1981) "On the Existence of Moments of Partially
Restricted Reduced Form Estimators: A Comment," Journal of Economet-
rics, 17,389-392.
Taylor, W. E. (1983) "On the Relevance of Finite Sample Distribution The-
ory," Econometric Reviews, 2, 1-139.
Theil, H. (1958) Economic Forecasts and Policy, North-Holland, Amsterdam
(second edition, 1961).
Van Dijk, H. K. and T. Kloek (1977) "Predictive Moments of Simultaneous
Econometric Models, A Bayesian Approach," in New Developments in the
Applications of Bayesian Methods, A. Aykac and C. Brumat, eds., North-
Holland, Amsterdam.
Wegge, L. L. (1978) "Constrained Indirect Least Squares Estimators," Econo-
metrica, 46, 435-499.
Wu, D. (1973) "Alternative Tests ofIndependence Between Stochastic Regres-
sors and Disturbances," Econometrica, 41, 733-750.
Wu, D. (1974) "Alternative Tests ofIndependence Between Stochastic Regres-
sors and Disturbances: Finite Sample Results," Econometrica, 42, 529-546.
Zellner, A. (1971) An Introduction to Bayesian Inference in Econometrics,
Wiley, New York.
Zellner, A. (1978) "Estimation of Functions of Population Means and Regres-
sion Coefficients Including Structural Coefficients: A Minimum Expected
Loss (MELO) Approach," Journal of Econometrics, 8, 127-158.
Zellner, A., D. S. Huang and L. C. Chau (1965) "Further Analysis of the Short
Run Consumption Function with Emphasis on the Role of Liquid Assets,"
Econometrica, 33, 571-581
Estimation and Inference in a System of Simultaneous Equations 653
Estimation
(Section 16.5)
I
I
Time domain
I
Frequency domain
estimation estimation
(Section 16.5.1) (Section 16.5.3)
656 Inference in Simultaneous Equation Models
16.1 STATIONARY
VECTOR STOCHASTIC PROCESSES
Obviously, these conditions are completely analogous to those given for the
univariate case in Chapter 7. Also, we define strict stationarity in the same way
as in Chapter 7, and a vector stochastic process is called Gaussian if it is
stationary and all finite subsequences Yt, . . . ,Yt+h have a joint multivariate
normal distribution. We now discuss some examples of stationary stochastic
processes.
where
8 11 ,n 81k,n
en = n= 1, . . ,p
8kl,n 8kk,n
an,n alk,n
An = n= 1, . . ,q
akl,n akk,n
Multiple Time Series and Systems of Dynamic Simultaneous Equations 657
(16.1.3)
(16.1.5)
where
and
and L is the lag operator defined such that LnYr = Yr-n (see Chapter 7). The
vector ARMA process (16.1.5) is stationary if
where M(L) = 8 p(L)-IA q(L) and the coefficient matrices Mi can be evaluated
using the following recursions:
Mo = I
MI = AI - 8 1
M2 = A2 - 8 2 - 8 1MI
min(i,p)
Mi = 2:
j= 1
8 j Mi- j , i> q
(16.1.8)
658 Inference in Simultaneous Equation Models
h>O
(16.1.9)
(16.1.10)
(16.1.1l)
(16.1.12)
Since both lep(L)1 and e;(L) are finite-order operators, (16.1.12) is another
finite order ARMA representation of Yt; that is, (16.1.12) represents the same
autocovariance structure as (16.1.5). Interestingly, lE>p(L)1 is a (1 x 1) operator
and thus, the same AR operator is applied to each component of Yt in (16.1.12).
This result has received considerable attention in the literature and has been
used in practical mUltiple time series analyses [e.g., Zellner and Palm (1974),
Wallis (1977), Chan and Wallis (1978)].
The problem of nonuniqueness of the ARMA representation of a stochastic
process-that is, the problem that two different vector ARM A models may
represent processes with identical covariance structure-corresponds of
course to the identification problem discussed in Chapter 14 in the context of
simultaneous equations models. In the multiple time series context this prob-
lem was raised by Quenouille (1957) in his pioneering book on the subject.
Subsequently, conditions for the uniqu~ness of ARMA representations have
been developed [e.g., Hannan (1969, 1970, 1976, 1979), Deistler and Hannan
(1981)]. The following conditions are due to Hannan [see Priestley (1981, Sec-
tion 10.5) and Granger and Newbold (1977, Section 7.3)].
Suppose the model is of the general form (16.1.5) and (16.1.6) and (16.1.10)
are satisfied. Assume further that ep(L) and Aq(L) have no common left divi-
sors; that is,
Multiple Time Series and Systems of Dynamic Simultaneous Equations 659
1. @p(L) is upper triangular and deg (}u(L) > deg eij(L), i = 1, . . . ,k andj =
1, . . . ,k, where deg eeL) denotes the degree of the operator eeL).
2. The AR operator @p(L) is a scalar [(1 x 1) matrix] and the pth coefficient
@p =1= o.
3. Out of the class of all equivalent ARMA models choose q to be minimal and
then choose p to be minimal. Then if the matrix [@p Aq] has full rank,
uniqueness of the representation is guaranteed.
and (h) is the MSE matrix of any other linear h-step ahead predictor at time T.
Similar to the univariate case it can be shown that
660 Inference in Simultaneous Equation Models
and so on.
If the infinite AR representation 06.1.11) is given, the optimal h-step forecast
IS
(16.1.19)
16.2 DYNAMIC
SIMULTANEOUS EQUATIONS MODELS
where VII and V21 are assumed to be independent white noise processes. Equa-
tion 16.2.2 obviously looks very much like the systems of simultaneous equa-
tions discussed in Chapters 14 and 15, the difference being that the lagged
endogenous and exogenous variables are expressed by means of matrix polyno-
mials in the lag operator and the error terms e l = A II (L)VI I are serially corre-
lated. Models of the form (16.2.2) are sometimes called ARMAX models. There
has been some discussion of exogeneity of variables and the above is but one
possible use of the term. We will return to this topic in Section 16.2.3.
Written without the lag operator (16.2.2) is
where it has been used that ell(L) and edL) do not necessarily have the same
order. In this case these operators are assumed to have the orders p and r,
respectively. Equation 16.2.4 is a structural form of a system of simultaneous
equations. The corresponding reduced form is
(16.2.6)
[e.g., Zellner and Palm (1974), Prothero and Wallis (1976), Wallis (1977), Har-
vey (1981, Chapter 9)]. In this form the infinite matrix polynomial
8(L) = L
j~O
8;V = -elll(L)eI2(L)
16.2.1a An Example
Zellner and Palm (1974) postulate the following dynamic model for quarterly
U.S. data:
CI = f3 + y(L)YI + ell (16.2.7a)
SI = J) + 'Y}(L)(CI -I- II) + e21 (16.2.7b)
YI = CI + II - SI (l6.2.7c)
1
[ -80
-YO][CI]
1 YI
[c] [1
YI
=
-80
-yo]-I
1
([f3 J [~:~:~:
+
The final form is obtained from
:
.+
.+
YPYI-P]
8 rC I- r
+[ 0 ]
8(L)
II +[ ell])
-e2r
(16.2.10)
y(L)8(L)
[el]
Y =
1 [1
1 - 8(1)y(1) 8(1)
y(1)][f3]
I -p +
1 - y(L)8(L)
8(L)
II + UI (16.2.11)
I
1 - y(L)8(L)
where
1 [ 1 Y(L)][ ell]
UI = 1 - y(L)8(L) 8(L) 1 -eZI
where 6 11 ,I(L) and ZII are (gl x 1), 6 11 ,z(L) and ZZI are (gz x 1), 6 IZ ,I(L) and XII
are (II x 1), and 6 Iz ,z(L) and XZI are (lz x 1), the variables ZZI and XZI are said to be
excluded if 6 11 ,z(L) and 6 12 ,z(L) are zero. Partitioning the [(gl + gz) X (gl + gz)]
matrix ell(L) and the [(gl + gz) X (II + Iz)] matrix edL) comformably, we get
6/ (L) 0/ 6/ (L) 0/ ]
[ell(L) edL)] = 11,1 IZ,I
[ ell,I(L) ell,z(L) elz,I(L) e 12 ,z(L) (16.2.13)
In this case a generalization of the rank condition states that the first equation
of the model is identified if and only if the matrix [ell,z(L) elz,z(L)] has rank
gl + gz - 1 and the rank is defined to be the order of the largest submatrix for
which the determinant does not vanish identically for all L. The order condition
664 Inference in Simultaneous Equation Models
states that for the first equation to be identified the number of excluded exoge-
nous variables, 12 , has to be, greater than or equal to gl - 1, the number of
included endogenous variables minus one.
Note that without any normalization (16.2.12) is still not unique if the rank
condition is satisfied, since multiplication by any polynomial 'Y/(L) will result in
an equivalent structure. However, uniqueness can be insured by requiring that
no further cancellation is possible and that the zero order coefficient of the first
endogenous variable is 1.
Occasionally exclusion restrictions of the type discussed by Hatanaka will
not suffice to decide the identifiability of an equation. An example is Equation
16.2.8b of the example model. Other conditions have been discussed by Deist-
ler (1976, 1978), Deistler and Schrader (1979), and others. For the case of
white noise errors e t see also Chapter 14.
(16.2.14)
Using (16.2.3)
Multiple Time Series and Systems of Dynamic Simultaneous Equations 665
where ZIt represents quantity, Z2t is the market price in the lth period, zit is the
unobservable market price expected to prevail during the lth period on the
basis of information available through the (t - 1)st period, X t is income, and elt
and e2t are stochastic disturbance terms. All variables are measured as devia-
tions from equilibrium values.
Equation 16.2.18a can be rewritten as
(16.2.19)
Using
(16.2.20)
Assuming that the disturbances elt and e2t are nonautocorrelated and indepen-
dent of Ot-I, and have zero means, we obtain the rational expectation
(16.2.21)
where the conditional expected value is the optimal I-step ahead predictor of
the exogenous variable. If we assume that Ot-I contains only ZL" Z2s, and x"
666 Inference in Simultaneous Equation Models
(16.2.22)
(16.2.23)
where
*_-
Z2t - f3
f32 I:>()
u L Xt-I
I - al
Further substitution of the above equation into (16.2. 18a) and (16.2.19) results
III
and
j
Multiple Time Series and Systems of Dynamic Simultaneous Equations 667
area is provided by Lucas and Sargent (1981), where also a very good introduc-
tion to the topic and many further references are given.
(16.2.24)
where ONxsls <: t} denotes the set of elements of Ot that are not in {xsls <: t},
and X t causes Zt instantaneously if
(16.2.25)
If (16.2.24) is not true, Zt is not caused by X t and if (16.2.25) does not hold, then
Zt is not caused instantaneously by Xt. Ifx t causes Zt and Zt causes Xt, then y; =
(z; ,x;) is a feedback system.
Assuming y; = (z; ,x;) is generated by the ARM A process (16.2. I) with in-
vertible operators A II (L) and A 22 (L), the process has an AR representation
where VII and V21 are independent white noise processes. Here transformations
similar to those in Section 16.1.1 have been used. Using (16.1.18) the optimal
I-step ahead forecasts of ZI and XI are
(16.2.27)
and
(16.2.28)
That is, we assume ZI and XI contain all relevant information in the universe. In
other words, the exogenous variables are not Granger-caused by the endoge-
nous variables.
More generally, it can be shown that for a Gaussian process (z; ,x;) with
conformably partitioned AR representation,
[XIZI] = [VII]
VZI
+ [MII,I MIZ,I][VI,I-I] + ...
M 2I ,1 M 22 ,1 Vl.t-I (16.2.31)
forecasts." This is appropriate if (z; ,x;) is assumed Gaussian but may clearly be
restrictive for non-Gaussian processes. Third, one could question the mean
square forecasting error as a measure for forecasting accuracy. As mentioned
earlier, in addition to these more technical problems, the fundamental objection
has been raised against the concept of Granger-causality that reference is made
only to the concept of "predictability" and not to the concept of "cause and
effect" in an acceptable philosophical sense. To demonstrate the involved
problems more clearly a simple example may be helpful.
Consider the two-dimensional Gaussian AR(1) process
[
zr] =
Xr
[0Il
0
0 ][zr-I]
0 22 Xr-I
+ [VIr],
V2r
<Tl2]
a1 (16.2.32)
By the foregoing discussion we know that Xr does not cause Zr and Zr does not
cause Xr in Granger's sense. However, the process can be written equivalently
as
where A is a constant Ulr = VIr - AV2r, and U2r = V2r [see (16.1.15)]. Clearly, if
this is the appropriate characterization of the relationship between Zr and Xr,
then variations in Xr have an effect on Zr although, as we have seen above, Zr is
not Granger-caused by Xr. This example also shows the possible arbitrariness of
multipliers if economic theory is not capable of providing a unique model form.
Further discussions of Granger-causality include Zellner (1979), Caines and
Chan (1975, 1976), Pierce (1977), Price (1979), Skoog (1977), Tj~stheim (1981),
Caines and Sethi (1979), Newbold (1982), Florens and Mouchart (1982a,b),
Chamberlain (1982) and Schwert (1979). Geweke, Meese, and Dent (1983) dis-
cuss and compare various tests for causality.
Engle, Hendry, and Richard (1983) emphasize that the concepts of Granger
noncausality and exogeneity serve different purposes. They characterize a
variable as weakly exogenous if inference for a set of parameters can be made
conditionally on that variable without loss of information. Thus, this notion of
exogeneity is not explicitly based on a variable's importance for prediction. If a
variable is weakly exogenous and is not Granger-caused by any of the endoge-
nous variables, then it is called strongly exogenous by Engle, Hendry, and
Richard.
16.3 AGGREGATION OF
VECTOR ARMA PROCESSES
Most economic data are aggregates. Examples are not only macro economic
variables like the GNP, but also micro economic data like the monthly sales
figure of a department store that is the sum of all individual items sold during a
670 Inference in Simultaneous Equation Models
month. Furthermore, the yearly sales figure is the sum of all monthly sales
figures. These examples contain two types of aggregation: contemporaneous
aggregation and temporal aggregation. Very often both types of aggregation
arise simultaneously and it seems desirable to analyze the consequences
jointly. Nevertheless, we will first consider contemporaneous aggregation and
then show how the results can be extended to variables subject to both types of
aggregation. This section is mainly based on Uitkepohl (1984a, 1984b) where
also further references are given. We will only consider aggregation of discrete
time processes and refer the reader to Sims (1971) and Geweke (l978b) for
results on temporal aggregation of continuous time processes.
Zt = [ZIt] =
Z2t
[I
0 0
1 1 0 0
o 1 1
. .
I-times (k - I)-times (16.3.1)
p* < kp (16.3.2)
and
This result follows from the representation (16.1.12) and the fact that a linear
transformation of a moving average process is a moving average process of at
most the same order. As shown in Section 16.1.1, Zt has various different finite
order ARMA representations. The above result means that the orders of one of
these representations are bounded by the quantities in (16.3.2) and (16.3.3). For
tighter bounds see Liitkepohl (1984a). The process Zt = FYt is stationary if Yt is
stationary.
Multiple Time Series and Systems of Dynamic Simultaneous Equations 671
Y(r-I)m+ I v(r-I)m+ I
Y(r-l)m+Z v(r-l)m+Z
:Yr = ~r =
Y(r-I)m+m v(r-I)m+m
h 0 0 0 h 0 0 0
-e l h 0 0 AI h 0 0
-e z -e l h 0 Az AI h 0
-ep -e 1 -ep-z
p_ h h
0 -ep -e p-I Aq A q_ 1 Aq-z
-e p-I 0 0 Aq A q_1
-ep
Aq
o o o o o o o o
(Pmk x mk) (Qmk x mk) (16.3.6)
672 Inference in Simultaneous Equation Models
respectively. Here
and
(16.3.9)
where
Y(T-I)3+ I V(T-I)3+ I
Y3T V3T
h 0 0 0 O2 0 1
00 = 01 h 0 , 01= 0 0 O2
O2 0 1 h 0 0 0
h 0 0 0 0 AI
Ao = AI h 0 , AI = 0 0 0
0 AI h 0 0 0
The macro process )rT can be used to investigate temporal aggregates of Yr.
For example, if Yr is a vector of flow variables and m consecutive vectors are
added starting with t = 1, we get a process
where the definition of F is obvious. That is, the resulting process ~T is a linear
transformation of ~'T. Similarly, if Yr consists of stock variables and obser-
vations are only available every mth period, we get a process
~T = [0 h]~T = F)rT
(k x mk) (16.3.12)
Multiple Time Series and Systems of Dynamic Simultaneous Equations 673
I
;yr=m[h h .,. hl)tr (16.3.13)
(k x mk)
Of course, this framework permits us to treat stock and flow variables simulta-
neously in one system. More generally, to jointly investigate temporal and
contemporaneous aggregation problems we can consider the process ;Yr = F ~'"
where F is a suitable (l x mk) matrix.
Given this framework, the results of Section 16.3.la can be used to investi-
gate the properties of the aggregated process ;Yr. IfYt is an ARMA(p,q) process,
)tr is the corresponding macro process for a given m, and F =1= 0 is an (l x mk)
matrix, ;Yr = F ~r can be shown to have an ARMA(p* ,q*) representation with
p* <: kp (16.3.14)
and
q* <: kp + Q (16.3.15)
q* <: kp (16.3.16)
U sing an aggregated model also has implications for the forecasts obtained for
the involved variables. Assuming that the model has vector ARM A form we
will again begin with the consequences of contemporaneous aggregation. In the
remainder of this section we assume that Yt is a k-dimensional stationary
674 Inference in Simultaneous Equation Models
is then used as a forecast for Zt+h. This forecast can be shown to be optimal
given the information set {ysls <: t}. However, often it is difficult in practice to
derive the process Yr from the data and univariate ARMA models may be
constructed for the individual components of Yr. These ARMA processes can
be used to forecast the individual components of Yr giving a predictor that we
denote by Yr(h). This forecast in turn can be used to form
As a third alternative, if only aggregated data and the aggregated process Zr are
available, an optimal linear predictor zlh) based on {zsls <: t} can be evaluated.
To compare the performance of the three forecasts iih), zr(h), and zlh), we
denote the corresponding mean square error matrices by
and
"2.(h) = E[zlh) - Zr+h][zr(h) - Zt+h]' (16.3.21)
As ~ention~d above, ir(h) is the best linear forecast. In particular, "2.(h) - i(h)
and "2.(h) - "2.(h) are positive semidefinite for h = 1,2, . . . . However, it is not
possible in general to give a ranking of zlh) and zr(h). In other words, it may be
better to forecast the aggregated process directly rather than forecasting the
individual series and aggregate the individual forecasts. Of course, if the indi-
vidual components of Yr are independent, forecasting them individually results
in Yr(h) = Yr(h) so that in this case zlh) is in general preferable to zlh). Liitke-
pohl (1984a) gives conditions for equality of zr(h), zr(h), and zr(h) and further
discussions of the above results include Kohn (1982), Tiao and Guttman (1980),
and Wei and Abraham (1981).
Using again the macro process ~T corresponding to Yr and a pre specified
positive integer m, it is easy to extend the above results to the case where both
temporal and contemporaneous aggregation are present. Theoretically, fore-
casting the basic disaggregate process and then aggregating the forecasts turns
out to be best.
Note that the above results also have interesting implications for comparing
forecasts from econometric models and single series ARMA models. Theoreti-
Multiple Time Series and Systems of Dynamic Simultaneous Equations 675
16.4 INTERPRETATION OF
MULTIPLE TIME SERIES MODELS
One objective of building multiple time series models is to better understand the
interrelationships within the considered set of variables Yt. For this purpose
Sargent and Sims (1977) and Sims (1980) suggest tracing the system's response
to so-called typical shocks. For simplicity we assume that Yt is a zero-mean
stationary vector ARMA process with MA representation
(16.4.1)
Suppose that the system has been in equilibrium up to time t = 0; that is, Yt = 0
for t < O. Then a typical shock is a unit increase in one of the variables at t = 0,
for instance Yo = (1, 0, . . . , 0)'. Equivalently, this can be expressed by
requiring V t = 0, t < 0, and Vo = (1, 0, . . . , 0)'. Tracing out the system's
response to such a shock over time gives the first column of MJ, M 2 , and so on.
Similarly, if Yo = (0, 1, 0, . . . , 0)', we get the second columns of the MA
coefficient matrices, and so on. Thus, the MA coefficient matrices represent
the system's response to a unit shock in each of the variables. In other words,
the MA coefficients are "dynamic multipliers" of the system. Note that the
definition of exogeneity in Section 16.2 ensures that an exogenous variable
does not react to a shock in any of the endogenous variables. This is easily seen
by considering the MA representation of (16.2.1).
Note that MA representations other than (16.4.1) exist and thus the multipli-
ers are not unique (see Sections 16.1.1 and 16.2). To ensure uniqueness eco-
nomic theory has to supply a priori restrictions on the model structure. In
addition, there are other obstacles that render the interpretation of the multipli-
ers difficult. In particular, we need to assume that all "important" variables are
included in the system, there are no measurement errors and prior transforma-
tions such as seasonal adjustment are such that the actual structure is not
distorted. These problems are discussed in the following sections.
Assume that Yt consists of two subvectors Ylt and Y2t of dimension kl and k2'
respectively, and V t is partitioned accordingly, so that
676 Inference in Simultaneous Equation Models
(16.4.2)
(16.4.3)
YI; = 2: [MIl.;
;~O
M I2 ,;]V t -;
(16.4.4)
YIt = 2: <1>; U
i~O
t -;,
(16.4.5)
yi = Yt + Zt (16.4.6)
Multiple Time Series and Systems of Dynamic Simultaneous Equations 677
(16.4.7)
This stochastic process may be singular-that is, some components of Zt, and
thus W t , may be identically zero since some of the components of Yt may be
measured without error. Substituting for Yt and Zt in (16.4.6), we get
where U t is vector white noise. In general 'I'(L) =1= M(L) so that this process
does not allow conclusions concerning the structure of Yt, which is often the
process of interest. Newbold (1978) has shown that a one-way causal system
may turn into a feedback system due to measurement error.
Occasionally a researcher or a user of the model for the considered variables
may really be interested in y* rather than in the y variables. For instance, close
to an election a politician may want to have a small measured unemployment
rate whatever the true unemployment rate may be. In that case the goal may be
to control the system contaminated by measurement error.
16.4.3 Seasonality
Yt = Ot + St (16.4.10)
we obviously have the same situation as in the previous section. Again, without
further information it is not possible to deduce the MA structure of Ot or St from
the MA representation of Yt.
Since interest often seems to focus on Ot, seasonally adjusted data have been
used in many analyses of multiple time series. Commonly the seasonal adjust-
ment procedure is applied to the univariate time series separately. Hence, a
rather special structure of Yt is implicitly assumed. For instance, adjusting the
data by a linear filter, say
(16.4.11)
678 Inference in Simultaneous Equation Models
DI = Y~ = a(L)YI (16.4.12)
is assumed, where
a(L) =
o
and y~ = (yfr.. . . ,Y%I)'. To be appropriate this adjustment clearly requires a
very special structure of YI.
The impact of seasonal adjustment in a multiple time series context has been
studied by Sims (1974) and Wallis (1974, 1978). In practice the MA structure of
the seasonally adjusted data may have little to do with the MA structure of DI so
that seasonal adjustment can be another source for misinterpreting a multiple
time series model.
In addition to omitted variables, measurement errors, and seasonality, we
have seen in Section 16.3 that aggregation can lead to erroneous interpretations
of multiple time series models. In the light of all these problems one might well
argue that such models are useless for structural analyses at least if the model
specification is based primarily on sample information. (See also the comments
at the beginning of Section 16.6 below.) Even if one does not share this extreme
viewpoint, it is useful to be aware of these potential problems when conclusions
for the structure of an economic system are drawn from multiple time series
models.
16.5 ESTIMATION OF
VECTOR ARMA PROCESSES
i
Multiple Time Series and Systems of Dynamic Simultaneous Equations 679
or, compactly,
(16.5.2)
Tk TIT I -I
-In e= 2 In 21T + 2 Inl~vl + 2: ~ VI ~v VI
Tk T
= 2 In 21T + 2" lnl~vl
(16.5.3)
We will assume that the YI and VI with t < 0 are zero since this will not affect
the asymptotic properties of the resulting estimators. Instead of setting the
presample values at zero, we could treat the first p observations as fixed pre-
sample values and Vp+l-q = . . . = vp = o. If the sample mean of the time series
is nonzero, we assume that it has been subtracted from the data previously.
Since the VI in (16.5.3) are unobserved, they have to be expressed in terms of
the observed YI, and we use the relation
(16.5.4)
[see (16.1.11)]. Thus, the parameters will enter nonlinearly in (16.5.3) and a
nonlinear, iterative optimization routine like the Newton-Raphson method is
required for the minimization of -In (see Appendix B). Assuming that 0 is a
vector containing all the parameters except the elements of ~v, the nth itera-
tion of the Newton-Raphson algorithm has the form:
a2In 1 )-I(a In 1 )
On+ 1 = On - (ao a9' 8~8" ao 8~8" (16.5.5)
680 Inference in Simultaneous Equation Models
where the index n denotes the parameter vector obtained in the (n - l)th
iteration. The derivatives on the right-hand side of (16.5.5) are given in Ander-
son's article and the choice of starting values 0 1 for the iteration is discussed in
Section 16.5.3. As noted in Chapter 6, if 0 1 is a consistent estimator of 0, the
method will provide an efficient estimator in one iteration under general condi-
tions.
Replacing the matrix of second derivatives in (16.5.5) by the information
matrix
(16.5.6)
results in the method of scoring (see Appendix B). The elements of 1(0) are also
given by Anderson (1980). The inverse information matrix 1(0)-1, where 0 is the
ML estimate, can be used as an estimate ofthe asymptotic covariance matrix of
O. Other nonlinear optimization algorithms that could be used to minimize
(16.5.3) are given in Appendix B.
The asymptotic normality of 0 holds under more general conditions than
those stated above. In particular, the VI need not be normally distributed. In
that case the procedure is a pseudo-ML procedure. Also, there may be con-
straints on the parameters. Such constraints are sometimes provided by eco-
nomic theory as we have seen earlier. Furthermore, the system may contain
non stochastic exogenous variables; that is, YI may be generated by an ARMAX
process. For a discussion of these and other generalizations, see, for example,
Dunsmuir and Hannan (1976), Deistler, Dunsmuir and Hannan (1978), Hannan
(1979), Hannan, Dunsmuir, and Deistler (1980), and Hillmer and Tiao (1979).
There are various special processes for which the estimation problem is compu-
tationally easier than may be suggested by the above discussion, and we will
consider some of these cases in more detail since they are of practical impor-
tance.
(16.5.7)
(16.5.8)
(16.5.9)
[see (16.2.2)], then a transformation of the above type is not necessarily desir-
able if interest centers on the structural form parameters contained in 0 11 (L)
and 0 12 (L). However, if the e l are not autocorrelated, then the system (16.5.9)
has the form of the models considered in Chapter 15 with predetermined vari-
ables Z'-l, ZI-2,' . . and XI' XI-I,' . . . Such a system can be estimated using
the techniques presented in Chapter 15. Estimation techniques especially de-
signed for rational expectations models (see Section 16.2.2) are considered by
Wallis (1980), Chow (1980), Hansen and Sargent (1980), Pagan (1984), Wickens
(1982), Cumby, Huizinga, and Obstfeld (1983), and others.
16.5.2b AP Processes
Another subclass of ARMA processes that can be estimated efficiently by LS is
the class of autoregressive-discounted-polynomial (AP) processes. These
processes are discussed in this context by Liitkepohl (1982c) and are general-
izations of the discounted polynomial lag models (see Chapter 10). The idea is
to impose some kind of smoothness constraint on the AR operator, since fitting
finite order AR processes has the disadvantage of picking up distortions that do
not reflect the actual underlying structure but are merely due to the randomness
in the sample.
The general form of an AP process is
(16.5.10)
where
(16.5.11)
I~O
I
]
682 Inference in Simultaneous Equation Models
-
Y I,t SO
I ,I = '"
L.J P'IO(SIO
I. ,I - Yl ,I ) + '"
L.J '"
L.J P'l
I.J'Sl'
'J,I - rl,t + V1,1
I~;+I 1=1 j=1
(16.5.13)
where
I-I
Slj,1 = - 2: AnnjYl.t_n
n=1
Vi
r'I, l = '"
L.J 'YI. Vt j
-II,)
j=O
and
Vi = max{vi/Il = 1, 2, . . . ,k}
The T}i,j are independent of t and thus, for a fixed A, (16.5.13) is a linear
regression equation with parameters Pi/,j and T}i.j, which can be estimated by
LS. The T}i,j contain the presample values YI, t < 0, and cannot be estimated
consistently since the ri,1 approach zero rapidly as t goes to infinity. Replacing
these terms by zero-that is, assuming YI = 0 for t < O-does not affect the
asymptotic properties of the estimators of the Pi/,j' in which we are primarily
interested. If YI is Gaussian, the LS estimators of the Pi/oj are consistent, asymp-
totically efficient, and normally distributed. The AR coefficients can be ob-
tained via (16.5.11) and it is easy to see that the coefficients of the <l>n are
linearly related to the Pil,j if A is fixed so that the asymptotic distribution of the
AR coefficients is also normal.
(16.5.14)
(see Section 7.6). Analogously the spectral density for a stationary vector
stochastic process Yt with autocorrelation matrices
(16.5.15)
is defined as
r y (h) = J"
-"
eiwhj,y (w) dw (16.5.17)
(16.5.18)
Rather than maximizing the likelihood, Dunsmuir and Hannan (1976) show that
(16.5.19)
1 T-l .
P(w) = -
27T
2
h~-(T-l)
Cy(h)e- IWh
(16.5.20)
where
(16.5.21)
and Cy(h) = Cy(-h)' for h < o. Maximization of (16.5.19) can have computa-
tional advantages due to the special structure of the matrix of second deriva-
tives although it is of course also a nonlinear optimization problem as can be
684 Inference in Simultaneous Equation Models
(16.5.22)
rye-h) = 2: 0
j~1
j ry(j - h) (16.5.23)
(16.5.24)
which in turn can be used to estimate the spectral density of the error process
(16.5.25)
by
(16.5.26)
where the Ce(h) are defined analogous to (16.5.21). The actual spectral density
of et is
(16.5.27)
and the estimate fe(w) is such that it can be factored accordingly in order to
obtain initial estimates of the A j , j = 1, . . . ,q, and ~v.
The small sample behavior of the different methods are in general unknown and
therefore we cannot give well-founded advice as to which method to use for a
particular data set. Also, little seems to be known about the best treatment of
the initial conditions terms. Furthermore, in general we do not know the as-
ymptotic distribution of the parameter estimates for nonstationary and nonin-
Multiple Time Series and Systems of Dynamic Simultaneous Equations 685
vertible ARMA processes. In fact, the full likelihood function cannot even be
maximized if the AR operator has a unit root [see Deistler, Dunsmuir, and
Hannan (1978)]. Also, it would be useful to know the consequences of estimat-
ing a misspecified model. Clearly these problems are sufficiently important to
deserve further investigation.
So far we have assumed that the model is completely specified and merely
involves some unknown parameters that can be estimated from the data. Al-
though there are economic theories that result in dynamic econometric specifi-
cations, Sims (1980) has persuasively argued that economic theory is not likely
to provide a completely and uniquely specified model. The following are his
main objections to identifying restrictions in macroeconometric models:
(1) Restrictions are often not based on economic theory but rather on the "econo-
metrician's version of psychological and sociological theory. " Economic theory
often tells us only "that any variable which appears on the right-hand side of one
of these equations belongs in principle on the right-hand side of all of them."
[Sims (1980, p. 3).]
(2) In dynamic models with lagged endogenous variables and temporally correlated
error terms where the exact lag lengths are not known a priori, the conditions for
exact or overidentification require to distinguish between lagged dependent and
strictly exogenous variables. However, sometimes variables are "treated as
exogenous only because seriously explaining them would require an extensive
modeling effort in areas away from the main interests of the model-builders."
[Sims (1980, pp. 5,6).]
(3) Including expectations variables in an econometric model may create identifica-
tion problems since the expected future values must have "a rich enough pattern
of variation to identify the parameters of the structure. This will ordinarily
require restrictions on the form of serial correlation in the exogenous variables."
[Sims (1980, p. 7).]
16.6.1 AR Processes
(16.6.1)
and the sample mean is assumed to be zero or previously subtracted from the
data. Of course, in practice many economic time series have trends and cannot
686 Inference in Simultaneous Equation Models
(16.6.2)
(16.6.3)
. . (T + j)k ILjl
FPE(j) = T _ j
A
(16.6.4)
so that
(16.6.6)
(16.6.7)
k T A)-1 (T _T jk LjA)-I]
CAT(j) = trace [ T ~ T _ sk Ls
j (
- (16.6.8)
(16.6.9)
leads to a consistent estimator p if and only if limr->oo sup Cr > 1. Quinn sug-
gests using Cr = k 2 so that we get a criterion
A 2jk 2 In In T
HQ(j) = InlLjl + T
(16.6.10)
Furthermore, if T :> 8, p(AIC) :> p(SC) and, for all T, p(HQ) :> p(SC). Overall
Liitkepohl's results clearly favor the SC criterion for small and moderate sam-
ples. In his Monte Carlo studies SC chooses the correct order most often and
the resulting estimated AR processes provide the best forecasts. Some of the
above criteria have been extended for estimating the order of a vector ARMA
process [e.g., Hannan (1981)].
Determining P and estimating the resulting AR process without any zero
constraints will often lead to a wastefully parameterized model in practice and
methods have been proposed to reduce the number of parameters in an AR
model. We will sketch one of them in the following.
(16.6.12)
where
1,)=1,2 (16.6.13)
(16.6.14)
and VI is Gaussian white noise with covariance matrix Lv. We assume that YI is
stationary. Defining P = max {Pij Ii, j = 1, 2}, YI is of course a bivariate AR(p)
process. However, some of the Pij may be less than p, which implies that the
corresponding 8ij.Pij+l, . . . ,8ij,p = O.
Allowing the 8i}(L), i, j = 1, 2, to have different degrees can lead to a sub-
stantial reduction of the number of parameters in the final AR model. However,
instead of only P we now have to determine all four degrees Pi}' The criteria
presented in the previous section can be employed for that purpose. Hsiao
suggests choosing Pi} such that FPE is minimized-that is,
(16.6.16)
Multiple Time Series and Systems of Dynamic Simultaneous Equations 689
T +I+n -J
FPE(I,n) = T _ I _ n (r(/,n)
(16.6.17)
where
i
-J(I ,n ) -_ T
cr "-2
T
~ VIr (16.6.18)
and the ulrare the LS residuals when (16.6.16) is fitted withpII = landpI2 = n.
To find the minimum of (16.6.17) over all/,n between 0 and m requires that
(m + 1)2 regressions be run or, if k variables are involved, (m + 1)k regressions
have to be fitted to specify only one of the equations. Clearly this may require a
quite substantial amount of computing time and Hsiao suggests the following
abbreviated procedure for a bivariate process.
where the dash indicates that the second variable, Y2r, is not included.
In other words, the optimal univariate AR model for Ylr is determined.
Step 2 Choose PI2 such that
This procedure will not necessarily lead to the minimum FPE over all combi-
nations (I,n), I,n = 0, 1, . . . , m. Therefore, Hsiao suggests overfitting and
underfitting of the final model and checking the significance of extra parameters
or deleted parameters by likelihood ratio tests. For the second equation the
procedure is analogous, and, of course, other criteria could be used in place of
FPE. Currently it is not clear which criterion is preferable in small samples.
A similar procedure could be applied to fit an AP model to a given set of data.
From (16.5.13) the first estimation equation for a bivariate model is known to be
VII V12
(16.6.22)
where all symbols are as defined in (16.5.13). To specify this equation of the AP
model, Vll and VI2 have to be determined. Obviously, as in the above AR
690 Inference in Simultaneous Equation Models
model, varying these degrees comes down to varying the number of regressors
in the regression equation (16.6.22) if A is fixed (see Section 16.5.2b). Similarly
we can proceed for the second equation. For details and applications see
Liitkepohl (l982b,c).
While the above procedures are relatively easy to carry out, they may result
in nonparsimonious specifications which in turn may have adverse effects on
their forecasting performance. Therefore, choosing a model from the larger
ARM A class is often desirable.
Different procedures have been proposed for vector ARMA model building and
it is beyond the scope of this chapter to offer a detailed account of each. We
will only give brief sketches of some of the methods in the following.
vl(L) I/JI(L)
Ylt = (31 (L) Y2t + WI (L) Vlt (16.6.23)
v2(L) I/J2(L)
Y2t = (32(L) Ylt + w2(L) V2t (16.6.24)
where Vt = (Vlt, V2t)' is a bivariate white noise process and the operators vj(L),
(3j(L), I/Jj(L), and wj(L),j = 1,2, are of finite order. To determine the orders of
these operators Granger and Newbold (1977) suggest to build univariate ARM A
models for Ylt and Y2t first, say
j = I, 2 (16.6.25)
where the Wjt are white noise and ()j(L) and Otj(L), j = I, 2, are finite order
operators. The models (16.6.25) can be specified using the Box-Jenkins ap-
proach presented in Chapter 7.
Although the Wit are univariate white noise processes, Wt = (Wit, W2t)' will in
general not be a bivariate white noise process. Therefore, a bivariate model,
say
is built in the next step. Using the fact that the Wjt are univariate white noise
processes makes it possible to infer the orders of the 'j(L), <ML),j = 1,2, from
the cross-correlations between WIt and W2s. t ,S = 0, + 1, +2,. . .. Finally, the
orders of the /J-j(L) and 8j (L) are inferred from the residual correlations of the
resulting models and the model (16.6.26)-(16.6.27) is amalgamated with
(16.6.25) to obtain a tentative bivariate specification for Yt. After having been
estimated by some efficient method the adequacy of this model has to be
checked and, if necessary, modifications have to be made. Since model check-
ing is an important step in all modeling procedures, we will mention some
possible checks in Section 16.6.3.
Granger and Newbold admit that their approach, although theoretically feasi-
ble, will be of little use in practice if more than three variables are involved. A
similar model building strategy is proposed and applied by Haugh and Box
(1977) and Jenkins (1979) and a slightly different procedure is discussed by
Wallis (1977) and Chan and Wallis (1978). These latter authors also "pre-
whiten" the single series first; that is, they build univariate ARM A models first,
and make use of the representation (16.1.12) of a vector ARMA model. Also
Zellner and Palm (1974) start from univariate models and then gradually com-
plicate the multivariate model structure. They explicitly make use of informa-
tion from economic theory. Furthermore, common factor analysis can be used
to specify multivariate ARM A models (see Chapter 10).
After a tentative model has been specified by any of the above procedures,
checks of its adequacy have to be carried out. As in a univariate analysis the
significance of the individual parameters should be tested and insignificant
parameters should be deleted in accordance with the principle of parsimony.
692 Inference in Simultaneous Equation Models
Also overfitting and testing the significance of the extra parameters is a useful
check. Suitable tests for this purpose have been proposed and discussed by
Quenouille (1957) and Hosking (1981a) among others.
Furthermore, a residual analysis can be based on a multivariate version of
the portmanteau test [Chitturi (1974), Hosking (1980, 1981b), Li and McLeod
(1981)]. A special case of this test is discussed in Section 10.4. For the present
case the appropriate statistic is
K
Q= T 2: tr(CiCo1c/Co
/~ 1
l
) (16.6.28)
where T is the sample size, K is small relative to T, and C/ is the lth covariance
matrix of the residuals,
1 T
C/ = - L.J
"""VIV,,'I _/
T I~l+ 1 (16.6.29)
and the VI are the estimation residuals. Under the null hypothesis that the model
is a k-dimensional ARMA(p,q) process, the statistic Q is asymptotically dis-
tributed as X2 with k2(K - p - q) degrees offreedom. A modification of Q with
possibly superior small sample behavior is noted by Hosking (1980). For further
details and references on model checking see Newbold (1981).
None of the above procedures is suitable for building multiple time series
models comparable in size to large scale econometric models. Currently we
cannot recommend a particular procedure for universal use. If the objective is
forecasting, then it seems logical to judge a model and thus the corresponding
modeling procedure by its forecasting performance, and it would be desirable
to compare different models for a set of economic data on this basis. It seems
fair to say that multivariate time series model building for economic data is still
in its infancy and there is much opportunity for further research in this area.
16.7 SUMMARY
linear multiple time series models at this point. However, nonlinear models
have proved their value in univariate analyses and eventually (perhaps in the
remote future) they may also become a useful tool in multivariate analyses. In
summary, it does not require much expertise in the business of forecasting to
predict that the analysis of mUltiple economic time series will remain an active
field of research over the years to come.
16.8 EXERCISES
Yt = [Zt]
Xt
= [a'Y f3][Zt-t] + Vt
D Xt-t
(16.8.1)
One hundred samples of size 100 should be generated using a random number
generator. A procedure to generate bivariate normal random variables is de-
scribed in Judge et al. (1982, p. 810).
Exercise 16.1
Use five samples of Monte Carlo data to estimate the parameters a, f3, 'Y, and D
by applying least squares to each of the two equations in (16.8.1) separately.
Exercise 16.2
U sing the five samples of Exercise 16.1, reestimate the parameters by the
seemingly unrelated regression method.
Exercise 16.3
Transform the model (16.8.1) so that the disturbance terms of the two equations
are independent and estimate the parameters of the transformed model by LS
using the five samples of data from Exercise 16.1.
Exercise 16.4
Perform tests to investigate the causal structure of the system Yt using the five
sets of Monte Carlo data from Exercise 16.1.
Exercise 16.5
I
!
Assume that in (16.8.1) Lv is diagonal and 'Y = o. Determine the reduced form
and the final form of the model.
Multiple Time Series and Systems of Dynamic Simultaneous Equations 695
Exercise 16.6
Determine the moving average representation of the model (16.8.1) and derive
an ARMA representation of the univariate subprocesses ZI and XI'
Exercise 16.7
U sing the five samples from Exercise 16.1, compute the sample autocorrela-
tions. Try to determine an adequate model for YI based on these quantities.
Exercise 16.8
Use the AIC and the SC criteria of Section 16.6.1 and a maximum order of six
and estimate the order of an AR model for YI for the five different samples of
Exercise 16.1. Interpret your results.
Exercise 16.9
U sing the five samples of Monte Carlo data from Exercise 16.1, apply Hsiao's
method as described in Section 16.6.1b to specify an AR model for YI' Assume
the maximum lag to be four.
Exercise 16.1 0
Repeat Exercises 16.1 and 16.2 with tOO bivariate samples of size 100. Compute
the means, variances, and mean square errors (MSE's) of the resulting estima-
tors, construct frequency distributions, and interpret.
Exercise 16.11
Repeat Exercise 16.3 with all 100 samples. Compute the means, variances, and
MSE's of the resulting estimators and construct frequency distributions.
Exercise 16.12
Make use of all 100 samples generated from the model (16.8.1) and repeat
Exercise 16.8. Construct frequency distributions for the AR orders estimated
by AIC and SC and interpret.
Exercise 16.13
Repeat Exercise 16.9 with all 100 samples of size tOO. Construct frequency
distributions for the estimated orders and interpret.
16.9 REFERENCES
Akaike, H. (1974) "A New Look at the Statistical Model Identification," IEEE
Transactions on Automatic Control, AC-19, 716-723.
Anderson, T. W. (1980) "Maximum Likelihood Estimation for Vector Autore-
gressive Moving Average Models," in D. R. Brillinger and G. C. Tiao, eds.,
Directions in Time Series, Institute of Mathematical Statistics, 49-59.
Baillie, R. T. (1979) "Asymptotic Prediction Mean Squared Error for Vector
Autoregressive Models," Biometrika, 66, 675-678.
Baillie, R. T. (1981) "Prediction from the Dynamic Simultaneous Equation
Model with Vector Autoregressive Errors," Econometrica, 49, 1331-1337.
Barth, J. R. and J. T. Bennett (1974) "The Role of Money in the Canadian
Economy: An Empirical Test," Canadian Journal of Economics, 7, 306-
311.
Caines, P. E. and C. W. Chan (1975) "Feedback Between Stationary Stochas-
tic Processes," IEEE Transactions on Automatic Control, AC-20, 498-508.
Caines, P. E. and C. W. Chan (1976) "Estimation, Identification and Feed-
back," in R. K. Mehra and D. G. Lainiotis, eds., System Identification:
Advances and Case Studies, Academic, New York.
Caines, P. E. and S. P. Sethi (1979) "Recursiveness, Causality and Feedback,"
IEEE Transactions on Automatic Control, AC-24, 113-115.
Chamberlain, G. (1982) "The General Equivalence of Granger and Sims Cau-
sality," Econometrica, 50, 569-581.
Chan, W. Y. T. and K. F. Wallis (1978) "Multiple Time Series Modeling:
Another Look at the Mink-Muskrat Interaction," Applied Statistics, 27,
168-175.
Chitturi, R. V. (1974) "Distribution of Residual Autocorrelations in Multiple
Autoregressive Schemes," Journal of the American Statistical Association,
69, 928-934.
Chow, G. C. (1980) "Estimation of Rational Expectations Models," Journal of
Economic Dynamics and Control, 2, 47-59.
Ciccolo, J. H., Jr. (1978) "Money, Equity Values, and Income: Tests for Ex-
ogeneity," Journal of Money, Credit, and Banking, 10,46-64.
Cumby, R. E., J. Huizinga, and M. Obstfeld (1983) "Two-Step Two-Stage
Least Squares Estimation in Models with Rational Expectations," Journal of
Econometrics, 21, 333-355.
Deistler, M. (1976) "The Identifiability of Linear Econometric Models with
Autocorrelated Errors," International Economic Review, 17,26-45.
Deistler, M. (1978) "The Structural Identifiability of Linear Models with Auto-
correlated Errors in the Case of Cross-Equation Restrictions," Journal of
Econometrics, 8, 23-31.
Deistler, M., W. T. M. Dunsmuir, and E. J. Hannan (1978) "Vector Linear
Time Series Models: Corrections and Extensions," Advances in Applied
Probability, 10, 360-372.
Deistler, M. and E. J. Hannan (1981) "Some Properties of the Parameterization
of ARMA Systems with Unknown Order," Journal of Multivariate Analysis,
11, 474-484.
Multiple Time Series and Systems of Dynamic Simultaneous Equations .697
FURTHER MODEL
EXTENSIONS
Chapter 17
Unobservable Variables
17.1 UNOBSERVABLE
VARIABLES IN ECONOMETRIC MODELS
is accomplished in one period. Thus the errors in variables problem is not only
inseparable from the problem of unobservable variables, but is also related to
distributed lag models, such as those discussed in Chapters 9 and 10.
In some cases in empirical analysis, the variables we measure are not really
what we want to measure. For example, observed test scores may be used as a
proxy for years of education and consequently the proxy variables may be
subject to large random measurement errors.
Even for the observable variables, the data may be subject to a variety of
errors. Errors may be introduced by the wording of the survey questionnaires.
Words such as weak or strong may imply different things to different respon-
dents. Griliches (1974) notes that these errors arise because (1) in economics
the data producers and data analyzers are separate, (2) there is fuzziness about
what it is we would like to observe, and (3) the phenomena we are trying to
measure are complex.
In the following sections, the inadequacy of the least squares rule, when
applied to a statistical model containing errors in variables, is evaluated and
the dilemma of using proxy variables versus ignoring the unobservable vari-
ables is discussed. The classical model in which both the dependent and inde-
pendent variables are subject to error is demonstrated to be unidentified due to
insufficient information. For identification, additional information is required.
For a single equation model, information regarding error variances may be used
to identify the equation and these error variances can often be estimated by the
use of repeated observations. When additional outcome indicators are available
or if unobservable variables can be specified as a function of observable explan-
atory variables (causes), mUltiple equation models result. In multiple equa-
tions, alternative procedures of estimation are available and factor analysis and
path analysis are of special interest. Finally, we also discuss simultaneous
equation models that involve errors in the exogenous variables and models that
are dynamic in nature. An overview of this chapter is given in Table 17.1.
The problem of using least squares on a model that contains errors in vari-
ables is that the equation error is no longer independent of the explanatory
variables that are measured with errors. To see the problem, consider the
following model:
X = X* + U (17.2.2)
Unobservable Variables 707
Inadequacy of OLS
(Section 17.2)
~ Using proxy variables
Dilemmaof~
(Section 17.3) Deleting unobservable variables
Repeated observations
(Section 17.5.2)
Multiple indicators
Unobservable
(Section 17.6.1)
variables in
econometric
Multiple causes
models----i
(Section 17.6.2)
(Section 17.1)
Multiple causes multiple
Multiple equations indicators model
(Section 17.6) (Section 17.6.3)
Instrumental variables
(Section 17.6.5)
Dynamic models
(Section 17.8)
y = Z"I + XIl + e
(17.2.3)
= Wo + e
e ~ v - Ull (17.2.4)
708 Further Model Extensions
(17.2.5)
is biased because e and W, which contains X, are not independent. Making use
of procedures and assumptions outlined in Chapter 5, we obtain, for large
samples
where plim(V' VIT) = +uu is assumed to exist. Thus the least squares estimator
8 is also inconsistent and the inconsistency is
plim(8 - 8) = plim(W'W)-ITT-IW'(v - VJl) (17.2.6b)
= [ -+~uJl] (17.2.8)
(17.2.9)
j
Unobservable Variables 709
plim(t3 - (3) =
(17.2.10)
(17.2.11)
where R denotes the multiple correlation coefficient for the regression model
with the true variables and R * denotes the multiple correlation coefficient for
the regression model containing errors in the variables. Since the F statistic is
proportional to R2/(1 - R2), or
R2 T - K
F(K-I.T-K) = 1 - R2 K - 1 (17.2.12)
x = x* + u (17.3.2)
we are confronted with the choice of either using the years of schooling as a
710 Further Model Extensions
proxy variable in place of the unobservable variable, years of education; that is,
Y = Zy + x{3 + e (17.3.3)
or simply omitting the unobservable variable (x*) and using instead the result-
ing misspecified equation
Y = Zy +e (17.3.4)
Since both (17.3.3) and (17.3.4) are misspecified equations of (17.3.1) together
with (17.3.2), the estimators of yare both biased and inconsistent [Theil
(1957)]. The question then is, "Which course of action will produce estimators
with less bias or inconsistency?"
McCallum (1972) and Wickens (1972) independently show, assuming that u is
independent of e, x*, and Z, that the bias of the estimator
(17.3.5)
is always greater than the bias of the estimator 'YE obtained from
Z'X]-I[Z'y] (17.3.6)
x'x x'y
The inconsistency of is is
(17.3.7)
provided that ~zz = plim(Z' ZIT) and ~zx* = plim(Z'x*IT) exist and the former is
nonsingular. On the other hand, the inconsistency of 'YEis
(17.3.8)
where CTuu =plim u'ulT, 8 = CT/I/I + CTx*x*[ 1 - (~~*~~I~Z-t*ICTx*x*)], and CTx*x* = plim
x*'x*IT. It can be shown that ICT x*x.181 < 1. Consequently the inconsistency of iE
is always smaller than that of i".
Thus, if our major concern is inconsistency, it is better to use even a poor
proxy than to omit the unobservable variable. However, the above conclusion
is not without reservation. First, the relationship between the proxy variable (x)
and the unobservable variable (x*) is not necessarily in the form of errors in
variables (17.3.2). If the proxy variable is a linear function of the unobservable
variable with intercept and slope parameter(s) [Aigner (1974a)], and also influ-
enced by other variables [Maddala (1977, p. 160)], or the proxy variable is in the
form of the dummy variable [Maddala (1977, p. 161)], the situation will be
different. Second, the bias comparisons of McCallum (1972) and Wickens
(1972) are based on asymptotic properties. They do not provide sufficient evi-
Unobservable Variables 711
dence for a choice among estimators in small samples. If we have large sam-
ples, the bias of the estimator for the model that includes the proxy variable can
be eliminated asymptotically if we can find an appropriate instrument and use
the instrumental variable method. However, the estimator is inefficient.
Other criteria may be used to compare the estimators. If we use the variance
criterion in choosing the estimator, the variance of -YE may be larger than the
variance ofYs. Aigner (1974a) compares both estimators in terms of the mean
square error (MSE), and as discussed in Chapter 3 this measure is the sum of
the variance and the squared bias. He finds that in terms of the MSE criterion
YE does not dominate Ys' Without loss of generality, he used a simple case of
one other explanatory variable to establish the condition MSE('9E) > MSE(ys)
if and only if
(1 - (1 - ItT)p~*) It
_ > p2
(1 - (1 - It)p~.)2 T - zx (17.3.9)
where It = CTuu/(CTx*x* + CTuu ) and p~* denotes the squ~red simple (population)
correlation coefficient between z and x*.
The condition (17.3.9) depends on the values of T and the unknown parame-
ters p~., and It. If, for example, p~* = 0.3, T = 10, and It = 0.9, the condition
holds and MSE('9E) > MSE(ys)' In this case, omitting the unobservable vari-
able x* from the regression is the optimal choice. However, the map of the
"contours of dominance" based on the above condition (17.3.9) shows that the
area in favor of Ys in terms of MSE is relatively small. When It is small (relative
to the above numerical example) and p~* and T are large, then the use of a
proxy variable in a multiple regression of yon z and x will be favorable in terms
ofMSE.
In summary, with the assumptions of McCallum (1972) and Wickens (1972),
it is true that the use of the proxy variable leads to a smaller inconsistency.
Aigner's (1974a) findings also broadly support the use of a proxy. However, in
cases when the proxy is not simply in the form of errors in variables, other
techniques may be preferable. This is particularly true when the proxy is a
function of its own unobservable variable and other additional variables that
may be subject to errors. In the following sections, we discuss other techniques
of estimating equations with unobservable variables that do not make use of
proxy variables.
y = x*{3 +v (17.4.1)
and
x = x* + u (17.4.2)
712 Further Model Extensions
T
In (f3,(J'~,(J'~,x*ly,x) = const. - "2 (In (J'~ + In (J'~)
- (x - x *)'(x - x *)/2(J'~
- (y - x*f3)'(y - x*f3)/2(J'~ (17.4.4)
Taking the partial derivatives ofln( ) or (17.4.4) with respect to f3, (J'~, (J'~, and x* ,
setting the derivation equal to zero and simplifying gives
and
provided that (J'~ and (J'~ are not zero. Solving equations (17.4.5) through (17.4.8)
jointly, we obtain an undesirable relation
-2 - f3- 2 -2
(J'v - (J'" (17.4.9)
It is undesirable because in large samples, the result in (17.4.9) implies f32 = (J'~/
(J'~, which is unreasonable and not true. The ratio (J';/(J'~ as an estimator of f32
was first deduced by Lindley (1947) and is considered an unacceptable solution
by some researchers, for example, Kendall and Stuart (1961, p. 384) and John-
ston (1963, p. 152). The failure of the maximum likelihood method to yield a
unique estimate is, in this case, the result of insufficient information. In other
words, the model is not identified.
Unobservable Variables 713
To see the lack of information, we note that elements of x and yare normally
distributed with zero mean vector and covariance matrix
(17.4.10)
and
(17.4.13)
respectively. The model implies three equations relating the structural parame-
0';. ,
ters {3, O'~, O'~, and to population moments of the observable variables, or,
in practice, to their sample counterparts:
a--2x* + (J"-2_
u - mxx (17.4.14)
2 + -2
{3-'-0"- x* (17.4.15)
(J" v == myy
- -2
{30' x* = m xy (17.4.16)
These three equations are not sufficient to determine the four parameters, and
the model is underidentified.
The basic concept of identification is the same as that of a system of simulta-
neous equations. In the latter, the problem of identification is to derive rand B
from rrr = - B, while the former is to derive the parameters {3, O'~, and 0';, 0';*
from (17.4.10). Just as in a system of simultaneous equations, for identification,
more information regarding the parameter restrictions is required. In the fol-
lowing sections, we discuss the types of parameter restrictions that will help
identify the parameters. Parameter restrictions therefore become an important
part of the model specification.
Within the context of the previous section, in this section we consider the
identification of the equation
y = {3x + e (17.5.1)
information regarding the parameters {3, ~, o-~, and 0-;., or extraneous infor-
mation in the form of repeated observations of x and y.
Recall that for the model (17.5.1), the parameters {3, ~,o-~, and 0-;. are to be
derived from the equations (17.4.14) through (17.4.16) plus additional informa-
tion. The additional information may involve one or more of the following
relations:
1. ~ = CI
2. ~ = C2
3. ~ = A~
4. o-i. = y~
where CI, C2, A, and yare known constants. If any of the above information is
used together with Equations (17.4.14)-(17.4.16), then there are four equations
and four unknowns and we have a complete system. However, the system is
not linear but is instead a third-order system. Therefore, there could be as many
as six different solutions since (17.4.15) is third order and (17.4.16) is second
order. By imposing nonnegativity conditions on o-~, o-~, and 0-;., some infeasi-
ble solutions can be ruled out.
If either 0-;' is known to be CI, or o-~ is known to be C2, then estimates of the
parameters can be obtained from (17.4.14) through (17.4.16), but the estimates
may be infeasible (negative variance) or feasible but inefficient. The maximum
likelihood estimator in this case is demonstrated by Birch (1964) and its sam-
pling properties are discussed by Fuller (\978). For the case when both 0-;' and
o-~ are known at the same time, the results are developed in Birch (1964) and a
summary of the results is available in Judge et at. (1980).
The case when the ratio o-~/~ = A is known is well documented in the
literature. In this case the system of nonlinear equations reduces to
(17.5.2)
The estimator i3 is one of the roots of (17.5.2) with algebraic sign that is consis-
tent with the sign of m XY ' since i3 measures the slope of the relation between the
true values y* and x*. The other parameter estimators are a-;. = In,J{3, a-~ =
mxx - m xy l{3, and a-~ = a-~/A. For the asymptotic properties, see Fuller (\978).
In another case, the ratio 0-;.10-;
or the ratio o-~/o-; may be known, where
0-; 0-;. 0-;'.
= + 0-;.10-;,
If the ratio which is often called the measure of reliability,
is known, the biased least squares estimator b = In,Jlnu may be transformed to
yield an estimator
(17.5.3)
Unobservable Variables 715
The variance of S, and the test statistic that may be used for inference purposes
are developed by Fuller (1978).
(17.5.6)
provided that Tn > n. Both of the estimators & and cT~ are unbiased.
A generalization of (17.5.5) is the addition of explanatory Zj variables, which
leads to the specification
J = 1, 2, . . . ,n (17.5.8)
Y = Z8 +v (17.5.9)
716 Further Model Extensions
where
v = (v; Vn
I ) I
,
8 = (a ' y')',
and
Z=
(17.5.10)
x = x* + u (17.5.12)
where [u] = 0, [uu '] = 0";'1, [uv ' ] = 0, and u is independent of Z. The
information in 07.5.12) provides the relation
which can be incorporated with the following relation obtained from (17.5.8),
(17.5.14)
(17.5.15)
Unobservable Variables 717
to obtain estimates of the parameters. Since the population variances <T;, 0-;,
and covariance <Txy can be estimated by sample counterparts m xx , m yy , and m xy ,
respectively, and crv can be estimated from (17.5.11), the three unknowns ~"
C?U, and {3 can be estimated from the three relations (17.5.13) through (17.5.15).
Further, if repeated observations are also available on x so that
x = x* + u ,
~ ~
j = 1, 2, . . . ,n (17.5.16)
where Xj, x*, and Uj are all (T x 1) vectors, then clearly there is overidentifica-
tion, since <T~ and <T;' can be estimated from (17.5.16) and both (17.5.14) and
(17.5.15) determine {3.
For an overidentified situation, the maximum likelihood estimator provides a
solution. Examples are Villegas (1961), Barnett (1970), and Dolby and Freeman
(1975). For a treatment of repeated observations within the context of analysis
of variance, see Tukey (1951) and Madansky (1959), and for a summary, see
Judge et al. (1980).
YJ = (3x*
J
+ VJ' ) = 1, 2, . . . ,n (17.6.1)
For the multiple equation model (17.6.1), the errors Vj do not necessarily come
from the same distribution. In this case, the underlying stochastic assumptions
are that each Vj and x* are independent and
(17.6.2)
(17.6.3)
and
(17.6.4)
718 Further Model Extensions
YI = f3lx* + VI
Y2 = f32X* + V2
Y3 = f33 X* + V3 (17.6.5)
where the upper triangle elements are omitted for simplicity. There are six
relations and six unknowns. The model is just identified. If a fourth indicator is
available, it would add two additional parameters f34 and (T~4 but there will be
four additional relations relating population moments of the observables to
parameters. The model is then overidentified.
In general, the multiple indicator model can be written as
Y = Jlx* + V (17.6.7)
E(x*v) = 0 (17.6.8)
E(vv') = n diagonal (17.6.9)
There are M(M + 1)/2 distinct elements in ~ = JlW + n, that when equated
with sample second moments, provide M(M + 1)/2 estimating equations. There
are only 2M parameters in Jl and n. Thus if M > 3, there will be more equations
than unknown parameters and overidentification results. There will be many
estimators that are consistent but may not be fully efficient. Efficient estimators
of Jl and n may be obtained by the ML method.
Unobservable Variables 719
Under the normality assumption, the log likelihood function with constants
dropped may be written as
For a detailed derivation, see Goldberger (1974). Although Goldberger does not
derive the variance for the ML estimator, the ML estimator is consistent and
efficient.
The multiple indicators model can be viewed as a special case of the factor
analysis model that has long been used in psychometrics. For details of factor
analysis, see Lawley and Maxwell (1971) and Harman (1976).
y = f3x* + v (17.6.16)
720 Further Model Extensions
with
x* = a'x (17.6.17)
where
n = al3' (17.6.20)
(17.6.21)
where
It can be shown (Goldberger, 1974) that the solution value for 13 is the charac-
teristic vector of QS~I corresponding to the root A; that is,
where
Q = Y'X(X'X)-IX' Y (17.6.26)
S = (Y - XP)'(Y - XP) (17.6.27)
P = (X'X)-IX' Y (17.6.28)
Further, it can be shown that 13 is the first principal component of QS-I since
A = 13' S-113 corresponds to the largest characteristic root of QS-I. The solution
value for a is shown by Goldberger (1974) to be
a = A-lpS-113 (17.6.29)
For a special case of only two indicators, Zellner (1970) and Goldberger
(1972a) develop generalized least squares and maximum likelihood algorithms
under alternative assumptions about the errors of the two indicators. Zellner
(1970) considers the following model that involves ancillary variables:
where XI is a (T x 1) column vector of ones and the 7T'S are additional unknown
parameters. An example of this model can be found in Crockett (1960). In
Crockett's model y is a vector of observed consumption levels, x* is permanent
income, and the z's are such variables as house value, educational attainment,
age, and so forth.
In the above formulation the number of parameters in the model is no longer
dependent on the sample size, and the T unknown incidental parameters x* are
now replaced by the K-dimensional unknown coefficient vector 71". The model
was first considered by Zellner (1970) as a system of simultaneous equations.
Indeed, substitution of (17.6.32) into (17.6.30) and (17.6.31) results in the two
equation model
(17.6.33)
X = XI7TI + Z27T2 + ... + ZK7TK + U (17.6.34)
where, for simplicity, all the variables are measured as deviations from their
means so that the intercepts a and 7TI may be eliminated, Z = (Z2 Z3 . . . z,.J is a
722 Further Model Extensions
[T X (K - I)] matrix,
11'x =
(17.6.37)
In the second stage y is regressed on x(=Zfrx ) to obtain one of the two stage
least squares estimators denoted by /3(00) as
(17.6.38)
The use of infinity in /3(00) is related to the variance ratio A. = cr~/cr~ and will
become apparent in what follows. On the other hand, if we define
(17.6.39)
y = Z11'y + V (17.6.40)
X = Z11'y/{3 + u (17.6.41)
In this case, the two-stage least squares procedure is reversed. In the first stage
y is regressed on Z to obtain
(17.6.42)
(17.6.44)
Thus there are two 2SLS estimators, /3(00) and /3(0). The former may be inter-
Unobservable Variables 723
1 1
S = 2 (x - ZTrx)'(x - ZTrx) + 2 (y - ZTrx(3)'(y - ZTrx(3)
CF u CF" (17.6.45)
(17.6.46)
, Syy
A=-
Sxx (17.6.48)
with
Syy = T-l(y - Zfry)'(y - Zfry) (17.6.49)
Sxx = T-l(X - Zfrx)'(x - Zfrx) (17.6.50)
(17.6.51)
Noting that the weighted sum of squares (17.6.45) is only a part of a log
likelihood function, Goldberger (1972a) suggests that Trx , Try, CF~, and CF~ be
estimated jointly from the likelihood function. The parameter (3 may be esti-
mated from (17.6.39) after Trx and Try are estimated. To show the symmetry
Goldberger writes the model in the reduced form as
y = ZTry + V (17.6.52)
X = ZTrx + U (17.6.53)
724 Further Model Extensions
where (u' v')' is normally distributed with mean zero and variance 0
(17.6.54)
f(u,v) = (27T)-TI01-1/2
1 , 1 , ]
. exp [ - 20"2 (y - Z1Ty) (y - Z1Ty) - 20"2 (x - Z1Tx) (x - Z1Tx)
v u
(17.6.55)
which, when viewed as a function of the parameters 1Tx , 1Ty , O"~, and O"~, given y
and x, is the likelihood function. The log likelihood is
(17.6.56)
Note that the last two terms of (17.6.56) contain the weighted sum of squares
(17.6.45). If 0 is known, then maximizing (17.6.56) is equivalent to minimizing
(17.6.45). If n is unknown, (17.6.56) or eo
is maximized by taking the deriva-
tives of In e() with respect to O"~ and O"~, setting them to zero, and solving for
O"~ and O"~ to obtain the ML estimators 6"~ and 6"~. Then 6"~ and 6"~ are inserted
back into (17.6.55) to obtain the concentrated likelihood function, which in turn
is maximized with respect to 1Tx and 1Ty . Since the first step produces
Goldberger suggests an iterative procedure to obtain 6"~, 6"~, Trx , and Try. The
generalized least squares estimators frx and fry, as proposed by Zellner, are
used as initial estimates in order to compute (17.6.57) and (17.6.58), or, equiva-
lently, (17.6.49) and (17.6.50), then (17.6.48) and hence the new generalized
least squares estimator (17.6.47). Then, (17.6.56) is maximized, given 6"~ and
6"~, to obtain Trx and Try. The procedure is repeated until the estimates con-
verge.
It should be noted at this point that for this particular model, Pagan (1984)
has shown that the maximum likelihood estimator and the 2SLS estimator of {3
happen to have the same asymptotic distribution. It turns out that, in this case,
the 2SLS estimator is asymptotically efficient and no asymptotic efficiency
gains are available by switching from the 2SLS estimator to a full maximum
Unobservable Variables 725
(17.6.59)
where 0 < (T~ < 00, 0 < A. < 00, -00 < 7Tj < 00, i = 1, 2,. ., K, and PI(A.) and
P2({3) are of unspecified form. In (17.6.59) we have assumed that {3, A., In {T~, and
the elements of 1Tx are independently distributed and the pdf's for In (T~ and the
elements of 1Tx are uniform. The multiplication of (17.6.59) and (17.6.56) results
in the posterior pdf for the parameters {3, 1Tx, A., and {T~. Numerical integration
techniques required to compute the joint and marginal posterior pdf's were
discussed in Chapter 4.
A generalization of the basic mUltiple causes model (17.6.16) is to include
some exogenous variables Xl that directly affect the indicators:
= TIlxl + TI2X2 + V
= TI'x + v (17.6.62)
where TI2 = (la', TI = (lli TI2)'. The previous procedure of solving (17.6.19)
and (17.6.20) for a and (l is still applicable.
When a stochastic error is added to the equation of multiple causes, the follow- .
ing model can be specified:
y = (lx* + u (17.6.63)
x* = a'x + e (17.6.64)
y = Il(a'x + e) + u
= Ila'x + Ile + u
= IT'x + v (17.6.65)
where
f
R = 1 + fS + Q
f fg h
A= 1 + f + (1 + f)2 + f
f = 11'0- 111, g = 11'0- 150- 111,
5 = (Y - XP)'(Y - XP)
P = (X'X)-IX' Y
Q = P'X'XP
u~
x
z
bx
... x*
! byx
jI Y
w/ V/
Figure 17.1 A path diagram.
(17.6.67)
(\7.6.68)
(17.6.69)
We assume that U t is uncorrelated with xi and also with Yt and Zt, E[xiu t ] =
E[Ztut] = E[Ytut] = 0, and E[ZtWt] = Elxiv t ] = E[xtv t ] = E[ztvt] = O. As a
consequence, E[u t v t ] = E[u t w t ] = E[v t w t ] = O. The above model is a special
case of the model in Section 17.6.3.
When the structural coefficients byx and bxz are adjusted by population stan-
dard deviations, they are called by Wright path coefficients and are denoted by
Pyx and Pxz., respectively, that is, Pyx = bvxu)u" and PXl = bxzu)ux . If the data
are standardized, then U x = U y = U z = 1, b yx = Pyx and b xz = Pxz. Here we
assume that all variables are standardized so that
and
where Pry and Pre denote population correlation between X t and Yt, and X t and Zt,
respectively.
If we multiply equation (17.6.69) by Zt, and take expectations, we obtain
(\7.6.70)
or
Pxz = b xz (17.6.71)
Thus, the sample simple correlation rxz provides an unbiased estimator of bxz
728 Further Model Extensions
(17.6.72)
or
( 17.6.74)
or
Therefore, the greater the variance in the errors (T~, the greater the bias in using
r,r as an estimator of h r , . Note also that the use of rcvlro., as an estimator of h", is
actually the same as the instrumental variable method, where the instrument
variable is 2,. Other possible instrumental variables are discussed in the follow-
ing section.
For other structural equation models and methods, see Hauser and Gold-
berger (1971), Goldberger (I972a,b), Goldberger and Duncan (1973), Duncan
(1975), and Bagozzi (1980).
y = {3x* + v (17.6.76)
x = x* + D (17.6.77)
x* = az + w (17.6.78)
In estimating f3, we combine the first two equations (17.6.76) and (17.6.77) to
obtain
y=f3x+e (17.6.81)
is consistent since
=f3 (17.6.83)
x = az + (w + u) (17.6.84)
a= (Z'Z)-IZ'X (17.6.85)
is consistent.
A difficulty in using the instrumental variable method is the choice of an
appropriate instrument that is correlated with the explanatory variable and
uncorrelated with the error term. A variable that is likely to satisfy these two
conditions is the discrete grouping variable. In other words, we classify the
730 Further Model Extensions
observations according to whether they fall into certain discrete groups and
treat this classification as a discrete-valued variable. Several grouping instru-
ments are available. If a two group classification is appropriate, then a binary
variable
Yf + X*B + E = 0 (17.7.1)
x = X* + V (17.7.2)
plim(X*'EIT) = 0 (17.7.3)
plim(X*' VIT) = 0 (17.7.4)
Y = -X*Bf- 1 - Ef- I
= x*n* + V (17.7.7)
(17.7.10)
which provides the relation connecting the observable moments and coeffi-
cients <I> and II, to the parameters, II* and 8. This relation, in conjunction with
prior restrictions of zeros in e and the overidentifying restrictions on II*, may
be sufficient to identify II*, which in turn, identifies the structural parameters.
In general, overidentifying restrictions on the reduced-form coefficients can be
used to identify measurement error variances if the error variances are properly
located. However, operational rules for assessing the identifiability of parame-
ters in a simultaneous equation model with unobservable variables, remain to
be worked out.
For the subject of identification involving errors in exogenous variables, see
Anderson and Hurwicz (1949), Wiley (1973), Geraci (1973), and Aigner, Hsiao,
Kapteyn, and Wansbeck (1983). For estimation, see Chernoff and Rubin (1953),
Sargan (1958), Joreskog (1973), and Geraci (1983). In the presence of errors in
variables, Geraci finds conditions under which simultaneous equations can be
analyzed and estimated on a recursive equation by equation basis. Examples of
models involving errors in exogenous variables are given by Goldberger (1972b,
1974), Duncan et al. (1968), Hauser (1972), and Duncan and Featherman (1972).
Yt = xi{3 + V t (17.8.1)
Xt = xi + U t (17.8.2)
where the number in the parentheses indicates the order of lags in the variable
of the second subscript. With the extraneous information of E[xi xi- t1 =1= 0, /3 is
identified by the following relation:
(Tyx(1) = E[Ytxt-l]
= E[(x t /3 + V t - u t /3)x t- t1
= /3E[x t x t- t1
= /3(TxA1) (17.8.4)
Therefore,
(17.8.5)
which can be estimated by the ratio of the sample moments myx (1) and m xA1) of
(Tyx(1) and (TxA1), respectively, where
(17.8.8)
is the instrumental variable estimator, the instrument being Xt-l . This is equiva-
lent to specifying a dynamic equation
xi = CUt-l + Wt (17.8.9)
(17.8.10)
X t* -- Y*t-l (17.8.11)
in addition to the first two equations (17.8.1) and (17.8.2), then the covariance
between lagged variables
(17.8.12)
(17.8.13)
734 Further Model Extensions
Yf + X*B + E = 0 (17.8.14)
where
x = X* + U (17.8.15)
is known to be nonsingular, where X'" 1and X _I denote the matrices with lagged
values of elements in X* and X, respectively, then the following equation
= plim(X'-1 XIT)II*
= +x-I xII* (17.8.17)
(17.8.18)
Therefore, the reduced form coefficients II* can be estimated by the sample
covariances of Y and X _I , and X and X _I . This is equivalent to using X -I as an
instrumental variable in regressing Yon X, or
Y = XII* + W (17.8.19)
(17.8.20)
17.9 SUMMARY
17.10 EXERCISES
Some data for exercises in this section are generated from the following basic
model:
y7 + {3xi +
= a YZt
- * + Ut
Xt-Xt
Yt-Yt- *+ Vt (17.10.1)
where a = 8.0, {3 = 0.6, and Y = 0.03; yi and xi, the unobservable variables,
and Zt, the observable variable, measured without errors, are given in Table
17.2. Withtheassumptionsthatu t ~ N(O,l)andv t ~ N(0,9),forall t, a sample
of X t and Yt of size 20 has been generated and is listed in Table 17.2.
Data in Table 17.3 are generated from the model
yi = a + {3xi
Xt = xi + U t
Yt = yi + V t (17.10.2)
Exercise 1 7.1
Using 20 observations of x, y, and z in Table 17.2, compute the least squares
estimates of a, {3, and Y in Yt = a + {3x t + YZt + Vt, where the true variable xi is
replaced by the observable proxy variable Xt. Compare the results to their
theoretical counterparts.
Unobservable Variables 739
Exercise 1 7.2
U sing the observations in Table 17.2, compute the least squares estimates a
and Y in Yt = a + YZt + e t , where the true unobservable variable xi is omitted.
Compare the results to their theoretical counterparts.
Exercise 17.3
Assuming that we know that (T~ = 1 and using the observations in Table 17.2,
estimate the parameters a and f3 of the model (17.10.1).
Exercise 1 7.4
Repeat 17.3 but assume that (T~ = 9 is known.
Exercise 17.5
Repeat I7.3 but assume that (T~/(J";' = 9 is known.
Exercise 1 7.6
Repeat 17.3 but assume that both (T~ = I and (T~ = 9 are known.
740 Further Model Extensions
YI XI Y2 X2 Y3 X3 Y4 X4
Exercise 1 7.7
Using 20 observations from Table I7. 3 and Xt-I as an instrument, estimate the
parameters x and f3 in the model (17.10.2).
Exercise 1 7.8
Repeat 17.7 but use Wald's two-group method.
Exercise 17.9
Repeat 17.7 but use the three-group method.
Exercise 1 7.1 0
Repeat 17.7 but use the ranks as an instrument.
Exercise 17.11
Treat the four samples in Table 17.3 as repeated observations of the same
unobservable variable listed in Table 17.4. Using the method of repeated obser-
vations described in Section 17.5.2, estimate the parameters of the model
(17.10.2).
A second indicator Y2t is assumed observable, where Y2t = x2 + f32X; + U2t
and V2t - N(0,CT~2)' Four samples of indicator values that are generated with
X2 = -0.2, f32 = 7, and U2t - N(0,4) are given by Table 17.5.
TABLE 17.4 TRUE
UNOBSERVABLE
VARIABLES, y* AND
x*
y* x*
26.97 18.86
25.38 17.09
24.62 16.25
22.55 13.95
22.15 13.50
22.78 14.20
23.90 15.44
25.50 17.22
25.90 17.67
24.73 16.37
25.80 17.56
26.64 18.49
28.76 20.84
31.18 23.53
31.86 24.29
32.36 24.85
31.78 24.20
31.16 23.51
29.91 22.12
30.59 22.88
Exercise 17.12
U sing any of the four samples of YI in Table 17.3 and any of the four samples of
Y2 in Table 17.5, estimate the parameters in the following model:
YI = alJ + f3 IX * + VI
Y2 = a2j + f32X* + V2
X = x* +u (17.10.3)
where j = (1 1 1)'.
Exercise 1 7.13
Repeat Exercise 17.12 but use the factor analysis approach described in Section
17.6.2.
TABLE 17.6
ADDITIONAL CAUSES
Z2
305.80 207.82
305.29 206.89
308.45 208.72
313.88 211.57
323.93 218.12
323.05 217.77
321.61 217.22
312.66 211.85
309.80 210.09
308.40 208.72
305.47 207.17
301.18 204.62
289.14 197.37
261.86 180.08
248.15 171.20
244.20 168.75
254.03 175.09
258.29 177.70
243.83 167.59
241.94 166.59
Unobservable Variables 743
Exercise 1 7.14
Using 20 observations of any of the four samples of YI and XI in Table 17.3 and
the two causes ZI and Z2 in Table 17.6, estimate the parameters of the model
YI = a + f3xi + Vt
XI = xi + u,
xi = 7To + 7TI ZIt + 7T2Z2t (17.10.4)
Exercise 1 7.1 5
Repeat 17.14 but use the generalized least squares procedure described 10
Section 17.6.3.
Exercise 17.16
Repeat 17.14 but use the maximum likelihood procedure described in Section
17.6.3.
17.11 REFERENCES
Qual itative
and Limited
Dependent
Variable Models
18.1 INTRODUCTION
Social scientists are concerned in general with the problem of explaining and
predicting individual behavior. Within this context economists examine indi-
vidual choice behavior in a wide variety of settings. Often the choices can be
considered to be selections from a continuum of alternatives, as illustrated by
the conventional theory of the household and firm that deals with, among other
things, "how much" to consume or produce of a certain primary, intermediate,
or final commodity. Economic theory provides models of individual behavior in
these circumstances, and standard procedures allow statistical inferences
about "average" population behavior given a random sample of data from a
population of individuals. Increasingly, however, as microdata become more
widely available, researchers are faced with situations in which the choice
alternatives are limited in number-that is, the alternatives are discrete or
"quanta!." Examples of situations where such choices arise include efforts at
modeling occupation choice, the labor force participation decision, voting be-
havior, and housing choice.
Statistical analysis of general popUlation choice behavior is complicated by
the fact that such behavior must be described in probabilistic terms. That is,
models describing choices from a limited number of alternatives attempt to
relate the conditional probability of a particular choice being made to various
explanatory factors that include the attributes of the alternatives as well as the
characteristics of the decision makers. Several well-known problems arise
when usual regression procedures are applied to such models and a substantial
body of literature exists suggesting alternative methods for dealing with these
problems. Our purpose is (1) to point out the difficulties in applying the usual
techniques to such models, (2) summarize proposed solutions that are applica-
ble under alternative circumstances, and (3) note some advantages and disad-
vantages of the various methods proposed in (2). These and related issues are
discussed in Sections IS.2 (for binary choice models) and IS.3 (for multinomial
choice models).
Qualitative and Limited Dependent Variable Models 753
Pr(Yi = 1) = Pr(V iI > ViO) = Pr[(eiO - eil) < (al - ao) + (Zil - ZiO)'O
+ W!("I1 - "10)]
= F(xi~),
where xi = (1, (Zil - ZiO)', wi), W = al - ao), 0', ("II - "10)'), and F is the
cumulative distribution function (CDF) of (eiO - eil). Note that if an individual
TABLE 18.1 QUALITATIVE AND LIMITED DEPENDENT VARIABLE MODELS
Multinomial logit
I(seetion 18.3.1)
Introduction
(Section IS.1)
L Multinomial
choice
Multinomial probit
(Section 18.3,2)
(Section 18.3)
Evaluation of multinomial choice
models (Section 18.3.3)
characteristic has the same effect on expected utility, then the corresponding
elements of 'Yo and 'Yt are equal and that variable falls out of the model. Also,
the presence of an intercept implies that the choices have effects on utility apart
from their attributes. The kind of choice model one obtains depends on the
choice of F. The most common choices in economic applications are:
where Yi is a random variable that takes the value I if the ith individual goes to
college and zero otherwise, Xi is the ith individual's income, ei is a random
disturbance, and a and {3 are unknown parameters. If Pi is the probability that
Yi = 1 and (I - Pi) is the probability that Yi = 0, then E[y;] = Pi = a + {3Xi if
E[e;] = 0, and (18.2.1) is called the linear probability model.
Qualitative and Limited Dependent Variable Models 757
p = Xf3 + e (18.2.2)
758 Further Model Extensions
(18.2.3)
P I = p. + e
I I , 1=1, . . . ,T ( 18.2.4)
Consequently the random variable ei has mean zero and variance Pi(l - P;)/ni'
which is the ith diagonal value of 1. If the Pi are not known, a feasible GLS
estimator is
(18.2.5)
and
(18.2.9)
and
where T)N is an (N x 1) vector of one's. In practice, only those Xi'S that will
produce extreme values for Pi need to be considered since, if the extreme
values of the predicted Pi are within the range between 0 and I, then the
predicted value of Pi for all other values of Xi is also within the specified
interval. Economic data often show an increasing or decreasing trend. The
extreme values of Pi often correspond to the smallest and the largest values of
Xi. Thus, by selecting possible binding restrictions, the number of restrictions
can be reduced to a manageable size. Combining the restrictions (18.2.7)
through (18.2.10), we can write
(18.2. I 1)
760 Further Model Extensions
where
(18.2.12)
and
(18.2.17)
(18.2.18)
The system of equations (18.2.14) through (18.2.16) is linear and can be solved
by a simplex algorithm with a modification that incorporates side conditions
(18.2.17) and the nonnegativity conditions (18.2.18). For such a modification,
see Wolfe (1959). For similar applications, see Lee, Judge, and Zellner (1968)
and Lee and Judge (1972).
It is interesting to note that the inequality-restricted least squares estimator
includes the feasible, unrestricted least squares estimator (18.2.3). In other
Qualitative and Limited Dependent Variable Models 761
words, if the estimator j3 will forecast p within the unit interval, then AI =
A2 = 0 and the inequality-restricted least squares estimator from (18.2.14) is
also the unconstrained estimator:
= (X'<Il-IX)-IX'<Il-l p = j3 (18.2.19)
If we assume that the larger the value of the index Ii = xf(3, the greater the
probability that the event E in question will occur, we can think of a monotonic
relationship between the value of Ii and the probability of E occurring. Under
these assumptions the "true" probability function would have the characteris-
tic shape of a cumulative distribution function (CDF). Theoretical arguments
for the use of particular CDFs have appeared recently and will be examined in
Section 18.3, but in most economic applications choices seem to have
only casual justification. The two most widely used CDFs are the normal and
the logistic, with the associated analyses called probit and logit, respectively.
One argument for adopting the normal CDF is as follows: Each individual
makes a choice, between E and not-E, by comparing Ii to some critical value of
the random index 1*, which reflects individual tastes. If there are many inde-
pendent factors determining the critical level for each individual, the central
limit theorem may be used to justify assuming that 1* is a normally distributed
random variable. If an individual chooses E only if Ii > 1*, the conditional
probability of event E occurring, given Ii, is
where F(') is the normal CDF evaluated at the value of the argument. The
logistic CDF
1
F(x'U) - ...,------,----,--
1 + exp(-xiJ3), -00 < xiJ3 < 00
ip - (18.2.21)
is used because of its close approximation to the normal CDF [see Cox (1970,
p. 28) and Amemiya (1981, p. 1487)] and its numerical simplicity. Alternative
estimation procedures exist for these models, the alternatives depending on the
number of repeated observations (ni) and hence how precise Pi is as an estima-
tor for Pi. These two cases will now be treated.
(18.2.22)
(18.2.23)
(18.2.24)
usmg
dF-I(P i ) 1
dP i f[F-I(P i )]
wheref() is the value ofthe standard normal density evaluated at its argument.
Alternatively,
(18.2.25)
(18.2.26)
If (18.2.25) is written as
v = X(3 +u (18.2.27)
(18.2.28)
1
Pr (Ell;) = Pi = 1 + exp( - I;) (18.2.29)
This formulation has the property that the odds ratio is a loglinear function of
x'(3, and is given by
,a ei
In [ 1 -Pi Pi ] .= Xi P + P i (1 - P;) (18.2.30)
(18.2.31)
764 Further Model Extensions
(18.2.32)
In [ Pi ]
1 - Pi
The feasible logit estimator J3 is obtained by replacing I> with a consistent
estimator. One consistent estimator for Pi is the sample proportion. Others can
be obtained from the linear probability model or least squares applied to
(18.2.30).
Other CDFs have been suggested by Zellner and Lee (1965, p. 386) and
Nerlove and Press (1973, p. 15), but probit and logit are the most popular in
economic applications. An extensive list of applications is given by McFadden
(1976, pp. 382-390), Hensher and Johnson (1981), and Amemiya (1981).
I with probability Pi
Yi = { 0 with probability 1 - Pi (18.2.33)
r
e - IT p
_
i~ I
V
I
'(1
-.1 _ PI. )I-V'
-,
( 18.2.34)
The logit or probit model arises when Pi is specified as the logistic or normal
CDF evaluated at x; (J. If F(x; (J) denotes either of the CDF's evaluated at x; (J,
then the likelihood function for both models is
r
e = IT [F(x: (J)]Yill
i~ I
- F(x: (J)l'Yi
(18.2.35)
r
In e = 2, {Yi In[F(x!(J)]
i~ I
+ (1 - Y;) In[1 - F(x!(J)]}
(18.2.36)
Qualitative and Limited Dependent Variable Models 765
Whether F() is chosen to be the standard normal or logistic CDF, the first-
order conditions for a maximum will be nonlinear, so maximum likelihood
estimates must be obtained numerically. From Chapter 6 and Appendix B, we
know that the Newton-Raphson iterative procedure for maximizing a nonlinear
objective function leads to the recursive relation
- -
Jln+ 1 = Jln - aJl aw
[iPln el p=p,J-I[a aJl
In el
p=p
J
(18.2.37)
where lin is the nth round estimate, and the matrix of second partials and the
gradient vector are evaluated at the nth round estimate. This is convenient
since, under the usual regularity conditions, the maximum likelihood estimator,
say Ii, is consistent and
a2ln e ]-1)
N ( Jl, - [ aJl aw p=1i
- - [a2lnel J-'[alne J
Jln+ 1 = Jln - E aJl aw p=p aJl p=p,
where
a2 In e] T [f(x; Jl)P ,
E [aJl aw = - ~ F(x; Jl)[ 1 - F(x; Jl)l XiXi
and we use as the asymptotic covariance matrix -(Ea 2 In flaJl aW)-'. The
reader should confirm that this is the asymptotic covariance matrix of the
feasible GLS estimator presented in Section 18.2.2.
Thus, in order to carry out optimization the first and second derivatives of
the log-likelihood function are required. While derivatives could be approxi-
mated numerically, there is no reason to do so for the probit and logit models
since they are quite tractable analytically.
For the probit model, where F and fare the standard normal CDP and pdf,
and
For the logit model, where F and f are the logistic CDF and pdf,
a In e T I T 1
all = ~ Yi I + exp(xlll) Xi - ~ (1 - y;) I + exp( -xlll) Xi
T
= L
i~ I
[Yi F ( -xill) - (I - Yi)F(xlll)]xi
(18.2.40)
and
a2 In e T exp( -xill) I
T
= - L f(xill)xixi
i~1 (18.2.41)
Using these derivatives and the recursive relation (18.2.37), maximum likeli-
hood estimators can be obtained given some initial estimates (il . For probit and
logit models the choice of the initial estimates does not matter since it can be
shown [see Dhrymes (1978, pp. 344-347)], that, for both of these models, the
matrix of second partials a 2 In elall all' is negative definite for all values of Il
Consequently, the Newton-Raphson procedure will converge, ultimately, to
the unique maximum likelihood estimators regardless of the initial estimates.
Computationally, of course, the choice does matter since the better the initial
estimates the fewer iterations must be carried out to attain the maximum of the
likelihood function. While several alternatives for initial estimates exist, one
can simply use the least squares estimators of Il obtained by regressing Yi on the
explanatory variables.
When evaluating binary choice models, care must be taken with regard to
several points. First, estimated coefficients do not indicate the increase in the
probability of the event occurring given a one unit increase in the correspond-
ing independent variable. Rather, the coefficients reflect the effect of a change
Qualitative and Limited Dependent Variable Models 767
in an independent variable upon F-I(P;) for the probit model and upon
In[PJ(1 - P;)] for the logit model. In both cases the amount of the increase in the
probability depends upon the original probability and thus upon the initial
values of all the independent variables and their coefficients. This is true since
Pi = F(x[f3) and iJPJiJxij = f(x[f3) . (3j, where fO is the pdf associated with
F(). Thus, while the sign of the coefficient does indicate the direction of the
change, the magnitude depends upon ftxf(3), which, of course, reflects the
steepness of the CDF at xff3. Naturally, the steeper the CDF the greater the
impact of a change in the value of an explanatory variable will be.
Second, usual individual or joint hypothesis tests about coefficients and confi-
dence intervals can be constructed from the estimate of the asymptotic covari-
ance matrix using either the Wald or LR statistics, and relying on the asymp-
totic normality of the EGLS or ML estimator being used.
A test ofthe hypothesis Ho: {32 = {33 = . . . = 13K = 0 can be easily carried out
using the likelihood ratio procedure. If n is the number of successes (Yi = 1)
observed in the T observations, then for both the logit and probit models the
maximum value of the log-likelihood function under the null hypothesis Ho is
In (w) = n In (;) + (T - n) In (T ~ n)
has a XfK-1) distribution, where In (fl) is the value of the log-likelihood func-
tion evaluated at fl. Acceptance of this hypothesis would, of course, imply that
none of the explanatory variables has any effect on the probability of E occur-
ring. In that case the probability that Yi = 1 is estimated by Pi = niT, which is
simply the sample proportion.
Investigators are also frequently interested in a scalar measure of model
performance, such as R2 is for the linear model. Amemiya (1981, pp. 1502-
1507) suggests several measures that can be used. Two popular ones are the
value of the chi-square statistic in (18.2.42) and the pseudo-R2, defined as
J In (fl)
p- = I - In (w)
This measure is I when the model is a perfect predictor, in the sense that Pi =
F(xf fl) = 1 when Yi = 1 and Pi = 0 when Yi = 0, and is 0 when In (fl) = In (w).
Between these limits the value of p2 has no obvious intuitive meaning. How--
ever, Hauser (1978) shows that p2 can be given meaning in an information
theoretic context. Specifically, p2 measures the percent of the "uncertainty" in
the data explained by the empirical results. This concept is discussed more
fully in Section 18.3.3.
768 Further Model Extensions
A typical quantal choice model considers subjects who face J > 2 alternatives
and must choose one of these. In this sense these models are generalizations of
those presented in Section 18.2. Let Yij be a binary variable that takes the value
one if the jth alternative, j = 1, . . . ,J, is chosen and zero otherwise. Let
Pij = Pr[Yij = 1]. Then
J J
"L.J y IJ.. = "L.J p IJ.. = 1
j~1 j~1
e= n P;i1P;F
i~ I
. . PW (18.3.1)
so that J could be indexed by i, but we will not pursue that generalization. This
model is made into a behavioral one by relating the selection probabilities to
attributes of the alternatives in the choice set and the attributes of the individ-
uals making the choices.
As in Section 18.2, such models can be motivated by assuming that individ-
uals maximize utility, and that the utility that the ith individual derives from the
choice of the jth alternative can be represented as
(18.3.2)
where xij is a vector of variables representing the attributes of the jth choice to
the ith individual, f3 is a vector of unknown parameters, and eij is a random
disturbance. The presence of the random disturbance eij can be rationalized as
reflecting intrinsically random choice behavior, and measurement and/or speci-
fication error. In particular it may reflect unobserved attributes of the alterna-
tives.
Having specified a utility function, each individual is assumed to make selec-
tions that maximize their utility. The probability that the first alternative is
chosen is
Similar expressions hold for all other Pij's and the Pij's become well-defined
probabilities once ajoint density function is chosen for the eij. It is convenient
to look at (18.3.3) in differenced form as
This transformation reduces by one the number of integrals that must be evalu-
ated in order to determine the Pij's. It also explains why distributions that are
closed under subtraction, or produce convenient distributions under subtrac-
tion, are popular candidates for the joint density of the eij.
In the two sections that follow, the logit and probit formulations, respec-
tively, will be presented. The logit model has been extensively considered by
McFadden (1974, 1976), who discusses the discrete choice theory implied by
the logit specification and develops appropriate estimation and inference proce-
dures. The logit model rests upon a very strong behavioral assumption, the
independence of irrelevant alternatives. The implications of this assumption
will be investigated in Section 18.3.1. The probit model has been considered by
Albright, Lerman, and Manski (1977) and Hausman and Wise (1978), and repre-
sents an attempt to avoid the strong assumption upon which logit analysis
depends.
770 Further Model Extensions
The multinomiallogit model follows from the assumption that the eij in (18.3.2)
are independently and identically distributed with Weibull [Johnson and Kotz
(1970)] density functions. Corresponding cumulative distribution functions are
of the form
McFadden (1974) has shown that a necessary and sufficient condition for the
random utility model (18.3.2) with independent and identically distributed er-
rors to yield the logit model is that the errors have Weibull distributions. The
difference between any two random variables with this distribution has a logis-
tic distribution function, giving the multinomial logit model. The probabilities
arising from this model can be expressed as
exp(xij(l)
P U.. = _-=----c--"--'--'-
J
L exp(xij(l)
j~1
(18.3.5)
exp(xfZ (l) /
j~ I
exp(xfj (l)
exp(xfz(l)
since the denominators of(l8.3.5) divide out. Thus, for this model the odds ofa
particular choice are unaffected by the presence of additional alternatives. This
property is called the independence of irrelevant alternatives and can represent
a serious weakness in the logit model. Suppose, for example, members of a
population, when offered a choice between a pony and a bicycle, choose the
pony in two-thirds of the cases. If an additional alternative is made available,
Qualitative and Limited Dependent Variable Models 771
sayan additional bicycle just like the first except of a different color, then one
would still expect two-thirds of the population to choose the pony and the
remaining one-third to split their choices among the bicycles according to their
color preference. In the logit model, however, the proportion choosing the
pony must fall to one-half if the odds relative to either bicycle is to remain two-
to-one in favor of the pony. This illustrates the point that when two or more of
the J alternatives are close substitutes, the conditional logit model may not
produce reasonable results. As Albright, Lerman, and Manski (1977) note, this
feature is a consequence of assuming the errors ei) are independent. On the
other hand, there are many circumstances where the alternatives are distinct
enough for this feature of the logit model not to be a negative factor.
Second, in this formulation none of the K variables represented in xi) can be
constant across all alteratives since then the associated parameter would not be
identified. For example, consider again the odds of choosing alternative one
rather than alternative two,
P exp(x[I!l)
-Pil -- exp(xfz!l) -- exp(x'il - X'~2)f.l
..
i2
If any corresponding elements of Xii and Xi2 are equal, the associated variable
has no influence on the odds. If this is the case for all alternatives, then the
variable in question does not contribute to the explanation of why one alterna-
tive is chosen over another and its parameter cannot be estimated. Intuitively,
if the parameter vector !l is to remain constant across all alternatives, only
factors that change from alternative to alternative can help explain why one is
chosen rather than another. Consequently, variables like age, sex, income, or
race that are constant across alternatives provide no information about the
choice process given this model. Variables that would provide information
about the choices made include the cost of a particular alternative to each
individual (for instance the cost of transportation by car, bus, train or taxi to a
particular destination) or the return to an individual from each available alter-
native. These factors vary across alternatives for each individual.
Unfortunately, economists rarely have data that varies over both the individ-
ual in the sample and over each alternative faced by an individual. Such data
are both costly to collect and difficult to characterize. The data economists
usually have access to varies across individuals but not across alternatives, so
that for any individual XiI = Xi2 = . . . = Xi] = Xi. Given this circumstance, the
model must be modified in some way before it can be used to characterize
choice behavior. One possible modification is to allow the explanatory vari-
ables to have differential impacts upon the odds of choosing one alternative
rather than another. That is, the coefficient vector must be made alternative-
specific. That is, let the selection probabilities be given by
p .. = exp(x[j!lj)
IJ J
L exp(x[j!lj)
j~1
(18.3.6)
772 Further Model Extensions
P ik exp(xikJJk)
Pi! exp(xil JJd
= exp(xikJJk - XiIJJI), k = 2, . . . ,1 (18.3.7)
If the vectors Xik and XiI contain variables that are constant across alternatives,
then Xik = Xii = Xi, for all k = 2, . . . ,1, and (18.3.7) becomes
(18.3.8)
1
P il = - - - j - - - -
1+ 2: exp(xiJJj)
j~2
P ij = __exp(xiJJj)
....,.j0....2...~'-'--_ j = 2, . . . ,1
1+ 2: exp(xiJJj)
j~2
(18.3.9)
Estimation of the parameters of the multiple logit model with the selection
probabilities given by (18.3.5) or (18.3.9) can be carried out by using maximum
likelihood procedures. The appropriate likelihood function is obtained by sub-
stituting the relevant expression for Pi) into (18.3.1). See McFadden (1974) for
details concerning estimation when probabilities are given by (18.3.5) and
Schmidt and Strauss (1975) for the case represented in (18.3.9). In both cases
the likelihood function is concave in the unknown parameters and any conver-
gent numerical optimization algorithm discussed in Appendix B can be used.
The expressions for first and second derivatives are given in the cited refer-
ences.
18.3.2 Multinomial Probit
The multinomial probit model overcomes some of the weaknesses of the logit
formulation by permitting the disturbances eij in (18.3.2) to be correlated and by
allowing tas.tes to vary across individuals in the population. The correlation of
the e ij' s arises as a consequence of their embodiment of unobserved attributes
and individual effects. For two similar choices, j and k, it is reasonable to
expect eij and eik to be similar also, and thus not independent. It follows that the
Qualitative and Limited Dependent Variable Models 773
(18.3.10)
U I) = xu'l
I).... + (x~v
I) 1 + e)
I) (18.3.11)
In this model the disturbances XijVi + eij are not independent because of the
relation between the eij's and also because for two choices j and k the errors
will have the common random element Vi. "
The random utility model is made compl'~te by assuming that the random
parameters l3i and the random errors eij are 'multivariate
, normal and indepen-
dent of one another. This assumption poses some practical problems since
(18.3.4) now represents (J - 1) integrals of the pdf of a multivariate normal
random vector. Several versions of the multinomial probit model, and the
computational problems, are discussed by Albright, Lerman, and Manski
(1977), Hausman and Wise (1978), and Daganzo (1979).
18.3.3 Evaluation of
Multinomial Choice Models
With respect to evaluation of results from multinomial choice models, one topic
of interest is measuring the effects of changes of the independent variables
upon the selection probabilities. This is a straightforward generalization of the
results in Section 18.2.3. The difficulties in using the relevant partial derivatives
are considered by Crawford and Pollak (1982). They obtain bounding values for
the partials and their standard errors. They also carefully define two notions of
"order" in qualitative response models, and give necessary and sufficient con-
ditions for the presence of each.
Procedures currently used to validate quantal choice models also involve
attempts to measure the "goodness" of the predictions of the model. This is
not straightforward, since the statistical model predicts conditional probabili-
ties that must be compared to actual choices. Current measures of model
goodness include:
2. A comparison of the actual share in the sample for each alternative with the
predicted share L,T~J)T, or the root mean square percent error, allows an
evaluation of different model specifications.
3. The log likelihood chi-square test, where the null model, with all coeffi-
cients equal to zero, implies that all alternatives are equally likely.
4. A pseudo-R2, the likelihood ratio index
2 = I _ L(jl)
P L(jlH)
where L(jl) is the log likelihood of the unconstrained model and L(jlH) is
the log likelihood of the model defined by the null hypothesis. This measure
is zero when L(jl) = L(jlH) and when the model is a perfect predictor
p2 = 1.
1
/(P) = In P
which decreases from infinity (infinite surprise when P = 0) to zero (no infor-
mation when P = 1 and y was certain to occur).
Prior to receiving any message about y, the informational content of a mes-
sage is known to be either /(P) or /(1 - P), depending upon whether the mes-
sage states that y occurred or that it did not. The expected informational con-
tent of the message, or entropy, then is
Qualitative and Limited Dependent Variable Models 775
H = P/(P) + (1 - P)I(1 - P)
= P In [~] + (1 - P) In [1 2p]
This measure can be extended to several events, say YI,' .. ,Yi, with proba-
bilities PI ,. . . ,Pi, which sum to one, meaning that it is known with certainty
that one of the events will occur. The expected information in this case is
H =
J . J
~ P;l(Pi ) = ~ Pi In Pi
[1] (18.3.12)
This measure is zero when P = Q, which implies that the message did not
change the odds in favor of Y and thus provided no information. Its value is
In(1/P) when Q = 1, which is the case when the message indicates that Y has
occurred. The value is negative when Q < P, which corresponds to the case
when the message indicates Y is less likely and then Y does occur, implying that
the message provided negative or bad information. In the general case, when J
mutually exclusive events are considered, the information on each event is
In(Q;lP i ) and the probability of Yi is Qi so that the expected information is
where pi = (PI, . . . ,Pi) and Q' = (QI, . . . ,Qi). Note that this measure
may be infinitely large if some Qi is positive when the corresponding Pi was
776 Further Model Extensions
zero and is zero if Pi = Qi for all i. Such a message would imply that Yi,
previously considered to have zero chance, now has a positive probability and
represents an infinite informational increase. Also, as should be expected, if the
message states that one outcome, Yi, has probability one, then the expected
information is In(1IP;), as in the simple case.
With these definitions and ideas the problem of evaluating quantal choice
models can be considered. Let y' = (YJ. . . . ,Yl) be the set of alternatives to
the ith individual, who is observed to have characteristics Xi. A multinomial
choice model estimates the probabilities Pij = Pr(Yjlxi); we wish to compare
these estimates with the actual choices made, represented by flij, which is unity
if individual i chooses alternativej and zero otherwise. Ifwe let Pr(Yjlx;) = Qij,
the probability of alternative j's being chosen by the ith individual given the
"message" Xi. and Pr(Yj) = Pj , the prior probability of thejth alternative being
chosen, then the informational content of the model (18.3.14) becomes
" Pr(ylx)
EI(y;X) = L..J Pr(xi)
Xi
L1 Pr(Yjlxi) In P (1 -)'
r Y1
(18.3.15)
where X is the set of sample observations, and Pr(Yj ,Xi) is the joint probability
of an observation Xi and an event Yj. The uncertainty after the observations Xi
are available is
1
H(yIX) = L
Xi
Pr(xi) L
j
Pr(Yjlxi) In P ( I )
rY1x,
which in most cases will be liT, since observations will not in general be
repeated. The distribution Pr(Yj) could be any prior distribution, but two partic-
ularly useful specifications are Pr(Yj) = Ill, the equally likely case, or to set
Qualitative and Limited Dependent Variable Models 777
Pr(Yi) equal to the known population or sample share of each of the alterna-
tives. Then, within the information theoretic framework, the following proce-
dures may be used to evaluate a model of qualitative choice:
1. Theil (1971, p. 645), suggests comparing the sample shares of each alter-
native with their predicted shares 2:,J~ij. If ql,. . . ,ql are the observed shares
and PI, . . . ,Pl the predicted shares, then (18.3.14) is an inverse measure of
the accuracy of these predictions; zero when Pi = qi for all alternatives and
positive otherwise. The expected information is called the information inaccu-
racy of the forecast and may be regarded as a measure of lack of fit.
2. A measure of the "empirical information" provided by the model can be
computed by evaluating
P
L oij In Pr tYi )
1
I(y;X) = TLj j
where Pr(Yi) is the selected prior distribution of the alternatives and Pi} is a
predicted choice probability from the model. The empirical information can be
compared to the total uncertainty in the system (18.3.12), which has maximum
value In J under the assumption of equally likely alternatives. Hauser (1978)
has shown that this ratio is a generalization of the pseudo-R2 measure in that if
the null hypothesis or prior distribution does not depend on i, then the two are
numerically equal. Thus the pseudo-R2 may be interpreted as the percent of
uncertainty explained by the empirical results.
3. Alternatively, we could compare the empirical information to the ex-
pected information from the model (18.3.15). Hauser (1978) states that a test for
the adequacy of the probabilistic model could be carried out by testing the
closeness of the empirical information to the expected information. He argues
that I(y;X) is asymptotically normally distributed with mean EI(y;X) and
vanance
V(y;X) = T
1
LT {1L Pr(Yilxi) [In pr(Yil
I~I )~I
P (.)
r Y)
xi )]2 [
-"? J pr(Yil xi )]2}
Pr(Yilxi) In P ( .)
)-1 r Y)
These measures, along with the chi-square test, provide useful, consistent,
and interpretable ways of evaluating the probabilistic model for quantal choice.
Also, regression diagnostic techniques have recently been extended to logistic
models by Pregibon (1981) and graphical diagnostic procedures are suggested
by Landwehr et al. (1984). As usual, none of these validation procedures
should be used as a vehicle for model specification. Although pretest estimates
have not been examined for these models, it is clear that the resulting estima-
tors may not have the expected distributions under such procedures.
778 Further Model Extensions
As formulated above, the multinomial probit model is less restrictive than the
multinomiallogit model, but the generality is gained at considerable computa-
tional expense. The multinomiallogit model assumes that the parameters of the
random utility function are constant across individuals in the population and
that the random utilities are independent, even for alternatives that are similar.
The multinomial probit relaxes both of these assumptions. By adopting a ran-
dom utility function of the form (18.3.10) taste parameters are assumed to vary
normally across the population of decisionmakers and utilities can be correl-
ated. Hausman and Wise (1978) assume the random parameters {3ij are uncorre-
lated across i and that the eij are uncorrelated across i and j. Thus Uij and U ik
.1re correlated because the same (3's appear for all the alternatives. The compu-
tational procedure they adopt is feasible for a model with up to five alterna-
tives. They compare their general model, and a special case that assumes the
variances of the random coefficients to be zero (called independent probit) to
multinomial logit. They conclude that in their example the results of logit and
independent probit are similar, and their general model fits best and produces
significantly different estimates and forecasts concerning a new alternative.
Albright, Lerman and Manski's (1977) model is more general and they consider
alternative procedures for evaluating the integral of the normal pdf. Their pro-
cedure can handle models with up to ten alternatives. They applied their model
to the Hausman and Wise data and concluded that the logit and probit results
did not differ by much. Furthermore, they could not obtain accurate estimates
of the covariance matrix of l3i' the increase in the value of the log likelihood
using probit rather than logit did not seem substantial relative to the decrease in
degrees of freedom, and the probit model required much more computer time
per iteration than did the logit model. Thus while they demonstrated the feasi-
bility of the multinomial probit model, they could not justify its additional cost,
at least for the data set they used.
In addition to literature on the choice between multinomiallogit and probit,
additional work has appeared on a variety of related topics. McFadden (1977,
1981) introduces a nonindependent logit model that eliminates some of the
weakness of the multinomial logit model described above. Misspecification in
the context of choice models has been considered by several authors. Lee
(1982) considers the effect of omitting a relevant variable from a multinomial
logit model. Ruud (1983) considers the effects of applying maximum likelihood
techniques to a multinomial choice model when the distribution has been mis-
specified. Parks (1980) extends the procedure of Amemiya and Nold (1975) to .
allow for an additional disturbance to reflect misspecification of the random
utility function in the context of a sample with repeated observations. Fomby
and Pearce (1982) present standard errors for the selection probabilities, partial
derivatives and the expected response in the multinomial logit model. Zellner
(1983) presents a Bayesian analysis of a simple multinomiallogit model.
Qualitative and Limited Dependent Variable Models 779
18.4 ANALYSIS OF
CENSORED AND TRUNCATED SAMPLES
In the two previous sections of this chapter we have considered discrete choice
models. They explain the behavior of variables that represent the choice of one
of a finite number of alternatives by a decisionmaker, and consequently are of
the multinomial form. A more general way to view these models is as regres-
sion models in which a dependent variable can only be observed in a limited
range or in a limited way. In the binary probit model, for example, we only
observe whether or not a certain threshold is crossed by an unobservable index
variable. In this section we consider other models in this framework. In partic-
ular we explore the case of a dependent variable that is continuous, but not
completely observable. For example, the dependent variable, and, perhaps,
some independent variables as well, is observed and recorded only if the depen-
dent variable is positive. The dependent variable is thus continuous, but ob-
served only on a limited range.
A unifying framework for the current literature on models with limited de-
pendent variables is to consider them as models with missing data, as is done by
Heckman (1976, 1979) and Hartley (1976). Samples with missing observations
arise in different ways, and the biases that result and the statistical procedures
required to deal with them depend upon why the observations are missing.
Such situations involving either censored or truncated samples are discussed
by Kendall and Stuart (1973, p. 541).
A censored sample is one in which some observations on the dependent
variable corresponding to known sets of independent variables are not observ-
able. For example, suppose shots are fired under controlled conditions at a
vertical circular target of fixed radius R. The distance of the shot from the
center, the random variable in question, will be unobservable if the shot misses
the target entirely. The observable range of the random variable will be limited
to the range [O,R] and no observations can be recorded for some experimental
trials. A familiar example in economics is Tobin's (1958) model in which a
consumer durable is purchased if a consumer's desire is high enough, where a
measure of that desire is provided by the dollar amount spent by the purchaser.
On the other hand, no measure is obtained if no purchase is made so that the
sample is censored at zero, and those observations are incomplete in that no
value is available for the dependent variables.
A truncated sample results when knowledge of the independent variables is
available only when the dependent variable is also observed. Thus, in the
shooting range example, the experimental setting is recorded only if the trial
results in the target's being hit, resulting in some trials in which neither depen-
dent nor independent variables are observed. One must simply assume, then,
that the values of the random variable occurring come from a distribution that
is truncated at R. Kendall and Stuart (1973, p. 541) note that censoring is a
property of the sample while truncation is a property of the distribution. In the
consumer durable example, a truncated sample would occur if customer char-
780 Further Model Extensions
_ {x: Il + e t if y t > 0
Yt - 0 oth '
erWlse (18.4.1)
Furthermore, assume that, of T observations, the last s y/s are zero, but Yt,
t = 1, . . . , T - s is observable. Thus the sample is censored and is often
called the Tobit model. In this case the regression function can be written
t = 1, . . . , T - s
(18.4.2)
where
(18.4.3)
and f() and FO are, respectively, the density and CDF of a standard normal
random variable evaluated at the argument. Thus, the regression function can
be written
t = 1, . . . , T - s (18.4.4)
The difficulty with least squares is that it omits the second term on the right-
hand side of (18.4.4). The least squares estimator of Il is biased and inconsis-
tent, using either the entire sample or the subsample of complete observations.
Greene (l981a) and Goldberger (1981) consider in more detail the properties of
Qualitative and Limited Dependent Variable Models 781
the OLS estimators in Tobit models when the explanatory variables are multi-
variate normal.
Two-step and maximum likelihood estimation procedures are available for
the censored sample problem. A simple two-step procedure can be carried out
as follows: Let Y; be a random variable that is 1 when YI is observed and zero
otherwise. Then the likelihood function for the sample is
n [Pr(e
T
e= 1= I
l <: -x;rl)]l-y; [Pr(e l > -x;rlW;
= D [X;;t {I -
F F [X;;]}l-Y~ (18.4.5)
since F( -t) = 1 - F(t) for the normal distribution. This IS the likelihood
function for probit estimation of the model
E[yn = x;rl/(T
Thus the first step of the two-step procedure is to estimate a pro bit model
where the dependent variable is 1 or zero depending on whether YI is observed
or not. This provides a consistent estimator of rl/(T, which can be used to
provide a consistent estimator of t/JI and AI. The consistent estimator of AI is
then inserted into (18.4.4) and the second step of the two-step procedure is the
application of least squares to the resulting equation. The estimator of rl pro-
duced by this process is consistent and asymptotically normally distributed.
See Heckman (1976, 1979) and Greene (1981b) for the proofs, expressions for
the asymptotic covariance matrix and other generalizations.
Amemiya (1973) considers maximum likelihood estimation ofthe parameters
of (18.4.1) under the assumption that the errors e l are independent and N(O,(T2).
Since the likelihood function is highly nonlinear, we will, following Amemiya,
seek a consistent estimator with which to begin an iterative search process. The
availability of a consistent initial estimator will also allow use of the linearized
maximum likelihood estimator, which is known to have the same asymptotic
distribution as the maximum likelihood estimator if the initial estimator is con-
sistent.
Let S be a s-element subset of the T integers 1, 2,. . . , T such that YI = 0 for
t in S. Let S be the set of T - s elements such that YI > 0 for t in S. The
likelihood function is defined as
where G(') and g(.) are the CDF and pdf of a N(O,(T2) random variable, respec-
tively. The likelihood function can be simplified since
782 Further Model Extensions
The normal equations are highly nonlinear and thus a root must be obtained
numerically. Furthermore, because of the nature of the likelihood function
usual theorems about the asymptotic normality and consistency of the maxi-
mum likelihood estimator do not hold. Amemiya (1973) shows that a root 0 of
the normal equations a(ln )laO = 0, where 0 = (W ,0- 2 )" is consistent and
Yl"(O - 0) ~ N 0, -
, d ( [ (0)]-I)
aao2Qao'
where
a2Q(0) I. 1
aO aO' - - 1m T T
2: 1
Ct
(18.4.6)
and
b -
t - 20-2
1 (2 I'
Zt J t + JI't -
Zt!;)
1- F
t
_
Ct - -
1
40-4
(31'
ZtJ t + Ztft -
zU~
1_ F - 2Ft
)
t
where Zt = x; Il/o-, F(x; Il/o-) = F t , and f(x; Il/o-) = ft. This means that the
sampling density of 0 may be approximated by
N ( 0, [ -T ( ao ao'
a2Q(0)
9~6
J-I)
where the limit sign in (18.4.6) is removed.
Qualitative and Limited Dependent Variable Models 783
Since the normal equations are nonlinear, their solution may be obtained by
an iterative process. The method of Newton gives the second round estimate
(18.4.7)
Expressions for the necessary partial derivatives are given in Amemiya (1973).
In order for the second round estimator to be consistent, a consistent initial
estimator must be used. Amemiya shows that the instrumental variables esti-
mator,
T-s T - s 1 T-s
L
t=1
In F t -
2
In (I2 - -2
2(I
L
t=1
(Yt - j3'x t )2
As with the probit and logit models, the estimated coefficients in a Tobit model
must be interpreted with some care. McDonald and Moffitt (1980) explore the
interpretation and use of the estimated Tobit coefficients. They show that
(18.4.8)
where Zt = x;ll/o-, Eyi is the expected value of the dependent variable Yt given
784 Further Model Extensions
that YI > 0, and F(') is the cumulative normal distribution function. Further-
more,
Thus, the total change in EYt in (18.4.8) is disaggregated into two parts: first the
change in y for those above the limit, weighted by the probability of being above
the limit and second, the change in probability of being above the limit,
weighted by the expected value of Yt given that it is above the limit. These
values, of course, depend upon the parameter estimates for p and (J'2 as well as
the values of the explanatory variables XI' For the purposes of reporting the
results, one might choose XI to be the mean values of the explanatory variables.
Also note that the coefficient f3j is not ayil aXtj. See the McDonald and Moffitt
paper for some examples of the use of this decomposition.
In the past several years a substantial amount of work has appeared on
alternative estimation procedures for models with qualitative and limited de-
pendent variables as well as the effects of and tests for specification errors in
such models. First, Olsen (1978) and Pratt (1981) have shown the concavity of
the log likelihood function for the Tobit model, so that if an iteration procedure
converges, it converges to a consistent and asymptotically normal estimate.
Fair (1977) has suggested a more efficient iteration procedure than the usual
method of Newton. Greene (1981a, 1983) suggests "corrected" OLS estimates
that are close to the ML estimates under the assumption that the explanatory
variables are multivariate normal. Powell (1981) offers a least absolute devia-
tion estimator that is consistent and does not depend on the distribution of the
errors. Paarsch (1984) evaluates Powell's estimator, Heckman's two-step esti-
mator, the ML estimator and OLS under various distributional, censoring and
sample size assumptions. Nelson (1984) studies the efficiency of Heckman's
two-step estimator relative to the ML estimator. These investigations show
that ML can be substantially more efficient than the two-step estimator.
The sensitivity of ML estimation to errors in the specification of the error
distribution in Tobit models has begun to be documented. Goldberger (1980),
Chung and Goldberger (1984) and Arabmazer and Schmidt (1982) show and
discuss the inconsistency of the MLE when the errors are not normal. Bera,
Jarque, and Lee (1982) define a Lagrange multiplier test that can be used to test
for normality in probit as well as truncated and censored sample models. Nel-
son (1981) offers a Hausman-like test for misspecification in the Tobit model.
Hurd (1979) and Arabmazer and Schmidt (1981) study the effect of heterosce-
dasticity in the estimation of truncated and censored models, respectively.
Heteroscedasticity makes the usual ML estimator inconsistent in both cases.
Qualitative and Limited Dependent Variable Models 785
Yf + XB + E = 0 (18.5.1)
where the systems notation is defined in Chapter 14. In contrast to Chapter 14,
however, where all endogenous variables were assumed to be completely ob-
servable, continuous variables, here we are going to permit some of the endoge-
nous variables to be limited or even unobservable. Specifically, assume that
o <: Ml <: M2 <: M3 <: M and the following:
1. The first Ml variables in Y, Yl , . . ,YMl' are observable, continuous
variables.
2. The next M2 - Ml variables, YMl+l, . .', Y M2' are limited in the sense that
they can be observed only if they are positive. (Tobit structure)
3. The next M3 - M2 variables, YM2+1, . . . , Y M3' are unobservable latent
variables. However, corresponding to each is an observable binary indica-
tor variable that takes the value unity or zero depending on whether the
unobservable variable is positive or nonpositive. (Probit structure)
4. The final M - M3 variables are censored by a subset of the variables Y M2+ 1,
. . . ,YM3. That is, the variables Y M3+ I , . . . ,YM each are associated with
786 Further Model Extensions
a specific latent variable and are observable themselves only when its
associated latent variable is positive. (Censored structure)
For the model (18.5.1), in the presence of these limited endogenous vari-
ables, maximum likelihood methods are conceptually possible but complicated
in practice. As an alternative we consider consistent two-stage techniques.
Each of these relies on consistent estimation of the reduced form in the first
stage. For (18.5.1), under usual completeness assumptions, the reduced form is
Y = Xll + V (18.5.2)
where II = -Bf- 1 and V = -Ef- 1 The reduced form equations for specific
endogenous variables can be written as
YI = X'lr'I + VI (18.5.3)
The parameters 'lri can be estimated consistently using probit, tobit, and so on,
depending upon the specific form of Yi.
In the second stage the structural parameters are estimated. Let the ith
structural equation be
(18.5.4)
where the partitions correspond to the LHS endogenous variable (Yi), the RHS
endogenous variables (Y;), and the excluded endogenous variables ( Yf) in the
ith equation. Note that Yi = Xll i+ Vi. Substituting for Yi in (18.5.4), we obtain
where the relation Vi = Vi'Yi + ei was used, since V r i = - Er- 1r i = -ei and V r i
= -Vi + Vi'Yi. If xfIi'Yi is added to and subtracted from (18.5.6), we obtain
Yi = XIIi'Yi + XJirli + Vi
In order to make some interpretive comments consider the following two equa-
tion model:
Note that in this specification the relevant endogenous variables are Yl and the
unobservable yi .
On the other hand suppose the relevant structural equation system is
In this case it is the observed, limited variable that appears on the RHS of
(l8.5.11a), not the continuous, but unobservable, latent variable yi as in
(18.5.10). Thus the difference between these two models is whether Yz or yi
should appear on the RHS of Equation 18.5.11a. The answer should certainly
depend on the economic interpretation in any particular application. There is a
considerable difference in the estimation procedure for the two models. The
reason for the difference is that the system (18.5.11) does not have a single set
of reduced form equations and certain conditions must be imposed on the
parameters in order for the model to be internally consistent. See Schmidt
(1981) for a discussion of the restrictions that are imposed on the parameters of
(18.5.11) and for the restrictions that arise in other forms of the model. Sickles
and Schmidt (1978) discuss the estimation of (18.5.11). Also see Heckman
(1978) for a model in which both the latent variable and its observable counter-
part appear in the equation. This permits an endogenous structural shift to
occur. Lee (1981) notes that Amemiya's principle can be applied to obtain more
efficient parameter estimates than the two stage procedure suggested by Heck-
man. A generalization of Heckman's model is provided by Lee (1978, 1979).
His model is a switching simultaneous equations model that uses sample sepa-
ration information. All structural parameters are allowed to vary depending on
the value of the latent variable. Lee (1981) provides two estimators that are
more efficient than those based on Amemiya's principle in this case. See Lee,
et al. (1980) for derivation of the asymptotic covariance matrices of two-stage
probit and tobit estimators.
18.6 SUMMARY
In this chapter we have considered models where the dependent variable is not
a continuously observable variable. These models fall into two primary catego-
ries: those where the dependent variable is qualitative, and takes values (usu-
ally) reflecting choices made by an economic unit, and those where the depen-
dent variable is continuous, but not completely observable for one reason or
another. As the literature survey in this chapter should indicate, there is sub-
stantial research activity in this area. Much has been discussed here, but cer-
tainly not all of it. The reader is referred to the excellent surveys by Amemiya
(1981, 1984), Maddala (1983), Manski and McFadden (1981), and Hensher and
Johnson (1981).
For an investigator modeling situations in which decision makers choose
Qualitative and Limited Dependent Variable Models 789
18.7 EXERCISES
Exercise 18.1
Verify the Taylor's series expansion (18.2.24) for the probit model.
Exercise 18.2
Verify (18.2.38)-(18.2.41).
Exercise 18.3
Given the likelihood function (18.3.1) for the multinomial choice model and
choice probabilities (18.3.9), derive the explicit expression for the Newton-
Raphson iterations.
Exercise 18.4
Using the Xl matrix defined in Section 2.5.1, let the design matrix for a probit
model be X = (X; X; X; X\)', so that X is (80 x 3) and let the parameter vector
be 13' = (0, .6, - .6). Then the probability that Yi = 1 is
';11 1 2
Pi =
f
-x Vh
e- t 12 dt
Construct five samples of size 80. The values of Yi should be assigned according
to the values Pi. One way to do this is to draw a uniform random number u in
the [0,1] interval and assign the value Yi = 1 if u is in [O,P;], and let Yi = 0
otherwise.
For each of the five samples of size 80 constructed from the probit model,
construct maximum likelihood estimates of the probit and logit models.
Exercise 18.5
For each of the five samples, compute and compare the values of the partial
derivatives of Pi with respect to each of the explanatory variables, evaluated at
the mean.
Exercise 18.6
Construct a 95% confidence interval for f32 using each of the five samples.
Exercise 18.7
Repeat Exercises 18.4, 18.5, and 18.6, but generate the observations from the
logit specification. That is, let
1
Pi = -:-----,----:-:::7
1 + exp(-x[l3)"
Qualitative and Limited Dependent Variable Models 791
(=1, . . . ,80
Exercise 18.8
Set all the nonpositive values of y to zero in the five samples generated above.
Obtain LS estimates of the parameters and compare them to the ML estimates
using the censored sample model, Heckman's two-step estimates, and to Ame-
miya's two-step estimates.
Exercise 18.9
Repeat Exercise 18.8 except discard the complete observation for nonpositive
values of y. Compare the LS estimates to the ML estimates and Amemiya's
two-step estimates.
Exercise 18.10
Consider 30 samples from the probit model as constructed in Exercise 18.4 as
repeated observations. Construct EGLS estimates based on the linear probabil-
ity, probit, and logit models.
Exercise 18.11
Construct 100 samples as in Exercise 18.8 and compare the empirical densities
of the alternative estimators of the unknown parameters.
18.8 REFERENCES
Olsen, R. (1982) "Distributional Tests for Selectivity Bias and a More Robust
Likelihood Estimator," International Economic Review, 23, 223-240.
Paarsch, H. J. (1984) "A Monte Carlo Comparison of Estimators for Censored
Regression Models," Journal of Econometrics, 24, 197-214.
Parks, R. (1980) "On the Estimation of Multinomial Logit Models From Rela-
tive Frequency Data," Journal of Econometrics, 13, 293-303.
Powell, J. (1981) "Least Absolute Deviations Estimation for Censored and
Truncated Regression Models," Technical Report No. 356, The Economics
Series, Institute for Mathematical Studies in the Social Sciences, Stanford
University, Stanford, California.
Pratt, J. (1981) "Concavity of the Log Likelihood," Journal of the American
Statistical Association, 76, 103-106.
Pregibon, D. (1981) "Logistic Regression-Diagnostics," Annals of Statistics, 9,
705--724.
Press, S. and S. Wilson (1978) "Choosing Between Logistic Regression and
Discriminant Analysis," Journal of the American Statistical Association,
23, 699-705.
Robinson, P. (1982) "On the Asymptotic Properties of Estimators of Models
Containing Limited Dependent Variables," Econometrica, 50, 27-41.
Ruud, P. (1983) "Sufficient Conditions for the Consistency of Maximum Likeli-
hood Estimation Despite Misspecification of Distribution in Multinomial Dis-
crete Choice Models," Econometrica, 51, 225-228.
Schmidt, P. (1980) "Constraints on the Parameters in Simultaneous Tobit and
Probit Models," in Structural Analysis of Discrete Data with Econometric
Applications, C. Manski and D. McFadden, eds., MIT Press, Cambridge,
Mass.
Schmidt, P. and R. Strauss (1975) "The Prediction of Occupation Using Multi-
ple Logit Models," International Economic Review, 16, 471-486.
Sickles, R. and P. Schmidt (1978) "Simultaneous Equations Models with Trun-
cated Dependent Variables: A Simultaneous Tobit Model," Journal of Eco-
nomics and Business, 31, 11-21.
Smith, V. and C. Cicchetti (1975) "Regression Analysis with Dichotomous
Dependent Variables," paper presented at the Third World Congress of the
Econometric Society, Toronto, Canada.
Stapleton, D. and D. Young (1981) "Censored Normal Regression with Mea-
surement Error in the Dependent Variable," Discussion Paper No. 81-30,
Department of Economics, University of British Columbia.
Theil, H. (1967) Economics and Information Theory, Rand McNally, Chicago,
and North-Holland, Amsterdam.
Theil, H. (1970) "On the Estimation of Relationships Involving Qualitative
Variables," American Journal of Sociology, 76, 103-154.
Theil, H. (1971) Principles of Econometrics, Wiley, New York.
Tobin, J. (1958) "Estimation of Relationships for Limited Dependent Vari-
ables," Econometrica, 26, 24-36.
Wales, T. and A. Woodland (1980) "Sample Selectivity and Estimation of
Labor Supply Functions," International Economic Review, 21,437-468.
796 Further Model Extensions
19.1 INTRODUCTION
Throughout this book we discuss problems that economists and other nonex-
perimental scientists face as a result of their use of data that has not been
generated by controlled experiments. In this chapter we consider another as-
pect of this problem-namely, the assumption of fixed or constant parameters.
Those who work with classical linear models usually assume that the economic
structure generating the sample observations remains constant. That is, there
exists not only a single parameter vector relating the dependent and indepen-
dent variables, but also a constant set of error process parameters and a single
functional form. Unfortunately, the microparameter systems along with their
aggregate counterparts for economic processes and institutions are not con-
stant, and given that our data are generated by uncontrolled and often unob-
servable experiments, it is frequently argued that the traditional assumption of
fixed coefficients is a poor one.
For example, when using cross-sectional data on micro-units, such as firms
or households, it is unlikely that the response to a change in an explanatory
variable will be the same for all micro-units. Similarly, when using time series
data, it is often difficult to explicitly model a changing economic environment
and, in these circumstances, the response coefficients are likely to change over
time. These considerations have led to the development of a number of sto-
chastic or variable parameter models.
Varying parameter models can be classified into three types. First, the pa-
rameters can vary across subsets of observations within the sample but be
nonstochastic. Examples of such models are discussed in Sections 19.2 and
19.3 and include a general systematically varying parameter model, seasonality
models, and a variety of "switching regression" models where the sample
observations are generated by two (or more) distinct regimes.
A second class of models is where the parameters are stochastic, and can be
thought of as being generated by a stationary stochastic process. We have seen
examples of such models in Chapter 13 where the Swamy and Hsiao random
coefficient models were used to combine time series and cross-sectional data.
In Sections 19.4 and 19.5 two more models of this type are presented, the
Hildreth-Houck random coefficient model and its generalization, the return to
normality model.
798 Further Model Extensions
Hildreth-Houck model
(Section 19.4)
Random parameters from
a stationary process --------1
Return to normality model
(Section 19.5)
Finally, the third class of models consists of those where the stochastic
parameters are generated by a process that is not stationary. The Cooley-
Prescott model is of this type and is presented in Section 19.6. A visual over-
view of the chapter is provided in Table 19.1.
I = I, . . . ,N; t = I, . . . , T (19.2.1)
where Yil is the ith cross-sectional observation on the dependent variable in the
tth time period; XiI is a (K x I) non stochastic vector of observations on explan-
atory variables, (3il is a (K x I) coefficient vector, possibly unique to the ith
cross section and tth time period, and the eil are normally and independently
distributed random disturbances with zero means and variances (F2. This for-
mulation of the linear regression model allows the response coefficient for the
explanatory variables to differ for each cross-sectional unit and each time
period. The difficulty with this model is that there are KNT + 1 parameters to
be estimated with only NT observations. Additional information must be intro-
duced that places some structure on how the coefficients vary across observa-
tions if reasonable estimation procedures are to be developed.
Following Belsley (1973a), let the nonsample information be
where Zit is a (K x M) matrix of variables that "explain" the variation in the 13it
across observations and 'Y is an (M x 1) vector of associated coefficients.
It is unlikely that each element of 13it will be the same linear function of a set
of explanatory variables. Equation 19.2.2 is sufficiently general to cover such
circumstances if, in Zit, zeros are included in the appropriate places. In general,
the matrix of explanatory variables Zit may contain (1) functions of variables
already included in Xit, implying that (19.2.1) is not linear in the original explan-
atory variables; (2) functions of other variables that do not appear in Xit; or (3)
qualitative variables that may be stochastic or nonstochastic, implying the
existence of separate regression regimes. With these alternative specifications
for Zit the model in (19.2.1) and (19.2.2) includes a number of special models
that have appeared in the literature. The following sections identify some of
these models and discuss the consequences of their adoption.
Estimation of the model (19.2.1)-(19.2.2) is carried out straightforwardly.
Equation 19.2.2 is combined with (19.2.1) to produce
(19.2.3)
estimator and the maximum likelihood estimator for this reformulated statisti-
cal model.
The models in Section 19.2 allow response coefficients to be different for each
observation. Often this completely general formulation can be usefully modi-
fied to allow the regression coefficients to be constant within subsets or parti-
tions of the observations, but to be different across partitions. These models
may be thought of as situations where the Zit matrix in (19.2.2) contains qualita-
tive variables that "sort out" observations into different partitions. Examples
include models with dummy variables, seasonality models, and the piecewise
regression models such as those developed by Hinkley (1971); McGee and
Carlton (1970); Gallant and Fuller (1973); Poirier (1973, 1976); Quandt (1958);
and Goldfeld and Quandt (1973a, 1973b). Analysis of these models re-empha-
sizes the close relationship between the "varying parameter" and "pooling"
models.
YI = XI 131 + el (19.3.1a)
Y2 = X2132 + e2 (19.3.1b)
Varying and Random Coefficient Models 801
Y3 = Xdl 3 + e3 (l9.3.1c)
Y4 = X 4P4 + e4 (19.3.1d)
YI XI PI el
Y2 X2 P2 e2
+ (19.3.2)
Y3 X3 P3 e3
Y4 X4 P4 e4
or more compactly as
Y = Z-y + w (19.3.3)
If w - N(O,a- 2J), the least squares rule would be applied to estimate the param-
eters of the quarterly equations, and the general hypothesis
PI
II -lz 0 0 0
P2
R-y = 0 lz -h 0 0 (19.3.4)
P3
0 0 h -14 0
P4
could be used to test the statistical significance of the seasonal effects on all or
some combination of the parameters. In (19.3.3), if the quarterly equations are
error related and w ~ N(O,cI, where cI> is some positive definite, symmetric
matrix, the coefficient vector -y may be efficiently estimated by using the Aitken
generalized least squares rule, where the covariance cI> is estimated. See Chap-
ter 12. In this case the standard Zellner type of aggregation tests could be
applied to the quarterly parameters.
If the parameters of the treatment variables do not have a seasonal pattern
and the seasonal effect is only exhibited in the intercept parameter for each of
the quarterly equations, (19.3.3) may be reformulated as
f311
YI XII X I2 el
P2
Y2 X 22 X21 e2
- f321 + (19.3.5)
Y3 X 32 X31 e3
f331
Y4 X 42 X41 e4
f341
where P2 is a (K - I)-dimensional parameter vector and f311, f321, f331, and f341
are parameters for the quarterly equations corresponding to the intercept vari-
ables XII, X21, X31, and X41. Seasonal intercept parameter variability could be
802 Further Model Extensions
Models that use dummy variables imply the presence of identifiable parameter
"regimes" that hold for partitions of the entire sample. Although dummy vari-
able models are easily extended to situations where both time series and cross-
sectional data are available, we will, for expository purposes and without loss
of generality, consider only time series models. This corresponds to setting
N = 1 in (19.2.1). Only two partitions will be considered because extension to a
greater number of partitions follows directly. If the observations for which the
different parameter regimes hold are readily identifiable, then the total of T
observations may be split into two groups of TI and T2 observations, where
TI + T2 = T. These groupings do not necessarily have to contain observations
that are sequential in time; thus this formulation is a generalization of the model
in Section 19.3.1. Let us write the two regimes as
if t E {Td
if t E {T2 } (19.3.6)
(19.3.7)
(19.3.8)
Varying and Random Coefficient Models 803
or
(19.3.9)
The imposition of these parameter restrictions implies that the two regression
functions will join at the point to.
Poirier (1973, 1976) has considered a piecewise regression function called a
cubic spline function, whose pieces join more smoothly than in the model
considered above. Cubic splines have been used extensively by physical scien-
tists as approximating functions. In reality cubic splines are cubic polynomials
in a single independent variable, which are joined together smoothly at known
points. The smoothness restnctions are such that at the point where the cubic
polynomials meet, their first and second derivatives are also equal; thus the
transition from one regime to another does not occur abruptly. In particular, let
the regression function be
(19.3.10)
where 1(') are indicator functions that take the value one if the argument is in
the stated interval and zero otherwise. The gi(t), i = 1,2, are cubic polynomials
of the form
In the cubic spline literature the point to, where the two cubic functions meet, is
referred to as the "knot" point. The functions gi(t) obey the following smooth-
ness (derivative) restrictions:
(19.3.11)
The models above have presumed that the point(s) of structural change were
known. If this is not the case, then the unknown point(s) where the regimes
switch, become unknowns and thus are parameters to be estimated. This prob-
804 Further Model Extensions
lem has been surveyed by Goldfeld and Quandt (1973b). Given the model
in (19.3.6), they assume ell ~ N(O,O"i) and e2t ~ N(O,O"~) and in general that
(3f,O"T) *- (32'0"~). The choice between the two regimes is then assumed either
deterministic, where some observed variable is compared to some unknown
threshold, or stochastic and dependent upon unknown probabilities.
x exp {- 21 2 i
0"1 t~1
(Yt - X;(31)2 - 21 2 :
0"2 t~to+1
(Yt - X;(32)2}
(19.3.12)
The estimate of to chosen is the value that maximizes the likelihood function. A
likelihood ratio test for the hypothesis of no switching can then be carried out
by comparing the value of the likelihood function (19.3.12) to that for a single
regression over the entire sample. Again this search process has pretest estima-
tor sampling consequences.
A detection procedure similar in spirit has been suggested by Brown, Durbin,
and Evans (1975). They investigate the constancy of regression relations over
time by considering functions of recursive residuals generated by moving re-
gressions. Comparison of the standardized sum and standardized sum of
squares of these residuals to approximate confidence bounds provide evidence
of regression stability or instability and approximately where any structural
change may have taken place.
An alternative test for shifts in slope parameters where the shift point is
unknown has been suggested by Farley and Hinich (1970) and Farley, Hinich,
and McGuire (1975). They approximate a discrete shift in the slope at an un-
known point by a continuous linear shift using the model
where
(19.3.13)
The test for constant slopes is then the usual likelihood ratio test of the hypoth-
esis that 0 = o. Farley et al. (1975) provide Monte Carlo evidence that their test
Varying and Random Coefficient Models 805
(19.3.14)
where 131, 132, O"r, O"~ and the Dr's must be estimated. To make the problem
tractable, the Dr's may be approximated by a continuous function. One possible
approximation is to use the probit function
1 (U 2)
Dt = IZ;y
-x
V 27T0"2 exp - -22 du
0"
TIT
L = - "2 In 27T - "2 ~ In[O"io - D t )2 + O"~D;]
_!
2
{Yt - x;[13I(l - D t) + 132 D tlF
0"1(1 - D t )2 + O"~Di
I (19.3.15)
can be carried out. Again the sampling properties of the outcome generated by
using this rule are unknown.
(19.3.16)
The log likelihood function L = 2,;=1 In g(YtIXt) may then be maximized with
respect to the Ws, (]'2'S, and A. The reader might want to note the relationship
between this formulation and the nonnested hypothesis testing problem of Sec-
tion 21.8. An immediate extension of this alternative would be to make A a
function of exogenous variables.
Another modification of the stochastic alternatives is to allow the probability
A of choosing the first regime in the time period to to depend upon the state of
the system in the previous trial. Specifically, Goldfeld and Quandt (1973a)
introduce a Markov model in which transition probabilities are explicitly intro-
duced. The transition probabilities may be considered fixed or non stationary
and functions of exogenous variables. In each case the likelihood function is
formed and maximized with respect to all relevant variables. The reader is
referred to Goldfeld and Quandt (1973a) for further details. Also see Tishler and
Zang (1978) who develop approximations to the likelihood function which are
computationally simple and Swamy and Mehta (1975) for a Bayesian and other
generalizations. Lee and Porter (1984) suggest a model for which some imper-
fect sample separation information is available.
Consider the linear model (19.2.2) where t = 1, and is thus suppressed, the
parameters are assumed stochastic, so f3i = Z/y + Vi, and that Zi = hand 'Y =
13. The reSUlting model is .
(19.4.1)
where
(19.4.2)
This is the random coefficient model of Hildreth and Houck (1968). The vector
t
1
Varying and Random Coefficient Models 807
f3i = (f3li, f32i, . . . ,f3Ki) contains the actual (random) response coefficients for
the ith individual, 13' = ({31, . . . , (3K) is a vector of nonstochastic mean
response coefficients and vi = (VIi, V2i, . . . , VKi) is a vector of random
disturbances. We assume E[v;] = 0, E[ViV[] = A and E[ViVJ] = 0 if i oF j. Note
that a separate equation disturbance in (19.4.1) could not be distinguished from
Vii if the first variable is an intercept, and thus it is ignored.
This type of model is often reasonable when we have cross-sectional data on
a number of micro-units. In such a case we are likely to be interested in
estimating the mean coefficients, 13, the actual coefficients, f3i, and the covari-
ance matrix A.
To consider estimation, combine (19.4.1) and (19.4.2) and rewrite as
Yi = xij3 + ei (19.4.3)
where xi = (1, X2;, . . . ,XKi), ei = XiVi, E[e;] = 0, and (IT = E[eT] = xi AXi .
. Then, if A is known, the generalized least squares estimator for 13
,
13- =
( N
L -2,
(Ii Xi Xi
)-1 L (Ii
N -2
XiYi (19.4.4)
1=1 1=1
(19.4.5)
is best linear unbiased [Griffiths (1972), Swamy and Mehta (1975), Lee and
Griffiths (1979)]. It is unbiased in the sense that E[lli - 13;] = 0, and best in the
sense that the covariance matrix of the prediction error of any other linear
unbiased predictor exceeds the covariance matrix of (Ili - f3i) by a nonnegative
definite matrix. o A
E[~] = Fa (19.4.8)
~here ~ contains the squares of the least squares residuals, and F = MZ with
M containing the squares ofthe elements of M = I - X(X'X)-IX'. The estima-
tor ci = (F'F)-IF'~ is unbiased, but because it is a least squares estimator, its
elements will not be constrained.
For the case where A is diagonal, Hildreth and Houck (1968) suggest chang-
ing negative estimates to zero or using the quadratic programming estimator
obtained by minimizing (~ - Fa)'(~ - Fa) subject to a > o. Unfortunately,
when A is not diagonal, the constraints become nonlinear so that there is not an
obvious, more general form for the quadratic programming estimator. For ex-
ample, when K = 3, the constraints can be written as
all al2 an
all > 0,
all
al2
al2
a22
> 0, al2 a22 a23 > (19.4.9)
an a23 a33
tion, an estimator for A can be derived from the sum of cross products
L~I(l3i - ~)(l3i - ~)'. Also see Srivastava et al. (1981), who consider using a
mixed estimation procedure to estimate the coefficient variances.
In addition to estimation, the applied worker is likely to be interested in
testing for randomness in the coefficients. The type of randomness discussed in
this section leads to a heteroscedastic error model, and so, for testing purposes,
the tests discussed in Section 11.3 are relevant. In particular, the Breusch-
Pagan (1979) test (Section 11.3.1) is likely to be a satisfactory one.
Additional references that relate to the Hildreth-Houck model include Raj,
et al. (1980), who obtain a finite sample approximation to the moment matrix of
the limiting distribution when the errors are normal and assuming disturbances
are small. For Bayesian treatments of the Hildreth-Houck model and a gener-
alization, see Griffiths, et al. (1979) and Liu (1981).
The return to normality model is specifically suited to use with time series
data where the assumption that the coefficients in the regression model are
constant over time is not a reasonable one. The parameters in this model are
dynamic and this represents a generalization of the random coefficient model
previously discussed in Section 19.4. Models with dynamic parameters can be
further classified into those where the parameters follow a stationary stochastic
process about a fixed but unknown mean, which is the topic of this section, and
those where the parameters follow a stochastic process that is nonstationary.
The nonstationary type parameter process is illustrated in Section 19.6 with the
discussion of the Cooley-Prescott (1976) model.
t=l, . . . ,T (19.5.1a)
I3t - Ii = <I>(l3t-1 - Ii) + e t (19.5.lb)
Yt = x;J3 + Vt (19.5.2a)
where
(19.5.2b)
The model (19.5.1) can be put in state space form with transition equation
at = [J3t]
Ot
= [h0 O][J3t-l]
<I> Ot-I h
+ [0] et (19.5.3)
where J3t = J3 for all t and Ot = Ilt - J3. The associated measurement equation is
Consequently, all that is needed to carry out the Kalman filter iterations, and
obtain the GLS estimator of J3 conditional on <I> and Q, is some starting values.
Harvey and Phillips (1982, p. 311) suggest using the first K observations to
at
obtain initial estimates of and its covariance matrix. Finally, full maximum
likelihood estimation would involve maximizing the concentrated log likelihood
function with respect to the unknown elements of <I> and Q. A numerical optimi-
zation procedure would require initial estimates of these unknown parameters.
Harvey and Phillips (1982, pp. 314-315), under the assumption that <I> and Q are
diagonal, with elements <PI, . . . ,<PK and 1, Q2, . . . ,qK, respectively, note
that
This suggests that starting estimates of diag(<I and diag(Q) could be obtained
by computing the least squares residuals el, . . . , eT and regressing e~ on X~I'
2
x t2,. 2 an d"
. . , X IK elel_1 on XII XI-l,l,' . ,XIKXI-I ,K' G'Iven the consIstency
. 0f
these initial estimates, an estimated GLS estimator of Ii is then computed by a
single pass of the Kalman filter. In Monte Carlo experiments, Harvey and
Phillips found a clear mean square error gain by using ML over OLS or the two-
step EGLS estimator, and the EGLS estimator provided substantial improve-
ment over 0 LS.
19.5.3 Generalizations
To this point we have reviewed models where parameter shifts have been given
considerable structure. In this section variable parameter models are presented
that place a less restrictive structure on the parameter variation. In particular,
we consider the random walk model of Cooley and Prescott (1973, 1976). In this
model the parameters vary from one time period to another on the basis
of a nonstationary probabilistic scheme. Similar models are discussed by
Sarris (1973), Cooper (1973), Sant (1977), Belsley (l973a), and Rausser and
Mundlak (1978).
Cooley and Prescott consider a time series regression model
1=1, . . . ,T (19.6.1)
persistent "drift" in the parameter values. These sources of variation are mod-
eled as
Ilt = Ilf + U t
Jlf = Ilf-l + V t (19.6.2)
where Ilf is the permanent component of the parameter vector. The terms U t
and V t are independent, normal random vectors with mean vectors zero and
covariance matrices E(utu;) = (1 - Y)(T2Lu and E(vtv;) = y(T2Lv. The matrices
Lu and Lv are assumed to be known up to a scale factor and normalized so that
the element corresponding to the intercept is unity. That is, the first regressor is
the constant term and the transitory component of the corresponding parame-
ter's variation plays the role of the additive disturbance in the regression equa-
tion. Note that the parameterization adopted is such that y reflects the relative
magnitudes of the permanent and transitory changes. If y is close to 1, then the
permanent changes are large relative to transitory ones.
Straightforward maximum likelihood estimation of (T2, y, and the permanent
components of the Ilt is not possible, since the process generating the parame-
ters is not stationary. However, by considering the value of the parameter
process at a particular point as the parameter vector of interest, a well-defined
likelihood function can be constructed.
Cooley and Prescott focus on the value of the parameter process one period
past the sample, which is convenient for forecasting. Then, by repeated substi-
tutions,
T+I
and so
1"+1
III = lli+1 - L
j~l+ 1
Vj + ut
(19.6.4)
t=l, . . . ,T (19.6.5)
where
T+ 1
e t = x; U t - x; L
j~I+1
vJ
where R is a diagonal matrix with elements rii = xiLuXi and Q is a matrix with
elements
% = mine T - i + 1, T - j + I)xiLuXj
y = Xf3i+1 + e
the log likelihood of the observations is
TTl
L = - - In 27T - - In (F2 - - Inl!1(y)1
2 2 2
(19.6.7)
Maximizing (19.6.7) with respect to f3i+1 and (F2, we obtain the conditional
estimators
and
(19.6.8)
Substituting these estimators back into (19.6.7) yields the concentrated log like-
lihood function
TT l T
"2 In 6- 2(y) - "2 Inl!1(y)1 - "2
L = - 2 In 27T -
T T l
= - 2 (In 27T + I) - "2 In 6- 2(y) - "2 Inl!1(y)1 (19.6.9)
,
varvT (y - y) = 2
/[ T1~ (dd (y)21)2 - (1T ~ (ddi(y)1)2)]2
i - i -
i
814 Further Model Extensions
where di(y) = (1 - y) + ydi and d i are the characteristic roots of Q*, with qt =
%/Vriirjj. The matrix Q* arises from a transformation of the model suggested
by Cooley and Prescott (1976, pp. 172-173), which reduces the computational
burden of estimation. The reader is referred to the Cooley-Prescott paper for
that discussion.
Although estimation and intt::rpretation ofthe Cooley-Prescott model is rela-
tively straightforward, its application is not without difficulties. The specifica-
tion ofLu and Lv must come from theoretical considerations. In particular, their
specification presumes the ability to specify the relative variability of the pa-
rameters.
19.7 SUMMARY
In this chapter, we have reviewed a variety of statistical models that have been
developed for situations where the coefficients of the general linear model are
assumed to vary across observations. Justification for the use of varying pa-
rameter models usually follows one of two lines. First in importance is the
situation when the coefficients of an otherwise properly specified relationship
are different for some subsets of the available sample; that is, the sample data
cannot be pooled. Estimation with a model that does not take this into account
would produce results that do not accurately represent the existing economic
structure and do not serve as a good basis for forecasting. The exact conse-
quences of ignoring the parameter variation would, of course, depend on the
nature and degree of the misspecification. The other justification for the use of a
model with varying parameters is that econometric models are necessarily
abstractions from and simplifications of reality. Adoption of the classical linear
model may then imply misspecifications that cause the coefficients of the model
to apparently vary across the sample even though the true underlying structure
is not changing. Some examples of this are the following:
1. If important explanatory variables are excluded from the model and if they
are related to included variables, the effects of the included variables on the
dependent variable can be expected to vary across the sample. As a mini-
mum, omitted variables can be expected to produce variations in the inter-
cept. In other words, the effects of omitted variables are (incorrectly)
attributed to variables that do appear in the model.
2. Economists often use proxy variables that only partially reflect the eco-
nomic effects they represent. If the relationship between the proxy and its
true counterpart is not constant across observations, the coefficient of the
proxy variable will not be constant.
3. When aggregate data are used, the potential for coefficient variation is
present. Coefficients in the aggregate equation will remain constant as long
as the relationship between microunits remains constant.
4. Coefficients of a linear model can vary across the sample if the true rela-
tionship is nonlinear and observations fall outside the narrow range where
the linear approximation is acceptable.
Varying and Random Coefficient Models 815
19.8 EXERCISES
The data in Table 19.2 were generated from a linear statistical model with a
systematically varying parameter. In particular, the five samples of y's shown
are generated using the statistical model y = X~ + e = lOx, + f32X2 + 0.6X3 + e,
where x" X2, and X3 are given in Section 2.5.1, e ~ N(O, I), and the coefficient f32
varies with the trend variable t, t = 1, . . . ,20, as
Exercise 19.1
Obtain the least squares estimates of the model y = X~ + e for the five samples
ignoring the presence of the varying parameter. Examine the residuals for each
of the samples.
Exercise 19.2
Obtain estimates of the coefficients in the correctly specified model for the five
samples.
y, Y2 Y3 Y4 Ys
Exercise 19.3
Test for the presence of a systematically varying parameter in the correctly
specified model by testing the hypothesis that 'YI = 'Y2 = O.
The data in Table 19.3 were generated from a linear model with time-varying
parameters. In particular, the five samples of y's shown were generated using
the statistical model
where XI, X2, and X3 are given in Section 2.5.1, e - N (0,/) and the coefficients
/3t. /32, /33 vary following the Cooley-Prescott formulation (Section 19.6) with
'Y = 0.8, Lu = Lv = I, and the initial values being Ilo = (10, 0.4, 0.6).
Exercises 19.4
For the five samples in Table 19.3 compute the usual least squares coefficient
estimates. Obtain the predicted conditional mean of the dependent variable for
observation (T + 1), where XT+I = (1, 1.5,2).
YI Y2 Y3 Y4 Y5
Exercise 19.5
For the five samples in Table 19.3 use the estimation procedure suggested by
Cooley and Prescott and described in Section 19.4.1 to estimate PT+ I , the set of
parameters generated by the indicated process for the (T + 1) time period.
Refer to the Cooley-Prescott paper for a transformation that eases the compu-
tational burden. Use the estimated coefficients to predict EYT+ I using XT+ I in
Exercise 19.4. Compare these forecasts to the least squares forecasts.
Exercise 19.6
For each of the five samples given in Table 19.3 test the hypothesis that"Y = 0
using the variance of the asymptotic distribution of"y given in Section 19.4.1.
Exercise 19.7
Repeat the methodology described to generate the data for Exercise 19.4 to
generate 100 samples of data. For each sample compute forecasts of E[YT+d
using the least squares and Cooley-Prescott estimates of PT+ I. Compare the
empirical mean square errors of prediction for the 100 samples for each of the
estimators.
19.9 REFERENCES
Smith, F. and D. Cook (1980) "Straight Lines with a Change Point: A Bayesian
Analysis of Some Renal Transplant Data," Applied Statistics, 29, 180-189.
Srivastava, V., G. Mishva, and A. Chaturvedi (1981) "Estimation of Linear
Regression Model with Random Coefficients Ensuring Almost Non-Negativ-
ity of Variance Estimators," Biometric Journal, 23, 3-8.
Swamy, P. and P. Tinsley (1980) "Linear Prediction and Estimation Methods
for Regression Models with Stationary Stochastic Coefficients," Journal of
Econometrics, 12, 103-142.
Swamy, P. A. V. B. and J. S. Mehta (1975) "Bayesian and Non-Bayesian
Analysis of Switching Regressions and of Random Coefficient Regression
Models," Journal of the American Statistical Association, 70, 593-602.
Tishler, A. and I. Zang (1979) " A Switching Regression Method U sing Inequal-
ity Conditions," Journal of Econometrics, 11, 259-274.
Tsurumi, H. (1982) "A Bayesian and Maximum Likelihood Analysis of a Grad-
ual Switching Regression in a Simultaneous Equation Framework," Journal
of Econometrics, 19, 165-182.
Wallis, K. F. (1974) "Seasonal Adjustment and Relations Between Variables,"
Journal of the American Statistical Association, 69, 18-31.
Zellner, A., ed. (1979) Seasonal Analysis of Economic Time Series, U.S. Gov-
ernment Printing Office, Washington, D.C.
Chapter 20
Nonnormal Disturbances
y = XJl + e (20.1.1)
where (I) the usual definitions hold, (2) X is non stochastic of rank K, (3)
limr..... '" T-IX'X is a finite nonsingular matrix, and (4) the random vector e is
such that E[e] = 0 and E[ee'] = (I2/, (I2 finite. If, in addition, e is normally
distributed
I I I I
Consequences Tests for normality Estimation under some Additive or
(Section 20.2) nonnormal distributions
functional form
(Section 20.5)
M-estimators L-estimators
(Section 20.4.1) (Section 20.4.2)
824 Further Model Extensions
3. The respective distributions for band (T - K)6- 2/(T2 are normal and Xlr-K)
and, furthermore, they are independent.
4. The F test on a set of linear restrictions R ~ = r, and t tests on the individual
coefficients, are justified in finite samples.
Two of the above results-namely, that (i) b is best linear unbiased, and (ii)
the conventional tests are asymptotically justified in the sense that they have
asymptotically correct size-have been used to justify the use of the least
squares estimator under conditions of nonnormality. Most econometric text-
books [e.g., Malinvaud (1970, p. 100)] use this argument, either explicitly, or
implicitly. However, Koenker (1982) argues that neither of the points is very
compelling. He notes that the class of linear estimators is computationally
appealing, but it may be drastically restrictive. Also, the power of the asymp-
totic tests can be extremely sensitive to the hypothesized error distribution.
The least squares estimator does not satisfy a robustness requirement that the
distribution of an estimator or test statistic be altered only slightly when the
distribution of the error term is altered only slightly [see Koenker (1982) for
Nonnormal Disturbances 825
details]. These facts, coupled with the fact that arguments justifying the exis-
tence of normally distributed errors are often weak, have led Koenker and
others to advocate use of robust estimation techniques (Section 20.4). Such
techniques will be particularly desirable if the error distribution has infinite
variance, and we now discuss some aspects of this situation.
There is a large body of literature [e.g., Mandlebrot (1963a and b, 1966, 1967,
1969) and Fama (1963, 1965, 1970)] which suggests that many economic data
series, particularly prices in financial and commodity markets, are well repre-
sented by a class of distributions with infinite variance. An example is the
Pareto distribution, with density fee) = c(e - eo)-a-I, where c, eo, and a are
constants and where the variance does not exist for a < 2. Thus it is natural to
ask whether, in the context of the general linear model, one should assume that
the disturbance comes from a distribution with finite variance. It is frequently
argued that the disturbance is made up of the sum of a large number of separate
influences and, from a central limit theorem, that the distribution of this sum
will approach normality. However, as Koenker (1982) notes, the necessary
Lindeberg condition may not be met, and, as Bartels (1977) points out, there
are other limit theorems which are just as likely to be relevant when consider-
ing the sum of a number of components in a regression disturbance; some of
these theorems lead to nonnormal stable distributions characterized by infinite
variance. Thus, it is worthwhile considering the consequences of an infinite
error variance for our usual least squares estimation techniques.
An infinite variance distribution has "fat tails," which implies that large
values or "outliers" will be relatively frequent. Because the least squares
technique minimizes squared deviations, it places a relatively heavy weight on
outliers, and their presence can lead to estimates that are extremely sensitive.
Thus, in repeated samples, least squares estimates will vary more than in the
finite variance case. Also, we will be unable to reliably estimate the variance of
the disturbance or the least squares estimates. If the variance does not exist, it
is obviously impossible to obtain a meaningful variance estimator and the least
squares estimator will not possess its usual minimum variance property. This in
turn implies that the conventional F and t tests on the coefficients could be very
misleading.
Malinvaud (1970, p. 308) points out that, in practice, one can always assume
that the distribution of the disturbances is bounded and this will lead to a finite
variance. However, he also points out that this will not overcome the problem.
The relatively large number of outliers will still lead to variance estimates that
are unstable in repeated samples and, in general, the estimates will behave as if
the variance were infinite.
The possibility of nonnormal disturbances in general, and infinite variance
disturbances in particular, has led to the development of alternative estimation
techniques which, relative to least squares, place less weight on outliers. As
mentioned above, these techniques come under the general heading of robust
estimation and are discussed in Section 20.4.
826 Further Model Extensions
(20.2.1)
where
1. n = T - K.
2. e(l) <: 1"(2) <: . . <: e(n) are the ordered BLUS residuals.
Nonnormal Disturbances 827
3. e = "2.7= 1 eJn.
4. h = nl2 or (n - 1)12 according to whether n is even or odd.
5. The coefficients ain are tabulated by Shapiro and Wilk (1965).
The null hypothesis is that the disturbances are normally distributed and the
critical region is given by low values of W. Significance points are provided by
Shapiro and Wilk (1965).
It is interesting to note that although the test assumes that the observations
are independent, both Huang and Bolch (1974) and Ramsey (1974) report on
Monte Carlo studies where the least squares vector e led to a more powerful
test than that obtained using the BLUS residual vector e.
Yt = f31 + L
;=2
f3i X it + et, t=l, . . . ,T
(20.3.1)
where Y is the log of maximum output, given the X;'s, the logs of the various
inputs. Since observed production must lie inside this frontier, every distur-
bance must be negative, that is, e t <: O.
If we assume that the et are independently and identically distributed random
variables with mean /.L < 0 and finite variance 0- 2 , then the model may be
estimated by least squares (LS). For i ofo 1 the estimates of the f3;'s will be best
linear unbiased. The estimator for f31 will be biased but it will be unbiased for f31
+ /.L. Furthermore, while the LS estimates will not be normal in small samples,
they will be asymptotically normal, so the usual tests for the f3i will be asymp-
totically valid. However, this method does not allow us to obtain a reliable
estimate for f31' nor can the frontier itself be estimated.
If one is willing to assume a distribution for e t , the parameters of the model
can be estimated by maximum likelihood (ML). Schmidt (1976b) notes that one
may conveniently assume either that the e/s have a half-normal distribution
2
feet) = \1'2;0- exp -
(e~
20-2
)
' (20.3.2)
828 Further Model Extensions
Huber (1972, 1973, 1977), Bickel (1976), Koenker (1982), and Dielman and
Pfaffenberger (1982), the books by Mosteller and Tukey (1977), Bierens (1981)
and Huber (1981), the special journal issue edited by Hogg (1977), and the
papers by Koenker and Bassett (1978, 1982a), McKean and Hettmansperger
(1978), and Joiner and Hall (1983). In this section we concentrate on two
classes of robust estimators-namely, M-estimators (Section 20.4.1) and L-
estimators (Section 20.4.2). A third classification where estimators are based
on a ranking of the residuals in linear models is known as R-estimators. For
details and additional references in this area, we refer the reader to Koenker
(1982), Jaeckel (1972), McKean and Hettmansperger (1978), and Sievers (1978).
Other work that is not described or referred to here but is worthy of mention is
that of Kadiyala (1972), Hogg and Randles (1975), Chambers and Heathcote
(1975, 1978), Krasker (1980), and Moberg et al. (1980). Gilstein and Leamer
(1983) suggest an approach where the set of maximum likelihood estimators for
a given set of error distributions is identified. Maximum likelihood estimation
of density functions that include the normal as a special case has been investi-
gated by Zeckhauser and Thompson (1970) and Goldfeld and Quandt (1981).
Finally, it is important to note that when James and Stein (1961) considered
the problem of the simultaneous estimation of location parameters under qua-
dratic loss that we discussed in Chapter 3, they also showed the assumption of
normality was unnecessary and suggested an estimator of the form 8(1i) =
{I - b/(a + li'Ii)}Ii, which is better than the maximum likelihood estimator Ii if
a and b are suitably chosen. Since the values a and b were not explicitly
determined it was left to Shinozaki (1984) to demonstrate explicit estimators
which dominate the best invariant estimator when the coordinates of the best
invariant estimator are independently, identically, and symmetrically distrib-
uted. UlIah, Srivastava, and Chandra (1983) consider a class of shrinkage esti-
mators and study their sampling properties when the disturbances are small and
possess moments of the fourth order. Alternatively, Stein (1981) proposed a
coordinate-wise limited translation estimator to improve on the conventional
Stein type estimator in the case of possibly heavy-tailed prior distributions.
Dey and Berger (1983) analyzed this estimator for a number of heavy-tailed
distributions, noted its risk performance and proposed an operational adaptive
version. Judge, Yancey, and Miyazaki (1984) analyzed the sampling perfor-
mance of this family of Stein-like estimators under multivariate t and indepen-
dent t error distributions with 1, 3, and 5 degrees of freedom and found the risk
performance of a variety of Stein estimators over the range of the parameter
space, to be superior to some of the conventional robust estimators to be
discussed in this section.
20.4.1 M-Estimators
t = I, 2, . . . , T (20.4.1)
830 Further Model Extensions
T
L xt(Yt -
t~ I
x;(3) = 0
(20.4.2)
(20.4.3)
T
L
t~ I
p(Yt - x;(3)
(20.4.4)
T
L xtt/J(Yt -
t~ I
x;(3) = 0
(20.4.5)
where, if p(z) is a strictly convex function, p'(z) = t/J(z). In the least squares
case, p(Yt - x;(3) = HYt - x;(3)2 and t/J(Yt - x;(3) = Yt - x;J3. However, for
robust estimation we typically choose p and t/J so that outliers are weighted less
heavily than they are in the least squares situation. One set offunctions that has
been popularized by Huber and that possesses some desirable minimax proper-
ties in the presence of a "contaminated" normal distribution [see Huber (1981)
for det~ils] is
(20.4.6)
Nonnormal Disturbances 831
and
t/1(Yt - x;Jl) = max{ -k, min(k,Yt - x;Jl)}
-k if Yt - x;Jl < -k
Yt - x;Jl if IYt - x;JlI <: k
k if Yt - x;Jl > k (20.4.7)
where k is a preassigned multiple of a robust measure of dispersion of the et.
From (20.4.6) we see that this scheme treats "small" residuals in the traditional
way, by minimizing their sum of squares. However! so that outliers do not have
an undue influence on the estimation procedure, the sum of the absolute values
of "large" residuals is minimized. In terms of the normal equations in (20.4.5),
the scheme implies that small residuals are treated in the usual manner, but
residuals greater than k in absolute value are replaced by + k. See Equation
20.4.7.
Let us introduce the measure of dispersion (or the scale factor) more explic-
itly. If k = ccr, where cr is a measure of dispersion and c is a constant usually
taken as about 1.5, then (20.4.7) can be written as
(20.4.9)
!
T
t~)
X (Yt -
cr
x; Jl) =a
(20.4.10)
Ecp is the expectation with respect to <1>, and <I> is the distribution function
for a standard normal random variable. The choice of a in (20.4.11) is mainly
for convenience. It leads to the classical estimates for the least squares choice
p(z) = lz2, and it ensures the scale estimate will be consistent if the errors are
normally distributed.
Let us examine the nature of (20.4.9) and (20.4.10) for the special case where
IjJ is defined by (20.4.8). Also, we will investigate how to solve the equations for
this particular case. Working in this direction we define what are often termed
the "Winsorized residuals"
That is, the residuals with absolute values greater than cO" are replaced by +CO".
Then, for IjJ defined in (20.4.8), Equation 20.4.9 can be written as
(20.4.13)
To rewrite (20.4.10) we first note that, in this case, X(z) ! IjJ(Z)2, and so
(20.4.10) becomes
(20.4.14)
X'e* = 0 (20.4.15)
and
e*'e*
0"2 = --:---::-::-:--
(T - K)y (20.4.16)
where e* and X (and for future reference y) have obvious vector and matrix
definitions. In the classical least squares case y = I and e* = y - XI3, and
thus the least squares solutions are obtained immediately from (20.4.15) and
(20.4.16). In our case, however, e* depends on both 13 and 0" through the
Winsorizing of residuals, and an iterative procedure is needed to solve (20.4. IS)
and (20.4. 16).
One algorithm is as follows. We begin with some initial estimates 13(0) and
6-(0), which could be least squares estimates, ot, preferably, those obtained
from ii-estimation. See Andrews (1974), Hinich and Talwar (1975), and Harvey
(1977) for some discussion on initial estimates. Using 13(0) and 6-(0), some Win-
sorized residuals e*(O) are calculated from (20.4.12). A second scale estimate
Nonnormal Disturbances 833
6- 2(1) = e*(O)t e*(O)/(T - K)y can then be calculated. To obtain a second estimate
of fl we define y(1) = Xp(O) + e*(O) and take 13(1) as the solution obtained when e*
in X'e* = 0 is replaced by y(l) - Xfl(l). That is, we solve
T~e estimates p(l)and 6-()) can be used to obtain a new set of Winsorized
residuals e*(I), and the process continues until convergence. For details of
convergence properties, see Huber (1981, Ch. 7) and references therein.
Another possible algorithm is an iterative weighted least squares scheme.
Specifically, at the mth iteration we minimize ~:~Iw~m)(y/ - x:fl(m2, where
1
w(m)
/
= ccr'(m- I)
(20.4.19)
(20.4.20)
where cr 2(l/J,F) = E F l/J(e/)2/[EF l/J/(e/)]2. For the special case where l/J is defined
by (20.4.8), Huber (1971) recommends estimating the asymptotic covariance of
13 with
2 */ *
.Ii = gee
T - K
(X/X)-I
(20.4.2 I)
20.4.2 L-Estimators
(20.4.22)
(20.4.23)
where ~8 is the Oth sample quantile; and the trimmed mean that is the sample
mean of all observations lying between two quantiles, say ~a and ~l-a.
When we consider the more general linear model, our usual concept of order
statistics is no longer adequate because what constitutes an appropriate order-
ing depends on the vector 13. One possible approach is to obtain the residuals
from a preliminary fit (such as least squares or least absolute values) and to
estimate the quantiles of the error distribution on the basis of these residuals. A
more appealing approach, however, is that suggested by Koenker and Bassett
(1978). They begin by noting that, in the location model, the Oth sample quantile
can be equivalently defined as any solution to the following minimization prob-
lem:
min [
/3
L
{rIY''''iJ}
0IYt - f31 + L
{rly,</3}
(l - O)IYt - f3IJ
. (20.4.24)
Nonnormal Disturbances 835
This definition extends readily to the more general case. Specifically, if we have
the linear model
(20.4.25)
where the e l are independent and identically distributed with distribution func-
tion F, which is symmetric around zero, and the other notation is identical to
that used previously, then the Oth regression quantile (0 < 0 < 1) is defined as
any solution to the minimization problem,
(20.4.26)
(20.4.27)
where:
(20.4.28)
2O.4.2a I.-Estimation
As mentioned above, when 0 = ! the minimization problem in (20.4.26) is
equivalent to finding that 13 which minimizes LIYt - x;l3l. This estimator,
t3 *(0.5), is sometimes referred to as the II-estimator since it is a special case of
the Ip-estimator that minimizes LtlYt - x; I3l p . It has also been called the least
absolute value (LA V) estimator, the least absolute residual (LAR) estimator,
the least absolute error (LAE) estimator, and the minimum absolute deviation
(MAD) estimator. See Dielman and Pfaffenberger (1982) for a review of compu-
tational algorithms, small sample (Monte Carlo) properties, and asymptotic
properties. Examples of computational algorithms are the linear programming
algorithms suggested by Spyropoulos et al. (1973), Abdelmalek (1974), and
Bassett and Koenker (1982), and the iterative weighted least squares algorithm
suggested by Schlossmacher (1973) and Fair (1974).
With respect to the statistical properties of t3 *(0.5), we first note [see Blatt-
berg and Sargent (1971)] that, if the disturbances follow a two-tailed exponen-
tial distribution with density function
(20.4.29)
(20.4.30)
Nonnormal Disturbances 837
where 1(0) is the value of the density at the median [Bassett and Koenker
(1978)]. The term [2/(0)]-Z is the asymptotic variance of the sample median
from samples with distribution function F. Thus, the iI-estimator will be more
efficient than the least squares estimator for all error distributions where the
median is superior to the mean as an estimator of location. This class of error
distributions includes the Cauchy, the two-tailed exponential, and a number of
other distributions where "outliers" are prevalent.
It is straightforward to use (20.4.30) to formulate a Wald statistic for testing
hypotheses about Jl, providing that we can find a consistent estimator for 1(0).
One such estimator is [Cox and Hinkley (1974, p. 470)]
, r- s
1(0) = TCe(r) -
')
e(.,) (20.4.31)
(20.4.32)
where Su and SR are the sums of the absolute values of the residuals in the
unrestricted and restricted models, respectively. For the hypothesis Jlz = 0 in
the partitioning Yt = x;, Jl, + x;zJlz + e t the LM statistic is
(20.4.33)
where g = ~tXt2 sign(Yt - x;dl,), ~, is the restricted I,-esti:nator for Jl" and D
is the second diagonal block of (X/ X)-' corresponding to the above partitioning
ofx; Jl. Under the null hypothesis both statistics have limiting X[J)-distributions,
where J is the number of restrictions. The LM test has the advantage that it
does not require estimation of 1(0). Koenker and Bassett (1982b) compare the
asymptotic efficiency of the tests with that of the more conventional ones based
on least squares residuals, and suggest some finite sample correction factors.
As expected, the tests based on I,-estimation are more powerful for heavy tailed
distributions.
Before turning to more general functions of regression quantiles, we note
that Amemiya (1982) and Powell (1983) have extended I,-methods to simulta-
neous equation models.
838 Further Model Extensions
~('iT) = 2: 7T(Oi)~*(Oi)
i~ I (20.4.34)
(20.4.35)
(20.4.36)
where r = [TO] + v, s = [TO] - v, and e(l), e(2), . , em are the ordered 11-
residuals. For another possible method of estimating the density function ordi-
nates f(&), see Koenker and Bassett (l982a).
For some statistical models the dependent variable may not be normally distrib-
uted, but there may exist a transformation such that the transformed observa-
tions are normally distributed. For example, consider the nonlinear model
1 = 1, 2, . . . , T
840 Further Model Extensions
(20.5.2)
(20.5.3)
where the er are i.i.d. N(0,(J"2). Thus they assume that there exists a transforma-
tion of the dependent variable, of the form given in (20.5.3), such that the
transformed dependent variable:
1. Is normally distributed.
2. Is homoscedastic.
3. Has an expectation that is linear in B.
(20.5.4)
(20.5.5)
Nonnormal Disturbances 841
and
(20.5.7)
A problem with the above approaches is that they assume that the transfor-
mation simultaneously yields the appropriate functional form and disturbances
that are approximately normally distributed and homoscedastic. Such an as-
sumption may be unreasonable. Zarembka (1974) has shown that when the et
are heteroscedastic, the estimated A will be biased in the direction required for
the transformed dependent variable to be more nearly homoscedastic. To over-
come this problem maximum likelihood estimation of a model that allows for
both a flexible functional form and heteroscedasticity has been considered by
Egy and Lahiri (1979). See also Gaudry and Dagenais (1979) and Seaks and
Layson (1983).
Alternative computational methods for obtaining maximum likelihood esti-
mates for the models in (20.5.3), (20.5.4), and (20.5.7) have been reviewed by
Spitzer (1982a, 1982b). Considering, for example, the model in (20.5.3), we can
write the log of the likelihood function as
(20.5.8)
h
were y (Al -- Al
Y 1 ,Y (Al
2 , . ,Y (Al),
T an d X' -- (x I , X2, . . ,X T.
) Th'IS f unc t'Ion
can be maximized directly using one of the algorithms outlined in Appendix B,
we can obtain the concentrated likelihood function that depends only on A and
carry out a search procedure over A, or we can employ one of the other
methods suggested by Spitzer (1982a, 1982b). Substituting the conditional ML
estimators 13('11.) = (X'X)-IX'yiAl and 6- 2('11.) = [y(Al - X13(A)]' [y(Al - X13(A)]!T
into (20.5.8) gives, apart from a constant, the concentrated likelihood function
T T
L(A) = - "2 In 6- ('11.) + (A
2
- I) ~ In Yt
(20.5.9)
ric mean of the y/s. This transformation has the effect of eliminating the second
term in (20.5.9) so that maximizing the likelihood becomes equivalent to mini-
mizing the transformed residual sum of squares. See Spitzer (1982a) for details
of the transformation and the relationship between the transformed and un-
transformed parameters.
A potential danger from using a standard least squares computer package is
that the standard errors provided by the program will be taken as a proper
reflection of the reliability of the parameter estimates. These standard errors
will, however, be conditional on A, and will underestimate the more relevant
unconditional ones. Details of the transformation necessary to obtain uncondi-
tional standard errors are given by Spitzer (1982a). Because of the model impli-
cations of A = 0 and A = I, a confidence interval for A is frequently of interest.
In the usual way, such a confidence interval can be constructed using the
asymptotic normality of the maximum likelihood estimator 'A., and its standard
error.
Empirical applications of the Box-Cox transformation include the work of
Zarembka (1968, 1974); White (1972); Spitzer (1976): Kau and Lee (1976); and
Chang (1977); and two extensions are its use in simultaneous equations models
[Spitzer (1977)] and the introduction of first-order autoregressive errors [Savin
and White (1978)]. Spitzer (1978) has provided some Monte Carlo evidence on
the small-sample properties of the ML estimates, and Hill, Johnson, and Kau
(1977) suggest, in terms of response variation characteristics, that the Box-Cox
transformation model is similar to a class of systematically varying parameter
models. Because it makes the model restrictions more explicit, they prefer this
latter class of models. Further properties of estimators based on the Box-Cox
transformation have been investigated by Bickel and Doksum (1981) [see also
Box and Cox (1982)], Carroll (1982) and Doksum and Wong (1983).
20.6 A MULTIPLICATIVE OR
AN ADDITIVE DISTURBANCE?
Hazell and Scandizzo (1975, 1977) and Turnovsky (1976), among others, have
pointed out some of the different implications of multiplicative and additive
disturbances in stochastic economic relationships. This question concerning
the type of disturbance is somewhat related to heteroscedasticity, functional
form, and the Box-Cox transformation, and so this is a convenient place to
discuss certain aspects of it.
We assume that the applied worker is faced with choosing between the two
alternatives
(20.6.1)
and
(20.6.2)
Nonnormal Disturbances 843
where YI, XI, and f3 have the usual definitions,fis not necessarily linear, the e l
in Equation 20.6.1 are assumed to be i.i.d. (0,0";), and the VI in Equation 20.6.2
are assumed to be i.i.d. (J.L,O"~). The error in (20.6.1) is given the title additive
disturbance while in (20.6.2) it is known as a multiplicative disturbance. The
distinction is important, for example, when dealing with stochastic supply func-
tions, where it has implications for the welfare effects of price stabilization
policies [Hazell and Scandizzo (1975), Turnovsky (1976)].
From a statistical standpoint the real difference between Equations (20.6.1)
and (20.6.2) seems to be whether or not YI is heteroscedastic with a variance
that depends on XI. To illustrate this, we define a disturbance ei = VI - J.L and
rewrite (20.6.2) as
(20.6.4)
are assumed to be i.i.d. N(0,a- 2). This is a choice between a constant elasticity
model with heteroscedastic Yt and a constant slope model with homoscedastic
Yt. Ifthe decisions about modeling heteroscedasticity and functional form were
separated, which seems more appropriate, two other alternatives would have
to be considered, namely a constant elasticity model with homoscedastic Yt,
such as Yt = {31xQlxfl' + et, and a constant slope model with heteroscedastic Yt
such as Yt = ({31 + {32X2t + {33X3t) . exp{et }. Another obvious example of where
the two decisions are not separated is when the Box-Cox transformation,
discussed in Section 20.5, is employed. As it was pointed out in that section,
the model assumes that the transformation simultaneously yields the appropri-
ate functional form and homoscedastic errors.
Thus we argue that the choice between a mUltiplicative and an additive
disturbance is really a choice between assuming that Yt is heteroscedastic or
homoscedastic, and this choice can be made using the framework described in
Chapter 11. However, the mUltiplicative-additive distinction does have an
advantage. It helps to separate explicitly the decisions concerning how to
model heteroscedasticity and how to model functional form.
Under the assumption that the functional form is known, Leech (1975) dem-
onstrates how the Box-Cox transformation can be used to try and discriminate
between the additive specification in (20.6.1) when e t is normally distributed
and the mUltiplicative specification in (20.6.2) when V t is lognormally distrib-
uted. Given the above discussion this is essentially a test designed to discrimi-
nate between the assumptions:
(20.6.5)
and
(20.6.6)
where the et are i.i.d. N(O,a-~) and the YJt are i.i.d. N(O,a-;). A more general
model, which includes these two as special cases, is
(20.6.7)
(A) _ Y~ - 1
Yt - II. if II. #- 0
= In Yt if II. = 0 ir
1
Nonnormal Disturbances 845
T
U = T(ln S(~,O) - In SOl..\ + 2.\ 2: In Yt
t~ I (20.6.8)
where
T ~ I
L = const - - In 0- 2 + (A. - 1) ~ _ S(3,A.)
2 n Yt 20- 2
T
3. S(3,A.) = 2: {y)A) - [f(x t ,(3)](A)}2
t~ I
20.7 RECOMMENDATIONS
1. If one has a priori information about the likely form of a nonnormal distri-
bution, then, because of its known desirable properties, maximum likeli-
hood estimation should be used.
846 Further Model Extensions
2. If the error distribution is not likely to be one where outliers occur, use
least squares; otherwise use one of the robust estimation techniques. In
particular, the trimmed least squares estimator or the iI-estimator seem to
be attractive alternatives.
3. When observations on the dependent variable come from a skewed distri-
bution the Box-Cox transformation may be useful. However, because it
also makes assumptions about the functional form and the homoscedastic-
ity of the transformed observations, it should be used cautiously.
4. When a researcher is choosing an appropriate functional form, this deci-
sion should not be made independently of the decision concerning how to
model the stochastic term.
20.8 EXERCISES
For the problems below we recommend generating 100 samples of Monte Carlo
data from the model
(20.8.1)
where /31 = 10, /32 = /33 = I, the e t are independent identically distributed
drawings from a Cauchy distribution (t distribution with one degree of free-
dom), and t = 1, 2, . . . ,20. Choose any convenient design for Xt2 and X 13.
Exercise 20.1
Choose five samples and, for each of the samples, find values for the following
estimators:
Compare the estimates with each other and the true parameter values.
Exercise 20.2
For the same five samples, use what in each case would be the conventional
covariance matrix estimator to estimate the covariance matrices of the four
estimators in ExeI;cise 20.1. Comment.
Exercise 20.3
Using the same four estimators and the same five samples, calculate "t statis-
tics" that would be used to test (a) Ho: /32 = 0 against HI: /32 oF 0, and (b) Ho:
/33 = 1 against HI: /33 oF 1. Comment on the outcomes of these tests.
Nonnormal Disturbances 847
Exercise 20.4
U sing all 100 samples calculate the mean and mean square error matrices for
each of the estimators in Exercise 20.1. Comment on the relative efficiency.
Exercise 20.5
Use all 100 samples to calculate the averages of the estimated covariance
matrices found in Exercise 20.2. Are these averages good estimates of the mean
square error matrices found in Exercise 20.4?
Exercise 20.6
Consider the tests in Exercise 20.3. Use all 100 samples to calculate the propor-
tion of Type I (Part b) and Type II (Part a) errors. Also, construct empirical
distributions of the t statistics. From these results comment on the validity of
the tests and their relative power.
20.9 REFERENCES
On Selecti ng
the Set of Regressors
1.1 INTRODUCTION
Much of the literature concerned with estimation and inference from a sample
of economic data deals with a situation when the statistical model is correctly
specified. Consequently, as discussed in Chapter 2 and in much of econometric
practice, it is customary to assume that the parameterized linear statistical
model used for purposes of inference is consistent with the sampling process
from which the sample observations were generated. In this happy event,
statistical theory provides techniques for obtaining point and interval estima-
tors of the population parameters and for hypothesis testing. In practice, how-
ever, the possibilities for model misspecification are numerous and false statis-
tical models are most likely the rule rather than the exception.
In many of the previous chapters various types of uncertainty relative to the
specification of the statistical model were recognized and procedures were
suggested for identifying the misspecification and/or mitigating its statistical
impact. Among the various possible sources of false statistical models, the
omission of relevant explanatory variables or the inclusion of extraneous ex-
planatory variables, are the most likely and pervasive. Consequently, in this
chapter, within the context of nonexperimental model building and the linear
statistical model framework introduced in Chapter 2, we consider the statistical
consequences of and the possibilities for dealing with uncertainty concerning
the appropriate column dimension of the design matrix X. Variable selection
procedures are considered within a broad context that encompasses both dis-
crimination among the nested alternatives and the. testing of separate models.
Although we are concerned with both nested and nonnested models, the main
focus is on the nested variety.
N onexperimentai model building or the selection of the statistical model that
is consistent with the sampling process whereby the data are generated, is an
old and important problem in econometrics. Over the years various criteria,
search processes, empirical rules, and testing mechanisms have been proposed
as aids in the choice process. Many of these procedures are discussed in recent
literature survey articles by Gaver and Geisel (1974), Hocking (1976), Thomp-
son (1978), Amemiya (1980), MacKinnon (1983), Sawyer (1980), McAleer
(1984), a specification search book by Leamer (1978), and special issues of the
Journal of Econometrics by Maddala (1981) and White (1982, 1983). Most
model search procedures recognize the importance of economic theory in nar-
On Selecting the Set of Regressors 855
Assume there exists a general model represented by the density fk!I3), which is
the true sampling model underlying the observed values of the random vector
y. Further assume there are (J - 1) alternative statistical models fkll3d,
f3(1132), . . . , f.i(II3J-I) that are also candidates for explaining the random
vector y. Let us further assume that all of the J alternative models can be
derived by imposing appropriate restrictions on the parameter vector 13, the
parameter vector for the general (true) model; that is, the J statistical models
make up a nested set of alternatives for describing the sampling process under-
lying y. Although these assumptions are restrictive they are general enough to
contain some important model selection problems both in a sampling theory
and Bayesian context.
TABLE 21.1 AN OVERVIEW OF SOME MODEL SELECTION PROCEDURES
Nonnested
Nested Statistical Models Statistical Models
-_.,
(Section 21.2.2)
i-ui
F test Mean Coefficient Information Jeffrey- Positive Cox
statistic square of multiple criteria Bayes Stein rule, I
I error determination AIC and BIC posterior zero null Atkinson
Pretest norm I I odds ratio hypothesis I
rules test Adjusted Bayes Quandt
statistic coefficient criterion I
I of multiple I Pesaran and Deaton
Pretest determination Chow
rules
I
I posterior Davidson and MacKinnon
Amemiya's probability ,
criterion, PC criterion
Fisher and McAleer
I
Mallows
criterion C p
On Selecting the Set of Regressors 857
The problem considered in this section is posed by Efron and Morris (1975,
p. 318) as follows:
". . . The statistician wants to estimate the parameters of a linear model that are
known to lie in a high dimensional parameter space HI ; but he suspects that they may
lie close to a specified lower dimensional parameter space Ho C HI' Then, estimates
unbiased for every parameter vector in HI may have large variance, while estimates
restricted to Ho have smaller variance but possibly large bias . . . . "
To see the underlying basis for this conclusion for expository purposes let us
consider the regressor choice problem within the context of a general two-
decision setting and make use of the parameterized linear statistical model
(21.2.1)
and
If the statistical model (21.2.1) is correct; that is, if the column dimension of the
design matrix X is K, then band 8 2 are minimum variance unbiased estimators.
The maximum likelihood estimator b is a normally distributed random vector
with mean vector Il and covariance E[(b - Il)(b - Il)'] = a- 2(X'X)-I. In regard
to the unbiased estimator 8 2 , the random variable (T - K)8 2/a- 2 is distributed as
a central X 2 random variable with (T - K) degrees of freedom.
Consider only the XI subset of X and assume that the X 2 set of variables
appears in the equation with a zero coefficient vector. Thus if we assume the X 2
858 Further Model Extensions
variables are extraneous and are omitted from the statistical model, then we
may represent this information by the set of restrictions
(21.2.4)
b* = b - (X'X)-IR'[R(X'X)-IR']-I(Rb - r) (21.2.5a)
If the elements of 132 are restricted to zero, the restricted least squares estima-
tor for this special case becomes
(21.2.5b)
We recognize the fact that some or possibly all of the elements of 132 may not be
zero by letting RI3 - r = 0, a vector of parameter specification errors of
dimension K 2 If the restrictions are correct, which for our special case implies
132 = 0, then 0 is a null vector. Within this context the mean of the restricted
estimator b* is
u + (XX )-IX'X u]
= 13 - [ 132 - 0
(X'X)-IR'[R(X'X)-IR']-I O =..,1 I I I 2..,2
(21.2.6)
or
The second expression on the right-hand side of (21.2. 7a) is a positive semi-
definite matrix. Also, the difference between the covariance matrices of the
On Selecting the Set of Regressors 859
where a is a positive semidefinite matrix. This implies, among other things, that
the covariance matrix (21.2. 7a) or (21.2. 7b), for the estimator involving restric-
tion (21.2.4), has diagonal elements (sampling variabilities or measures of preci-
sion) that are equal to or less than the corresponding unrestricted elements of
the least squares estimator b for the statistical model (21.2.1).
An estimator of (j2 that uses the restricted least squares estimator b* is
(21.2.9)
(21.2.10)
1.
2.
The estimator bi is biased unless 112 = or XI and X 2 are orthogonal.
Whether or not 112 = 0, the unrestricted least squares estimate of b l , when
both XI and X 2 are included, has lower precision than the estimate bi from
the subset model, that is, the sampling variability of bi is equal to or less
than the sampling variability of b l from the unrestricted model.
3. Unless 112 = 0, the subset (restricted) model estimator (;.2 has a positive
bias.
4. If 112 =1= 0, then either bi is biased or (;.2 is biased or both are biased.
The above conclusions cast the incentive for correct model choice in terms of
bias and/or variance. The mean square error measure, which in the case of
biased estimators is a composite of the two, provides a basis for reflecting a
trade-off between bias and precision [Section 3.1.3 and Judge and Bock (1978,
pp. 28, 29)].
Within the context of Section 21.2.1 if 112 =1= and thus 8 =1= 0, then b* is a
biased estimator with mean square error (MSE),
860 Further Model Extensions
States of World
True Model
Investigators True Model y = X I I3I + X 2132 + e
Actions y = X I I3I + X2132 + e 132 = 0
which is the sum of the covariance matrix (21.2.7a) and the bias matrix. Of
course, if the specification error 0 = 0, the restricted estimator is unbiased and
(21.2.11) is equal to (21.2.7b).
In particular, the mean square error matrix for the restricted estimator bj is,
from (21.2.6) and (21.2.7),
The expression inside { } is the covariance matrix for b2 and the bias matrix for
On Selecting the Set of Regressors 861
+ tr{(X' X)-IR'[R(X'X)-IR']-188'
x [R(X'X)-IR']-IR(X'X)-I} (21.2.16)
for the restricted or subset estimator. Judge and Bock [(1978), pp. 29-33]
present the conditions under squared error loss for which the risk of the re-
stricted subset estimator b* is less than the risk of the maximum likelihood
estimator b. As one special case, consider the conditional mean forecasting
problem concerned with estimating X(3, the mean values of y at the sample
points X. The risk of the maximum likelihood estimator Xb under squared error
loss is
where cr 2Kl is the risk penalty for the included variables (complexity) and the
second term is the risk penalty for misspecification (incorrect variable selec-
tion). Consequently,
if
(21.2.20)
and one would select the subset model Xl over the complete model involving
Xl, X 2 if the expected loss (21.2.18) is less than the expected loss (21.2.17).
However in (21.2.18) the specification error & depends on 132, which is un-
known. Therefore, for variable selection purposes the rule is not operational.
The search for an operational rule is the subject of the next few sections.
In this section we specify, discuss, and compare a range of sampling theory and
Bayesian decision rules that have been proposed for coping with the nested
model selection problem. In general most of these rules rely on discrimination
criteria such as maximizing some modified R2 and each one measures how well
the models fit the data after some adjustment for parsimony.
where y is a vector with elements y = 2.tYtIT, and y = Xb. Given this decompo-
sition of the sample variation of y about its mean, the coefficient of determina-
tion may be defined as
As a basis for model choice, the R2 measure has an obvious fault; it can be
increased by increasing the number of explanatory variables, that is, by in-
creasing the column dimension of the design matrix. To take this characteristic
of the measure into account, Theil (1971, p. 178) proposed a corrected coeffi-
On Selecting the Set of Regressors 863
cient of multiple determination R2, which uses unbiased estimators of the re-
spective variances. For the subset model involving the design matrix XI, the
corrected or adjusted R2 may be defined as
R2 = 1 _ (y - Xlbi)'(y - Xlbi)/(T - Kd = T - 1)
I (y - y)'(y - y)/(T - 1) 1 - ( T _ KI (1 - Rr)'
(21.2.22)
where T is the sample size, KI is the number of variables included, and KI < K.
Given the measure, Theil's recommendation is to seek that subset of explana-
tory variables that maximizes the corrected or adjusted R2. Such a rule, he
states, will lead us to the right model "on the average." However, the expecta-
tion property alone is no guarantee of high power. Also, as Pesaran (1974)
notes, the R2 criterion need not be the most powerful of the criteria involving
the quadratic form of the residuals that have as their property: The expected
value is minimized by the true model. Perhaps the most uncomfortable aspect
of both the R2 and adjusted R2 measures is that they do not include a consider-
ation of the losses associated with choosing an incorrect model; that is, they do
not consider within a decision context the purpose for which the model is to be
used. With the goal of eliminating this deficiency, we now turn to two criteria
that are based on the mean square error measure.
where the last term on the right -hand side is the sum of squares of the bias and
follows from the fact that bi
= Il,
+ (X;X 1)-'X;X21l2 + (X;X1)-'X;e, since y =
X,Il, + X21l2 + e. Since (21.2.23) contains unknown parameters, one way to
proceed is to use an estimate for the unknown parameters and, based on the
estimate, choose the model with the smallest estimated risk.
864 Further Model Extensions
(bias)2
= Kt + (j 2 (21.2.24)
Noting that
(2 1.2.26)
(21.2.27)
where 6- 2 = (y - Xb)'(y - Xb)/(T - K). When the subset model has small bias,
then ('iT is approximately equal to 6- 2, and Cp is approximately equal to K t
Therefore, in the (Cp,K t ) space, Cp values with small bias will tend to cluster
about a 45 line, where C p = K t In using the C p criterion, a procedure some-
times recommended is to obtain a Cp value for all of the possible 2K subsets of
models and choose the model in which Cp is approximately equal to K t No rule
such as minimizing Cp is advised in applied work but the implication is that
variable sets with Cp less than K t are thought to have smaller prediction error.
If alternatively in (21.2.23) a consistent estimator of 132 is chosen, we are led
to a consistent estimator of the risk and to selection rules such as those pro-
posed by Allen (1971) and discussed by Leamer (1983).
j
On Selecting the Set of Regressors 865
(21.2.29)
and involves the unknown parameters 11 and (T2. Amemiya then regards Xo as a
random vector that satisfies the condition E[xox6] = (1/T)X' X. Under this as-
sumption we obtain, by substitution in (21.2.29),
(21.2.30)
and Amemiya calls this the unconditional mean square prediction error. Unfor-
tunately, (21.2.30) still contains unknown parameters. In typical econometric
form, (T2 is replaced by its unbiased estimator, and 112 is made equal to a null
(zero) vector. Since 112 is assumed zero, the corresponding unbiased estimator
of (T2 is
This sequence of assumptions yields what Amemiya calls the prediction crite-
rion (PC)
(21.2.31)
T + Kl)
PC = ( T _ Kl (1 - Rd
2 (SST)
T (21.2.32)
If we compare this with (21.2.22), it is evident that the PC has a higher penalty
for adding variables than Theil's adjusted R2.
If we do not assume that 112 = 0 in (21.2.30) but instead estimate the bias
component from the relation .
(21.2.33)
866 Further Model Extensions
(21.2.34)
(21.2.35)
1. Under the general mean square error criterion E[(ji - (3)(ji - (3)'], in order
for the subset of regressors to be preferred over the complete set of regres-
sors, it is necessary that
(21.2.36)
2. Under the squared error loss criterion E[(ji - (3)'(ji - (3)], in order for the
subset of regressors to have lower risk than the complete set ofregressors,
it is necessary that
(21.2.37)
(21.2.38)
and
(Rb)'[R(X' X)-lR']-l(Rb)
K 2(F'2 <: 1 (21.2.39)
where the left-hand ratios of (21.2.38) and (21.2.39) are F random variables with
K2 and (T - K) degrees offreedom. According to these criteria, the variables in
the set K2 would be deleted if, depending on the criterion, the F test statistic
was less than or equal to IIK2 or the weaker requirement, less than or equal
to I. The commonly suggested rule of thumb of deleting variables whose t sta-
tistics are less than 1 in absolute value has a basis in (21.2.39), which requires
the t statistics associated with the K2 regressors to be less than I in magnitude.
By noting that
= (y'M1Y _ l)(T - K)
y'My K2 , (21.2.40)
we may establish the relationship between the Theil adjusted R2, the Amemiya
PC, and the Mallows Cp criteria, Following Amemiya (1980), in order to make
bilateral comparison between models containing K and K - K2 = Kl variables
in terms of the F random variable, we have
-2
T - KRKI T - K
F = K2 Rk - K2 <I (21.2.41)
y'M1Y)
K 2 F(K2.T-KJ = ( lf2 - (T - K)
(21.2.42)
Then
(21.2.43)
868 Further Model Extensions
(21.2.45)
Therefore, Cp <: KJ implies that F(K2,T-K) <: 1, and the hypothesis that ~K2 =
o is accepted when F(K2,T-K) < 1. Note that the critical value of F(K2,T-K) is
smaller than that implied by the traditional a = 0.05 significance level.
3. For the prediction criterion (PC)
(21.2.46)
21.2.3f A Comment
Other selection criteria are, of course, possible, and an exhaustive list of these
rules is included in Hocking's (1976) review article. All are functions of the
equation error sum of squares and thus are related. Since all the procedures
lead in reality to a pretest estimator in one form or another, they all lead to
inadmissible estimators whose exact sampling properties are not known. Con-
sequently, inference relative to the estimators that result from any of these
search processes should be viewed with caution by the applied researcher.
In closing this section perhaps we should mention two papers where t!1e
model selection problem is in a way regarded as part of the estimation proce-
dure-that is, the properties of the estimators are derived. Geweke and Meese
(1981) consider the problem of estimation when the correct specification of the
statistical model is unknown but it is known to be one of an infinite sequence of
nested alternatives. Procedures for model selection are set forth that provide
choices that lead to the correct model with unit probability asymptotically.
Shibata (1981) considers the optimal selection of regression variables in a model
involving infinitely many parameters or where the number of variables in-
creases with sample size. His criterion is the same as that used by Breiman and
Freedman (1983) and his method is asymptotically equivalent to the Mallows
and Akaike methods.
In assessing the validity of the statistical model, one type of diagnosis often
employed is to use a hypothesis test that compares the null model with a
generalized alternative. Rejection of a hypothesis based on the test statistic(s)
suggests incompatibility of the data and the model and is taken as evidence of
variable misspecification. Within this testing framework the Wald, likelihood
ratio and Lagrange multiplier tests are traditionally used. These tests, which
were discussed in Chapter 5, have the nice property of being asymptotically
locally most powerful unbiased. They are designed to test a set of parametric
restrictions and are in contrast to the Hausman (1978) specification test, which
is, as Holly (1982) notes, designed to test the implications of an hypothesis in
terms of inconsistency.
In most cases hypothesis and specification testing are done with an eye
toward estimation. This means that the final estimation stage proceeds condi-
tionally on the hypothesis (statistical model) that has been selected. This pro-
cess, which makes the model and thus the estimation procedure dependent on
the test based on the data at hand, leads to what we referred to in Chapter 3 as
pretest estimators. This testing mechanism leads to estimators that are inad-
missible under a squared error loss measure. Little is known of the sampling
properties of many of these pretest estimators and even less is known of the
sampling properties of higher order sequential test procedures-that is, re-
peated tests that typically use the same set of data. Therefore, it would appear
870 Further Model Extensions
21.2.5 Information
Criteria for Model Selection
2 2KI
AIC = - Tin (bily) + T (21.2.47)
and under this criterion, (21.2.47) is to be minimized among all of the possible
linear hypotheses RP = O. For the statistical model (21.2.1), this criterion
J
On Selecting the Set of Regressors 871
reduces to
(21.2.48)
(21.2.49)
Since in reality (J2 is not known and if (J2 is replaced by an estimate, the
question becomes which estimate to use. If 6- 2 = y'(I - X(X' X)-IX')y/(T - K),
then as Amemiya notes, (21.2.49) is the same as the Mallows Cp criterion
defined in (21.2.35). Alternatively, if 6- 2 = y'M I y/( T - K), then (21.2.49) is
equal to the Amcmiya PC defined in (21.2.31). In these cases, of course, AIC
suffers all the defects of C p and PC.
Chow (1981) has noted that since in general (Ji > (J2 an adjustment should be
made in Akaike's formula. If this is done the Chow rule turns out to favor the
small model more than Akaike's rule.
Leamer (1983) questions whether the estimation problem inherent in the
information criterion implies a model selection problem. He observes that the
fundamental reason why the information criterion does not imply anything
especially different from maximum methods is that it uses the scoring rule
underlying maximum likelihood estimation.
w2 (W2)2]
BIC = In y'M1y + 2 [(K 1 + 2) -*2 - -*2
(J (J (21.2.50)
and thus the pseudo-true model closely approximates the true model, then
(21.2.50) becomes the Ale. However, if (}2 is used to estimate w 2 in BIC
(21.2.50), then W2/(}2 <: 1 and relative to AIC, a model with fewer explanatory
variables and larger y'M] Y is preferred to a competing model with a larger
number of variables and smaller squared errors. In a two-model choice prob-
lem, the rule that emerges is to compute BIC for each model and choose that
model with a minimum outcome where w2 is replaced by a-2 from the higher-
dimension model.
As Sawa (1978) shows, these criteria can be related to a decision rule based
on the conventional F statistic, and when this happens for "normal" values of
K], the implied ex significance level varies for AIC from 30% to 16% as the
degrees of freedom increase and for BIC, from 10% to 16%. This stands in
contrast to the conventional pretest level of 5% or 1%, which is relatively
parsimonious toward the inclusion of additional variables, a procedure that has
severe risk consequences when the hypotheses are wrong.
is largest. Thus the Schwarz criterion differs from the Akaike criterion in that
the dimension ofthe model is multiplied by (!) In T. For the number of observa-
tions usually found with economic data, the Schwarz criterion favors a lower-
dimensional model than the Akaike criterion. As T grows, the difference be-
tween the two criteria grows. Schwarz notes that under his formulation of the
problem, Akaike's criterion cannot be asymptotically optimal. Using a poster-
ior probability criterion Akaike (1978) attempted to derive the choice rule
(21.2.52) to approximate the Schwarz rule. Chow (1981) rightly points out there
is no need to justify the information criterion in terms of the posterior probabil-
ity criterion since they are designed to answer different questions. Within this
context let us turn to the Bayesian approach to statistical model choice.
(21.2.54)
where 7rO/7rA is the prior odds ratio, P(3,) and P(3" (32) are prior distributions
under Ho and H A , and the eo
are likelihood functions under the two hypothe-
ses. Under "reasonable" assumptions about the prior distributions, Zellner
(1978) approximates the posterior odds ratio as
Thus, among other things, the posterior odds ratio depends on the prior odds
ratio, the compatibility of the prior distributions and maximum likelihood esti-
mates for Jl) and Jl, a ratio of the determinants of the two design matrices, and a
ratio of the likelihoods and therefore a goodness of fit consideration. Acting to
minimize expected loss in a symmetric loss structure, we choose Ho if KOA > 1
and HA if KOA < 1, or in other words, we choose the hypothesis with the higher
posterior probability. Note again that the posterior odds are equal to the prior
odds times the ratio of the average likelihoods weighted by the prior probability
density functions. In large samples the posterior odds are the conventional
likelihood ratio multiplied by the prior odds. It is interesting to note that for the
classical normal linear statistical model (21.2.1) and the corresponding hypoth-
eses, if the priors are chosen to be natural conjugate-multivariate normal and
inverted gamma, the posterior odds reflects the relative sizes of the error qua-
dratic form for each hypothesis. If the priors are diffuse and the prior odds
equal, the quadratic form comparisons dominate the posterior odds.
In order to compare the posterior odds ratio to the Akaike criterion, Zellner
(1978) rewrites (21.2.55) as
and this criterion reduces to the rule, choose Ho if - 21n K OA < 0 and choose HA
if -2 In K OA > O.
Alternatively for comparison purposes, Zellner (1978) writes the Alkaike
information criterion as
(21.2.57)
While somewhat similar, the two criteria obviously differ as to how prior infor-
mation and prior odds are used, and are equal under only some very strong
conditions.
As a further basis for seeing how the various criteria are interrelated, let us
note that for AIC (21.2.57), if 6- 2 replaces the unknown (J2, the result is the
Mallows Cp criterion. Also if (}2 replaces the unknown (J2 in (21.2.57), we have
the Amemiya prediction criterion (PC). Therefore, a link between these criteria
and the Jeffrey-Bayes posterior odds ratio is provided. It should also be clear
that Bayesians evaluate inference procedures in terms of posterior expected
losses while sampling theorists are concerned with sampling properties or risk
On Selecting the Set of Regressors 875
Continue to consider the problem of the investigator who suspects that the data
generation process may be described by a high-dimensional parameter space K
but who has good reason to believe some of the variables are not nec~ssary,
since they have little or no effect on the outcome of y. Therefore, the conjec-
ture is that the data process can be adequately modeled by a lower-dimensional
parameter space K,. We know from Chapter 3 and the preceding sections of
this chapter that including extraneous variables in the design matrix or exclud-
ing important explanatory variables from the design matrix conditions the pre-
cision and bias with which the parameter vector Il is estimated. The maximum
likelihood (least squares) rule for the complete model involving K parameters
and the restricted maximum likelihood (least squares) procedure, which ex-
cludes some of the extraneous and other variables and thus reduces the dimen-
sion of the parameter space to K" are end points on the bias-variance contin-
uum. In this section we leave the traditional estimator and pretest estimation
worlds and consider the bias-variance choice within a Stein-rule context,
where the data are used to determine the compromise between bias and vari-
ance and the appropriate dimension or transformation of the parameter vector.
e = _ _ _ _
!3'X'XJ3 + (F2 tr [X(X'X)-IX'] (21.2.60)
depends upon the unknown parameter vector J3 and the scalar (F2. If after
making the proper substitutions, we replace the unknown J3 and (F2 in (21.2.60)
with their unbiased estimators, then
(21.2.61)
This result looks suspiciously like the general form of the Stein-rule adjustment
for the maximum likelihood estimator that was discussed in Chapter 3 and
suggests that perhaps some member of the Stein-rule family might be of use in
the estimation-model selection problem. Use of the Stein-rule estimator in this
context is particularly appealing, since we know the sampling properties of this
estimator and know little about the sampling properties of the criteria or deci-
sion rules discussed in the previous sections of this chapter or for that matter
the estimator suggested in (21.2.61).
2
c*], [C(T--K)u ] ,
0* = [ 1 - U (0 - 0 0 ) + 0 0 = 1 - (0 _ 0 ),(0 _ 0 ) (0 - 0 0) + 00
0 0
(21.2.64)
*
0(00=0) = (1 - U*) C,
0
(21.2.65)
One important thing to note in (21.2.65) is that except for the case when u =
c*, although the parameters are transformed, the dimension of the parameter
space is maintained at K. In the case u = (T - K)(K - 2)/K(T - K + 2) = c*,
all elements of the K-dimensional parameter vector are set equal to 0 0 = o.
These results are interesting, since we now have an estimator (21.2.64) and
(21.2.65) that is a function of the maximum likelihood estimator but has risk
equal to or less than the maximum likelihood estimator over the whole parame-
ter space. If we use a Stein-rule that shrinks to a null vector when the extrane-
ous variables are included, under loss (21.2.63), the risk function is
p(O~o=O),O) _ _ _ 2 (T - K) [1]
cr
2 - K (K 2) T _ K + 2E 2
X(K.Io.) (21.2.66)
(21.2.67)
A variant of the Stein-rule that does exclude variables for a range of values of u,
or in other words, values of the F statistic, is the Baranchik (1964) positive-
part, Stein-rule estimator. For the formulation resulting in (21.2.65) involving a
zero null hypothesis for the complete parameter vector, the positive rule esti-
mator is
(21.2.68)
Co = (K - 2)(T - K)/(K(T - K + 2
and Co <: c* <: 2co, the parameter vector 6 should be estimated by 60 = O. The
positive Stein estimator produces an estimator that, under squared error loss,
dominates the corresponding maximum likelihood and James and Stein estima-
tors. In addition, for comparable levels of the test, the positive Stein-rule
dominates the conventional pre-test estimator.
The implied statistical level of significance for the F random variable, when
using the positive Stein-rule is given in Table 21.3 for various K and T - K. As
is evident from Table 21. 3, the above criterion and the corresponding rules
require critical values of the F statistic that result in higher significance levels a
than those recommended by many of the informal criteria discussed in the
previous sections. This means, on the average, that a larger subset of variables
would be retained with the Stein-rule criterion than, for example, the informa-
tion, Amemiya, and Mallows-type criteria.
a(T - K)&2)]
hi = [1 - ( b'X'Xb b (21.2.69)
aCT - K)&2 ]
= [ 1 - (r _ Rb)'[R(X'X)-IR']-I(r _ Rb) (b - b*) + b*
(21.2.70)
(21.2.71)
where a* = aCT - K)J-l, dominates the James and Stein estimator (21.2.70)
if 0 <: a* <: 2(T - K){tr [R(X'X)-IR'] dZ 1 - 2}IJ(T - K + 2) and tr R(X'X)-IR'
:> 2 d L or
(21.2.73)
As in the
orthonormal case in the 6 (prediction) space, where
tr [R(X'X)-IR']/d L :> 2 implies that K2 :> 2, we are provided with a basis for
On Selecting the Set of Regressors 881
choosing a model in the K-dimensional space. The rule is: (1) if Ul < a*, use h*;
(2) if Ul > a*, use the K-dimensional space where the estimator (21.2.71) be-
comes operational (i.e., the James and Stein estimator) and transform all the
elements in the K-dimensional vector. The case when {tr [R(X'X)-IR']di: 1 } < 2
will be discussed in Chapter 22. If we assume that the characteristic root
condition (21.2.73) is fulfilled, the critical value of the F(J,T-K) test statistic is
and test whether the null can predict the performance of the alternative. Al-
though there are differences among researchers as to the appropriate way to
test separate families of hypotheses [see, e.g., Atkinson (1970) and Quandt
(1974)], the initial results on this subject are based on the work of Cox (1961,
1962). Pesaran (1974) extended the work of Cox to include the regression model
and considered the autocorrelated disturbance case, and Pesaran and Deaton
(1978) have extended the Pesaran results so that one does not have to assume
linearity of the models and competing systems of nonlinear equations can be
handled. Atkinson (1970) proposes a procedure that nests the nonnested
models into a general model. Dastoor (1983) notes an inequality between the
Cox and Atkinson statistics and along with Fisher and McAleer (1981) modifies
the Cox test for linear regressions. For model specification testing Davidson
and MacKinnon (1981) have proposed procedures that are linearized versions
of the Cox test and closely related to the nonnested hypothesis tests of Pesaran
and Deaton (1978). Pesaran (1982) has compared the local power of several of
the tests noted above. A bibliography of papers concerning separate families of
hypotheses is given by Pereira (1977b). A special issue of the Journal of Econo-
metrics edited by White (1983) contains a set of articles concerned with inter-
pretation, extension, and application of the Cox approach to nonnested hypoth-
esis testing.
In one interesting paper in White's special issue Gourieroux, Monfort and
Trognon (1983) present a testing approach that is applicable to both nested and
nonnested models. Their procedure rests on the notion of aSawa-type pseudo-
true value and the statistic proposed is similar to the Wald statistics used in the
case of nested models.
In the nonnested hypothesis case, we have the following scenario: a data
system and a set of alternative hypotheses that are, by assumption, nonnested
and thus unranked as to generality. Each of the models (hypotheses) Hi is
considered equally likely. Tests of each pair of hypotheses are made, and in
each case, a check is made to see if the performance of Hi against the data is
consistent with the truth of H j . One outcome, of course, is that both hypothe-
ses may be rejected.
Suppose for expository purposes that we have two hypotheses HI and H2 about
the random vector y. Under HI the random vector y is assumed to be generated
by a density function J1(yIX, Ill, (TT), where (Ill, (TT)EOJ, and under H2 the
random vector y is assumed to be reflected by the density function f2(yIZ, 1l2'
~), where (1l2, (T~)E02. The union of 0 1 and O 2 is assumed not equal to 0 1 or
O 2 , and, under this framework, we do not consider approximating one hypoth-
esis by another. Within the linear hypothesis framework for the linear statisti-
cal model, the above two nonnested hypotheses become
(21.3.1)
On Selecting the Set of Regressors 883
(21.3.2)
where e2 - N(O,O"~Ir), and J31 and J32 are K I - and Kz-dimensional parameter
vectors. Thus attention is focused on specification tests for separate models
and following Pesaran (1982) we assume that transformations on Z result in
Z = [ZI XA], where ZI cannot be obtained from X by the imposition of linear
parametric restrictions, and ZI is a (T x J) matrix, where J is the number of
nonoverlapping variables. In this context the general model that includes both
hypotheses is
= (2"T) In :;1 ,2
(21.3.4)
where a-~I can be interpreted as the residual variance of H2 predicted by HI . In
the expression ' denotes maximum likelihood estimators and O"~I denotes the
probability limit of a-~ under hypothesis HI.
As Cox has shown, if HI holds, CI2 will be asymptotically normally distrib-
uted with mean zero and variance V(CI2). At this point, it should be noted that
the small-sample distribution of the test statistic C12 depends on unknown pa-
rameters and thus cannot be derived. However, if [I - Z(Z'Z)-IZ']X = M 2 X =
0, then the models are nested, and for this case an exact test exists.
884 Further Model Extensions
(21.3.5)
Next we consider a linearized version of the Cox test statistic that has been
suggested by Davidson and MacKinnon (1981) and is very much related to
Ramsey's tests (1974). As one of the alternative tests proposed, Davidson
and MacKinnon consider embedding the alternatives in a general model that
uses a mixing parameter A in the combined statistical model
Under the hypothesis Hz, Davidson and MacKinnon replace the unknown ZIl2
with its estimate Z132 = (I - M 2 )y. Therefore, (21.3.6) may be rewritten as
y = XIlI(1 - A) + A(Z132) + e
= XIlI(1 - A) + AY2 + e (21.3.7)
(21.3.8)
Economic theory has much to offer the researcher in describing how economic
data are generated, how economic variables are classified, which economic
variables should be considered, and the direction of possible relationships be-
tween the variables. Unfortunately, economic theory, which treats most vari-
ables symmetrically, is of little help when one reaches the stage in quantitative
886 Further Model Extensions
and
These specifications suggest that many alternative functional forms exist for
reflecting the relationship between the treatment variables and the outcome
variable y. Given this array of algebraic alternatives, the central econometric
problem is how best to use a sample of data so that we can discriminate
between the alternative possible functional forms. It is conventional to bundle
this choice problem within a hypothesis-testing framework. For example,
power series representations, such as the parabolic specification given in
(21.4.1), can be represented in a nested model context, since the linear formula-
tion y = XI f31 + X2f32 + E is a subset of the more general quadratic model.
Consequently, the statistical consequences of the choice of a functional form
can be evaluated in an omitted variable context-that is, the contrast between
the least squares result under the restriction f33 = 0 and the unrestricted model
(21.4.1). Unfortunately, a test of the hypothesis that the model is linear in terms
ofx2 versus the alternative that it may be represented by some polynomial in X2
leads to a pretest estimator and thus has all the negative statistical conse-
quences discussed in Chapter 3 and in the previous sections ofthis chapter. So
the question is, not whether we can represent our variables in terms of a
On Selecting the Set of Regressors 887
polynomial or exponential function, but rather how we can use the data to learn
something about the correct functional form. Here again, the positive Stein rule
may be a useful alternative to conventional hypothesis testing and ad hoc rules.
In the area of consumer and production theories a variety of functional
forms, such as the translog, the generalized Leontief, and the generalized
Cobb-Douglas, have recently been introduced. Many of these are of the flexi-
ble type, in the sense that a priori they do not constrain the relevant elasticities
of substitution. These functions provide a local approximation to an arbitrary,
twice differentiable cost or direct or indirect utility or production function.
Building on this work Appelbaum (1979) uses the Box-Cox (1964) transforma-
tion function to demonstrate new generalized forms that can be polynomials,
quadratic in logarithms, or various mixed combinations of both. Gallant (1981)
goes outside of the Taylor series approximation traditionally used with flexible
functional forms and proposes a Fourier flexible form in an attempt to find an
indirect utility function whose derived expenditure system will adequately ap-
proximate those resulting from a broad class of utility functions. Guilkey, Knox
and Sickles (1983) consider the translog, the generalized Leontief and the gen-
eralized Cobb-Douglas and analyze the ability of estimating models derived
from these functional forms to track a known technology over a range of
observations. They conclude the translog form provides a dependable approxi-
mation to reality, provided reality is not too complex. They were not able to
turn up a flexible functional form more reliable than the translog. Although
these generalizations enrich the family of specification alternatives, the ques-
tion of how to choose among the functional forms remains.
One alternative in specifying and selecting a functional form is to use the Box
and Cox (1964) transformation, which includes linearity as a special case. For a
single treatment variable specification, we can, following Section 20.5 of Chap-
ter 20 write the Box and Cox transformation as
Y~ : 1 = f31 + f32(X2~ - 1) + 01
(21.4.4)
(21.4.6)
cal models, one of which explains the level of a variable up to an additive error
and the other explains its logarithm up to an additive error. The variables that
appear in the design matrix may be linear or nonlinear. In their paper, they
develop a test based on the Cox (1961, 1962) procedure for testing separate
families of hypotheses; that is, they develop two test statistics that are, if
correct, asymptotically distributed as N(O,I) random variables. Their work is
therefore related to the nonnested models hypothesis procedures discussed in
Section 21.3. They conclude that the large-sample properties of the test are
closely approximated in small samples. Their results also imply that if the
investigator knows that one or the other model is correct, that is, in (21.4.4),
either A = 0 or A = 1, then the likelihood ratio criterion proposed by Sargan
(1964) is a useful basis for discriminating between the two models. However, in
case neither model is true, the Cox test has the possibility of rejecting both
models, which is the correct decision. Building on the work of Andrews (1971),
and Aneuryn-Evans and Deaton (1980), Godfrey and Wickens (1981) develop a
basis for testing the adequacy of the linear and log-linear forms against the
more general Box-Cox regression model. Their approach, which treats the
linear and log-linear forms as special cases of the general model, leads to a
Lagrange multiplier test that is easy to compute and that can as one outcome
reject both models. An informative discussion of testing linear and log-linear
models may be found in McAleer (1984).
21.5 SUMMARY
Perhaps within both the nested and nonnested worlds, a requirement for
introducing any new criterion or test should be that its sampling properties
should be demonstrated under alternative loss structures. This requirement
would profoundly curtail or reduce the current and prospective criterion sets.
21.6 EXERCISES
where e is a normal random vector with mean vector zero and variance (J2 =
.0625. Assume the (20 x 6) design matrix is
XI X2 X3 X4 X5 X6
1 0.693 0.693 0.610 1.327 -2.947
1 1.733 0.693 -1.714 2.413 0.788
1 0.693 1.386 0.082 3.728 -0.813
I 1.733 1.386 -0.776 -0.757 1.968
I 0.693 1.792 1.182 -0.819 3.106
1 2.340 0.693 3.681 2.013 -3.176
I 1.733 1.792 -1.307 0.464 -2.407
1 2.340 1.386 0.440 -2.493 0.136
1 2.340 1.792 1.395 -1.637 3.427
1 0.693 0.693 0.281 0.504 0.687
X= (21.6.2)
1 0.693 1.386 -1.929 -0.344 -2.609
I 1.733 0.693 1.985 -0.212 -0.741
I 1.733 1.386 -2.500 -0.875 0.264
1 0.693 1.792 0.335 2.137 0.631
I 2.340 0.693 0.464 1.550 -1.169
1 1.733 1.792 4.110 -1.664 1.585
1 2.340 1.386 -2.254 -0.285 -0.131
I 2.340 1.792 -1.906 1.162 0.903
1 1.733 1.386 0.125 0.399 -1.838
1 0.693 0.693 -2.935 -2.600 -2.634
U sing the statistical model described by (21.6.1) and (21.6.2) and the sam-
pling process described in Chapter 2, use Monte Carlo procedures to generate
100 y samples of size 20.
Choose five samples of data and for each sample do the following:
On Selecting the Set of Regressors 891
Exercise 21.1
Select a final model using the R2 criterion.
Exercise 21.2
Select a final model using the Cp criterion.
Exercise 21.3
Select a final model using the PC criterion.
Exercise 21.4
Select a final model using the AIC criterion.
Exercise 21.5
Select a final model using the BIC criterion.
Exercise 21.6
Select a final model by the general Stein positive rule when the coefficients of
Xj, X5, and X6 are hypothesized to be zero, that is RJ3 = [O,h]J3 = o.
Exercise 21.7
U sing the conventional and.J tests for nonnested models, discriminate between
the two models that involve XI , X2, and X3, and X4, X5, and X6.
Exercise 21.8
Compare and contrast your model choice results for both the nested and non-
nested cases.
U sing the 100 samples of data previously generated involving the design matrix
(21.6.2), an error variance of .0625 and the risk function (Il - (3)' X' X(1l - (3),
where Il is the relevant estimator, develop empirical risk functions associated
with the following variable selection criteria over the specification error param-
eter space J3'J3 and discuss:
Exercise 21.9
The Cp criterion.
Exercise 21.1 0
The PC criterion.
Exercise 21.11
The AIC criterion.
Exercise 21.12
The positive Stein-rule criterion under the hypothesis
Exercise 21.13
The traditional pretest estimator under the hypothesis RJl = o.
21.7 REFERENCES
J
On Selecting the Set of Regressors 895
Mu Iticoll inearity
22.1 INTRODUCTION
Once detected, the best and obvious solution to the problem is to obtain and
incorporate more information. This additional information may be reflected in
the form of new data, a priori restrictions based on theoretical relations, prior
statistical information in the form of previous statistical estimates of some of
the coefficients and/or subjective information. Unfortunately, the possibilities
for "solving" the problems caused by multicollinearity by these procedures are
all too limited.
For the researcher unable to obtain more information, procedures have been
developed that offer hope of culling more information from the sample and
producing correspondingly more precise parameter estimates. These proce-
dures include, for example, principal components regression, ridge regression,
and various variable selection procedures (Chapter 21). Unfortunately, if one
considers the sampling properties of the resulting estimators, these procedures
are in general unsatisfactory. Nontraditional estimators, which have been re-
cently developed, offer point estimates of parameters superior to those pro-
vided by traditional procedures under a variety of loss functions. Although
these estimators do not provide solutions to all statistical problems, as their
sampling distributions are as yet unknown, they do in many cases provide
important risk gains. The magnitude of those gains may not be substantial,
however, when data are collinear.
In this chapter the dimensions of the multicollinearity problem are discussed,
and procedures that may be used for its identification and mitigation are sum-
marized and evaluated. A visual outline of the chapter is given in Table 22.1.
The statistical consequences of multicollinearity are specified in Section 22.2.
Methods that have been proposed for detecting the presence and nature of
multicollinearity are reviewed in Section 22.3 and possible methods for mitigat-
ing the effects of collinearity are discussed in Sections 22.4 to 22.6. In Section
22.7 we consider multicollinearity in the context of the stochastic regressor
model. All other sections assume that the design matrix is non stochastic and
898 Further Model Extensions
Introduction
(Section 22.J)
I
I I I
Statistical Detection of Solutions Multicollinearity
consequences of multicollinearity problem when
multicollinearity (Section 22.3) regressors are
(Section 22.2) stochastic
(Section 22.7)
the results are thus conditional. Finally, a summary and concluding com-
ments are provided in Section 22.S.
y = XfJ + e (22.2.1)
Multicollinearity 899
(22.2.2)
holds, where the Ci are constants not all equal to zero and Xi is the ith column of
X. Thus the observation matrix [y: X] obeys a second linear relation beside
(22.2.1), the one we are interested in. The presence of linear relations like
(22.2.2) implies that certain real linear parametric functions, such as
where A is a diagonal matrix whose elements are AI,. . . ,AK, the characteris-
tic roots of X'X. If X has rank K - J, then X'X has J characteristic roots that
are zero, and J columns of XP are zero; say the last J. It follows that the last J
components of 6 are simply eliminated from the model. Therefore, (h-i+),
OK-J+2, . . . , OK cannot be estimated from the available observations X. On
the other hand, 01 , , OK -J, or any linear combination of these can be
estimated. Therefore, we can estimate w'll if and only if w'll transforms into a
linear combination of 01 , ,OK-i' Since
we can estimate w'll if and only if the last J components of P'w are zero. This
result is of practical importance, since there are numerous computer programs
that can be used to obtain P and A.
This analysis not only allows us to determine which functions w'll are esti-
mable, but also which of these functions can be estimated relatively precisely
and which only imprecisely. If p is a normalized characteristic vector of X'X
corresponding to a nonzero characteristic root A, then p'll is estimable, and its
minimum variance linear unbiased estimator is p'b*, where b* is any solution of
the normal equations X'Xb* = X'y. The variance of p'b* is given by a- 2/A,
SInce
(22.2.7)
Multicollinearity 901
where the k i are constants and the Pi are characteristic vectors corresponding to
the nonzero roots ;'1, . . . ,AK-J, the variance of the estimator w'b* is
, * _ 2 (kI k~ kk-J)
var(w b ) - (J" AI + A2 + ... + AK-J (22.2.8)
This expression summarizes what we know about the precision with which
we may estimate an estimable function. Consequently, the precision depends
on the error variance (J"2, the magnitudes of the constants k i and the magnitudes
of the nonzero characteristic roots Ai. As noted by Silvey (1969, p. 542),
That is, the linear combination of w'p, if estimable, will be relatively less
precisely estimated ifw, as written in (22.2.7), has large weights ki attached to
characteristic vectors that correspond to the small characteristic roots; the
smaller those roots, the larger the estimator's variance will be. This approach
may, of course, be extended to the case where there are no exact linear combi-
nations among the columns of X. Therefore, all functions of the form w'p =
(kiPI + ... + kKPk)'P are estimable and the minimum variance unbiased
estimator is w'b, where b = (X'X)-IX'y is the least squares estimator, and the
corresponding variance is
K k2
= (J"2 2: _i
i~1 Ai (22.2.9)
The only question here is about the variance of the estimator, and the same
conclusions that were made about (22.2.8) hold. Expression (22.2.9) is also
useful in explaining why multicollinearity does not necessarily produce a model
that is a poor predictor. The predictor for the conditional mean of the depen-
dent variable, corresponding to the set of independent variables in the tth row
of X, is x;b, where x; is the tth row of X. The variance of the predictor, by the
next to last expression in (12.2.9), is
~ Ai PiPi
2 , K -I']
(J" Xl [ Xl
(22.2.10)
If, for instance, Ak is small, the X;Pk will be small and the effect of the small Ak
will be canceled out. Thus predictors will not necessarily be imprecise because
of multicollinearity, as long as the values of the independent variables for
which a prediction is desired obey the same near exact restrictions as the X
matrix.
902 Further Model Extensions
X'X = 2: AiPiP;
i~ I
tic root is
Characteristic
Root var(bd var(b 2 ) var(b K )
904 Further Model Extensions
Given that poor sample design is a basic problem, the most obvious solution is
to obtain additional sample data. Silvey (1969) has considered the problem of
what values of the independent variables are optimal in some sense if one new
observation is to be taken. Suppose, in the model y = X~ + e, that there is one
exact linear dependency. Then there is one vector p such that X'Xp = 0, which
means that p'~ is not estimable, nor is any linear function w'~ for which w has a
nonzero component in the direction ofp. Since X' Xp = if and only if Xp = 0,
the multicollinearity problem exists if every row of X is orthogonal to p. Thus,
to get rid of multicollinearity, we must choose a new observation that is not
orthogonal to p. An obvious choice is to choose an observation in the direction
of p itself, say XT+I = lp, where l is a scalar.
A similar analysis holds true if X'X has no zero roots, but some that are
small. Suppose that X'X has one small root A corresponding to the characteris-
tic vector p, which defines the direction in which estimation is imprecise. Then
we take an additional observation YT+ I at values XT+ I . lp. The model for the
complete set of observations is
or
y. = X.~ + e.
Then
and
ZZ ' A -Ia
(1"2a' A-I
var(w'b r ) - var(w'br+l) = 1 + Z'A-I Z
906 Further Model Extensions
The effects of imposing exact linear constraints upon the parameters of a linear
regression model have been discussed in Chapter 3, Section 3.2.1. The restric-
tions imposed presumably describe some physical constraint on the variables
involved and are the product of a theory relating the variables. Alternatively,
we may simply observe relations like Xp - 0 and use those sample specific
results to place restrictions on the parameter space. This procedure will be
discussed in Section 22.5.1 on principal components regression.
One effect of using exact parameter restrictions is to reduce the sampling
variability of the estimators, a desirable end given multicollinear data. The
imposition of binding constraints, even if incorrect, may reduce the mean
square error of the estimator although incorrect restrictions produce biased
parameter estimators.
Exact linear restrictions may be employed in case of both extreme and
near-extreme multicollinearity. Exact restrictions "work" by reducing the di-
mensionality of the parameter space, one dimension for each independent lin-
ear constraint. Thus, by adding K-rank(X) restrictions in the extreme multicol-
linearity situation, all parametric functions w ' j3 become estimable. One still
incurs bias, however, if the restrictions are incorrect.
When multicollinearity is present researchers often try a great number of
model specifications until they obtain one that "works." The variable selection
procedures described in Chapter 21 represent ways of evaluating exact exclu-
sion restrictions. As noted in Chapter 21, these sequential procedures produce
estimators whose sampling properties are largely unknown. This is, of course,
Multicollinearity 907
r=ll+v
The ridge estimator of Section 22.5 results when r = O. This may be viewed
as taking the conservative position that an explanatory variable is assumed to
have no effect on the dependent variable in the absence of contrary informa-
tion. The appropriateness of this rationalization is questionable when all coeffi-
cients are shrunk toward zero. The parameter k reflects the confidence with
which this stochastic prior information is held. As k ~ 0, the prior information
becomes less important relative to the sample information and as k ~ 0Cl the
restrictions become more strongly held. Thus one approach to selecting k
would be to either test the compatibility of the stochastic prior information
implied by a specific k using Theil's (1963) compatibility statistic, or transform
the stochastic restrictions into non stochastic ones via the transformation in
908 Further Model Extensions
Judge, Yancey, and Bock (1973) and apply the MSE tests of Toro-Vizcarrondo
and Wallace (1968) or Wallace (1972) as suggested by Fomby and Johnson
(1977). Either of these procedures, of course, produces pretest estimators
whose properties are not in general favorable. Furthermore, if the pretest esti-
mator based on exact restrictions is adopted, it is known that there is a superior
estimator, namely, the modified positive part estimator proposed by Sclove,
Morris, and Radhakrishnan (1972). See Chapter 3, Section 3.4.4, for a discus-
sion of this estimator.
Linear inequality restrictions are discussed in Chapter 3, Section 2.3. Their use
can improve the precision of estimation when near-extreme multicollinearity is
present, but cannot, in the extreme multicollinearity situation, make it possible
to estimate a function that cannot be estimated because inexact restrictions
may not affect the dimensionality of the parameter space. Presumably, how-
ever, inequality constraints upon parameters involved in an estimable function
improve the precision of estimation of those parametric functions. The conse-
quences of incorrect inequality constraints and pretesting are discussed in
Chapter 3, Section 3.2.3.
which is the usual ridge estimator, where k = (J21(J~. Given that k represents a
ratio of variances, they go on to suggest an iterative Bayes approach to the
choice of k.
Greenberg (1975); Hill, Fomby, and Johnson (1977); Johnson, Reimer, and
Rothrock (1973); Lott (1973); McCallum (1970); and Massey (1975). Principal
components regression is a method of inspecting the sample data or design
matrix for directions of variability and using this information to reduce the
dimensionality of the estimation problem. The reduction in dimensionality is
achieved by imposing exact linear constraints that are sample specific but have
certain maximum variance properties [see Greenberg (1975) and Fomby, Hill,
and Johnson (1978)] that make their use attractive.
Let the model under consideration be
y = Xil + e (22.5.1)
(22.5.3)
where X{P I :P2} = {ZI :Z2}. Properties of6 1 = (Z;ZI)-IZ;y, the LS estimator
of 0 1 with Z2 omitted from Equation 22.5.3, are easily obtained. Specifically, 6 1
is unbiased, due to the orthogonality of ZI and Z2, and has covariance matrix
cr 2(Z;ZI )-1. The principal components estimator is obtained by an inverse lin-
ear transformation. Since Il = PO = PIOI + P 20 2 , omitting the components in
Z2 means that O2 has implicitly been set equal to zero. Hence P 2 0 2 = 0 and the
principal components estimator of Il is
where 6* = (6;, 0')' with 0 a null vector of conformable dimension. The proper-
ties of 13* follow straightforwardly from its equivalence to the restricted least
squares estimator obtained by estimating (22.5.1) subject to Pzll = o. Thus the
principal components estimator, like all restricted least squares estimators, is
I
,I
Multicollinearity 911
known to have smaller sampling variance than the least squares estimator b,
but is biased unless the restrictions P zl3 = 0 are true.
Identifying the restricted least squares estimator equivalent of the principal
components estimator has several advantages. First, one is reminded that data
reduction techniques impose restrictions on and sometimes reduce the dimen-
sion of the parameter space. Second, explicit recognition of the restrictions
permits evaluation of their theoretical implications. However, as argued in
Section 22.3, it is very difficult to interpret the restrictions when more than one
near exact dependency is present.
Judge and Bock (1978, Chapter 10, Section 3). Under certain conditions these
estimators are known to dominate the maximum likelihood estimator under
squared error loss. More will be said about this possibility in Section 22.6.
For an application illustrating this case, see Fomby and Hill (1978b).
Since the publication of the original papers on ridge regression by Hoerl and
Kennard (l970a, 1970b), there have been a large number of papers written on
the subject. Because of the difficulty of obtaining analytical results for this type
of estimator, Monte Carlo experiments have been carried out to investigate the
properties of the resulting estimators. Hoerl and Kennard (1979) offer a survey
article with an annotated bibliography. Vinod (1978), Vinod and Ullah (1981),
Draper and Van Nostrand (1979) and Judge and Bock (1983) also survey the
ridge regression literature.
Ridge regression was originally suggested as a procedure for investigating the
sensitivity of least squares estimates based on data exhibiting near-extreme
multicollinearity, where small perturbations in the data may produce large
changes in the magnitudes of the estimated coefficients. Also discussed, how-
ever, is the hope that the ridge regression estimator is an "improved estima-
tor," in the. sense
,
that it may have smaller risk than the conventional least
squares estlmator.
In this section we discuss the ridge estimators and their properties. The
literature devoted to the selection of a member from the ridge family for use is
reviewed and recent results on the minimax properties of ridge regression
Multicollinearity 913
presented. These studies suggest that the conditions on ridge regression that
guarantee its mean square error improvement over least squares may not be
fulfilled when the data are extremely, or even moderately, ill-conditioned. In
fact, there may be a basic conflict between estimators that are minimax and
those that exhibit stability in the face of ill-conditioned data. These aspects of
the problem will also be investigated.
where P is the matrix whose columns are the orthonormal characteristic vec-
tors of X' X and D is a diagonal matrix of constants d i > O. If the constants di are
all equal and take the value di = k, the generalized ridge estimator reduces to
the ordinary ridge estimator
Note that the LS estimator is a member of this family with k 0 and the
principal components estimator of Section 22.5. I can be viewed as (22.5.4) with
d l = . . . = d K - J = 0 and dK-J+1 = . . . = d K = 00.
One property of the least squares estimator b that is frequently noted in the
ridge regression literature is
(j2
E(b'b) = J3'1l + (j2 tr(X'X)-1 > J3'1l + -
AK
where AK is the minimum characteristic root of X' X. Thus, where the data are
ill-conditioned, and AK is small, this implies that the expected squared length of
the least squares coefficient vector is greater than the squared length of the true
coefficient vector, and the smaller AK, the greater the difference. Brook and
Moore (1980) give a valid evaluation of the usual statement that the least
squares estimator is longer than the true parameter vector. Hoerl (1962, 1964)
suggested that to control the coefficient inflation and general instability associ-
ated with the use of the least squares estimator, one could use the ordinary
ridge estimator b*(k). The properties of b*(k) include:
914 Further Model Extensions
1. b*(k) minimizes the sum of squared residuals on the sphere centered at tke
origin whose radius is the length of b*(k). That is, for a given sum of
squared residuals, it is the coefficient vector with minimum length.
2. The sum of squared residuals is an increasing function of k.
3. b*(k)'b*(k) < b'b, and b*(k)'b*(k) ~ 0 as k ~ 00.
4. The ratio of the largest characteristic root of the design matrix (X' X + kl)
to the smallest root is (AI + k)/(AK + k), where AI > 1..2 > . . . > AK are the
ordered roots of X' X, and is a decreasing function of k. The square root of
this ratio is called the condition number of X. See Belsley et al. (1980, pp.
100-104). The condition number is often taken as a measure of data ill-
conditioning and is related to the stability of the coefficient estimates.
5. The ridge estimator
(22.5.6)
where the first term on the RHS of (22.5.6) is the sum of the variances of
the estimators and the second term is the total squared bias introduced by
using b*(k) rather than b. The total variance is a monotonically decreasing
function of k and the squared bias a monotonically increasing function of k.
7. limwp--->oc MSE[b*(k)] = 00, thus for fixed k, the estimator b*(k) is not mini-
max, and in fact has unbounded risk.
S. There always exists a k > 0, such that b*(k) has a smaller MSE than b
[HoerI and Kennard (1970a)].
HoerI and Kennard (1970a) show that under squared error loss, a sufficient
condition for property 8 is that k < a-21(j~ax' where (j~ax is the largest element of
the vector 6 = P'1l (P is a matrix of orthonormal characteristic vectors of X' X).
Theobald (1974) has generalized this result to weighted squared error loss and
has shown that a sufficient condition is k < 2a- 2IWIl. Property 8 is, of course,
the property of the ridge family that has provided hope for some that ridge
regression can improve upon the least squares estimator.
However b*(k) improves upon b only for a limited range of the parameter
space, and the region of improvement depends upon the unknown parameters
Il and a- 2 Consequently, as a practical matter property 8 is useless. The hope
has been, however, that one could use the data to help select a k value that will
Multicollinearity 915
The Ridge Trace The ridge trace is a two-dimensional plot of values of ridge
coefficient estimates M(k) and the residual sum of squares for a number of
values of k. One curve or trace is made for each coefficient. Hoerl and Kennard
(1970b) suggest selecting a value of k on the basis of criteria like stability of the
estimated coefficients as k increases, reasonable signs, coefficient magnitudes,
sum of squared residuals and the maximum variance inflation factor, that is, the
maximum diagonal element of (X' X)-I, that Marquardt and Snee (1975) claim
to be the best single measure of data conditioning. While Marquardt and Snee
say that in practice the value of k is not hard to select using the ridge trace,
others, noted below, have expressed greater concern.
Brown and Beattie (1975, p. 27) specify the following decision rule that
attempts to formalize use of the ridge trace: "Select a value of k at that point
where the last ridge estimate attains its maximum absolute magnitude after
having attained its "ultimate" sign, where "ultimate" sign is defined as being
the sign at, say k = 0.9."
Vinod (1976) offers a rescaled horizontal axis, the m-scale, which com-
presses the [0,00] range of k to the interval [O,K] for m, where m is defined by
m = K - 2.iAJ(Ai + k), and the regressors are assumed standardized. He also
attempts to quantify the concept of coefficient stability using an index of stabil-
ity of relative magnitudes (lSRM), which is zero for orthogonal data. Vi nod
suggests selecting a value of m for which ISRM is smallest, which also defines a
non stochastic value of k, since ISRM depends only on X. Also see Vinod and
Ullah (1981, pp. 180-183) and Friedmann (1982) for interpretations of ISRM.
Others such as Conniffe and Stone (1973) have criticized the use of the ridge
trace. To make a long academic dialogue short, however, we can summarize
the usefulness of the ridge trace, scaled or unscaled, by simply saying that it is
an art form. There may be those who can learn from the data using the ridge
trace, but determination of k by visual inspection leads to estimates whose
properties cannot be determined. In particular, we have no guarantee of the
superiority of these estimators to the least squares estimator in terms of MSE.
916 Further Model Extensions
The rules that have been proposed to quantify concepts associated with the
ridge trace are also analytically void. The ISRM criterion, although nonsto-
chastic, does not guarantee that k falls in the region where the corresponding
ridge estimator would have lower MSE than the LS estimator. Any nonsto-
chastic k produces an estimator that is admissible; but although inadmissibility
is bad, admissibility by itself is not necessarily a valuable property. Thus the
use of the ridge trace as a way of obtaining improved estimators may be re-
jected. It can illustrate the sensitivity of coefficients to small data perturba-
tions, but that is all.
where kHKB is the sample analogue ot K(T2/Wfl, which is the harmonic mean of
the optimal generalized ridge element d; = (T2/();. Hoerl and Kennard (1976)
consider iterative estimation of the biasing parameter. Hoerl, Kennard, and
Baldwin (1975) and Lawless and Wang (1976) produced Monte Carlo results
that show the ridge estimator using kHKB superior to least squares.
The Lawless- Wang Estimator Lawless and Wang (1976) suggest use of the ridge
estimator
where
They carry out Monte Carlo experiments that support the conclusion that ()}-w
performs substantially better than the HKB or LS rules against squared error
loss or squared error of prediction loss.
which means that the adaptative ridge estimator has the same expected squared
length as the true parameter vector. They carry out a Monte Carlo experiment
and conclude that none of the ridge-type estimators they consider is better than
LS in all cases.
pected squared error loss and expected squared error of prediction loss. Al-
though the sampling properties of the resulting estimator are unknown, they
carry out Monte Carlo experiments, which suggest that this estimator is more
conservative in departing from least squares than other adaptive procedures.
They also report results when the estimated expected loss functions are mini-
mized subject to general inequality restrictions. The reader is referred to their
paper for the results of those numerical experiments. For a more detailed
survey ofthese and other adaptive generalized ridge rules, see Vinod and Ullah
(1981, Chapter 8). Our conclusions about these adaptive generalized ridge rules
are that they do not guarantee risk improvement over LS for all parameter
values and thus their use is based on unspecified nonsample information.
As noted above, however, there is an increasing amount of research on
determining analytic expressions for the risk of shrinkage estimators, like the
adaptive ordinary and generalized ridge estimators, whose shrinkage is de-
termined by stochastic factors. For adaptive generalized ridge rules, examples
are Vinod (1978), Vinod, Ullah, and Kadiyala (1981), Ullah and Vinod (1979),
Casella (1977), and Strawderman (1978). We will consider Strawderman's
results in more detail.
Hoerl and Kennard (1970a) and others have studied generalized ridge rules of
theform [X'X + kB]-IX'y, where B is a symmetric, positive definite matrix.
This rule can be re-expressed as [(X'X)(1 + k(X'X)-IB)]-IX'y or [I + kC]-lb,
where C = (X' X)-IB. Strawderman (1978) considers rules of this form and
explicitly allows the factor k to be a function of the data-that is, k( y). He then
writes the ridge rule as
&(b,s) = [I + k(y)C]-lb
His goal is to produce estimators within this class that are minimax and exhibit
some of the stability properties of traditional ridge rules.
Strawderman's procedure is to consider the problem in reduced form-
namely, that of estimating the mean vector of a multivariate normal random
vector. By doing so, the techniques of proof developed by those who have
considered the problem of obtaining minimax estimators for the multivariate
normal mean become applicable. Strawderman's results relating to ridge rules
can be summarized as follows: Let y = X(3 + e, where all the usual assump-
tions of the classical, normal linear regression model hold. If loss is measured
by
asQ-IX' X ]-1
&(b,s)= [ l+b'X'Xb+gs+h b (22.5.7)
Multicollinearity 919
where
2(K - 2) I
s = (y - Xb)'(y - Xb), o <: a <: (T - K + 2) Amax[Q-IX' Xl
h > 0 and g > 2K/( T - K + 2). Also, Amax ['] is the largest characteristic root of
the matrix argument. This corresponds to the general form with key) =
as/(b'X'Xb + gs + h) and C = Q-IX'X.
It is interesting that the usual ridge rule
is minimax when the matrix of the loss function is Q = (X' X)2 rather than the
more usual cases Q = I (squared error of estimation loss) or Q = X' X (squared
error of prediction loss). The consequences of this fact are quite important if we
relate it back to the multicollinearity problem. Recall that the motivation for
ridge regression, or in fact any sort of biased estimation procedure, is to gain
improved information about the unknown parameters of the linear regression
model by using an estimator that has smaller risk, over the range of the parame-
ter space, than the conventioilalleast squares estimator. This consideration is
especially important, of course, when a lack of precision due to poor implicit
sample design is a problem. The class of shrink estimators, those that improve
on least squares by shrinking the least square estimators toward some specified
point in the parameter space, includes both the ridge estimators and the general
Stein-like minimax estimators to be considered in Section 22.6. One important
question that remains is: "How does the choice of Q affect the shrink estima-
tors?" Furthermore, given the previously mentioned instability ofleast squares
estimates based on multicollinear data, is it possible to obtain shrink estimators
that are more stable than least squares estimates?
To investigate the questions raised above, consider the model in its canonical
form
y = X(3 +e = XPP'(3 + e = ZO + e
o(b,s) = PoCO,s) = P I +
A [
A
asP'Q-IPA
A
]-1 0
A
WAO + gs + h
920 Further Model Extensions
where 6 = (Z' Z)-IZ'y = A -IZ'y and for minimaxity g > 2K/(T - K + 2), h > 0
and
2(K - 2) 1
o <: a <: (T - K + 2) Amax[Q-IPAP']
How the choice of Q and multicollinearity affect the properties of the ridge
estimator can be seen by examining the following cases:
1. Let Q = I. Then
and the loss function penalizes errors in all components equally, regardless of
how well they may be estimated, given the sample design. The Strawderman
estimator becomes
asA ]-'
8(b,s) = P [ I + A A 0
A
9'A9 + gs +h
If we recall that cov(6) = 0- 2A -I, it is clear that this estimator shrinks the
components fJ i with the smallest variances the most. If (22.5.7) is rewritten as
asX'XQ-IX'X ]-'
8(b,s) = [ X'X + b'X'Xb + gs + h X'y (22.5.8)
we can examine the influences of this type of shrinkage on the stability of the
estimator by comparing the condition number of the design matrix of the least
squares estimator X'X and the design matrix in (22.5.8). If Q = J, the design
matrix is [X'X + k(y)(X'X)2]. The condition of this matrix is [lq + AIk(y)]/
[AK + A~k(y)]. This ratio is greater than the condition number AI/AK of X' X,
which implies that this ridge estimator is less stable than the least squares
estimator. This is not a desirable result if we are concerned with the effects of
collinearity upon the stability of the estimator.
Another consideration is how much the resulting minimax rule will be differ-
ent from the conventional estimator. For minimaxity,
2(K - 2) 1
o <: a <: (T - K + 2) AmaJQ-IX'X]
Factors that affect the width of the interval containing the constant a are
important, since that determines the amount of shrinkage allowable for the
estimator to remain minimax. Furthermore, it is of interest to examine the
Multicollinearity 921
effects of multicollinearity upon the bound ofthe interval that contains a. Since
Q = I and AmaAX'XJ = AI, the range of a becomes the interval
2(K - 2) 1]
[ 0, (T - K + 2) ~
aSI] - J ,
[
o(b,s) = P I +, ,
O'AO + gs +h
0
All components Oi are shrunk by the same fraction. The design matrix becomes
[X'X + k(y)X'X], which has condition number [AI + k(Y)AIJ/[AK + k(y)AKJ.
which equals AI/AK, the condition number of X' X. Thus this type of shrinkage
has no effect on the stability of the resulting etimates under the criterion.
For this loss function the upper bound on the shrinkage constant a becomes
2(K - 2)/(T - K + 2). Therefore, only the number of regressors and number of
observations affect the maximum minimax amount of shrinkage and not the
multicollinearity in the data. Since this loss function is appropriate for situa-
tions in which prediction is the primary concern of the analysis, this is not a
surprising result.
3. Let Q = (X'X)2. The risk function is p(Jl*,Jl) = (0* - 0),A2(0* - 0)/(J'2
and losses on components with small variances are magnified more than in the
case where Q = X'X. The Strawderman estimator becomes
o(b,s) = P [ I +, ,
asA -I ]-1 0,
O'AO + gs +h
and now components with large variances are shrunk more than components
with small variances. Casella (1977) terms this the "ridge property," which
serves to distinguish this class of estimators from other shrinkage estimators.
922 Further Model Extensions
The design matrix is [X'X + k(y)I], which has condition number [AI + k(y)]/[AK
+ k(y)]' which is less than AI/AK, the condition number of X' X. Thus this choice
of Q ensures the improved stability of the estimators as measured on this
criterion.
The upper bound on the shrinkage constant a is now
2(K- 2) 1 2(K - 2)
(T - K + 2) Amax[(X ' X)-I] - T - K + 2 AK
where AK is the smallest root of X' X. The smaller AK-that is, the more severe
the multicollinearity-the smaller the amount of minimax shrinkage allowed.
Therefore, although this estimator is minimax and improves the condition num-
ber of the design matrix, the more severe the collinearity the less different this
estimator will be from the conventional estimator. This reaffirms the comment
of Efron and Morris (1973, p. 93) that there is "tension" between minimaxity
and the stability of estimated coefficients.
The work of Stein and its extensions have been discussed in Chapter 3 as a
superior method of incorporating prior information when its validity is uncer-
tain. In this section we consider a general class of minimax estimators dis-
cussed by Judge and Bock (1978) as a way of obtaining more precise informa-
tion about unknown population parameters when implicit design is poor. In the
usual regression case where (J"2 is unknown, Judge and Bock (1978, p. 234)
propose a class of rules
(22.6.1)
where Band C are square matrices and h is a differentiable function. This rule
is minimax for Jl against
1
L(o,Jl) = 2 (0 - Jl)'Q(O - Jl)
(J"
In the general estimator (22.6.1) let Q = C = I, B = X' X, and h(u) = a/u. The
resulting estimator is
(22.6.2)
BI(b,s) = [I - k(y)l]b
2
o <: a <: (T _ K + 2) [AK tr A -I - 2]
Note that the width of this interval, determining the amount of shrinking that is
allowed for minimaxity, is positively related to K and negatively related to T.
The remaining factor [AK tr A -I - 2] must be positive and is positively related to
the width of the' interval. Since this factor depends upon the characteristic roots
of the design matrix, and therefore on whatever patterns of multicollinearity
are present, it is important to have an idea of what characteristic root spectra
lead to the satisfaction of this condition. Our first intuition that increasingly
severe multicollinearity implies AK increasingly small and therefore limited
range of improvement of the Stein-rule estimator over least squares is not
completely accurate. Recall that each small characteristic root corresponds to a
nearly exact linear relationship among the columns of the X matrix. The con-
dition
will fail in general when there is one, but not more than one, small characteris-
tic root. If there are two or more small characteristic roots, implying two or
more near-exact linear dependencies among the columns of X, then the condi-
tion may be satisfied.
Fomby and Hill (1978a), Trenkler (1984) and Oman (1982) have investigated
this condition for several patterned X' X matrices that could be thought to
924 Further Model Extensions
2 [tr(RS-IR')-IRS-IQS-IR' J
o <: a <: (T _ K + 2) ~ - 2
Members of the class (22.6.1) that do shrink with the ridge property are illus-
trated in the following cases:
as(X'X)-I]
&(b,s) = [ 1- b'X'Xb b (22.6.3)
as A
&(b,s) = P [ I - , ,
-I] 0,
O'AO
As is clear from examining (22.6.3), this estimator shrinks most those compo-
nents, fJ i , which have the largest variance (j2/Ai. The condition for minimaxity is
2
tr A -2 > --:z
AK (22.6.5)
The same general remarks about the satisfaction of (22.6.5) hold as discussed in
Section 22.6.1. In particular, for the equicorrelation example considered there,
(22.6.5) is satisfied for K :> 3 regardless of the value of z. Again the interval in
(22.6.4) decreases in width as z gets larger. As for the stability of the estimator
in (22.6.3) the condition number of the design matrix is that of
2AdAK tr A-I - 2]
O <: a <: -=-'==:-"'------::------:---=-
- - T-K+2
Again, the smaller AK, the less difference there will be between this estimator
and the least squares estimator despite the fact that it is minimax and is more
stable on the condition number criterion.
3. To obtain an estimator with more desirable properties, but which is
slightly different in form from the estimators presented above, let Q = (X'X)2,
C = (X'X)-I, B = I, and h(u) = alu. This estimator has the form
as(X'X)-I] [ as A-I] ,
~h(b,s) = [ I - b'b b = P I - (ro 9
2(K - 2)
o <: a <: (T - K + 2)
These estimators offer properties that are desirable when data are multicol-
linear. What remains to investigate is the extent of risk gains among the various
rules and the circumstances when one is preferred to another.
y = XJl + e
where y is a (T x 1) vector of observations on the dependent variable, X is a
(T x K) stochastic matrix; that is, the tth row x; is a (1 x K) random vector,
Jl is a (K x 1) vector of unknown parameters, and e is a (T x 1) stochastic
error vector that is distributed independently of X so that E(eIX) = 0 and
E(ee'IX) = (T21. The properties of this model are discussed in Chapter 5.
Since we are presuming that our data are generated by independent drawings
from the distribution of X and e, it is clear that as experimenters we no longer
have a carefully controlled exp~rimental design to work with. Thus it is possi-
ble to draw a sample of T observations on the independent variables that have
poor sample design. We may analyze this sample in one of two ways. First, we
could analyze the sample design as if X were non stochastic with all results
conditional on the values of the sample actually drawn. Multicollinearity can
then be properly analyzed as a feature of the sample, not the population, and all
the comments of previous sections apply. This is usually the position taken.
On the other hand, we may wish to know if the independent variables obey
Multicollinearity 927
22.8 SUMMARY
In this chapter the problem of multicollinearity has been considered and its
statistical consequences noted. The presence of multicollinearity implies poor
928 Further Model Extensions
implicit sample design, and means that not all parameters in the model can be
estimated precisely. Which linear combinations of parameters will be estimated
relatively imprecisely using least squares with a given sample can be deter-
mined by examining the characteristic roots and vectors of the X'X matrix. If
the linear combinations of parameters on which interest centers are not af-
fected by the multicollinearity, then one can proceed using classical methods.
If, however, the parameters of interest cannot be estimated precisely with the
given sample using classical methods, then other alternatives should be con-
sidered.
One possibility for improving the precision of parameter estimation is to
introduce nonsample information that can be combined with the existing sam-
ple information. The statistical consequences of this approach may be summa-
rized as follows: Over the region of the parameter space where the nonsample
information is "close" to being correct, the estimator that combines sample
and other information has a lower risk than the conventional estimator. If the
information is uncertain, however, there is a potential cost-namely, that the
estimator that combines both types of information may not be better than the
conventional estimator. Because of this fact, investigators have adopted the
practice of performing statistical tests of the compatibility of the sample and
prior information and making a decision whether or not to pool based upon the
outcome of that test. The resulting pretest estimator protects against large
losses due to the incorporation of erroneous information. It is superior to the
conventional estimator over only a portion of the parameter space. Further-
more, the sampling distributions of these pretest estimators are not presently
known and thus the basis for inference is limited.
If the statistical consequences of pretesting are ignored, then it is possible to
make empirical results "look good" by locating a specification of the model
that fits the data well in the sense that the" t values" are high. Such exercises
do not generally represent positive contributions to knowledge. Like other sta-
tistical theories with the sampling theory approach to estimation and inference,
a particular empirical result cannot be judged either good or bad. What people
typically mean when they say that the results "look good" is that they agree
with prior notions. This, however, does not necessarily imply correctness.
Another suggested route to improving the precision of estimators is to in-
spect the (nonstochastic) design matrix X to determine what information it
contains and then incorporate that information into the estima ion process.
Examples of these ad hoc techniques include principal componenL regression,
ridge regression, and various variable selection procedures. These procedures
as usually presented have little or no conceptual basis and lead to estimators
that are inferior to the conventional estimator over a large range of the parame-
ter space. Consequently, their use is not recommended. Recently, Strawder-
man (1978) has produced a ridge-like estimator, which under certain conditions
is superior to the maximum likelihood estimator over the entire parameter
space. To achieve improvement in risk and estimator stability, an unusually
weighted loss function must be adopted. It is possible, however, to obtain risk
improvement over least squares under a wide range of loss functions. Sampling
Multicollinearity 929
distributions are not available for these estimators, and thus neither hypothesis
tests nor confidence interval statements can be made.
Finally, Stein-like estimators provide an alternative route to improved preci-
sion of estimation. These estimators can be thought of as superior ways of
combining sample and uncertain nonsampie information. They have the prop-
erty that regardless of the correctness of the nonsample information, if certain
conditions are met, they dominate the conventional estimator under a wide
range of loss functions. Once more, however, these results are useful only
where point estimates are sought because the sampling distributions of these
estimators are not yet known. Furthermore, the extent of the risk gain can
diminish substantially when certain forms of multicollinearity are present be-
cause of the limited shrinkage permitted by the minimaxity condition on the
shrinkage constant. A generalized coordinate-wise Stein (1981) estimator may
remedy this deficiency.
Given the alternatives noted above, what solid guidance may be given to the
investigators who feel that their samples may not be rich enough in information
to precisely estimate specific linear parametric functions?
First, in general, investigators who have only vaguely defined objectives
when carrying out empirical work are more likely to be troubled by poor sample
design problems than investigators who have more limited or narrowly defined
objectives. Multicollinear data do not imply that one should despair immedi-
ately, since in some cases linear combinations of parameters may permit esti-
mation with acceptable sampling precision. What should be done is to assess
whether or not key parametric functions can be estimated relatively precisely.
If the data will not allow this, then other sources of information must be sought.
Second, adoption of estimation procedures that have low risk relative to least
squares over only a narrow range of the parameter space should be avoided in
general. This includes a vast array of ad hoc data search procedures and pretest
estimators outlined in this chapter and Chapter 21.
Third, if the sample does not contain sufficient information for precise pa-
rameter estimation, then those who wish to test hypotheses or make confidence
interval statements are in a difficult position. Most of the procedures available
to improve the precision and/or stability of parameter estimates, such as the
Stein-rule family of estimators, have unknown sampling distributions. Finally,
for those who are only interested in point estimates, Stein-like alternatives are
superior to least squares. The degree of improvement in estimator precision
and stability, however, depends on the loss function adopted, the number of
regressors, the nature of the multicollinearity, and the type and quality of both
the available sample and prior information.
Finally, what are some suggestions for research in this area? First and fore-
most is the course to continue efforts to develop estimators, perhaps in the
Stein-family, that provide risk improvements over LS and that are robust to
data ill-conditioning. Part of this work, of course, is careful investigation of
potential risk gains provided by current estimators under varying conditions of
collinearity. A second avenue of research that has yet to receive much atten-
tion is the effect of near exact linear dependencies on the finite sampling preci-
930 Further Model Extensions
sion of estimators in nonlinear models like Tobit, Probit, and so on. These
questions become increasingly important given the growing use of nonlinear
models in applied work.
22.9 EXERCISES
Exercise 22.1
For the model outlined above, and using the data in Table 22.4, compute the
least squares estimates of the coefficients, their means, and their sample vari-
ances. Compare the sample variances with their theoretical counterparts.
Exercise 22.2
For the given design matrix compute and evaluate each of the multicollinearity
"detection" methods outlined in Section 22.3.
~ Xs ~ X5
YI Y2 Y3 Y4 Ys
16.64 15.81 15.48 16.97 15.97
17.77 18.76 19.38 18.35 18.94
16.98 17.31 16.70 18.92 15.38
22.40 20.02 20.73 22.41 22.44
19.19 16.51 17.79 18.58 19.33
22.03 21.88 19.72 20.70 21.75
22.22 22.87 20.96 23.66 23.03
20.85 22.01 20.77 21.14 22.29
22.68 22.49 23.05 21.50 23.33
16.24 14.29 14.52 14.87 14.71
17.85 17.17 16.63 17.15 17.77
18.95 21.98 19.97 20.37 19.99
21.12 20.76 22.29 21.72 19.98
19.85 19.45 19.24 18.29 20.27
21.34 23.66 22.17 20.85 21.08
22.74 20.99 21.37 19.15 19.58
24.36 24.18 21.34 24.10 24.14
24.70 25.19 23.71 24.81 23.27
18.92 20.59 20.81 21.00 20.17
15.31 13.12 16.08 15.58 13.16
Exercise 22.3
Generate 100 samples of size 20 for the statistical model outlined above. Com-
pute the least squares estimates, their sample means, variances, and squared
error loss, Ll~ (b i - (1i)'(b i - (1i)/100. Construct empirical frequency distribu-
tions of the least squares estimates of the coefficients and the unbiased estimate
of the error variance. Compare these distributions to the true distributions.
Exercise 22.4
Assume that exact prior information of the form f32 + f33 = f34 is available.
Compute the restricted least squares estimates for the five samples of data from
Exercise 22.1 and compare these outcomes with the corresponding least
squares results. See Section 3.1.
Exercise 22.5
Assume stochastic prior information of the form R (1 + v = f32 + f33 - f34 + V =
0, where v is assumed to be a normal random variable with mean zero and
932 Further Model Extensions
variance th. Using this information in conjunction with the sample data, esti-
mate the parameter vector fl and compare it with the least squares and re-
stricted least squares results. See Section 3.2.
Exercise 22.6
Use the inequality restricted estimator and the inequality restriction R fl = {32 +
{33 - {34 > 0 to estimate the unknown fl parameters for the five samples.
Compare these results with those of the least squares, restricted least squares,
and stochastically restricted least squares estimators. See Section 3.2.
Exercise 22.7
Apply principal components regression (Section 22.5.1), deleting in turn one,
two, and three principal components, to obtain parameter estimates for the five
samples. Compute the sample means of the principal components estimates
and their sample variance.
Exercise 22.8
Compute the restricted least squares estimator for the 100 samples constructed
as outlined above, using the exact prior information {32 + {33 = {34. Compute the
sample means and variances of the resulting estimates and compare them to
their theoretical counterparts. Compute the squared error loss of the estimates
and compare it to that for the least squares estimator.
Exercise 22.9
Compute the stochastically restricted estimator based on the prior information
Rfl + v = {32 + {33 - {34 + V = 0, where v is a normal random variable with mean
zero and variance th, for 100 samples. Compute the sample means and vari-
ances of the estimates and compare with their theoretical counterparts. Com-
pute the squared error loss of these estimates and compare them with the least
squares estimates.
Exercise 22.10
Use the inequality restricted estimator and the prior information that Rfl = {32
+ {33 - {34 > 0 to estimate the model paramelers for 100 samples. Compute the
sample means, variances, and squared error loss of these estimates and com-
pare them to the least squares results.
Exercise 22.11
Apply principal components regression, deleting in turn one, two, and three
components, to tOO samples. Compute the sample means and variances of
these estimates and compare them with the least squares results.
Multicollinearity 933
Exercise 22.12
Adopt one of the rules outlined in Section 22.5.2b for selecting the biasing
parameter k used in ridge regression and compute the ridge estimates for five
samples. Compute the sample means and variances of the estimated coeffi-
cients and compare them with the least squares results.
Exercise 22.13
Use Strawderman's adaptive generalized ridge estimator (Section 22.5.2c) to
estimate the model's coefficients for five samples. Compute the means and
variances of these estimates and compare them to the least squares results.
Exercise 22.14
Use the sample data from Exercise 22.3 and Strawderman's adaptive general-
ized ridge estimator to estimate the model's coefficients for tOO samples. Com-
pute the sample means, variances, and squared error loss for these estimates
and compare them to the least squares results.
Exercise 22.15
Use the James- and Stein-type mInlmaX estimator (22.6.2) to estimate the
model's parameters for five samples. Compute the sample means and variances
of these estimates and compare them to the least squares results.
Exercise 22.16
Use the sample data from Exercise 22.3 and the James- and Stein-type minimax
estimator (22.6.2) to estimate the model's parameters for tOO samples. Com-
pute the sample means, variances, and squared error loss of these estimates
and compare them to the least squares results.
Exercise 22.17
Contrast the results for the various estimators and diagnostic checks and dis-
cuss the best way to deal with this model and the corresponding data.
22.10 REFERENCES
For reference purposes some of the matrix and distribution theorems used in
derivations in this book and elsewhere are summarized in this Appendix. For a
more complete treatment see linear algebra or linear statistical model books
such as Introduction to Matrices with Applications to Statistics and The Theory
and Application of the Linear Model, both by F. Graybill and Linear Statistical
Inference by C. R. Rao.
A.l.l
A.l.2
Also
A.l.3
k 1
+ (k -
1
1)'. Li . Lj
[ a - c
:\uX,. aXl. x
]
(Xi - xi) .
A.2.1
A.2.2
The two K-dimensional vectors y and z are said to be orthogonal ify'z = o. Ifin
addition y'y = 1 and z'z = 1, they are said to be orthonormal. For a (T x T)
orthogonal matrix C,C'C = CC' = I, and, if C is orthogonal, then C' is ortho-
gonal.
A.2.3
A.2.4
A.2.5
A.2.6
A.2.7
A.2.8
o
A=Q Q'
o
AII2 = Q Q'
o
A.2.9
A.2.10
A.2.11
O(T-J)X(T-J)
]
The identity matrix is the only nonsingular idempotent matrix, and a symmetric
idempotent matrix is positive semidefinite.
A.2.12
A.2.13
If A and B are two symmetric matrices, a necessary and sufficient condition for
an orthogonal matrix C to exist such that CAC = A and CBC = M, where A
and M are diagonal, is that A and B commute, that is, AB = BA.
A.3.1
If y is a normal random variable with mean 8 and variance (T2, that is, N(8,(T2),
then z = ay is a normal random variable with mean a8 and variance a 2(T2, for
a =1= O. If Y is a (T x 1) normally distributed random vector with mean vector 8
and covariance (T2 h and A is a (K x T) matrix of rank K < T, then z = Ay, is a
normal random vector with me2' A8 and covariance (T2AA I.
A.3.2
A.3.3
A.3.4
A.3.5
If Y is any random variable with mean 5 and variance a 2 , then, when T is large,
Lty/T has approximately the distribution N(5,a 2/T).
A.4.1
A.4.2
A.4.3
A.4.4
Ify is a (T x 1) normal random vector N(Oh), then the quadratic form y'y is
distributed as a XrT) random variable.
A.4.5
A.4.6
A.4.7
A.4.8
U _ y'Ay/y'BY
- Ja- 2 Ka- 2
A.4.9
A.4.10
A.4.11
If the random vectors y and z are distributed N(8,a- 2Ir) and N(O,a- 2Ir), respec-
tively, and A and B are symmetric idempotent matrices of rank K and J,
respectively, where BA = 0, then the ratio [(y'Ay/a- 2)/(z'Bz/a- 2)(J/K) = u is
distributed as a noncentral F(K,J,)..) random variable with A = 8'A8/2a- 2.
A.4.12
A.4.13
A.4.14
A.4.15
A.4.16
The reciprocal of the central X 2 random variable with r degrees of freedom has
expected value E[(xrr-I] = 1/(r - 2), for r > 3.
A.4.17
The central X2 random variable with r degrees of freedom has variance 2r.
A.4.18
A.4.19
A.4.20
E[(Xrk.A)-I]
1
="2 "2 -
(k )
2 !
(-2)k/2-1 [
T e-
A/2
-
k/2-2 <-A/2)']
~ I!
2 _I _! (f[(k rm
E[(X(k.A) ] - 2
- 2)/2]) (_
A
-I[ 2 (A)-1I2
2)k-1)/2 J
2 qjJ
((A)-1/2)
2
f(n + 1+ l)]
- I(k)
[5.",)
k-1)/2)-2 (
L
n=O
- -A)"
2 f(z)
1
For" A Simple Form for the Inverse Moments of Non-Central X2 and F Random
Variables and for Certain Confluent Hypergeometric Functions," see Bock,
Judge and Yancey (1984), Journal of Econometrics.
1. (A B)(CD) = ACBD.
2. (A B)-I = A-I B-1.
Some Matrix and Distribution Theorems 947
A.S.2
where All and A22 are square matrices. The partitioned inverse of A
A.S.3
A.S.4
For any matrix A, there exists a unique matrix A + that satisfies the following
four conditions:
1. AA+A = A
2. A+AA+ = A+
3. (A+A)' = A+A
4. (AA +)' = AA +
A.S.S
A.5.6
A.5.7
A.5.8
A.5.9
A.6.1
5. La = a.
6. L-IL = LL-I = 1.
A.6.2
A.7.1
A.7.2
A.7.3
A.7.4
A.7.5
A.8 DETERMINANTS
A.8.l
A.8.2
A.8.3
Numerical
Optimization Methods
8.1 INTRODUCTION
Most of the minimization methods discussed in the following are iterative and
follow the general scheme depicted in Figure B .1. In this approach we try to
952 Appendix B
Specify initial
parameter vector 8,
and initial step
direction
Determine step
length
Compute Compute
new step
8n+, ;8" + In
direction
Termination No
criteria
fulfilled?
Yes
I Stop
I
Figure B.l Flow diagram for iterative opti-
mization methods.
find a sequence 0 1 , O2 , . ,ON of vectors in the parameter space such that ON
minimizes H(O) approximately. Starting with some initial vector 0 1 , each of the
following elements in the sequence is based on the preceding one in that we add
a vector tn, called a step, to On in order to determine On+1. Thus
(B.2.l)
Although this might not be the shortest way to the minimum of H(O), we usually
require H(On) > H(On+1). A step that meets this condition is called acceptable.
Ideally, the procedure should terminate when no further reduction of the
objective function can be obtained. For practical purposes the iteration stops,
for example, if for a prespecified small e > 0:
Usually, we do not wish to rely on a single one of these criteria, since (1) does
not guarantee a termination if, for example, the minimum is not unique and (2)
Numerical Optimization Methods 953
and (3) may never occur if no minimum exists or the algorithm used does not
approach the existing minimum for some reason. Conditions (4) and (5) are
useful to stop a search in a region of the parameter space far away from a
minimum. Different starting values may be tried to locate the minimum in a
reasonable number of iterations. Let us now turn to a discussion of some of the
possibilities in choosing a step.
(B.2.2)
has to be less than zero. Abbreviating the gradient of the objective function
aH
ao On
(B.2A)
where P n is any positive definite matrix, that is, 'Y'Pn'Y > 0 for all vectors 'Y 4= 0
and, therefore, 'Y~o = -'Y~Pn'Yn < 0 if'Yn 4= O. For 'Yn = 0, we can hope that we
have reached the trough. As the general form of an iteration we get from (B.2. 1)
(B.2.5)
trough are proposed in the literature. In some algorithms the step length is
determined simultaneously with the direction, whereas other methods specify
the direction only and allow various possibilities for the step length selection.
These can range from using the first trial of t that fulfills (B.2.2) to an optimiza-
tion of the step length in the given direction. For a description of some possi-
ble procedures, see Flanagan, Vitale, and Mendelsohn (1969). Determination of
the optimal step length in each iteration is likely to decrease the number of
steps to the trough but increases the computational cost for each iteration. Bard
(1970) and Flanagan, Vitale, and Mendelsohn (1969) found that for their test
problems it did not pay to use the most expensive procedures for the step
length selection and according to Dennis (1973, p. 161) efforts to improve the
gradient methods focus on modifications of the direction rather than the step
length.
Usually, the gradient methods are differentiated by the step direction, or
since almost all use the basic formula (B.2.S), by the direction matrix P n In the
following section we will discuss possible choices, some of which are summa-
rized in Table B .1.
N ewton-Raphson 8.2.3
+ ~n-l~~-l
P n-l r' ( )
Davidon-FIetcher- ~n-l 'Yn - 'Yn-l
Powell Pn-1('Yn - 'Yn-l)('Yn - 'Yn-l)'Pn- 1 8.2.4
('Yn - 'Yn-l)'Pn-1('Yn - 'Yn-l)
Gauss [Z(9 n )' Z(9 n W I B.2.5
(B.2.6)
(B.2.7)
In the sequel the Hessian of H(6) in 6 n will be denoted by 'Xn. To see why this
direction matrix is chosen, we approximate H(6) at 6 n by its Taylor series
expansion up to the quadratic terms: .
(B.2.S)
(B.2.9)
or
(B.2.1O)
t = 1, . . . ,20 (B.2.11)
y = f(6*) +e (B.2.12)
where Y = (Yh Y2, . . . ,Yzo)', 6* = (or, oi)' is the true parameter vector,
e = (el, ez, . . . ,ezo)' and
01 + OZXI2 + ohl3
0 1 + 02 X 22 + Oh23
f(6) =
t Yt XII X t2 X t3
The data used for this example are given in Table B.2. The X t 2 and X t 3 are
pseudo-random numbers from a uniform distribution on the unit interval, and
the Yt are computed by adding a normal pseudo-random number to or
+ oi X t 2 +
or
oi x t 3, where the true parameters Or and were chosen to be Or = oi = 1.
2
Loci in the parameter space with constant H(9) are depicted in Figure B.2.
Using the iterations
and three different starting values, we obtained the results given in Table B.3.
The algorithm terminates at two different points in the parameter space. This
outcome is not surprising since H(9) has two different local minima (see Figure
B.2). The example shows that an optimization algorithm does not necessarily
converge to the global minimum. We will return to this problem later on.
1 Loci of
constant
Objective
~~ function
o
Local
minima ' - - - - - -
-1
-2
-1 o 1 2 3
Some algorithms approximate the inverse of the Hessian of the objective func-
tion in each iteration by adding a correction matrix to the approximate inverse
of the Hessian used in the most recent step,
(B.2.14)
(B.2.15)
Numerical Optimization Methods 959
which leads to
(B.2.16)
(B.2.17)
(B.2.18)
This is the only symmetric matrix of rank one which meets the requirements of
(B.2.17) [Broyden (1965,1967)]. The resulting algorithm is therefore called the
rank one correction (ROC) method. This choice does not necessarily lead to an
acceptable step, since P n + Mn may not be positive definite.
Another famous member of this family of algorithms is the Dauidon-
Fletcher-Powell (DFP) method [Davidon (1959), Fletcher and Powell (1963)]
for which
As shown by Fletcher and Powell (1963), if the step length tn for each step ~n =
-tnPn'Yn is selected so as to minimize B(On + tn) for the given On, P n , and "In,
then P n+1 = P n + Mn will always be positive definite. Therefore, choosing P n as
the step direction in the nth iteration guarantees an acceptable step. The iden-
tity matrix h can be used as PI. In practice, however, the necessary computa-
tions may not be precise enough to obtain a positive definite P n in each iteration
[Bard (1968)]. Nevertheless, the DFP method is very popular and has proved
very efficient in many applications.
Clearly, the choice of the step size is crucial in this algorithm, and many
procedures have been developed to efficiently minimize the objective function
in the direction - Pn'Yn in the nth iteration. Some methods are discussed in
Himmelblau (1972a, Section 2.6) and Dixon (1972).
Many other possible formulae have been suggested for the correction matrix
Mn. Some are presented in Himmelblau (1972a, Table 3.5-1), and Huang (1970)
has given a classification of many of them. See also Huang and Levy (1970),
Dennis and More (1977), and Brodlie (1977).
960 Appendix B
2
_, T [a /t,(a) ]
~(a) - 2Z(a) Z(a) - 2 ,,~, [y, - /t(a)] aa a9' 8 (B.2.22)
(B.2.23)
and, since "In = -2Z(a n),[y - f(a n)] , we get with step length 1,
(B.2.25)
Numerical Optimization Methods 961
where
(B.2.26)
This shows that the Gauss algorithm can be viewed as a sequence of linear
regressions. In each step we compute the LS estimator for a linear approxima-
tion of the nonlinear model. It is sometimes useful to write (B.2.24a) as
where
ae 18
aO' = -Z(O)
a2 In 1 ]-1
Pn = - [E ao ao' 8 n (B.2.27)
where is the likelihood function. Thus, using the negative log likelihood
function as the objective function, the Hessian is approximated by its expected
value. Assuming independently, identically, normally distributed errors and
deleting the variance 0- 2 in the usual way from the minimization procedure,
[Z(On)' Z(On)]-1 can be used as direction matrix instead of (B.2.27). For a gener-
alization of this method see Berndt, Hall, Hall, and Hausman (1974).
The Gauss method is also applicable for objective functions different from
(B.2.21). [See Bard (1974, Sections 5-9 to 5-11)]. The simplicity of this al-
gorithm and its good local convergence properties [Dennis (1973)] make it an
attra,ctive method where applicable. However, in general, P n is singular and
thus not positive definite if On is not close to Omin, which minimizes H(O) and, if
the linear approximation of the nonlinear model is poor or the residuals are
large, convergence can be slow. For the example model (B.2.11) iterations of
the Gauss algorithm are given in Table B.4.
Since the performance of the Gauss method depends on the size of the residuals
in a particular model or more precisely on
(B.2.28)
962 Appendix B
Brown and Dennis (1971) suggest the possibility of combining the Gauss al-
gorithm with a quasi-Newton method. Instead of the inverse Hessian of the
objective function, we approximate the Hessian of !teO) iteratively; that is, we
choose
(B.2.30)
Numerical Optimization Methods 963
From
(B.2.31)
(B.2.32)
with
(B.2.33)
(B.2.34)
could be used.
The direction matrix for this algorithm is
T
P n = {Z(On),Z(On) - 2:
I,t' ~ 1
[YI - !t(On)]FI',n}-l
(B.2.35)
As one alternative the unit matrix could be substituted for FI,l for all t. For
another possible choice see Dennis (1973, p. 173). Brown and Dennis (1971)
found a good performance of this algorithm in a comparison with other meth-
ods. However, the computer storage requirements for this procedure are rela-
tively high.
B.2.7 Modifications
(B.2.36)
modified rather than its inverse. Thus the new direction matrix is given by
(B.2.37)
(B.2.38)
(B.2.39)
(B.2.40)
[Greenstadt (1967), or
if IAjl-1 > e
if IAjl-1 < e (B.2.41)
[Bard (1974), p. 93]. Here 0- 1 is defined to be infinity and e, /J-, and TJ are
suitable positive constants. In these proposals a positive lower bound for the
absolute value of the characteristic roots is given, since in practical computa-
tions a very small number is treated as zero.
Numerical Optimization Methods 965
This algorithm was suggested by Fletcher and Reeves (1964) and is based on an
idea that is different from the other gradient methods. Suppose the objective
function H(O) is quadratic and thus the Hessian '1Jf is constant. Two directions 8
and 8 in the K-dimensional O-space are called conjugate if 8''1Jf8 = O. For a
given 8 oF 0 there are K - 1 conjugate directions in the parameter space.
Starting at an initial vector 0 1 with a step in the steepest descent direction and
then performing steps in all conjugate directions will lead to the global mini-
mum, provided that H(O) is minimized in the given direction at each stage, that
is, the optimal step length has to be chosen in each iteration. It turns out that
the conjugate directions can be computed inductively by the formula
,
~ YnYn
u n+ 1 = -Yn +, 8n
Yn-IYn-1 (B.2.42)
(B.2.43)
(B.2.44)
966 Appendix B
where F(O) = [F)(O), . . . ,FK(O)]' and the optimum is reached when On+) =
On. In other words, the algorithms determine a fixed point of the function F().
If the original problem is formulated in this way, the iteration (B.2.44) can be
used directly. This procedure is called Jacobi method. It does not converge in
general.
The Gauss-Seidel algorithm follows similar ideas. It is applicable if the origi-
nal problem is set up so that, in the optimum,
1= 1,2, . . . ,K (B.2.45)
In this case iterations similar to (B.2.44) can be carried out. For further details
and references see Quandt (1983).
One disadvantage of the gradient methods is that they require analytic compu-
tation of at least first partial derivatives to find the step direction. Actually, it is
possible to let the computer remove this burden from the user, since approxi-
mately
(B.2.47)
The step length tn is chosen such that H(On+d < H(On)' The methods differ in
how they select step length and direction and how and when they apply a new
set of direction vectors. An algorithm suggested by Powell (1964) has proved
quite successful in some applications. As in the conjugate gradient method, a
search is performed along conjugate directions. For details see Powell's article
or Himmelblau (1972a, Chapter 4), where other derivative free methods are
Numerical Optimization Methods 967
max
j ~ 1,2, . . . K + 1
Then Om is replaced by 0, say, a point on the ray from Om through the centroid of
the remaining points. The procedure is repeated, always replacing the vertex
that leads to a maximum value of the objective function as illustrated in Figure
B.3. Usually, modifications of this basic step will be necessary in order to
guarantee satisfactory progress in locating the minimum of H(O). A detailed
8,
description of the algorithm is given by NeIder and Mead (1965); for some
modifications see Parkinson and Hutchinson (1972a, 1972b). The method seems
to be robust and has been successfully applied for problems with a high-dimen-
sional parameter space. Computer programs for the simplex method and also
for Powell's algorithm are given in Himmelblau (1972a, Appendix B).
Since for applied researchers the actual performance of an algorithm for their
particular problems is of major interest, we have not given theoretical conver-
gence results. These can be found in the references [e.g., Ortega and Rhein-
boldt (1970) and Dennis (1977)]. Rather, we compare the methods on the basis
of some performance comparisons reported in the literature. It should be clear
that the wide variety of possible objective functions does not permit an evalua-
tion that is correct for any given problem and also the differences in the evalua-
tion criteria used in different studies hamper generally valid statements. Crite-
ria such as (I) robustness measured, for example, by the number of times the
algorithm under consideration has converged to the minimum for a set of trial
problems, (2) precision in the solution, (3) the number of function evaluations,
and (4) execution time, are often used [Himmelblau (l972b)].
Bard (1970) found that for his test problems Gauss-type methods including
the Marquardt algorithm were superior to quasi-Newton procedures, and Him-
melblau's (1972b) results seem to indicate that quasi-Newton methods perform
better than the conjugate gradient and derivative-free algorithms. This ranking,
however, depends on the procedure used for step length selection. In Him-
melblau's study, Powell's direct search method could compete with some vari-
able metric algorithms. Box (1966) reports similar results. Comparing direct
search methods, Parkinson and Hutchinson (1972a) found that a modification of
the simplex algorithm has some advantages in terms of robustness and com-
puter storage space over Powell's method for test problems with high-dimen-
sional parameter spaces. Although the conjugate gradient method appears to be
less efficient than quasi-Newton procedures in most comparisons, Sargent and
Numerical Optimization Methods 969
Sebastian (1972) mention its good performance for some functions involving
many parameters. As we have seen, this algorithm needs less storage space
than most other gradient methods.
Summarizing, there seems to be evidence that the Gauss-type methods are
preferable if applicable, but a general ranking of other algorithms appears to be
difficult and possibly for the applied worker not too useful. A combination of
Gauss and quasi-Newton methods might be the first choice if the residuals are
expected to be relatively large. However, this procedure is more costly in
terms of computer storage space than other algorithms. An application of the
Newton-Raphson algorithm or a modification thereof is often hampered by the
inconvenience of providing analytic second partial derivatives. To find the
optimal algorithm in the first trial requires some experience and even then it
will be difficult or impossible. Bard (1974, pp. 116-117) remarks "that the state
of the art of nonlinear optimization is such that one cannot as yet write a
computer program that will produce the correct answer to every parameter
estimation problem in a single computer run." This, however, should not dis-
courage the applied researcher from using the described procedures, since even
for a few trials the cost will usually be in acceptable limits if the number of
parameters is relatively small. Moreover, the application of the methods is very
easy if the available computer packages are used. For a discussion of computer
related problems see Murray (1972b).
Generally, all possible simplifications of the minimization procedure should
be utilized. A reparameterization of ill-conditioned objective functions, for
instance, can sometimes simplify the minimization problem. An example is
given in Draper and Smith (1981, Chapter 10). Unfortunately, the difficulty of
finding a good reparameterization is one limitation of this method. If the model
under consideration is linear in some parameters, a nonlinear procedure can
sometimes be combined profitably with the methods to solve a linear estimation
problem. At each stage of the iteration procedure the optimal values of the
linear parameters are computed conditional on the given values for the nonlin-
ear parameters.
Finally, we remind the reader that the described algorithms usually converge
only to local minima. To find the global minimum, it can be helpful to apply
different methods and starting values (see Section B.2.3). The reader is referred
to Quandt (1983) and Dixon and Szego (1975) for a discussion of more sophisti-
cated, global optimization techniques.
In Section B.2 we have assumed that a priori any point in the parameter space
qualifies as a solution of the optimization problem. In practice, nonsample
information about the parameters is often available in the form of equality and/
or inequality constraints. Knowledge of this kind can help to find good starting
values for an optimization algorithm and also reduces the region in which we
have to search for the minimum of the objective function. Moreover, it can aid
970 Appendix B
Suppose that the true parameters are known to fulfill equality constraints given
in the form
q(6) = 0 (B.3.1)
B.3.1a Reparameterization
Sometimes equality restrictions can be introduced by reducing the dimension of
the parameter space. For instance, if 6 = (8, ,(2 )' and a constraint is 82 = 8i,
then we can define
and minimize H R(8,) = H(g(8, with respect to 8, only. Generally, if the re-
strictions can be expressed in the form
6 = g(a) (B.3.2)
But notice that this method increases the number of parameters in the objective
function and thereby may increase the minimization difficulties, whereas a
reduction of the parameter space as described above is likely to reduce the
computational burden. Furthermore, (aHRlavl v ) = 0 is only a necessary condi-
tion for a constrained minimum of H(O).
For more details and other possible penalty functions, see Fletcher (1977) and
Lootsma (1972b).
If the constraints are simple enough to permit an easy computation of second
partial derivatives of A(O), Gauss-type methods can still be applied to minimize
the modified objective function if H(O) has the form discussed in Section B.2.S.
The direction matrix for the nth iteration becomes
(B.3.6)
where P n is the direction matrix used in the Gauss or modified Gauss algorithm.
Using this specification the A'S can be given fixed nonzero values in a first
round and HR(O,A.) is minimized with respect to 0 without constraints. The
resulting 0 vector can be used in (B.3.3) to determine a new A. that is used as a
fixed vector in (B.3.7) in the next round. This procedure is repeated until
convergence. For details see Fletcher (1975) and Powell (1978). Powell (1982b)
recommends the method in conjunction with the conjugate gradient algorithm
when the number of parameters is large and the number of restrictions is small.
Suppose now that constraints for the parameters are given III the form of
inequalities
Bn(O) = Cn 2: [lIqAO)]
j=1 (B.3.9)
[Carroll (1961)] or
J
[Lootsma (1967)], where the Cn are small positive constants with CI, CZ, C3,
. . . ,approaching zero. In order to find the minimum of the original objective
function in the feasible region
(B.3.1l)
B(O) = C 2: p[qj(O)]
j=1 (B.3.12)
ters to the interior of the feasible region, and problems can arise if the minimum
is obtained on or near the boundary. In this case an exterior point method may
be appropriate. A penalty function of the general form
J
A(O) = d L 1/[qAO)]
j=l (B.3.13)
(B.3.14)
A possible choice is
(B.3.15)
If we add A(O) to H(O), the penalty for leaving the feasible region grows with d.
Thus an increasing series of d-values may have to be tried in order to obtain a
feasible minimum of the objective function. Note that the Hessian of the new
objective function does not exist at the boundary of the feasible region. For
more details about the use of barrier and penalty functions, as well as other
methods and references, see Lootsma (1972b) , Davies and Swann (1969),
Powell (1969), Fiacco and McCormick (1968), Fletcher (1977), and Murray
(1969). The Gauss method can be modified to allow for a penalty or barrier
function term in the objective function as in (B.3.6).
Sometimes inequality constraints can be eliminated by a parameter transfor-
mation. For instance, if OJ is restricted to be greater than zero it can be replaced
by ee,. Since we are dealing with nonlinear models anyway, such a transforma-
tion may not increase the computational burden. Other possible transforma-
tions are listed in Box (1966) and Powell (1972).
Sometimes both equality and inequality constraints for the parameter vector 0
are present. In this case the methods discussed in Sections B.3.1 and B.3.2 can
be combined. Suppose the restrictions are given as
and (B.3.16)
where q)(.) is (1) x 1) and q2(-) is (12 x 1). Combining (B.3.4) and (B.3.1S) we
get a new objective function,
1z
HR(O) = H(O) + d{q)(O)'q)(O) + 2: (min[0,q2j(O)])2}
j~) (B.3.17)
cases not adequate in the presence of penalty or barrier functions, are usually
applied in ready-for-use computer programs. It is possible that a long trial step
is carried out beyond the barrier at the boundary of the feasible region. This can
cause trouble especially if the original objective function is not defined outside
the feasible region. Also, the use of the simple projection method together with
a quasi-Newton method cannot be recommended, since frequent interruptions
of the unconstrained iteration process hinder a successful approximation of the
Hessian.
Consequently, Gauss-type algorithms should be used in the presence of in-
equality constraints, if possible. In a first-round minimization the projection
method can be applied to guarantee a feasible estimate. If the minimization
procedure terminates at the boundary of the feasible region, other methods
should be tried. If the structure of the objective function prohibits the use of
Gauss-type methods, then a quasi-Newton method with a special unidimen-
sional search procedure [Lasdon, Fox, and Ratner (1973)] or an algorithm
designed particularly for the minimization of objective functions with penalty
or barrier function terms could be applied [Lasdon (1972)].
In general it is desirable to utilize any possible simplifications that may for
instance result from a reparameterization of the objective function. Also, if the
objective functions or restrictions have a particular form, simplifications may
be possible. For a discussion of linear constraints see Powell (l982b) and the
references given in that article. Of course, at the extreme end, where the
normal equations and restrictions are linear, the above methods are not re-
quired (see Chapters 2 and 3).
8.4 REFERENCES
tion Problems," in D. Jacobs, ed., The State of the Art in Numerical Analy-
sis, Academic, London, 365-407.
Fletcher, R. and M. J. D. Powell (1963) "A Rapidly Convergent Descent
Method for Minimization," The Computer Journal, 6, 163-168.
Fletcher, R. and C. M. Reeves (1964) "Function Minimization by Conjugate
Gradients," The Computer Journal, 7, 149-154.
Goldfeld, S. M. and R. E. Quandt (1972) Nonlinear Methods in Econometrics,
North-Holland, Amsterdam.
Goldfeld, S. M., R. E. Quandt, and H. F. Trotter (1966) "Maximization by
Quadratic Hill-Climbing," Econometrica, 34,541-551.
Greenstadt, J. (1967) "On the Relative Efficiencies of Gradient Methods,"
Mathematics of Computation, 21, 360-367.
Himmelblau, D. M. (1972a) Applied Nonlinear Programming. McGraw-Hili,
New York.
Himmelblau, D. M. (1972b) "A Uniform Evaluation of Unconstrained Optimi-
zation Techniques," in F. R. Lootsma, ed., Numerical Methods for Non-
linear Optimization, Academic, London, 69-97.
Hooke, R. and T. A. Jeeves (1961) "Direct Search Solution of Numerical and
Statistical Problems," Journal of the Association for Computing Machinery,
8,212-229.
Huang, H. Y. (1970) "A Unified Approach to Quadratically Convergent Al-
gorithms for Function Minimization," Journal of Optimization Theory and
Applications, 5, 405-423.
Huang, H. Y. and A. V. Levy (1970) "Numerical Experiments on Quadrati-
cally Convergent Algorithms for Function Minimization," Journal of Opti-
mization Theory and Applications, 6, 269-282.
Jennrich, R. I. and P. F. Sampson (1968) "Application of Stepwise Regression
to Non-Linear Estimation," Technometrics, 10,63-72.
Judge, G. G., R. C. Hill, W. E. Griffiths, H. Uitkepohl, and T. C. Lee (1982)
Introduction to the Theory and Practice of Econometrics, Wiley, New York.
Kiinzi, H. P. and W. Oettli (1969) Nichtlineare Optimierung: Neuere Verfahren
Bibliographie, Lecture Notes in Operations Research and Mathematical Sys-
tems, 16, Springer, Berlin.
Lasdon, L. S. (1972) "An Efficient Algorithm for Minimizing Barrier and Pen-
alty Functions," Mathematical Programming, 2,65-106.
Lasdon, L. S., R. L. Fox, and M. W. Ratner (1973) "An Efficient One-Dimen-
sional Search Procedure for Barrier Functions," Mathematical Program-
ming, 4, 279-296.
Lootsma, F. R. (1967) "Logarithmic Programming: A Method of Solving Non-
linear-Programming Problems," Philips Research Reports, 22,329-344.
Lootsma, F. R., ed. (1972a) Numerical Methods for Non-linear Optimization,
Academic, London.
Lootsma, F. R. (1972b) "A Survey of Methods for Solving Constrained Mini-
mization Problems via Unconstrained Minimization," in F. R. Lootsma, ed.,
Numerical Methods for Non-linear Optimization, Academic, London, 313-
347.
978 Appendix B
C.l INTRODUCTION
Kalman Filters are used in different parts of this book. They provide a rather
general framework in which to handle a range of different problems. In the
following discussion we will summarize the basic formulas. More details can be
found, for example, in Kalman (1960), Kalman and Bucy (1961), Hannan
(1970), Harvey (1981,1982), Priestley (1981), Anderson and Moore (1979), and
Meinhold and Singpurwalla (1983).
Suppose interest centers on a (K x 1) vector 'Yt of state variables that are
generated by the process
t = 1, 2, . . . , T (C.I.I)
and we assume that the V t are uncorrelated with the initial state vector 'Yo. Note
that the above definition of vector white noise differs from the definition given
in Chapter 16 in that the covariance matrix may depend on t. Equation C.l.l is
called the transition equation.
Suppose that 'Yt cannot be observed directly. Instead the (N x I) vector
t = 1, 2, . . . , T (C. 1.2)
(C.l.3)
The Kalman Filter 981
can be written as
Yt
[ 02Yt-1 +
] [01
aV t . - 02 o
1][Yt-1
02Yt-2
]
+ aVt-1 + a V
[1] t
(C.l.4)
Yt = [1 0] [Yt ]
02Yt-1 + aV t (C. 1.5)
"Vt = [ Yt
02Yt-1 + aV t
], ~]. Ct = 0, ~t [~] =
Zt = [1 St = 0
01 1 0 0 1
(h 0 1 0 al
"Vt= "Vt-I + Vt
OK-I 0 0 1 aK-1
OK 0 0 0 (K x 1) (C.1.7a)
(K x K)
Yt = [1 O'1'Yt
(1 x K)
(C.1.7b)
and
t=l, . . . ,T (C.1.8b)
where 'Ytlt-I is the optimal linear prediction of 'Yt given 'Yt-I and Otlt-I is the
982 Appendix C
optimal linear estimator of Ot given 0t-I. When Yt becomes available the updat-
ing equations
(C.l.9a)
t=l, . . . ,T (C.l.9b)
(C.1.10)
(C. 1.1 I)
The recursions in (C.1.8) and (C.l.9) are called Kalman Filters and can be
solved ifYI, . . . , YT, 'Yo, 00, <Pt , Ct , W t , Zt, and Sf are given.
When working backwards, the Kalman Filter can also be used for smoothing.
Since the estimator YT uses all the sample information, the recursion is started
at the end of the sample period using the smoothing equations
(C.1.12a)
and
(C.l.12b)
where
Smoothing is sometimes called signal extraction and the above recursions pro-
vide an optimal solution to this problem.
C.2 AN APPLICATION
In e= -
KT
Tin 27T - '12 In I't'"'t'y I - '12 (Y - J1) ,'t' 1
"'t'y (Y - J1) (C.2.I)
+y
where y' = (y;, yz, . .. , Yl) and J1 and are the mean and covariance matrix
+y
ofy. Usually J1 and depend on unknown parameters that are to be estimated.
Since +y
is a (KT x KT) matrix, its inversion may be rather expensive if the
sample size and/or the dimension ofYt is large. Therefore, if the model for Yt can
The Kalman Filter 983
be written in state space form, from a computational point of view, the follow-
ing equivalent form of (C.2.1) may be easier to handle:
KT I TIT
In C = - T In 27T - 2" ~ In!Ft ! - 2" ~ e; F; let (C.2.2)
where F t and e t are as defined in (C. 1. 10) and (C.l.l1), respectively. This form
of the log likelihood function follows by noting that the joint density of the
sample can be written as
C.3 REFERENCES
Berk, K. N., 618, 652 Brook, R. J., 77, 892, 913, 934
Bernardo, J. M., 132, 136 Brown, B., 582, 583, 588
Berndt, E. K., 180, 192 Brown, G. G., 975
Berndt, E. R., 186, 192,475,509,961,975 Brown, K. G., 429, 460
Berzeg, K., 529, 555 Brown, K. M., 962, 963, 975
Bhansali, R. J., 244, 265, 31 S, 338 Brown, R., 804, 818
Bhargava, A., 301, 303, 325, 347, 529, 555 Brown, R. L., 174, 192
Bickel, P. J., 439, 460, 829, 842, 848 Brown, W. G., 915, 934
Bielby, W. T., 737, 743 Browne, M. W., 737, 744
Bierens, H. J., 201, 203, 221, 829,848 Brownstone, D., 785, 792
Binkley, J. K., 470, 509 Broyden, C G., 959, 976
Birn, E., 538, 555 Brumat, C, 652
Birch, J. B., 239, 266 Brundy, J., 606, 647
Birch, M. W., 714, 744 Brunk, H. D., 94
Blake, D. P. C, 984 Brunner, K., 698, 700, 701
Blalock, A. B., 750 Bryan, W. R., 406, 412
Blalock, H. M., Jr., 744, 750 Bryne, G. D., 976
Blattberg, R. C, 329, 338, 826, 836, 848 Bucy, R., 980, 984
Blaylock, J. R., 845, 848 Bunke, H., 196,221
Bloomfield, P., 170, 192, 257, 266 Burguete, J. F., 202, 217, 221
Bock, M. E., 57, 60, 70, 71, 74, 78, 82, 83, 84, Burman, J. P., 260, 266
85, 86, 87, 88, 90, 94, 95, 96, 120, 137,331, Burmeister, E., 983, 983
342, 361, 363, 366, 372, 373, 471, 472, 509, Burnett, T. D., 818
511, 859, 860, 875, 877, 880, 881, 894, 908, Burt, O. R., 377, 412
912, 922, 927, 935 Buse, A., 187, 192,477,478, 509, 803, 818
Bodkin, R. G., 219, 221 Byron, R. P., 298, 345, 475, 496, 509, 512, 615,
Bohrer, R., 72, 86, 94, 96 647
Bolch, B. W., 826, 850
Bonyhady, B. P., 437, 439, 445, 460 Cagan, P., 376, 379, 394, 412, 705, 744
Booker, A., 203, 222 Caines, P. E., 669, 696
Boot, J. C G., 509 Campbell, F., 903, 937
Booth, N., 805, 818 Carbone, R., 269
Bora-Senta, E., 247, 266 Cargill, T. F., 398, 412
Bornholt, G., 338 Carlson, F. D., 744
Borowski, N., 968, 979 Carlton, W. T., 800, 820
Bowden, R. J., 582, 584, 586,588, 640, 647 Carroll, C W., 972, 976
Bowers, D. A., 433, 464 Carroll, J., 838, 839, 842, 848, 852
Bowles, S., 744 Carroll, R. J., 439, 460
Bowman, K. 0., 826, 852 Carter, R. A. L., 478, 509
Box, G. E. P., 101, 1(12, 115, 121, 125, 136, 138, Carvalho, J. L., 257, 258, 269
233,239,240,241,257,266,268,307,311, Casella, G., 917, 918, 921, 934
312, 313, 319, 338, 344, 382, 383, 384, 399, Cassing, S., 326, 339
400,412,413,439,460,691,698, 701,840, Cavanagh, C L., 217, 221
842, 848, 887, 892 Chall, D., 323, 344
Box, M. J., 198,221,968,973,975 Chamberlain, G., 109, 136, 518, 527, 528, 529,
Brantle, T. F., 917, 935 538, 544, 555, 669, 696, 705, 737, 744
Breiman, L., 868, 869, 892 Chambers, R. L., 829, 848
Bremner, J. M., 94 Chan, C W., 669, 691, 696
Breusch, T. S., 175, 182, 184, 187, 192,320, Chan, W. Y. T., 658, 696
321, 329, 339, 400, 412, 446, 447, 460, 476, Chandra, R., 829, 853
496, 509, 526, 537, 555, 809, 818 Chang, H. S., 842, 848
Brillinger, D. R., 257, 263, 266, j69, 696 Chant, D., 207, 222
Brodlie, K. W., 959, 975 Chatfield, C, 253, 257, 266
Broemeling, L., 805, 818 Chaturvedi, A., 821
Author Index 987
Efroymson, M. A., 893 524, 525, 526, 534, 536, 555, 556, 714, 715,
Egy, D., 445, 460, 841, 849 737, 744, 745, 750, 800, 819
Eicker, F., 426, 460 Fiirst, E., 819, 984
EI-Sayyad, G. M., 745
Engle, R. F., 234, 267, 281, 314, 315, 340, 406, Gaab, W., 398, 413
413,441,444,460,667,669,697,983,983 Gabrielsen, A., 586, 589
Epps, M. L., 331, 340 Galarneau, D. I., 916, 936
Epps, T. W., 331, 340 Gallant, A. R., 202, 203, 213, 214, 215, 217,
Evans, 1. M., 174, 186, 192, 193,804,818 221,222, 315,340,480,510,629,648,800,
Evans, M. A., 332,343, 453, 461 819
Gallant, R., 887, 893
Fair, R. C., 638, 648, 784, 792, 836, 849 Gambetta, G., 364, 371
Fama, E. F., 825, 826, 849 Garcia, R., 640, 650
Farebrother, R. W., 173, 174,193,323,340, Gardner, R., 314, 315, 340
909,934 Gastwirth, 1. L., 327, 340, 826, 834, 849
Farley, 1., 804, 819 Gaudry, M. 1. 1.,339,445,461,841,849
Farrar, D. E., 902, 927, 934 Gaver, K. M., 132, 136, 854, 873, 893
Featherman, D. L., 732, 737, 743, 745 Geary, R. C., 327,340,422,439,461, 745
Feige, E. L., 544, 556, 668, 697 Geisel, M. S., 132, 136, 139,392,416, 751,854,
Feinberg, S. E., 514, 588, 589, 651, 653 873,893
Feldstein, M. S., 745 Geraci, V. J., 586, 589, 732, 737, 745, 746
Ferguson, T. S., 51, 94 Geweke, J., 262, 267, 355, 371, 406, 413, 667,
Ferrara, M. c., 737, 749 669, 670, 697, 737, 746, 869, 893
Ferreira, P., 805, 819 Gibbons, D., 917, 935
Fiacco, A. V., 951, 973, 976 Gibson, W. M., 730, 746
Fiebig, D. G., 381, 382, 403, 416, 475, 510 Gilbert, R. F., 364, 372, 449, 463, 470, 490, 493,
Fildes, R., 269 511
Fisher, F. M., 476, 510, 582, 583, 584, 585, 586, Giles, D. E. A., 115, 136, 328, 331, 341, 343,
588 894,937
Fisher, G. R., 320, 340, 885, 893, 894 Gilstein, C. Z., 829, 849
Fisher, I., 355, 371, 405, 413 Girshick, M. A., 564, 589
Fitts, 1., 330, 340 Glahn, H., 478, 510
Flanagan, P. D., 954, 976 Glauber, R. R., 902, 927, 934
Fletcher, R., 951, 959, 965, 967, 971, 973, 976, Glejser, H., 389, 413,433,461, 529, 556
977,978 Godfrey, L. G., 320, 321, 329, 330, 341, 360,
Flinn, C., 529,556 371,446,447,454,461,845,849,888,893
Florens, 1. P., 669, 697, 745 Godolphin, E. J., 310, 341
Fomby, T. B., 95, 116, 136,291,293,331,340, Goebel, J. J., 239, 240, 267, 315, 340
366,371,394,413, 792,907,909,910,911, Goel, P. K., 121, 137
912,923,934,935 Goldberger, A. S., 40, 58, 316,341,433,461,
Fouregeaud, c., 935 525,560,602,621,719,720,721,723,728,
Fox, R. L., 975, 977 732, 736, 737, 743, 744, 746, 747, 748, 749,
Francis, G. R., 72, 94 750, 757, 780, 784, 792
Franck, W. F., 826, 849 Goldfeld, S. M., 196, 202, 222, 422, 434, 435,
Franzini, L., 555 449,454,455,461,648,800,805,806,819,
Freedman, D., 868, 869, 892 829, 843, 849, 964, 966, 977
Freeman, T. G., 717, 745 Goldstein, M., 917, 935
Friedman, M., 412, 705, 730, 745 Gonedes, J., 826, 848
Friedman, R., 902, 935 Goodhart, C. A., 668, 701
Froehlich, B. R., 434, 435, 437, 461 Goodhew, J., 327,338
Frost, P. A., 359, 371 Goodman, L. A., 746
Fuller, W. A., 149, 176, 193, 204, 222, 235, 236, Goodnight, J., 361,371
239, 240, 257, 266, 267, 268, 278, 297, 304, Gourieroux, C., 72, 94, 187, 193, 241, 265, 638,
318,340,394,412,434,435,438,462,513, 648, 882, 893, 935
Author Index 989
Hildreth, c., 288, 289, 290, 291, 342, 420, 434, Hwang, H. S., 615, 616, 640, 649
435,436, 437, 460, 462, 548, 549, 556, 806,
808, 809, 819
Imhof, J. P., 323, 342, 450, 462
Hill, B. M., 103, 115, 137
Inder, B. A., 327, 342
Hill, R. c., 11,40,95, 137, 193,222,268,342,
Intriligator, M. D., 40, 221, 372, 416, 651, 743,
358, 360, 372, 414, 557, 589, 649, 698, 747, 794, 894, 935, 936, 978
833,850,909,910,911,912,923,924,934, Ironmonger, D. S., 406, 414
935,977 Ito, T., 638, 649
Hill, R. W., 842, 850
Hill, W. J., 439, 460
Hillmer, S. C., 258, 268, 680, 698 Jackson, J. E., 746
Himmelblau, D. M., 951, 959, 966, 968, 977 Jacobs, D., 976, 977
Himsworth, F. R., 967, 979 Jaeckel, L. A., 829, 850
Hinich, M. J., 804, 819, 832, 850 Jaffee, D. M., 638, 648
Hinkley, D. Y., 180, 192, 819, 837, 848 James, W., 82-84, 95, 877, 880, 893
Hiromatsu, T., 77 Jarque, C. M., 332, 338, 342, 475, 509, 784,
Hoch, I., 533,557 792, 826, 848
Hocking, R. R., 854, 869, 870, 893, 917, 935 Jayatillake, K. S. E., 423, 429, 462
Hodge, R. W., 750 Jayatissa, W. A., 476, 511
Hoerl, A. E., 117, 137,912,913,914,915,916, Jeeves, T. A., 967, 977
935 Jeffreys, H., 101, 102, 131, 137,826,850
Hoffman, D. L., 665, 698 Jenkins, G. M., 233; 240, 241, 257, 266, 268,
Hogg, R. Y., 829, 850 307,311,319,338,382, 3R3, 384, 399, 400,
Holbert, D., 805, 819 412,691,698
Holbrook, R., 747 Jennrich, R. I., 201, 203, 222, 974, 977
Holland, P. W., 833, 850 Jobson, J. D., 434, 435, 438, 462
Holly, A., 72, 94,187,193,202,213,215,217, John, S., 826, 848
222, 612, 629, 648, 649, 869, 893 Johnson, J., 40, 454, 462, 712, 747,802,819
Hood, W. c., 589, 590, 615, 650, 744 Johnson, L., 788, 793
Hooke, R., 967, 977 Johnson, N. L., 764, 770, 793
Hooper, J. W., 478, 5/0, 747 Johnson, S. R., 358, 360, 372, 547, 557, 842,
Horn, P., 454, 462 850,907,909,910,911,934,935
Horn, R. A., 423, 462 Joiner, B. L., 829, 850
Horn, S. D., 423, 462 Jones, R. H., 983, 984
Hosking, J. R. M., 686, 692, 698 Jiireskog, K. G., 538, 557, 732, 737, 747
Houck, J. P., 420, 434, 435, 436, 437, 462, 548, Jorgenson, D. W., 377, 414, 606, 628, 629, 647,
549, 556, 806, 808, 809, 819 648,649
Houthakker, H. S., 419, 437, 463 Jowett, G. H., 730, 746
Howrey, E. P., 244, 268, 983, 984 Judge, G. G., 6, 6, 11,25,28,40,53,57,60,62,
Hsiao, c., 313, 337, 529, 547, 548, 549, 554, 64,66,70, 71, 72, 74, 78, 80, 82, 83, 84, 85,
555, 557, 586, 588, 589, 668, 688, 689, 695, 86, 87, 88, 90, 94, 95, 96, 102, 104, 107, 109,
698, 732, 734, 736, 743, 747, 799, 819 113, 120, 123, 124, 126, 129, 131, 137, 152,
Hsieh, D. A., 426, 453, 462 164, 168, 193, 196,202,215,222,224,227,
Hsu, D., 805, 819 241,268,277,278,310,323,331,342,359,
Huang, C. J., 826, 827, 850 361,363,372,373,390,394,414,431,464,
Huang, D. S., 469, 470, 514, 623, 652 471,472,499,501,502, 5l/, 519,557,564,
Huang, H. Y., 959, 977 589,594,649,654,694, 698, 71~, 717, 747,
Huber, P. J., 829, 830, 831, 833, 834, 850 760,761, 794,829,850,859,861,875,877,
Huizinga, J., 680, 696 880,881,894,903,908,912,922,927,935,
Hunter, K. R., 327, 344 936, 956, 977
Hurd, M., 793 Just, R. E., 379, 414, 441, 462
Hurwicz, L., 732, 743
Hussain, A., 526, 530, 536,560 Kadane, J. B., 137, 586, 589, 737, 747
Hutchinson, D., 968, 978 Kadiyala, K. R., 286, 342, 829, 850, 918,938
Author Index 991
Kakawani, N. C, 95, 175, 193, 437, 438, 462, Kounias, S., 247, 266
470,511,619,620,621,649 Koyck, L. M., 376, 379, 382, 394, 414
Kallek, S., 258, 261, 268 Kramer, W., 281, 286,344
Kalman, R. E., 980, 984 Krasker, W. S., 829, 834, 851
Kapteyn, A., 547, 548, 560, 586, 588, 732, 743 Krishnaiash, P. R., 269, 697, 699
Kariya, T., 470, 476, 511 Kuh, E., 834, 848, 902, 934
Kashyap, R L., 678, 698 Kuhn, H., 95
Katooka, Y., 470, 511 Kullback, S., 870, 894
Kau, J. B., 842, 850 Kumar, T. K., 902, 927, 936
Kaufman, G. M., 635, 649 Kunitomo, N., 612, 647
Keating, J. P., 448, 460 Kiinzi, H. P., 951, 977
Kedem, G., 309, 310,346
Kelejian, H. H., 536, 549, 557, 582, 589, 601, Laffont, J. J., 628, 638, 640, 648, 649, 650
609, 623, 638, 648, 649, 651, 826, 843, 850 Lahiri, K., 445, 460, 841, 849
Kelley, G. D., 241, 267, 268 Lai, T. L., 239, 268
Kelly, J. S., 586, 589 Lainiotis, D. G., 696
Kendall, M. G., 253, 261, 263, 448, 462, 502, Laitinen, K., 202, 223, 475, 511
712, 747, 779, 793 Landwehr, J. M., 777, 793
Kendrick, D. A., 416 Lasdon, L. S., 975, 977
Kenkel, J. L., 327, 343 Lawless, J. E, 916, 936
Kennard, R. W., 117, 137,912,914,915,916, Lawley, D. N., 719, 748,927,936
935 Layson, S. K., 841, 852
Kenny, P. B., 262,267 Leamer, E. E., 95, 98, 109, 110, III, 112, 121,
Kenward, L. R., 320, 321, 343 131, 132, 136, 137,541,557,829,849,854,
Kerrison, R. F., 359,371 864, 871, 894, 902, 908, 936
Kiefer, J., 747 Ledolter, J., 303, 339
Kiefer, N. M., 529, 530, 557, 826, 850, 875, 894 Lee, C F., 842, 850
Kinal, T. W., 612, 649 Lee, L., 757, 760, 761, 764, 778, 784, 785, 787,
King, M. L., 313, 319, 322, 323, 324, 327, 328, 788, 792, 793, 794
329,331,332,341,343,451,452,453,461, Lee, L. F., 524, 525, 528, 535, 538, 541, 548,
462, 875, 894 557, 806, 819, 820, 827, 828, 851
King, N., 894, 927, 936 Lee, R. M., 262, 268
Kitagawa, G., 245, 268 Lee, T., 935
Kiviet, J. F., 282, 322, 343, 344 Lee, T. C, 11,40,95, 137, 193,222,268,342,
Klein, L. R., 6, 6, 219,221,388,395,413,414, 372,414,499,501,502,511,557,589,649,
737, 748 698, 747, 977
Kloek, T., 133, 137, 138,635,649,652 Lee, T. H., 796
Kmenta, J .. 40, 182, 193, 288, 344, 372, 429, Leech, D., 844, 845, 851
439,463,470,490,493,511,512,518,557, Lempers, F. B., 132, 137
700,802,819,907,917,936 Lerman, S. R, 769, 771, 773, 778, 791
Knight, J. L., 621, 649 Leser, C E. U., 6, 7
Knott, M., 170, 193 L'Esperance, W. L., 323, 326, 344
Knox, CA., 887, 893 Levi, M. D., 748
Koenker, R., 447, 454, 463, 556, 824, 825, 829, Levy, A. Y., 959, 977
834, 835, 836, 837, 838, 839, 847, 848, 850, Lewandowski, R., 269
851 Li, W. K., 692, 698
Koerts, J., 173, 191. 192,323,326,337,344, Lieu, T. C, 339
450,463 Liew, C K., 62, 95
Kohn, R, 312, 344, 674, 698 Lillard, L., 529, 557
Konijn, H. S., 748 Lim, L., 803, 818
Koopmans, T. C, 6, 6, 564, 585, 589, 590, 615, Lin, K., 907, 917, 936
650, 744 Lindgren, B. W., 40
Kopp, R., 828, 851 Lindley, D. Y., 117, 121, 125, 131, 133,136,
Kotz, S., 770, 793 137, 712, 748, 894, 908, 936
992 Author Index
Mishva, G., 821 331, 338, 341, 345, 383, 384,400,407, 413,
Mitchell, B. M., 286, 346 415, 658, 669, 675, 690, 691, 692, 697, 699
Mittlehammer, R c., 879, 894, 924, 936 Newton, J., 269
Miyazaki, S., 291, 344, 431, 464, 829, 850 Neyman, J., 749,895
Mizon, c., 609, 649 Nicholls, D. F., 234, 269, 277, 282, 299, 306,
Mizon, G. E., 219, 222, 322,342, 344, 386, 413, 307, 312, 318, 345, 426, 463, 496, 512
414, 845, 851, 889, 894 Nickell, S., 529, 558
Moberg, T. E., 829, 852 Nold, F., 778, 792
Moffitt, R, 783, 794 Novick, M. R., 121, 138
Monarchi, D. E., 329, 348
Monfort, A., 72, 94, 187, 193, 241, 265, 638, Obenchain, R. L., 936
648, 882, 893 Oberhofer, W., 182, 193,429,463,470,512
Moore, H. L., 6, 7 O'Brien, R. J., 293, 345
Moore, J. B., 980 Obstfeld, M., 681, 696
Moore, T., 913, 934 Odeh, 621
Morales, J. A., 635, 650 Oettli, W., 951, 977
More, J. J., 959, 976 O'Hagen, J., 936
Morey, M. J., 367,372 Ohtani, K., 430, 463, 805, 820
Morimune, K., 362, 371 Oi, W., 394, 415
Morgenstern, 0., 748 Olsen, R., 784, 794, 795
Morris, c., 82, 85, 86, 94, 95, 120, 121, 136, Olson, L., 785, 794
137,857,878,879,893,895,908,917,922, Oman, S., 923, 937
934,937 O'Neill, M., 749
Mosback, E., 650 Orcutt, G. H., 286, 288, 339
Mosteller, F., 829, 852 Ord, J. K., 253, 261,263,268
Mouchart, M., 669, 697, 745, 748 Ortega, J. M., 968, 978
Mount, T. D., 526, 558 Osborn, D. R., 302, 303, 310, 313,345
Mundlak, Y., 406, 414, 519, 527, 528, 529, 530, Owens, M. E. B., 826, 849
533,540,544,558,716,748,811,820 Ozaki, T., 245, 269
Murphy, M. J., 262, 267
Murray, W., 951, 969, 973, 976, 978, 979 Paarsch, H. J., 784, 795
Musgrave, J. c., 261, 270 Pagan, A., 288, 296, 297, 310, 313, 320, 345,
Muth, J. F., 664, 699, 737, 749 598,651,681,699,724,749,809,811,818,
820
Nagar, A. L., 95, 287,348, 478, 509, 603, 609, Pagan, A. R, 180, 184, 192, 193,277,282,298,
621,650,651,820 299,306,307,310,312,318,320,321,339,
Nair, K. R., 730, 749 345, 359, 360, 362, 363, 372, 373, 376, 400,
Nakamura, A., 331,345,617,618,651 410,412,413,415,426,446,447,460,463,
Nakamura, M., 331, 345, 617, 618, 651 476, 496, 509, 512, 526, 537, 555
Narendranathan, W., 555 Pagano, M., 355, 363, 369, 372
Nasburg, R. E., 678, 698 Palm, F., 658, 661, 662, 691, 701
Neave, H. R., 256, 269 Pan Jei-jian, 323, 345
Neider, J. P., 968, 978 Park, C. J., 394, 416
Nelson, C. R, 241, 269 Park, C. Y., 304, 346
Nelson, F. D., 638, 640, 650, 784, 785, 794 Park, R. E., 286, 291, 346, 439, 463
Nelson, H. L., 257, 269 Park, S. B., 117, 138, 139, 327,346,637,651,
Nerlove, M., 256, 257, 258, 326, 345, 376, 379, 652
380,405,406,410,414,415,519,524,526, Parkinson, J. M., 968, 978
529, 530, 534, 535, 555, 558, 705, 737, 746, Parks, R. W., 488, 489, 493, 498,502,512,543,
749,764 558, 778, 795
Neudecker, H., 282, 326, 345 Parzen, E., 246, 269, 699
Neumann, G. R, 530, 557 Pearce, D. K., 668, 697
Newbold, P., 234, 237, 239, 241, 247, 257, 263, Pearce, J., 778, 792
265,267,269,282, 302, 310, 312, 313, 318, Pearlman, J. G., 313, 346
994 Author Index
Pearson, E. S., 826, 852 806, 819, 820, 829, 843, 849, 882, 889, 895,
Peck, J. K., 331,346 951,964,966,969,977,978
Penm, J. H. W., 688, 699 Quenouille, M. H., 21 I, 243, 270, 658, 692, 700
Periera, B. de B., 894 Quinn, B. G., 234, 246, 267, 269, 687, 700
Pesando, J. E., 407, 408, 415
Pesaran, M. H., 300, 346, 389, 415, 882, 883, Radhakrishnan, R., 82, 86, 95, 878, 895, 908,
884, 889, 894, 895 937
Peters, S. c., 137 Raduchel, W. J., 626, 651
Petrov, B. N., 265, 371, 695, 892 Raj, B., 434, 435, 436, 437, 809, 820
Pfaffenberger, R., 829, 836, 848 Ramber, J. S., 852
Phadke, M. S., 309, 310, 346 Ramsay, J. 0., 121, 138
Phillips, A. W., 306, 307, 318, 346 Ramsey, J. B., 364, 372, 448, 449, 463, 700, 827,
Phillips, G., 809, 810, 819 852,895
Phillips, G. D. A., 174,194,309,313,325,326, Randles, R. H., 829, 850, 852
342, 346, 449, 451, 462, 983, 984 Rao, A. S., 327, 344, 395, 398, 414
Phillips, P. C. B., 163, 194, 239, 270, 346, 470, Rao, C. R., 40, 95, 155, 194, 215, 216, 222, 423,
512,609,612,649,650,651 429,463,480,503,512,526,540,541,558,
Pierce, D. A., 302, 31 1,346,668,669,699, 899,937
700 Rao, J. N. K., 429, 463
Plewes, T. J., 261, 270 Rao, M. M., 239, 270
Ploberger, W., 239, 270 Rao, P., 291, 341, 749
Poirier, D. J., 257, 270, 286,346, 364, 372, 556, Rao, U. L. G., 542, 558
800, 803, 820, 827, 847 Ratner, M. W., 975, 977
Polak, E., 965, 978 Rausser, G. c., 547, 557, 81 1,820
Pollock, D. S. G., 165, 178, 194 Rayner, A. c., 115, 136
Pollak, R., 773, 792 Reeves, C. M., 965, 977
Pope, R. D., 441, 462 Reiersol, 0., 585, 589, 749
Porter, R. H., 806, 820 Reimer, S. c., 910
Portes, A., 745 Reinsel, G., 538, 559, 660, 688, 700
Poskitt, D. S., 360, 371 Revankar, N. S., 313, 346, 469, 470, 512,538,
Piitscher, B. M., 243, 270 559
Powell, A. A., 503, 512 Rheinboldt, W. c., 968, 978
Powell, J. L., 768, 784, 792, 795, 837, 852 Rice, J. R., 976
Powell, M. J. D., 951, 959, 966, 967, 972, 973, Richard, J.-F., 112, 132, 138, 139,293,322,
975, 977, 978 342,346,386,414,635,651,667,669,697,
Pradel, J., 935 745, 875, 889, 894
Praetz, P. D., 329, 346 Richardson, D. H., 610, 61 I, 651, 749
Prais, S. J., 286, 346, 419, 422, 437, 438, 463 Richmond, J., 580, 582, 584, 586, 590
Prakash, S., 137,437,461,819 Ringstad, V., 746
Pratschke, J. L., 327,341 Rinnooy Kan, A. H. G., 978
Pratt, J., 784, 795 Rissanen, J., 246, 267, 270
Pregibon, D., 777, 793, 795 Robinson, P., 785, 795
Prescott, E., 809, 81 1,814,818,819 Robinson, P. M., 737, 749
Press, S. J., 133, 138,764, 795 Robinson, S. M., 976
Price, J. M., 669, 700 Robson, D. S, 454, 461
Priestley, M. B., 234, 239, 256, 257, 263, 270, Rosen, H. S., 640, 651
658, 700, 980, 984 Rosen, J. B., 973, 979
Prothero, D. L., 661, 700 Rosen, S., 705, 749
Prucha, I. R., 480, 512, 538, 558, 601, 609, 651, Rosenberg, B., 518, 542, 544, 547,559,81 1,820
826,850 Rosenblatt, M., 281, 341, 413
Pudney, S. E., 528, 538, 544, 546, 558 Rosenbrock, H. H., 967, 979
Rossi, P. E., 132, 133, 138, 139, 768, 796
Quandt, R. E., 196,202,222,422,434,436, Rothenberg, T. J., 133, 138, 163, 186, 194, 207,
449,454,455,461,640,648,651,800,805, 208, 223, 582, 584, 585, 586, 590, 635, 651
Author Index 995
Spitzer, J. J., 291, 318, 347, 348, 841, 842, 853 325,348, 381, 382, 403, 416, 419, 422, 432,
Spurrier, J. S., 826, 851 434,435, 437, 438, 449, 451, 464, 469, 475,
Srivastrava, A., 348 476,502,503,510,513,525,560,602,652,
Srivastava, S. K., 480, 513 710, 730, 747, 750, 774, 775, 795, 862, 895,
Srivastava, V., 809, 820, 821, 937 907, 937
Srivastava, V. K., 348, 465, 468, 470, 480, 510, Theobold, C. M., 914, 937
513,603,651,652,829,853 Thisted, R., 917, 937
Staelin, R., 737, 747 Thomas, J. J., 298, 348, 373, 407, 416
Stafford, F., 747 Thompson, M., 853
Stapleton, D., 785, 795 Thompson, M. L., 854, 895
Steiglitz, K., 395, 413 Thursby, J. G., 329,348, 364,373,455, 464
Stein, C. M., 82-90, 95, 96, 120, 138, 829, 853, Tiao, G. c., 101, 102, 115, 121, 125, 136, 138,
877, 879, 880, 893, 895, 924, 929, 937 258,261,266,268,269,293,311,348,349,
Stephan, S. W., 536, 549, 557 406,416,673,680,691,696,698, 701
Stern, R. M., 403, 414 Tillman, J. A., 324, 327,348
Stigum, B. P., 233, 239, 270 Tinsley, P. A., 407, 416, 811, 821
Stone, J., 915, 934 Tintner, G., 6, 7, 750
Stone, M., 103, 138, 246, 271 Tishler, A., 806, 821
Stone, R., 497, 513, 937 Tjj1fstheum, D., 669, 701
Strauss, R., 772, 795 Tobin, J., 753, 779, 795
Strawderman, W. E., 918, 928, 937 Tompa, H., 112, 138
Struby, R., 221 Toro-Vizcorrondo, c., 76, 77, 96, 361,373,908,
Stuart, A., 253, 261, 263, 268, 448, 462, 502, 937
712, 747, 779, 793 Toyoda, T., 77, 430, 463,476, 513
Styan, G. P. H., 173, 192, 193 Trenkler, G., 923, 938
Subrahmaniam, K., 429, 463 Trivedi, P. K., 121, 138, 307,342, 359, 360, 362,
Surekha, K., 116, 138,439,445, 447, 452, 453, 363, 373, 927, 938
455, 464 Trivedi, P. L., 307, 348
Swamy, P. A. V. B., 61, 429, 464, 470, 483, 512, Trognon, A., 529, 538, 558, 559, 560, 882, 893
513,518,519,524,525,526,530,532,534, Trost, R., 793, 794
535, 536, 538, 539, 541, 542, 543, 544, 546, Trotter, H. F., 964, 977
550,556,559,560,621,652,800,806,807, Tsay, R. S., 691,701
808, 811, 821 Tse, Y. K., 398, 416
Swann, W. H., 967, 973, 976, 979 Tsurumi, H., 125, 138, 219, 223, 401, 402, 416,
Szego, G. P., 969, 976 805, 821
Szroeter, J., 450, 452, 464, 495, 513 Tucker, A., 95
Tukey, J. W., 717, 750,829,852
Takayama, T., 62, 95 Turnovsky, S. J., 842, 843, 853
Talwar, P. P., 832, 850 Tyler, W. G., 827, 828, 851
Tan, W. Y., 102, 138
Taub, A. J., 524,560 Ullah, A., 187, 194, 291,348,480,513,546,
Taylor, D., 323, 326, 344 559,829,853,903,912,913,915,918,938
Taylor, L. D., 394, 416, 853 Upadhyaya, S., 820
Taylor, W. E., 163, 194, 291, 348, 365, 373, 429,
430,464,519,526,527,528,556,560,580, Vandaele, W., 119,140,239,269,480,514,635,
586,589 653
Teekens, R., 917, 937 Van Dijk, H. K., 133, 137, 138,635, 649, 652
Telser, L. G., 470, 513 van Ijzeren, J., 730, 750
Teriisvirta, T., 359, 360, 364, 373 Van Nostrand, R., 912, 934
Terpenning, I. J., 262, 266 Verbon, H. A. A., 480, 513
Terrell, R. D., 277, 299, 345, 688, 699 Villegas, c., 101, 138, 717, 750
Theil, H., 40, 58, 59, 61, 96, 157, 172, 173, 176, Vinod, H. D., 282, 328, 347, 348, 422, 464, 903,
178, 187, 194,202,223,286,287,290, 316, 912,913,915,918,938
Author Index 997