You are on page 1of 31

Single and Multiple Equation GMM: What do we do in linear model when orthogonality no longer holds

Motivation: In our previous linear model, the most important assumption we made is the orthogonality between error term and regressors
(i.e. Strict exogeneity or predetermined regressors), without which the OLS estimator is not even consistent for the desired
(i.e. from our model yi =xi + i ) Endogeneity Bias!
Since in economics the orthogonality condition is not satisfied, we develop methods here to deal with endogenous regressors,
called Generalized Method of Moments (GMM), which includes OLS as a special case.

Example 1:

Single Equation GMM: Here we relax assumptions even further and take away the predetermined regressor assumption. Instead, we have
orthogonality condition from instruments.
I. Assumptions:
3.1 Linearity: The data we observe comes from underlying RVs { yi (1x1), xi (1xd)}with
(this is the equation we want to estimate)
yi = xi + i (i = 1,2,,n)
3.2 Ergodic Stationarity:
Let zi be a M-dimensional vector of instruments, and let wi be the unique and nonconstant elements of (yi,xi, zi).
{wi} is jointly stationary and ergodic.
3.3 Orthogonality Condition
All the M variables in zi are predetermined in the sense that they are all orthogonal to the current error term:
E ( z im i ) = 0 i, m E[(yi xi) . zi ] = 0 E(gi) = 0 where gi = zi . (yi xi) = zi .i
Note on Moment Conditions: E[(yi xi)1x1 . zi Mx1 ]Mx1 = 0 These are the M moment conditions
Note on Instruments vs. Regressors: Even though we denote regressors and instruments by xi and zi, this does not mean that
they do not share the same variables. Not true! Regressors that are predetermined are instruments, and regressors that are not
predetermined are endogenous regressors.
Note on 1 as an Instrument: Typically we will include 1 as an instrument E(i) = 0!
3.4 Rank Condition for Identification : Guarantees theres a unique solution to the system of equations 1
The m x d matrix E(zi xi)mxd is of full column rank (or E(xi zi) is of full row rank), M > d (# of equations > # of unknowns).
We denote this matrix by ZX.
3.5 Martingale Difference with Finite Second Moments: Assumption for Asymptotic Normality
gi = xi . i is a martingale difference sequence with finite second moments gi is the sequence of moment conditions!
{ gi } is a martingale difference sequence (so E(gi) = 0 with E(gi | gi-1, gi-2,, g1)= 0 for i > 2) no serial correlation in gi
The KxK matrix of cross moments, E(gi gi) is nonsingular.
so,

g nx1 =

1
n

g N (0, E( g g ' )), S A var(g


N

i i

nx1 ) = E ( g i gi ' ) by

3.2 (so gi ergodic stationary) and Ergodic Differences CLT

i =1

Note: Again, if zi includes a constant term i is MDS no autocorrelation in i


Note: The same 4 comments here apply as in 2.5 2
Addl:
3.6 Finite fourth moments for Regressors: (For consistent estimation of S)
E[(xikzij)2] exists and is finite for all k = 1,, d,j = 1,2, , m
3.7 Conditional Homoskedasticity
E(i2 | zi ) = 2
1

This is called the rank condition for identification for the following reason (see proof: the condition guarantees unique min)
We can rewrite the moment/orthogonality condition as a system of K simultaneous equations:
E[g(wi;1)] = 0mx1 where g(wi;)= xi . (yi zi), wi is the unique and nonconstant elements of (yi,xi, zi), and 1 is the parameter vector / coefficient vector
The moment condition means that the true value of the coefficient vector 1 is a solution to this system of K simultaneous equations. Assumptions 3.1 3.3. guarantees that
there exists a solution to the moment conditions, but if the coefficient vector (or the equation) is identified if there is a unique solution to the moment condition. A necessary
and sufficient condition for a unique solution in the system of simultaneous equations is that XZ has full column rank.
Derivation: Want unique minimization of E[g(wi;1)]. Let be a minimizer.
unique iff b , E P [g ( wi ; b)] EP [g ( wi ; )] where P is the underlying distribution that generated the data

E P [zi ( yi xi ' b) ] E P [zi ( yi xi ' )] E P [zi ( xi ' b)] E P [zi ( xi ' )] E P [zi ( xi ' b)] E P [zi ( xi ' )] 0

E P [zi ( xi ' b xi ' )] 0 E P [zi xi ']mxd (b ) 0 E P [zi xi ']mxd = XZ has full column rank with m d

If not, then there exists a nonzero vector b such that (b ) is in the Ker(EP[zixi]). i.e. The columns are linearly dependent, so there exists a non-trivial linear
combination (b ) such that EP[zixi] (b ) = 0 (Minimizer not unique!)
Order Condition for Identification - m > d
A necessary condition (embedded in the proofs here) is that m > d (# of equations > # of unknowns) order condition for identification1
We can interpret this in 3 ways: #predetermined vars > #regressors or #orthogonality conditions > #parameters or # orthogonality conditions > #parameters
If order condition is not satisfied, then the equation or parameter is not identified.
We say that the equation is
1. Overidentified if the rank condition is satisfied and m > d
2. Exactly identified / just identified if the rank condition is satisfied and m = d
3. Underidentified / not identified if the rank condition is not satisfied or m < d
2
IID is a special case. If iid, we only need Lindberg-Chevy CLT
If instruments include a constant, then the error term is a martingales sequence (and a fortiori serially uncorrelated)
The assumption is hard to interpret, so we interpret an easier/sufficient condition: E(i | i-1, i-2,, 1, zz, zi-1,, z1 ) = 0
Besides being a martingales sequence and therefore uncorrelated with itself, the error term is orthogonal not only to the current but also to the past instruments
Since gigi= i2 zizi, S = E(gigi) is a matrix of 4th moments, so consistent estimation of S will require a 4th moment assumption (assumption 3.6)

II. What do Assumptions Imply about Properties of Instruments?


1. Orthogonality condition Instruments are orthogonal to errors (and uncorrelated if 1 is included as an instrument, i.e. E(i) = 0)
2. Rank Condition (non-constant) Instruments are correlated with the endogenous variables 3
3. Instruments uncorrelated with
Note on Orthogonality vs. Covariance:
We can always think of orthogonality between error and instrument as covariance provided that one of them is de-meaned.
Cov(x,e) = Cov(x,e-E(e)) = E(x(e-E(e)) and similarly we can rewrite model as: y = b0+E(e) + b1 x + [e E(e)]
III. Using 1 as an instrument: 2 Assumptions Made
It is common practice to use 1 as an instrument, however, doing so we are making 2 important assumptions:
1. E(i) = 0 (this comes from the orthogonality condition)
2. i is an MDS no serial correlation in i

Suppose regressors are (1 xi


3

z i ) , xi endogenous, instruments are (1 z2i

Ezi

Ez2i zi A

Ezi xi
Ezi2
If cov( z2i , xi ) = 0, then 1st column of A times E ( xi ) = 2nd column of A

Then, moment condition is E z2i 1 xi


z
i

1

zi = Ez2i
Ez
i

zi )

Exi
Ez2i xi

IV. Generalized Method of Moments Defined: Well show that IV estimator is a special GMM estimator (i.e. exactly identified system)
1. General Setup: The true parameter of interest is the solution to the moment conditions
s.t. E [g (Wi , b)] = 0 = arg min E [g (Wi , b)]' W E [g (Wi , b)] for some p.d . weighting matrix
= arg max E [g (Wi , b)]' W E [g (Wi , b)]

1
By Analogy Principle, b GMM = arg max b
n

i =1

'

1
g (Wi , b) W n
n

g (W , b)
i

i =1

where sample moment conditions are defined to be g n (Wi ; b) =


2.

1
n

g (W , b)
i

i =1

Applied to Linear Model: Our model is yi = xi + i with the moment condition E[(yi xi) . zi ] = 0
o Expression for sample moment condition in linear model
g n (Wi , b) =

1
n

(
n

i =1

) 1n z
n

z i y i x i' b =

i yi

i =1

1
n

z i x i' b =

i =1

1
n

zi yi

i =1

1
n

z i ( mx1) x i' (1xm) b =

i =1

1
n

i =1

1
zi yi
n

z x b
'
i i

i =1

s ZY ( mx1) S ZX ( mxd ) b( dx1)

Method of Moments: If the equation is exactly identified m = d, and there exists a unique b such that gn(Wi,b) = 0, and zx
invertible, then for sufficiently large n, SZX p zx by ergodic theorem and is invertible with prob. 1.
So, with large sample size the system of simultaneous equation has unique solution given by
The MM estimator
bIV = (S ZX

s ZY

1
=
n

z i x i'

i =1

z
i =1

y i (IV estimator with zi as instruments, def. for m = d) 4

General Method of Moments:


If the equation overly identified m > d, and there does not exists a unique b such that gn(Wi,b) = 0 exactly, we choose b to
minimize gn(Wi,b) or equivalently gn(Wi,b)Wn gn(Wi,b) (for some choice of p.d. weighting matrix that converges in
probability to some p.d. W) 5
1
b GMM = arg max bR N
n

i =1

'

g (Wi , b) W n ( mxm)

g (W , b)
i

i =1

Note: If the equation is just identified, then regardless of the weighting matrix, the GMM estimator = the IV estimator numerically
3.

GMM Estimator and Sampling Error


o GMM Estimator:
bGMM = (SzxWn Szx)-1 SzxWnsZY 6
(By 3.2 and 3.4, Sxz has full column rank for sufficiently large n with prob 1, then SzxWn Szx invertible) 7
If m = d, then Szx is a mxm p.d. square matrix and the GMM est. reduces to the IV estimator regardless of Wn! 8
(P.d. by full column rank assumption full rank) Also note that this means OLS is just an IV estimator!
o Sampling Error: Our model is yi = xi + i
bGMM - = (SzxWn Szx)-1 SzxWnsZ 9

The IV estimator is defined for the EXACTLY IDENTIFIED case (i.e. the case where there are as many instruments as endogenous regressors). If zi = xi , i.e. all the regressors
are predetermined/orthogonal to the error contemporaneous term, then this boils down to the OLS estimator. So, OLS is a special case of MM estimator. And, IV and OLS are both
special cases of GMM
5
The quadratic form formulation gn(Wi,b)Wn gn(Wi,b) gives us a 1x1 real number over which we can define the minimization/maximization problem. Otherwise it would be
impossible to minimize over a m-dimensional vector of moment conditions gn(Wi,b)zd
1
1
bGMM = arg max bR n g n (Wi , b )' Wn g n (Wi , b ) = arg max bR n (sZY S ZX b )'Wn (sZY S ZX b )
2
2

1
6 FOC : Assume int eriority ,
sZY ( mx1) S ZX b 'Wn sZY ( mx1) S ZX b = 0 S ZX 'Wn sZY S ZX b = 0 S ZX 'Wn s ZY = S ZX 'Wn S ZX b
b 2
By assumption 3.2 and 3.4, S XZ is of full column rank for sufficiently big n with prob 1, then S XZ 'Wn S XZ invert. (Wn p.d .)

) (

bGMM = (S ZX 'Wn S ZX )1 S ZX 'Wn sZY


7

Claim: if Amxn with m> n has full column rank, Wmxm p.d., then AWA invertible.
Proof: Suppose not. Then there exists non-zero nx1 vector c s.t. cAWAc = 0 there exists a mx1 vector d = Ac, d nonzero (since A has full column rank so there are no
nontrivial linear combination of columns that give zero vector), and dWd = 0. Contradiction to the assumption that W is p.d.!
'
'
1 1 '
'
1
'
1 1 '
'
1 1 '
bGMM = ( S ZX
Wn S ZX ) 1 S ZX
Wn sZY = ( S ZX
Wn ( S ZX ) 1 ) S ZX
Wn sZY = S ZX
sZY = bIV Note : ( S ZX
Wn S ZX ) 1 = ( S ZX
Wn ( S ZX ) 1) sin ce S ZX
Wn S ZX S ZX
Wn ( S ZX ) 1 = I

V. Large Sample Properties of GMM and Efficient GMM


(We established these results more generally before using the multivariate mean value theorem. Here we can be more specific now because
we impose linearity our parameter is linear in xi. Also note that we index the estimator by the weight matrix Wn to denote sample size)
1. Consistency 10 : Under assumptions 3.1 3.4, bGMM(Wn) p
Asymptotic Normality 11 : Under assumptions 3.1 3.5,

2.

D
n (bGMM (W n ) )
N (0, V )

where V = A var(bGMM (W n ) ) = 'XZ W XZ

'XZ W S W XZ 'XZ W XZ

= (E ( x i z i ' )' W E ( x i z i ' ) )1 E ( x i z i ' )' W E ( g i g i ' ) W E ( x i z i ' ) (E ( x i z i ' )' W E ( x i z i ' ) )1 where W = p lim W n

Consistent Estimate of Avar(bGMM(Wn)) 12 :


Suppose there exists a consistent estimator S* of Smxm = E(gigi). Then, under 3.2, Avar(bGMM(Wn)) is consistently estimated by
1
1
) ( S 'W S ) S 'W S W S ( S 'W S )
A var(b

3.

GMM

ZX

n ZX

ZX

n ZX

ZX

n ZX

Consistency of s2 (estimation of variance of true error is consistent) 13 :


For any consistent estimator b(W n ) of , define i y i x i ' b(W n ). Under 3.1, 3.2, and assume E ( x i x i ' ) exists and is finite,

4.

1
n

2
i

( )

E i2

( )

provided E i2 exists and is finite

i =1

Consistent estimation of S: Weve assumed S* exists thus far- How do we obtain consistent estimator S of Skxk from the sample (y,X)?
Suppose the coefficient estimate b used for calculating the residual for S is consistent for , and suppose S =E(gigi) exists and is

5.

1
finite. Then, under assumptions 3.1, 3.2, and 3.6, S =
n

2
i zi zi '

is consistent for S. 14

i =1

n
n

1
1
1
1
1
bGMM = ( S ZX 'Wn S ZX ) S ZX 'Wn sZY = ( S ZX 'Wn SZX ) S ZX 'Wn
zi yi = ( S ZX 'Wn S ZX ) S ZX 'Wn
zi ( xi ' + i ) by 3.1
n

i =1

i =1

= ( S ZX 'Wn S ZX )
10

n
n

1
1
1
1
S ZX 'Wn
zi ( xi ' ) + ( S ZX 'Wn S ZX ) S ZX 'Wn
zi i = + ( S ZX 'Wn S ZX ) S ZX 'Wn s z

n
i =1

i =1

From above, since

P
z

i i = g n E (zi i ) = E ( gi ) = 0 by ergodic theorem and 3.3


n

i =1

11

GMM
Continuing from above,

1
1
bGMM = ( S ZX 'Wn S ZX ) S ZX 'Wn s x n bGMM = ( S ZX 'Wn S ZX ) S ZX 'Wn n sz
n

1
D
P
P
n s x =
zi i N (0, E ( gi gi ')) by Ergodic Martingale Differences CLT , S ZX XZ by Ergodic Theorem, Wn W by construction
n

i =1

1
1
1
D
n bGMM = ( S ZX 'Wn S ZX ) S ZX 'Wn ns z N 0, ( ZX 'Wn ZX ) ZX 'W E ( gi gi ')W ' ZX ( ZX 'Wn ZX ) by CMT and Slutsky ' s

12
This follows from above. Standard asymptotic tools.
13
This proof is very similar to 3D from previous notes

i = yi xi ' b = yi xi ' b + xi ' xi ' = i + xi ' (b ) ei 2 = i 2 + 2 i xi ' (b ) + (b )' xi xi ' (b )

1
n

i =1

ei2 =

1
n

i 2 + 2 i xi ' (b ) + (b )' xi xi ' (b ) =

i =1

1
n

i =1

1
n

i 2 + 2(b )

i =1

P
P
P
2(b )' s x
0 sin ce s x
some finite vector and b

P
P
P

(b )' S XX (b ) 0 sin ce b 0 and S XX XX finite by assumption


14

(Proof is similar to 4 multiple zizi on both sides!)

1
n

i xi ' + (b )'

1
xi xi ' (b ) =

n
i =1

i =1

2
i

+ 2(b )' s x + (b )' S XX (b )

VI. Efficient GMM: How do we choose Wn to minimize Avar(bGMM(Wn))? Let Wn = E(gi gi)-1 inverse of variance of moment conditions
1. Efficient weighting matrix is given by Wn* = S-1 = E(gi gi)-1
For any weighting matrix Wn, Avar(bGMM(Wn)) > Avar(bGMM(Wn*))= [zxWn*zx ]-1 =[zx S-1 zx ]-1 =[E(xizi) (E(gigi)-1) E(xizi) ]-1

eff
bGMM
( S 1 ) = S 'ZX S 1S ZX

2.

) (S
1

' 1
ZX S s ZY

Large Sample Properties of Efficient GMM Estimator


From above, efficient GMM is consistent, asymptotically normal with the following asymptotic variance and its consistent estimate:

(
A var ( b

) (
)) = ( S

eff
(S 1 ) = 'ZX S 1 ZX
A var bGMM
eff
GMM

3.

(S 1

' 1
ZX S S ZX

)
)

Hypothesis Testing: Robust t-Ratio and Wald Statistic


From below, the formulas for robust t and Wald statistics become:
tl =

b( S 1 )l l

1 ' 1
S ZX S S ZX
n

where Robust SEl* =

Robust

SEl*

) (

)(

'
' 1
W = n a b( S 1 ) A b( S 1 ) S ZX
S S ZX

4.

) (
1

ll

'
A b( S 1 ) a b( S 1 )

2-Step Efficient GMM Procedure: How do we construct an Efficient GMM estimator?


A.

Pick some arbitrary weighting matrix (e.g. I) (that converges in probability to a symmetric p.d. W) and obtain a preliminary consistent GMM
estimator bGMM = (SzxWn Szx)-1 SzxWnsZY , which we will use to construct the optimal W

Note: Usually we set Wn = SXX-1. Then bGMM(SXX-1) = (SzxSzz-1Szx)-1 Szx Szz-1sZY This is the 2SLS
1
Then, using the preliminary estimator bGMM(SZZ-1) we construct S 1 with S =
n

B. In the second step, the efficient GMM estimator is obtained as:


' 1
bneff,GMM = S ZX
S S ZX

2
i zi zi '

i =1

) (

' 1
' 1
S ZX
S sZY with A var bneff,GMM = S ZX
S S ZX

12

with B
Or in matrix notation: b( S 1 ) = [ X ' Z ( Z ' BZ ) 1 Z ' X ] X ' Z ( Z ' BZ ) 1 Z ' y

Note: With this notation wan see that the efficient GMM is a GLS estimator!

5.

(FIX THE AVAR)

2
(in 2SLS, B = I)
n2

Note on Small Sample Properties


The efficient GMM estimator uses S 1 , a function of estimated fourth moments, as weighting matrix. Generally, it takes a
substantially larger sample size to estimate fourth moments reliably (compared to 1st and 2nd moments). Therefore, the efficient
GMM estimator has poorer small-sample properties than the GMM estimators that do not use fourth moments for Wn.
Equally weighted GMM estimator with Wn = I generally outperforms the efficient GMM in terms of the bias and variance in finite
samples!

VII. Hypothesis Testing


Prop: Robust T-Ratio and Wald Statistic (Testing Linear and NonLinear Restrictions)
Suppose Assumptions 3.1 3.5 hold, and suppose there is available a consistent estimate S of S.

'
Then, by above, A var(b) S ZX
W S ZX

'
' S ' W S
S ZX
W SWS
ZX
ZX
ZX

And
(a) Under the null hypothesis H 0 : k = k ,

n bk (W ) k

tk

)=

bk k

A var(b (W )) kk

1
A var(b (W )) kk
n

bk k

Robust S .E. (bk )


*

D N (0,1)

15

This t-ratio is the robust t-ratio because it uses the S.E. that is robust to errors that can be conditionally
heteroskedastic.
(b) Under the null hypothesis, H 0 : R# rxK Kx1 = r# rx1 where R is an #rxK matrix, #r < K, with full row rank
(where #r is the number of restrictions on )

W ( Rb(W ) r ) R[ A var(b(W ))]R '

( Rb(W ) r ) D 2 (# r ) 16

(c) Under the null hypothesis with #a restrictions 17


H 0 : a( ) = 0 for some # a dim ensional vector valued function with continuous first derivatives
s.t. ( first deriv evaluated at ) A( ) =

Then,

a( )
is # a K matrix of continuous derivatives with full row rank
Kx1

W n a(b(W ))1' x # a A(b(W )) # axK A var(b(W )) KxK A(b(W ))'Kx # a a (b(W )) # ax1 D 2 (# a )

18

Hypothesis Testing by the Likelihood Ratio Principle

15

)
(
n ( bk k )

By V .2 above n bk k D N 0, A var(bk ) , by V .3 above, A var(bk ) P A var(bk )


Therefore, by Slutsky,

A var(bk )

D N (0,1)

16

)(

We can rewrite W = n Rb r ' R[ A var(b)]R '

n Rb r = cn' Qn1cn

)(

Under the null , Rb = r W = n R b ' R[ A var(b)]R '

n R Rb

From (b) above n b D N (0, A var(b)) R n b D N 0, RA var(b) R '

From (c) above, [ A var(b)] P A var(b) R[ A var(b)]R ' D RA var(b) R '


Re call , if xmx1 ~ N m ( , mxm ), ( x ) 1( x ) ' ~ 2 ( m)
Here, c c ~ N ( RA var(b) R ').
n

cn' Qn1cn c ' Q 1c ~ 2 (# r )


17
18

The full row rank condition is there so that the hypothesis is well-defined. Thisis the generalization of the requirement for linear restrictions Rb = r that R is full row rank.

Under Null , a ( ) = 0 na (b) = n a (b) a( )

D
By (b) above, n b N (0, A var(b)) By Delta Method , n a (b) a ( ) D c c ~ N (0, A(b) A var(b) A(b) ')

The rest follows same from proof of (b) above.

VIII.

1.

Test for Overidentifying Restrictions: If the equation is exactly identified, then it is possible to choose b* s.t.
gb(b*) = 0 and J(,Wn) = gn(b*)Wn gn(b*) = 0 (We call b* the IV estimator). If the equation is overidentified, then the distance
cannot be set to 0 exactly (since there is no correlation between the moment conditions), though we expect the minimized distance to
be close to 0. If we choose the efficient weighting matrix W* s.t. plim W* = S-1, then the minimized distance is asymptotically chisquared.
Hansens Test of Overidentifying Restrictions:
Suppose there is available a consistent estimator S of S ( = E(gigi) ). Under assumptions 3.1 3.5,
D
J b
S 1 , S 1 = g b
S 1 ' S 1 g b
S 1
2 (m d )

GMM

( ) )

GMM

( ))

GMM

( ))

Note:
This says that the objective function evaluated at the estimator, i.e. the minimum distance, is asymptotically chi-squared.
This is a specification test, testing whether all the restrictions of the model (i.e. 3.1 3.5) are satisfied. Given a large enough sample,
if the J statistic is surprisingly large, then either the orthogonality condition (3.3) or the other assumptions (or both) are likely to be
false.
2.

Testing a subset of orthogonality conditions (Newey, Eichenbaum, Hansen, and Singleton)


Suppose assumptions 3.1 3.5 hold. Let zi1 be a subvector of zi, and strengthen Assumption 3.4 by requiring that the rank condition
for identification is satisfied for zi1 (so E(xi1zi) is full column rank ). Then, for any consistent estimator S* of S, and S11* of S11,
D
C J J 1
2 (m m1 )

where m = #zi (dimension of zi), m1 = #zi1 (dimension of zi1)


J(,Wn) = n gn(b*)S*-1 gn(b*)
J1(,Wn) = n g1n(b*) S11*-1 g1n(b*)
g 1n b d x1
S11( d1 xd1 ) S12( d1 xd 2 )
1
, S dxd
g n (b) mx1

g 2 n b d 2 x1
S 21( d 2 xd1 ) S 22( d 2 xd 2 )

()
()

IX. Implications of Conditional Homoskedasticity: Assumption 3.7 E(i2| zi ) = 2.


A. S = E(gigi) = E[i2 zi zi] = 2ZZ (ZZ = E[zi zi] ) 19
(as in chapter2, this decomposition has several implications)
S nonsingular (by 3.5), the decomposition implies 2ZZ nonsingular 2 !=0 and ZZ nonsingular

A consistent estimator S* of S is: S * = 2

1
n

z z

'
i i

= 2 S ZZ where 2 is some consistent estimator of 2

i =1

(By Ergodic stationarity, S*a.s. S , we dont need 4th moment assumption!)


B. Efficient GMM becomes 2SLS:
GMM Estimator Under Conditional Homoskedasticity:
Setting S * = 2
2

1
n

z z

'
i i

= 2 S ZZ , GMM estimator becomes

i =1

bGMM(( SZZ) ) = (Szx( Szz)-1Szx)-1 Szx ( Szz)-1sZY = (Szx(Szz)-1Szx)-1 Szx (Szz)-1sZY = bGMM(SZZ-1) = b2SLS
(Does not depend on 2 !) 20
-1

X. 2SLS: 2SLS is a (special case) GMM estimator, i.e. with a particular choice of weighting matrix - ( 2 SZZ)-1.
Its also the efficient GMM estimator obtained under conditional homoskedasticity.
A. Alternative Derivations of 2SLS: 2SLS as IV estimator and 2SLS as 2 Regressions
x1'
z1'
y1
'
'

x2
z2
y
Let X nxd = , Z nxm = , ynx1 = 2



x'
z'
y
n
n


n
Then,
b
= (S
2 SLS

ZX

'( S ZZ ) 1 S ZX ) 1 S ZX '( S ZZ ) 1 sZY

= ( X ' Z ( Z ' Z ) 1 Z ' X ) 1 X ' Z ( Z ' Z ) 1 Z ' y = ( X ' PX ) 1 X ' Py where P = Z ( Z ' Z ) 1 Z '
1
'
1

and y X b2 SLS
(b2 SLS ) = n 2 X ' Z ( Z ' Z ) 1 Z ' X = n 2 [ X ' PX ] where
AVar

n
T Statistic : tl =

b2 SLS ,l l

2 [ X ' PX ]

Wald Statistic : W =

D N (0,1)
ll

1
a(b2 SLS ) ' A(b2 SLS ) X ' Z ( Z ' Z ) 1 Z ' X A(b2 SLS ) '

a(b2 SLS )

( y Xb ) P ( y Xb )
=

D ChiSq (# a )

'

J Statistic : J b, ( 2 S XX ) 1
S arg an ' s Statistic :

19

20

ChiSq (m d )

' P
2

S = E(gigi) = E[ (zi i) (zi i)] = E[i2 zi zi] = E[E(i2 zi zi | zi) ] = E[zi ziE(i2 | zi )]= [2 zi zi ] =2ZZ
Note: In the efficient 2-step GMM estimation, the first step is to obtain a consistent estimator of S. Under conditional homoskedasticity, we dont need to perform the first step

since from the assumption we immediately get the consistent estimator S * = 2

1
n

z z = S
'
i i

-1
ZZ . So, the second step estimator collapses to the GMM estimator with SXX as

i =1

the weighting matrix. This estimator is called the 2SLS estimator because it can be estimated by 2 OLS regressions.

B. 2 Interpretations of 2SLS: b2 SLS = ( X ' PX ) 1 X ' Py


(i) 2SLS as an IV Estimator
Let x*i (Dx1) be the vector of D instruments for the D regressors xi, these instruments will be generated from zi (Mx1) as
follows:
a. The d-th instrument is the fitted value from regressing the d-th regressor, xid, on zi :
1
1
1
1
We obtain xd = Z ( Z ' Z ) Z ' xd and X Dx1 = Z ( Z ' Z ) Z ' x1 ,..., Z ( Z ' Z ) Z ' xD = Z ( Z ' Z ) Z ' X = PX

(Verify that these are indeed instruments: uncorrelated with errors and correlated with endogenous var)
b.

Recall, if ZnxD is the data matrix of D instruments for the D endogenous regressors XnxD, then
bIV =

1
S ZX
sZY

1
=
n

i =1

zi xi '

xi xi '

z y

x y = ( X ' X ) ( X 'Y ) = b

i i

i =1

Here, the IV estimator is:


1
1
bIV = S ZX
sZY =
n

i =1

2 SLS

i i

i =1

(ii) 2SLS as 2 Regressions

1
Since P is symmetric and idempotent, b2 SLS = ( X ' PX ) 1 X ' Py = ( X ' P ' PX ) X ' Py = X ' X

X 'y

We can obtain this estimator by 2 OLS estimations:


Regress X on Z to obtain fitted values xi : i.e. obtain X = PX
Note: We only need to regress the endogenous variables on the instruments.
The part of the vector x that is pre-determined it should be treated as an instrument, so projecting
it onto the column space will get the same thing back.
1
Regress Y on X : i.e. obtain b
= X 'X
X ' y where P = Z ( Z ' Z ) 1 Z '

a.

b.

2 SLS

Note: OLS packages return 2SLS SEs based on the residual vector y X b2 SLS .
This is NOT the same y X b
(i.e. the true estimated residual of interest). Therefore the estimated asymptotic standard
2 SLS

variance from the second stage cannot be used for statistical inference.
C. Asymptotic Properties of 2SLS: These results follow from the fact that 2SLS is special case of GMM with Wn = (Szz)-1
a.
b.

Consistency 21 : Under Assumptions 3.1 3.4, the 2SLS estimator b2SLS = (Szx(Szz)-1Szx)-1 Szx (Szz)-1sZY is consistent.
Asymptotic Normality: If we add Assumption 3.5 to 3.1 3.4, then the 2SLS estimator is asymptotically normal

D
1
n bGMM ( S ZZ
)
N (0, V )

) (

1
where V = A var bGMM ( S ZZ
) = 'XZ ZZ1 XZ

'XZ ZZ1 S ZZ1 XZ 'XZ ZZ1 XZ

= ( E ( xi zi ') ' E ( zi zi ') E ( xi zi ') ) E ( xi zi ') ' E ( zi zi ') E ( gi gi ') E ( zi zi ') E ( xi zi ') ( E ( xi zi ') ' E ( zi zi ') E ( xi zi ') )
bc S ZZ =

21

1
n

z z E ( z z ') by Ergodic Theorem (sin ce instruments are ergodic stationary)


'
i i

i i

i =1

Since 2SLS estimator is a special case of GMM, therefore consistency follows from the general case

c.

Conditional Homoskedasticity: If we add Assumption 3.7 to 3.1 3.5, then the estimator is the efficient GMM estimator with
the asymptotic variance given by:
GMM Estimator Under Conditional Homoskedasticity: Setting S * = 2

1
n

z z

'
i i

= 2 S ZZ , GMM estimator becomes

i =1

bGMM(( SZZ)-1) = (Szx( Szz)-1Szx)-1 Szx ( Szz)-1sZY = (Szx(Szz)-1Szx)-1 Szx (Szz)-1sZY = bGMM(SZZ-1) = b2SLS

( )

eff 1
A var(b2 SLS ) = A var bGMM
S

'
2
= under cond hom o = ZX ZZ

'ZX = 2 'ZX ZZ1 ZX

A natural, consistent estimator of asymptotic variance is

'
1
A var(b2 SLS ) = 2 S ZX
S ZZ
S ZX

d.

1
n

( y x ' b )
n

2 SLS

i =1

2 ' 1
S ZX S ZZ S ZX
n

2
1
= X ' Z ( Z ' Z ) ZX ' = X ' PZ X '
l

Wald-Statistic Converges in distribution to Chi-Square(#r) (where #r = the dimensionality of restrictions)

W=

f.

where 2

T-Statistic Converges in distribution to Standard Normal:


b2 SLS l
D
Tl =

N (0,1)
Robust SE (b2 SLS )l
Robust SEl =

e.

1
a(b2 SLS ) ' A(b2 SLS ) X ' Z ( Z ' Z ) 1 Z ' X A(b2 SLS ) '

a (b2 SLS )

D 2 (# r )

The Sargan Statistic Converges to Chi-Sq(m r)


S arg an ' s Stat =

' P D
2 (m d )
2

D. Note on Small Sample Properties of 2SLS


There is research that shows that if the R2 in the first-stage regression is low, then we should suspect that the large sample
approximation to the finite sample distribution of the SLS estimator to be poor.

E. When Regressors are Predetermined and Errors are Conditionally Homoskedastic: Efficient GMM = OLS!
When all regressors are predetermined and errors are conditionally homoskedastic, the objective function (J statistic) for the
efficient GMM estimator/2SLS is:
1
1
( y Xb2 SLS ) ' P( y Xb2 SLS )

2
Z ( y Xb2 SLS ) =
J b2 SLS , 2 S ZZ
= Z ( y Xb2 SLS ) ' S ZZ

2
y ' Py b2 SLS ' X ' Py y ' PXb2 SLS + b2 SLS ' X ' PXb2 SLS
y ' Py 2b2 SLS ' X ' Py + b2 SLS ' X ' PXb2 SLS
=
=
2

2
y ' Py 2b2 SLS ' X ' P ' y + b2 SLS ' X ' PXb2 SLS
Since P = P '
=
2
y ' Py 2b2 SLS ' X ' y + b2 SLS ' X ' Xb2 SLS
=
Since PX = X when xi zi (i.e. regressors are instruments )
2
( y Zb2 SLS ) ' ( y Zb2 SLS ) y ' y y ' Py
=
2
2
( y Zb2 SLS ) ' ( y Zb2 SLS ) ( y y )( y y ) ' where y Py
=
2
2

Since the last term does not depend on b, minimizing J amounts to minimizing the SSR = (y Z b) (y Z b).
Implication:
i. Efficient GMM estimator is OLS (This is true as long as zi = xi, ie. regressors are predetermined)
ii. The restricted efficient GMM estimator subject to constraints of the null hypothesis is the restricted OLS (whose objective
function is not J but SSR
iii. Wald statistic, which is numerically equal to the LR statistic, can be calculated as the difference in SSR with and whtout the
imposition of the null, normalized to 2 . (this confirms the derivation in 2.6 that the LR principle can derive the Wald)
(Note: This is why we can fit OLS in the GMM framework, treating xs as instruments)
F. Limited Information Maximum Likelihood Estimator (LIML): This is the ML counterpart of 2SLS
Theyre both k class estimators. 2SLS is a k class estimator with k = 1 (p. 541). So when the equation is just identified (k = 1),
LIML = 2SLS numerically.

Multiple Equation GMM


Background: Having seen how to estimate 1 equation via GMM, now we can estimate a system of multiple equations as well. Going from
simple to multiple equation is easy because the multiple-equation GMM estimator can be expressed as a single-equation GMM estimator by
suitably specifying the matrices and vectors comprising the single-equation GMM formula.

Summary: Under conditional homoskedasticity, multiple equation GMM reduces to the full-information instrumental variable efficient
estimator (FIVE), which reduces to the 3SLS if the set of instruments is common to all equations. If we further assume that all regressors are
predetermined, the 3SLS reduces to seemingly unrelated regressions (SUR), which in turn reduces to the multivariate regression when all the
equations have the same regressors.
I.

Assumptions: There are M equations, each of which is a linear equation like the one in GMM
4.1 Linearity: There are M linear equation,
yim = ximm + im (m = 1, 2, ... , m; i = 1,2,,n) (this is the system of equations we want to estimate)
(xim is the dm-dim. vector of regressors, m is the coefficient vector, and im is the unobservable error term for the m-th equation)
Note on interequation correlation and cross-equation restriction 22
Cross-Equation restrictions often occur in panel data models, where the same relationship can be estimated for different points in
time. 23

4.2 Ergodic Stationarity:


Let wi be the unique and nonconstant elements of (yi1,, yiM, xi1,,xiM,zi1,,ziM). {wi} is jointly stationary and ergodic.
Note: This is stronger than assuming ergodic stationarity is satisfied for each of the M equations in the system.
Even if {yim, zim, xim} is stationary and ergodic for each m does not imply the whole system {wi}, i.e. the union of
individual processes, is jointly stationary and ergodic. 24
4.3 Orthogonality Conditions: Conditions for the M-equation system are just a collection conditions for each equation
For each equation m, the K variables in zi are predetermined in the sense that they are all orthogonal to the current error term:
z i1( K x1) y i1 z i1

z i1( K x1) i1
(1xd1 ) ' d1 x1 1x1
1
1

E ( zim im ) = 0 i, m = 1, 2,..., M E ( g i ) M E
= E
= 0 M

K m x1
K x1

m
z
y i1 z i1(1xd M ) ' d M x1 1x1
i =1

i =1

z iM ( K M x1) i1
i1( K M x1)

Note on Cross Orthogonalities: The model assumes no cross orthogonalities. e.g. zi1 and i2 do not have to be orthogonal.
However, if a variable is included in both zi1 and zi2 (shared instrument), then 4.3 implies that the variable is orthogonal to both i1
and i2.

22
The model makes no assumptions about the interequation (or contemporaneous) correlation between errors (i1,, iM). Also, there is no a priori restrictions on the coefficients
from different equations: i.e. the model assumes no cross-equation restrictions on the coefficients.

Example: Suppose we want to estimate the wage equation (a la Grliches) and add to it the equation for KWW (score on the Knowledge of the World Test)

LogWagei = 1 + 1Si + 1 IQi + EXPRi + i1


KWWi = 2 + 2 Si + 2 IQi + i 2

xi1 = (1 Si IQi EXPRi ) ', xi 2 = (1 Si IQi ) ', 1 = (1 , 1 , 1 , ) ', 1 = (2 , 2 , 2 ) '


Here, i1 and i2 can be correlated. Correlation arrises if, for example, there is unobservable individual characteristic that affects both wage rate and test score.
There are no cross-equation restrictions such as 1 = 2
23
Panel data Example: Suppose we have data for the log-wage exercise (a la Griliches) in 1969 and 1980. We can estimate 2 wage equations:

LW 69i = 1 + 1S 69i + 1 IQi + 1 EXPR69i + i1

LW 80i = 2 + 2 S 80i + 2 IQi + 2 EXPR80i + i 2


In this setup, S69i (education in 1969) and S80i (education in 1980) are 2 different variables. One possible set of restrictions is that the set of coefficients remain unchanged
through time (i.e. same effect) 1 = 2, 1 = 2, 1 = 2, 1 = 2
24
Example: Element-wise vs. joint stationarity
Let {ei} (i = 1,2,) be a scalar i.i.d. process. Create a 2-dimensional process {zi} from it by defining zi1 = ei and zi2 = e1.
Note: The first case is an example of iid sequence. The second is an example of a constant sequence (maximum serial dependence). Both are
types of stationary processes.
Here, the scalar processes {zi1} and {zi2}is are stationary. However, the vector process {zi}is not jointly stationary because the joint
distribution of z1 = (1,1) differs from that of z2 = (2,1).

4.4 Rank Condition for Identification: Guarantees theres a unique solution to the system of equations25
For each of the m (=1,2,,M) equations, the Km x dm matrix E(zi xi)KmxDm is of full column rank (or E(xi zi) is of full row rank),
Km > dm (# of equations > # of unknowns) for all m.
4.5 gi is a Martingale Difference with Finite Second Moments: Assumption for Asymptotic Normality
{gi} is a joint martingale difference sequence with finite second moments
(so E(gi) = 0 with E(gi | gi-1, gi-2,, g1)= 0 for i > 2) no serial correlation in gi)
E ( i1 i1 z i1 z i1 ') E ( i1 iM z i1 z iM ')
is
The matrix of cross moments, S = E(gi gi) = S = E ( g i g i ' ) M M =

K
K

m m
E ( i1 i1 z iM z i1 ') E ( iM iM z iM z iM ')

m =1
m =1
nonsingular.
Note: This is stronger than assuming gim= zim. im is MDS in each equation m.

Addl:
4.6 Finite fourth moments for Regressors: (For consistent estimation of S)
E[(zimkxihj)2] exists and is finite for all k = 1,, Km,j = 1,2, , Dh m,h = 1,2,M, where zimk is the kth element of xim and zihj is
the jth elemnt of zih
4.7 Conditional Homoskedasticity: Constant Cross Moment
E(imih | zim , zih) = mh2 for all m, h = 1,2,, M or E(ii| Zi)=
Note on Complete System of Simultaneous Equations: the Complete system adds more assumptions to our model, assumptions which are
unnecessary for development of ME GMM. They are covered in 8.5.

II.

Multuple-Equation GMM Defined: This is same as Single Equation GMM but with re-defined matrices
1. General Setup
1
The parameter of interest GMM = is defined implicitly as the solution to the moment conditions
M

E ( g i ( wi , )) M

K m x1

i =1

z i1( K x1) y i1 x i1
E ( z i1( K x1) y i1 )
(1xd1 ) ' d1 x1 1x1
1
1

E
=

z
E( z

'
)
y
x
y

iM ( K M x1)
iM
i1
iM (1xd M )
d M x1 1x1

iM ( K M x1)
M

E ( z i1( K x1) y i1 ) E ( z i1( K x1) x i'1 (1xd ) ' )


1
1
1

0

E( z

0
iM ( K M x1) y iM )

K m x1

i =1

0
0

E ( z i1( K x1) x i'1

(1xd1 ) ' d1 x1 )
1

'
E( z

'
)
x

i1( K1 x1)
iM (1xd1 )
d M x1

K m x1

i =1

1( d x1)
1

'
M

M
E ( z i1( K1 x1) x iM
'
)
M
M
d
x
(
1
)

(1xd1 )
M
d m x1
K m x d m
0

i =1

i =1

i =1

ZY ZX = 0

Sample Analogue:
1
1 n
z i1( K1 x1) y i1

n
n i =1

g n ( wi , ) M

1 n

K m x1

i =1
z iM ( K M x1) y iM

n i =1

x iM ( d M x1)

x i1( d x1)

'
and ZX = E ( Z i X i ) with X i =

z iM ( K M x1)

z i1( K x1)
1

Where ZY = E ( Z i Yi ) with Z i =

So, E(ZiYi) = E(ZiXi)

z
i =1

'
i1( K1 x1) x i1 (1xd1 )

0
0

1( d x1)

'

z iM ( K1 x1) x iM (1xd )
M ( d M x1) M d m x1
1
M M
i =1
K m x d m
0

1
n

i =1

i =1

i =1

s ZY S ZX = 0

We can uniquely determine all the coefficient vectors 1,..,m iff each coefficient vector mis uniquely determined, which occurs iff Assumption 3.4 holds for each equation. The
rank condition is simple here because there are no cross-equation restrictions when coefficients are assumed to be the same across all equations we will have different
identification condition.

25

4 Special Features of M.E. GMM: We substitute these into Single Equation GMM and get same results!
i. sZY is a stacked vector
ii. SZX is a block diagonal matrix
iii. By ii, Wn is a Sm Km x Sm Km matrix
1 n

z i1 i1

n
n i =1

1
= g n ( ) is a stacked vector
iv. g
gi =
1 n

n i =1

z iM iM
n i =1

2. Applied to Linear Model: Our model is yi = xi + i with the moment condition E[(yi xi) . zi ] = 0
o Expression for sample moment condition in linear model
g n (Wi , b) = s ZY ( mx1) S ZX ( mxd ) b( dx1)
o

Method of Moments: If the equation is exactly identified Km = dm for all m, and there exists a unique b such that gn(Wi,b) = 0,
and zx invertible, then for sufficiently large n, SZX p zx by ergodic theorem and is invertible with prob. 1.
So, with large sample size the system of simultaneous equation has unique solution given by
The MM estimator (CHECK THIS!!!)

IV = (S ZX
o

)1 s ZY

(IV estimator with zi as instruments, def. for m = d) 26

General Method of Moments:


If the equation overly identified m > d, and there does not exists a unique b such that gn(Wi,b) = 0 exactly, we choose (Wn) to
minimize gn(Wi,b) or equivalently gn(Wi,b)Wn gn(Wi,b) (for some choice of p.d. weighting matrix that converges in
probability to some p.d. W) 27

] [

GMM = arg max bR N [E ( g i (Wi , b))]' W [E ( g i (Wi , b))] = E ( Z i (Yi X i' b)) W E ( Z i (Yi X i' b))
1

GMM = arg max bR N

1
g i (Wi , b) W n ( mxm)
n

g (W , b)
i

i =1

GMM

= E ( Z i X i ') 'WE ( Z i X i ') E ( Z i X i ') W E ( Z iYi )

1
GMM =
n

26

i =1

'

'

Zi ' X i Wn ( mxm )

i =1
n


Zi ' X i

i =1

'

1
Z i ' X i Wn ( mxm )

n
i =1
n

Z Y
i i

i =1

The IV estimator is defined for the EXACTLY IDENTIFIED case (i.e. the case where there are as many instruments as endogenous regressors). If zi = xi , i.e. all the regressors
are predetermined/orthogonal to the error contemporaneous term, then this boils down to the OLS estimator. So, OLS is a special case of MM estimator. And, IV and OLS are both
special cases of GMM
27
The quadratic form formulation gn(Wi,b)Wn gn(Wi,b) gives us a 1x1 real number over which we can define the minimization/maximization problem. Otherwise it would be
impossible to minimize over a m-dimensional vector of moment conditions gn(Wi,b)zd

XI.

ME GMM Estimator and Sampling Error


GMM Estimator:
GMM(Wn) = (SzxWn Szx)-1 SzxWnsZY 28
(By 4.2 and 4.4, Sxz has full column rank for sufficiently large n with prob 1, then SzxWn Szx invertible)
Note: If m = d, then Szx is a mxm square matrix and the GMM estimator reduces to the IV estimator if we pick Wn =
o Sampling Error:
GMM(Wn) - = (SzxWn Szx)-1 SzxWnsZ
o

1 (W )

GMM (W ) =
= S ZX ' WS ZX
(W )
M

1

n

=

i =1

xi1zi'1

d1 xK1
0
0

i =1

xi1zi'1

d1 xK1
0
0

i =1

S ZX ' W sZY

0
W11( K1 xK1 )

W M 1( K xK )
'
M
1
xiM ziM

d M xK M

W1M ( K1 xK M ) n

W MM ( K M xK M )

W11( K1 xK1 )

W M 1( K xK )
'
M
1
xiM ziM

d M xK M

1 n

zi1( K1 x1) yi1

W1M ( K1 xK M ) n

i =1

1 n

WMM ( K M xK M )
ziM ( K M x1) yiM

n i =1

i =1

1 n

1 n


xi1zi'1
W11( K1 xK1 )
zi1xi'1
n

d1 xK1
i =1
K1 xd1
i =1
=
n

1 n

'
1
xiM ziM
WM 1( K M xK1 )
zi1xi'1
n

i =1
d M xK M
i =1
K1 xd1

1 n

1 n


xi1zi'1
W11( K1 xK1 )
zi1( K1 x1) yi1

n i =1
d1 xK1
n i =1
K1 xd1

1 n

1 n
'

x
z
W
zi1( K1 x1) yi1
iM iM
M 1( K M xK1 )

n i =1
d M xK M
n i =1
K1 xd1

Where Wm,h (KmxKh) is the (m,h) block of Wn.


Recall, for Block matrices:
A'
B
B1M A11

11
11
1.

'
A
B
B
MM M 1

MM

A'
11
2.

B
11

'
AMM
B
M 1

A' B A
11 11 11
=
'
AMM AMM BM 1 A11

'
B1M c1 A11
B11c1 +

=

'
BMM cM AMM Bc1 +

+
+

i =1

zi1xi'1

K1 xd1
0

i =1

'
ziM xiM

K M xd M

28

i =1

1
xi1zi'1
W1M ( K1 xK M )

n
d1 xK1

'
iM ziM

i =1

+
1
+
n

i =1

WMM ( K M xK M )

n
d M xK M

i =1

i =1

'
ziM xiM

K M xd M

'
ziM xiM

K M xd M
n

i =1

1
xi1zi'1
W1M ( K1 xK M )

n
d1 xK1

i =1

1
'
xiM ziM
W MM ( K M xK M )

n
d M xK M

'
A11
B1M AMM

'
AMM
BMM AMM

'
A11
B1M cM

'
AMM
BMM cM

ziM ( K M x1) yiM

K M xd M

ziM ( K M x1) yiM

K M xd M
n

i =1

III.

Large Sample Theory: As seen above, all the theory/formulas for the multiple-equation GMM is a matter of substitution of the newly
defined matrices into the equations. (See summary)

GMM Summary
IV.
Population Orthogonality Conditions

g ( wi , ) E ( Z iYi ) E ( Z i X i ') = ZY ZX = 0
g n ( wi , ) sZY S ZX = 0

Sample Analogue of Orthogonality Conditions

IV (S ZX )1 s ZY since SZX square and invertible

IV Estimator (#moment conditions = # of parameters)


GMM Estimator (#moment conditions > # of parameters
Cant set gn exatly to 0)
GMM Estimators Sampling Error
GMM Estimators Asymptotic Variance (Under
Asymptotic Normality)
Consistent Estimator of GMM Estimators Asymptotic
Variance
Optimal/Efficient Weighting Matrix
Efficient GMM Estimator
Asymptotic (Minimum) Variance of the
Efficient/Optimal GMM
Asymptotic (Minimum) Variance of the
Efficient/Optimal GMM
J-Statistic / Objective Function evaluated at Optimal
Weighting Matrix

GMM(Wn) = (SzxWn Szx)-1 SzxWnsZY


GMM(Wn) - = (SzxWn Szx)-1 SzxWnsZ

A var(bGMM (Wn ) ) = 'XZ W XZ


A var(bGMM ) (S ZX ' W n S ZX

bn,GMM

'XZ W S W XZ 'XZ W XZ

)1 S ZX ' W

S *W ' S ZX (S ZX ' W n S ZX

S*-1 s.t. S*-1p S-1 = E(gigi)-1


( S 1 ) = ( S ' S 1 S ) 1 S ' S 1 s
ZX

ZX

ZX

ZY

A var(bn,GMM ( S 1 )) = ( ZX ' S 1 ZX ) 1 = ( ZX ' E ( g i g i ' ) 1 ZX ) 1

A var(bGMM ) S ZX ' S 1 S ZX

( ) )

( ))

( ))

'
J bGMM S 1 , S 1 = n g n bGMM S 1 S 1 g n bGMM S 1

)1

Definitions and Properties of Single Equation vs. Multiple Equation Estimators:

gi

Single Equation GMM


zi(KX1) . i

dx1

sZY

Multiple Equation GMM


M

gi = [zi1 i1 ziM iM ]'


K m x1

m =1

m =1

] Dm x1

= 1( D1x1) ,..., 1( DM x1) '

1
zi1( K1 x1) yi1

i =1

n
1
ziM ( K M x1) yiM

i =1

z y

i =1

SZX

S = A var( g ) = E ( g i g i ' )

i =1

n
1
zi1( K1 x1) xi'1(1xd )
1
n
i =1

zi xi'

0
n

1
'

ziM ( K1 x1) xiM


(1xd1 )
n
M
M

i =1
K m x d m

i =1

KXK

Size of W

ZX

E(zi xi)

S = E ( g i g i ') = E

'
E(z
i1( K1 x1) xi1(1xd1 ) ' )

i2 z i z i '

1
S =
n

'
E ( zi1( K1 x1) xiM
'
)
M

(1xd1 ) M
K m x d m
0

i =1

)
S = E(g i g i ' )

M
M

K m
Km

m =1
m =1

S (consistent estimator of S)

i =1

M
M

K m x K m
i =1
i =1

2
i zi zi '

i =1

i =1

E ( i1 i1 z i1 z i1 ') E ( i1 iM z i1 z iM ')

E ( i1 i1 z iM z i1 ') E ( iM iM z iM z iM ')

1 n
1
'
i1i1 z i1 z i'1
i1iM z i1 z iM

n i =1

n i =1

S =

n
n
1
'

1
iM i1 z iM z i'1
MM MM z MM z MM

n
n
i =1

i =1

Estimator consistent under


assumptions
Estimator asymptotically
normal under assumptions
S P S under assumptions
D.F. of J-Statistic

3.1 3.4

4.1 4.4

3.1 3.4 + 3.5

4.1 4.4 + 4.5

3.1, 3.2, 3.6, E(gigi) finite

4.1, 4.2, 4.6, E(gigi) finite

K-D

m (Km Dm)

K m K m

m =1

m =1

V.

Single Equation vs. Multiple Equation Estimation


A. Equation by Equation GMM vs. Joint Estimation: An alternative to joint estimation is to apply single-equation GMM separately
to each equation.
The equation-by-equation estimator is a particular M.E. GMM estimator with particular choice of Wn
Estimate the M equations via SE GMM s.t. the weighting matrix for the m-th equation is W*mm (Km x Km)
Then, if we stack the equation-by-equation GMM estimator, it can be written as a M.E. GMM estimator with the block

diagonal matrix whose m-th diagonal block is

W11

Wmm: W =

W11

(this works bc SZX and W are block diagonal)

When are they Equivalent


i. If all equations are just identified, then the equation-by equation GMM and multiple-equation GMM are numerically the
same and equal to the IV estimator (regardless of weighting matrix!)
ii. If at least one equation is overidentified but the equations are unrelated in the sense that E(eimeihximxoh) = 0 for all
m !=h, then the efficient equation-by-equation GMM and the efficient multiple-equation GMM are asymptotically
P
0
equivalent in that (W ) ( S 1 )

Joint Estimation Can Be Hazardous


Except for the above cases, joint estimation is asymptotically more efficient since it takes advantage of cross-equation
correlations. Even if you are interested in estimation 1 particular equation, you can generally gain asymptotic efficiency by
combining it with some other equations (though there are special cases where joint estimation entails no efficiency gain even
if the added equations are not unrelated).
Caveats:
i. Small Sample Properties: Small-sample properties of the equation of interest might be better without joint estimation
ii. Misspecification: Asymptotic result presumes that the model is correctly specified, i.e. model assumptions are satisfied.
If the model is misspecified, neither the single-equation GMM nor the multiple equation GMM is guaranteed to be even
consistent. Furthermore, chances of misspecification increase as you add equations to the system. (see p272 for example)

VI. Special Cases of Multiple Equation GMM under Conditional Homoskedasticity: FIVE, 3SLS, and SUR
0. S under conditional homoskedasticity:
E ( i1 i1 zi1 zi1 ') E ( i1 iM zi1 ziM ') 11 E ( zi1 zi1 ') 1M E ( zi1 ziM ' )

S = E ( gi gi ') =
=
= E ( Z i ' Z i )
E ( i1 i1 ziM zi1 ' ) E ( iM iM ziM ziM ') M 1 E ( ziM zi1 ' ) MM E ( ziM ziM ' )

1.

29

Full-Information Instrumental Variables Efficient (FIVE) Estimator


Simplification of S: This estimator exploits the above structure of 4th moments by using the following consistent estimator S of S:
n
n

1
1
zi1 zi1 ' 1M
zi1 ziM '
11
n i =1
n i =1
1M

11
n
1 30

=
=
Z

Z
'
where
=
i i '
S =

n
n
i =1
n
n

M 1
MM
1
M 1 1
ziM zi1 ' MM
ziM ziM '

n i =1
n i =1

( is usually the 2SLS estimator for equation m. So for the cross moments we need 2 2SLS estimators)

Then, our ME efficient GMM estimator simplifies to:


1
1

( S 1 ) = S ' S 1S
S ' S 1s = XZ '( Z Z ') 1 ZX '
XZ '( Z Z ')1 ZY

( ZX

FIVE

ZX

ZX

ZY

(the 1/ns cancel)

Large Sample Properties of FIVE: (these follow from the large sample properties of M.E. GMM estimators)

Suppose Assumptions 4.1 4.5 and 4.7 hold. Suppose further that E(zihzim) exists and is finite for all m, h = 1,2,M. Let S and S be
defined as above. Then,
(a) S P S
(b) FIVE ( S 1 ) is consistent, asymptotically normal, and efficient with Avar( FIVE ) = (ZXS-1ZX)-1
(c) The estimated asymptotic variance Avar( A var(
)
) = (S' S -1S )-1 is consistent for Avar(
FIVE

ZX

ZX

FIVE

D
(d) Sargans Statistic / J Statistic: J ( FIVE , S 1 ) n g n ( FIVE )' S 1 g n (FIVE )
2

29

30

(K

D m )

The (m,h) block of E(gigi) = E(im ih xim xih) = E(E(im ih | xim xih) xim xih) by Law of Iterated Exp and linearity of conditional exp
= E(mh xim xih) by conditional homoskedasticity = mh E(xim xih)

mh

1
n

im ih

'
m (m, h = 1, 2,..., M ) for some consistent estimator m of m
, im yim xim

i =1

By Assumption 4.1, 4.2, and E(zihzim) exists ,


assumption. Therefore,

P
mh
mh (Prop 4.1 p.269). By (joint) ergodic stationarity,

S P S without finite 4th moment assumption.

1
n

z
i =1

im zih

P
E ( zim zih ') which exists and is finite by
'

2.

Three-Stage Least Squares (3SLS): When the set of instruments is same across equations, FIVE can be simplified to 3SLS
Simplification of gi, S, and S : If zi ( = zi1 = zi2 = zi3 = = ziM) is the common set of instruments (for all M equations) with

dimension K, then gi, S, and S can be written compactly using the Kronecker product 31 as follows:
z i1 i1 z i i1

=
gi =
since instruments are common = i zi


z iM iM z i iM
MKx1

1M E ( z i1 z iM )
E ( i1 iM z i1 z iM ) 11 E ( z i1 z i1 )
E ( i1 i1 z i1 z i1 )

S = E(g i g i ' ) =

=
E ( iM i1 z iM z i1 )
E ( iM iM z iM z iM ) M 1 E ( z iM z i1 )
MM E ( z iM z iM )
1M E ( z i z i )
11 E ( z i z i )
by common instruments
=

M 1 E ( z i z i )
MM E ( z i z i )
1M
i1
11

= E ( z i z i ) MK MK where

= E ( i i ' ) for i =
iM
M 1
MM
S-1 = -1 E(zi zi)-1
1
S =
n

i =1

1
S 1 = 1
n

i =1

zi zi '

= n 1 (Z ' Z )1

1M

1
= n
MM

With above simplifications, our M. E. GMM estimator

( ) (

' 1
3SLS = FIVE S 1 = S ZX
S S ZX

31

11

where =
M 1

1
1
z i z i ' = Z ' Z = ( Z ' Z )

n
n

a11

Recall: For general matrices A =


aM 1

(
= X ' (

)
X ' (

' 1
S ZX
S sZY = X ' 1 Z ( Z ' Z )1 Z ' X

a1N
b11

(MxN)
and
B
=

bK1
aMN

PZ X

(
P )Y

33

b1L
a11

(KxL),
a
=

(Mx1), b =
aM 1
bKL

a1N B
b11
a11
a11B

the Kronecker product is defined as: A B =


,
a

b
=

bN 1
aM 1
aM 1B
aMN B
MKxNL
Useful Properties:
(A B)(C D) = AC BD (provided that A and C are conformable and B and D are conformable)

X ' 1 Z ( Z ' Z ) 1 Z ' Y

b11

(Nx1)
bN1

a11b

aM 1b
MNx1

(A B) = A B

(A B)-1 = A-1 B-1


32

Just as before,

mh

1
n

im ih

'
m (m, h = 1, 2,..., M ) for some consistent estimator m of m
, im yim xim

i =1

m is usually the 2SLS estimator for equation m. So for the cross moments we need 2 2SLS estimators)

33

This is bc

'
i i

i =1

32

Large Sample Properties of 3SLS: (SEE ANALYTICAL EXERCISES FOR CHAPTER 4)


Suppose Assumptions 4.1 4.5 and 4.7 hold, and suppose that there exists a common set of instruments for all equations (zim = zi).
Suppose further that E(zihzim) exists and is finite for all m, h = 1,2,M. Let be the MxM matrix of estimated error cross moments
calculated using 2SLS residuals. Then,
=
( S 1 ) is consistent, asymptotically normal, and efficient
(a)
3 SLS

FIVE

( (

) )

with Avar( 3SLS ) = (ZXS-1ZX)-1 = n X ' 1 Z ( Z ' Z ) 1 Z ' X

( (

) )

(b) The estimated asymptotic variance A var(3SLS ) = n X ' 1 Z ( Z ' Z ) 1 Z ' X

is consistent for Avar( 3SLS ) (SEE 4.5.17)

D
(c) Sargans Statistic / J Statistic: J (3SLS , S 1 ) n g n (3SLS )' S 1 g n (3SLS )
2 MK
Dm

Seemingly Unrelated Regressions: Suppose in addition to common instruments, zi = union of (xi1,, xi1) (all regressors are instruments!)
The SUR cross orthogonality condition is equivalent to E(xim . ih) = 0 (m, h = 1,2,,M)
Predetermined regressors satisfy cross orthogonalities: not only are the regressors predetermined in each equation E(xim . im), but
also they are predetermined in other equations (so regressors in any equation is an instrument for all equations).
This simplification produces the SUR estimator WHAT IS THE IMPORTANCE OF CROSS ORTHOG?

3.

With the above, (S and g still defined the same) we get that (bc Xi are in the column space of PZ)

( ) (

' 1
SUR = 3SLS = FIVE S 1 = S ZX
S S ZX

n
'

zi1( K1 x1) xi(1


1xd1 )

n
i =1
1
1
S ZX =
Zi X i ' =
0
n
n

i =1

i =1
1

=
0
n
n

'

ziM ( K1 x1) xiM


(1xd1 )

i =1

z'

1(1xK ) z11
=
where Z =


zn' (1xK ) zn1

1
sZY = ( I M Z ) ' Y
n

z x

X1( nxDM )
z1K

and X =

znK

nxK

'
i i(11xd1 )

0
0

' 1
S ZX
S sZY = X ' 1 I n X

34
X ' 1 I n Y

=
0
I Z )' X
n( M
n

'

zi xiM
(1xd1 )
i =1

X M ( nxDM )
nMx

x'
1m
, Xm =
(data matrix for mth equation ' s regressors)

'
xnm nxDm
m Dm

y
y
1
m1
where Y =
(data matrix for mth equation ' s dependent var)
, ym =

yn nMx1
ymn nx1
1

1
1
1
1

= X ' ( I M Z ) n 1 ( Z ' Z )1 ( I M Z ) ' X = n X ' 1 Z ( Z ' Z ) 1 ( I M Z ) ' X = n X ' 1 Z ( Z ' Z )1 Z X

1
1
1
1

1
' 1
S ZX S sZY = X ' ( I M Z ) n ( Z ' Z ) ( I M Z ) ' y = X ' Z ( Z ' Z ) ( I M Z ) ' y = X ' Z ( Z ' Z ) Z ' Y

n
n
n
n

' 1
S ZX
S S ZX

34

Large Sample Properties of SUR: (SEE ANALYTICAL EXERCISES FOR CHAPTER 4)


Suppose Assumptions 4.1 4.5 and 4.7 hold, and suppose that there exists a common set of instruments for all equations (zim = zi),
and suppose zi = union of (xi1,, xi1). Suppose further that E(zihzim) exists and is finite for all m, h = 1,2,M. Let be the MxM
matrix of estimated error cross moments calculated using OLS residuals. Then,
=
=
( S 1 ) is consistent, asymptotically normal, and efficient
(a)
3 SLS

SUR

FIVE

( (

) )
) = n(X ' (

with Avar( SUR ) = (ZXS-1ZX)-1 = n X ' 1 I n X


(b) The estimated asymptotic variance A var(SUR

) )

In X

is consistent for Avar( SUR ) (SEE 4.5.17)

D
2 MK
Dm
(c) Sargans Statistic / J Statistic: J (SUR , S 1 ) n g n (SUR )' S 1 g n (SUR )

3SLS vs. SUR:


SUR estimator is a special case of 3SLS. In 3SLS, the initial (consistent) estimator used to calculate was 2SLS. But we know
2SLS when regressors are a subset of the instrument set (i.e. when regressors are predetermined) is OLS (Ch. 3.8 Review
Question 7). So for SUR the initial estimator is the OLS estimator.

SUR vs. OLS:


Since the regressors are predetermined, the system can also be estimated by the equation-by-equation OLS.
Then why SUR over OLS?
The relationship between SUR and equation-by-equation OLS is strictly analogous to the relationship between M.E.
GMM and equation-by-equation GMM: Under conditional homoskedasticity, the efficient M.E. GMM is FIVE, which is equal
to SUR under the cross orthogonality condition. On the other hand, under conditional homoskedasticity, efficient single-equation
GMM is SLS, which is OLS under cross orthogonality since it implies that regressors are predetermined.

2 Cases to consider:
(i) Each equation is just identified:
Then, since common instrument set is the union of all regressors, this is possible only if the regressors are the same for all
equations, i.e. xim = zi for all m Multivariate Regression equation by equation OLS!
(ii) At least one equation is overidentified:
Then, SUR is more efficient than equation by equation OLS, unless the equations are unrelated to each other in the
sense that E(eimeihximxoh) = mh E(ximxoh) = 0 (first equality true by conditional homoskedasticity) for all m !=h (recall, in
this case the ME GMM is asymptotically equivalent to SE GMM).
Since E(ximxoh) is assumed to be non-0 (by rank condition), mh E(ximxoh) = 0 iff mh = 0 (cov of error terms = 0)
Therefore, SUR is more efficient than OLS if mh != 0 for some pair (m,h), and they are are asymptotically
equivalent if mh = 0 for all (m,h)

( SZX' S 1SZX )

= n X ' 1 Z ( Z ' Z ) 1 Z X = n X ' 1 PZ X

X'
P

1M PZ X1( nxD1 )
1( D1 xn)
11 Z ( nxn)

X ' 1 PZ X =

' M 1PZ

X M
MM PZ

XM

X '

P'
P
X
11 1( D1 xn) Z ( nxn) Z ( nxn) 1( nxD1 )

sin ce P symm idempotent


=
Z

'
'

MM X M PZ PZ X M

X '

X
11 1( D1 xn) 1( nxD1 )

sin ce X is in the column space of P ( Z is projection onto space spanned by union of x ' s)
=
m
Z
m

'

MM X M X M

X'

X1

1( D1 xn)
11I n

'
MM I n
X M

XM

= X ' 1 I n X
Similarly,

1
1
' 1
S ZX
S sZY = X ' 1 Z ( Z ' Z ) 1 Z ' Y = X ' 1 I n Y
n
n

4.

Multivariate Regression : Suppose in addition to common instruments, zi = union of (xi1,, xi1), and all equations are identified.
This condition implies xim = zi for all m (same regressors for all equations, which are all exogenous) 35
Simplification of X:
With the assumption, we get that X = IM Z
With the above, (S and g still defined the same) we get that

( ) (

' 1
MVR = SUR = 3SLS = FIVE S 1 = S ZX
S S ZX

36
' 1
S ZX
S sZY = I M ( Z ' Z ) 1 Z ' Y

Multivariate Regression Estimator = Equation-by-Equation OLS


Recall, when equations are just identified, the ME GMM and efficient single-equation GMM are numerically equal to the IV
estimator, and since the regressors are predetermined, the GMM estimator of the multivariate regression model is just equationby-equation OLS.

Multivariate Regression Interpretation of SUR: We can think of SUR model as a multivariate regression model with a priori
exclusion restrictions. (i.e. a system with same regressors but with restrictions on certain coefficients)

Multiple Equation GMM with Common Coefficients: We modify the ME GMM model to allow this restriction. We get RE and Pooled OLS.
Background:
In many applications (e.g. panel data) we deal with a special case of ME GMM model where the number of regressors is the same
across equations with the same coefficients. How do we apply ME GMM while imposing common coefficient restriction
Random Effects Estimator and Pooled OLS

35
36

This is true bc zi instruments all m equations. So, if not, xim c zi for some m, then dim(zi) > dim(xim) and the mth equation is overidentified.

(
X ' (

)
) ( I Z ) = ( I Z ')( I ) ( I Z ) = ( Z ') ( I Z ) = (
(
I ) Y = ( I Z ')( I ) Y = ( Z ' ) Y
= X ' ( I ) X X ' ( I ) Y = Z ' Z ( Z ') Y = ( ( Z ' Z ) )( Z ') Y = ( I

X ' 1 I n X = ( I M Z ) ' 1 I n
1

MVR

'
M

Z 'Z

'
M

m (Z ' Z )

Z' Y

Relationship between ME. Estimators


Multiple Equation GMM

FIVE

3SLS

SUR

Multivariate
Regression
Assumptions
4.1 4.5, 4.7
E(ximxih)
finite
zim = zi for
all m
xim = zi for
all m

The
Model

Assumptions 4.1 4.6

Assumptions 4.1 4.5, 4.7


E(ximxih) finite
Note: 4.6 (finite 4th moments) not needed
under homoskedasticity, 4.7

Assumptions 4.1 4.5, 4.7,


E(ximxih) finite
zim = zi for all m
(same set of instruments across
equations)

Assumptions 4.1 4.5, 4.7


E(ximxih) finite
zim = zi for all m
zi = unionm xim
(This is the cross-equation
orthogonality:
E(xim . ih) = 0 (m, h =
1,2,,M) )

E ( i1 i1zi1zi1 ') E ( i1 iM zi1ziM ')

E ( z z ') E ( z z ')
i1 i1 iM i1
iM iM iM iM

E ( i1 i1 zi1 zi1 ') E ( i1 iM zi1 ziM ')

E ( i1 i1 ziM zi1 ' ) E ( iM iM ziM ziM ' )

1M E ( zi zi )
11E ( zi zi )

M 1E ( zi zi )
MM E ( zi zi )
= E ( zi zi ) MK MK

1M E ( zi zi )
11E ( zi zi )

M 1E ( zi zi )
MM E ( zi zi )
= E ( zi zi ) MK MK

Irrelevant

1
S =
zi zi '
n

i =1

= Z 'Z
n
from 2 SLS residuals

1
S =
zi zi '
n

i =1

= Z 'Z
n
from OLS residuals

Irrelevant

11 E ( zi1 zi1 ') 1M E ( zi1 ziM ' )

M 1 E ( ziM zi1 ' ) MM E ( ziM ziM ' )

= E ( Z i ' Z i )
S

1
i1i1zi1zi'1
n

i =1

n
1
iM i1ziM zi'1
n
i =1

(S

( S 1)

' 1
ZX S S ZX

i =1

1
'
MM MM zMM zMM

n
i =1

1
n

'
i1iM zi1ziM

E ( Zi ' Z i )

' 1
S ZX
S sZY

( S S S ) S
= ( XZ '( Z Z ') ZX ')
1

'
1
ZX
ZX

' 1
ZX S sZY

Avar
( S 1)
A var
( S 1)

( ZX ' S 1 ZX )1

(S

ZX ' S

S ZX

with S defined above


with S defined above

( ZX ' S 1 ZX )1

(S

ZX ' S

S ZX

XZ '( Z Z ') 1 ZY

with S defined above


with S defined above

(S

' 1
ZX S S ZX

= X ' 1 PZ X

X ' 1 PZ Y

( (

) )

n X ' 1 Z ( Z ' Z ) 1 Z ' X

n X ' 1 Z ( Z ' Z ) 1 Z ' X

) S S s

= X ' ( I ) X X ' ( I ) Y

n X ' ( I ) X

(S

' 1
S ZX
S sZY

' 1
ZX S S ZX
1

'
ZX
1

ZY
1

Equation-byEquation
OLS

OLS
Formula

OLS
Formula

n X ' 1 I n X

Note:
'
z1(1
z
xK )

11
Z =
=
'

zn (1xK ) zn1

X1( nxDM )
z1K
, X =

znK nxK

X M ( nxDM )
nMx

Dm

x1' m
1( nx1)
y1( nx1)
1m
y1m
1

, Xm =
, D = , Y =
, ym =
, m =

, =

m m
'

nm
M
ynm
x
(
1)
M
(
nx
1)
M
nx
nx1
nx1

nMx1

nMx1
nm nxDm

VII. Simultaneous Equations, FIML (ML Counterpart to 3SLS) and LIML (ML Counterpart to 2SLS)
Background: Given that were going to estimate simultaneous equations system (with same instruments across all equations) via maximum
likelihood, we will assume iid and normality. But first, we need to complete the system (# of endogenous variables = #
equations)

1.

Recall, there are M linear equation we want to estimate,


yim = ximm + im (m = 1, 2, ... , m; i = 1,2,,n)
Common instruments will be used across all equations (this is the 3SLS assumption)

Full Information Maximum Likelihood (ML estimation of 3SLS)


Complete System of Simultaneous Equations
Setup: In order for the M equations of the system to form a complete system of simultaneous equations, we need:
A. # Endogenous Variables = #Equations: This implies that we can write the M-Equation system in structural form
0 ( M M ) y t ( M 1) + B 0( M K ) x t ( K 1) = t ( M 1)
where yt is the vector of endogenous variables in the system and xt is the vector of exogenous variables in the system 37
Note: It is not possible to estimate an incomplete system by FIML unless we complete it by adding appropriate equations
This is in contrast to GMM. An incomplete system can be estimated via 3SLS (or M.E. GMM if we dont assume
conditional homoskedasticity), as long as the rank condition for identification is satisfied.
Note: To complete the system, we can add equations involving instruments (Endogenous variable and instruments).

B. The square matrix 0 ( M M ) is nonsingular: This implies that the structural equation can be solved for endog. vars
y t ( M 1) = 01 ( M M ) B 0( M K ) x t ( K 1) + 01 t ( M 1) '0( M K ) x t ( K 1) + v t

Regression Form is a Multivariate Regression Model


'
Def: Reduced Form is: y t ( M 1) = 0(
M K ) xt ( K 1) + v t

Def: The coefficients in '0( M K ) are called reduced-form coefficients

Predetermined Condition is satisfied


Since we have put all the predetermined variables in xt, then by assumption (of predetermined), E(xt vt) = 0.
Therefore, we have a (ME GMM) Multiple Regression Model (Same regressors/instruments for all equations)

Estimating System of Simultaneous Equations by FIML


2 Assumptions:
A. Structural Error t vector is jointly normal conditional on xt : t | xt ~ N M (0, 0 )
B. {y t , xt } iid (to Strength Assumption 4.2 and 4.5)
From here, we obtain:
1
1
1 '
C. Distribution of Reduced form Error : v t | xt = 0 t | xt ~ N (0, 0 0 ( 0 ) )
y t | xt = 01B 0 xt + vt ~ N (01B0 xt , 010 (01 )' )

Distribution of endogenous yt:

D. Log-likelihood function for the sample (i.e. our objective function) (pg 531 - 532 Hayashi):
Qn ( , ) =

M
1
1
1
log ( 2 ) + log | |2 log (| |)
2
2
2
2n

( y

+ Bxt ) 1 ( y t + Bxt )
'

t =1

Therefore, the FIML estimate of ( 0 , 0 ) (i.e. the coefficients in the B0 and 0 matrix and the variance of errors) is the
( , ) that maximizes the objective function.

37

How does this relate to instruments? If its not complete, then we can add equations that involve instruments that we take to be true, which are orthogonal to the error terms (See
LIML example later).

Properties of FIML
A. Identification of the FIML estimator
The identification condition is equivalent to the rank condition being satisfied for each equation:
'
E (z t xtm
) is full column rank for all m = 1, 2,..., M
B. Invariance: Since FIML is an ML estimator, the invariance property holds (see HW2 and HW3 on why useful)
C. Asymptotic Properties of FIML
Consider the M-Equation system: yim = ximm + im (m = 1, 2, ... , m; i = 1,2,,n)
Let 0 be the stacked vector collecting all the coefficients in the M-equation system.
Suppose the following assumptions are satisfied
'
) is full column rank for all m = 1, 2,..., M
A1: Rank condition for identification is satisfied: E (z t xtm

( )

A2: E z t z t' nonsingular


A3: M-Equation system can be written as a complete system of simultaneousequations with 0 nonsingular
A4: t | xt ~ N (0, 0 ) for some 0 p.d .
A5: {y t , xt } iid
A6: Parameter space for ( 0 , 0 ) is compact with the true parameter vector in the interior
Then,
(a) The FIML estimator (, ) which maximizes the objective function is consistent and asymptotically normal

( (

) ) (same as 3SLS)
A consistent estimator of the asymptotic variance is n X ' ( Z ( Z ' Z ) Z ') X (same as 3SLS)

(b) The asymptotic variance of is n X ' 1 Z ( Z ' Z ) 1 Z ' X

(c) The likelihood ratio static for testing overidentifying restrictions is asymptotically ChiSq KM

Furthermore, these asymptotic results hold even without the normality assumption.

2.

M
L
m =1 m

LIML vs. 3SLS: They are asymptotically equivalent. However, LIML has invariance property that 3SLS (and 2SLS dont).
Limited Information Maximum Likelihood (ML Estimation of 2SLS)
Setup: The difference here is that we are only estimating 1 equation, instead of a whole system, and we have endogeneity problem.
The only trick here is that we need to complete the system by adding 1 more equation relating the endogenous variable to
the set of predetermined variables (just take something you know to be true).

Example: Completing the System for LIML


e.g. Suppose the population model of interest is Y1 = 0 + 1Y2 + and we want to estimate ( 0 , 1 )
where Y1 and Y2 are endogenous. In order to complete the system, suppose we KNOW the following relationship,
Y2 = 0 + 1 ' Z + u .
Then, we can write the structural form of the system as:
Y1 2Y2 = 0 +
0 0
Y1 = 0 + 1Y2 +
1

1 2 Y1
=
+

'
Y2 = 0 + 1 ' Z ( mx1) + u 0 1 2 x 2 Y2
Y2 = 0 + 1 ' Z + u
0 1 2 xm Z mx1 u
2 x1
Y
1 2
1 =
1
Y2 2 x1 0

1
0 0 1 1 2
+


'
0 1 Z 0 1 u

LIML vs. 2SLS


Theyre both k class estimators. 2SLS is a k class estimator with k = 1 (p. 541). So when the equation is just identified (k = 1),
LIML = 2SLS numerically.
LIML and 2SLS have same asymptotic distribution, so we cannot prefer one over the other on asymptotic grounds
In finite sample, LIML has invariance property while 2SLS does not (which makes LIML more desirable)
Other literature suggest that LIML should be preferred in finite sample over 2SLS

Example of LIML (271A PS2 Empirical Ex #2): Single Equation System, Completed by Equation Given
Suppose the population model of interest is:
Y1 = 0 + 1Y2 +
We suspect endogeneity, and we complete system with: Y2 = 0 + 1Z + u
1
Such that E = 0,
Z

11 12

, and we observe an iid sample of (Y1,Y2,Z)
u | Z ~ N (0, ) where =

21 22

Derive LIML Estimator (Use Invariance Property!):


Y1 = 0 + 1Y2 + Y1 = 0 + 1 ( 0 + 1Z + u ) + = ( 0 + 1 0 ) + 11Z + ( 1u + ) = 0 + 1Z + v
E ( Zv) = E ( Z ( 1u + ) ) = 1 E ( Zu ) + E ( Z ) = 0

So, orthogonality condition holds. E(ZZ) assumed to be invertible. And we have iid and Gaussian errors, therefore, we can estimate this
equation consistently by OLS But OLS with is the same as the MLE estimate (under Gaussian errors assumption)
Similarly, we can estimate the second equation consistently via OLS to obtain MLE estimate (under Gaussian errors)
We obtain MLE estimates 0 , 1 , 0 , 1 from OLS

We also know 1 = 11 By inv. prop of MLE , 1 = 1


1
Similarly , 0 = 0 + 1 0 By inv. prop MLE , 0 = 0 10
( , ) are the LIML Estimators !
0

Example of FIML (271A PS3 #5): Multiple (2) Equation System, Completed by Equation Given
The structural model is:
Y1 = 12Y2 + 13Y3 + 13 Z 3 + 14 Z 4 + u1
Y2 = 22Y1 + 21Z1 + u2
Y3 = 31Z1 + 32 Z 2 + 33 Z 3 + 34 Z 4 + u4
where Z1 = 1, E (us ) = 0 for s = 1, 2,3 and E ( Z j us ) = 0 for j = 1,.., 4, s = 1, 2,3
Assume in addition that 13 + 14 = 1

1.

3 Cases of Endogeneity (Simultaneity, Errors in Variables/Measurement Error, Omitted Varaibes), Examples.


A. Simultaneity Example: (Workings) Simultaneous Equations Model for Market Equilibrium
Setup: The true relationship between demand and supply of coffee is modeled as follows
Demand Equation: qid = 0 + 1 pi + ui
(ui represents factors that influence coffee demand other than price)
Supply Equation:
qis = 0 + 1 pi + vi
(vi represents factors that influence coffee supply other than price)
Market Equilibrium: qid = qis
Note: We assume E(ui) = 0 and E(vi) = 0 (if not, include nonzero means in the intercepts)
Endogeneity: Here, the regressor pi is endogenous/not predetermined, i.e. not orthogonal to the (contemporaneous) error
term, and therefore does not satisfy the orthogonality condition that
E(pi . ui) = 0 cov(pi,ui) = 0 and E(pi . vi) = 0 cov(pi,vi) = 0
The endogeneity in this example arises from the face that price is a function of both error terms ui and vi, which is a
result of market equilibrium 38 . Therefore, cov(pi,ui) = 0 and cov(pi,vi) = 0 iff Var(ui) = 0 and Var(vi) = 0 respectively.
Not possible (except in the extreme case when, for example, there are no other factors that shift demand, so ui = 0)!
Problem with Endogeneity: Endogeneity Bias
When we regress observed quantity on a constant and price, we neither estimate the demand nor supply curve because price
is endogenous in both equations. Recall that the OLS estimator is consistent for the least squares projection coefficients: in
this case, the least squares projection of (true) qi on a constant and (true) pi gives a coefficient of pi given by Cov(pi,qi)/Var(pi)
Suppose we observe {qi, pi} and we regress qi on a constant and pi, what is it that we estimate?
OLS estimate of the price coefficient 1 (from the demand equation)
Cov( p i , q i ) Cov( p i , 0 + 1 p i + u i ) Cov( p i , 1 p i + u i ) 1Var ( p i ) + Cov (u i , p i )
Cov(u i , p i )
P

=
=
=
= 1 +
Var ( p i )
Var ( p i )
Var ( p i )
Var ( p i )
Var ( p i )
Asymptotic Bias =

Cov(ui , pi )
Var ( pi )

OLS estimate of the price coefficient 1 (from the supply equation)


Cov( pi , qi ) Cov( pi , 0 + 1 pi + ui ) Cov( pi , 1 pi + vi ) 1Var ( pi ) + Cov(vi , pi )
Cov(vi , pi )
P

=
=
=
= 1 +
Var ( pi )
Var ( pi )
Var ( pi )
Var ( pi )
Var ( pi )
Asymptotic Bias =

Cov (vi , pi )
Var ( pi )

Since Cov( pi , ui ) 0 and Cov( pi , vi ) 0 , therefore endogeneity bias/simultaneous equation bias/simultaneity bias exists! (bc
regressor and error term are related to each other through a system of simultaneous equations).
So, OLS estimator is not consistent for either 1 or 1.
Solution: Instrumental Variables and 2 Stage Least Squares
The reason why demand curve nor supply curve can be consistently estimated because we cannot infer from the data whether the
observed changes in price and quantity is due to a shift in demand or supply. Therefore, we might be able to estimate the
demand/supply if some of the factors that shift the supply/demand curves are observable.
Def: A predetermined variable (predetermined in the system) that is correlated with the endogenous regressor is called an
instrumental variable or instrument. Sometimes we call it a valid instrument to emphasize that the correlation with
the endogenous regressor is not 0.

Observable Supply Shifters:


Given appropriate observable supply shifters (Instrument), we can estimate demand and supply!
Suppose the supply shifter vi can be divided into an observable factor xi and an unobservable factor I with Cov(xi.I)=0 39
38

To see endogeneity, treat the 3 equations as a system of simultaneous equations and solve for pi and qi
(1 1 ) pi = ( 0 0 ) + (vi ui ) pi = ( 0 0 ) + (vi ui )
(1 1 ) (1 1 )

qid = qis 0 + 1 pi + ui = 0 + 1 pi + vi

( 0 ) (vi ui )

1
1
So, Cov( pi , ui ) = Cov 0
+
, ui =
Cov ((vi ui ), ui ) =
(Cov(vi , ui ) Var (ui )) = Var (ui ) (Since Cov(vi , ui ) = 0 by assumption)
(1 1 )
(1 1 )
(1 1 ) (1 1 ) (1 1 )
( 0 0 ) (vi ui )

1
1
(Var (vi ) Cov(vi , ui )) = Var (vi ) (Since Cov(vi , ui ) = 0 by assumption)
Cov ( pi , vi ) = Cov
+
, vi =
Cov ((vi ui ), vi ) =
(1 1 )
(1 1 )
(1 1 ) (1 1 ) (1 1 )

Supply Equation: qis = 0 + 1 pi +2 xi +i

Suppose further that the observed supply shifter xi is predetermined in the demand equation, i.e. uncorrelated with the error term ui
(e.g. think of xi is the temperature in coffee growing regions). If the temperature (xi) is uncorrelated with the unobserved factors that
shift demand (ui), i.e. temperature (xi) is an instrument (for the demand equation), it would be possible to extract from observed price
movements a component that is related to the temperature (i.e. the observed supply shifter) but uncorrelated with the demand shifter.
Then, we can estimate the demand curve by examining the relationship between coffee consumption and that component of price.

IV Estimator for 1: We derive the IV estimator for 1 below


We can re-express price as:
q id = q is 0 + 1 p i + u i = 0 + 1 p i + 2 x i + i ( 1 1 ) p i = ( 0 0 ) + 2 x i + ( i u i )

( u i )
( 0 0 )
2
xi + i
+
( 1 1 )
( 1 1 ) ( 1 1 )
( 0 )
( u i )
2
2
1
(Cov( i , x i ) Cov(u i , x i ))
Cov ( p i , x i ) = Cov 0
+
xi + i
, xi =
Var ( x i ) +
( 1 1 ) ( 1 1 )
( 1 1 )
( 1 1 ) ( 1 1 )
pi =

( 1 1 )

Var ( x i ) sin ce Cov( i , x i ) = 0 by construction and Cov(u i , x i ) = 0 by assumption

0 ( So x i a valid instrument )

With a valid instrument, we can estimate the price coefficient 1 of demand curve consistently.
Cov(q i , x i ) = Cov( 0 + 1 p i + u i , x i ) = 1Cov( p i , x i ) + Cov(u i , x i ) = 1Cov( p i , x i ) sin ce Cov(u i , x i ) = 0 by assumption
1 =

Cov (q i , x i )
Cov ( p i , x i )

If we observe an iid sample (qi, pi, xi), then by the analogy principle, the natural (consistent) estimator is:
(we say the endougenous regressor pi is instrumente by xi)

1, IV =

Sample cov bt x i and q i


=
Sample cov bt x i and p i

x q = x ( + p
px
px
i

i i

i i

+i )

= 0

i i

x
px
i

i i

+ 1

px + x
px px
i

i i

i i 40

i i

i i

IV estimator is consistent for 1


An instrumental variable is one that is correlated with the independent variable but not with the error term. The
estimator is

When z and are uncorrelated, the final term vanishes in the limit providing a consistent estimator. Note that when x is
uncorrelated with the error term, x is itself an instrument. In that case the OLS estimator is a type of IV estimator.
The approach above generalizes in a straightforward way to a regression with multiple explanatory variables. Suppose
X is the T x K matrix of explanatory variables resulting from T observations on K variables. Let Z be a T x K matrix of
instruments. Then,

39

This decomposition is always possible by the projection theorem. vi can be expressed as the projection onto the space spanned by xi and the orthogonal complement (remember,
vi includes all factors that affect supply, so by definition has at least as many dimensions than xi ). i.e. If the least squares projection of vi on a constant and xi is E*(vi | 1 xi) = 0 +
2xi. Define i = vi 0 + 2xi. By definition, i is orthogonal to xi and E(i ) = 0, therefore i, xi uncorrelated. Substituting this into the original supply equation, and combining the
intercept terms we get the resulting expression.
40
Recall, sample covariance can be expressed as

(x x )(y y ) = (x y xy x y + xy ) = (x y ) x (y ) y (x ) + (xy ) = (x y ) nxy nxy + nxy = (x y ) nxy


i

Here, average of

i i

i i

i i

i i

One computational method often used for implementing the technique is two-stage least-squares (2SLS). One
advantage of this approach is that it can efficiently combine information from multiple instruments for over-identified
regressions: where there are fewer covariates than instruments. Under the 2SLS approach, in a first stage, each
endogenous covariate (predictor variable) is regressed on all valid instruments, including the full set of exogenous
covariates in the main regression. Since the instruments are exogenous, these approximations of the endogenous
covariates will not be correlated with the error term. So, intuitively they provide a way to analyze the relationship
between the outcome variable and the endogenous covariates. In the second stage, the regression of interest is estimated
as usual, except that in this each endogenous covariate is replaced with its approximation estimated in the first stage.
The slope estimator thus obtained is consistent. A small correction must be made to the sum-of-squared residuals in the
second-stage fitted model in order that the associated standard errors be computed correctly.
Stage 1:
Stage 2:
Mathematically, this estimator is identical to the single stage estimator presented above when the number of
instruments is the same as the number of covariates.
Two-Stage Least Squares (2SLS) Estimator for 1: This is another procedure for consistently estimating 1 which is named thusly
because the procedure consists of running 2 least squares (OLS) regressions.
First Stage: Endogenous regressor pi is regressed on a constant and the predetermined variable xi to obtain fitted values p i .
(OLS coeff. for xi is Sample Cov bt pi and xi / sample variance of xi).
Second Stage: Regress dependent variable qi on a constant and p i .
(OLS coeff. for xi is Sample Cov bt p i and xi / sample variance of xi).
The 2nd stage estimates the equation (bracketed term is error): q i = 0 + 1 p i + [u i + 1 ( p i p i )]

The 2SLS Estimator is consistent: Why???


Here, IV and 2SLS estimators are numerically the same (generally this will be true see later).
Generally, 2SLS estimator can be written as an IV estimator with an appropriate choice of instruments, and the IV estimator
is a particular GMM estimator.

B. Errors-in-Variables/Measurement Error Example:


This is the phenomenon that an otherwise predetermined regressor becomes endogenous when measured with error
C.
The point is, if you have endogeneity problems and you can find an instruments that are not correlated with error terms (i.e. meets
orthogonality condition) but is correlated with the endogenous terms (and correlated with dependent variable only through the endogenous
terms), then we can find a consistent estimator for the underlying parameter of interest.

You might also like