Professional Documents
Culture Documents
1 Introduction
Module 4 covers the following topics
matrix representation of the linear regression model, both simple and multiple
derivation of the covariance matrices and standard errors of the estimators needed for statistical inference (calculation of condence intervals and
hypothesis tests)
The purpose of all this is to develop the formal statistical machinery of the
multiple regression model to clear the way for discussion of substantive and
practical aspects of multiple regression in Module 5. This material is in part
foundational: matrices are almost indispensable for advanced methods beyond
regression analysis. Readings for Module 4 are ALSM5e ???; ALSM4e Chapter
5 and part of Chapter 6 (pp. 225-236).
2 Matrices
Matrix notation represents a set of conventions to represent tables of numbers
and operations on these numbers.
2.1 Matrices
A
matrix
are
2.1 1.7
1.0 .32 .55
1
4.0 2.3
B = .32 1.0 .61 D = 0
A=
5.0 0.3
.55 .61 1.0
0
4.7 1.5
number of rows
number of columns : A
element of a matrix
row
is
4 2, B
and
C is identied as ci j
column.
symmetric matrices :
and
3 3,
are
and
c12
c22
c32
followed by the
is
3 2,
An
square matrix
0 0
c11
2 0 C = c21
0 4
c31
and
are also
leading
main diagonal
or
is a
diagonal matrix ;
all entries are zero except those on the main diagonal. A diagonal matrix with
identity matrix.
null matrix.
and called an
by bold capital letters. Examples of an identity matrix and a null matrix are
1
I = 0
0
0
1
0
0
0
1
0=
0
0
0
0
0
0
transpose
of the matrix
A,
denoted
A0 ,
A
A0 = [aji ].)
changing each row by the same-numbered column, so that the rst column of
then
(I nd it useful to visualize the transpose operation with a special hand movement which I will demonstrate in class.) An alternative notation found for the
transpose
A0
is
AT .
The transpose of a
42
24
because rows and columns are interchanged. A symmetric matrix is equal to its
transpose so that
A0 = A.
a. Row vectors
are
are viewed as
scalars.
b0 .
(When used
or
(lambda).
The
1.2
a= 3
1.7
b0 = 1.1 .5
2
c = 22.6
= .05
and
by reversing the sign of its elements and then adding the corresponding elements. Two columns vectors of the same length, or two row vectors of the same
length, may be added or subtracted in the same way. Examples of addition and
subtraction are shown below.
4
5
1
0
A = 2
7
1 3
B = 3 2
5
0
1
A+B= 5
12
7
3
1
1
A B = 1
2
1
7
1
a0
a0 b = 2
21=2
1
4 3 = 12
3
3 =
15=5
32=6
2
0
a b = 2 + 12 + 5 + 6 = 25
by the matrix
A.
B)
AB.
3
AB
A,
has as many
rows as the rst matrix (A) and as many columns as the second matrix (B). For
multiplication to be possible the rows of the rst matrix (A) and the columns
of the second matrix (B) must be of equal length. Thus the second dimension
(number of columns) of
of
B,
3 5), or
a 1 1 scalar), or a 5 1 column vector
a 5 5 matrix). In each case for the
being of dimensions
being
being
by a
by a
by a
middle numbers have to be equal. The dimensions of the product are given by
the outside dimensions.
0
A = 2
7
AB11 = (0 3) + (4 2) = 8
AB12 = (0 4) + (4 5) = 20
4
3 4 AB21 = (2 3) + (5 2) = 16
5 B =
AB22 = (2 4) + (5 5) = 33
2 5
1
AB31 = (7 3) + (1 2) = 23
AB32 = (7 4) + (1 5) = 33
8 20
AB = 16 33
23 33
(1)
(2)
Y,
and
multiplication
WXYZ
4 2, 2 3, 3 7,
and
7 5,
W , X,
then the
45
(given by outermost dimensions in the series). The principle holds for vectors,
by an identity matrix
leaves
by a diagonal matrix
the columns of
D;
postmultiplying
by
rescales
D.
X, X0 X and XX0
always exist.
is a scalar
b0 = b1
b2
2
A=
3
c0 = 20
10
and
4
1
2b1 + 4b2 = 20
3b1 + b2 = 10
as
2
3
b1
20
=
b2
10
4
1
or
Ab = c
Note how compact the matrix representation is: the very same expression
Ab = c can represent two equations with two unknown as well as 1000 equations
with 1000 unknowns. To solve a system of equations requires the inverse of a
matrix (see next).
1/x or x1 .
a(1/x) = a/x. For
is its reciprocal
matrices there is no direct equivalent of division but the matrix inverse is the
equivalent of the reciprocal; multiplying by the inverse is the matrix equivalent
of dividing.
A or AA
equals
I,
A,
denoted
A1 ,
(1/a)a = 1 for
scalars). An example is
A=
2
3
4
1
A1 =
.1 .4
.3 .2
(3)
AA1 = I.
predictable from others, it does not have an inverse (see Linear dependence &
rank below). A matrix that has no inverse is called
singular.
The inverse of
Ab = c,
if
the equation by
A1
like this:
Ab = c
A1 Ab = A1 c
Ib = A1 c
b = A1 c
For example the system of equations
forming the product
A1 c,
Ab = c
i.e.
.1 .4
20
2
=
.3 .2 10
4
so the solution is
b1 = 2
and
b2 = 4.
in general to be a large computational task. Let the computer do it. You can
always check to see whether the results it has given you is correct by carrying out
the multiplication
AA1 ,
Some
(A0 )1 = (A1 )0 ,
inverse
2.
(A1 )1 = A,
i.e.
matrix
3. The inverse of a symmetric matrix is also symmetric
In a few special cases matrix inversion does not require extensive computations:
1. The inverse of an identity matrix is itself, i.e.
I1 = I
a b
c d
6
d b
c a
1/(ad bc)
b and c,
a and d, change
the matrix
(ABCD)0 = D0 C0 B0 A0 ,
i.e.
order
2.
(ABCD)1 = D1 C1 B1 A1 , i.e.
taken in reverse
order
taken in reverse
For the second property to hold all the matrices have to be square, of the same
order, and non-singular (otherwise multiplication would not be possible and/or
the necessary inverses would not exist).
54
matrix
1
1
X=
1
1
1
X
2
4
3
7
4
1
1
1
0
0
0
0
1
1
columns of a
rc
c1
c4
linearly dependent when c
c2
The
scalars
1 c1 + 2 c2 + + c cc = 0
If this relation only holds when all the
dependent.
Q - Find 4
's
rank
1, 0, 1, 1.
X?
4 Random Vectors
A
random vector
For example in the simple regression model one can dene the
containing the errors for each observation, so that
0 = 1
2
n 1 vector
n . (Note
that the vector is given by its transpose to save vertical space in text.)
expectation
dom variables constituting the elements. For example in the simple regression
model
E{} = E{1 }
E{2 } E{n }
expectation of the distributon of the error for each observation is zero. This is
the same as assuming
expected value
E{} = 0,
expectation
or
and A.12.)
variance-covariance matrix
expectation is denoted
of the elements on the diagonal and the covariances of the elements o the
diagonal. For example the covariance matrix of an
nn
n1
random vector
is the
matrix
2 {1 }
{1 , 2 } {1 , n }
{2 , 1 }
2 {2 } {2 , n }
2 {} =
.
.
.
.
.
.
.
.
.
{n , 1 } {n , 2 } 2 {n }
2 {} = .
..
0
0
2
.
.
.
0
0
.
.
.
2
2 {} = E{0 } = 2 I
So far we have represented the simple linear regression model as a generic equation
yi = 0 + 1 xi + i
that is supposed to hold for each observation
i = 1, ..., n
i = 1, ..., n.
the model corresponding to each observation in the data set one would have to
write
y1 = 0 + 1 x1 + 1
y2 = 0 + 1 x2 + 2
yn = 0 + 1 xn + n
as many time as there are cases.
Matrices provide a more compact representation of the model. One denes
y1
1 x1
1
y2
1 x2
2
0
y = . X = .
=.
. =
.
..
..
..
1
.
yn
1 xn
n
The matrices
y, X
and
have
set. The regression model for the entire data set (i.e. the equivalent of the
y = X +
The intercept
constant term,
a variable
i = 1, ..., n
side of the equation. (The reason for setting the index of the last independent
variable to
p1
and
p 1 + 1 = p,
a nice simple
as before, and
1 x1,1
1 x2,1
X = .
.
.
..
.
1 xn,1
0
x1,p1
1
x2,p1
= 2
.
.
.
.
..
xn,p1
p1
x1,2
X2,2
.
.
.
xn,2
the regression model for the entire data set can be written
y = X +
which is precisely the same as for the simple linear regression model. The only
dierences between simple and multiple regression models are the dimensions of
X (now n p) and of
on the errors :
(now
p 1).
E{} = 0
2 {} = E{0 } = 2 I
the stronger set assumes in addition that the errors are normally distributed
has ex-
pectation
E{y} = Xb
and the variance-covariance matrix of
so that
Q=
n
X
i=1
10
It can be shown with calculus (ALSM5e ??; ALSM4e pp. 201202) that the OLS
estimator
of
is the vector
normal equations
the
b0 = b0
bp1
b1
X0 Xb = X0 y
X0 X?
Of
X0 y ?
p equations with
by pre-multiplying both
X0 Xb = X0 y
(X0 X)1 X0 Xb = (X0 X)1 X0 y
Ib = (X0 X)1 X0 y
b = (X0 X)1 X0 y
The relation
b = (X0 X)1 X0 y
is the fundamental formula of OLS.
Q - What are the dimensions of
theorem says that
b?
Of
(X0 X)1 X0
? The Gauss-Markov
is BLUE; can you tell from the formula for b why there is
an L in BLUE?
5.4.1
The
of
for observation
n1
vector
is
i = 1, , n
0 = Y1
Y
Y2
Yn
y
= Xb
Q - Why does Xb represent the tted (predicted, estimated) values of Y?
5.4.2
Replacing
in
Xb
y
= X(X0 X)1 X0 y
or equivalently
y
= Hy
11
is
where
H = X(X0 X)1 X0
The matrix
is called the
hat matrix.
(Q - Why would
is square of dimension
nn
(the observations on
Thus
y. H is very important
in outlier diagnostics, as discussed in Module 10. (To anticipate, the diagonal elements
hii
of
measure the
yi of
yi .)
aects its
is
own
tted value
idempotent,
i.e.,
HH = H.
leverage
of observation i, measured as
projection
of
5.4.3
HH,
replacing
X.
Residuals
The residual
ei
for observation
is estimated as
ei = yi yi
same as in simple regression. The n 1
e0 = e1 e2 en , is given by
vector
e=yy
e=yy
e = y Hy
e = (I H)y
Q - How does the last expression follow?
Like H, the matrix I H is n n, symmetric and idempotent.
H(I H) = 0. (QCan you see why?)
Furthermore
simple regression model, except for the degrees of freedom corresponding to the
number of independent variables (n
multiple regression).
12
np
in
Sum of Squares
df
Mean Squares
Regression
P
SSR = P(
yi y)2
SSE = P
(yi yi )2
SST O = (yi y)2
p1
np
n1
M SR = SSR/(p 1)
M SE = SSE/(n p)
V ar(Y ) = SST O/(n 1)
Error
Total
5.5.1
(yi y) = (
yi y) + (yi yi )
which expresses, for each observation, the deviation of
yi
y as the
yi
sum of the deviation of the tted value from the mean plus the deviation of
from the tted value. (The identity is tautological, as one can see by removing
the parentheses on the right-hand side; then
yi
and
yi
for
SST O df = (n 1),
df
lost estimating
y)
for
SSR df = (p 1),
constant
for
SSE df = (n p),
df lost in
constant (p
where
yi
and
ei )
The
mean squares
5.5.2
df,
S2
(variance of
y)
Sums of squares in matrix form are shown in Table 5.5.2. Matrix notation for
n 1 vector u (for P
unity )
0
0
y
uu
y
is equal to (
yi )2
P
P
P 2
0
0
(y u)(u y) = ( yi )( yi ) = ( yi ) .
SSR
1.
and
SST O
uses an
y0 uu0 y =
Table 5.5.2 (Collumn 3) shows that the sums of squares can all be represented
in a form
y0 Ay
where
13
quadratic form.
Matrix Notation
Quadratic Form
SSR
SSE
SST O
y0 Ay =
n
X
df
y [H (1/n)uu ]y
y0 (I H)y
y0 [I (1/n)uu0 ]y
y0 Ay
where
p1
np
n1
is a symmetric
aij yi yj
i=1
where
aij = aji
and
y0 Ay
is
11
(a scalar).
y0 Ay
is
observations
where
can be represented as
y0 Ay
A is called the matrix of the quadratic form. (In the underlying geometry n n
quadratic forms represent distances in n-dimensional space.)
5.5.3
if
N(0,1),
a vector of
then
1 df ),
p 1, n p,
the sum of
and
n 1,
(chi-square with
(chi-square with
if
are
2 (1)
y0 Ay is
z 2 , each
1 df )
is distributed as
y is a vectors of independent
N(0,1), then SSR, SSE , and SST O
p), and 2 (n 1), respectively
thus if
14
(k)
(chi-square with
2 (1)
k df )
2 (p 1), 2 (n
df , is distributed as
F (df1 , df2 ) where df 1 and df 2 are the df of the numerator and the denominator, respectively; thus for example F = M SR/M SE = (SSR/(p
1))/(SSE/(n p)) is distrinuted as F (p 1, n p)
of the dependent variables (y), which are functions of random errors (). Thus
these estimates are themselves random variables. In order to carry out statistical
inference (to do hypothesis tests or calculate condence intervals) one needs the
standard error
b, yh (predicted
e, and others.
value of
for a combination
Xh
of
the standard error of estimate is the square root of the variance of the
estimate, which is the element in the diagonal of the covariance matrix of
the estimate(s)
of the observed y;
b, yh , e,
linear functions
Ay
where
in a
Ay can be obtained by
2 {Ay} = A 2 {y}A0 where y is a
5.6.2
Covariance Matrix of b
2 {b}
b = (X0 X)
X0 y = Ay
2 {b} = 2 {Ay}
= A 2 {y}A0
1
= (X0 X)
X0 2 IX(X0 X)
1
= 2 (X0 X)
so that
2 {b} = 2 (X0 X)
15
X0 X(X0 X)
Then the
estimated
covariance matrix of
M SE ,
as
s2 {b} = M SE(X0 X)
pp
matrix.
{b0 }
is in position (1,1), s
{b1 }
b,
so that
b.
5.6.3
x0h = 1 xh,1
xh,2
x0h
xh,p1
of the observations in the data set on which the regression model is estimated,
in which case it represents a row of
X,
x0h
E{yh }
is estimated as
yh = x0h b
Note that
yh = x0h b
b.
is a linear function of
Furthermore
yh
is a scalar,
2 {
yh }
of
yh
yh = x0h b
2 {
yh } = x0h 2 {b}xh
1
= x0h 2 (X0 X)
= 2 x0h (X0 X)
Then the
estimated
variance of
yh
xh
xh
is given by
s2 {
yh } = M SE(x0h (X0 X)
and the estimated standard error of
s{
yh } =
yh
xh )
by
M SE(x0h (X0 X)
16
xh )
x0h (X0 X)
xh ?
Hint: it is a q f.
5.6.4
2 {e}
of the residuals
is derived in the
following steps
e = (I H)y
0
{e} = (I H) 2 {y}(I H)
= 2 (I H)I(I H)
= 2 (I H)(I H)
so that
2 {e} = 2 (I H)
Then the estimated covariance matrix of
is given by
s2 {e} = M SE(I H)
The diagonal of this
ments of
e.
nn
5.6.5
screening test
F = M SR/M SE
As mentioned above in connection with quadratic forms,
distributed as
F (p 1, n p)
where
(p 1)
and
(n p)
F = M SR/M SE is
df of SSR and
are the
SSE, respectively, from the ANOVA table. The expectations of MSR and MSE
are discussed in ALSM5e p. ?; ALSM4e p. 229.
5.7.1
5.7.2
matrix
use craft
mat x = craft(;size season)
mat const = m(9,1,1)
mat x = const||x
mat y = craft(; clerks)
sho x y
mat xpxi = inv(trp(x)*x)
18
19