Professional Documents
Culture Documents
This is page i
Printer: Opaque this
c 2000
Jianqing Fan and Runze Li
ALL RIGHTS RESERVED
This is page i
Printer: Opaque this
Contents
17
ii
Contents
2.4 Individual Contrasts . . . . . . . . . . . . . . . . . . . 26
2.5 Methods of Multiple Comparisons . . . . . . . . . . . 30
2.5.1 Bonferroni Method . . . . . . . . . . . . . . . . 30
2.5.2 Schee method . . . . . . . . . . . . . . . . . . 32
2.5.3 Tukey Method . . . . . . . . . . . . . . . . . . 33
2.6 Model diagnostic . . . . . . . . . . . . . . . . . . . . . 35
2.6.1 Outlier detection . . . . . . . . . . . . . . . . . 36
2.6.2 Testing of normality . . . . . . . . . . . . . . . 38
2.6.3 Equal variance . . . . . . . . . . . . . . . . . . 40
2.7 S-plus codes for Example 2.1 . . . . . . . . . . . . . . 42
47
4 Analysis of Covariance
73
Contents
iii
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Least Squares Estimates . . . . . . . . . . . . . . . . . 77
4.4 Analysis of Covariance . . . . . . . . . . . . . . . . . . 78
4.5 Treatment Contrasts and Condence Intervals . . . . . 83
4.6 S-plus codes for Example 4.1 . . . . . . . . . . . . . . 85
87
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Random Eects One Way Model . . . . . . . . . . . . 90
5.2.1 Estimation of 2 and T2 . . . . . . . . . . . . . 91
5.2.2 Testing Equality of Treatment Eects . . . . . 93
5.3 Mixed Eects Models and BLUP . . . . . . . . . . . . 94
5.4 Restricted Maximum Likelihood Estimator (REML) . 97
5.5 Estimation of Parameters in Mixed Eects Models . . 100
5.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 101
107
iv
Contents
6.3 Maximum Likelihood Methods . . . . . . . . . . . . . 119
6.3.1 Maximum Likelihood Estimate and Estimated
Standard Errors . . . . . . . . . . . . . . . . . 119
6.3.2 Computation . . . . . . . . . . . . . . . . . . . 124
6.4 Deviance and Residuals . . . . . . . . . . . . . . . . . 127
6.4.1 Deviance . . . . . . . . . . . . . . . . . . . . . 127
6.4.2 Analysis of Deviance . . . . . . . . . . . . . . . 128
6.4.3 Deviance residuals . . . . . . . . . . . . . . . . 130
6.4.4 Pearson residuals . . . . . . . . . . . . . . . . . 132
6.4.5 Anscombe residual . . . . . . . . . . . . . . . . 133
6.5 Comparison with Response Transform Models . . . . . 134
6.6 S-plus codes . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6.1 S-plus codes for Example 6.9 . . . . . . . . . . 135
6.6.2 S-plus codes for Example 6.10 . . . . . . . . . . 137
References
139
Author index
141
Subject index
143
1
Overview of Linear Models
This is page 1
Printer: Opaque this
Y = 0 + 1 X + ";
(1.1)
Y = 0 + 1 X1 + + p Xp + ":
(1.2)
m(x) E (Y jX = x) = 0 + 1 x1 + + p xp ;
is linear in 's. For model (1.1), the linear assumption of regression
function m(x) on x is not always granted. One may try polynomial
regression:
m(x) = 0 + 1 x + + p xp :
The polynomial regression is a linear model as it is linear in 's,
though nonlinear in x.
Y = 0 + 1 X1 + 2 X2 + 3 X3 + ":
(1.3)
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 S1 + 5 S2 + 6 S3 + ": (1.4)
This model yields dierent intercepts for dierent seasons. The 4 is
the dierence between the intercept for spring and the intercept for
winter.
As an extension of model (1.4), one may consider the following
model having dierent intercepts and slopes for dierent seasons
m(x) = 0i + 1i x1 + 2i x2 + 3i x3
Yi
j =1
Xij j :
Y1 C
X11
B
B
B .. C
B ..
y = B . C; X = B .
Yn
Xn1
X1p C
1
"1
B C
B C
.. C
B .. C
B .. C
; = B . C; " = B . C:
. C
A
@ A
@ A
Xnp
p
"n
y = X + ":
(1.7)
The matrix X is known as the design matrix and is of crucial importance to the whole theory of linear regression analysis. The S ( )
in (1.6) indeed is
S () = jjy Xjj2 = (y X)T (y X):
Dierentiating S () with respect to , we obtain the normal equations
XT y = XT X;
(1.8)
which are a fundamental starting point for the analysis of linear models. If XT X is invertible, equation (1.8) yields an unbiased estimator
of
b = (XT X) 1 XT y
as
E(b jX) = (XT X) 1 XT (X) = :
H0 : C = h versus H1 : C 6= h;
where C is a constant matrix and h is a constant vector. Under H0 ,
the method of least squares is to minimize S ( ) with constraints
C = h. This leads to
(1.9)
When "i are independent and identically distributed N (0; 2 ), the left
hand side of (1.9) has a 2n p+q distribution, where q is the rank of C ,
and the second term on the right-hand side has a 2n p distribution,
when H0 is true. Furthermore,
F=
SUM OF SQUARES
(Yi Y )2
P
SSR0 (Yi Y )2
SSR1
D.F.
MEAN SQUARE
p 1
SSR1 =(p 1)
RSS0 p 1 q SSR0 =(p 1 q )
RSS0 RSS1
q
(RSS0 RSS1 )=q
RSS1
n p
RSS1 =(n p)
P
(Yi Y )2
n 1
RSS1
y = X + ":
(1.10)
When the functional form of the probability distribution of the error term is specied, estimates of the regression coecients and the
unknown parameter 2 can be obtained by the method of maximum
likelihood. The method of maximum likelihood was rst proposed
by C.F. Gauss in 1821. However, the principal is usually credited to
R. A. Fisher, who rst studied the properties of the approach (See
Fisher,1922). Essentially, the method of maximum likelihood consists of nding those values of unknown parameters which is \most
likely" to have produced the sampled data. Consider the joint density
of the sampled data as a function of the parameters for xed data
(xi ; Yi ); i = 1; ; n: This function is called the likelihood function
and denoted by `(; ), The maximum likelihood estimator is the
maximizer (b ; b) of the likelihood function `(; ). That is
;
jjy Xjj2 :
22
RSS1
n
which is a biased estimator for 2 .
b2 =
H0 : C = h
versus H1 : C 6= h:
;;C =h
where b0 = RSS0 =n.
10
Yi =
j =1
Xij j + "i ; i = 1; ; n;
(1.11)
where the errors "i are independent with mean 0. In many situations
it may not be reasonable to assume that the variances of the errors
are the same for all levels xi of the independent variables. However,
we may be able to characterize the dependence of var("i ) on xi at
least up to a multiplicative constant. That is, var("i ) = 2 vi . The estimation of the regression coecients in model (1.11) could be done
by using the least squares estimator in Section 1.2 with equal error variances. These estimators are still unbiased and consistent for
model (1.11) with unequal error variances. But they are no longer
have minimum variance.
Let Y = v 1=2 Y , X = v 1=2 X ; " = v 1=2 " . Then
i
ij
ij
X
Yi = Xij j + "i
i=1
(1.12)
jjy Xjj2 = (y X)T W (y X);
where W = diag(v1 ; ; vn ) is the weight matrix. It follows that the
above weighted least squares estimate is the BLUE for .
In general, assume that
var(") = 2 ;
11
y = X + " :
(1.13)
Thus
based on model
12
1.5 An Example
In this section, we will illustrate theory of linear regression models
by analyses of the environmental data set introduced in Section 1.1.
Throughout this book, S-plus will be used to analyze all examples.
Readers may refer Chambers and Hastie (1993) for details.
Example 1.1 (Continued) The scatter plots of the response and the
three air pollutants are depicted in Figure 1.1. From Figure 1.1, it
seems that there exists linear relationship between the covariates of
NO2 and dust. S-plus codes and outputs are displayed in the end of
this section. From the outputs, we have estimated coecients and
their estimated standard errors, listed in Table 1.2.
TABLE 1.2. Estimated coecients and standard errors for Example 1.1
bi
SE t-value p-value
0.0000
0.0058
0.0011
0.3361
TABLE 1.3. Estimated coecients and standard errors for Example 1.1
Df
SO2 1
NO2 1
dust 1
Error 726
SS
336
88725
2162
1693766
MS
335.62
88725.33
2161.81
2333.01
F - Value
0.14386
38.03038
0.92662
1.5 An Example
20
40
60
80
100
20 40 60 80
120
160
350
450
13
60
80
100
150
250
admissions
80
120
20
40
so2
100
140
20
40
60
no2
20
60
dust
150
250
350
450
20
40
60
80
100 120
SO2 ,
TABLE 1.4. Estimated coecients and standard errors for Example 1.1
SO2
dust
NO2
Errors
Df
1
1
1
726
SS
336
66011
24876
1693766
MS
335.62
66010.66
24876.48
2333.01
F -value
0.14386
28.29418
10.66282
14
the model. This might be due to that the linear relationship between
the covariates of NO2 and dust. The multiple R2 is only 0.0511. This
implies that the three linear terms of air pollutants do not signicant
reduce the sum of squares. It suggests that more complicate models,
such as model (1.6), should be considered. In fact, several authors
have studied this data set using nonparametric regression techniques,
see Fan and Zhang (1999) and Cai, Fan and Li (2000).
Appendix: S-plus codes for Example 1.1
t value Pr(>|t|)
42.5049
0.0000
-2.7673
0.0058
3.2654
0.0011
1.5 An Example
dust
0.1121
0.1165
0.9626
15
0.3361
Pr(F)
16
so2 1
no2 1
dust 1
Residuals 726
2
One Way Analysis of Variance
In many scientic researches, there may be many factors having impact on the outcome of experiments. But there exists a very important factor, and the experimenters want to investigate its eect on
response variable assuming that the other factors are xed. In this
chapter, we introduce analysis of one factor experiment, which is
fundamental to analysis of experiment.
This is page 17
Printer: Opaque this
18
33
Yij = + i + "ij ; j = 1; ; ni ; i = 1; ; I;
(2.1)
where I is the number of treatments, and ni is the number of observations to be taken on the ith treatment. In this model, is called
grand mean , i treatment eect and "ij individual variation having
mean 0. An alternative way of writing model (2.1) is to replace + i
by i , so that the model becomes
Yij = i + "ij ; j = 1; ; ni ; i = 1; ; I:
(2.2)
The i is the mean of the ith group. Model (2.1) contains I + 1 unknown parameters, while there are only I unknown parameters in
model (2.2). In fact, the I + 1 unknown parameters ; 1 ; ; I in
model (2.1) are not identiable. This means that there is no unique
estimate of the I + 1 parameters in model (2.1). On the other hand,
many experimenters prefer model (2.1) because and i have their
own physical interpretation. The parameter is a constant, and usually is referred to the mean eect. The parameters i represents the
positive or negative deviation of the response from the mean eect
when the ith treatment is observed. Thus the deviation is called the
\eect" on the response of the ith treatment.
19
To make the I + 1 parameters in model (2.1) estimable, statisticians usually impose a constraint on the model. In many statistics
textbooks, it is assumed that
I
X
i=1
nii = 0:
I
1X
;
I i=1 i
here, as usual, the subscript \" stands for \average", and
=
i = i :
Under this assumption, i is the dierence from the average treatment eect.
Dierent packages of statistical data analysis impose dierent constraints on the model. For example, in S-plus, it is assume that 1 = 0
with some option. Thus, is the eect of the rst treatment and
i = i 1 , the dierence between the rst treatment and the ith
treatment. It is assumed that I = 0 in SAS, so the constant is the
eect of the I -th treatment, and i = i I , the dierence between
the ith treatment and the I -th treatment.
Yij = + i + "ij ; j = 1; ; ni ; i = 1; ; I;
(2.3)
20
where "ij 's are independent and identically distributed N (0; 2 ). DeP
note n = Ii=1 ni. Here it is not required that n1 = = nI . This
implies that we are studying an unbalanced one way ANOVA model,
and balance one way ANOVA models are special cases thereof. Use
matrix notation, the model can be rewritten as
y = X + ";
where
Y
B 11 C
B
C
B Y12 C
B
C
B . C
B .. C
B
B
B
B
B
B
B
B
B
B
@
C
C
C
n1 C
C
C
C
C
C
C
A
y = Y1 ;
Y21
X=
..
.
B
B
B
B
B
B
B
B
B
B
B
B
B
@
YInI
1
..
.
1
1
..
.
1
0 0 0
1 0 0
..
. 0 0
1 0 0
0 1 0
.. .. ..
. . .
B
C
B
C
B1 C
B
= B .. C
C
B . C
@
and
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
0C
C
0C
C
C
0C
C
C
0C
C
.. C
.C
C
A
"11 C
C
"12 C
C
.. C
. C
C
C
C
n1 C
C
C
C
C
C
C
A
" = "1
"21
..
.
"InI
In this model, the design matrix X is not full rank, so the matrix
XT X is not invertible, and the least squares estimator of cannot
be obtained via computing b = (XT X) 1 XT y. To obtain the least
21
(Yij
i )2 :
(2.4)
Y b
PI
bi
i=1 (ni =n)
= 0;
Y i b bi = 0 i = 1; ; I;
(2.5)
(2.6)
P
P
i
Yij : Among these I + 1 equations, there
where Y = n1 Ii=1 nj =1
are only I independent equations. Statisticians usually impose a constraint on the one way ANOVA model. A commonly used constraint
is
I
X
i=1
nii = 0:
b = Y ;
and
bi = Y i b:
In S-plus, it is assumed that 1 = 0. Under this assumption, b =
Y 1 , bi = Y i Y 1 ; i = 2; ; I:
Though the parameters ; 1 ; ; I are not identiable, many
parameters, such as 1 2 , are estimable without any constraints.
Such a parameter is called a contrast. More generally, a contrast
has the form
=
I
X
i=1
ci i
22
with
PI
i=1 ci
b =
I
X
i=1
ci bi
ni
I X
X
i=1 j =1
By simple algebra,
and hence
(Yij
b bi )2 =
RSS 2 2n
b2 =
is an unbiased estimator of 2 .
ni
I X
X
i=1 j =1
(Yij
Y i )2 :
RSS
n I
Example 2.1 (Female Labor Supply) This data set from East
Germany was collected around 1994. It consists of 607 female salary.
It is of interest to investigate the relationship between hourly earning
and education level. This data set contains several related measurements. But here we take the hourly earning as the response variable
and dened educational level as:
10 11 12 13 14
C
D
median of earning
10 11 12 13 14
9
mean of earning
23
eduf
eduf
Factors
Factors
60
20
40
B
D
15
earning
20
10
var of earning
25
eduf
Factors
eduf
The default contrast in S-plus is the Helmert polynomial. To compute the cell means easily, we set contrast to be treatment, in which
it is assumed that 1 = 0.
(Intercept)
14.07384
edufB
edufC
edufD
From the output, we can compute the cell mean, such as, Y 1 =
14:073 and Y 2 = 14:07 3:328, so on. Table 2.1 gives a summary
of this data set.
24
A
B
B
C
70 14.07
207 10.75
200 9.11
130 8.14
26.507
11.177
23.459
6.248
4 n = 0, then we have
i=1 i i
b = Y = 10:52; b1 = Y 1
Y = 3:55;
b2 = Y 2
Y = 0:23; b3 = Y 3 Y = 1:40;
b4 = Y 4
Y = 2:38:
H0 : 2
1 = = I
1 = 0
Yij = + "ij :
Therefore the least square estimate for is
b =
ni
I X
I
1X
1X
Yij =
n Y = Y :
n i=1 i=1
n i=1 i i
25
XX
Y )2
(Yij
XX
Y i )2 +
(Yij
ni (Y i Y )2 :
XX
(Yij
Y i )2 SSE :
RSS1 =
ni (Y i Y )2 :
FI
1; n I (1
):
Source
Sum Of Squares
Treatment
SST
Error
SSE
Total
P P
j (Yij
D.F.
n I
Y )2 n 1
Mean Squares
SST =(I
1)
SSE =(n I )
F value
SST =(I 1)
SSE =(n I )
26
SS
D.F.
Education
Residuals
1884.8
9605.6
3
603
Total
11490.4
606
Source
F value p-value
628.3 39.44
0
15.9
MS
X X
cij )( + i )
P
ci ( + i ):
P
27
b N (; 2
Further b is independent of b2 .
I
X
i=1
c2i =ni ):
n
X
i=1
ci ( + i ) = :
n
X
X
var() =
c2i var(Y i ) = (c2i =ni )2 :
i=1
i=1
For any linear unbiased estimator of ,
b
b =
we have
E b =
I X
X
i=1 j
I X
X
i=1 j
dij Yij ;
dij ( + i ):
dij ) +
I X
X
i=1
dij )i =
I
X
i=1
ci i :
28
P
P
P
This implies that Ii=1 j dij = 0 and ci = j dij . Thus c2i =
P
P
( j dij )2 , so c2i ni j d2ij by arithmetic-geometric inequality. Thus
XX
X
var( ) =
d2ij 2 (c2i =ni )2 = var(b):
i=1 j
i=1
This completes the proof of Part (i).
b
(ii) Since b is a linear combination of Yij , if "ij is normally distributed, so is b. Thus
b N (; 2
I
X
i=1
c2i =ni )
cov(Y i ; Yij
Y i ) = 0
2
b
i ci =ni
b
q
P
tn I :
tn I (1 =2)b
b
s
X
c2i =ni :
29
b1 = 3:33
and
b2 = 15:9:
SE = 3:99
d
1 1
1
1
1
(
+
+
) + = 0:508:
9 207 200 130 70
1 2 3:33 1:08
and
2 2 4:74 1:00:
30
Pf
m
\
(j 2 Sj )g
j =1
= 1 Pf
m
[
(j 2 S j )g
j =1
1
1
m
X
j =1
m
X
j =1
P fj 2 S j g
j ;
such that m
j =1 j = . Then with probability 1
events occur simultaneously
31
, the following
1 2 S1 ; ; m 2 Sm :
Therefore S1 ; ; Sm may be regarded as a 100(1 )% simultaneous
condence set. In practice, one often takes 1 = = m = =m.
Therefore the larger m is, the wider the simultaneous condence
interval is. The probability inequality demonstrates that the Bonferroni method is conservative in that the width of the intervals can be
unnecessarily wide.
P f1 2 S1 ; ; m 2 Sm g =
m
Y
j =1
(1 j ) 1
m
X
j =1
32
=
b
as
i ci
i=1
ci (Y i i ) =
I
X
i=1
= 0. Denoted by bi = Y i
Y and i = i . Then
I
X
=
b
i=1
ci (bi i ):
jb j
v
u I
uX
t
i=1
c2i =ni
v
u I
uX
t
i=1
ni (bi i )2 :
b
q
PI
b
2
i=1 ci =ni
Note that
PI
i=1 ni (bi
i )2 = RSS0
)2 =(I
PI
i=1 ni (bi
b2
PI
i=1 ni (bi
b2
i )2
RSS1 : Thus
1)
FI
1;n I :
b
jb j
2
i=1 ci =ni
PI
(I 1)FI
1;n I (1
33
4 c (Y
i i ) p
=1
7:815 = 2:796:
c2 =ni b2
Y j ) (i
p p
b 2= n
j )
34
Y j ) (i
p p
b 2= n
j )
Denote
Q=
maxi (Y i
i ) mini (Y i i
p p
b 2= n
)
namely
(Y i
p
Y j ) (i j )
p
q
(1
)
=
2;
I;n
I
2b
i j 2 fY i Y j bqI;n I (1 )= ng:
The above idea readily applies to unbalanced design. The distribution of Q will depend on n1 ; ; nI , but independent of unknown
parameters. See Hockberg and Tamhance (1987) for details. Critical values of qI;n I can be easily simulated by Monte Carlo method.
Most of the statistical softwares, such as S-plus and SAS, provide,
the upper percentiles of Q. A upper percentils table of Q can also be
found in Dean and Voss (1999).
35
Yij = + i + "ij ;
where "ij are independent and identically distributed N (0; 2 ): The
estimates and tests derived in previous sections are obtained as if
the model and assumptions are correct. In many practical problems
the assumptions are in doubt. Therefore, we have to check model
assumptions after obtaining the estimates. Thus, when we t a data
set by the one way ANOVA model, we have to check the following
three assumptions:
i
1. f"ij gnj =1
are a random sample. In other words, there are no outliers among the data.
36
"bij = Yij
Y i;
j = 1; ; ni :
37
60
against the level i for the data set studied in Example 2.1. It can be
seen from Figure 2.2 that there is an outlier in this data set.
20
Residual
40
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
1.0
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
1.5
2.0
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
2.5
3.0
3.5
4.0
FIGURE 2.2. Scatter plot of residuals against the education levels: 1 stands
for A, 2 for B, so on.
20
0
Residual
40
60
38
The scatter plot of residuals against index is used to detect independence among the data. Figure 2.4 depicts the scatter plot of
residuals against index. We did not nd unusual pattern except a
few outliers.
60
39
20
Residuals
40
o
o
o
o
o o
o oo
oo o o
oo
oo
o
oo o
oo o o
o
oo o o
o
o
ooo
o
o
o o
o
oooo
o
oo ooo o o
o
o
o
oooo
o o ooooo
oo
o
o
o
ooo
o
o
o
oo
oo o
o
ooo oo
o oo
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
ooo
o
oo
o ooo
o
o
oo
o
ooooo
ooo
o
o
o
oo
o
oo oo
oo
oo ooo
oo
o
oo
o
o
oo
o
ooo
oo
o
oo
o
oo
oo
o
o
o
o
o
o
oo
o
ooo
o
o
o
o
oo
oo
oo
o
oo
o
oooo
oo
oo
o
oo
o
oo
o
ooo
o
ooo
ooo
ooo
o
o
o
o
o
o
oo
o
oo
oo
o
oooo
o
oooooo
ooo
o
o
oo
o
o
o
oo
o
o
oo
o
o
o
o
ooo
oo
o
o
o
oooo
o
o
o
o
o
o
oo
o
oo
ooo
oo
oooo
oo
o
o
oo
o
oo
o
oo
o
oo
o
o
oo
oo
ooo
o
oo
o
o
o
oo ooo o
oo
o
o
o
oo
o
o
ooo
o
o
oooo
oo
o
oooo
o
o
o
o
o
oo
oo
o
o
o
oo
o
oo
o
o
o
oo
o
oooo
oo
o
o
oooo
oo
oo
o
o
o
oo ooo
o oooo
o
o
oo
o
o
o
o
o
o
ooo
o oooo
oo
oo
o
o
ooo
o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
oo oo
ooo o
o
o oo
o o ooo ooo
o o
oo o
o o
o oo o
o o o o oooo o
o
o
o o
o
o
o
o
0
100
200
300
400
500
600
Index
size is large, we may directly plot "ij against the quantiles of standard normal distribution. Figure 2.5 depicts the normal Q-Q plot of
residuals in Example 2.1.
When the sample size in each group is large, we may estimate
the density of Y 's for each group by using nonparametric kernel
estimation
ni
Yij x
1X
b
fi;h(x) =
K
=h;
ni j =1
h
where K (x) may be chosen as Gaussian density and the bandwidth
h may be taken as 1:06b n 1=5 . A estimated density function of the
residuals in Example 2.1 is depicted in Figure 2.6. The density curve
60
40
20
residuals
40
oo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oooooo
o
o
-3
-2
-1
41
0.04
0.0
0.02
density
0.06
0.08
-20
20
40
60
residuals
Yij = + i + "ij
with var("ij ) = i2 . Testing equal variance is equivalent to testing the
hypothesis
H0 : 12 = 22 = = I2
based on the assumption that Yij are independent and identically
distributed N (i ; i2 ). Likelihood ratio test can be constructed for
testing the null hypothesis. It is easy to show that the maximum
42
P
likelihood estimate for (i , i2 ) is (Y i; j (Yij Y i )2 =ni ). Under
the null hypothesis H0 , the maximum likelihood estimate for 2 is
b2 =
ni
I X
1X
(Y
n i=1 j =1 ij
Y i )2 :
b2
T = ni log 2
bi
i=1
I
X
T=
15:930
15:930
+ 207 log
26:507
11:177
15:930
+ 130 log 15:9306:248 = 81:747
+200 log
23:459
70 log
43
44
Coefficients:
(Intercept)
edufB
edufC
edufD
14.07384 -3.328302 -4.959358 -5.938258
Degrees of freedom: 607 total; 603 residual
Residual standard error: 3.991209
>
> options(contrasts=c("contr.sum"))
> lm(earning~eduf)
Call:
lm(formula = earning ~ eduf)
Coefficients:
(Intercept)
eduf1
eduf2
eduf3
10.51736 3.556479 0.2281777 -1.402878
Degrees of freedom: 607 total; 603 residual
Residual standard error: 3.991209
>
## To get an ANOVA table ##
> aov.labor <- aov(earning~eduf)
> summary(aov.labor)
Df Sum of Sq Mean Sq F Value Pr(F)
eduf 3 1884.844 628.2813 39.44076
0
Residuals 603 9605.637 15.9297
>
> ## To make a histogram, density and Q-Q plot ##
> postscript("d:/rli/book/figsa/ch2fig2.ps",width=4,
+
height=5,horizontal=F,pointsize=8)
> residuals <- resid(aov.labor)
> edua <- rep(1,length(edu))
> edua[edu<16]<-2
> edua[edu<13]<-3
> edua[edu<12]<-4
> plot(edua,residuals,xlab='Factor: Education Level',
+ ylab='Residual')
> dev.off()
graphsheet
2
>
> postscript("d:/rli/book/figsa/ch2fig3.ps",width=4,
45
46
3
Two-way Layout Models
This is page 47
Printer: Opaque this
48
Brand
Duty
Name
Store
two levels. The response variable is the life per unit cost (min=dollar)
. The data, extracted from Dean and Voss (1999), are listed in Table
3.1.
This example is a balanced design as the number of observations in
each cell are the same. When the number of experiments in each cell
are not the same, the corresponding experiment is called unbalanced
design.
49
600
700
Store
Name
500
mean of cost
800
brand
Alkaline
Heavy
duty
One may examine whether or not the interaction eect is statistically signicant. To this end, we have to t the data by a two-way
layout model. A two-way complete model is dened as
Yijk = + i + j + ij + "ijk ;
(3.1)
50
Yijk = + i + j + "ijk
(3.2)
with "ijk 's are independent and identically distributed N (0; 2 ). The
main eect model is also called as two-way additive model , since
the eect on the response of treatment combination is modeled as
sum of the individual eects of the two factors. If an additive model
is used when the factors really do interact, then inferences on main
eects can be very misleading. Consequently, if the experimenter does
not have reasonable knowledge about the interaction, then two-way
complete model should be used.
51
(iii) b2 =
P P
j dij Y ij
and
j nij
P P P
k (Yijk
Y ij )2 2 2n
IJ , where n =
P P
j nij .
tn
b
IJ (1
=2)
XX
i
P P
dij ij
i dij
= 0, for j = 1; ; J
j cj ij ,
with
j cj
52
(iv) When interaction exists, main eect is hard to dene. For convenience, we dene main eect as
i = i = i +
i
for i = 1; ; I . And a main eect contrast in A is
I
X
with
i=1
PI
i=1 ci
ci i
= 0, which is estimable.
12 ) ( 21
22 )g
b1 = 113:25
with estimated standard error
b
= 24:33:
2
The 95% condence interval for the interact contrast is
c =
SE
53
H0 : 1 = 0 versus H1 : 1 6= 0:
The duty contrast (Alkaline-Heavy duty)
2 = 1
1
2 = (11 + 12
2
21
22 )
2
:
4
Hence
b
= 24:33:
2
Thus a 95% condence interval for the duty contrast is 251 2:1799
24:33.
b ) =
c
SE(
2
H0 : 2 = 0 versus H1 : 2 > 0:
The observed t-statistic is
tobs =
251
= 10:32
24:33
3 = 1
2
54
b
= 24:33:
2
For hypothesis
tobs =
176:50 + 150
= 1:089
24:33
XX
dij = 0
55
dij Y ij (IJ
1)FIJ 1;n
IJ (1
XX
d2ij b:
When I and J are moderate, the interval can be very wide. For
example, I = 3, J = 4, and = 5%, The multiplier
q
(IJ
1)F(IJ 1);n
IJ (1 ) =
11F11;n 12 (0:95)
when n is large. So, the interval can be very wide. In most practical
applications, we may only be interested in some of the contrasts. In
this case, we can construct a shorter interval. For example, if we want
P P
to construct m linear contrasts i j dij ij ; the multiplier can be
replace by Bonferroni multiplier wB = tu IJ (1 =2m).
P
ci Y i w
56
wS = (I
wT = qI;n
wB = tn
p
1)FI 1;n
IJ ()=
IJ (1
IJ (1
);
2;
=2m):
H0AB : ij
il = jk
jl for all i 6= j; k 6= l:
Under the null hypothesis H0 , the model becomes the main eect
model
Yijk = + i + j + "ijk :
Therefore the sum of squares values is
n
ij
I X
J X
X
i=1 j =1 k=1
(Yijk
i j )2 :
P P
k (Yijk
P P
k (Yijk
k (Yijk
b bi bj ) = 0;
b bi
bj ) = 0;
b bi bj ) = 0:
57
k Yijk
P P
P P
where niT =
j nij
nb
k Yijk
niT bi
k Yijk
nT j
and nT j =
b i niT
i
P
j nij j
b
bi
i nij
i nij .
j j nT j
b
= 0;
= 0;
nT j bj = 0;
Thus
1 P nibi 1 P nT j bj = 0;
i
n j
P
Y i b bi n1iT j nij bj = 0;
P
Y j b n1T j i nij bi bj = 0:
Y b
niT bi = 0 and
nT j bj = 0:
b = Y
and bi and bj can be solved. Hence the residual sum of squares under
the null hypothsis
RSS0 =
XXX
(Yijk
bi bj
b)2
RSS1 :
58
bi = Y i Y ;
bj = Y j Y :
Consequently,
SSAB =
i;j;k
= K
(Yijk
i;j
Y i Y j Y )2
(Yij
Y i
Y j + Y )2
i;j;k
(Yijk
Y ij )2
i = i = i +
i ; i = 1; ; I:
Now we consider testing hypothesis concerning the main eects
H0A : 1 = 2 = = I
P
Yijk = + i + j +
ij + "ijk
with the constraints. H0A is equivalent to
H0 : i = 0; for i = 1; 2; ; I:
i niT i
= 0 for
59
i;j;k
Y ij Y i + Y )2 :
(Yijk
(Yijk
i;j;k
= JK
Y ij Y i + Y )2
(Y i
i;j;k
Y )2
(Yijk
Y ij )2
H0B : 1 = 2 = = J :
The sum of squared errors due to factor B is
SSB = IK
(Y j
Y )2 :
(Yijk
i;j;k
Y )2
i;j;k
+(Y i
Y ) + (Y j Y )g2
60
Source
D.F.
SS
MS
F value
Factor A
I 1
SSA
SSA =(I 1)
MSA =MSE
Factor B
J 1
SSB
SSB =(J 1)
MSB =MSE
Interactions (I 1)(J 1) SSAB SSAB =(I 1)(J 1) MSAB =MSE
Error
n IJ
SSE
SSE =(n IJ )
n 1
Total
Source
D.F.
Brand
Duty
Interactions
Error
1
1
1
12
Total
15
SS
MS
Experiments
F
p
From Table 3.3, the p-values indicate that all of the main eects
and interactions are highly statistically signicant.
61
ijk
(Yij
Y )2
= RSS()
= RSS() RSS(A) + RSS(A; B ) RSS(A)
+RSS(A B ) RSS(A; B ) + RSS(A B )
= SS (A) + SS (B jA) + SS (A B jA; B ) + SSE
The rst term SS (A) is the sum of squares due to factor A, the second one SS (B jA) is the sum of squares reduction due to factor B
(Yi
Y )2 = RSS()
= SS (B ) + SS (AjB ) + SS (A B jA; B ) + SSE
First the data set was tted a two way layout model. From the
S-plus outputs, the estimated contrasts are as follows:
Intercept
14.40
EB
EC
ED
-1.19
-0.68
62
Source
D.F.
Child
Education
Interaction
Residuals
1
3
3
599
Total
606
F
p
4.057
4.058 0.257 0.613
1936.489 645.496 40.852 0.000
85.213 28.404 1.798 0.146
9464.721 15.8009
SS
MS
0.05. This suggest us to t the data set a main eect model, studied
in next section.
TABLE 3.5. ANOVA table for Example 3.2
Source
D.F.
Eduf
Child
Interaction
Residuals
3
1
3
599
Total
606
F
p
1884.844 628.281 39.762 0.000
55.703
55.703 3.52531 0.061
85.213
28.404 1.798 0.146
9464.721 15.8009
SS
MS
63
this section, we consider the main eect model with balanced design
Yijk = + i + j + "ijk ;
where i = 1; ; I , j = 1; ; J and k = 1; ; K . With constraints
P
P
i i = 0 and j j = 0, the least squares estimates for the parameters in the model are
b = Y ;
bi = Y i Y ;
bj = Y j Y :
So for the estimable main eect contrasts:
=
with
i ci
ci i
ci bi =
ci Y i
var(
ci bi ) =
i;j;k
(Yi;j;k
(Yijk
i;j;k
SSE
Y i Y j + Y )2
Y ij )2 +
+ SSAB
i;j;k
(Yij
Y i Y j + Y )2
64
and
IJ
RSS 2 2n
I J +1
and
b2 =
n I
22n
J + 1 i;j;k
I J +1 =(n
Y i Y j + Y )2
(Yijk
J + 1):
ci Y i w
wB = tn
I J +1 (1
)
2m
wS =
(I
1)FI 1;n
I J +1 (1
wT = qI;n
I J +1(1
1 p
)= 2
H0B : 1 = = J = :
65
Yijk = + i + j + "ijk
which is a one-way layout model. Hence
X
RSS0 =
i;j;k
(Yijk
Y i)2 :
RSS1 =
ijk
Y j )2 = IK
(Y
(Y j Y )2 :
(Y i
Y )2 :
ijk
(Yijk
Source
D.F.
SS
MS
Factor A
I 1
SSA
SSA =(I 1)
Factor B
J 1
SSB
SSB =(J 1)
Error
n I J + 1 SSE SSE =(n I J + 1)
Total
n 1
F value
MSA =MSE
MSB =MSE
66
Example 3.2 (Continued) As mentioned in Section 3.4, the interaction between child and education is not signicant. So we may t
the data set by the main eect model
earning child + education
The estimated contrasts are displayed as follows:
Intercept child edufB edufC edufD
14.575 -0.675 -3.373 -4.958 -6.258
The two way ANOVA table is depicted in Table 3.7.
TABLE 3.7. ANOVA table for Example 3.2
Source D.F.
child
eduf
Residuals
1
3
602
Total
606
F
p
4.057
4.058 0.256 0.613
1936.489 645.496 40.690 0.000
9549.934 15.864
SS
MS
Table 3.8 shows that the decomposition of sum of squares are dependent on the order of covariates adding into the model.
67
Source D.F.
eduf
child
Residuals
3
1
602
Total
606
F
p
1884.844 628.281 39.605 0.000
55.703 55.703 3.511 0.061
9549.934 15.864
SS
MS
68
69
70
Mean Sq
628.2813
55.7030
28.4042
15.8009
F Value
Pr(F)
39.76245 0.0000000
3.52531 0.0609242
1.79764 0.1464194
>lm(earning~child+eduf)
Call:
lm(formula = earning ~ child + eduf)
Coefficients:
(Intercept)
child
edufB
edufC
edufD
14.57527 -0.6749966 -3.373208 -4.957911 -6.257954
Degrees of freedom: 607 total; 602 residual
Residual standard error: 3.982923
>summary(aov(earning~child+eduf))
Mean Sq F Value
Pr(F)
628.2813 39.60502 0.00000000
55.7030 3.51135 0.06143388
15.8637
71
72
4
Analysis of Covariance
4.1 Introduction
In previous two chapters, it is assumed that the source of variations
are mainly due to one or more treatment factors. If confounding or
nuisance factors are expected to be a major source of variations, they
should be taken into account in the design and analysis of the experiment. For example, in the female labor supply data, in addition to
the education level that aects the income, gender, age and job prestige should also aect the income. To control or adjust for this prior
weight variability, we may make each group resemble with respect to
gender, age and job prestige, and eliminate confounding factors in
the stage of design experiments. An alternative approach is to adjust
them by linear models in the stage of analysis of experiments.
This is page 73
Printer: Opaque this
74
4. Analysis of Covariance
4.2 Models
Consider an experiment conducted as a completely randomized design to compare the eects of the levels of treatments on a response
4.2 Models
75
variable Y . Suppose that the response is also aected by a few nuisance covariates whose value x can be measured during or prior to
the experiment. Further, suppose that there is a linear relationship
between E (Y ) and x, which can be examined by a scatter plot for
each level of the factors. A comparison of the eects of the two treatments can be done by comparison of mean response at any value of
x.
The model that allows this type of analysis is analysis of covariance model. For simplicity, assume that there are one factor and p
covariates. Then the analysis of covariance model is
Yik = + i +
j =1
j xijk + "ik ;
(4.1)
k = 1; ; Ki ; i = 1; ; I;
where "ik are independent and identically distributed N (0; 2 ).
In this model, the eect of the i-th treatment is modeled as i as
usual. If there is more than one treatment factor, then it could be
replaced by main eects and interactions. The value of covariates on
the k-th time that treatment i is observed is written as xik , and the
linear relationship between the response and the covariates is modeled as xTik as in a regression model. It is important for the analysis
that treatments do not aect the value xik of the covariates, otherwise, comparison of treatment aects at a common x-value would
not be meaningful.
A common alternative form of the analysis of covariance model is
Yik = + i +
j =1
j (xijk
xj ) + "ik ;
76
4. Analysis of Covariance
Yik = + i +
j =1
ij xijk + "ik :
This model will reduce modeling bias, but will result in a wider condence intervals as it uses more parameters, compared with the model
(4.1). If there is no signicant lack of t of the model, then plots of the
residuals versus run order, predicted values, and normal scores can
be used to assess the assumptions of independence equal variance,
and normality of the random error terms.
77
Yik = i +
j =1
(4.2)
y = A + X + ";
where A is an n I design matrix associated with a one way ANOVA
model and X is an n p design matrix associated with (xijk xij ).
Since the X was centered, AT X = 0. Thus the least squares estimate
for and is
bi = Y i:;
b = (XT X) 1XT y:
To obtain the least squares estimate of and 's, one have to solve
the equations
b + bi = Y i
j =1
bj xij :
(4.3)
This is the same as the normal equations in a one way ANOVA model
if we treat the term in right-hand side of (4.3) as \cell mean". It is
clear that there are I + 1 parameters, but there are only I equations.
Therefore there is no unique solution for b and bi 's. Like in the
78
4. Analysis of Covariance
b =
I
1X
(b
I i=1 i
bn = bi
p
X
j =1
j =1
bj xij );
bj (xij b):
Dierent package of statistical analysis may impose a dierent constraint on i 's. For example, SAS assumes that I = 0.
H0 : 1 = = p = 0
against the alternative hypothesis HA that at least two of the i differ. In this section, we study the analysis of covariance. For simplicity,
we still assume that there is one factor and p-covariates X1 ; ; Xp .
Denote by SSE (= RSS1 (T; )) the residual sum of squares under
the full model, i.e.
SSE = RSS1 (T; ) =
i;k
(yik
b bi
j =1
bj xijk )2
79
and RSS0 be the residual sum of squares under the null hypothesis,
RSS0 () =
i;k
(yik
b0
j =1
bj0 xijk )2 ;
0
where b0 and b are the maximum likelihood estimate under H0 .
Then, by the general linear model theory,
SSE 2 2n
p I
F=
SST j =(I 1)
SSE =(n p I )
p I (1
H0 : 1 = 2 = = p = 0:
In this case, the model becomes the one way layout model and
RSS0(T ) =
i;k
(Yi;k
Y i )2 2n I :
F=
SS ( jT )=p
SSE =(n p I )
80
4. Analysis of Covariance
TABLE 4.1. Analysis of Covariance for p Linear Covariates
Source
D.F.
SS
MS
F value
Treatment
I 1
SST
SST =(I 1)
MST =MSE
Covariate
p
SS (jT )
SS (jT )=p
MS ( jT )=MSE
Error
n I p
SSE
SSE =(n I p)
Total
n 1
Since the parameters in the treatment and covariate are not orthogonal, SS (jT ) equals to the sum of squares reduction due to
covariates, given the contributions of treatment. In fact, one can decompose the total of sum of squares in a more detailed form
RSS () = RSS () RSS (A) + RSS (A; X1 ) RSS (A) +
+RSS (A; X1 ; ; Xp 1 ) RSS (A; X1 ; ; Xp )
+RSS (A; X1 ; ; Xp )
= SST + SS + SSE :
Interpretation: RSS (A; X1 ; ; Xk ) RSS (A; X1 ; ; Xk 1 )
2 21 , which holds if model Y A + X1 + Xk 1 is correct, is the
sum of squares reduction due to Xk , given the contribution already
in A; X1 ; ; Xk 1 .
earning eduf+age+prestige
81
Source
D.F.
eduf
age
prestige
Error
3
1
1
601
F value
1884.844 43.808
32.268
2.250
954.093
66.53
8619.276
SS
p-value
0.0000
0.1314
0.0000
is not signicant given eduf. Sum of squares due to age and prestige
= 32.268+954.093=977.361 with degrees 2 of freedom. For testing
H0 : 1 = 2 = 0,
F=
977:361=2
= 34:035
8629:276=601
Source
D.F.
prestige
eduf
age
Error
1
3
1
601
F value p-value
2394.842 166.986 0.0000
461.556 10.852 7 10 7
14.806
1.0324
0.31
8619.276
SS
82
4. Analysis of Covariance
For testing H0 : 1 = = 4 = 0 in the model
earningi;k + i + (prestigei;k ) + "i;k :
The p-value is 7 10 7 .
Even though the sum of squares admits dierent ways of decomposition, the model is still the same and the estimated coecients
are the same. The estimated coecients and their standard errors
are summarized in Table 4.3 based on S-plus output. Note that it is
assumed that 1 = 0 in S-plus.
TABLE 4.4. Estimated Contrasts and Standard Errors
Est
SE
t-value p-value
5.6430
0
8.1564
0
-4.3465
0
-5.0205
0
-5.5470
0
1.0161
0.31
-3.328
-4.959
-5.938
83
=
b
I
X
i=1
ci (Y i:
j =1
bj xij:)
I
X
I
X
ci bi
ci
bj xij:)
i=1
i=1 j =1
2
T
T
= fc (A A) 1 c + cT X(XT X) 1 Xcg;
b
b)
c
SE(
tn
I p:
b tn
I p (1
c b
)SE():
2
ci i 2
I
X
i=1
b)
c
ci bi wSE(
I p (1
I p (1
2m ) and for the
).
84
4. Analysis of Covariance
TABLE 4.6. Estimated Contrasts and Their Standard Errors
Est
SE
prestige
EdufB
EdufC
prestige
-0.6256
EdufB
-0.4702
0.2235
EdufC
-0.6205
0.4131
0.7504
EdufD
-0.3965
0.4684
0.6802
0.7086
Age
-0.6440
-0.0590
0.0036
0.0738
EdufD
-0.2573
b2 ) = varf(b3
b1 ) (b2
b1 )g:
Thus
c =
SE
0:58092 + 0:53732
2 is 0:5811 1:96
The coecient0w1
B for all pairwise contrasts is t601 (1 0:05=12) =
4
2:647 as m = @ A = 6, and the coecient for constructing all
2
85
1 :
1 :
1 :
3 :
..
.
86
4. Analysis of Covariance
Mean Sq
628.2813
32.2683
954.0929
14.3416
F Value
Pr(F)
43.80844 0.0000000
2.24999 0.1341406
66.52645 0.0000000
5
Mixed Eects Models
5.1 Introduction
So far, the treat eects in models we studied is treated as xed. That
is, levels of treatment factors were specically chosen. We have tested
hypothesis about, and calculated condence intervals for, comparisons in the eects of these particular treatment factor levels. These
treatment eects are known as xed eects, and models that contain
only xed eects are called xed eects models.
In some situations, when the levels of a treatment factor is large,
the levels that are used in the experiment can be regarded as random
variables. Such treatment factor eects are called random eect, and
the corresponding models are called random eects models. Furthermore models containing both xed eects and random eects are
called mixed eects models. The models were introduced by Hender-
This is page 87
Printer: Opaque this
88
Example 5.1 This example is extracted from Dean and Voss (1999).
Suppose that a manufacture of canned tomato soup wishes to reduce the variability in the thickness of the soup. Suppose that the
most likely causes of the variability are the quality of the corn
our
(corstarch) recerived from the supplier and the actions of the machine
operators. Let us consider two dierent scenarios:
Scenario 1: The machine operators are highly skilled and have been
with the company for a long time. Thus, the most likely cause of variability is the equality of the corn
our delivered to the company. The
treatment factor is corn
our, and its possible levels are all the possible batches of corn
our that the supplier could deliver. Theoretically,
this is an innite population of batches. We are interested not only
in the batches of corn
our that have currently been delivered, but also
in all those that might be delivered in the future. If we assume that
the batches delivered to the company are a random sample from all
batches that could be delivered, and if we take a random sample of
delivered batches to be observed in the experiment, then the eect of
5.1 Introduction
89
the corn
our on the thickness is a random eect and can be modeled
by a random variable.
Scenario 2: It is known that the quality of the corn
our is extremely consistent, so the most likely cause of variability is due to
the dierent actions of the machine operators. The company is large
and machine operators change quite frequently. Consequently, those
that are available to take part in the experiment are only a small
sample of all operators employed by the company at present or that
might be employed in the future. If we can assume that the operators
available for the experiment are representative of the population, then
we can assume that they are similar to a random sample from a very
large population of possible operators, present and future. Since we
would like to know about the variability of the entire population, we
model the eect of the operators as random variables, and call them
random eects.
Example 5.2 In the last chapter, the earning data were tted by a
xed eect model
earning
eduf + prestige.
90
where (i0 ; i1 ) N (0; D). Compared with the xed eect model, the
mixed eect model is far more
exible. It allows dierent slopes and
intercepts. Hence it reduces modeling bias substantially.
Compared with the xed eect model with dierent slopes and intercepts
earning eduf + i0 + i1 prestigei ;
the latter is more
exible, but it has 2n+3+1 independent parameters.
As a result, the parameters cannot be estimated with good accuracy.
Indeed, they are not estimable for this example. On the other hand,
the mixed eect model allows varying coecients, but these coecients are restricted to be a realization from a normal distribution.
The total number of parameter is 2 + 3 + 1 + 3 = 9. In conclusion,
the random eects model enhance the
exibility of the model without
introducing many parameters.
Yij = + Ti + "ij ; j = 1; ; ni ; i = 1; ; I;
(5.1)
91
Thus the two components T2 and 2 of the variance of Yij are known
as variance components, and observations on the same treatment are
positive correlated with correlation coecient = T2 =(2 + T2 ). It
is obvious that the maximum likelihood estimates of is
b = Y :
But the maximum likelihood estimates for T2 and 2 are complicate.
ni
I X
X
n I i=1 j =1
(Yij
Y i)2 :
The random eects one way model is very similar to the xed eect
one way ANOVA model, so a natural question is whether the estimate
b2 is a good estimate in the random eects one way model. Now we
show that b2 is an unbiased estimator of 2 . In fact, since Yij
N (; 2 + T2 ), EYij2 = 2 + 2 + T2 , and as Y i = + Ti + "i ,
EY i = 2 + T2 + n1i 2 .
E [SSE ] = E f
ni
I X
X
i=1 j =1
Yij2
I
X
i=1
ni Y 2i g = (n I )2 :
92
Y = +
1X
n T + " :
n i i i
n2
1
E [Y 2] = 2 + i2 T2 + 2 :
n
n
Thus
I
X
E [SST ] = E f
=
i=1
ni Y 2i nY g
2
i ni
Denote
T2 + (I
1)2 :
2
n2
i ni
c=
n(I 1)
which equals to J in balanced design in which ni = J , for i = 1; ; I .
Thus an unbiased estimator for T2 is
P
bT2 = c 1
SST
I 1
SSE
:
n I
93
H0 : T2 = 0
versus
H1 : T2 > 0:
H0 : T2
2
versus
H1 : T2 > 2 :
(5.2)
(cT2 + 2)2I
94
c 1
FI 1;n
F
I (1 =2)
T2
2
c 1
FI 1;n I (=2)
1 :
y = X + Zu + ";
where y consists of the n responses, X and Z are n p and n q
design matrix, is p 1 xed eects unknown parameters, u is q 1
random eect. Here it is assumed that E u = 0, var(u) = 2 G and
var(") = 2 R: Further we assume that u and " are uncorrelated. In
this model, we allow the random error " heteroscadastic. All examples can be expressed in this form. Natural questions arise here are
how to estimate and how to predict u.
95
Hendenson (1950) assumed that u and y have a joint normal distribution, and assumed that G and R are known for the moment.
Then the joint density is
f (y; u)
= f (yju)f (u)
1
= (22 ) n=2 jRj 1=2 expf (y X Zu)T R 1 (y X
2
2
q=
2
1
=
2
(2 ) jGj exp( 21 uT G 1u):
Zu)g
Zu)T R 1 (y X Zu) + uT Gu
Zu)T R 1 (y X Zu):
Comparing with the xed eect model, we shrink u towards zero and
hence change the estimate of as well. The corresponding normal
equations are as follows:
XT R 1 (y Xb Zub ) = 0;
ZT R 1 (y Xb Zub ) Gub = 0:
Using the identity
(R + ZGZT ) 1 = R 1
R 1 Z(ZT R 1 Z + G 1) 1 ZT R 1 ; (5.4)
96
XT (R + ZGZT ) 1 Xb = XT (R + ZGZT ) 1 y:
This implies that
ub = GZT (ZGZT + R) 1 (y Xb ):
Note that E (ujy) = GZT (ZGZT + R) 1 (y X). Hence ub can be
regarded as the plug-in of the best linear predictor. As a result, it
can be shown that b is BLUE and ub is the best linear unbiased predictor (BLUP). In fact, it can be shown that for any linear unbiased
estimator aT y of bT + cT u. That is
E (aT y) = E (bT + cT u)
We have
for all
var(aT y) var(bT b + cT ub ):
97
where Ti N (0; T2 ) and "ij N (0; 2 ): Further fTi g's and "ij 's are
independent. For this model, we have
Y i = + Ti + "i
where "i N (0; 2 =Ji ): Thus it can be derived by some straightforward algebraic calculation
P
b =
and
Tbi =
Ji Y i
i 2 +Ji T2
P
Ji
i 2 +Ji T2
Ji T2
(Y
b):
Ji T2 + 2 i
b = Y
Tbi =
JT2
(Y
Y ):
JT2 + 2 i
y = X + e;
98
where var(e) = V (). The parameters has to satisfy certain constraints in order for V ( ) > 0. For a given , when y is normal, the
likelihood is
`( ; ) =
n
log 2
2
1
log jV ()j
2
1
(y X)T V 1 ( )(y X):
2
`p () = max `(; ) =
1
1
log jV ( )j (y Xb ())T V 1 ()(y Xb ( )):
2
2
Q()]y;
Q()]y
99
2
X(XT X) 1 XT )y: So bMLE
= n1 RSS, which is
2
bRMLE
=
RSS
:
n p
Yij = i + "ij ; j = 1; ; J i = 1; ; n;
where "ij are independent and identically distributed N (0; ). So the
prole likelihood is
P P
i (Yij
E bMLE =
1
n(J
nJ
1) =
:
bRMLE =
n(J
1)
RSS:
100
Example 5.6 (Random eects one way model) Consider the random eects one way model
and
bT2 = MST
MSE
5.6 Examples
101
For comparing two nested mixed eect models, one uses the likelihood ratio statistics
T > 2d (1 )
5.6 Examples
In this section, we illustrate analysis of mixed eect models by the
Female Supply Labor data. The data set was tted by the model
earningi = + 0 prestigei +
j =1
j I (eduf = j )
(5.5)
where and i s are xed eects, and Ai and Bi are random effects which are normally distributed with mean 0 and covariance D.
From S-plus output, the value of the log-likelihood at the maximum
likelihood estimate is
`R1 = 1614:309:
Of interest here is to test H0 : Bi = 0. Thus consider the reduced
model
4
X
earningi = + 0 prestigei + j I (eduf = j ) + Ai + "i : (5.6)
j =1
102
`R2 = 1671:3:
The likelihood ratio test statistic is
T = 2f`R1
`R2 g = 113:91
The likelihood ratio statistic has a 2 -square distribution with 2 degrees of freedom. Thus the corresponding p-value 0. So we reject
the null hypothesis and model (5.5) should be used.
The estimated xed eects are displayed as follows
Value
Approx. Std.Error
z ratio(C)
(Intercept)
6.982
0.8143
8.581
prestige
0.120
0.0140
8.518
edufB
edufA
-1.924
0.569
-3.379
edufC
edufA
-2.421
0.570
-4.250
edufD
edufA
-2.891
0.593
-4.873
5.6 Examples
103
z ratio(C)
8.580611
8.518374
-3.378790
-4.249539
-4.872635
104
0.2762747 0.8297279
0.3378162 0.8053599 0.8575645
6.991530e-001
-1.060450e+000
-4.738697e+000
6.946953e+000
-7.125518e-001
-2.773203e-002
4.192961e-002
1.874799e-001
-2.748225e-001
2.795519e-002
5.6 Examples
105
z ratio(C)
8.231607
8.230456
-4.350135
-5.109320
-5.469516
106
605 5.322124989
606 -8.803717773
607 -1.561537897
Standardized Population-Average Residuals:
Min
Q1
Med
Q3
Max
-0.6722075 -0.1226388 -0.01158055 0.1101916 4.00266
Number of Observations: 607
Number of Clusters: 607
> anova(fit1,fit2)
Response: earning
fit1
fixed: (Intercept), prestige, edufB, edufC, edufD
random: (Intercept), prestige
block: list(1:2)
covariance structure: unstructured
serial correlation structure: identity
variance function: identity
fit2
fixed: (Intercept), prestige, edufB, edufC, edufD
random: (Intercept)
block: list(1:1)
covariance structure: unstructured
serial correlation structure: identity
variance function: identity
Model Df
AIC
BIC Loglik
Test Lik.Ratio P value
fit1
1 9 3246.6 3286.3 -1614.3
fit2
2 7 3356.5 3387.4 -1671.3 1 vs. 2
113.91
0
6
Introduction to Generalized Linear
Models
6.1 Introduction
In many applications, response variables can be either discrete or
continuous. Let us examine some motivating examples rst.
108
P fY = 1jX1 = x1 ; ; Xp = xp g = p(x1 ; ; xp ):
This is equivalent to assuming that the conditional distribution of Y
6.1 Introduction
109
Hodgkin Non-Hodgkin
396
375
568
375
1212
752
171
208
554
151
1104
116
257
736
435
192
295
315
397
1252
288
675
1004
700
431
440
795
771
1621
688
1378
426
902
410
958
979
1283
377
2415
503
110
yi = 0 + 1 xi + "i ;
i = 1; ; n;
where "i N (0; 2 ): For this model, we have already looked at two
of the three components: the conditional distribution of Y given X
and the linear structure. The third component in this simple model
in fact is identical link. Let us look at all three more more closely.
m(x1 ; ; xp ) = E (Y jX1 = x1 ; ; Xp = xp ):
111
Y = m(X1 ; ; Xp ) + "
E ("jx) = 0:
with
m(x1 ; ; xp ) = 1 x1 + p xp :
It leads to the liner model
Y = 1 X1 + p Xp + ":
This is the most crucial part of the assumption and needs to veried.
When the parameters 1 ; ; p varies from 1 to +1. So does the
linear function 1 x1 + + p xp . Thus the minimum requirement
is that = m(x1 ; ; xp ) has range ( 1; +1). This condition is
not satised for the mean functions in Examples 6.1 and 6.2. In this
situation, we transform by a function g and model this transformed
mean by a liner model
g() =
j =1
j xj :
X
= m(x1 ; ; xp ) = g 1 () = g 1 ( j xj ):
j =1
112
strictly increasing distribution can be used to construct a link function for the Bernoulli model. In fact, suppose unobserved variable z
follows the linear model
z = 1 x1
pxp + ";
(Y jX1 ; ; Xp ) Bernoullifp(x1 ; ; xp )g
with
p(x1 ; ; xp ) = F ( + 1 x1 + p xp ):
= 0. Hence
p(x1 ; ; xp ) = F (1 x1 + p xp):
113
p(x1 ; ; xp ) =
exp(1 x1 + p xp )
:
1 + exp(1 x1 + p xp)
g() = log
1
= 1 x1 + + p xp :
p xp ).
(b) If " is normal distribution N (0; 2 ), then
x + + p xp
p(x1 ; ; xp ) = 1 1
=(
b
1 x1 + + p xp):
link :
which is equivalent to
Example 6.4 When responses are counts, we often assume that the
conditional distribution of response has a Poisson distribution with
114
= log() = 1 x1 + + p xp :
and
(x1 ; ; xp ) = exp(1 x1 + + p xp ):
In fact, the log-link is the canonical link of Poisson regression. Therefore Poisson regression model is referred to as log-linear model in
some literature.
(6.1)
for some known functions a(); b() and c(; ). The parameter () is
called a canonical parameter and is called a dispersion parameter .
The exponential family includes many commonly used distributions,
such as normal distributions, binomial distributions, Poisson distributions and gamma distributions. We now illustrate model (6.1) by
some useful examples.
115
m
P (Y = yjX = x) = @ A py (1 p)m
y
y
0
m
p
+ m log(1 p) + log @ Ag:
= expfy log
1 p
y
Here = log 1 p p is a canonical parameter, b() = m log(1
0
p) and
m
c(y; ) = log @ A with = 1. This model is commonly used for
y
situations with a binary response.
y exp( )
y!
= exp(y log log y!):
P (Y = yjX = x) =
116
log( ).
This model is useful for continuous response with constant coecient
of variation.
For the canonical exponential family, we have the following moment properties.
Theorem 6.1 If the distribution of Y belongs to the canonical exponential family (6.1), then
(i) = E (Y ) = b0 ();
(ii) var(Y ) = a()b00 ().
Proof: Note that
f (y; ; ) d (y) = 1:
@f (y; ; )
d (y) =
@
@ log f (y; ; )
f (y; ; ) d (y):
@
E `0 () = 0:
This is Bartlette's rst identity, which gives us the consistency of
maximum likelihood estimates of . Taking derivative E `0 () with
117
respect to again,
@ 2 log f (y; ; )
f (y; ; ) d (y)
@2
Z
@ log f (y; ; ) 2
+
f (y; ; ) d (y) = 0:
@
Z
That is
E `0 () = E fY
b0 ()g=a() = 0:
E f b00 ()=a()g + E fY
Hence
g() =
118
b()
2 =2
exp()
log(1 e )
log( )
Canonical Link
Range
g() =
( 1; +1)
g() = log
( 1; +1)
g() = log 1 ( 1; +1)
g() = 1
( 1; 0)
Theorem 6.2 If the conditional distribution of Y given X = x belongs to the canonical exponential family (6.1), then it holds that
(i) The canonical link function is strictly increasing;
(ii) The likelihood `() using canonical parameter is strictly concave.
Proof: Note that
dg()
1
= 00 :
d
b ()
Since var(Y ) = a()b00 () > 0, so b00 () > 0. Thus g() is a strictly
increasing.
As to (ii), note that
`00 () =
b00 ()
< 0:
a()
119
g(i ) = xTi :
Hence, (1 ; ; n ) are only a p-dimensional manifold, called . To
accommodate the heteroscedasticity, we will assume that
ai () =
:
wi
120
`n (; ; ; y; x) =
`() = `n (; ; y; x)
for the model i = (g 1 (xTi )) = h(xTi ). Then, the maximum
likelihood estimate solves the following likelihood equations
@`( ) 0 b
= ` () = 0:
@
By Taylor's expansion around the true parameter 0 ,
0 = `0 (b ) `0 (0 ) + `00 ( 0 )(b
0 ):
Hence
(6.2)
`0 (0 ) =
n
X
i=1
121
X
varf`0 (0 )jxg = fh0 (xTi )g2 V (i )xi xTi :
i=1
This can be estimated by the substitution method, denote by the
resulting estimator by
b = varf`0 ( )jxg:
0
which is the corresponding sandwich formula, also referred as a robust estimator of variance-covariance matrix. Hence, one obtains estimated covariance matrix, standard errors and correlations.
122
-1.405
1.696
-3.05
1.827
From Table 6.3, it seems that only the variable Start appears signicant. Note that
logitfp(x1 ; x2 ; x3 )g = log
1 p
= 0 + 1 x1 + 2 x2 + 3 x3 ;
it follows that
Example 6.10 (Wave-soldering data) In this example, we apply Poisson log-linear regression model to wave-soldering data, which have
been analyzed in Chambers and Hastie (1993), using several models. In 1988, an experiment was designed and implemented at one of
AT&T's factories to investigate alternatives in the \wave-soldering"
procedure for mounting electronic components on printed relevant to
the engineering of wave-soldering. The response, measured by eye,
123
They are many ways to parameterize the problem. Like in the ANOVA,
we report estimable contrasts and their estimated standard errors in
Table 6.4.
124
(Intercept)
Opening.L
Opening.Q
Solder
Mask1
Mask2
Mask3
PadType1
PadType2
PadType3
PadType4
PadType5
PadType6
PadType7
PadType8
PadType9
Panel1
Panel2
0.7357
-1.3389
0.5619
-0.7776
0.2141
0.3294
0.3307
0.0550
0.1058
-0.1049
-0.1229
0.0131
-0.0466
-0.0076
-0.1355
-0.0283
0.1668
0.0292
0.0295 24.9548
0.0379 -35.3286
0.0420 13.3778
0.0273 -28.4742
0.0377 5.6761
0.0165 19.9285
0.0089 36.9702
0.0332 1.6570
0.0173 6.1034
0.0152 -6.9157
0.0136 -9.0317
0.0089 1.4780
0.0088 -5.2752
0.0070 -1.0871
0.0106 -12.7861
0.0066 -4.3096
0.0210 7.9305
0.0117 2.4876
6.3.2 Computation
The computation method for generalized linear model relies on iteratively reweighted least squares. Suppose that we want to solve the
likelihood equation
`0 (b ) = 0:
Given an initial value 0 of b , by Taylor's expansion
0 = `0 (b ) `0 (0 ) + `00 ( 0 )(b
Hence
0 ):
125
`0 (b ) = 0:
`n () =
[fyi i
126
and
i = (b0 ) 1 g 1 (xTi ):
@`n ( ) X yi i 0 T
=
h (xi )xij :
@j
a()
i
In vector notation,
@`n () X yi i 0 T
=
h (xi )xi = XT W (y );
@
a
(
)
i
where
In = XT W X:
Hence, the Fisher scoring method gives
127
1 = (XT W X) 1 XT W z:
Step 3 Regarding 1 as initial value and update it. Iteratively between Step 1 and Step 2 until convergence.
`n (; ; y; x) =
n
X
i=1
This turns out that the unrestricted maximizer ~i = (b0 ) 1 (yi ) =
(yi). Then the dierence is the lack-of-t due to the model restriction, called deviance and denoted by D(y; b),
n
X
1
wi fyi (~i bi ) b(~) + b(b)g= = D(y; b)=:
2
i=1
128
D(y; b) = 2
=
n
X
i=1
n
X
i=1
wi (yi
yi2 ybi2
+ g
2
2
ybi)2 :
Distribution
deviance
(wi = 1)
P
2
Normal
(
y
bi )
i
P
Binomial
2 (yi log(yi =P
bi ) (ni yi ) logf(ni yi)=(ni
Poisson
2P fyi log(yi =bi ) (yi bi )g
Gamma
2 f log(yi =bi ) + (yi bi )=bi g
Ybi )g)
129
dim(0 ).
Example 6.9 (Continued) Table 6.6 shows that the deviance decomposition. From this table, one can see that age and number are not
signicant. For testing signicant of number and age simultaneously,
we take deviance reduction = 3.5357+3.1565=6.6922 with 2 degrees
130
D(y; b) =
n
X
i=1
wi d2i :
131
Dene deviance residual as
rD;i = di sgn(yi
ybi):
rD;i =
132
bi = g 1 (xTi b ):
Then, the Pearson residual is dened as
Y b
rbP;i = pi i :
V (bi )
The Pearson residual is the same as the ordinary residual for normal
distributed error, and is skewed for nonnormal distributed response.
The Pearson goodness-of-t statistic is
n
X
2 :
X 2 = rbp;i
i=1
The residual dened above plays an similar role to the usual one. It
133
d
;
a V 1=3 ()
where a is the lower limit of the range of . The residual
A(z ) =
rA =
A(y) A(b)
:
V (b)1=6
A(z ) =
Hence
d 3 2=3
= z :
0 1=3 2
3 2=3
(y
rA = 2
rp =
b1=6
b2=3 )
Y b
b1=2
134
135
So the residuals are based on yi2=3 2i =3 , which does not stabilize
the variance. For example, for large i
yi2=3
2i =3 = (i
!
yi i )2=3 2i =3
2
2i =3 + i 1=3 (yi i ) 2i =3
3
4 1=3
N (0; i )
9
157
26
120
42
36
3
7
2
7
4
13
13
13
6
13
136
Value
-2.03693225
0.01093048
-0.20651000
0.41060098
Std. Error
1.44918287
0.00644419
0.06768504
0.22478659
t value
-1.405573
1.696175
-3.051043
1.826626
B6
B6
B6
B6
B6
W9
W9
L9
L9
L9
2
3
1
2
3
21
15
11
33
15
Value
0.735679734
-1.338897707
0.561940334
-0.777627385
0.214096793
0.329383406
0.330750657
0.055000439
0.105788229
Std. Error
0.029480508
0.037898452
0.042005350
0.027309904
0.037719290
0.016528258
0.008946418
0.033193067
0.017332685
t value
24.954785
-35.328560
13.377828
-28.474190
5.676056
19.928501
36.970177
1.656986
6.103395
137
138
-0.104859727
-0.122876529
0.013084728
-0.046620368
-0.007583584
-0.135502138
-0.028288193
0.166761164
0.029213741
0.015162522
0.013604968
0.008852899
0.008837701
0.006976023
0.010597577
0.006563991
0.021027817
0.011743659
-6.915718
-9.031740
1.478016
-5.275169
-1.087093
-12.786144
-4.309603
7.930503
2.487618
Pr(Chi)
0.000000e+000
0.000000e+000
0.000000e+000
0.000000e+000
1.554312e-015
References
140
References
Author index
Altman, D. G. 108
Hockberg, Y. 34
Barnett, V. 36
Kutner, M.H. 1
Cai, Z. 14
Landwehr, J. M. 123
Lewis, T. 36
Comizzoli, R. B. 123
Li, R. 14
Davidian, M. 88
Longford, N. T. 88
Fisher, R.A. 8
McMullagh, P. 107
Fan, J. 14, 14
Nachtsheim, C.J. 1
Gauss, C.F. 8
Giltinan, D.M. 88
Neter, J. 1
Hawkins, D.M. 36
Smith, R. L. 1
142
AUTHOR INDEX
Sinclair, J. D. 123
Tamhance, A. C. 34
Thompson, 97
Voss, D. 34, 48, 88
Wasserman, W. 1
Young, K. D. S. 1
Zhang, W. 14
Subject index
Analysis of covariance 75
Estimable 26
explanatory variables 2
Fixed eects 87
(BLUE) 6
Bonferroni method 31
Grand mean 18
Independent variables 2
individual variation 18
Likelihood function 8
logit 113
contrast 21
covariates 2
log-link 114
Dependent variable 2
design matrix 5
Normal equations 5
144
SUBJECT INDEX
Odds 113
one way ANOVA model 18
Pearson residual 132
probit 113
Random eect 87
random eects models 87
response variable 2
Schee simultaneous condence
interval 32
Treatment eect 18
two-way additive model 50
two-way complete model 49
Weighted least squares 10