Professional Documents
Culture Documents
2 Multiple Linear
Regression
Hector Lemus
Spring 2016
Example
Independent variables:
1. BMI
2. Age
3. Smoking history:
0 = Nonsmoker
1 = Current or Previous Smoker
Hypothetical Example
Drug Group
Active
Placebo
<60
-10
-2
60
-1
-2
Hypothetical Example
Independent variables:
1. Age
2. Drug group (Active/Placebo)
3. Interaction term (to be discussed)
Model:
Y 0 1 X 1 2 X 2 L k X k E
k
0 i X i E
i 1
0 1 X 1 2 X 2 L k X k
4. The variance of Y is the same for any fixed combination of the Xs.
X X L X
Y
0
1 1
2 2
k
k
Let the predicted value be
Find the estimated parameters
i 1
Yi Yi
i 1
0 , 1 ,K , k
which minimizes
Yi 0 1 X 1i L k X ki
SSY Yi Y
SSE
i 1
i 1
SS Reg SS Res
SSY SSE
i 1
Source
df
SS
Regression
SSY SSE
nk1
SSE
n1
SSY
Residual
Total
Yi Yi
Yi Y
MS
SSY SSE
MSReg
k
SSE
MS Res
n k 1
F
MSReg
F
MS Res
10
Coefficient of Determination
Proportion of variability of Y that can be explained by the model
R2
SSY SSE
SSY
0 R2 1
11
Taken together, does the set of BMI, Age and Smoking History explain a
significant amount of the variability of SBP?
the model, does the addition of one variable explain a significant amount of the
variability of Y?
Evaluate the relationship between one independent variable and Y after controlling
(adjusting) for the other variables in the model.
Given that Age and Smoking History are in the model, what is the relationship
between BMI and SBP?
variables in the model, does the addition of another set of variables explain a
significant amount of the variability of Y?
For example:
Suppose we test the association between BMI and SBP after adjusting for
Age and Smoking History
H 0 : 1 0
13
The concepts of nested (full and reduced) models will apply to all of the
tests that we discuss.
14
Y 0 1 X 1 2 X 2 L k X k E
Y 0 E
15
MS Reg
MS Res
Fk, n-k-1,
n-k-1, 1-
1-: the 100(1 - ) percentile from the F-dist with k and n-k-1 degrees
of freedom, where is our chosen level of significance.
16
SBP Example
Determine whether BMI, age and smoking history taken together account
for a significant amount of the variability of SBP.
Y: SBP, X1: BMI, X2: Age, X3: Smoking History
n = 32 subjects k = 3
Full model:
Y 0 1 X 1 2 X 2 3 X 3 E
H 0 : 1 = 2 = 3 = 0
Reduced model:
Y 0 E
SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: SBP Systolic Blood Pressure (mmHg)
Number of Observations Read
Number of Observations Used
At = 0.05,
F3, 28, 0.95 = 2.95
32
32
Analysis of Variance
Source
DF
Model
Error
Corrected Total
3
28
31
Root MSE
Dependent Mean
Coeff Var
Sum of
Squares
4889.82570
1536.14305
6425.96875
7.40691
144.53125
5.12478
Mean
Square
1629.94190
54.86225
R-Square
Adj R-Sq
F Value
29.71
0.7609
0.7353
Pr > F
At = 0.01,
F3, 28, 0.99 = 4.57
<.0001
At = 0.001,
F3, 28, 0.999 = 7.19
Reject H0 and conclude that taken together the 3 variables account for a
significant amount of the variability of SBP.
18
df
SS
X1
1 3537.95
Regression X 2 | X 1
1 582.65
X | X , X 1 769.23
3
1
2
Residual
28 1536.14
19
SS(X1)
Since, technically, X2 and X3 are not in the model, then pool their terms
with the residual.
SSRes = 1536.14 + 582.65 + 769.23 = 2888.02
dfRes = 28 + 1 + 1 = 30
Y 0 1 X 1 E
3537.95 / 1 3537.95
36.75
2888.02 / 30
96.28
20
SS(X2|X1)
The extra sum of squares explained by adding Age to the model given BMI
already in the model.
Pooled error term:
SSRes = 1536.14 + 769.23 = 2305.37
dfRes = 28 + 1 = 29
Full:
Y 0 1 X 1 2 X 2 E
Reduced: Y 0 1 X 1 E
582.65 / 1
582.65
7.33
2305.37 / 29 79.50
21
SS(X3|X1, X2)
Y 0 1 X 1 2 X 2 3 X 3 E
Reduced: Y 0 1 X 1 2 X 2 E
H0: 3 = 0 [Smoking history is not associated with SBP after adjusting for
BMI and Age.]
F
769.23 /1
769.23
14.02
1536.14 / 28 54.86
22
Full model:
H0: The addition of X* to the model does not explain a significant amount
of the variability of Y in the presence of X1, X2, , Xp.
H 0 : * = 0
Reduced model:
Y 0 1 X 1 2 X 2 L p X p E
23
To construct the partial F test, you need the extra sum of squares for X*.
Denote:
SS(X*| X1, X2, , Xp) = RegSS(X1, X2, , Xp, X*) RegSS(X1, X2, , Xp)
= RegSS(Full) RegSS(Reduced)
MSRes Full
So,
F X * | X 1 ,..., X p
SS X * | X 1 ,..., X p
SSRe s (Full)
n p2
MSRes (Full)
Example 1
Test whether smoking history is related to SBP after controlling for Age and
BMI.
Y 0 1 X 1 2 X 2 3 X 3 E
Full model:
H0: 3 = 0
Example 2
Test the relationship of BMI to SBP controlling for Age and Smoking history.
H 0 : 1 = 0
Full:
Y 0 1 X 1 2 X 2 3 X 3 E
Reduced: Y 0 2 X 2 3 X 3 E
We know that SS(X1, X2, X3) = 4889.83 from the SAS Output.
However, we would have to find SS(X2, X3) by fitting a model with only X2
and X3 in it.
Example 2 (cont.)
SS(X1|X2, X3) = 4889.83 4689.69 = 200.14
This is the marginal sum of squares, SAS can provide this information.
F(X1|X2, X3) = 200.14/54.86 = 3.65
F1, 28, 0.90 = 2.89
F1, 28, 0.95 = 4.20
A T-test Equivalent
An equivalent test to the Partial F test.
*
*
Full model: Y 0 1 X 1 2 X 2 L p X p X E
Test: H0: * = 0
Could use F(X*| X1, X2, , Xp) or equivalently
*
where
is the estimated regression parameter
s
and * is the estimated standard error.
*
T
s *
Example 2 (again)
Relationship of BMI to SBP adjusting for Age and Smoking History.
Parameter Estimates
Variable
Label
Intercept
BMI
AGE
SMK
Intercept
Body Mass Index
Age (years)
Smoking History
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
45.10319
1.22225
1.21271
9.94557
10.76488
0.63993
0.32382
2.65606
4.19
1.91
3.75
3.74
0.0003
0.0664
0.0008
0.0008
1.2223
1.91,
0.6399
p value 0.066
F = T2 = (1.91)2 = 3.65
29
SS ( X 1 )
SS ( X 2 | X 1 )
SS ( X 3 | X 1 , X 2 )
SS ( X 2 | X 1 , X 3 )
SS ( X 3 | X 1 , X 2 )
With the exception of the last test, these tests are not equivalent.
30
Parameter Estimates
Variable
Label
Intercept
BMI
AGE
SMK
Intercept
Body Mass Index
Age (years)
Smoking History
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
45.10319
1.22225
1.21271
9.94557
10.76488
0.63993
0.32382
2.65606
4.19
1.91
3.75
3.74
0.0003
0.0664
0.0008
0.0008
Parameter Estimates
Variable
Label
Intercept
BMI
AGE
SMK
Intercept
Body Mass Index
Age (years)
Smoking History
DF
Type I SS
Type II SS
1
1
1
1
668457
3537.94574
582.64651
769.23345
963.09739
200.14147
769.45920
769.23345
31
MLR Table
Multiple Linear Regression of Systolic Blood Pressure versus selected characteristics (n
(n = 32)
Characteristic
Estimated Coefficient
p-value
BMI (kg/m2)
1.2
-0.1, 2.5
0.066
Age (5 yr interval)
6.1
2.7, 9.4
<0.001
Smoking History
9.9
4.5, 15.4
<0.001
R2 = 0.76
32
3.
33
Full model:
*
*
H0: The addition of Xp+1
p+1 , , Xk to the model does not explain a
significant amount of the variability of Y in the presence of X1, X2, , Xp.
*
*
H0: The set of Xp+1
p+1 , , Xk is not significantly related to Y controlling for
X1, X2, , Xp.
*
*
H0: p+1
p+1 = = k = 0
Reduced model:
Y 0 1 X 1 L p X p E
34
*
*
Need the extra sum of squares from adding Xp+1
p+1 , , Xk to the model.
Denote:
*
*
SS(Xp+1
p+1 , , Xk | X1, X2, , Xp) = RegSS(Full) RegSS(Reduced)
F X
*
p 1
,..., X | X 1 ,..., X p
*
k
SS X *p 1 ,..., X k* | X 1 ,..., X p / k p
So,
*
*
Reject the H0 if F(Xp+1
p+1 , , Xk | X1, X2, , Xp) > Fk-p, n-k-1,
n-k-1, 1-
1-
MSRes (Full)
35
Y 0 1 X 1 2 X 2 3 X 3 E
36
SS
df
MS
Source
Regression
4889.83
1629.94
Residual
1536.14
28
54.86
Regression
5092.83
848.80
Residual
1333.14
25
53.33
SS
df
MS
203.00 / 3
1.27
53.33
Constructing Extra SS
Suppose we have:
SS(X1)
SS(X2|X1)
SS(X3|X1, X2)
39