You are on page 1of 8

ANOVA/ANCOVA Review

YIK LUN, KEI


2016/11/12

One-Way ANOVA (One categorical variable)


Reduced model: Empty because there is no relationship between Y and C
Full model: Y = 0 + 1 C

F-test =

SS(regression)/dfregression
SS(residual)/dfresidual

H0 : 1 = 0 There is no relationship between Y and C


Ha : 1 6= 0 There is some relationship between Y and C
Example (two levels categorical variable)
nyc <- read.csv("http://www.stat.tamu.edu/~sheather/book/docs/datasets/nyc.csv",header=T)
attach(nyc);East<-factor(East)
Full <- lm(Price ~ East)
anova(Full)
##
##
##
##
##
##
##
##

Analysis of Variance Table


Response: Price
Df Sum Sq Mean Sq F value Pr(>F)
East
1
502.3 502.31 5.9906 0.01542 *
Residuals 166 13919.2
83.85
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is 0.01542, which is less than 0.05, we reject H0 : 1 = 0. Therefore, there is some
relationship between Price and East: the mean price when East = 0 is different from the mean price when
East = 1.

Example (three levels categorical variable)


pain = c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)
drug = c(rep("A",9), rep("B",9), rep("C",9));migraine = data.frame(pain,drug)
Full <- lm(pain ~ drug, data = migraine);anova(Full)

##
##
##
##
##
##
##
##

Analysis of Variance Table


Response: pain
Df Sum Sq Mean Sq F value
Pr(>F)
drug
2 28.222 14.1111 11.906 0.0002559 ***
Residuals 24 28.444 1.1852
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pairwise.t.test(pain, drug, p.adjust="bonferroni")


##
##
##
##
##
##
##
##
##
##

Pairwise comparisons using t tests with pooled SD


data:

pain and drug

A
B
B 0.00119 C 0.00068 1.00000
P value adjustment method: bonferroni

results = aov(pain ~ drug, data=migraine);TukeyHSD(results, conf.level = 0.95)


##
##
##
##
##
##
##
##
##
##

Tukey multiple comparisons of means


95% family-wise confidence level
Fit: aov(formula = pain ~ drug, data = migraine)
$drug

diff
lwr
upr
p adj
B-A 2.1111111 0.8295028 3.392719 0.0011107
C-A 2.2222222 0.9406139 3.503831 0.0006453
C-B 0.1111111 -1.1704972 1.392719 0.9745173

print(model.tables(results, "means"), digits = 3)


##
##
##
##
##
##
##
##
##

Tables of means
Grand mean
5.111111
drug
drug
A
B
C
3.67 5.78 5.89

Since the p-value is 0.0003, which is less than 0.05, we reject H0 : 1 = 0. Therefore, there is some relationship
between pain and drug. Namely, at least one of the three population means is different from each other.
In pairwise t-test, we can see the p-value for A-B and A-C are 0.00119 and 0.00068 respectively, therefore
population mean between A and B, as well as population mean between A and C are different. On the
2

contrary, p-value for B-C is 1, which means we fail to reject the null that there is no difference in population
mean between B and C. Finally, we are 95% confident that the difference in population mean between A and
B is 2.11, and the difference in population mean between A and C is 2.22, and the difference in population
mean between B and C is 0.11.

Two-Way ANOVA (two categorical variables C1 and C2)


Reduced model: Y = 0 + 1 X
Full model: Y = 0 + 1 X + 2 C + 3 X C
H0 : 2 = 3 = 0
Ha : 2 6= 0 or 3 6= 0

partial F-test: F =

(RSS(reduced)RSS(f ull))/(dfreduced dff ull )


RSS(f ull)/dff ull

dff ull = p + 1 and dfreduced = p k + 1


p = number of predictor and k = number of predictor omitted
Example (two categorical variables)
births <- read.delim("/Users/air/Desktop/ncbirth.txt",header=T);attach(births)
Full <- lm(pounds ~ smoke + premie + smoke * premie,data = births)
Reduced <- lm(pounds ~ smoke,data = births)
anova(Reduced,Full)
##
##
##
##
##
##
##
##
##

Analysis of Variance Table


Model 1: pounds
Model 2: pounds
Res.Df
RSS
1
198 394.30
2
196 273.97
--Signif. codes:

~ smoke
~ smoke + premie + smoke * premie
Df Sum of Sq
F
Pr(>F)
2

120.33 43.043 3.189e-16 ***

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model <- aov(pounds ~ smoke + premie + smoke * premie,data = births)


print(model.tables(model, "means"), digits = 3)
##
##
##
##
##
##
##

Tables of means
Grand mean
7.21615
smoke
smoke
3

##
##
##
##
##
##
##
##
##
##
##
##
##

0
1
7.32 6.63
premie
premie
0
1
7.54 5.33
smoke:premie
premie
smoke 0
1
0 7.64 5.44
1 6.94 4.69

summary(Full)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = pounds ~ smoke + premie + smoke * premie, data = births)
Residuals:
Min
1Q Median
-3.810 -0.756 0.174

3Q
0.744

Max
2.364

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
7.63596
0.09785 78.040 < 2e-16 ***
smoke
-0.69236
0.25590 -2.706 0.00742 **
premie
-2.19476
0.25590 -8.577 2.94e-15 ***
smoke:premie -0.05884
0.68618 -0.086 0.93175
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.182 on 196 degrees of freedom
Multiple R-squared: 0.3249, Adjusted R-squared: 0.3146
F-statistic: 31.45 on 3 and 196 DF, p-value: < 2.2e-16

interaction.plot(premie, smoke, pounds, type = "b", col = c(1:2),main="Interaction plot",


xlab = "premie", ylab = "Mean of birthweight")

smoke
1
2

0
1

5.5

6.0

6.5

7.0

5.0

Mean of birthweight

7.5

Interaction plot

2
0

1
premie

Since the p-value for F-test is very small, we reject H0 : 2 = 3 = 0 that premie and interaction term are
insignificant predictors. In other words, the full model is preferred and there is at least one category has
different population mean. As we can see from the mean table, the means between smoke = 0 and smoke =
1, the means between premie = 0 and premie = 1, are all different. But the means between interaction and
non-interaction are not very different. Hence the internation is not significant. This can also be confirmed
from the regression summary of full model. According to interaction plot, we can see that the means for
smoke and premie are indeed different. However, since these two lines are parallel and do not cross each
other, this again prove that the interaction term is not significant.

Final model should be pounds = 7.64 - 0.7 * smoke - 2.2 * premie


Final <- lm(pounds ~ smoke + premie,data = births)
summary(Final)
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = pounds ~ smoke + premie, data = births)
Residuals:
Min
1Q
-3.8537 -0.7572

Median
0.1728

3Q
0.7428

Max
2.3628

Coefficients:
Estimate Std. Error t value Pr(>|t|)

##
##
##
##
##
##
##
##
##

(Intercept)
7.6372
0.0966 79.058 < 2e-16
smoke
-0.7005
0.2368 -2.958 0.00348
premie
-2.2029
0.2368 -9.301 < 2e-16
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

***
**
***
'.' 0.1 ' ' 1

Residual standard error: 1.179 on 197 degrees of freedom


Multiple R-squared: 0.3249, Adjusted R-squared: 0.3181
F-statistic: 47.41 on 2 and 197 DF, p-value: < 2.2e-16

anova(Reduced,Final)
##
##
##
##
##
##
##
##
##

Analysis of Variance Table


Model 1: pounds
Model 2: pounds
Res.Df
RSS
1
198 394.30
2
197 273.98
--Signif. codes:

~ smoke
~ smoke + premie
Df Sum of Sq
1

Pr(>F)

120.32 86.515 < 2.2e-16 ***

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANCOVA (one continuous variable and one categorical variable)


Example (two levels categorical variable)
Full <- lm(Price ~ Food + East + Food:East, data = nyc)
Reduced <- lm(Price ~ Food, data = nyc)
anova(Reduced,Full)
##
##
##
##
##
##
##

Analysis of Variance Table


Model 1:
Model 2:
Res.Df
1
166
2
164

Price ~ Food
Price ~ Food + East + Food:East
RSS Df Sum of Sq
F Pr(>F)
8751.2
8620.9 2
130.36 1.2399 0.2921

Since the p-value is 0.2921, which is greater than 0.05, we fail to reject H0 : 2 = 3 = 0. Therefore, reduced
model is better than full model.

Example (Three-way ANOVA with three levels categorical variable)


cracker <- read.table("/Users/air/Desktop/cracker.txt",header=TRUE)
attach(cracker);treat <- factor(treat)
Full <- lm(sales ~ treat + x + treat:x)
Reduced1 <- lm(sales ~ x)
Reduced2 <- lm(sales ~ treat + x)
anova(Reduced1,Reduced2,Full)

##
##
##
##
##
##
##
##
##
##
##

Analysis of Variance Table


Model 1: sales ~ x
Model 2: sales ~ treat + x
Model 3: sales ~ treat + x + treat:x
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
13 455.72
2
11 38.57 2
417.15 59.5536 6.457e-06 ***
3
9 31.52 2
7.05 1.0065
0.4032
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(Reduced2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = sales ~ treat + x)
Residuals:
Min
1Q Median
-2.4348 -1.2739 -0.3362

3Q
1.6710

Max
2.4869

Coefficients:
Estimate Std. Error t value
(Intercept) 17.3534
2.5230
6.878
treat2
-5.0754
1.2290 -4.130
treat3
-12.9768
1.2056 -10.764
x
0.8986
0.1026
8.759
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>|t|)
2.66e-05
0.00167
3.53e-07
2.73e-06

***
**
***
***

'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.873 on 11 degrees of freedom


Multiple R-squared: 0.9403, Adjusted R-squared: 0.9241
F-statistic: 57.78 on 3 and 11 DF, p-value: 5.082e-07

According to three-way anova table, model 2 is preferred to model 1 and model 3. Variable x and treat are
significant and interaction is insignificant for model.

Final Model: sales = 17.4 + 0.89 * x - 5.07 * treat2 - 12.98 * treat3


results = aov(sales ~ treat + x);TukeyHSD(results, conf.level = 0.95)
##
##
##
##
##
##
##
##
##
##

Tukey multiple comparisons of means


95% family-wise confidence level
Fit: aov(formula = sales ~ treat + x)
$treat
diff
lwr
upr
p adj
2-1 -2.2 -5.398655 0.9986549 0.1968712
3-1 -11.0 -14.198655 -7.8013451 0.0000042
3-2 -8.8 -11.998655 -5.6013451 0.0000358
7

pairwise.t.test(sales, treat, p.adjust="bonferroni")


##
##
##
##
##
##
##
##
##
##

Pairwise comparisons using t tests with pooled SD


data:

sales and treat

1
2
2 1.000 3 0.015 0.053
P value adjustment method: bonferroni

According to pairwise t-test and TukeyHSD, we can see that the difference between treat1 and treat3, as well
as the difference between treat2 and treat3 are significant. However, the difference between treat1 and treat2
are not significant, since the p-value is greater than 0.05.

You might also like