You are on page 1of 7

Biostatistics SDS 328M Lab Section

Lab 11: Linear Modeling - ANOVA

Introduction
An ANOVA (analysis of variance) is used to analyze the effect of predictor(s) on a
numerical response variable and can determine whether the mean response differs across
levels of the predictor variable(s). A one-way, or single-factor, ANOVA tests a single
explanatory variable, while a multi-factor ANOVA tests the effect of multiple
explanatory variables.

Objectives of this lab

Conduct single-factor and multi-factor ANOVA analyses.


Verify the assumptions for ANOVA models.

New R Functions and Tools


Function/Tool What it Does
aov()
Defines an ANOVA model object
TukeyHSD()
Runs post-hoc pairwise comparisons using Tukey adjustment
drop1()
Provides correct sums of squares for multi-factor ANOVA
Contained in the gplots library, this function creates an interaction
plotmeans()
plot with standard error bars
legend()
Adds a legend to a plot

Dataset
This week we will examine a dataset from a survey which looked at factors contributing
to birth weight. The data contains information about the mother as well as the child's
weight, and is called lab11_birthwt.csv on Canvas.

Questions of Interest
In this lab, we will conduct ANOVA models to understand the relationships between
potential underlying factors of birth weight in infants:
Does the mean birth weight of infants vary across different races?
What about while also considering if the mother smokes or not?
Does the effect of race interact with the effect of smoking?

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

Biostatistics SDS 328M Lab Section


Lab 11: Linear Modeling - ANOVA

1. Reading the Data Set into R


The first thing we will do with every data set is to determine what kind of variables we
are working with, as well as their names. We will do this using the head() function:
> head(birthwt)
low age lwt race smoke ptl ht ui
ftv bwt
1 no 19 182 black
no no no yes
none 2523
2 no 33 155 other
no no no no two or more 2551
3 no 20 105 white
yes no no no
one 2557
4 no 21 108 white
yes no no yes two or more 2594
5 no 18 107 white
yes no no yes
none 2600
6 no 21 124 other
no no no no
none 2622

Variables in the Data Set


low:
Was the baby underweight or not? (yes/no)
age:
The mothers age when the child was born.
lwt:
The mothers weight (lbs) at her last menstruation.
race:
Race of the mother (white/black/other).
smoke:
Did the mother smoke during pregnancy? (yes/no)
ptl:
Has the mother had previous premature labors? (yes/no)
ht:
Does the mother have a history of hypertension? (yes/no)
ui:
Was urinary irritation present during pregnancy? (yes/no)
ftv:
The number of doctor visits before birth (none/one/two or more).
bwt:
The childs weight (g).

2. Conduct a One-Way ANOVA


We can use a one-way ANOVA to answer the question: Does the mean birth weight of
infants vary across different races?
2.1 Build ANOVA Model
First well build an ANOVA model object using the aov() function. Notice that similar
to the lm() objects we saw last week, well use the y as a function of x notation:
# Define the ANOVA model object
> my_model <- aov(birthwt$bwt ~ birthwt$race)

2.2 Check ANOVA Assumptions


First, we will verify that the response variable, bwt, follows a normal distribution for all
three race groups. We can quickly visualize these three distributions with a grouped
boxplot:
2

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

Biostatistics SDS 328M Lab Section


Lab 11: Linear Modeling - ANOVA

> # Plot a boxplot of birth weights for each race


> boxplot(birthwt$bwt ~ birthwt$race, ylab='Birth Weight (g)',
main='Birth Weights for each Mothers Race Group')

All three of the distributions are fairly symmetric, so we pass the normality assumption.
If any of the groups showed an obvious skew, we could subset the data and look at
histograms or Q-Q plots to investigate further. Just like with t-tests, we can try to
transform our numeric variable if it violates the normality assumption.
We can see in the boxplots above that the three groups have a similar spread of values,
but we can formally check the equal variance assumption with Levenes test, like we did
in the two-sample t-test lab. Make sure you have installed the car package:
> library(car)
> leveneTest(bwt~race,data=birthwt)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group
2 0.4684 0.6267
186

With a p-value > 0.05, we can say that this data meets the assumption of equal variances.
2.3 State Null and Alternative Hypotheses
H0: The mean infant birth weight in all three race groups are equal.
HA: At least one mean infant birth weight is different from the others.
3

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

Biostatistics SDS 328M Lab Section


Lab 11: Linear Modeling - ANOVA

2.4 Conduct and Interpret ANOVA Output


To view the ANOVA table for our model, we can use the function summary():
> summary(my_model)
Df
Sum Sq Mean Sq F value Pr(>F)
birthwt$race
2 5015725 2507863
4.913 0.00834 **
Residuals
186 94953931 510505
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

We have evidence that the mean birth weight of at least one race is different from another
(F = 4.91, df = (2,186), p < 0.05). To find out which races differ from each other, well
need to run a post-hoc analysis.
2.5 Conduct and Interpret Tukey Post-Hoc Pairwise Comparisons
We can use the TukeyHSD() function to automatically run all of the pair-wise t-tests from
our ANOVA model, using an adjustment to protect against Type-I errors:
> TukeyHSD(my_model)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = birthwt$bwt ~ birthwt$race)
$`birthwt$race`
diff
lwr
upr
p adj
other-black 85.59127 -304.452081 475.6346 0.8624372
white-black 383.02644
9.816581 756.2363 0.0428037
white-other 297.43517
28.705095 566.1652 0.0260124

This function returns the p-values using the Tukey adjustment for multiple comparisons.
By comparing the p adj values in the last column to 0.05, we see that there is a
significant difference in mean birth weight between babies with white mothers and black
mothers (p = 0.04), and also between babies with white mothers and mothers of other
races (p = 0.03).

3. Conduct a Multi-Factor ANOVA


A multi-factor ANOVA tests the effect of multiple explanatory variables on a single
numeric response variable. So we can run a single model to test if the mean birth weight
differs by race and whether or not the mother smokes. We will add another explanatory
variable, smoke, into the model we just analyzed with a plus sign (+).
> multi_model <- aov(birthwt$bwt ~ birthwt$race + birthwt$smoke)

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

Biostatistics SDS 328M Lab Section


Lab 11: Linear Modeling - ANOVA

3.1 Check Model Assumptions


Since weve already confirmed that the factor race met both the normality and equal
variance assumption, we now only need to confirm the same for the smoke factor:
> boxplot(birthwt$bwt ~ birthwt$smoke, ylab='Birth Weight
(g)',xlab='Mother Smokes?', main='Birth Weights for Smokers
and Non-Smokers')

The box-plots show no major deviations from normality or equal variance, so both factors
for this ANOVA meet the necessary assumptions.
3.2 State Null and Alternative Hypotheses
In a multi-factor ANOVA, we must have a set of hypotheses for each explanatory
variable:
Hypothesis Set 1:
H0: While accounting for smoking during pregnancy, there is no difference in
mean birth weight across the three race groups.
HA: While accounting for smoking during pregnancy, there is a difference in
mean birth weight across the three race groups.
Hypothesis Set 2:
H0: While accounting for race, there is no difference in mean birth weight across
smoking groups.
HA: While accounting for race, there is a difference in mean birth weight across
smoking groups.
5

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

Biostatistics SDS 328M Lab Section


Lab 11: Linear Modeling - ANOVA

3.3 Conduct and Interpret Multi-Factor ANOVA Output


Here we need to correct for an unfortunate default in Rs aov() objects. If we simply ask
for summary() like we did in the one-factor ANOVA case, we will get the wrong results.
So well need to use the drop1() function, which produces the correct ANOVA table:
> drop1(multi_model,~.,test='F')
Single term deletions
Model:
birthwt$bwt ~ birthwt$race
Df Sum of Sq
<none>
birthwt$race
2
7368153
birthwt$smoke 1
4380265

+ birthwt$smoke
RSS
AIC F value
Pr(>F)
90573666 2480.1
97941819 2490.9 7.5249 0.0007213 ***
94953931 2487.0 8.9468 0.0031582 **

While controlling for race, smoking during pregnancy does impact mean birth weight (F
= 8.94, p < 0.05). Also, while controlling for smoking status, mean birth weight does
differ based on mothers race (F = 7.52, p < 0.05).

4. Testing for an Interaction Effect


To determine whether or not the effect of race differs for mothers who smoke versus
those who dont, we can add an interaction effect to the model. We dont need to
confirm any additional assumptions, but we will need to add a third set of hypotheses that
refers specifically to the interaction between our two explanatory variables:
Hypothesis Set 3:
H0: There is no interaction between race and smoking on mean birth weight.
HA: There is an interaction between race and smoking on mean birth weight.
To include the interaction term to our model, we can change the plus sign (+) between the
two explanatory variables to an asterisk (*):
> int_model <- aov(birthwt$bwt ~ birthwt$race*birthwt$smoke)
> drop1(int_model,~.,test='F')
Single term deletions
Model:
birthwt$bwt ~ birthwt$race
Df Sum of Sq
<none>
birthwt$race
2 11635383
birthwt$smoke 1
756002
race: smoke
2
5782704

* birthwt$smoke
RSS
AIC
84790962 2471.6
96426345 2491.9
85546963 2471.3
90573666 2480.1

F value

Pr(>F)

12.5560 7.764e-06 ***


1.6316 0.203095
6.2403 0.002389 **

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

Biostatistics SDS 328M Lab Section


Lab 11: Linear Modeling - ANOVA

We see a significant interaction between race and smoking on birth weight (F = 6.24, p <
0.05). Note that the main effect of smoke is no longer significant. However, in the
presence of a significant interaction effect, you dont usually interpret the main effects
of the two variables.
4.1 Graph an Interaction Plot
We can display the interaction between these two variables with an interaction plot. The
gplots package contains the plotmeans() function, which will plot the means of our
explanatory factors, as well as error bars for the 95% confidence interval for each mean.
First, well subset the data by one of the explanatory variables (its easiest to do this for
the one with the fewest values). Then, well plot the means of the other explanatory
variable and add a legend to tell which color corresponds to which smoking group:
> install.packages('gplots')
> library(gplots)
# Divide data into subgroups based on smoking status
> births_no <- birthwt[birthwt$smoke=='no',]
> births_yes <- birthwt[birthwt$smoke=='yes',]
# Interaction plot for two categorical explanatory variables
> plotmeans(bwt~race,data=births_no,n.label=F,col='black',
barcol='black',pch=20,ylim=c(2000,3700),main='Interaction
Plot', ylab='Birth Weights (g)')
> plotmeans(bwt~race,data=births_yes,n.label=F,add=T,col='red',
barcol='red',pch=20)
> legend('topright',inset=0.01,title='Smoke?',c('No','Yes'),
fill=c('black','red'),cex=.8)

Copyright 2016 Dept. of


Statistics and Data Sciences, UT Austin

You might also like