Lab11 Handout

Biostatistics SDS 328M Lab Section
Lab 11: Linear Modeling - ANOVA
Introduction
An ANOVA (analysis of variance) is used to analyze the effect of predictor(s) on a
numerical response variable and can determine whether the mean response differs across
levels of the predictor variable(s). A one-way, or single-factor, ANOVA tests a single
explanatory variable, while a multi-factor ANOVA tests the effect of multiple
explanatory variables.
Objectives of this lab
Conduct single-factor and multi-factor ANOVA analyses.

Verify the assumptions for ANOVA models.
New R Functions and Tools

Function/Tool What it Does
aov()
Defines an ANOVA model object
TukeyHSD()
Runs post-hoc pairwise comparisons using Tukey adjustment
drop1()
Provides correct sums of squares for multi-factor ANOVA
Contained in the gplots library, this function creates an interaction
plotmeans()
plot with standard error bars
legend()
Adds a legend to a plot
Dataset
This week we will examine a dataset from a survey which looked at factors contributing
to birth weight. The data contains information about the mother as well as the child's
weight, and is called lab11_birthwt.csv on Canvas.
Questions of Interest
In this lab, we will conduct ANOVA models to understand the relationships between
potential underlying factors of birth weight in infants:
Does the mean birth weight of infants vary across different races?
What about while also considering if the mother smokes or not?
Does the effect of race interact with the effect of smoking?
Copyright 2016 Dept. of

Statistics and Data Sciences, UT Austin

1. Reading the Data Set into R

The first thing we will do with every data set is to determine what kind of variables we
are working with, as well as their names. We will do this using the head() function:
> head(birthwt)
low age lwt race smoke ptl ht ui
ftv bwt
1 no 19 182 black
no no no yes
none 2523
2 no 33 155 other
no no no no two or more 2551
3 no 20 105 white
yes no no no
one 2557
4 no 21 108 white
yes no no yes two or more 2594
5 no 18 107 white
yes no no yes
none 2600
6 no 21 124 other
no no no no
none 2622
Variables in the Data Set

low:
Was the baby underweight or not? (yes/no)
age:
The mothers age when the child was born.
lwt:
The mothers weight (lbs) at her last menstruation.
race:
Race of the mother (white/black/other).
smoke:
Did the mother smoke during pregnancy? (yes/no)
ptl:
Has the mother had previous premature labors? (yes/no)
ht:
Does the mother have a history of hypertension? (yes/no)
ui:
Was urinary irritation present during pregnancy? (yes/no)
ftv:
The number of doctor visits before birth (none/one/two or more).
bwt:
The childs weight (g).
2. Conduct a One-Way ANOVA

We can use a one-way ANOVA to answer the question: Does the mean birth weight of
infants vary across different races?
2.1 Build ANOVA Model
First well build an ANOVA model object using the aov() function. Notice that similar
to the lm() objects we saw last week, well use the y as a function of x notation:
# Define the ANOVA model object
> my_model <- aov(birthwt$bwt ~ birthwt$race)
2.2 Check ANOVA Assumptions

First, we will verify that the response variable, bwt, follows a normal distribution for all
three race groups. We can quickly visualize these three distributions with a grouped
boxplot:
2


> # Plot a boxplot of birth weights for each race

> boxplot(birthwt$bwt ~ birthwt$race, ylab='Birth Weight (g)',
main='Birth Weights for each Mothers Race Group')
All three of the distributions are fairly symmetric, so we pass the normality assumption.
If any of the groups showed an obvious skew, we could subset the data and look at
histograms or Q-Q plots to investigate further. Just like with t-tests, we can try to
transform our numeric variable if it violates the normality assumption.
We can see in the boxplots above that the three groups have a similar spread of values,
but we can formally check the equal variance assumption with Levenes test, like we did
in the two-sample t-test lab. Make sure you have installed the car package:
> library(car)
> leveneTest(bwt~race,data=birthwt)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group
2 0.4684 0.6267
186
With a p-value > 0.05, we can say that this data meets the assumption of equal variances.
2.3 State Null and Alternative Hypotheses
H0: The mean infant birth weight in all three race groups are equal.
HA: At least one mean infant birth weight is different from the others.
3


2.4 Conduct and Interpret ANOVA Output

To view the ANOVA table for our model, we can use the function summary():
> summary(my_model)
Df
Sum Sq Mean Sq F value Pr(>F)
birthwt$race
2 5015725 2507863
4.913 0.00834 **
Residuals
186 94953931 510505
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
We have evidence that the mean birth weight of at least one race is different from another
(F = 4.91, df = (2,186), p < 0.05). To find out which races differ from each other, well
need to run a post-hoc analysis.
2.5 Conduct and Interpret Tukey Post-Hoc Pairwise Comparisons
We can use the TukeyHSD() function to automatically run all of the pair-wise t-tests from
our ANOVA model, using an adjustment to protect against Type-I errors:
> TukeyHSD(my_model)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = birthwt$bwt ~ birthwt$race)
$`birthwt$race`
diff
lwr
upr
p adj
other-black 85.59127 -304.452081 475.6346 0.8624372
white-black 383.02644
9.816581 756.2363 0.0428037
white-other 297.43517
28.705095 566.1652 0.0260124
This function returns the p-values using the Tukey adjustment for multiple comparisons.
By comparing the p adj values in the last column to 0.05, we see that there is a
significant difference in mean birth weight between babies with white mothers and black
mothers (p = 0.04), and also between babies with white mothers and mothers of other
races (p = 0.03).
3. Conduct a Multi-Factor ANOVA

A multi-factor ANOVA tests the effect of multiple explanatory variables on a single
numeric response variable. So we can run a single model to test if the mean birth weight
differs by race and whether or not the mother smokes. We will add another explanatory
variable, smoke, into the model we just analyzed with a plus sign (+).
> multi_model <- aov(birthwt$bwt ~ birthwt$race + birthwt$smoke)


3.1 Check Model Assumptions

Since weve already confirmed that the factor race met both the normality and equal
variance assumption, we now only need to confirm the same for the smoke factor:
> boxplot(birthwt$bwt ~ birthwt$smoke, ylab='Birth Weight
(g)',xlab='Mother Smokes?', main='Birth Weights for Smokers
and Non-Smokers')
The box-plots show no major deviations from normality or equal variance, so both factors
for this ANOVA meet the necessary assumptions.
3.2 State Null and Alternative Hypotheses
In a multi-factor ANOVA, we must have a set of hypotheses for each explanatory
variable:
Hypothesis Set 1:
H0: While accounting for smoking during pregnancy, there is no difference in
mean birth weight across the three race groups.
HA: While accounting for smoking during pregnancy, there is a difference in
mean birth weight across the three race groups.
Hypothesis Set 2:
H0: While accounting for race, there is no difference in mean birth weight across
smoking groups.
HA: While accounting for race, there is a difference in mean birth weight across
smoking groups.
5


3.3 Conduct and Interpret Multi-Factor ANOVA Output

Here we need to correct for an unfortunate default in Rs aov() objects. If we simply ask
for summary() like we did in the one-factor ANOVA case, we will get the wrong results.
So well need to use the drop1() function, which produces the correct ANOVA table:
> drop1(multi_model,~.,test='F')
Single term deletions
Model:
birthwt$bwt ~ birthwt$race
Df Sum of Sq
<none>
birthwt$race
2
7368153
birthwt$smoke 1
4380265
+ birthwt$smoke
RSS
AIC F value
Pr(>F)
90573666 2480.1
97941819 2490.9 7.5249 0.0007213 ***
94953931 2487.0 8.9468 0.0031582 **
While controlling for race, smoking during pregnancy does impact mean birth weight (F
= 8.94, p < 0.05). Also, while controlling for smoking status, mean birth weight does
differ based on mothers race (F = 7.52, p < 0.05).
4. Testing for an Interaction Effect

To determine whether or not the effect of race differs for mothers who smoke versus
those who dont, we can add an interaction effect to the model. We dont need to
confirm any additional assumptions, but we will need to add a third set of hypotheses that
refers specifically to the interaction between our two explanatory variables:
Hypothesis Set 3:
H0: There is no interaction between race and smoking on mean birth weight.
HA: There is an interaction between race and smoking on mean birth weight.
To include the interaction term to our model, we can change the plus sign (+) between the
two explanatory variables to an asterisk (*):
> int_model <- aov(birthwt$bwt ~ birthwt$race*birthwt$smoke)
> drop1(int_model,~.,test='F')
Single term deletions
Model:
birthwt$bwt ~ birthwt$race
Df Sum of Sq
<none>
birthwt$race
2 11635383
birthwt$smoke 1
756002
race: smoke
2
5782704
* birthwt$smoke
RSS
AIC
84790962 2471.6
96426345 2491.9
85546963 2471.3
90573666 2480.1
F value
Pr(>F)
12.5560 7.764e-06 ***

1.6316 0.203095
6.2403 0.002389 **


We see a significant interaction between race and smoking on birth weight (F = 6.24, p <
0.05). Note that the main effect of smoke is no longer significant. However, in the
presence of a significant interaction effect, you dont usually interpret the main effects
of the two variables.
4.1 Graph an Interaction Plot
We can display the interaction between these two variables with an interaction plot. The
gplots package contains the plotmeans() function, which will plot the means of our
explanatory factors, as well as error bars for the 95% confidence interval for each mean.
First, well subset the data by one of the explanatory variables (its easiest to do this for
the one with the fewest values). Then, well plot the means of the other explanatory
variable and add a legend to tell which color corresponds to which smoking group:
> install.packages('gplots')
> library(gplots)
# Divide data into subgroups based on smoking status
> births_no <- birthwt[birthwt$smoke=='no',]
> births_yes <- birthwt[birthwt$smoke=='yes',]
# Interaction plot for two categorical explanatory variables
> plotmeans(bwt~race,data=births_no,n.label=F,col='black',
barcol='black',pch=20,ylim=c(2000,3700),main='Interaction
Plot', ylab='Birth Weights (g)')
> plotmeans(bwt~race,data=births_yes,n.label=F,add=T,col='red',
barcol='red',pch=20)
> legend('topright',inset=0.01,title='Smoke?',c('No','Yes'),
fill=c('black','red'),cex=.8)


Lab11 Handout

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab11 Handout

Uploaded by

Copyright:

Available Formats

Biostatistics SDS 328M Lab Section

Lab 11: Linear Modeling - ANOVA

Objectives of this lab

Conduct single-factor and multi-factor ANOVA analyses.

New R Functions and Tools

Copyright 2016 Dept. of

Biostatistics SDS 328M Lab Section

1. Reading the Data Set into R

Variables in the Data Set

2. Conduct a One-Way ANOVA

2.2 Check ANOVA Assumptions

Copyright 2016 Dept. of

Biostatistics SDS 328M Lab Section

> # Plot a boxplot of birth weights for each race

Copyright 2016 Dept. of

Biostatistics SDS 328M Lab Section

2.4 Conduct and Interpret ANOVA Output

3. Conduct a Multi-Factor ANOVA

Copyright 2016 Dept. of

Biostatistics SDS 328M Lab Section

3.1 Check Model Assumptions

Copyright 2016 Dept. of

Biostatistics SDS 328M Lab Section

3.3 Conduct and Interpret Multi-Factor ANOVA Output

4. Testing for an Interaction Effect

12.5560 7.764e-06 ***

Copyright 2016 Dept. of

Biostatistics SDS 328M Lab Section

Copyright 2016 Dept. of

You might also like