You are on page 1of 4

STAT 431 - Practice Term Test # 2

Question 1 [20 marks]

School administrators want to study the attendance behaviours of high school students. For each
of 316 students the following variables are available:
• daysabs: The number of days the student was absent from school (the response variable).
• daysatt: The number of days the student attended school.
• id: An identification number assigned to each student.
• male: An indicator variable of the student’s gender (1=male, 0=female).
• math: The student’s standardized test score in mathematics (a continuous variable).
• langarts: The student’s standardized test score in language arts (a continuous variable).
Here is the data for the first five students:
id male math langarts daysatt daysabs
1001 1 56.988830 42.45086 73 4
1002 1 37.094160 46.82059 73 4
1003 0 32.275460 43.56657 76 2
1004 0 29.056720 43.56657 74 3
1005 0 6.748048 27.24847 73 3

We will consider modelling the data as a time homogeneous poisson process. The R code and output
from fitting a log-linear model appears on page 13.

(a) Explain why we include an offset term in log-linear models for poisson processes. What would
you choose as an appropriate offset term for this data? What would be the implications of
omitting the offset term? [4 marks]

(b) Consider the main effects model

log µi = β0 + β1 xi1 + β2 xi2 + offset

where xi1 = I[male = 1] is the indicator of male gender and xi2 = math is the math score for
subject i. Assume the appropriate offset has been included. The fitted version of this model
is given in the R output. Conduct a Wald based test of the null hypothesis that the rate of
absenteeism for males is half that of females. Be sure to carefully state the null and alternative
hypotheses in terms of the regression parameters, give the formula of the test statistic and its
asymptotic distribution under the null hypothesis. What is the conclusion of the test? [6
marks]

(c) Based on the main effects model in (b) estimate the expected number of days a female student
with a math score of 50 would be absent from school over the course of a full school year of
200 days. Provide a 95% confidence interval for your estimate. [6 marks]

(d) The models in this question have been based on the underlying assumption of a time homoge-
neous poisson process. Derive an expression for the deviance statistic and state its asymptotic
distribution. Define any notation you introduce. [4 marks]
Question 2 [20 marks]

The data below arise from a study investigating the relation between cigarette smoking behaviour
(Factor C), hypertension status (Factor H) and proteinurea (Factor P - the presence of protein in
urine) among expectant mothers.
Hypertension
Yes No
Proteinurea Proteinurea
Cigarettes/day Yes No Yes No
0 439 1740 294 5132
1-19 195 811 244 3625
>20 31 154 51 658
The R code and output from fitting several different log-linear models appears on page 14.
(a) Below is a table of residual deviances for a number of log-linear models fit to the above data.
Use an appropriate sequence of deviance tests for nested models to show that Model 3 provides
the most reasonable fit to the data. Be sure to indicate the degrees of freedom of the χ2( )
distributions used in your tests.
You may continue your solution on the opposite side of this page. [5 marks]

Residual
Model Form Deviance
1 (CHP ) 0.00
2 (CH, CP, HP ) 5.78
3 (CH, HP ) 6.85
4 (CH, CP ) 501.63
5 (CP, HP ) 118.47

(b) Give the expression for the log-linear model corresponding to Model 3. Define any notation
you introduce and include appropriate constraints. [3 marks]
(c) Give an expression for the null hypothesis represented by Model 3 using appropriate prob-
abilities (for example πijk , πij|k , πi·k etc). Interpret this hypothesis in words as a particular
type of independence between the factors. [3 marks]
(d) Based on Model 3 find an expression for the relative probability of an expectant mother
with hypertension but without proteinurea, smoking > 20 cigarettes/day versus being a non-
smoker. Use the fitted model to give an estimate of this relative probability.
[4 marks]
(e) Based on Model 3 find an expression for the odds ratio for having proteinurea for expectant
mothers with hypertension versus those without hypertension controlling for level of cigarette
smoking. Use the fitted model to give an estimate and 95% confidence interval for the odds
ratio. [5 marks]
R output for Question 1

> # Fit the main effects model


> # (appropriate offset has been included)

> model1 <- glm(daysabs~male+math+offset(???),


family=poisson(link=log),data=attendance)
> summary(model1)

Call:
glm(formula = daysabs ~ male + math + offset(???),
family = poisson(link = log), data = attendance)

Deviance Residuals:
Min 1Q Median 3Q Max
-4.1506 -2.6354 -0.9966 0.8403 13.3481

Coefficients:
Estimate Std. Error
(Intercept) -1.848700 0.067386
male -0.328198 0.047546
math -0.013447 0.001311
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2526.3 on 315 degrees of freedom


Residual deviance: 2380.3 on 313 degrees of freedom
AIC: 3247.7

Number of Fisher Scoring iterations: 6

> exp(model1$coeff)
(Intercept) male math
0.1574417 0.7202201 0.9866431

> summary(model1)$cov.unscaled
(Intercept) male math
(Intercept) 4.540809e-03 -1.085616e-03 -7.887789e-05
male -1.085616e-03 2.260645e-03 3.573382e-06
math -7.887789e-05 3.573382e-06 1.719111e-06
R output for Question 2 > # Choose model 3
> summary(model3)

Call:
> # Enter the data from the 3-way contingency table glm(formula = y ~ cigft * hypft + hypft * proft, family = poisson)
> cig <- c(1,1,1,1,2,2,2,2,3,3,3,3)
> hyp <- c(1,1,2,2,1,1,2,2,1,1,2,2) Deviance Residuals:
> pro <- c(1,2,1,2,1,2,1,2,1,2,1,2) 1 2 3 4 5 6
> y <- c(439,1740,294,5132,195,811,244,3625,31,154,51,658) 0.4335 -0.2158 -1.4442 0.3560 -0.2501 0.1235
> data <- data.frame(cbind(cig,hyp,pro,y)) 7 8 9 10 11 12
> data 1.0615 -0.2688 -0.9357 0.4491 1.3841 -0.3592
cig hyp pro y
1 1 1 1 439 Coefficients:
2 1 1 2 1740 Estimate Std. Error z value Pr(>|z|)
3 1 2 1 294 (Intercept) 6.06374 0.04082 148.563 < 2e-16 ***
4 1 2 2 5132 cigft2 -0.77288 0.03812 -20.276 < 2e-16 ***
5 2 1 1 195 cigft3 -2.46627 0.07658 -32.206 < 2e-16 ***
6 2 1 2 811 hypft2 -0.29710 0.05872 -5.060 4.20e-07 ***
7 2 2 1 244 proft2 1.40307 0.04328 32.416 < 2e-16 ***
8 2 2 2 3625 cigft2:hypft2 0.43468 0.04354 9.983 < 2e-16 ***
9 3 1 1 31 cigft3:hypft2 0.43116 0.08637 4.992 5.97e-07 ***
10 3 1 2 154 hypft2:proft2 1.36856 0.06064 22.568 < 2e-16 ***
11 3 2 1 51 ---
12 3 2 2 658 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> # Make variables into factors (Dispersion parameter for poisson family taken to be 1)
> data$cigf <- factor(data$cig)
> data$cigft <- C(data$cigf,treatment) Null deviance: 20397.2675 on 11 degrees of freedom
> data$hypf <- factor(data$hyp) Residual deviance: 6.8459 on 4 degrees of freedom
> data$hypft <- C(data$hypf,treatment) AIC: 117.24
> data$prof <- factor(data$pro)
> data$proft <- C(data$prof,treatment) Number of Fisher Scoring iterations: 3
> attach(data)
> exp(model3$coeff)
(Intercept) cigft2 cigft3 hypft2
> # Fit a variey of log-linear model to the data 429.98071217 0.46167967 0.08490133 0.74297196
proft2 cigft2:hypft2 cigft3:hypft2 hypft2:proft2
> # Fit staturated model 4.06766917 1.54446542 1.53904723 3.92970008

> model1 <- glm(y~cigft*hypft*proft, family=poisson) > summary(model3)$cov.unscaled


> model1$deviance (Intercept) cigft2 cigft3
[1] -7.507328e-13 (Intercept) 0.0016659488 -4.589261e-04 -4.589261e-04
cigft2 -0.0004589261 1.452962e-03 4.589261e-04
> # Fit model with main effects and 2-way interactions cigft3 -0.0004589261 4.589261e-04 5.864309e-03
hypft2 -0.0016659488 4.589261e-04 4.589261e-04
> model2 <- glm(y~cigft*hypft+cigft*proft+hypft*proft, proft2 -0.0015037586 -1.948211e-19 6.028989e-19
family=poisson) cigft2:hypft2 0.0004589261 -1.452962e-03 -4.589261e-04
> model2$deviance cigft3:hypft2 0.0004589261 -4.589261e-04 -5.864309e-03
[1] 5.782559 hypft2:proft2 0.0015037586 -2.050039e-19 -3.870912e-18
hypft2 proft2 cigft2:hypft2
> # Fit models with two 2-way interaction terms (Intercept) -0.0016659488 -1.503759e-03 4.589261e-04
cigft2 0.0004589261 -1.948211e-19 -1.452962e-03
> model3 <- glm(y~cigft*hypft+hypft*proft, family=poisson) cigft3 0.0004589261 6.028989e-19 -4.589261e-04
> model3$deviance hypft2 0.0034480632 1.503759e-03 -6.432238e-04
[1] 6.845915 proft2 0.0015037586 1.873444e-03 3.335303e-19
cigft2:hypft2 -0.0006432238 3.335303e-19 1.895724e-03
> model4 <- glm(y~cigft*hypft+cigft*proft, family=poisson) cigft3:hypft2 -0.0006432238 -8.617005e-20 6.432238e-04
> model4$deviance hypft2:proft2 -0.0032015352 -1.873444e-03 1.167887e-20
[1] 501.6307 cigft3:hypft2 hypft2:proft2
(Intercept) 4.589261e-04 1.503759e-03
> model5 <- glm(y~cigft*proft+hypft*proft, family=poisson) cigft2 -4.589261e-04 -2.050039e-19
> model5$deviance cigft3 -5.864309e-03 -3.870912e-18
[1] 118.4714 hypft2 -6.432238e-04 -3.201535e-03
proft2 -8.617005e-20 -1.873444e-03
cigft2:hypft2 6.432238e-04 1.167887e-20
cigft3:hypft2 7.459042e-03 3.385072e-18
hypft2:proft2 3.385072e-18 3.677434e-03

You might also like