Professional Documents
Culture Documents
Case Study
21.08.2014
Page 1/52
Housekeeping
I My contact details
Office: 303S.265
Email:aj.lee@auckland.ac.nz or lee@stat.auckland.ac.nz
I Office hours: Tuesday 10:30-12:00, Thursday 10:30-12:00
I am happy to talk to students at any time if I am not too
busy
Page 2/52
Aims of todays lecture
Page 3/52
Evaporation data
Page 4/52
Evaporation data: Aims of the analysis
Page 5/52
Evaporation data: The variables
A data frame with 46 observations
on the following 11 variables:
Page 6/52
The Modelling Cycle
Choose Model
Fit Model
USE MODEL
Page 7/52
The Modelling Cycle: Our plan of attack
I Graphical check
I Suitable for regression
I Gross outliers
I Preliminary fit
I Outlier check
Page 8/52
Step 1: Plots
I Preliminary plots
Page 9/52
Step 1: Pairs plot (using gclus-package)
66 72 80 90 150 200 30 60 100 500
90
avst
75
72
minst
66
180
maxst
130
80 90
avat
70
minat
60
200
maxat
150
96
avh
93
60
minh
30
440
maxh
340
500
wind
100
evap
30
0
75 90 130 180 60 70 93 96 340 440 0 30
Page 10/52
Step 1: Points to note
I No obvious outliers.
Page 11/52
The Modelling Cycle: Our plan of attack
I Graphical check
I Preliminary fit
I Fit a model of response against all explanatory variables
I Check diagnostic plots
I Investigate extreme outliers
I Look at VIFs
I Model selection
I Outlier check
Page 12/52
Step 2: Preliminary fit
Call:
lm(formula = evap ~ ., data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.074877 130.720826 -0.414 0.68164
avst 2.231782 1.003882 2.223 0.03276 *
minst 0.204854 1.104523 0.185 0.85393
maxst -0.742580 0.349609 -2.124 0.04081 *
avat 0.501055 0.568964 0.881 0.38452
minat 0.304126 0.788877 0.386 0.70219
maxat 0.092187 0.218054 0.423 0.67505
avh 1.109858 1.133126 0.979 0.33407
minh 0.751405 0.487749 1.541 0.13242
maxh -0.556292 0.161602 -3.442 0.00151 **
wind 0.008918 0.009167 0.973 0.33733
---
Residual standard error: 6.508 on 35 degrees of freedom
Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023
F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11
Page 13/52
Step 2: Diagnostic plots
10 15
Standardized residuals
1
5
Residuals
0
0
5
2
2
15
41 33 2
33
41
0 10 20 30 40 50 2 1 0 1 2
2
Standardized residuals
1
Standardized residuals
0.5
1
1.0
0
1
0.5
31 0.5
1
2
2
41
Cook's distance
3
0.0
Page 14/52
Removing point 31 and comparing
Page 15/52
Variance Inflation Factors
Page 16/52
Findings
I Point 31 has quite high Cooks distance, but its removal does
not change regression much.
Page 17/52
The Modelling Cycle: Our plan of attack
I Graphical check
I Preliminary fit
I Model selection
I Using APR to identify suitable models
I Look at summary stats
I Check diagnostic plots
I Outlier check
Page 18/52
Step 3: Model selection using APR
Cp Plot
30
25
20
Cp
15
1,2,3,4,5,6,7,8,9,10
6,9
10
1,3,4,5,6,7,8,9,10
1,3,4,6,7,8,9,10
6,9,10
1,3,4,7,8,9,10
1,3,6,8,9,10
5
1,3,6,9 1,3,6,8,9
2 4 6 8 10
Number of variables
Page 19/52
Model suggestions for different criteria
Page 20/52
Suggested models
avst minst maxst avat minat maxat avh minh maxh wind
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 1 0 0 1 0
3 0 0 0 0 0 1 0 0 1 1
4 1 0 1 0 0 1 0 0 1 0
5 1 0 1 0 0 1 0 1 1 0
6 1 0 1 0 0 1 0 1 1 1
7 1 0 1 1 0 0 1 1 1 1
8 1 0 1 1 0 1 1 1 1 1
9 1 0 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Page 21/52
Suggested models
I CV favours model
evap~maxat+maxh+wind
Page 22/52
CV selected model: Summary
Call:
lm(formula = evap ~ maxat + maxh + wind, data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.901800 24.624411 5.032 9.60e-06 ***
maxat 0.222768 0.059113 3.769 0.000506 ***
maxh -0.342915 0.042776 -8.016 5.31e-10 ***
wind 0.015998 0.007197 2.223 0.031664 *
---
Residual standard error: 6.69 on 42 degrees of freedom
Multiple R-squared: 0.805, Adjusted R-squared: 0.7911
F-statistic: 57.8 on 3 and 42 DF, p-value: 5.834e-15
Page 23/52
CV selected model: Diagnostic plot
3
8 8
2
Standardized residuals
10
1
Residuals
1
10
2
41
20
33 41
3
33
0 10 20 30 40 50 2 1 0 1 2
3
33
41 8 0.5
Standardized residuals
8
1.5
2
Standardized residuals
1
1.0
0
3 2 1
2
0.5
0.5
41
Cook's distance 1
0.0
Call:
lm(formula = evap ~ avst + maxst + maxat + maxh,
data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.30530 45.65538 1.321 0.19387
avst 1.29035 0.60287 2.140 0.03832 *
maxst -0.56355 0.18237 -3.090 0.00359 **
maxat 0.42601 0.09389 4.538 4.9e-05 ***
maxh -0.30734 0.05160 -5.956 5.0e-07 ***
---
Residual standard error: 6.433 on 41 degrees of freedom
Multiple R-squared: 0.824, Adjusted R-squared: 0.8069
F-statistic: 48 on 4 and 41 DF, p-value: 6.089e-15
Page 25/52
BIC selected model: Diagnostic plot
Standardized residuals
10
1
Residuals
1
10
2
33 332
20
41
3
41
0 10 20 30 40 50 2 1 0 1 2
2
33
Standardized residuals
0.5
2 Standardized residuals
1.5
1
1.0
0
3 2 1
0.5
0.5
2
1
41
Cook's distance
0.0
Call:
lm(formula = evap ~ avst + maxst + maxat + minh + maxh,
data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69.5476 45.2608 1.537 0.132265
avst 2.1304 0.8000 2.663 0.011104 *
maxst -0.6857 0.1955 -3.507 0.001136 **
maxat 0.2908 0.1265 2.299 0.026802 *
minh 0.6021 0.3852 1.563 0.125960
maxh -0.4712 0.1165 -4.045 0.000232 ***
---
Residual standard error: 6.323 on 40 degrees of freedom
Multiple R-squared: 0.8342, Adjusted R-squared: 0.8134
F-statistic: 40.24 on 5 and 40 DF, p-value: 1.411e-14
Page 27/52
AIC selected model: Diagnostic plot
10
Standardized residuals
1
5
Residuals
1
10
2
2
41 33 332
3
20
41
0 10 20 30 40 50 2 1 0 1 2
2
1.5
Standardized residuals
Standardized residuals
38
1
1.0
0
1
0.5
0.5
2
3
41 1
Cook's distance
0.0
Page 29/52
The Modelling Cycle: Our plan of attack
I Graphical check
I Preliminary fit
I Model selection
I Outlier check
Page 30/52
Step 4: Checking for signal in residuals
8 8
10
10
0
0
Residuals
Residuals
10
10
41
33
20
20
33 41
0 10 20 30 40 50 0 10 20 30 40 50
Page 31/52
Transformations for Response?
95% 95%
50
50
BoxCox for CV model BoxCox for BIC model
100
100
logLikelihood
logLikelihood
150
150
200
200
250
2 1 0 1 2 250 2 1 0 1 2
Page 32/52
Transformations for Variables: Cross Validation
20
20
20
10
10
10
s(maxat,1.22)
s(maxh,3.22)
0
0
s(wind,1)
10
10
10
20
20
20
30
30
30
150 160 170 180 190 200 210 340 360 380 400 420 440 460 480 100 200 300 400 500 600
Page 33/52
Transformations for Variables: Cross Validation
Call:
lm(formula = evap ~ maxat + poly(maxh, 3) + wind, data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.041666 10.932152 -0.736 0.46627
maxat 0.201854 0.058888 3.428 0.00142 **
poly(maxh, 3)1 -69.247897 8.132711 -8.515 1.61e-10 ***
poly(maxh, 3)2 3.167952 6.631289 0.478 0.63544
poly(maxh, 3)3 15.969100 6.550355 2.438 0.01931 *
wind 0.015351 0.006896 2.226 0.03170 *
---
Residual standard error: 6.369 on 40 degrees of freedom
Multiple R-squared: 0.8317, Adjusted R-squared: 0.8107
F-statistic: 39.54 on 5 and 40 DF, p-value: 1.876e-14
Page 34/52
Transformations for Variables: Cross Validation
8
8
2
10
Standardized residuals
1
Residuals
1
10
41
2
41
20
3
33
33
10 20 30 40 50 2 1 0 1 2
3
33
Standardized residuals
2
8 41 Standardized residuals
1.5
0.5
1
1.0
0
4 3 2 1
2 0.5
1
6
0.5
41
Cook's distance
0.0
40
40
s(avst,3.76)
s(maxst,1)
20
20
0
0
20
20
40
40
75 80 85 90 95 130 150 170 190
avst maxst
40
40
s(maxh,3.28)
s(maxat,1)
20
20
0
0
20
20
40
40
maxat maxh
Page 36/52
Transformations for Variables: BIC
Call:
lm(formula = evap ~ poly(avst, 3) + maxst + maxat
+ poly(maxh, 3), data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 61.32493 23.91563 2.564 0.01453 *
poly(avst, 3)1 57.17955 20.51578 2.787 0.00834 **
poly(avst, 3)2 21.68382 8.41793 2.576 0.01413 *
poly(avst, 3)3 14.24292 7.31494 1.947 0.05914 .
maxst -0.72595 0.15901 -4.565 5.35e-05 ***
maxat 0.52135 0.09295 5.609 2.13e-06 ***
poly(maxh, 3)1 -64.45204 9.89038 -6.517 1.26e-07 ***
poly(maxh, 3)2 -10.10864 8.01830 -1.261 0.21531
poly(maxh, 3)3 15.92728 6.37772 2.497 0.01709 *
---
Residual standard error: 5.346 on 37 degrees of freedom
Multiple R-squared: 0.8903, Adjusted R-squared: 0.8666
F-statistic: 37.55 on 8 and 37 DF, p-value: 1.794e-15
Page 37/52
Transformations for Variables: BIC
Call:
lm(formula = evap ~ poly(avst, 2) + maxst + maxat
+ poly(maxh, 3), data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.24568 24.69042 2.643 0.011885 *
poly(avst, 2)1 52.57309 21.11410 2.490 0.017267 *
poly(avst, 2)2 25.04976 8.53573 2.935 0.005637 **
maxst -0.65890 0.16084 -4.097 0.000212 ***
maxat 0.43969 0.08595 5.116 9.24e-06 ***
poly(maxh, 3)1 -71.74260 9.48446 -7.564 4.30e-09 ***
poly(maxh, 3)2 -13.57454 8.10027 -1.676 0.101985
poly(maxh, 3)3 10.75983 6.00852 1.791 0.081300 .
---
Residual standard error: 5.539 on 38 degrees of freedom
Multiple R-squared: 0.8791, Adjusted R-squared: 0.8568
F-statistic: 39.47 on 7 and 38 DF, p-value: 1.599e-15
Page 38/52
Transformations for Variables: BIC
Call:
lm(formula = evap ~ poly(avst, 2) + maxst + maxat + maxh,
data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 190.07107 28.88199 6.581 7.22e-08 ***
poly(avst, 2)1 51.96140 22.44731 2.315 0.025838 *
poly(avst, 2)2 18.49093 6.22026 2.973 0.004980 **
maxst -0.65449 0.16987 -3.853 0.000413 ***
maxat 0.48866 0.08857 5.517 2.25e-06 ***
maxh -0.33992 0.04853 -7.004 1.85e-08 ***
---
Residual standard error: 5.894 on 40 degrees of freedom
Multiple R-squared: 0.8559, Adjusted R-squared: 0.8378
F-statistic: 47.5 on 5 and 40 DF, p-value: 8.841e-16
Page 39/52
Transformations for Variables: BIC
10
Standardized residuals
1
5
Residuals
0
5
2
2
33 33
15
2
41
41
3
10 20 30 40 50 2 1 0 1 2
2
0.5
2
1.5
33
Standardized residuals
Standardized residuals
1
1.0
0
1
0.5
7 0.5
2
41 1
3
Cook's distance
0.0
FitResidual plot for CV model FitResidual plot for BIC transformed model
Residuals vs Fitted Residuals vs Fitted
10
10
5
0
0
Residuals
Residuals
5
10
41
10
2
33
20
15
33 41
10 20 30 40 50 10 20 30 40 50
Page 41/52
Comparisons
WBtest: p=0.1
2
8
WBtest: p=0
2
1
1
Standardized residuals
Standardized residuals
0
0
1
2
2
41
33
2
3
33 41
2 1 0 1 2 3 2 1 0 1 2
Page 41/52
Step 4: Summary
Page 42/52
The Modelling Cycle: Our plan of attack
I Graphical check
I Preliminary fit
I Outlier check
I Check for outliers
I Test effects of removing outliers
Page 43/52
Step 5: Diagnostics for outliers for CV model
1.5
6 6 6
1.5
0.8
1.0
1.0
1.0
0.6
dfb.maxh
dfb.maxt
dfb.I(^2
dfb.1_
0.4
0.5
0.5
0.5
0.2
0.0
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
6 6 6
1.5
2.5
0.5
1.5
ABS(COV RATIO1)
2.0
0.4
1.0
1.0
dfb.wind
1.5
DFFITS
dfb.I(^3
0.3
41
33
2 8
1.0
0.2
0.5
0.5
3
7 41
40
0.5
0.1
0.0
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Cook's D Hats
6 6
1.0
0.6
2
0.8
Cook's D
0.4
0.6
Hats
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
Page 44/52
CV summary
Page 45/52
Diagnostics for outliers for BIC model
1.2
0.5
2 7
1.5
0.5
1.0
0.4
0.4
0.8
41
1.0
dfb.p(,2)1
dfb.p(,2)2
0.3
dfb.mxst
0.3
dfb.1_
0.6
0.2
0.2
0.4
0.5
0.1
0.1
0.2
0.0
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
41 2 32
1.5
41
0.8
7
0.4
1.5
ABS(COV RATIO1)
41
0.6
6
1.0
0.3
dfb.maxh
dfb.maxt
DFFITS
1.0
33
0.4
38
0.2
0.5
0.5
0.2
0.1
0.0
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Cook's D Hats
2 32
0.4
0.5
41
0.4
0.3
7
Cook's D
0.3
Hats
0.2
0.2
0.1
0.1
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
Page 46/52
BIC summary
Page 47/52
The Modelling Cycle: Our plan of attack
I Graphical check
I Preliminary fit
I Outlier check
Page 48/52
Step 6: Interpretation (BIC model)
Page 49/52
Step 6: Interpretation of negative
maxst coefficient
Given : avst
5
6 6
50
75 80 85 90 95
4 6 6 6
5
5 6
6
4
5 5 5
4
4 5
40
4 6 6 6
4 3
5
evap.df$evap
1 2
130 150 170 190 130 150 170 190
30
3
2
3
50
40
30
4
20
20
1 5 4
10
4
evap
2
0
5
50
10
2 2
40
30
1 1
20
0
10
0
75 80 85 90 95 130 150 170 190
maxst
evap.df$avst
Note: in the left hand picture, we have coded the range of maxst
into groups with 1 being the lowest and 6 the highest value. For
fixed avst, the evaporation goes down for increasing maxst.
Page 50/52
Step 6: Prediction
Page 51/52
Step 6: Summary
Page 52/52