Professional Documents
Culture Documents
Variance
12.08.2014
Office hours
I Lecturers
Office auckland.ac.nz day time
Steffen Klaere 303.219 s.klaere Wed, 13:0015:00
Alan Lee 303S.265 aj.lee Tue, 10:3012:00
Thu, 10:3012:00
http://statweb.stanford.edu/~tibs/ElemStatLearn/
R-hint of the day
Independence: ...
Outliers: ...
GAM plots and splines
40
transform and how.
30
I The shape could be a
20
s(diameter,2.69)
polynomial...
10
0
10
20
8 10 12 14 16 18 20
diameter
GAM plots and splines
100
transform and how.
50
I The shape could be a
s(tensile,5.42)
polynomial...
0
I ...or a locally smoothed
function.
50
120 140 160 180 200 220 240
tensile
GAM plots and splines
4e+05
which variable we need to
3e+05
transform and how.
2e+05
I The shape could be a
s(medianIncome,3.03)
polynomial...
1e+05
I ...or a locally smoothed
0e+00
function.
1e+05
I But be aware of the
2e+05
distribution of the 0 5 10 15
observations medianIncome
GAM plots and splines
Call:
lm(formula = abloss ~ poly(tensile, 4) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 615.4012 29.2893 21.011 < 2e-16 ***
poly(tensile, 4)1 -264.4043 24.6171 -10.741 1.20e-10 ***
poly(tensile, 4)2 23.6269 24.8947 0.949 0.352043
poly(tensile, 4)3 119.9408 24.1991 4.956 4.64e-05 ***
poly(tensile, 4)4 -91.6965 23.2722 -3.940 0.000613 ***
hardness -6.2614 0.4124 -15.182 8.35e-14 ***
---
Residual standard error: 23.25 on 24 degrees of freedom
Multiple R-squared: 0.9423, Adjusted R-squared: 0.9303
F-statistic: 78.46 on 5 and 24 DF, p-value: 4.504e-14
GAM plots and splines
Call:
lm(formula = abloss ~ bs(tensile, df = 4, degree = 3) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 612.1556 43.0348 14.225 3.43e-13 ***
bs(tensile,df=4,degree=3)1 195.5549 40.6339 4.813 6.69e-05 ***
bs(tensile,df=4,degree=3)2 -148.3497 38.6717 -3.836 0.000796 ***
bs(tensile,df=4,degree=3)3 -24.2971 37.7010 -0.644 0.525385
bs(tensile,df=4,degree=3)4 -61.0593 25.4829 -2.396 0.024720 *
hardness -6.1914 0.4139 -14.959 1.15e-13 ***
---
Residual standard error: 23.36 on 24 degrees of freedom
Multiple R-squared: 0.9418, Adjusted R-squared: 0.9297
F-statistic: 77.7 on 5 and 24 DF, p-value: 5.021e-14
GAM plots and splines
Call:
lm(formula = abloss ~ bs(tensile, knots = 180) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 614.273 43.043 14.271 3.19e-13 ***
bs(tensile, knots = 180)1 196.594 41.116 4.781 7.24e-05 ***
bs(tensile, knots = 180)2 -161.450 38.676 -4.174 0.000339 ***
bs(tensile, knots = 180)3 -21.134 37.451 -0.564 0.577774
bs(tensile, knots = 180)4 -65.016 25.485 -2.551 0.017529 *
hardness -6.191 0.415 -14.917 1.23e-13 ***
---
Residual standard error: 23.42 on 24 degrees of freedom
Multiple R-squared: 0.9415, Adjusted R-squared: 0.9293
F-statistic: 77.26 on 5 and 24 DF, p-value: 5.351e-14
Aims of todays lecture
I Theory
I Ladder of powers
I Polynomials
diameter2
volume = height
3 4
Call:
lm(formula = log(volume) ~ log(height) + log(diameter),
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.70492 0.88190 -1.933 0.0634 .
log(height) 1.11712 0.20444 5.464 7.81e-06 ***
log(diameter) 1.98265 0.07501 26.432 < 2e-16 ***
---
Residual standard error: 0.08139 on 28 degrees of freedom
Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
Using theory: Cherry trees
0.15
31
0.10
2
5
0.05
0.00
Residuals
Residuals
0.05
0
0.10
5
0.15
16
18 15 18
0.20
10 20 30 40 50 60 70 2.5 3.0 3.5 4.0
0
Refitting model: Tyre abrasion data
> lm(abloss~hardness+poly(tensile,4),data=rubber.df)
I Usually a lot of trial and error involved
I We have succeeded when
I R 2 improves
I Residual plots show no pattern
Call:
lm(formula = abloss ~ hardness + poly(tensile, 5),
data = rubber.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 615.3617 29.8178 20.637 2.44e-16 ***
hardness -6.2608 0.4199 -14.911 2.59e-13 ***
poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 ***
poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129
poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 ***
poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 ***
poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495
---
Residual standard error: 23.67 on 23 degrees of freedom
Multiple R-squared: 0.9427, Adjusted R-squared: 0.9278
F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13
Ladder of powers
I In practice this means that the scatter does not depend on the
explanatory variables or the mean of the response.
I This means the big residuals happen when the fitted values
are big.
I Variables are
I Fit model
> lm(educ ~ percap + under18 + urban,data=educ.df)
Pairs plot: Education expenditure data
300 500 700 900 300 340 380
500
educ
400
300
200
900
urban
700
0.32
500
300
percap
5500
0.61 0.63
4500
3500
380
under18
340
Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -555.92562 123.46634 -4.503 4.56e-05 ***
urban -0.00476 0.05174 -0.092 0.927
percap 0.07236 0.01165 6.211 1.40e-07 ***
under18 1.55134 0.31545 4.918 1.16e-05 ***
---
Residual standard error: 40.53 on 46 degrees of freedom
Multiple R-squared: 0.5902, Adjusted R-squared: 0.5634
F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09
Basic fit, outlier out
Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df,
subset = -50)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
Residual analysis
> par(mfrow=c(1,2))
> plot(educ50.lm,which=c(1,2))
15 15
2
7 7
Standardized residuals
50
1
Residuals
1
50
10
2
100
10
ScaleLocation ScaleLocation
1.5
10
15 10
1.5
7
15
Standardized residuals
Standardized residuals
14
1.0
1.0
0.5
0.5
0.0
0.0
Y p = 0 + 1 x1 + + k xk
95%
5
logLikelihood
0
5
10
2 1 0 1 2
Weighted least squares
5000
3.6
Squared residuals
Log std. errors
3.4
3000
3.2
0 1000
3.0
Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df, subset = -50)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
Weighted Model
Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df[-50,], weights = 1/vars)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -270.29363 102.61073 -2.634 0.0115 *
urban 0.01197 0.04030 0.297 0.7677
percap 0.05850 0.01027 5.694 8.88e-07 ***
under18 0.82384 0.27234 3.025 0.0041 **
---
Residual standard error: 1.019 on 45 degrees of freedom
Multiple R-squared: 0.629, Adjusted R-squared: 0.6043
F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10
Diagonstic steps
Independence: ...
Outliers: ...
Normality
6
2
4
Sample Quantiles
Sample Quantiles
1
2
0
0
2
2
3
3 2 1 0 1 2 3 2 0 2 4 6
10
3
5
Sample Quantiles
Sample Quantiles
10
3
3 2 1 0 1 2 3 10 5 0 5 10
> qqnorm(residuals(cherry.cone))
0.10
> WB.test(cherry.cone)
0.05
Sample Quantiles
0.00
WB test statistic = 0.983
0.05
p = 0.36
0.10
Since p-value large, no evidence
0.15
against normality.
2 1 0 1 2
Theoretical Quantiles
Remedies for Non-normality
I The idea is, on the original scale the model does not fit well,
but on the transformed scale it does.
Outliers: ...
Independence: ...
http://xkcd.com/552/