330 Lecture10 2014 PDF

STATS 330: Lecture 10
Variance
12.08.2014
Office hours
I Lecturers
Office auckland.ac.nz day time
Steffen Klaere 303.219 s.klaere Wed, 13:0015:00
Alan Lee 303S.265 aj.lee Tue, 10:3012:00
Thu, 10:3012:00
I Tutors (Room 303.326)

aucklanduni.ac.nz day time
Savannah Post spos008 Mon, 10:0012:00
Thu, 14:3015:30
Leshun Xu lxu472 Tue, 13:0014:00
Wed, 13:0014:00
Thu, 13:0014:00
Hongbin Guo hguo033 Tue, 11:0012:00
Wed, 14:0016:00
Thu, 10:0011:00
Fri, 11:0012:00
Book suggestion
eries in Statistics Springer Series in Statistics
Hastie Tibshirani Friedman

Trevor Hastie
e Robert Tibshirani Jerome Friedman
ents of Statictical Learning Robert Tibshirani
Jerome Friedman
de there has been an explosion in computation and information tech-
come vast amounts of data in a variety of fields such as medicine, biolo-
keting. The challenge of understanding these data has led to the devel-
in the field of statistics, and spawned new areas such as data mining,
d bioinformatics. Many of these tools have common underpinnings but
The Elements of
Statistical Learning
with different terminology. This book describes the important ideas in
The Elements of Statistical Learning
mon conceptual framework. While the approach is statistical, the

pts rather than mathematics. Many examples are given, with a liberal
. It should be a valuable resource for statisticians and anyone interested
ence or industry. The books coverage is broad, from supervised learning
pervised learning. The many topics include neural networks, support
sification trees and boostingthe first comprehensive treatment of this
Data Mining, Inference, and Prediction
on features many topics not covered in the original, including graphical
sts, ensemble methods, least angle regression & path algorithms for the
atrix factorization, and spectral clustering. There is also a chapter on
ata (p bigger than n), including multiple testing and false discovery rates.
t Tibshirani, and Jerome Friedman are professors of statistics at

They are prominent researchers in this area: Hastie and Tibshirani Second Edition
d additive models and wrote a popular book of that title. Hastie co-
he statistical modeling software and environment in R/S-PLUS and
rves and surfaces. Tibshirani proposed the lasso and is co-author of the
troduction to the Bootstrap. Friedman is the co-inventor of many data-
ng CART, MARS, projection pursuit and gradient boosting.
http://statweb.stanford.edu/~tibs/ElemStatLearn/
R-hint of the day
Saving a plot into a pdf.

> pdf("name.pdf",width=8,height=8)
> par(mfrow=c(2,2))
> plot(mymodel.lm,which=c(1,2,3,5))
> dev.off()
Diagonstic steps
We test for <> using <>:

Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots
Constant Variance: ...
Normality of Errors: ...
Independence: ...
Outliers: ...
GAM plots and splines
I We use GAM plots to decide

which variable we need to
transform and how.

40
transform and how.
30
I The shape could be a
20
s(diameter,2.69)
polynomial...
10
0
10
20
8 10 12 14 16 18 20
diameter

100
transform and how.
50
s(tensile,5.42)
polynomial...
0
I ...or a locally smoothed
function.
50
120 140 160 180 200 220 240
tensile
4e+05
3e+05
transform and how.
2e+05
s(medianIncome,3.03)
polynomial...
1e+05
I ...or a locally smoothed
0e+00
function.
1e+05
I But be aware of the
2e+05
distribution of the 0 5 10 15
observations medianIncome
Call:
lm(formula = abloss ~ poly(tensile, 4) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 615.4012 29.2893 21.011 < 2e-16 ***
poly(tensile, 4)1 -264.4043 24.6171 -10.741 1.20e-10 ***
poly(tensile, 4)2 23.6269 24.8947 0.949 0.352043
poly(tensile, 4)3 119.9408 24.1991 4.956 4.64e-05 ***
poly(tensile, 4)4 -91.6965 23.2722 -3.940 0.000613 ***
hardness -6.2614 0.4124 -15.182 8.35e-14 ***
---
Residual standard error: 23.25 on 24 degrees of freedom
Multiple R-squared: 0.9423, Adjusted R-squared: 0.9303
F-statistic: 78.46 on 5 and 24 DF, p-value: 4.504e-14
Call:
lm(formula = abloss ~ bs(tensile, df = 4, degree = 3) + hardness,
data = rubber.df)
--
Coefficients:
(Intercept) 612.1556 43.0348 14.225 3.43e-13 ***
bs(tensile,df=4,degree=3)1 195.5549 40.6339 4.813 6.69e-05 ***
bs(tensile,df=4,degree=3)2 -148.3497 38.6717 -3.836 0.000796 ***
bs(tensile,df=4,degree=3)3 -24.2971 37.7010 -0.644 0.525385
bs(tensile,df=4,degree=3)4 -61.0593 25.4829 -2.396 0.024720 *
hardness -6.1914 0.4139 -14.959 1.15e-13 ***
---
Call:
lm(formula = abloss ~ bs(tensile, knots = 180) + hardness,
data = rubber.df)
--
Coefficients:
(Intercept) 614.273 43.043 14.271 3.19e-13 ***
bs(tensile, knots = 180)1 196.594 41.116 4.781 7.24e-05 ***
bs(tensile, knots = 180)2 -161.450 38.676 -4.174 0.000339 ***
bs(tensile, knots = 180)3 -21.134 37.451 -0.564 0.577774
bs(tensile, knots = 180)4 -65.016 25.485 -2.551 0.017529 *
hardness -6.191 0.415 -14.917 1.23e-13 ***
---
Aims of todays lecture
I To describe some more remedies for non-planar

data.
I To look at diagnostics and remedies for

non-constant scatter.
I To recapitulate and discuss tests for normality.

Remedies for non-planar data
I Last time we looked at diagnostics for non-planar data.
I We discussed what to do if the diagnostics indicate a problem.
I The short answer was: we transform, so that the model fits

the transformed data.
I How to choose a transformation?
I Theory
I Ladder of powers
I Polynomials
I We illustrate with a few examples.

Using theory: Cherry trees
I As we have observed, a tree trunk is a bit like a cone, i.e.

volume, height and diameter are related by
diameter2
volume = height
3 4
log(volume) = log(/12) + 2 log(diameter) + log(height)
I So a linear regression using the logged variables should work!
I In fact, R 2 increases from 94.8% for normal model to 97.7%.
I Also the intersect of b0 = 1.70492 better fits the cone than

the cylinder.
Call:
lm(formula = log(volume) ~ log(height) + log(diameter),
data = cherry.df)
---
Coefficients:
(Intercept) -1.70492 0.88190 -1.933 0.0634 .
log(height) 1.11712 0.20444 5.464 7.81e-06 ***
log(diameter) 1.98265 0.07501 26.432 < 2e-16 ***
---
F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
Residuals vs Fitted Residuals vs Fitted

10
0.15
31

0.10

2

5
0.05

0.00

Residuals
Residuals

0.05
0
0.10

5
0.15

16

18 15 18
0.20
10 20 30 40 50 60 70 2.5 3.0 3.5 4.0
Fitted values Fitted values

lm(volume ~ height + diameter) lm(log(volume) ~ log(height) + log(diameter))
Using GAM plots: Tyre abrasion data
> rubber.gam <- gam(abloss~s(hardness)+s(tensile),
data=rubber.df)
> plot(rubber.gam,page=1)
100
50
s(tensile,5.42)
0
Refitting model: Tyre abrasion data
I GAM curve for tensile looks like a polynomial, so fit

polynomial
> lm(abloss~hardness+poly(tensile,4),data=rubber.df)
I Usually a lot of trial and error involved
I We have succeeded when
I R 2 improves
I Residual plots show no pattern
I 4th degree polynomial works well for rubber data: R 2 improves

from 84% to 94%.
Why 4th degree? Tyre abrasion data
Call:
lm(formula = abloss ~ hardness + poly(tensile, 5),
data = rubber.df)
---
Coefficients:
(Intercept) 615.3617 29.8178 20.637 2.44e-16 ***
hardness -6.2608 0.4199 -14.911 2.59e-13 ***
poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 ***
poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129
poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 ***
poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 ***
poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495
---
Ladder of powers
I Rather than fit polynomials in some independent variables,

guided by GAM plots, we can transform the response using
the ladder of powers.
I i.e. use y p as the response rather than y for some power p.
I Choose p either by trial and error using R 2 or use a Box-Cox

plot see later in this lecture.
Checking for equal scatter
I The model specifies that the scatter about the regression

plane is uniform.
I In practice this means that the scatter does not depend on the
explanatory variables or the mean of the response.
I All tests and confidence intervals rely on this!

Checking for equal scatter
I Scatter is measured by the size of the residuals.
I A common problem is where the scatter increases as the mean

response increases.
I This means the big residuals happen when the fitted values
are big.
I Recognise this by a funnel effect in the residuals versus fitted

value plot.
Example: Education expenditure data
I Data for 50 states of the USA
I Variables are
educ: Per capita expenditure on education (response)
percap: Per capita Income
under18: Number of residents per 1000 under 18
urban: Number of residents per 1000 in urban areas
I Fit model
> lm(educ ~ percap + under18 + urban,data=educ.df)
Pairs plot: Education expenditure data
300 500 700 900 300 340 380
500
educ
400

300

200

900

urban

700

0.32

500

300

percap
5500

0.61 0.63

4500

3500

380
under18
340
0.27 0.29 0.30

300
200 300 400 500 3500 4500 5500

Outlier alert! Education expenditure data
Basic fit, outlier in
Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df)
---
Coefficients:
(Intercept) -555.92562 123.46634 -4.503 4.56e-05 ***
urban -0.00476 0.05174 -0.092 0.927
percap 0.07236 0.01165 6.211 1.40e-07 ***
under18 1.55134 0.31545 4.918 1.16e-05 ***
---
Basic fit, outlier out
Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df,
subset = -50)
---
Coefficients:
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual analysis
> par(mfrow=c(1,2))
> plot(educ50.lm,which=c(1,2))
Residuals vs Fitted Normal QQ

100
15 15
2
7 7

Standardized residuals

50
1

Residuals
1
50
10
2
100
10
220 260 300 340 2 1 0 1 2
Fitted values Theoretical Quantiles

Remedies
Either: Transform the response (Ladder of

powers)
Or: Estimate the variances of the

observations and use weighted least
squares.
Transforming the response
> tr.educ50.lm <- lm(1/educ~urban+percap
+under18,data=educ.df[-50,])
ScaleLocation ScaleLocation
1.5
10
15 10
1.5
7

15
14

1.0
1.0

0.5
0.5

0.0
0.0
220 260 300 340 0.0030 0.0035 0.0040 0.0045
Fitted values Fitted values

What power to choose?
I How did we know to use reciprocals?
I Think of a more general model
Y p = 0 + 1 x1 + + k xk
where p is some power.
I The estimate p from the data using a Box-Cox

plot.
Box-Cox plots
> library(MASS)
> boxcox(educ~urban+percap+under18,
data=educ.df[-50])
95%
5
logLikelihood
0
5
10
2 1 0 1 2

Weighted least squares
I Observations need to have constant variance.
I If the i th observation has variance vi 2 , then we can get a

valid test by using weighted least squares, minimising the sum
of the weighted squared residuals
k
X r2 i
wRSS =
vi
i=1
rather than the sum of squared residuals

k
X
RSS = ri2 .
i=1
I Need to know the variance weight vi .

Finding the weights
1. Plot the squared residuals versus the fitted values.
2. Smooth the plot.
3. Estimate the variance of an observation by the smoothed

squared residual
4. Weight is reciprocal of smoothed squared residual.
Rationale: Variance is a function of the mean.

Doing it in R
> vars <- funnel(educ50.lm)
Slope: 1.723989
3.8
5000
3.6
Squared residuals
Log std. errors

3.4
3000

3.2
0 1000

3.0
5.5 5.6 5.7 5.8 220 260 300 340
Log means Fitted values

Recall model fit after outlier removal
Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df, subset = -50)
--
Coefficients:
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Weighted Model
Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df[-50,], weights = 1/vars)
--
Coefficients:
(Intercept) -270.29363 102.61073 -2.634 0.0115 *
urban 0.01197 0.04030 0.297 0.7677
percap 0.05850 0.01027 5.694 8.88e-07 ***
under18 0.82384 0.27234 3.025 0.0041 **
---
Diagonstic steps

GAM plots, Box-Cox plot
Constant Variance: Funnel plots, weighted least squares
Normality of Errors: ...
Independence: ...
Outliers: ...
Normality
I Another assumption in the regression model is that the errors

are normally distributed.
I This is not so crucial, but can be important if the errors have

a long-tailed distribution, since this will imply there are several
outliers
I Normality assumption important for prediction.

Detecting non-normality
Standard diagnostic:
> qqnorm(residuals(xyz.lm))
Normal distribution Right skewed distribution

3
6

2
4

Sample Quantiles
Sample Quantiles

1
2

0
0

2
2

3
3 2 1 0 1 2 3 2 0 2 4 6
Theoretical Quantiles Theoretical Quantiles
Symmetric, short tailed Symmetric, long tailed
10

3

5

Sample Quantiles
Sample Quantiles

10
3
3 2 1 0 1 2 3 10 5 0 5 10
Theoretical Quantiles Theoretical Quantiles

The Weisberg-Bingham test
I Test statistic: WB is the square of the correlation of the

normal plot, measures how straight the plot is.
I WB lies between 0 and 1. Values close to 1 indicate normality.
I R function WB.test calculates WB statistic and computes

p-value for test with null hypothesis of sample being normal.
I WB test is a variant of Shapiro-Wilk test.

Example: Residuals of cherry trees
Residuals for cherry cone model

> qqnorm(residuals(cherry.cone))
0.10

> WB.test(cherry.cone)
0.05

Sample Quantiles

0.00

WB test statistic = 0.983
0.05

p = 0.36

0.10
Since p-value large, no evidence
0.15

against normality.
2 1 0 1 2
Theoretical Quantiles
Remedies for Non-normality
I The standard remedy is to transform the response using a

power transformation.
I The idea is, on the original scale the model does not fit well,
but on the transformed scale it does.
I The power is obtained by means of a Box-Cox plot.
I The idea is to assume that for some power p, the response y p

follows the regression model. The plot is a graphical way of
estimating the power p.
I Technically, it is a plot of the profile likelihood.

Diagonstic steps

GAM plots, Box-Cox plot
Constant Variance: Funnel plots, weighted least squares
Normality of Errors: QQ plots, Weisberg-Bingham test (and many

more)
Outliers: ...
Independence: ...
http://xkcd.com/552/

330 Lecture10 2014 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

330 Lecture10 2014 PDF

Uploaded by

Copyright:

Available Formats

STATS 330: Lecture 10

I Tutors (Room 303.326)

eries in Statistics Springer Series in Statistics

Hastie Tibshirani Friedman

mon conceptual framework. While the approach is statistical, the

t Tibshirani, and Jerome Friedman are professors of statistics at

Saving a plot into a pdf.

We test for <> using <>:

Constant Variance: ...

Normality of Errors: ...

I We use GAM plots to decide

I We use GAM plots to decide

I We use GAM plots to decide

I We use GAM plots to decide

I To describe some more remedies for non-planar

I To look at diagnostics and remedies for

I To recapitulate and discuss tests for normality.

I Last time we looked at diagnostics for non-planar data.

I We discussed what to do if the diagnostics indicate a problem.

I The short answer was: we transform, so that the model fits

I How to choose a transformation?

I We illustrate with a few examples.

I As we have observed, a tree trunk is a bit like a cone, i.e.

log(volume) = log(/12) + 2 log(diameter) + log(height)

I So a linear regression using the logged variables should work!

I In fact, R 2 increases from 94.8% for normal model to 97.7%.

I Also the intersect of b0 = 1.70492 better fits the cone than

Residuals vs Fitted Residuals vs Fitted

Fitted values Fitted values

I GAM curve for tensile looks like a polynomial, so fit

I 4th degree polynomial works well for rubber data: R 2 improves

I Rather than fit polynomials in some independent variables,

I i.e. use y p as the response rather than y for some power p.

I Choose p either by trial and error using R 2 or use a Box-Cox

I The model specifies that the scatter about the regression

I All tests and confidence intervals rely on this!

I Scatter is measured by the size of the residuals.

I A common problem is where the scatter increases as the mean

I Recognise this by a funnel effect in the residuals versus fitted

I Data for 50 states of the USA

educ: Per capita expenditure on education (response)

percap: Per capita Income

under18: Number of residents per 1000 under 18

urban: Number of residents per 1000 in urban areas

0.27 0.29 0.30

200 300 400 500 3500 4500 5500

Residuals vs Fitted Normal QQ

220 260 300 340 2 1 0 1 2

Fitted values Theoretical Quantiles

Either: Transform the response (Ladder of

Or: Estimate the variances of the

220 260 300 340 0.0030 0.0035 0.0040 0.0045

Fitted values Fitted values

I How did we know to use reciprocals?

I Think of a more general model

where p is some power.

I The estimate p from the data using a Box-Cox

I Observations need to have constant variance.

I If the i th observation has variance vi 2 , then we can get a

rather than the sum of squared residuals

I Need to know the variance weight vi .

1. Plot the squared residuals versus the fitted values.

2. Smooth the plot.