You are on page 1of 47

STATS 330: Lecture 10

Variance

12.08.2014
Office hours
I Lecturers
Office auckland.ac.nz day time
Steffen Klaere 303.219 s.klaere Wed, 13:0015:00
Alan Lee 303S.265 aj.lee Tue, 10:3012:00
Thu, 10:3012:00

I Tutors (Room 303.326)


aucklanduni.ac.nz day time
Savannah Post spos008 Mon, 10:0012:00
Thu, 14:3015:30
Leshun Xu lxu472 Tue, 13:0014:00
Wed, 13:0014:00
Thu, 13:0014:00
Hongbin Guo hguo033 Tue, 11:0012:00
Wed, 14:0016:00
Thu, 10:0011:00
Fri, 11:0012:00
Book suggestion

eries in Statistics Springer Series in Statistics

Hastie Tibshirani Friedman


Trevor Hastie
e Robert Tibshirani Jerome Friedman
ents of Statictical Learning Robert Tibshirani
Jerome Friedman
de there has been an explosion in computation and information tech-
come vast amounts of data in a variety of fields such as medicine, biolo-
keting. The challenge of understanding these data has led to the devel-
in the field of statistics, and spawned new areas such as data mining,
d bioinformatics. Many of these tools have common underpinnings but
The Elements of
Statistical Learning
with different terminology. This book describes the important ideas in
The Elements of Statistical Learning

mon conceptual framework. While the approach is statistical, the


pts rather than mathematics. Many examples are given, with a liberal
. It should be a valuable resource for statisticians and anyone interested
ence or industry. The books coverage is broad, from supervised learning
pervised learning. The many topics include neural networks, support
sification trees and boostingthe first comprehensive treatment of this
Data Mining, Inference, and Prediction
on features many topics not covered in the original, including graphical
sts, ensemble methods, least angle regression & path algorithms for the
atrix factorization, and spectral clustering. There is also a chapter on
ata (p bigger than n), including multiple testing and false discovery rates.

t Tibshirani, and Jerome Friedman are professors of statistics at


They are prominent researchers in this area: Hastie and Tibshirani Second Edition
d additive models and wrote a popular book of that title. Hastie co-
he statistical modeling software and environment in R/S-PLUS and
rves and surfaces. Tibshirani proposed the lasso and is co-author of the
troduction to the Bootstrap. Friedman is the co-inventor of many data-
ng CART, MARS, projection pursuit and gradient boosting.

http://statweb.stanford.edu/~tibs/ElemStatLearn/
R-hint of the day

Saving a plot into a pdf.


> pdf("name.pdf",width=8,height=8)
> par(mfrow=c(2,2))
> plot(mymodel.lm,which=c(1,2,3,5))
> dev.off()
Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots

Constant Variance: ...

Normality of Errors: ...

Independence: ...

Outliers: ...
GAM plots and splines

I We use GAM plots to decide


which variable we need to
transform and how.
GAM plots and splines

I We use GAM plots to decide


which variable we need to

40
transform and how.

30
I The shape could be a

20
s(diameter,2.69)
polynomial...

10
0
10
20
8 10 12 14 16 18 20

diameter
GAM plots and splines

I We use GAM plots to decide


which variable we need to

100
transform and how.

50
I The shape could be a

s(tensile,5.42)
polynomial...

0
I ...or a locally smoothed
function.

50
120 140 160 180 200 220 240

tensile
GAM plots and splines

I We use GAM plots to decide

4e+05
which variable we need to

3e+05
transform and how.

2e+05
I The shape could be a

s(medianIncome,3.03)
polynomial...

1e+05
I ...or a locally smoothed

0e+00
function.

1e+05
I But be aware of the
2e+05
distribution of the 0 5 10 15

observations medianIncome
GAM plots and splines

Call:
lm(formula = abloss ~ poly(tensile, 4) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 615.4012 29.2893 21.011 < 2e-16 ***
poly(tensile, 4)1 -264.4043 24.6171 -10.741 1.20e-10 ***
poly(tensile, 4)2 23.6269 24.8947 0.949 0.352043
poly(tensile, 4)3 119.9408 24.1991 4.956 4.64e-05 ***
poly(tensile, 4)4 -91.6965 23.2722 -3.940 0.000613 ***
hardness -6.2614 0.4124 -15.182 8.35e-14 ***
---
Residual standard error: 23.25 on 24 degrees of freedom
Multiple R-squared: 0.9423, Adjusted R-squared: 0.9303
F-statistic: 78.46 on 5 and 24 DF, p-value: 4.504e-14
GAM plots and splines

Call:
lm(formula = abloss ~ bs(tensile, df = 4, degree = 3) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 612.1556 43.0348 14.225 3.43e-13 ***
bs(tensile,df=4,degree=3)1 195.5549 40.6339 4.813 6.69e-05 ***
bs(tensile,df=4,degree=3)2 -148.3497 38.6717 -3.836 0.000796 ***
bs(tensile,df=4,degree=3)3 -24.2971 37.7010 -0.644 0.525385
bs(tensile,df=4,degree=3)4 -61.0593 25.4829 -2.396 0.024720 *
hardness -6.1914 0.4139 -14.959 1.15e-13 ***
---
Residual standard error: 23.36 on 24 degrees of freedom
Multiple R-squared: 0.9418, Adjusted R-squared: 0.9297
F-statistic: 77.7 on 5 and 24 DF, p-value: 5.021e-14
GAM plots and splines

Call:
lm(formula = abloss ~ bs(tensile, knots = 180) + hardness,
data = rubber.df)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 614.273 43.043 14.271 3.19e-13 ***
bs(tensile, knots = 180)1 196.594 41.116 4.781 7.24e-05 ***
bs(tensile, knots = 180)2 -161.450 38.676 -4.174 0.000339 ***
bs(tensile, knots = 180)3 -21.134 37.451 -0.564 0.577774
bs(tensile, knots = 180)4 -65.016 25.485 -2.551 0.017529 *
hardness -6.191 0.415 -14.917 1.23e-13 ***
---
Residual standard error: 23.42 on 24 degrees of freedom
Multiple R-squared: 0.9415, Adjusted R-squared: 0.9293
F-statistic: 77.26 on 5 and 24 DF, p-value: 5.351e-14
Aims of todays lecture

I To describe some more remedies for non-planar


data.

I To look at diagnostics and remedies for


non-constant scatter.

I To recapitulate and discuss tests for normality.


Remedies for non-planar data

I Last time we looked at diagnostics for non-planar data.

I We discussed what to do if the diagnostics indicate a problem.

I The short answer was: we transform, so that the model fits


the transformed data.

I How to choose a transformation?

I Theory
I Ladder of powers
I Polynomials

I We illustrate with a few examples.


Using theory: Cherry trees

I As we have observed, a tree trunk is a bit like a cone, i.e.


volume, height and diameter are related by

diameter2
volume = height
3 4

log(volume) = log(/12) + 2 log(diameter) + log(height)

I So a linear regression using the logged variables should work!

I In fact, R 2 increases from 94.8% for normal model to 97.7%.

I Also the intersect of b0 = 1.70492 better fits the cone than


the cylinder.
Using theory: Cherry trees

Call:
lm(formula = log(volume) ~ log(height) + log(diameter),
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.70492 0.88190 -1.933 0.0634 .
log(height) 1.11712 0.20444 5.464 7.81e-06 ***
log(diameter) 1.98265 0.07501 26.432 < 2e-16 ***
---
Residual standard error: 0.08139 on 28 degrees of freedom
Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
Using theory: Cherry trees

Residuals vs Fitted Residuals vs Fitted


10

0.15
31

0.10

2


5

0.05






0.00

Residuals

Residuals






0.05
0

0.10





5

0.15

16

18 15 18

0.20
10 20 30 40 50 60 70 2.5 3.0 3.5 4.0

Fitted values Fitted values


lm(volume ~ height + diameter) lm(log(volume) ~ log(height) + log(diameter))
Using GAM plots: Tyre abrasion data
> rubber.gam <- gam(abloss~s(hardness)+s(tensile),
data=rubber.df)
> plot(rubber.gam,page=1)
100
50
s(tensile,5.42)

0
Refitting model: Tyre abrasion data

I GAM curve for tensile looks like a polynomial, so fit


polynomial

> lm(abloss~hardness+poly(tensile,4),data=rubber.df)
I Usually a lot of trial and error involved
I We have succeeded when
I R 2 improves
I Residual plots show no pattern

I 4th degree polynomial works well for rubber data: R 2 improves


from 84% to 94%.
Why 4th degree? Tyre abrasion data

Call:
lm(formula = abloss ~ hardness + poly(tensile, 5),
data = rubber.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 615.3617 29.8178 20.637 2.44e-16 ***
hardness -6.2608 0.4199 -14.911 2.59e-13 ***
poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 ***
poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129
poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 ***
poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 ***
poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495
---
Residual standard error: 23.67 on 23 degrees of freedom
Multiple R-squared: 0.9427, Adjusted R-squared: 0.9278
F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13
Ladder of powers

I Rather than fit polynomials in some independent variables,


guided by GAM plots, we can transform the response using
the ladder of powers.

I i.e. use y p as the response rather than y for some power p.

I Choose p either by trial and error using R 2 or use a Box-Cox


plot see later in this lecture.
Checking for equal scatter

I The model specifies that the scatter about the regression


plane is uniform.

I In practice this means that the scatter does not depend on the
explanatory variables or the mean of the response.

I All tests and confidence intervals rely on this!


Checking for equal scatter

I Scatter is measured by the size of the residuals.

I A common problem is where the scatter increases as the mean


response increases.

I This means the big residuals happen when the fitted values
are big.

I Recognise this by a funnel effect in the residuals versus fitted


value plot.
Example: Education expenditure data

I Data for 50 states of the USA

I Variables are

educ: Per capita expenditure on education (response)

percap: Per capita Income

under18: Number of residents per 1000 under 18

urban: Number of residents per 1000 in urban areas

I Fit model
> lm(educ ~ percap + under18 + urban,data=educ.df)
Pairs plot: Education expenditure data
300 500 700 900 300 340 380

500
educ

400

300










200

900




urban



700







0.32













500







300



percap

5500







0.61 0.63

4500









3500

380

under18
340

0.27 0.29 0.30


300

200 300 400 500 3500 4500 5500


Outlier alert! Education expenditure data
Basic fit, outlier in

Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -555.92562 123.46634 -4.503 4.56e-05 ***
urban -0.00476 0.05174 -0.092 0.927
percap 0.07236 0.01165 6.211 1.40e-07 ***
under18 1.55134 0.31545 4.918 1.16e-05 ***
---
Residual standard error: 40.53 on 46 degrees of freedom
Multiple R-squared: 0.5902, Adjusted R-squared: 0.5634
F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09
Basic fit, outlier out

Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df,
subset = -50)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
Residual analysis
> par(mfrow=c(1,2))
> plot(educ50.lm,which=c(1,2))

Residuals vs Fitted Normal QQ


100

15 15

2
7 7

Standardized residuals

50

1

Residuals

1
50

10
2
100

10

220 260 300 340 2 1 0 1 2

Fitted values Theoretical Quantiles


Remedies

Either: Transform the response (Ladder of


powers)

Or: Estimate the variances of the


observations and use weighted least
squares.
Transforming the response
> tr.educ50.lm <- lm(1/educ~urban+percap
+under18,data=educ.df[-50,])

ScaleLocation ScaleLocation
1.5

10
15 10

1.5
7

15
Standardized residuals

Standardized residuals
14




1.0

1.0












0.5

0.5










0.0

0.0

220 260 300 340 0.0030 0.0035 0.0040 0.0045

Fitted values Fitted values


What power to choose?

I How did we know to use reciprocals?

I Think of a more general model

Y p = 0 + 1 x1 + + k xk

where p is some power.

I The estimate p from the data using a Box-Cox


plot.
Box-Cox plots
> library(MASS)
> boxcox(educ~urban+percap+under18,
data=educ.df[-50])

95%
5
logLikelihood

0
5
10

2 1 0 1 2


Weighted least squares

I Observations need to have constant variance.

I If the i th observation has variance vi 2 , then we can get a


valid test by using weighted least squares, minimising the sum
of the weighted squared residuals
k
X r2 i
wRSS =
vi
i=1

rather than the sum of squared residuals


k
X
RSS = ri2 .
i=1

I Need to know the variance weight vi .


Finding the weights

1. Plot the squared residuals versus the fitted values.

2. Smooth the plot.

3. Estimate the variance of an observation by the smoothed


squared residual

4. Weight is reciprocal of smoothed squared residual.

Rationale: Variance is a function of the mean.


Doing it in R
> vars <- funnel(educ50.lm)
Slope: 1.723989
3.8

5000
3.6

Squared residuals
Log std. errors


3.4

3000




3.2

0 1000





3.0

5.5 5.6 5.7 5.8 220 260 300 340

Log means Fitted values


Recall model fit after outlier removal

Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df, subset = -50)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
Weighted Model

Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df[-50,], weights = 1/vars)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -270.29363 102.61073 -2.634 0.0115 *
urban 0.01197 0.04030 0.297 0.7677
percap 0.05850 0.01027 5.694 8.88e-07 ***
under18 0.82384 0.27234 3.025 0.0041 **
---
Residual standard error: 1.019 on 45 degrees of freedom
Multiple R-squared: 0.629, Adjusted R-squared: 0.6043
F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10
Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Normality of Errors: ...

Independence: ...

Outliers: ...
Normality

I Another assumption in the regression model is that the errors


are normally distributed.

I This is not so crucial, but can be important if the errors have


a long-tailed distribution, since this will imply there are several
outliers

I Normality assumption important for prediction.


Detecting non-normality
Standard diagnostic:
> qqnorm(residuals(xyz.lm))

Normal distribution Right skewed distribution


3

6







2

4














Sample Quantiles

Sample Quantiles
















1

2






















0

0































2

2



3

3 2 1 0 1 2 3 2 0 2 4 6

Theoretical Quantiles Theoretical Quantiles

Symmetric, short tailed Symmetric, long tailed

10

3





5


Sample Quantiles

Sample Quantiles







10
3

3 2 1 0 1 2 3 10 5 0 5 10

Theoretical Quantiles Theoretical Quantiles


The Weisberg-Bingham test

I Test statistic: WB is the square of the correlation of the


normal plot, measures how straight the plot is.

I WB lies between 0 and 1. Values close to 1 indicate normality.

I R function WB.test calculates WB statistic and computes


p-value for test with null hypothesis of sample being normal.

I WB test is a variant of Shapiro-Wilk test.


Example: Residuals of cherry trees

Residuals for cherry cone model




> qqnorm(residuals(cherry.cone))

0.10


> WB.test(cherry.cone)

0.05

Sample Quantiles

0.00


WB test statistic = 0.983

0.05

p = 0.36


0.10
Since p-value large, no evidence

0.15


against normality.
2 1 0 1 2

Theoretical Quantiles
Remedies for Non-normality

I The standard remedy is to transform the response using a


power transformation.

I The idea is, on the original scale the model does not fit well,
but on the transformed scale it does.

I The power is obtained by means of a Box-Cox plot.

I The idea is to assume that for some power p, the response y p


follows the regression model. The plot is a graphical way of
estimating the power p.

I Technically, it is a plot of the profile likelihood.


Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)

Outliers: ...

Independence: ...
http://xkcd.com/552/

You might also like