You are on page 1of 143

STATS 330: Lecture 9-12

Diagnostics

6.08.2015
Aims of the next four lectures

I To give you an overview of the modelling cycle.

I To have a detailed discussion of diagnostic


procedures.
The modelling cycle

I We have seen that the regression model describes rather


specialised forms of data

I Data are planar,


I Scatter is uniform over the plane.

I We have looked at some plots that help us decide if the data


is suitable for regression modelling

I pairs
I reg3d
I coplot
Residual analysis

I Another approach is to fit the model and examine the


residuals.

I If the model is appropriate the residuals have no pattern

I A pattern in the residuals usually indicates that the model is


not appropriate

I If this is the case we have two options

1. Select another form of model e.g. non-linear regression;

2. Transform the data so that the regression model fits the


transformed data.
The Modelling Cycle

PLOTS and THEORY

Choose Model

Fit Model

Transform Examine Residuals

Bad fit Good fit

USE MODEL
What constitutes a bad fit?

I Non-planar data: Seriously affects meaning and accuracy of


estimated coefficients
I Outliers in the data: Seriously affects meaning and accuracy
of estimated coefficients
I Non-constant scatter: Affects standard error of estimate
I Errors not independent: Affects standard error of estimate
I Errors not normally distributed: Affects standard error of
estimate
Diagonstic steps

We test for <> using <>:


Planarity: ...

Constant Variance: ...

Outliers: ...

Independence: ...

Normality of Errors: ...


Detecting non-planar data

I We can diagnose non-planar data (non-linearity) by fitting the


model, and

I plotting residuals versus fitted values;


I residuals against explanatory variables;
I fitting additive models

I In each case, a curved plot indicates non-planar data.


Plotting residuals vs. fitted values

> data(cherry.df)
> cherry.lm <- lm(volume~diameter+height,data=cherry.df)
> plot(cherry.lm,which=1)

which=1: selects the plot of residuals vs. fitted values


Plotting residuals vs. fitted values

Residuals vs Fitted

10
31

2


5


Residuals

18

10 20 30 40 50 60 70

Fitted values
lm(volume ~ height + diameter)
Additive models

I These are models of the form

Y = g1 (x1 ) + g2 (x2 ) + + gk (xk ) +

where g1 , . . . , gk are transformations.

I Fitted using the gam function in R.

I The transformations are estimated by the software.

I Use the function to suggest good transformations.


Example: Cherry trees

> library(mgcv)
> cherry.gam <- gam(volume~s(diameter)+s(height),
+ data=cherry.df)
> plot(cherry.gam,residuals=T,pages=1)
Example: Cherry trees

40

40
30

30
s(diameter,2.69)

s(height,1)
20

20
10

10
0

0
20

20
8 10 12 14 16 18 20 65 70 75 80 85

diameter height
Fitting polynomials

I To fit a model y = 0 + 1 x + 2 x 2 , use

y~poly(x,2)

I To fit a model y = 0 + 1 x + 2 x 2 + 3 x 3 , use

y~poly(x,3)

etc.
Orthogonal polynomials

I The model fitted by y~poly(x,2) is of the form

Y = 0 + 1 p1 (x) + 2 p2 (x)

where

p1 : polynomial of degree 1, i.e. of the form a0 + a1 x

p2 : polynomial of degree 2, i.e. of the form b0 + b1 x + b2 x 2 .

I p1 , p2 chosen to be uncorrelated (best possible estimation)


Adding a quadratic term: Cherry trees

Call:
lm(formula = volume ~ poly(diameter, 2) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.56553 6.72218 0.233 0.817603
poly(diameter, 2)1 80.25223 3.07346 26.111 < 2e-16 ***
poly(diameter, 2)2 15.39923 2.63157 5.852 3.13e-06 ***
height 0.37639 0.08823 4.266 0.000218 ***
---
Residual standard error: 2.625 on 27 degrees of freedom
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
Quadratic equation

Call:
lm(formula = volume ~ diameter + I(diameter^2) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.92041 10.07911 -0.984 0.333729
diameter -2.88508 1.30985 -2.203 0.036343 *
I(diameter^2) 0.26862 0.04590 5.852 3.13e-06 ***
height 0.37639 0.08823 4.266 0.000218 ***
---
Residual standard error: 2.625 on 27 degrees of freedom
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
Quadratic equation

volume 9.92 2.89 diameter + 0.27 diameter2 + 0.38 height

volume

er
he et
igh am
t di
Splines

I An alternative to polynomials are splines these are piecewise


cubics, which join smoothly at knots.

I Give a more flexible fit to the data.

I Values at one point not affected by values at distant points,


unlike polynomials
Example with 4 knots

1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
Cherry splines

Call:
lm(formula = volume ~ bs(diameter, knots = knot.points) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.3679 7.4856 -2.187 0.03921 *
bs(diameter, knots = knot.points)1 0.1941 7.9374 0.024 0.98070
bs(diameter, knots = knot.points)2 5.5744 3.1704 1.758 0.09201 .
bs(diameter, knots = knot.points)3 10.7976 3.9798 2.713 0.01240 *
bs(diameter, knots = knot.points)4 31.4053 5.5545 5.654 9.35e-06 ***
bs(diameter, knots = knot.points)5 42.2665 6.1297 6.895 4.97e-07 ***
bs(diameter, knots = knot.points)6 58.6454 4.2781 13.708 1.49e-12 ***
height 0.3970 0.1050 3.780 0.00097 ***
---
Residual standard error: 2.8 on 23 degrees of freedom
Multiple R-squared: 0.9778, Adjusted R-squared: 0.971
F-statistic: 144.4 on 7 and 23 DF, p-value: < 2.2e-16
Cherry splines

Basis for quadratic splines


1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
Cherry splines
80

polynomial
splines
70
60



50
Volume


40





30







20





10

8 10 12 14 16 18 20

Diameter
Example: Tyre abrasion data

I Data collected in an experiment to study the abrasion


resistance of tyres

I Variables are

Hardness: Hardness of rubber

Tensile: Tensile strength of rubber

Abrasion Loss: Amount of rubber worn away in a standard


test (response)
Tyre abrasion data

Call:
lm(formula = abloss ~ hardness + tensile, data = rubber.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
hardness -6.5708 0.5832 -11.267 1.03e-11 ***
tensile -1.3743 0.1943 -7.073 1.32e-07 ***
---
Residual standard error: 36.49 on 27 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8284
F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11
Tyre abrasion data

I We will use this example to illustrate all the methods we have


discussed so far to check if the data are planar, scattered
about a flat regression plane i.e.

I Pairs plot
I Conditional plot
I Residual vs. fitted value plot
I Fitting GAMs
Pairs plot tensile vs hardness non-linear
120 140 160 180 200 220 240

90


hardness





80

70




60





50

120 140 160 180 200 220 240




tensile




0.30


abloss

300
0.74 0.30

200
50 100
50 60 70 80 90 50 100 200 300
Coplot Suggestion of non-planarity
Given : hardness
50 60 70 80

120 140 160 180 200 220 240 120 140 160 180 200 220 240

350

250


150






50

abloss


350


250








150




50

120 140 160 180 200 220 240

tensile
Residuals vs. fitted values weak suggestion of
non-planarity

Residuals vs Fitted

29


50









Residuals


50

22
10

50 100 150 200 250 300 350

Fitted values
GAMs Quite strong indication of non-planarity

hardness looks okay, but tensile needs transformation.


50 100

50 100
s(tensile,5.42)

s(hardness,1)
0

0
100

100
120 140 160 180 200 220 240 50 60 70 80 90

tensile hardness
Fitting a fourth degree polynomials

> rubber.poly <- lm(abloss~hardness+tensile+I(tensile^2)


+I(tensile^3)+I(tensile^4),data=rubber.df)
> summary(rubber.poly)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.862e+04 4.177e+03 -4.458 0.000165 ***
hardness -6.261e+00 4.124e-01 -15.182 8.35e-14 ***
tensile 4.414e+02 9.836e+01 4.487 0.000153 ***
I(tensile^2) -3.693e+00 8.546e-01 -4.321 0.000233 ***
I(tensile^3) 1.342e-02 3.246e-03 4.133 0.000377 ***
I(tensile^4) -1.794e-05 4.553e-06 -3.940 0.000613 ***
---
Residual standard error: 23.25 on 24 degrees of freedom
Multiple R-squared: 0.9423, Adjusted R-squared: 0.9303
F-statistic: 78.46 on 5 and 24 DF, p-value: 4.504e-14
Fitting splines

> rubber.bs <- lm(abloss~hardness+bs(tensile,df=4),


data=rubber.df)
> summary(rubber.bs)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 612.1556 43.0348 14.225 3.43e-13 ***
hardness -6.1914 0.4139 -14.959 1.15e-13 ***
bs(tensile, df = 4)1 195.5549 40.6339 4.813 6.69e-05 ***
bs(tensile, df = 4)2 -148.3497 38.6717 -3.836 0.000796 ***
bs(tensile, df = 4)3 -24.2971 37.7010 -0.644 0.525385
bs(tensile, df = 4)4 -61.0593 25.4829 -2.396 0.024720 *
---
Residual standard error: 23.36 on 24 degrees of freedom
Multiple R-squared: 0.9418, Adjusted R-squared: 0.9297
F-statistic: 77.7 on 5 and 24 DF, p-value: 5.021e-14
GAM plots and splines

I We use GAM plots to decide


which variable we need to
transform and how.
GAM plots and splines

I We use GAM plots to decide


which variable we need to

40
transform and how.

30
I The shape could be a

20
s(diameter,2.69)
polynomial...

10
0
10
20
8 10 12 14 16 18 20

diameter
GAM plots and splines

I We use GAM plots to decide


which variable we need to

100
transform and how.

50
I The shape could be a

s(tensile,5.42)
polynomial...

0
I ...or a locally smoothed
function.

50
120 140 160 180 200 220 240

tensile
GAM plots and splines

I We use GAM plots to decide

4e+05
which variable we need to

3e+05
transform and how.

2e+05
I The shape could be a

s(medianIncome,3.03)
polynomial...

1e+05
I ...or a locally smoothed

0e+00
function.

1e+05
I But be aware of the
2e+05
distribution of the 0 5 10 15

observations medianIncome
Diagnostic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Residuals vs. covariates,
added variable plots, GAM plots

Constant Variance: ...

Outliers: ...

Independence: ...

Normality of Errors: ...


Remedies for non-planar data

I So far we have discussed how to look for non-planarity, and


stated that the remedy is transformation.
I How to choose a transformation when non-planarity is
indicated?

I Theory
I Ladder of powers
I Polynomials

I We illustrate with a few examples.


Using theory: Cherry trees

I As we have observed, a tree trunk is a bit like a cone, i.e.


volume, height and diameter are related by

diameter2
volume = height
3 4

log(volume) = log(/12) + 2 log(diameter) + log(height)

I So a linear regression using the logged variables should work!

I In fact, R 2 increases from 94.8% for normal model to 97.7%.

I Also the intersect of b0 = 1.70492 better fits the cone than


the cylinder.
Using theory: Cherry trees

Call:
lm(formula = log(volume) ~ log(height) + log(diameter),
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.70492 0.88190 -1.933 0.0634 .
log(height) 1.11712 0.20444 5.464 7.81e-06 ***
log(diameter) 1.98265 0.07501 26.432 < 2e-16 ***
---
Residual standard error: 0.08139 on 28 degrees of freedom
Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
Using theory: Cherry trees

Residuals vs Fitted Residuals vs Fitted


10

0.15
31

0.10

2


5

0.05






0.00

Residuals

Residuals






0.05
0

0.10





5

0.15

16

18 15 18

0.20
10 20 30 40 50 60 70 2.5 3.0 3.5 4.0

Fitted values Fitted values


lm(volume ~ height + diameter) lm(log(volume) ~ log(height) + log(diameter))
Using GAM plots: Tyre abrasion data
> rubber.gam <- gam(abloss~s(hardness)+s(tensile),
data=rubber.df)
> plot(rubber.gam,page=1)
50 100

50 100
s(tensile,5.42)

s(hardness,1)
0

0
100

100

120 140 160 180 200 220 240 50 60 70 80 90

tensile hardness
Refitting model: Tyre abrasion data

I GAM curve for tensile looks like a polynomial, so fit


polynomial

> lm(abloss~hardness+poly(tensile,4),data=rubber.df)
I Usually a lot of trial and error involved
I We have succeeded when
I R 2 improves
I Residual plots show no pattern

I 4th degree polynomial works well for rubber data: R 2 improves


from 84% to 94%.
Why 4th degree? Tyre abrasion data

Call:
lm(formula = abloss ~ hardness + poly(tensile, 5),
data = rubber.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 615.3617 29.8178 20.637 2.44e-16 ***
hardness -6.2608 0.4199 -14.911 2.59e-13 ***
poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 ***
poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129
poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 ***
poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 ***
poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495
---
Residual standard error: 23.67 on 23 degrees of freedom
Multiple R-squared: 0.9427, Adjusted R-squared: 0.9278
F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13
Ladder of powers

I Rather than fit polynomials in some independent variables,


guided by GAM plots, we can transform the response using
the ladder of powers.

I i.e. use y p as the response rather than y for some power p.

I Choose p either by trial and error using R 2 or use a Box-Cox


plot see later in the diagnostics cycle.
https://xkcd.com/833/
Checking for equal scatter

I The model specifies that the scatter about the regression


plane is uniform.

I In practice this means that the scatter does not depend on the
explanatory variables or the mean of the response.

I All tests and confidence intervals rely on this!


Checking for equal scatter

I Scatter is measured by the size of the residuals.

I A common problem is where the scatter increases as the mean


response increases.

I This means the big residuals happen when the fitted values
are big.

I Recognise this by a funnel effect in the residuals versus fitted


value plot.
Example: Education expenditure data

I Data for 50 states of the USA

I Variables are

educ: Per capita expenditure on education (response)

percap: Per capita Income

under18: Number of residents per 1000 under 18

urban: Number of residents per 1000 in urban areas

I Fit model
> lm(educ ~ percap + under18 + urban,data=educ.df)
Pairs plot: Education expenditure data
300 500 700 900 300 340 380

500
educ

400

300










200

900




urban



700







0.32













500







300



percap

5500







0.61 0.63

4500









3500

380

under18
340

0.27 0.29 0.30


300

200 300 400 500 3500 4500 5500


Outlier alert! Education expenditure data
Basic fit, outlier in

Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -555.92562 123.46634 -4.503 4.56e-05 ***
urban -0.00476 0.05174 -0.092 0.927
percap 0.07236 0.01165 6.211 1.40e-07 ***
under18 1.55134 0.31545 4.918 1.16e-05 ***
---
Residual standard error: 40.53 on 46 degrees of freedom
Multiple R-squared: 0.5902, Adjusted R-squared: 0.5634
F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09
Basic fit, outlier out

Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df,
subset = -50)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
Residual analysis
> par(mfrow=c(1,2))
> plot(educ50.lm,which=c(1,2))

Residuals vs Fitted Normal QQ


100

15 15

2
7 7

Standardized residuals

50

1

Residuals

1
50

10
2
100

10

220 260 300 340 2 1 0 1 2

Fitted values Theoretical Quantiles


Remedies

Either: Transform the response (Ladder of


powers)

Or: Estimate the variances of the


observations and use weighted least
squares.
Transforming the response
> tr.educ50.lm <- lm(1/educ~urban+percap
+under18,data=educ.df[-50,])

ScaleLocation ScaleLocation
1.5

10
15 10

1.5
7

15
Standardized residuals

Standardized residuals
14




1.0

1.0












0.5

0.5










0.0

0.0

220 260 300 340 0.0030 0.0035 0.0040 0.0045

Fitted values Fitted values


What power to choose?

I How did we know to use reciprocals?

I Think of a more general model

Y p = 0 + 1 x1 + + k xk

where p is some power.

I The estimate p from the data using a Box-Cox


plot.
Box-Cox plots
> library(MASS)
> boxcox(educ~urban+percap+under18,
data=educ.df[-50])

95%
5
logLikelihood

0
5
10

2 1 0 1 2


Uses for Box-Cox plots

Transformation of response according to Box-Cox plot may fix:


I non-normality of residuals;
I unequal variances in response;
I non-planarity of model;
I need to transform covariates indicated by GAM plots.
Always be aware, that description of effect on response usually only
works for original response (mean, additive) or log-transformed
response (median, multiplicative).
Weighted least squares

I Observations need to have constant variance.

I If the i th observation has variance vi 2 , then we can get a


valid test by using weighted least squares, minimising the sum
of the weighted squared residuals
k
X r2 i
wRSS =
vi
i=1

rather than the sum of squared residuals


k
X
RSS = ri2 .
i=1

I Need to know the variance weight vi .


Finding the weights

1. Plot the squared residuals versus the fitted values.

2. Smooth the plot.

3. Estimate the variance of an observation by the smoothed


squared residual

4. Weight is reciprocal of smoothed squared residual.

Rationale: Variance is a function of the mean.


Doing it in R
> vars <- funnel(educ50.lm)
Slope: 1.723989
3.8

5000
3.6

Squared residuals
Log std. errors


3.4

3000




3.2

0 1000





3.0

5.5 5.6 5.7 5.8 220 260 300 340

Log means Fitted values


Recall model fit after outlier removal

Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df, subset = -50)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
Weighted Model

Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df[-50,], weights = 1/vars)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -270.29363 102.61073 -2.634 0.0115 *
urban 0.01197 0.04030 0.297 0.7677
percap 0.05850 0.01027 5.694 8.88e-07 ***
under18 0.82384 0.27234 3.025 0.0041 **
---
Residual standard error: 1.019 on 45 degrees of freedom
Multiple R-squared: 0.629, Adjusted R-squared: 0.6043
F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10
Diagnostic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Independence: ...

Outliers: ...

Normality of Errors: ...


http://xkcd.com/674/
Normality

I Another assumption in the regression model is that the errors


are normally distributed.

I This is not so crucial, but can be important if the errors have


a long-tailed distribution, since this will imply there are several
outliers

I Normality assumption important for prediction.


Detecting non-normality
> qqnorm(residuals(xyz.lm))

Normal distribution Right skewed distribution


3

6






2

4












Sample Quantiles

Sample Quantiles

















1

2


























0

0






























2

2



3

3 2 1 0 1 2 3 2 0 2 4 6

Theoretical Quantiles Theoretical Quantiles

Symmetric, short tailed Symmetric, long tailed

10

3







5



Sample Quantiles

Sample Quantiles







10
3

3 2 1 0 1 2 3 10 5 0 5 10

Theoretical Quantiles Theoretical Quantiles


The Weisberg-Bingham test

I Test statistic: WB is the square of the correlation of the


normal plot, measures how straight the plot is.

I WB lies between 0 and 1. Values close to 1 indicate normality.

I R function WB.test calculates WB statistic and computes


p-value for test with null hypothesis of sample being normal.

I WB test is a variant of Shapiro-Wilk test.


Example: Residuals of cherry trees

Residuals for cherry cone model




> qqnorm(residuals(cherry.cone))

0.10


> WB.test(cherry.cone)

0.05

Sample Quantiles

0.00


WB test statistic = 0.983

0.05

p = 0.36


0.10
Since p-value large, no evidence

0.15


against normality.
2 1 0 1 2

Theoretical Quantiles
Remedies for Non-normality

I The standard remedy is to transform the response using a


power transformation.

I The idea is, on the original scale the model does not fit well,
but on the transformed scale it does.

I The power is obtained by means of a Box-Cox plot.

I The idea is to assume that for some power p, the response y p


follows the regression model. The plot is a graphical way of
estimating the power p.

I Technically, it is a plot of the profile likelihood.


Diagnostic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Independence: ...

Outliers: ...

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)
https://xkcd.com/242/
Outliers and high-leverage points

I An outlier is a point that has a larger or smaller y value than


the model would suggest.

I Can be due to a genuine large error ;


I Can be caused by typographical errors in recording the data.

I A high-leverage point is a point with extreme values of the


explanatory variables.
Outliers
I The effect of an outlier depends on whether it is also a
high-leverage point.

I A high-leverage outlier

I Can attract the fitted plane, distorting the fit, sometimes


extremely;
I In extreme cases may not have a big residual;
I In extreme cases can increase R 2 .

I A low-leverage outlier

I Does not distort the fit to the same extent;


I Usually has a big residual;
I Inflates standard errors, decreases R 2 .
Outliers

40

40









30

30






20

20
y

y



10

10
No highleverage points Lowleverage outlier
No outliers Big residual
0

0
0 2 4 6 8 0 2 4 6 8

x x
40

40









30

30

Highleverage outlier



20

20
y




10

10

Highleverage point
Not an outlier

0

0 2 4 6 8 0 2 4 6 8

x x
Example: The education data (without urban)


High leverage point
550
500
450





400


educ

under18
350




400


380
300



360

340
250


320
300
200

280
3000 3500 4000 4500 5000 5500 6000

percap
An outlier too?

100
6 50
8 12
7
50
45
22
5500

48 13 157
47 2114
23 49
4

50
Residual somewhat extreme
5
9
21 41 42 43
5000

46 10
15 3
9 16 11
3846 4048

residuals(educ.lm)
per capita income

20 40 29 13
5 18 34 36
25
4717 24 26
29 14 1 23
2

0
4500

43 31 37 4144 22
38 2420
19 37 39 3230 27 39
28 28
1735 6
36 33 19 4 49
26
4000

30 3 44 2 11 12
31 1

50
25 8
27 35 16 18
32 42 45
34
3500

33 10

300 320 340 360 380 200 250 300 350 400 450

Number of residents per 1000 under 18 predict(educ.lm)


Measuring leverage

Fitted values b
y are related to response y by the equation
 1
y = H y,
b where H = X XT X XT ,
ybi = hi1 y1 + + hii yi + + hin yn .

I hij depend explanatory variables X, hii is called hat matrix


diagonals (HMDs), measures the influence yi has on ybi .
I Also distance between average xi. and average x, measures
how extreme observations are.
Interpreting the HMDs

I Each HMD lies between 0 and 1.

I The average HMD is (k + 1)/n.

I An HMD larger than 3(k + 1)/n is considered extreme.


Example: The education data (without urban)
> hatvalues(educ.lm)

6 2942
10 2349
1 1334
4 32 7
37
16
11
24
192830
3 27 25
4043 548
15
38
1446 94535 33 44
4117
20
2 2122
39 31 50
18 47 36 8
12
26

(k + 1)/n 3(k + 1)/n


= 3/50 = 9/50

0.05 0.10 0.15 0.20 0.25 0.30 0.35

Hat matrix values for educ.lm


Studentised residuals

I How can we recognise a big residual? How big is big?

I The actual size depends on the units in which the y -variable is


measured, so we need to standardize them.

I Can divide by their standard deviations.

I Variance of a typical residual ei is

Var(ei ) = (1 hii ) 2 ,

where hii is the i th diagonal entry of the hat matrix H.


Studentised residuals

I Internally studentised (or standardised in R):


ei
ei = p
b
(1 hii )s 2

where s 2 is the usual estimate of the residual variance 2 .

I Externally studentised (or studentised in R):


ei
ei = q
b
(1 hii )si2

where si2 is the estimate of 2 after deleting the i th data point.


Studentised residuals

I How big is big?

I The internally studentised residuals are approximately


standard normal distributed if the model is OK and there are
no outliers.

I The externally studentised residuals has a t-distribution.

I Thus, studentised residuals should be between 2 and 2 with


approximately 95% probability.
Studentised residuals: Calculating it in R

#Load the MASS library


> library(MASS)
# internally studentised (standardised in R)
> stdres(educ.lm)[50]
50
3.089699
# externally studentised (studentized in R)
> studres(educ.lm)[50]
50
3.424107
What does studentised mean?
Recognising outliers

I If a point is a low influence outlier, the residual will usually be


large, so large residual and a low HMD indicates an outlier.

I If a point is a high leverage outlier, then a large error usually


will cause a large residual.

I However, in extreme cases, a high leverage outlier may not


have a very big residual, depending on how much the point
attracts the fitted plane. Thus, if a point has a large HMD,
and the residual is not particularly big, we cannot always tell if
a point is an outlier or not.
High-leverage outlier

Residuals vs Fitted

8

2
Fitted Line 13
True Line 19
6

Residuals





4
y

1
2






26
0

2
0 2 4 6 8 2 3 4 5 6 7

x Fitted values
Leverage-residual plots
> plot(educ.lm,which=5)
Residuals vs Leverage

50

1
2

7

0.5

Standardized residuals





45 0.5
2


1
Cook's distance

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Leverage
lm(educ ~ percap + under18)
Interpreting LR plots
4

Lowleverage Highleverage
Outlier Outlier
2
Standardized residuals

Potential
OK Highleverage
0

Outlier
2

Lowleverage Highleverage
Outlier Outlier
4

leverage
No big studentised residuals, no big HMDs

Plot of y vs. x, Example 1


Residuals vs Leverage
40

Fitted Line
True Line



30
3(k+1)/n=0.2
19





30

23

Standardized residuals



20

19
y

0
23




1
10

30

2 Cook's distance
0

0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 0.12

x Leverage
One big studentised residuals, no big HMDs

Plot of y vs. x, Example 2


Residuals vs Leverage
40

2
Fitted Line
True Line



3(k+1)/n=0.2
19
30

1





30

Standardized residuals

1
20

19 31
y

30

2
10

0.5

3
31 1

4 Cook's distance
0

0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 0.12

x Leverage
No big studentised residuals, one big HMDs

Plot of y vs. x, Example 3


Residuals vs Leverage
40

Fitted Line
True Line

2

1

30 19



0.5

30

3(k+1)/n=0.2

Standardized residuals







20

19
y

0




31

1

10


0.5

30 1

2
31
Cook's distance
0

0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

x Leverage
One big studentised residuals, one big HMDs

Plot of y vs. x, Example 4


Residuals vs Leverage
40

Fitted Line
True Line

5
31


4



30

Standardized residuals

2
20

1
y



0.5
1
3(k+1)/n=0.2

1

8

31


10

1

18 0.5

Cook's distance
0

2 1

0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

x Leverage
Four big studentised residuals, one big HMDs

Plot of y vs. x, Example 5


Residuals vs Leverage

3

8

Fitted Line
True Line 31
13
31

2
6

Standardized residuals

1






3(k+1)/n=0.2 1
0.5
4
y

0
0.5
1






13


1



2

2


26
26 Cook's distance
0

0 2 4 6 8 0.0 0.2 0.4 0.6 0.8

x Leverage
HMD Summary

I Hat matrix diagonals

I Measure the effect of a point on its fitted value;


I Measure how outlying the x-values are (how high-leverage a
point is);
I Are always between 0 and 1 with bigger values indicating
high-leverage;
I Points with HMDs more than 3(k + 1)/n are considered
high-leverage
Influential points

I How can we tell if a high-leverage point/outlier is affecting


the regression?

I By deleting the point and refitting the regression: a large


change in coefficients means the point is affecting the
regression.

I Such points are called influential points.

I We do not want our analysis to be driven by one or two


points!
Leave one out measures

I We can calculate a variety of measures by leaving out each


data point in turn, and looking at the change in key regression
quantities such as:

I Coefficients
I Fitted Values
I Standard errors

I We discuss each in turn.


Example: Education data

With point 50 Without point 50

Const 557.451 298.714

percap 0.072 0.059

under18 1.555 0.933


Coefficient measures: DFBETAS

DFBETAS: Standardised difference in coefficients

bj bj [i]
dfbetas =
se(bj )

Problematic: when |dfbetas| > 1. This is a criterion coded into


R.
Coefficient measures: DFFITS

DFFITS: Standardised difference in fitted values:


ybj ybj [i]
dffits =
se(byj )

Problematic: when r
k +1
|dffits| > 3 .
nk 1
Coefficient measures: Covariance Ratio and Cooks D

Cov Ratio: Measures change in the standard errors of the


estimated coefficients.

Problem: when Cov Ratio greater than 1 + 3(k + 1)/n or


smaller than 1 3(k + 1)/n.

Cooks D: Measures overall change in the coefficients.

Problem: when greater than qf(.5,k+1,n-k-1) (median of


F -distribution), roughly 1 in most cases.
Coefficient measures in R

Influence measures of
lm(formula = educ ~ percap + under18, data = educ.df) :

dfb.1_ dfb.prcp dfb.un18 dffit cov.r cook.d hat inf


1 0.0120 -0.01794 -0.00588 0.0233 1.121 1.84e-04 0.0494
***
10 0.0638 -0.16792 -0.02222 -0.3631 0.803 4.05e-02 0.0257 *
***
44 0.0229 0.00298 -0.02948 -0.0340 1.283 3.94e-04 0.1690 *
***
50 -2.3688 1.50181 2.23393 2.4733 0.821 1.66e+00 0.3429 *
---
1.0000 1.00000 1.00000 0.7579 0.82 8.00e-01 0.18
1.18
Plotting influence
# There will be seven plots
par(mfrow=c(2,4))
# Plot the measures using R330
influenceplots(educ.lm)
dfb.1_ dfb.prcp dfb.un18 DFFITS

2.5
1.5
50 50 50 50

2.0
2.0

2.0
1.5
1.0
1.5

1.5
dfb.un18
dfb.prcp

DFFITS
dfb.1_

1.0
1.0

1.0
0.5

0.5
0.5

0.5
0.0

0.0

0.0

0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Obs. number Obs. number Obs. number Obs. number

ABS(COV RATIO1) Cook's D Hats


0.35
44 50 50
1.5
0.25

0.30
0.25
0.20

10
ABS(COV RATIO1)

1.0

0.20
Cook's D
0.15

Hats

0.15
0.10

0.10
0.5
0.05

0.05
0.00

0.00
0.0

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Obs. number Obs. number Obs. number


Remedies for outliers

I Correct typographical errors in the data.

I Delete a small number of points and refit (do not want fitted
regression to be determined by one or two influential points)

I Report existence of outliers separately: they are often of


scientific interest

I Do not delete too many points.


Diagnostic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Independence: ...

Outliers: Leverage-Residual plots, influence measures

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)
https://xkcd.com/934/
Independence

I One of the regression assumptions is that the errors are


independent.

I Data that is collected sequentially over time often has errors


that are not independent.

I If the independence assumption does not hold, then the


standard errors will be wrong and the tests and confidence
intervals will be unreliable.

I We need to be able to detect lack of independence.


Types of dependence

I If large positive errors have a tendency to follow large positive


errors, and large negative errors a tendency to follow large
negative errors, we say the data has positive autocorrelation.

I If large positive errors have a tendency to follow large negative


errors, and large negative errors a tendency to follow large
positive errors, we say the data has negative autocorrelation.
Diagnostics: Positive Autocorrelation

If the errors are positively autocorrelated:


I Plotting the residuals against time will show long runs of
positive and negative residuals.

I Plotting residuals against the previous residual (i.e. ei vs.


ei1 ) will show a positive trend.

I A correlogram of the residuals will show positive spikes,


gradually decaying.
Diagnostics: Negative Autocorrelation

If the errors are negatively autocorrelated:


I Plotting the residuals against time will show alternating
positive and negative residuals.

I Plotting residuals against the previous residual (i.e. ei vs.


ei1 ) will show a negative trend.

I A correlogram of the residuals will show alternating positive


and negative spikes, gradually decaying.
Residuals against time

res <- residuals(lm.obj)

plot(1:length(res),res,xlab="time",ylab="residuals",
type="b")

abline(h=0,lty=2)
Residuals against time: Plot
Autocorrelation = 0.9

0.5
residual
0.0
-1.0 -0.5

0 20 40 60 80 100

time

Autocorrelation = 0.0
0.4
residual
0.0
-0.4

0 20 40 60 80 100

time

Autocorrelation = - 0.9
0.0 0.5 1.0
residual
-1.0

0 20 40 60 80 100

time
Residuals against their predecessor

res <- residuals(lm.obj)

n <- length(res)

plot.res <- res[-1] # element 1 has no predecessor

prev.res <- res[-n] # last residual has no succesor

plot(prev.res,plot.res,xlab="previous residual",
ylab="residual")
Residuals against their predecessor: Plot

Autocorrelation = 0.9 Autocorrelation = 0.0 Autocorrelation = - 0.9


1.0

1.0
0.4
0.5

0.5
previous residual

previous residual

previous residual
0.2

0.0
0.0
0.0

-0.5
-0.2
-0.5

-1.0
-0.4

-1.5
-0.5 0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4 -1.5 -1.0 -0.5 0.0 0.5 1.0

residual residual residual


Correlogram

acf(residual(lm.obj))

I The autocorrelation function (acf, aka correlogram)


investigates the correlation between a residual and another
residual k time units apart.

I This is also called lag k autocorrelation.


Correlogram: Plot
0.8 Autocorrelation = 0.9
ACF
0.4
0.0

0 5 10 15 20

Lag

Autocorrelation = 0.0
0.8
ACF
0.4
0.0

0 5 10 15 20

Lag

Autocorrelation = - 0.9
-1.0 -0.5 0.0 0.5 1.0
ACF

0 5 10 15 20

Lag
Remedies

I Time-series modelling

I Spatial correlation can also be modelled (correlation matrices)

I Mammal data, independent contrasts to mask ancestral


correlation

I If dependency is justifiably random, permute observations.


Diagnostic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Independence: Correlogram

Outliers: Leverage-Residual plots, influence measures

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)
https://xkcd.com/915/
1976 Chateau Margaux
NZ$ 250 800

1976 Chateau Latour


NZ$ 370 900

1976 Chateau Lafite Rothschild


NZ$ 550 1, 300
Case study: The wine data

I Data on 27 vintages of Bordeaux wines.

I Variables are

year: 1952 1980;


price: in 1980 US$, converted to an index with
1961 = 100;
temp: average temperature during growing season
( C );
h.rain: total rainfall during harvest period, mm;
w.rain: total rainfall over preceding winter, mm;

I Data part of R330 package, also available on course website.


Bordeaux wines: Their price

I Bordeaux wines are an iconic luxury consumer good. Many


consider these to be the best wines in the world.

I The quality and the price depends on the vintage (i.e. the
year the wines are made.)

I The prices are (in 1980 US$, in index form, 1961 = 100).
Bordeaux wines: Their price

Bordeaux wine price index 19521980


100


80



60
price




40







20

1955 1960 1965 1970 1975 1980

year
Bordeaux wines: Pairs plot
15.0 16.0 17.0 400 600 800


year

1955 1965 1975



























temp
17.0







0.29
16.0









15.0

h.rain

250

150

0.059 0.031

50


800

w.rain


600


0.051

0.33 0.27






400

price

20 40 60 80
0.45 0.59 0.45 0.23

1955 1965 1975 50 150 250 20 40 60 80


Bordeaux wines: Preliminary analysis

Call:
lm(formula = price ~ year + temp + h.rain + w.rain,
data = wine.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1305.52761 597.31137 2.186 0.03977 *
year -0.82055 0.29140 -2.816 0.01007 *
temp 19.25337 3.92945 4.900 6.72e-05 ***
h.rain -0.10121 0.03297 -3.070 0.00561 **
w.rain 0.05704 0.01975 2.889 0.00853 **
---
Residual standard error: 11.69 on 22 degrees of freedom
Multiple R-squared: 0.7369, Adjusted R-squared: 0.6891
F-statistic: 15.41 on 4 and 22 DF, p-value: 3.806e-06
Bordeaux wines: Diagnostic plots

Residuals vs Fitted Normal QQ


30

3
8 8

6
20

Standardized residuals
19 19
6

2


Residuals

10

1





0

0


10

0 20 40 60 2 1 0 1 2

Fitted values Theoretical Quantiles

ScaleLocation Residuals vs Leverage


8

3
8 1
1.5

6
19
Standardized residuals

Standardized residuals
19 0.5

2


1.0



1



0


0.5



7
Cook's distance
0.0

0 20 40 60 0.0 0.1 0.2 0.3

Fitted values Leverage


Bordeaux wines: Checking normality

Bordeaux Residuals

20

Sample Quantiles
> qqnorm(residuals(wine.lm))

10

> WB.test(wine.lm)


WB test statistic = 0.957

0

p = 0.03

10

2 1 0 1 2

Theoretical Quantiles
Bordeaux wines: Box-Cox routine

95%

10
20

Power = 1/3
logLikelihood

30
40
50

2 1 0 1 2


Bordeaux wines: Transform and refit

I Use y 1/3 as a response (reciprocal cubic root)

I Has the fit improved?

I Are the errors more normal? (normal plot and


Weisberg-Bingham test)
I Has R 2 increased?
I Would further transformations improve normality?
I Minimum at p = 1 means fit cannot be improved by further
transformation.
Bordeaux wines: Re-Checking normality

Normal QQ Plot

0.04

0.02

Sample Quantiles
> qqnorm(residuals(wine.upd))

> WB.test(wine.upd)

0.00


WB test statistic = 0.988

0.02
p = 0.65


0.04

2 1 0 1 2

Theoretical Quantiles
Bordeaux wines: Re-Checking normality

Call:
lm(formula = price^(-1/3) ~ ., data = wine.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 *
year 2.639e-03 7.870e-04 3.353 0.00288 **
temp -7.051e-02 1.061e-02 -6.644 1.11e-06 ***
h.rain 4.423e-04 8.905e-05 4.967 5.71e-05 ***
w.rain -1.157e-04 5.333e-05 -2.170 0.04110 *
---
Residual standard error: 0.03156 on 22 degrees of freedom
Multiple R-squared: 0.8331, Adjusted R-squared: 0.8028
F-statistic: 27.46 on 4 and 22 DF, p-value: 2.841e-08
Bordeaux wines: Box-Cox revisited

22
95%
20
logLikelihood

Power = 1
18
16
14

2 1 0 1 2


Bordeaux wines: Conclusions on Normality

I Transformation has been spectacularly successful in improving


the fit!

I What about other aspects of the fit?

I residuals/fitted values?
I Pairs
I Added variable plots.
Bordeaux wines: Diagnostics plots revisited

Residuals vs Fitted Normal QQ


0.06

24 24

2
20 20

Standardized residuals


1
0.02




Residuals

0


0.02

1


6
0.06

0.20 0.25 0.30 0.35 0.40 0.45 2 1 0 1 2

Fitted values Theoretical Quantiles

ScaleLocation Residuals vs Leverage


0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

24
20
24 20 0.5

2
6

Standardized residuals

Standardized residuals

1




0



1

23



Cook's distance
2

0.5

0.20 0.25 0.30 0.35 0.40 0.45 0.0 0.1 0.2 0.3

Fitted values Leverage

ALL GOOD
Bordeaux wines: GAM plots

> library(mgcv)
> plot(gam(price^(-1/3)~temp+s(h.rain)+s(w.rain)+year,
data=wine.df))
0.10

0.10
0.05

0.05
s(w.rain,4.87)
s(h.rain,1)

0.00

0.00
0.05

0.05

50 100 150 200 250 300 400 500 600 700 800

h.rain w.rain
Bordeaux wines: Polynomial fit
Call:
lm(formula = price^(-1/3) ~ temp + h.rain + year
+ poly(w.rain, 4), data = wine.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.974e+00 1.532e+00 -1.942 0.06715 .
temp -7.478e-02 1.048e-02 -7.137 8.75e-07 ***
h.rain 4.869e-04 8.662e-05 5.622 2.02e-05 ***
year 2.284e-03 7.459e-04 3.062 0.00642 **
poly(w.rain, 4)1 -7.561e-02 3.263e-02 -2.317 0.03180 *
poly(w.rain, 4)2 4.469e-02 3.294e-02 1.357 0.19079
poly(w.rain, 4)3 -2.153e-02 2.945e-02 -0.731 0.47374
poly(w.rain, 4)4 6.130e-02 2.956e-02 2.074 0.05194 .
---
Residual standard error: 0.02931 on 19 degrees of freedom
Multiple R-squared: 0.8757, Adjusted R-squared: 0.8299
F-statistic: 19.13 on 7 and 19 DF, p-value: 2.352e-07
Bordeaux wines: Final fit

Call:
lm(formula = price^(-1/3) ~ ., data = wine.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 *
year 2.639e-03 7.870e-04 3.353 0.00288 **
temp -7.051e-02 1.061e-02 -6.644 1.11e-06 ***
h.rain 4.423e-04 8.905e-05 4.967 5.71e-05 ***
w.rain -1.157e-04 5.333e-05 -2.170 0.04110 *
---
Residual standard error: 0.03156 on 22 degrees of freedom
Multiple R-squared: 0.8331, Adjusted R-squared: 0.8028
F-statistic: 27.46 on 4 and 22 DF, p-value: 2.841e-08
Bordeaux wines: Influential points
dfb.1_ dfb.year dfb.temp

0.0 0.1 0.2 0.3 0.4 0.5


0.4
0.4

0.3
0.3

dfb.temp
dfb.year
dfb.1_

0.2
0.2

0.1
0.1
0.0

0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Obs. number Obs. number Obs. number

dfb.h.rn dfb.w.rn DFFITS

0.6

0.8
0.4

dfb.w.rn

DFFITS
dfb.h.rn

0.4

0.4
0.2

0.2
0.0

0.0

0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Obs. number Obs. number Obs. number

ABS(COV RATIO1) Cook's D Hats


0.00 0.05 0.10 0.15 0.20

8
0.8

4
ABS(COV RATIO1)

0.3
0.6

15
Cook's D

0.2
Hats
0.4

0.1
0.2
0.0

0.0

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Obs. number Obs. number Obs. number


Bordeaux wines: Conclusions

I Model using price1/3 as response fits very well.

I Use this model for prediction, understanding relationships:

I Coefficient of year positive, so transformed response increases


with year (i.e. older vintages are more valuable).
I Coefficient of temp negative, so high temperatures decrease
transformed response (i.e increase price).
I Coefficient of h.rain positive, so high harvest rain increases
transformed response (i.e. decreases price).
I Coefficient of w.rain negative, so high winter rain decreases
transformed response (i.e. increases price).
Statistical outliers

http://xkcd.com/539/

You might also like