330 Diagnostics 2015

STATS 330: Lecture 9-12
Diagnostics
6.08.2015
Aims of the next four lectures
I To give you an overview of the modelling cycle.
I To have a detailed discussion of diagnostic

procedures.
The modelling cycle
I We have seen that the regression model describes rather

specialised forms of data
I Data are planar,

I Scatter is uniform over the plane.
I We have looked at some plots that help us decide if the data

is suitable for regression modelling
I pairs
I reg3d
I coplot
Residual analysis
I Another approach is to fit the model and examine the

residuals.
I If the model is appropriate the residuals have no pattern
I A pattern in the residuals usually indicates that the model is

not appropriate
I If this is the case we have two options
1. Select another form of model e.g. non-linear regression;
2. Transform the data so that the regression model fits the

transformed data.
The Modelling Cycle
PLOTS and THEORY
Choose Model
Fit Model
Transform Examine Residuals
Bad fit Good fit
USE MODEL
What constitutes a bad fit?
I Non-planar data: Seriously affects meaning and accuracy of

estimated coefficients
I Outliers in the data: Seriously affects meaning and accuracy
of estimated coefficients
I Non-constant scatter: Affects standard error of estimate
I Errors not independent: Affects standard error of estimate
I Errors not normally distributed: Affects standard error of
estimate
Diagonstic steps
We test for <> using <>:

Planarity: ...
Constant Variance: ...
Outliers: ...
Independence: ...
Normality of Errors: ...

Detecting non-planar data
I We can diagnose non-planar data (non-linearity) by fitting the

model, and
I plotting residuals versus fitted values;

I residuals against explanatory variables;
I fitting additive models
I In each case, a curved plot indicates non-planar data.

Plotting residuals vs. fitted values
> data(cherry.df)
> cherry.lm <- lm(volume~diameter+height,data=cherry.df)
> plot(cherry.lm,which=1)
which=1: selects the plot of residuals vs. fitted values

Plotting residuals vs. fitted values
Residuals vs Fitted
10
31
2

5

Residuals
18
10 20 30 40 50 60 70
Fitted values
lm(volume ~ height + diameter)
Additive models
I These are models of the form
Y = g1 (x1 ) + g2 (x2 ) + + gk (xk ) +
where g1 , . . . , gk are transformations.
I Fitted using the gam function in R.
I The transformations are estimated by the software.
I Use the function to suggest good transformations.

Example: Cherry trees
> library(mgcv)
> cherry.gam <- gam(volume~s(diameter)+s(height),
+ data=cherry.df)
> plot(cherry.gam,residuals=T,pages=1)
Example: Cherry trees
40
40
30
30
s(diameter,2.69)
s(height,1)
20
20
10
10
0
0
20
20
8 10 12 14 16 18 20 65 70 75 80 85
diameter height
Fitting polynomials
I To fit a model y = 0 + 1 x + 2 x 2 , use
y~poly(x,2)
I To fit a model y = 0 + 1 x + 2 x 2 + 3 x 3 , use
y~poly(x,3)
etc.
Orthogonal polynomials
I The model fitted by y~poly(x,2) is of the form
Y = 0 + 1 p1 (x) + 2 p2 (x)
where
p1 : polynomial of degree 1, i.e. of the form a0 + a1 x
p2 : polynomial of degree 2, i.e. of the form b0 + b1 x + b2 x 2 .
I p1 , p2 chosen to be uncorrelated (best possible estimation)

Adding a quadratic term: Cherry trees
Call:
lm(formula = volume ~ poly(diameter, 2) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.56553 6.72218 0.233 0.817603
poly(diameter, 2)1 80.25223 3.07346 26.111 < 2e-16 ***
poly(diameter, 2)2 15.39923 2.63157 5.852 3.13e-06 ***
height 0.37639 0.08823 4.266 0.000218 ***
---
Residual standard error: 2.625 on 27 degrees of freedom
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
Quadratic equation
Call:
lm(formula = volume ~ diameter + I(diameter^2) + height,
data = cherry.df)
---
Coefficients:
(Intercept) -9.92041 10.07911 -0.984 0.333729
diameter -2.88508 1.30985 -2.203 0.036343 *
I(diameter^2) 0.26862 0.04590 5.852 3.13e-06 ***
height 0.37639 0.08823 4.266 0.000218 ***
---
Quadratic equation
volume 9.92 2.89 diameter + 0.27 diameter2 + 0.38 height
volume
er
he et
igh am
t di
Splines
I An alternative to polynomials are splines these are piecewise

cubics, which join smoothly at knots.
I Give a more flexible fit to the data.
I Values at one point not affected by values at distant points,

unlike polynomials
Example with 4 knots
1.0
0.8
0.6
y
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
Cherry splines
Call:
lm(formula = volume ~ bs(diameter, knots = knot.points) + height,
data = cherry.df)
---
Coefficients:
(Intercept) -16.3679 7.4856 -2.187 0.03921 *
bs(diameter, knots = knot.points)1 0.1941 7.9374 0.024 0.98070
bs(diameter, knots = knot.points)2 5.5744 3.1704 1.758 0.09201 .
bs(diameter, knots = knot.points)3 10.7976 3.9798 2.713 0.01240 *
bs(diameter, knots = knot.points)4 31.4053 5.5545 5.654 9.35e-06 ***
height 0.3970 0.1050 3.780 0.00097 ***
---
Cherry splines
Basis for quadratic splines

1.0
0.8
0.6
y
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
Cherry splines
80
polynomial
splines
70
60

50
Volume

40

30

20

10
8 10 12 14 16 18 20
Diameter
Example: Tyre abrasion data
I Data collected in an experiment to study the abrasion

resistance of tyres
I Variables are
Hardness: Hardness of rubber
Tensile: Tensile strength of rubber
Abrasion Loss: Amount of rubber worn away in a standard

test (response)
Tyre abrasion data
Call:
lm(formula = abloss ~ hardness + tensile, data = rubber.df)
---
Coefficients:
(Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
hardness -6.5708 0.5832 -11.267 1.03e-11 ***
tensile -1.3743 0.1943 -7.073 1.32e-07 ***
---
F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11
Tyre abrasion data
I We will use this example to illustrate all the methods we have

discussed so far to check if the data are planar, scattered
about a flat regression plane i.e.
I Pairs plot
I Conditional plot
I Residual vs. fitted value plot
I Fitting GAMs
Pairs plot tensile vs hardness non-linear
120 140 160 180 200 220 240
90

hardness

80

70

60

50

120 140 160 180 200 220 240

tensile

0.30

abloss
300
0.74 0.30
200
50 100
50 60 70 80 90 50 100 200 300
Coplot Suggestion of non-planarity
Given : hardness
50 60 70 80
120 140 160 180 200 220 240 120 140 160 180 200 220 240
350

250

150

50

abloss

350

250

150

50
120 140 160 180 200 220 240
tensile
Residuals vs. fitted values weak suggestion of
non-planarity
Residuals vs Fitted
29

50

Residuals

50
22
10
50 100 150 200 250 300 350
Fitted values
GAMs Quite strong indication of non-planarity
hardness looks okay, but tensile needs transformation.

50 100
50 100
s(tensile,5.42)
s(hardness,1)
0
0
100
100
120 140 160 180 200 220 240 50 60 70 80 90
tensile hardness
Fitting a fourth degree polynomials
> rubber.poly <- lm(abloss~hardness+tensile+I(tensile^2)

+I(tensile^3)+I(tensile^4),data=rubber.df)
> summary(rubber.poly)
--
Coefficients:
(Intercept) -1.862e+04 4.177e+03 -4.458 0.000165 ***
hardness -6.261e+00 4.124e-01 -15.182 8.35e-14 ***
tensile 4.414e+02 9.836e+01 4.487 0.000153 ***
I(tensile^2) -3.693e+00 8.546e-01 -4.321 0.000233 ***
I(tensile^3) 1.342e-02 3.246e-03 4.133 0.000377 ***
I(tensile^4) -1.794e-05 4.553e-06 -3.940 0.000613 ***
---
F-statistic: 78.46 on 5 and 24 DF, p-value: 4.504e-14
Fitting splines
> rubber.bs <- lm(abloss~hardness+bs(tensile,df=4),

data=rubber.df)
> summary(rubber.bs)
--
Coefficients:
(Intercept) 612.1556 43.0348 14.225 3.43e-13 ***
hardness -6.1914 0.4139 -14.959 1.15e-13 ***
bs(tensile, df = 4)1 195.5549 40.6339 4.813 6.69e-05 ***
bs(tensile, df = 4)2 -148.3497 38.6717 -3.836 0.000796 ***
bs(tensile, df = 4)3 -24.2971 37.7010 -0.644 0.525385
bs(tensile, df = 4)4 -61.0593 25.4829 -2.396 0.024720 *
---
GAM plots and splines
I We use GAM plots to decide

which variable we need to
transform and how.

40
transform and how.
30
I The shape could be a
20
s(diameter,2.69)
polynomial...
10
0
10
20
8 10 12 14 16 18 20
diameter

100
transform and how.
50
s(tensile,5.42)
polynomial...
0
I ...or a locally smoothed
function.
50
120 140 160 180 200 220 240
tensile
4e+05
3e+05
transform and how.
2e+05
s(medianIncome,3.03)
polynomial...
1e+05
I ...or a locally smoothed
0e+00
function.
1e+05
I But be aware of the
2e+05
distribution of the 0 5 10 15
observations medianIncome
Diagnostic steps

Planarity: Residuals vs. fitted values, Residuals vs. covariates,
added variable plots, GAM plots
Constant Variance: ...
Outliers: ...
Independence: ...

Remedies for non-planar data
I So far we have discussed how to look for non-planarity, and

stated that the remedy is transformation.
I How to choose a transformation when non-planarity is
indicated?
I Theory
I Ladder of powers
I Polynomials
I We illustrate with a few examples.

Using theory: Cherry trees
I As we have observed, a tree trunk is a bit like a cone, i.e.

volume, height and diameter are related by
diameter2
volume = height
3 4
log(volume) = log(/12) + 2 log(diameter) + log(height)
I So a linear regression using the logged variables should work!
I In fact, R 2 increases from 94.8% for normal model to 97.7%.
I Also the intersect of b0 = 1.70492 better fits the cone than

the cylinder.
Call:
lm(formula = log(volume) ~ log(height) + log(diameter),
data = cherry.df)
---
Coefficients:
(Intercept) -1.70492 0.88190 -1.933 0.0634 .
log(height) 1.11712 0.20444 5.464 7.81e-06 ***
log(diameter) 1.98265 0.07501 26.432 < 2e-16 ***
---
Residuals vs Fitted Residuals vs Fitted

10
0.15
31

0.10

2

5
0.05

0.00

Residuals
Residuals

0.05
0
0.10

5
0.15

16

18 15 18
0.20
10 20 30 40 50 60 70 2.5 3.0 3.5 4.0
Fitted values Fitted values

lm(volume ~ height + diameter) lm(log(volume) ~ log(height) + log(diameter))
Using GAM plots: Tyre abrasion data
> rubber.gam <- gam(abloss~s(hardness)+s(tensile),
data=rubber.df)
> plot(rubber.gam,page=1)
50 100
50 100
s(tensile,5.42)
s(hardness,1)
0
0
100
100
120 140 160 180 200 220 240 50 60 70 80 90
tensile hardness
Refitting model: Tyre abrasion data
I GAM curve for tensile looks like a polynomial, so fit

polynomial
> lm(abloss~hardness+poly(tensile,4),data=rubber.df)
I Usually a lot of trial and error involved
I We have succeeded when
I R 2 improves
I Residual plots show no pattern
I 4th degree polynomial works well for rubber data: R 2 improves

from 84% to 94%.
Why 4th degree? Tyre abrasion data
Call:
lm(formula = abloss ~ hardness + poly(tensile, 5),
data = rubber.df)
---
Coefficients:
(Intercept) 615.3617 29.8178 20.637 2.44e-16 ***
hardness -6.2608 0.4199 -14.911 2.59e-13 ***
poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 ***
poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129
poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 ***
poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 ***
poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495
---
Ladder of powers
I Rather than fit polynomials in some independent variables,

guided by GAM plots, we can transform the response using
the ladder of powers.
I i.e. use y p as the response rather than y for some power p.
I Choose p either by trial and error using R 2 or use a Box-Cox

plot see later in the diagnostics cycle.
https://xkcd.com/833/
Checking for equal scatter
I The model specifies that the scatter about the regression

plane is uniform.
I In practice this means that the scatter does not depend on the
explanatory variables or the mean of the response.
I All tests and confidence intervals rely on this!

Checking for equal scatter
I Scatter is measured by the size of the residuals.
I A common problem is where the scatter increases as the mean

response increases.
I This means the big residuals happen when the fitted values
are big.
I Recognise this by a funnel effect in the residuals versus fitted

value plot.
Example: Education expenditure data
I Data for 50 states of the USA
I Variables are
educ: Per capita expenditure on education (response)
percap: Per capita Income
under18: Number of residents per 1000 under 18
urban: Number of residents per 1000 in urban areas
I Fit model
> lm(educ ~ percap + under18 + urban,data=educ.df)
Pairs plot: Education expenditure data
300 500 700 900 300 340 380
500
educ
400

300

200

900

urban

700

0.32

500

300

percap
5500

0.61 0.63

4500

3500

380
under18
340
0.27 0.29 0.30

300
200 300 400 500 3500 4500 5500

Outlier alert! Education expenditure data
Basic fit, outlier in
Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df)
---
Coefficients:
(Intercept) -555.92562 123.46634 -4.503 4.56e-05 ***
urban -0.00476 0.05174 -0.092 0.927
percap 0.07236 0.01165 6.211 1.40e-07 ***
under18 1.55134 0.31545 4.918 1.16e-05 ***
---
Basic fit, outlier out
Call:
lm(formula = educ ~ urban + percap + under18, data = educ.df,
subset = -50)
---
Coefficients:
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Residual analysis
> par(mfrow=c(1,2))
> plot(educ50.lm,which=c(1,2))
Residuals vs Fitted Normal QQ

100
15 15
2
7 7

Standardized residuals

50
1

Residuals
1
50
10
2
100
10
220 260 300 340 2 1 0 1 2
Fitted values Theoretical Quantiles

Remedies
Either: Transform the response (Ladder of

powers)
Or: Estimate the variances of the

observations and use weighted least
squares.
Transforming the response
> tr.educ50.lm <- lm(1/educ~urban+percap
+under18,data=educ.df[-50,])
ScaleLocation ScaleLocation
1.5
10
15 10
1.5
7

15
14

1.0
1.0

0.5
0.5

0.0
0.0
220 260 300 340 0.0030 0.0035 0.0040 0.0045
Fitted values Fitted values

What power to choose?
I How did we know to use reciprocals?
I Think of a more general model
Y p = 0 + 1 x1 + + k xk
where p is some power.
I The estimate p from the data using a Box-Cox

plot.
Box-Cox plots
> library(MASS)
> boxcox(educ~urban+percap+under18,
data=educ.df[-50])
95%
5
logLikelihood
0
5
10
2 1 0 1 2

Uses for Box-Cox plots
Transformation of response according to Box-Cox plot may fix:

I non-normality of residuals;
I unequal variances in response;
I non-planarity of model;
I need to transform covariates indicated by GAM plots.
Always be aware, that description of effect on response usually only
works for original response (mean, additive) or log-transformed
response (median, multiplicative).
Weighted least squares
I Observations need to have constant variance.
I If the i th observation has variance vi 2 , then we can get a

valid test by using weighted least squares, minimising the sum
of the weighted squared residuals
k
X r2 i
wRSS =
vi
i=1
rather than the sum of squared residuals

k
X
RSS = ri2 .
i=1
I Need to know the variance weight vi .

Finding the weights
1. Plot the squared residuals versus the fitted values.
2. Smooth the plot.
3. Estimate the variance of an observation by the smoothed

squared residual
4. Weight is reciprocal of smoothed squared residual.
Rationale: Variance is a function of the mean.

Doing it in R
> vars <- funnel(educ50.lm)
Slope: 1.723989
3.8
5000
3.6
Squared residuals
Log std. errors

3.4
3000

3.2
0 1000

3.0
5.5 5.6 5.7 5.8 220 260 300 340
Log means Fitted values

Recall model fit after outlier removal
Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df, subset = -50)
--
Coefficients:
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban 0.06624 0.04966 1.334 0.188948
percap 0.04827 0.01220 3.958 0.000266 ***
under18 0.88983 0.33159 2.684 0.010157 *
---
Weighted Model
Call:
lm(formula = educ ~ urban + percap + under18,
data = educ.df[-50,], weights = 1/vars)
--
Coefficients:
(Intercept) -270.29363 102.61073 -2.634 0.0115 *
urban 0.01197 0.04030 0.297 0.7677
percap 0.05850 0.01027 5.694 8.88e-07 ***
under18 0.82384 0.27234 3.025 0.0041 **
---
Diagnostic steps

Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot
Constant Variance: Funnel plots, weighted least squares
Independence: ...
Outliers: ...

http://xkcd.com/674/
Normality
I Another assumption in the regression model is that the errors

are normally distributed.
I This is not so crucial, but can be important if the errors have

a long-tailed distribution, since this will imply there are several
outliers
I Normality assumption important for prediction.

Detecting non-normality
> qqnorm(residuals(xyz.lm))
Normal distribution Right skewed distribution

3
6

2
4

Sample Quantiles
Sample Quantiles

1
2

0
0

2
2

3
3 2 1 0 1 2 3 2 0 2 4 6
Theoretical Quantiles Theoretical Quantiles
Symmetric, short tailed Symmetric, long tailed
10

3

5

Sample Quantiles
Sample Quantiles

10
3
3 2 1 0 1 2 3 10 5 0 5 10
Theoretical Quantiles Theoretical Quantiles

The Weisberg-Bingham test
I Test statistic: WB is the square of the correlation of the

normal plot, measures how straight the plot is.
I WB lies between 0 and 1. Values close to 1 indicate normality.
I R function WB.test calculates WB statistic and computes

p-value for test with null hypothesis of sample being normal.
I WB test is a variant of Shapiro-Wilk test.

Example: Residuals of cherry trees
Residuals for cherry cone model

> qqnorm(residuals(cherry.cone))
0.10

> WB.test(cherry.cone)
0.05

Sample Quantiles

0.00

WB test statistic = 0.983
0.05

p = 0.36

0.10
Since p-value large, no evidence
0.15

against normality.
2 1 0 1 2
Theoretical Quantiles
Remedies for Non-normality
I The standard remedy is to transform the response using a

power transformation.
I The idea is, on the original scale the model does not fit well,
but on the transformed scale it does.
I The power is obtained by means of a Box-Cox plot.
I The idea is to assume that for some power p, the response y p

follows the regression model. The plot is a graphical way of
estimating the power p.
I Technically, it is a plot of the profile likelihood.

Diagnostic steps

Independence: ...
Outliers: ...
Normality of Errors: QQ plots, Weisberg-Bingham test (and many

more)
Outliers and high-leverage points
I An outlier is a point that has a larger or smaller y value than

the model would suggest.
I Can be due to a genuine large error ;

I Can be caused by typographical errors in recording the data.
I A high-leverage point is a point with extreme values of the

explanatory variables.
Outliers
I The effect of an outlier depends on whether it is also a
high-leverage point.
I A high-leverage outlier
I Can attract the fitted plane, distorting the fit, sometimes

extremely;
I In extreme cases may not have a big residual;
I In extreme cases can increase R 2 .
I A low-leverage outlier
I Does not distort the fit to the same extent;

I Usually has a big residual;
I Inflates standard errors, decreases R 2 .
Outliers
40
40

30
30

20
20
y
y

10
10
No highleverage points Lowleverage outlier
No outliers Big residual
0
0
0 2 4 6 8 0 2 4 6 8
x x
40
40

30
30

Highleverage outlier

20
20
y

10
10
Highleverage point
Not an outlier

0
0 2 4 6 8 0 2 4 6 8
x x
Example: The education data (without urban)

High leverage point
550
500
450

400

educ
under18
350

400

380
300

360

340
250

320
300
200
280
3000 3500 4000 4500 5000 5500 6000
percap
An outlier too?
100
6 50
8 12
7
50
45
22
5500
48 13 157
47 2114
23 49
4
50
Residual somewhat extreme
5
9
21 41 42 43
5000
46 10
15 3
9 16 11
3846 4048
residuals(educ.lm)
per capita income
20 40 29 13
5 18 34 36
25
4717 24 26
29 14 1 23
2
0
4500
43 31 37 4144 22
38 2420
19 37 39 3230 27 39
28 28
1735 6
36 33 19 4 49
26
4000
30 3 44 2 11 12
31 1
50
25 8
27 35 16 18
32 42 45
34
3500
33 10
300 320 340 360 380 200 250 300 350 400 450
Number of residents per 1000 under 18 predict(educ.lm)

Measuring leverage
Fitted values b
y are related to response y by the equation
1
y = H y,
b where H = X XT X XT ,
ybi = hi1 y1 + + hii yi + + hin yn .
I hij depend explanatory variables X, hii is called hat matrix

diagonals (HMDs), measures the influence yi has on ybi .
I Also distance between average xi. and average x, measures
how extreme observations are.
Interpreting the HMDs
I Each HMD lies between 0 and 1.
I The average HMD is (k + 1)/n.
I An HMD larger than 3(k + 1)/n is considered extreme.

Example: The education data (without urban)
> hatvalues(educ.lm)
6 2942
10 2349
1 1334
4 32 7
37
16
11
24
192830
3 27 25
4043 548
15
38
1446 94535 33 44
4117
20
2 2122
39 31 50
18 47 36 8
12
26
(k + 1)/n 3(k + 1)/n

= 3/50 = 9/50
0.05 0.10 0.15 0.20 0.25 0.30 0.35
Hat matrix values for educ.lm

Studentised residuals
I How can we recognise a big residual? How big is big?
I The actual size depends on the units in which the y -variable is

measured, so we need to standardize them.
I Can divide by their standard deviations.
I Variance of a typical residual ei is
Var(ei ) = (1 hii ) 2 ,
where hii is the i th diagonal entry of the hat matrix H.

I Internally studentised (or standardised in R):

ei
ei = p
b
(1 hii )s 2
where s 2 is the usual estimate of the residual variance 2 .
I Externally studentised (or studentised in R):

ei
ei = q
b
(1 hii )si2
where si2 is the estimate of 2 after deleting the i th data point.

I How big is big?
I The internally studentised residuals are approximately

standard normal distributed if the model is OK and there are
no outliers.
I The externally studentised residuals has a t-distribution.
I Thus, studentised residuals should be between 2 and 2 with

approximately 95% probability.
Studentised residuals: Calculating it in R
#Load the MASS library

> library(MASS)
# internally studentised (standardised in R)
> stdres(educ.lm)[50]
50
3.089699
# externally studentised (studentized in R)
> studres(educ.lm)[50]
50
3.424107
What does studentised mean?
Recognising outliers
I If a point is a low influence outlier, the residual will usually be

large, so large residual and a low HMD indicates an outlier.
I If a point is a high leverage outlier, then a large error usually

will cause a large residual.
I However, in extreme cases, a high leverage outlier may not

have a very big residual, depending on how much the point
attracts the fitted plane. Thus, if a point has a large HMD,
and the residual is not particularly big, we cannot always tell if
a point is an outlier or not.
High-leverage outlier
Residuals vs Fitted

8
2
Fitted Line 13
True Line 19
6
Residuals

4
y
1
2

26
0
2
0 2 4 6 8 2 3 4 5 6 7
x Fitted values
Leverage-residual plots
> plot(educ.lm,which=5)
Residuals vs Leverage
50
1
2
7

0.5


45 0.5
2

1
Cook's distance
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Leverage
lm(educ ~ percap + under18)
Interpreting LR plots
4
Lowleverage Highleverage
Outlier Outlier
2
Potential
OK Highleverage
0
Outlier
2
Lowleverage Highleverage
Outlier Outlier
4
leverage
No big studentised residuals, no big HMDs
Plot of y vs. x, Example 1

40
Fitted Line
True Line

30
3(k+1)/n=0.2
19

30
23


20
19
y
0
23

1
10
30
2 Cook's distance
0
0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 0.12
x Leverage
One big studentised residuals, no big HMDs

40
2
Fitted Line
True Line

3(k+1)/n=0.2
19
30

1

30
1
20
19 31
y
30

2
10
0.5
3
31 1
4 Cook's distance
0
0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 0.12
x Leverage
No big studentised residuals, one big HMDs

40
Fitted Line
True Line

2

1
30 19

0.5

30
3(k+1)/n=0.2

20
19
y
0

31

1

10

0.5

30 1
2
31
Cook's distance
0
0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
x Leverage
One big studentised residuals, one big HMDs

40
Fitted Line
True Line
5
31

4

30
2
20
1
y

0.5
1
3(k+1)/n=0.2
1

8

31

10
1

18 0.5
Cook's distance
0
2 1
0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
x Leverage
Four big studentised residuals, one big HMDs

3

8
Fitted Line
True Line 31
13
31

2
6
1

3(k+1)/n=0.2 1
0.5
4
y
0
0.5
1

13

1

2
2

26
26 Cook's distance
0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8
x Leverage
HMD Summary
I Hat matrix diagonals
I Measure the effect of a point on its fitted value;

I Measure how outlying the x-values are (how high-leverage a
point is);
I Are always between 0 and 1 with bigger values indicating
high-leverage;
I Points with HMDs more than 3(k + 1)/n are considered
high-leverage
Influential points
I How can we tell if a high-leverage point/outlier is affecting

the regression?
I By deleting the point and refitting the regression: a large

change in coefficients means the point is affecting the
regression.
I Such points are called influential points.
I We do not want our analysis to be driven by one or two

points!
Leave one out measures
I We can calculate a variety of measures by leaving out each

data point in turn, and looking at the change in key regression
quantities such as:
I Coefficients
I Fitted Values
I Standard errors
I We discuss each in turn.

Example: Education data
With point 50 Without point 50
Const 557.451 298.714
percap 0.072 0.059
under18 1.555 0.933

Coefficient measures: DFBETAS
DFBETAS: Standardised difference in coefficients
bj bj [i]
dfbetas =
se(bj )
Problematic: when |dfbetas| > 1. This is a criterion coded into

R.
Coefficient measures: DFFITS
DFFITS: Standardised difference in fitted values:

ybj ybj [i]
dffits =
se(byj )
Problematic: when r
k +1
|dffits| > 3 .
nk 1
Coefficient measures: Covariance Ratio and Cooks D
Cov Ratio: Measures change in the standard errors of the

estimated coefficients.
Problem: when Cov Ratio greater than 1 + 3(k + 1)/n or

smaller than 1 3(k + 1)/n.
Cooks D: Measures overall change in the coefficients.
Problem: when greater than qf(.5,k+1,n-k-1) (median of

F -distribution), roughly 1 in most cases.
Coefficient measures in R
Influence measures of
lm(formula = educ ~ percap + under18, data = educ.df) :
dfb.1_ dfb.prcp dfb.un18 dffit cov.r cook.d hat inf

1 0.0120 -0.01794 -0.00588 0.0233 1.121 1.84e-04 0.0494
***
10 0.0638 -0.16792 -0.02222 -0.3631 0.803 4.05e-02 0.0257 *
***
44 0.0229 0.00298 -0.02948 -0.0340 1.283 3.94e-04 0.1690 *
***
50 -2.3688 1.50181 2.23393 2.4733 0.821 1.66e+00 0.3429 *
---
1.0000 1.00000 1.00000 0.7579 0.82 8.00e-01 0.18
1.18
Plotting influence
# There will be seven plots
par(mfrow=c(2,4))
# Plot the measures using R330
influenceplots(educ.lm)
dfb.1_ dfb.prcp dfb.un18 DFFITS
2.5
1.5
50 50 50 50
2.0
2.0
2.0
1.5
1.0
1.5
1.5
dfb.un18
dfb.prcp
DFFITS
dfb.1_
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Obs. number Obs. number Obs. number Obs. number
ABS(COV RATIO1) Cook's D Hats

0.35
44 50 50
1.5
0.25
0.30
0.25
0.20
10
ABS(COV RATIO1)
1.0
0.20
Cook's D
0.15
Hats
0.15
0.10
0.10
0.5
0.05
0.05
0.00
0.00
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Obs. number Obs. number Obs. number

Remedies for outliers
I Correct typographical errors in the data.
I Delete a small number of points and refit (do not want fitted
regression to be determined by one or two influential points)
I Report existence of outliers separately: they are often of

scientific interest
I Do not delete too many points.

Diagnostic steps

Independence: ...
Outliers: Leverage-Residual plots, influence measures

more)
Independence
I One of the regression assumptions is that the errors are

independent.
I Data that is collected sequentially over time often has errors

that are not independent.
I If the independence assumption does not hold, then the

standard errors will be wrong and the tests and confidence
intervals will be unreliable.
I We need to be able to detect lack of independence.

Types of dependence
I If large positive errors have a tendency to follow large positive

errors, and large negative errors a tendency to follow large
negative errors, we say the data has positive autocorrelation.
I If large positive errors have a tendency to follow large negative

errors, and large negative errors a tendency to follow large
positive errors, we say the data has negative autocorrelation.
Diagnostics: Positive Autocorrelation
If the errors are positively autocorrelated:

I Plotting the residuals against time will show long runs of
positive and negative residuals.
I Plotting residuals against the previous residual (i.e. ei vs.

ei1 ) will show a positive trend.
I A correlogram of the residuals will show positive spikes,

gradually decaying.
Diagnostics: Negative Autocorrelation
If the errors are negatively autocorrelated:

I Plotting the residuals against time will show alternating
positive and negative residuals.
I Plotting residuals against the previous residual (i.e. ei vs.

ei1 ) will show a negative trend.
I A correlogram of the residuals will show alternating positive

and negative spikes, gradually decaying.
Residuals against time
res <- residuals(lm.obj)
plot(1:length(res),res,xlab="time",ylab="residuals",
type="b")
abline(h=0,lty=2)
Residuals against time: Plot
Autocorrelation = 0.9
0.5
residual
0.0
-1.0 -0.5
0 20 40 60 80 100
time
0.4
residual
0.0
-0.4
0 20 40 60 80 100
time
Autocorrelation = - 0.9
0.0 0.5 1.0
residual
-1.0
0 20 40 60 80 100
time
Residuals against their predecessor
res <- residuals(lm.obj)
n <- length(res)
plot.res <- res[-1] # element 1 has no predecessor
prev.res <- res[-n] # last residual has no succesor
plot(prev.res,plot.res,xlab="previous residual",
ylab="residual")
Residuals against their predecessor: Plot
Autocorrelation = 0.9 Autocorrelation = 0.0 Autocorrelation = - 0.9

1.0
1.0
0.4
0.5
0.5
previous residual
previous residual
previous residual
0.2
0.0
0.0
0.0
-0.5
-0.2
-0.5
-1.0
-0.4
-1.5
-0.5 0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4 -1.5 -1.0 -0.5 0.0 0.5 1.0
residual residual residual

Correlogram
acf(residual(lm.obj))
I The autocorrelation function (acf, aka correlogram)

investigates the correlation between a residual and another
residual k time units apart.
I This is also called lag k autocorrelation.

Correlogram: Plot
0.8 Autocorrelation = 0.9
ACF
0.4
0.0
0 5 10 15 20
Lag
0.8
ACF
0.4
0.0
0 5 10 15 20
Lag
Autocorrelation = - 0.9
-1.0 -0.5 0.0 0.5 1.0
ACF
0 5 10 15 20
Lag
Remedies
I Time-series modelling
I Spatial correlation can also be modelled (correlation matrices)
I Mammal data, independent contrasts to mask ancestral

correlation
I If dependency is justifiably random, permute observations.

Diagnostic steps

Independence: Correlogram
Outliers: Leverage-Residual plots, influence measures

more)
1976 Chateau Margaux
NZ$ 250 800
1976 Chateau Latour

NZ$ 370 900
1976 Chateau Lafite Rothschild

NZ$ 550 1, 300
Case study: The wine data
I Data on 27 vintages of Bordeaux wines.
I Variables are
year: 1952 1980;

price: in 1980 US$, converted to an index with
1961 = 100;
temp: average temperature during growing season
( C );
h.rain: total rainfall during harvest period, mm;
w.rain: total rainfall over preceding winter, mm;
I Data part of R330 package, also available on course website.

Bordeaux wines: Their price
I Bordeaux wines are an iconic luxury consumer good. Many

consider these to be the best wines in the world.
I The quality and the price depends on the vintage (i.e. the
year the wines are made.)
I The prices are (in 1980 US$, in index form, 1961 = 100).
Bordeaux wines: Their price
Bordeaux wine price index 19521980

100

80

60
price

40

20
1955 1960 1965 1970 1975 1980
year
Bordeaux wines: Pairs plot
15.0 16.0 17.0 400 600 800

year
1955 1965 1975

temp
17.0

0.29
16.0

15.0
h.rain
250

150

0.059 0.031
50

800
w.rain

600

0.051
0.33 0.27

400
price
20 40 60 80
0.45 0.59 0.45 0.23
1955 1965 1975 50 150 250 20 40 60 80

Bordeaux wines: Preliminary analysis
Call:
lm(formula = price ~ year + temp + h.rain + w.rain,
data = wine.df)
---
Coefficients:
(Intercept) 1305.52761 597.31137 2.186 0.03977 *
year -0.82055 0.29140 -2.816 0.01007 *
temp 19.25337 3.92945 4.900 6.72e-05 ***
h.rain -0.10121 0.03297 -3.070 0.00561 **
w.rain 0.05704 0.01975 2.889 0.00853 **
---
Bordeaux wines: Diagnostic plots

30
3
8 8
6
20
19 19
6
2

Residuals
10
1

0
0

10
0 20 40 60 2 1 0 1 2
ScaleLocation Residuals vs Leverage

8
3
8 1
1.5
6
19
19 0.5
2

1.0

1

0

0.5

7
Cook's distance
0.0
0 20 40 60 0.0 0.1 0.2 0.3
Fitted values Leverage

Bordeaux wines: Checking normality
Bordeaux Residuals
20

Sample Quantiles
> qqnorm(residuals(wine.lm))
10

> WB.test(wine.lm)


0

p = 0.03

10

2 1 0 1 2
Bordeaux wines: Box-Cox routine
95%
10
20
Power = 1/3
logLikelihood
30
40
50
2 1 0 1 2

Bordeaux wines: Transform and refit
I Use y 1/3 as a response (reciprocal cubic root)
I Has the fit improved?
I Are the errors more normal? (normal plot and

Weisberg-Bingham test)
I Has R 2 increased?
I Would further transformations improve normality?
I Minimum at p = 1 means fit cannot be improved by further
transformation.
Bordeaux wines: Re-Checking normality
Normal QQ Plot
0.04

0.02

Sample Quantiles
> qqnorm(residuals(wine.upd))

> WB.test(wine.upd)
0.00


0.02
p = 0.65

0.04

2 1 0 1 2
Bordeaux wines: Re-Checking normality
Call:
lm(formula = price^(-1/3) ~ ., data = wine.df)
---
Coefficients:
(Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 *
year 2.639e-03 7.870e-04 3.353 0.00288 **
temp -7.051e-02 1.061e-02 -6.644 1.11e-06 ***
h.rain 4.423e-04 8.905e-05 4.967 5.71e-05 ***
w.rain -1.157e-04 5.333e-05 -2.170 0.04110 *
---
Bordeaux wines: Box-Cox revisited
22
95%
20
logLikelihood
Power = 1
18
16
14
2 1 0 1 2

Bordeaux wines: Conclusions on Normality
I Transformation has been spectacularly successful in improving

the fit!
I What about other aspects of the fit?
I residuals/fitted values?
I Pairs
I Added variable plots.
Bordeaux wines: Diagnostics plots revisited

0.06
24 24
2
20 20

1
0.02

Residuals
0

0.02
1

6
0.06
0.20 0.25 0.30 0.35 0.40 0.45 2 1 0 1 2
ScaleLocation Residuals vs Leverage

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
24
20
24 20 0.5
2
6


1

0

1
23

Cook's distance
2
0.5
0.20 0.25 0.30 0.35 0.40 0.45 0.0 0.1 0.2 0.3
Fitted values Leverage
ALL GOOD
Bordeaux wines: GAM plots
> library(mgcv)
> plot(gam(price^(-1/3)~temp+s(h.rain)+s(w.rain)+year,
data=wine.df))
0.10
0.10
0.05
0.05
s(w.rain,4.87)
s(h.rain,1)
0.00
0.00
0.05
0.05
50 100 150 200 250 300 400 500 600 700 800
h.rain w.rain
Bordeaux wines: Polynomial fit
Call:
lm(formula = price^(-1/3) ~ temp + h.rain + year
+ poly(w.rain, 4), data = wine.df)
---
Coefficients:
(Intercept) -2.974e+00 1.532e+00 -1.942 0.06715 .
temp -7.478e-02 1.048e-02 -7.137 8.75e-07 ***
h.rain 4.869e-04 8.662e-05 5.622 2.02e-05 ***
year 2.284e-03 7.459e-04 3.062 0.00642 **
poly(w.rain, 4)1 -7.561e-02 3.263e-02 -2.317 0.03180 *
poly(w.rain, 4)2 4.469e-02 3.294e-02 1.357 0.19079
poly(w.rain, 4)3 -2.153e-02 2.945e-02 -0.731 0.47374
poly(w.rain, 4)4 6.130e-02 2.956e-02 2.074 0.05194 .
---
Bordeaux wines: Final fit
Call:
lm(formula = price^(-1/3) ~ ., data = wine.df)
---
Coefficients:
(Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 *
year 2.639e-03 7.870e-04 3.353 0.00288 **
temp -7.051e-02 1.061e-02 -6.644 1.11e-06 ***
h.rain 4.423e-04 8.905e-05 4.967 5.71e-05 ***
w.rain -1.157e-04 5.333e-05 -2.170 0.04110 *
---
Bordeaux wines: Influential points
dfb.1_ dfb.year dfb.temp
0.0 0.1 0.2 0.3 0.4 0.5

0.4
0.4
0.3
0.3
dfb.temp
dfb.year
dfb.1_
0.2
0.2
0.1
0.1
0.0
0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
dfb.h.rn dfb.w.rn DFFITS
0.6
0.8
0.4
dfb.w.rn
DFFITS
dfb.h.rn
0.4
0.4
0.2
0.2
0.0
0.0
0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
ABS(COV RATIO1) Cook's D Hats

0.00 0.05 0.10 0.15 0.20
8
0.8
4
ABS(COV RATIO1)
0.3
0.6
15
Cook's D
0.2
Hats
0.4
0.1
0.2
0.0
0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Bordeaux wines: Conclusions
I Model using price1/3 as response fits very well.
I Use this model for prediction, understanding relationships:
I Coefficient of year positive, so transformed response increases

with year (i.e. older vintages are more valuable).
I Coefficient of temp negative, so high temperatures decrease
transformed response (i.e increase price).
I Coefficient of h.rain positive, so high harvest rain increases
transformed response (i.e. decreases price).
I Coefficient of w.rain negative, so high winter rain decreases
transformed response (i.e. increases price).
Statistical outliers
http://xkcd.com/539/

330 Diagnostics 2015

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

330 Diagnostics 2015

Uploaded by

Copyright:

Available Formats

STATS 330: Lecture 9-12

I To give you an overview of the modelling cycle.

I To have a detailed discussion of diagnostic

I We have seen that the regression model describes rather

I Data are planar,

I We have looked at some plots that help us decide if the data

I Another approach is to fit the model and examine the

I If the model is appropriate the residuals have no pattern

I A pattern in the residuals usually indicates that the model is

I If this is the case we have two options

1. Select another form of model e.g. non-linear regression;

2. Transform the data so that the regression model fits the

PLOTS and THEORY

Transform Examine Residuals

Bad fit Good fit

I Non-planar data: Seriously affects meaning and accuracy of

We test for <> using <>:

Constant Variance: ...

Normality of Errors: ...

I We can diagnose non-planar data (non-linearity) by fitting the

I plotting residuals versus fitted values;

I In each case, a curved plot indicates non-planar data.

which=1: selects the plot of residuals vs. fitted values

I These are models of the form

Y = g1 (x1 ) + g2 (x2 ) + + gk (xk ) +

where g1 , . . . , gk are transformations.

I Fitted using the gam function in R.

I The transformations are estimated by the software.

I Use the function to suggest good transformations.

I To fit a model y = 0 + 1 x + 2 x 2 , use

I To fit a model y = 0 + 1 x + 2 x 2 + 3 x 3 , use

I The model fitted by y~poly(x,2) is of the form

p1 : polynomial of degree 1, i.e. of the form a0 + a1 x

p2 : polynomial of degree 2, i.e. of the form b0 + b1 x + b2 x 2 .

I p1 , p2 chosen to be uncorrelated (best possible estimation)

volume 9.92 2.89 diameter + 0.27 diameter2 + 0.38 height

I An alternative to polynomials are splines these are piecewise

I Give a more flexible fit to the data.

I Values at one point not affected by values at distant points,

0.0 0.2 0.4 0.6 0.8 1.0

Basis for quadratic splines

0.0 0.2 0.4 0.6 0.8 1.0

I Data collected in an experiment to study the abrasion

Hardness: Hardness of rubber

Tensile: Tensile strength of rubber

Abrasion Loss: Amount of rubber worn away in a standard

I We will use this example to illustrate all the methods we have

120 140 160 180 200 220 240

50 100 150 200 250 300 350

hardness looks okay, but tensile needs transformation.

> rubber.poly <- lm(abloss~hardness+tensile+I(tensile^2)

> rubber.bs <- lm(abloss~hardness+bs(tensile,df=4),

I We use GAM plots to decide

I We use GAM plots to decide

I We use GAM plots to decide

I We use GAM plots to decide

We test for <> using <>:

Constant Variance: ...

Normality of Errors: ...

I So far we have discussed how to look for non-planarity, and

I We illustrate with a few examples.

I As we have observed, a tree trunk is a bit like a cone, i.e.

log(volume) = log(/12) + 2 log(diameter) + log(height)

I So a linear regression using the logged variables should work!