You are on page 1of 24

Inference for Regression

Simple linear regression


Statistical model for linear regression
Estimating the regression parameters
Confidence interval for regression parameters

Significance test for the slope
Confidence interval for
y
Prediction intervals
The data in a scatterplot are a random
sample from a population that may
exhibit a linear relationship between x
and y. Different sample different plot.
Now we want to describe the population mean response m
y
as a
function of the explanatory variable x: m
y
= b
0
+ b
1
x.

And to assess whether the observed relationship is statistically
significant (not entirely explained by chance events due to
random sampling).
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
H
o
u
s
e

P
r
i
c
e

(
$
1
0
0
0
s
)

Statistical model for linear regression
In the population, the linear regression equation is m
y
= b
0
+ b
1
x.

Sample data then fits the model:

Data = fit + residual
y
i
= (b
0
+ b
1
x
i
) + (e
i
)



where the e
i
are
independent and
Normally distributed N(0,s).

Linear regression assumes equal variance of y
(s is the same for all values of x).
m
y
= b
0
+ b
1
x
The intercept b
0
, the slope b
1
, and the standard deviation s of y are
the unknown parameters of the regression model. We rely on the
random sample data to provide unbiased estimates of these
parameters.

The value of from the least-squares regression line is really a prediction
of the mean value of y (m
y
) for a given value of x.
The least-squares regression line ( = b
0
+ b
1
x) obtained from sample data
is the best estimate of the true population regression line (m
y
= b
0
+ b
1
x).
unbiased estimate for mean response m
y
b
0
unbiased estimate for intercept b
0
b
1
unbiased estimate for slope b
1
Estimating the parameters
The regression standard error, s
e
, for n sample data points is calculated
from the residuals (y
i

i
):




s
e
is an unbiased estimate of the regression standard deviation s.

The population standard
deviation s for y at any given
value of x represents the
spread of the normal
distribution of the e
i
around
the mean m
y
.

s
e

residual
2

n 2

(y
i


y
i
)
2

n 2
Conditions for inference
The observations are independent.
The relationship is indeed linear.
The standard deviation of y, , is the same for all values of x.
The response y varies normally
around its mean.
Using residual plots to check for regression validity
The residuals (y ) give useful information about the contribution of
individual data points to the overall pattern of scatter.

We view the residuals in
a residual plot:


If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed
residuals for each value of x, and constant standard deviation .
Residuals are randomly scattered
good!



Curved pattern
the relationship is not linear.



Change in variability across plot
not equal for all values of x.
What is the relationship between the
average speed a car is driven and its
fuel efficiency?

We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars. The
relationship is curved.
When speed is log transformed (log
of miles per hour, LOGMPH) the new
scatterplot shows a positive, linear
relationship.
Residual plot:
The spread of the residuals is reasonably
randomno clear pattern. The
relationship is indeed linear.
But we see one low residual (3.8, 4) and
one potentially influential point (2.5, 0.5).
Normal quantile plot for residuals:
The plot is fairly straight, supporting the
assumption of normally distributed
residuals.
Data okay for inference.
Confidence interval for regression parameters
Estimating the regression parameters b
0
, b
1
is a case of one-sample inference
with unknown population variance.
We rely on the t distribution, with n 2 degrees of freedom.

A level C confidence interval for the slope, b
1
, is proportional to the standard
error of the least-squares slope:
b
1
t* SE(b
1
)


A level C confidence interval for the intercept, b
0
, is proportional to the
standard error of the least-squares intercept:
b
0
t* SE(b
0
)


t* is the t critical for the t (n 2) distribution with area C between t* and +t*.
The standard error of the regression coefficients
To estimate the parameters of the regression, we calculate the standard
errors for the estimated regression coefficients.

The standard error of the least-squares slope
1
is:



The standard error of the intercept
0
is:


SE(b
0
) s
e
1
n

x
2
(x
i
x )
2


SE(b
1
)
s
e
(x
i
x )
2

s
e
s
x
n 1

Significance test for the slope
We can test the hypothesis H
0
: b
1
= 0 versus a 1 or 2 sided
alternative.

We calculate t = b
1
/ SE(b
1
)


which has the t (n 2)
distribution to find the
p-value of the test.

Note: Software typically provides
two-sided p-values.
Testing the hypothesis of no relationship
We may look for evidence of a significant relationship
between variables x and y in the population from which our
data were drawn.

For that, we can test the hypothesis that the regression slope
parameter is equal to zero.

H
0
:
1
= 0 vs. H
0
:
1
0

Testing H
0
:
1
= 0 also allows to test the hypothesis of
no correlation between x and y in the population.

1
slope
y
x
s
b r
s

Confidence interval for


y
Using inference, we can also calculate a confidence interval for the
population mean
y
of all responses y when x takes the value x* (within the
range of data tested):

This interval is centered on , the unbiased estimate of
y
.
The true value of the population mean
y
at a given
value of x, will indeed be within our confidence
interval in C% of all intervals calculated
from many different random samples.
The level C confidence interval for the mean response
y

at a given value x* of x is:

y t
n 2
* SE()

A separate confidence interval is
calculated for
y
along all the values
that x takes. Graphically, the series of
confidence intervals is shown as a
continuous interval on either side of
.
t* is the t critical for the t (n 2)
distribution with area C between t*
and +t*.
95% confidence
interval for m
y
^

m
^
Inference for prediction
One use of regression is for predicting the value of y, , for any value of x
within the range of data tested: = b
0
+ b
1
x.
But the regression equation depends on the particular sample drawn. More
reliable predictions require statistical inference:

To estimate an individual response y for a given value of x, we use a
prediction interval.
If we randomly sampled many times, there
would be many different values of y
obtained for a particular x following
N(0, ) around the mean response
y
.
The level C prediction interval for a single observation on y
when x takes the value x* is:



t*
n 2
SE()


Graphically, the series
confidence intervals are
shown as a continuous
interval on either side of .
95% prediction
interval for
t* is the t critical for the t (n 2)
distribution with area C between t*
and +t*.
To estimate or predict future responses, we calculate
the following standard errors

The standard error of the mean response
y
is:


The standard error for predicting an individual
response is:


SE(

m ) SE
2
(b
1
)( x * x )
2

s
e
2
n
s
e
1
n

(x * x )
2
(x
i
x )
2


SE(

y ) SE
2
(b
1
)( x * x )
2

s
e
2
n
s
e
2
s
e
1
1
n

(x * x )
2
(x
i
x )
2

Confidence intervals are intervals constructed about


the predicted value of y, at a given level of x, which
are used to measure the accuracy of the mean
response of all the individuals in the population.

Prediction intervals are intervals constructed about
the predicted value of y that are used to measure the
accuracy of a single individuals predicted value.

A confidence interval represents an inference on a
parameter and is an interval that is intended to cover
the value of the parameter, while a prediction interval,
on the other hand, is a statement about the possible
value of a random variable.

In regression, an outlier can stand out in two ways. It can have
1). a large residual:





Unusual and Extraordinary Observations
2011 Pearson Education, Inc.
In regression, an outlier can stand out in two ways. It can have
2). a large distance from:
x
High-leverage
point
A high leverage point is influential if omitting it gives a regression model
with a very different slope.
Examples
Not high-leverage
Large residual
Not very influential
High-leverage
Small residual
Not very influential
High-leverage
Medium residual
Very influential
(omitting the red
point will change the
slope dramatically!)
What should you do with a high-leverage point?
Sometimes, these points are important. They can
indicate that the underlying relationship is in fact
nonlinear.
Other times, they simply do not belong with the
rest of the data and ought to be omitted.
When in doubt, create and report two models: one with
the outlier and one without.
Influential points do not necessarily have high residuals!
So, use scatterplots rather than residual plots to identify
high-leverage outliers.
(Residual plots work well of course for identifying high-
residual outliers.)

2011 Pearson Education, Inc.

You might also like