Linear Regression

Introduction to Probability
and Statistics
Twelfth Edition
Robert J. Beaver Barbara M. Beaver William Mendenhall
Presentation designed and written by:

Barbara M. Beaver
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Introduction to Probability
and Statistics
Twelfth Edition
Chapter 12
Linear Regression and
Correlation
Some graphic screen captures from Seeing Statistics
Some images 2001-(current year) www.arttoday.com

Introduction
In Chapter 11, we used ANOVA to investigate the
effect of various factor-level combinations
(treatments) on a response x.
Our objective was to see whether the treatment
means were different.
In Chapters 12 and 13, we investigate a response y
which is affected by various independent variables,
x i.
Our objective is to use the information provided by
the xi to predict the value of y.
Example
Let y be a students college achievement,
measured by his/her GPA. This might be a
function of several variables:
x1 = rank in high school class
x2 = high schools overall rating
x3 = high school GPA
x4 = SAT scores
We want to predict y using knowledge of x1, x2,
x3 and x4.
Example
Let y be the monthly sales revenue for a
company. This might be a function of several
variables:
x1 = advertising expenditure
x2 = time of year
x3 = state of economy
x4 = size of inventory
We want to predict y using knowledge of x1, x2,
x3 and x4.
Some Questions
Which of the independent variables are
useful and which are not?
How could we create a prediction equation
to allow us to predict y using knowledge of
x1, x2, x3 etc?
How good is this prediction?
We
Westart
start with
with the
the simplest
simplest case,
case,in
in which
which the
the
response
response yy isis aa function
function of
of aa single
single independent
independent
variable,
variable, x.x.
A Simple Linear Model

In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose
Deterministic
Deterministic Model:
Model: yy == x
x
Probabilistic
Probabilistic Model:
Model:
yy == deterministic
deterministic model
model ++ random
random error
error
yy == x
x
A Simple Linear Model

Since the bivariate measurements that we
observe do not generally fall exactly on a
straight line, we choose to use:
Probabilistic Model:
y = x
E(y) = x
Points deviate from the
line of means by an amount
where has a normal
distribution with mean 0 and
2
variance .
The Random Error

The line of means, E(y) = x , describes
average value of y for any fixed value of x.
The population of measurements is generated
as y deviates from
the population line
by . We estimate
and using sample
information.
MY
APPLET
The Method of
Least Squares
The equation of the best-fitting line

is calculated using a set of n pairs (xi, yi).
We choose our
estimates a and b to
estimate
line
and:ysoa bx
Bestfitting
that the vertical
Choose
a
and
b
to
minimize
distances of the
2
2
SSE from
( y the
y ) line,
( y a bx)
points
are minimized.

Least Squares Estimators

Calculate
the
of
::
Calculate
thesums
sums
ofsquares
squares
22
22
((
xx))
((
yy))
22
22
SSxxxx
SSyyyy
xx
yy
nn
nn
((
xx)()(
yy))
SSxyxy
xy
xy
nn
Best
fitting
line
Best
fitting
line:: yy aa bx
bx where
where
SSxyxy
bb
and
and aa yy bbxx
SSxxxx
Example
The table shows the math achievement test scores
for a random sample of n = 10 college freshmen,
along with their final calculus grades.
Student
Math test, x
10
39 43
21
64
57
47
28
75
34
52
Calculus grade, y 65 78
52
82
92
89
73
98
56
75
Use
Useyour
yourcalculator
calculator
to
tofind
findthe
thesums
sums
and
andsums
sumsof
of
squares.
squares.
xx 460
460
yy 760
760
22
22
xx 23634
23634
yy 59816
59816
xy
xy 36854
36854
2006 Brooks/Cole
xx 46
76
46 yyCopyright
76
Example
22
((460
)
460) 2474
SSxxxx 23634
23634
2474
10
10
22
((760
)
760) 2056
SSyyyy 59816
59816
2056
10
10
((460
)()(760
))
460
760
SSxyxy 36854
1894
36854
1894
10
10
1894
bb 1894 .76556
.76556and
and aa 76
76..76556
76556((46
46)) 40
40..78
78
2474
2474
Best
fitting
line
Best
fitting
line:: yy 40
40..78
78..77
77xx
The Analysis of Variance
The total variation in the experiment is

measured by the total sum of squares:
squares
22
Total
SS
(
y
y
)
TotalSS Syyyy ( y y )
The Total SS is divided into two parts:

SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.
The Analysis of Variance

We calculate
22
((SSxyxy))22 1894
1894
SSR
SSR
SSxxxx
2474
2474
1449
1449..9741
9741
SSE
SSE Total
TotalSS
SS--SSR
SSR
((SSxyxy))22
SSyyyy
SSxxxx
2056
20561449
1449..9741
9741
606
606..0259
0259
The ANOVA Table

Total df = n -1
Regression df = 1
Error df = n 1 1 = n - 2
Mean Squares
MSR = SSR/(1)
MSE = SSE/(n-2)
Source
df
SS
MS
Regression
SSR
SSR/(1)
MSR/MSE
Error
n-2
SSE
SSE/(n-2)
Total
n -1
Total SS

The Calculus Problem

22
((SSxyxy))22 1894
1894 1449.9741
SSR
SSR
1449.9741
SSxxxx
2474
2474
((SSxyxy))22
SSE
SSETotal
TotalSS
SS--SSR
SSR SSyyyy
SSxxxx
2056
20561449
1449..9741
9741606
606..0259
0259
Source
df
SS
MS
Regression
1449.9741 1449.9741 19.14
Error
606.0259
Total
2056.0000
75.7532

Testing the Usefulness

of the Model
The first question to ask is whether the

independent variable x is of any use in
predicting y.
If it is not, then the value of y does not change,
regardless of the value of x. This implies that
the slope of the line, , is zero.
H
H00 :: 00
versus
H
versus
Haa:: 00
Testing the
Usefulness of the Model
The test statistic is function of b, our best

estimate of Using MSE as the best estimate
of the random variation 2, we obtain a t
statistic.
bb00
Test
::tt
Teststatistic
statistic
MSE
MSE
SSxxxx
which
has
aattdistributi
on
which
has
distributi
on
MSE
MSE
with
df
2
or
a
confidence
interval
:
b
t
withdf n 2 oraconfidence
interval
: b t/ /22
SSxxxx

MY
APPLET
Is there a significant relationship between

There
is
a
significant
the calculus grades and the test scores at the
linear relationship
5% level of significance?
between the calculus
grades and the test scores
H 0 : 0 versus
H a : for
0 the population of
b0
.7656 college
0
freshmen.
t
4.38
MSE/ S xx
75.7532 / 2474
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H 0 is rejected .
The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
To testH 0 : model
isuseful
in predicting
y
This test is
exactly
equivalent to
the t-test, with t2
= F.
MSR
TestStatistic
:F
MSE
Reject
H 0 if F F with1 andn - 2 df .
Minitab Output
Least squares
To test
H 0 : 0line
regression
Regression Analysis: y versus x

The regression equation is y = 40.8 + 0.766 x
Predictor
Coef
SE Coef
T
P
Constant
40.784
8.507
4.79
0.001
x
0.7656
0.1750
4.38
0.002
S = 8.70363
R-Sq = 70.5%
Analysis of Variance
Source
DF
SS
Regression
1 1450.0
Residual Error 8
606.0
Total
9 2056.0
MSE
R-Sq(adj) = 66.8%
MS
1450.0
75.8
F
19.14
P
0.002
Regression coefficients, 2
t F
a and Copyright
b 2006 Brooks/Cole
Measuring the Strength

of the Relationship
If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
SSxyxy
Correlatio
nncoefficien
tt::rr
Correlatio
coefficien
SSxxxxSSyyyy
22
xy
xy
SS
SSR
SSR
Coefficien
t
of
determinat
ion
:
r
Coefficien
t of determinat
ion: r
SCopyright
Brooks/Cole
SxxxxSSyyyy2006Total
TotalSS
SS
22
Measuring the Strength

of the Relationship
Since Total SS = SSR + SSE, r2 measures

the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
For the calculus problem, r2 = .705 or
70.5%. The model is working well!
SSR
rr SSR
Total
TotalSS
SS
22

Interpreting a
Significant Regression
Even if you do not reject the null hypothesis

that the slope of the line equals 0, it does not
necessarily mean that y and x are unrelated.
Type II errorfalsely declaring that the slope is

0 and that x and y are unrelated.
It may happen that y and x are perfectly related

in a nonlinear way.
Some Cautions
You may have fit the wrong model.
Extrapolationpredicting
values of y outside
the range of the fitted data.

CausalityDo
not conclude that x causes y.

There may be an unknown variable at work!
Checking the
Regression Assumptions
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
1.
1. The
The relationship
relationship between
between xx and
and yy isis linear,
linear,
given
given by
by yy == ++ x
x ++
2.
2. The
The random
random error
error terms
terms are
are independent
independent and,
and,
for
for any
any value
value of
of x,
x, have
have aa normal
normal distribution
distribution
with
with mean
mean 00 and
and variance
variance
22..
Diagnostic Tools
We use the same diagnostic tools used in
Chapter 11 to check the normality
assumption and the assumption of equal
variances.
1.
1. Normal
Normal probability
probability plot
plot of
of residuals
residuals
2.
2. Plot
Plot of
of residuals
residuals versus
versus fit
fit or
or
residuals
residuals versus
versus variables
variables
Residuals
The residual error is the leftover
variation in each data point after the
variation explained by the regression model
has been removed.
Residual
yyii yyii or
Residual
or yyii aa bx
bxii
If all assumptions have been met, these
residuals should be normal,
normal with mean 0
and variance 2.
Normal Probability Plot
IfIf the
the normality
normality assumption
assumption is
is valid,
valid, the
the
plot
plot should
should resemble
resemble aa straight
straight line,
line,
sloping
sloping upward
upward to
to the
the right.
right.
IfIf not,
not, you
you will
will often
often see
see the
the pattern
pattern fail
fail
in
in the
the tails
tails of
of the
the graph.
graph.
Normal Probability Plot of the Residuals
(response is y)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
-20
-10
0
Residual
10
20

Residuals versus Fits
IfIf the
the equal
equal variance
variance assumption
assumption isis valid,
valid,
the
the plot
plot should
should appear
appear as
as aa random
random
scatter
scatter around
around the
the zero
zero center
center line.
line.
IfIf not,
not, you
you will
will see
see aa pattern
pattern in
in the
the
residuals.
residuals.
Residuals Versus the Fitted Values
(response is y)
15
Residual
10
-5
-10
60
70
80
Fitted Value
90
100

Estimation and Prediction
Once you have

determined that the regression line is useful
used the diagnostic plots to check for
violation of the regression assumptions.
You are ready to use the regression line to
Estimate the average value of y for a
given value of x
Predict a particular value of y for a
given value of x.

Estimating a
particular value of y
when x = x0
Estimating the
average value of y
when x = x0

The best estimate of either E(y) or y for

a given value x = x0 is
yy aa bx
bx00
Particular values of y are more difficult to

predict, requiring a wider range of values in the
prediction interval.


To
the
value
of
xx xx00 ::
To estimate
estimate
theaverage
average
value
of yy when
when
11 ((xx0 xx))22
0
yy tt//22 MSE
MSE
n
S
Sxxxx
n
To
aaparticular
value
of
xx xx00 ::
To predict
predict
particular
value
of yy when
when
yy tt//22
11 ((xx00 xx))22
MSE
MSE 11
n
S
n
Sxxxx


Estimate the average calculus grade for
students whose achievement score is 50 with a
95% confidence interval.
Calculate
yy 40.78424
.76556(50)
79.06
Calculate
40.78424
.76556(50)
79.06
22
11 ((50
46
))
50
46

yy 22..306
306 75
75..7532
7532
10
2474
10
2474
79
to
79..06
06 66..55
55 or
or72.51
72.51
to85.61.
85.61.

Estimate the calculus grade for a particular
student whose achievement score is 50 with a
95% confidence interval.
Calculate
yy 40.78424
.76556(50)
79.06
Calculate
40.78424
.76556(50)
79.06
22
11 ((50
46
))
50
46

yy 22..306
306 75
75..7532
7532 11
10
2474
10
2474
Notice how
79
.
06
21
.
11
or
57.95
to
100.17.
79.06 21.11 or57.95
to100.17.
much wider this
interval is!

Minitab Output
Confidence and prediction

intervals when x = 50
Predicted Values for New Observations

New Obs
Fit
SE Fit
95.0% CI
1
79.06
2.84
(72.51, 85.61)
95.0% PI
(57.95,100.17)
Values of Predictors for New Observations

Fitted Line Plot
New Obs
x
y = 40.78 + 0.7656 x
1
50.0
120
Regression
95% CI
95% PI
110
S
R-Sq
R-Sq(adj)
100
8.70363
70.5%
66.8%
90
y
Green prediction
bands are always
wider than red
confidence bands.
80
70
60
Both intervals are

narrowest when x = xbar.
50
40
30
20
30
40
50
x
60
70
80

Correlation Analysis
The strength of the relationship between x and y is

measured using the coefficient of correlation:
correlation
SSxyxy
Correlatio
n
tt::rr
Correlatio
ncoefficien
coefficien
SSxxxxSSyyyy
Recall from Chapter 3 that

(1) -1 r 1
(2) r and b have the same sign
(3) r 0 means no linear relationship
(4) r 1 or 1 means a strong (+) or (-)
relationship
Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player
10
Height, x
73
71
75
72
72
75
67
69
71
69
Weight, y
185 175 200 210
Use
Useyour
yourcalculator
calculator
to
tofind
findthe
thesums
sums
and
andsums
sumsof
of
squares.
squares.
190 195 150
170 180 175
SSxyxy 328
328 SSxxxx 60
60..44 SSyyyy 2610
2610
328
328
rr
..8261
8261
((60
60..44)()(2610
2610))
Football Players
Scatterplot of Weight vs Height
210
200
Weight
190
rr==.8261
.8261
180
170
Strong
Strongpositive
positive
correlation
correlation
160
150
66
67
68
69
70
71
Height
72
73
74
75
As
Asthe
theplayers
players
height
heightincreases,
increases,so
so
does
doeshis
hisweight.
weight.
Some Correlation Patterns
Use
the
Exploring
Correlation
applet
to
rr==0;
No
0; No
rr==.931;
.931;Strong
Strong
explore
some correlation patterns:
correlation
correlation
positive
positivecorrelation
correlation
rr==1;
1;Linear
Linear
relationship
relationship
rr==-.67;
-.67;Weaker
Weaker
negative
negativecorrelation
correlation
MY
APPLET

Inference using r
The population coefficient of correlation is

called (rho). We can test for a significant
correlation between x and y using a t test:
To testH 0 : 0 versus
Ha : 0
This test is
exactly
equivalent to
the t-test for the
slope .
n2
Test Statistic
: tr
1 r2
Reject
H 0 if t t / 2 ort t / 2 withn - 2 df .

rr ..8261
8261
Example
Is there a significant positive correlation

between weight and height in the population
of all college football players?
H
H00:: 00
H
Haa:: 00
Use
Usethe
thet-table
t-tablewith
withn-2
n-2==88df
dfto
to
bound
boundthe
thep-value
p-valueas
asp-value
p-value<<..
005.
005.There
Thereisisaasignificant
significant
positive
positivecorrelation.
correlation.
nn22
Test
:: tt rr
TestStatistic
Statistic
11rr22
88
..8261
4.15
8261
22 4.15
11..8261
8261
MY
APPLET

Key Concepts
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the
appropriate model is y x .
2. The random error has a normal distribution with
mean 0 and variance 2.
II. Method of Least Squares
1. Estimates a and b, for and , are chosen to
minimize SSE, the sum of the squared deviations
about the regression line, y a bx.
2. The least squares estimates are b = Sxy/Sxx and
a y b x.

Key Concepts
III. Analysis of Variance
1. Total SS SSR SSE, where Total SS Syy and
SSR (Sxy)2 Sxx.
2. The best estimate of 2 is MSE SSE (n 2).
IV. Testing, Estimation, and Prediction
1. A test for the significance of the linear regression
H0 : 0 can be implemented using one of two test
statistics:
b
MSE/ S xx
or
MSR
F
MSE

Key Concepts
2.
The strength of the relationship between x and y can be

measured using
SSR
R
TotalSS
2
3.
4.
5.
which gets closer to 1 as the relationship gets stronger.

Use residual plots to check for nonnormality, inequality of
variances, and an incorrectly fit model.
Confidence intervals can be constructed to estimate the
intercept and slope of the regression line and to estimate
the average value of y, E( y ), for a given value of x.
Prediction intervals can be constructed to predict a
particular observation, y, for a given value of x. For a given
x, prediction intervals are always wider than confidence
intervals.
Key Concepts
V. Correlation Analysis
1. Use the correlation coefficient to measure the
relationship between x and y when both variables are
random:
S
r
xy
S xx S yy
2. The sign of r indicates the direction of the

relationship; r near 0 indicates no linear relationship,
and r near 1 or 1 indicates a strong linear
relationship.
3. A test of the significance of the correlation coefficient
is identical to the test of the slope

Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

Introduction to Probability

Presentation designed and written by:

Copyright 2006 Brooks/Cole

A Simple Linear Model

Copyright 2006 Brooks/Cole

A division of Thomson Learning, Inc.

A Simple Linear Model

The Random Error

The equation of the best-fitting line

Copyright 2006 Brooks/Cole

Least Squares Estimators

The Analysis of Variance

The total variation in the experiment is

The Total SS is divided into two parts:

The Analysis of Variance

The ANOVA Table

Copyright 2006 Brooks/Cole

The Calculus Problem

1449.9741 1449.9741 19.14

Copyright 2006 Brooks/Cole

Testing the Usefulness

The first question to ask is whether the

The test statistic is function of b, our best

The Calculus Problem

Is there a significant relationship between

A division of Thomson Learning, Inc.

Regression Analysis: y versus x

A division of Thomson Learning, Inc.

Measuring the Strength

A division of Thomson Learning, Inc.

Measuring the Strength

Since Total SS = SSR + SSE, r2 measures

Copyright 2006 Brooks/Cole

Even if you do not reject the null hypothesis

Type II errorfalsely declaring that the slope is

It may happen that y and x are perfectly related

You may have fit the wrong model.

the range of the fitted data.

not conclude that x causes y.

Normal Probability Plot

Copyright 2006 Brooks/Cole

Residuals versus Fits

Copyright 2006 Brooks/Cole

Estimation and Prediction

Once you have

Estimation and Prediction

Copyright 2006 Brooks/Cole

Estimation and Prediction

The best estimate of either E(y) or y for

Particular values of y are more difficult to

Copyright 2006 Brooks/Cole

Estimation and Prediction

Copyright 2006 Brooks/Cole

The Calculus Problem

The Calculus Problem

Copyright 2006 Brooks/Cole

Confidence and prediction

Predicted Values for New Observations

Values of Predictors for New Observations

Both intervals are

Copyright 2006 Brooks/Cole

The strength of the relationship between x and y is

Recall from Chapter 3 that

185 175 200 210