Professional Documents
Culture Documents
and Statistics
Twelfth Edition
Robert J. Beaver Barbara M. Beaver William Mendenhall
Introduction to Probability
and Statistics
Twelfth Edition
Chapter 12
Linear Regression and
Correlation
Some graphic screen captures from Seeing Statistics
Some images 2001-(current year) www.arttoday.com
Introduction
In Chapter 11, we used ANOVA to investigate the
effect of various factor-level combinations
(treatments) on a response x.
Our objective was to see whether the treatment
means were different.
In Chapters 12 and 13, we investigate a response y
which is affected by various independent variables,
x i.
Our objective is to use the information provided by
the xi to predict the value of y.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
Let y be a students college achievement,
measured by his/her GPA. This might be a
function of several variables:
x1 = rank in high school class
x2 = high schools overall rating
x3 = high school GPA
x4 = SAT scores
We want to predict y using knowledge of x1, x2,
x3 and x4.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
Let y be the monthly sales revenue for a
company. This might be a function of several
variables:
x1 = advertising expenditure
x2 = time of year
x3 = state of economy
x4 = size of inventory
We want to predict y using knowledge of x1, x2,
x3 and x4.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Some Questions
Which of the independent variables are
useful and which are not?
How could we create a prediction equation
to allow us to predict y using knowledge of
x1, x2, x3 etc?
How good is this prediction?
We
Westart
start with
with the
the simplest
simplest case,
case,in
in which
which the
the
response
response yy isis aa function
function of
of aa single
single independent
independent
variable,
variable, x.x.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
MY
APPLET
The Method of
Least Squares
We choose our
estimates a and b to
estimate
line
and:ysoa bx
Bestfitting
that the vertical
Choose
a
and
b
to
minimize
distances of the
2
2
SSE from
( y the
y ) line,
( y a bx)
points
are minimized.
22
22
SSxxxx
SSyyyy
xx
yy
nn
nn
((
xx)()(
yy))
SSxyxy
xy
xy
nn
Best
fitting
line
Best
fitting
line:: yy aa bx
bx where
where
SSxyxy
bb
and
and aa yy bbxx
SSxxxx
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
The table shows the math achievement test scores
for a random sample of n = 10 college freshmen,
along with their final calculus grades.
Student
Math test, x
10
39 43
21
64
57
47
28
75
34
52
Calculus grade, y 65 78
52
82
92
89
73
98
56
75
Use
Useyour
yourcalculator
calculator
to
tofind
findthe
thesums
sums
and
andsums
sumsof
of
squares.
squares.
xx 460
460
yy 760
760
22
22
xx 23634
23634
yy 59816
59816
xy
xy 36854
36854
2006 Brooks/Cole
xx 46
76
46 yyCopyright
76
A division of Thomson Learning, Inc.
Example
22
((460
)
460) 2474
SSxxxx 23634
23634
2474
10
10
22
((760
)
760) 2056
SSyyyy 59816
59816
2056
10
10
((460
)()(760
))
460
760
SSxyxy 36854
1894
36854
1894
10
10
1894
bb 1894 .76556
.76556and
and aa 76
76..76556
76556((46
46)) 40
40..78
78
2474
2474
Best
fitting
line
Best
fitting
line:: yy 40
40..78
78..77
77xx
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
(
y
y
)
TotalSS Syyyy ( y y )
SSR
SSxxxx
2474
2474
1449
1449..9741
9741
SSE
SSE Total
TotalSS
SS--SSR
SSR
((SSxyxy))22
SSyyyy
SSxxxx
2056
20561449
1449..9741
9741
606
606..0259
0259
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Mean Squares
MSR = SSR/(1)
MSE = SSE/(n-2)
Source
df
SS
MS
Regression
SSR
SSR/(1)
MSR/MSE
Error
n-2
SSE
SSE/(n-2)
Total
n -1
Total SS
SSR
1449.9741
SSxxxx
2474
2474
((SSxyxy))22
SSE
SSETotal
TotalSS
SS--SSR
SSR SSyyyy
SSxxxx
2056
20561449
1449..9741
9741606
606..0259
0259
Source
df
SS
MS
Regression
Error
606.0259
Total
2056.0000
75.7532
versus
H
versus
Haa:: 00
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Testing the
Usefulness of the Model
which
has
aattdistributi
on
which
has
distributi
on
MSE
MSE
with
df
2
or
a
confidence
interval
:
b
t
withdf n 2 oraconfidence
interval
: b t/ /22
SSxxxx
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
APPLET
4.38
MSE/ S xx
75.7532 / 2474
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H 0 is rejected .
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
To testH 0 : model
isuseful
in predicting
y
This test is
exactly
equivalent to
the t-test, with t2
= F.
Copyright 2006 Brooks/Cole
MSR
TestStatistic
:F
MSE
Reject
H 0 if F F with1 andn - 2 df .
Minitab Output
Least squares
To test
H 0 : 0line
regression
R-Sq = 70.5%
Analysis of Variance
Source
DF
SS
Regression
1 1450.0
Residual Error 8
606.0
Total
9 2056.0
MSE
R-Sq(adj) = 66.8%
MS
1450.0
75.8
F
19.14
P
0.002
Regression coefficients, 2
t F
a and Copyright
b 2006 Brooks/Cole
SS
SSR
SSR
Coefficien
t
of
determinat
ion
:
r
Coefficien
t of determinat
ion: r
SCopyright
Brooks/Cole
SxxxxSSyyyy2006Total
TotalSS
SS
22
SSR
rr SSR
Total
TotalSS
SS
22
Interpreting a
Significant Regression
Some Cautions
Extrapolationpredicting
values of y outside
Checking the
Regression Assumptions
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
1.
1. The
The relationship
relationship between
between xx and
and yy isis linear,
linear,
given
given by
by yy == ++ x
x ++
2.
2. The
The random
random error
error terms
terms are
are independent
independent and,
and,
for
for any
any value
value of
of x,
x, have
have aa normal
normal distribution
distribution
with
with mean
mean 00 and
and variance
variance
22..
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Diagnostic Tools
We use the same diagnostic tools used in
Chapter 11 to check the normality
assumption and the assumption of equal
variances.
1.
1. Normal
Normal probability
probability plot
plot of
of residuals
residuals
2.
2. Plot
Plot of
of residuals
residuals versus
versus fit
fit or
or
residuals
residuals versus
versus variables
variables
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Residuals
The residual error is the leftover
variation in each data point after the
variation explained by the regression model
has been removed.
Residual
yyii yyii or
Residual
or yyii aa bx
bxii
If all assumptions have been met, these
residuals should be normal,
normal with mean 0
and variance 2.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
IfIf the
the normality
normality assumption
assumption is
is valid,
valid, the
the
plot
plot should
should resemble
resemble aa straight
straight line,
line,
sloping
sloping upward
upward to
to the
the right.
right.
IfIf not,
not, you
you will
will often
often see
see the
the pattern
pattern fail
fail
in
in the
the tails
tails of
of the
the graph.
graph.
Normal Probability Plot of the Residuals
(response is y)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
-20
-10
0
Residual
10
20
IfIf the
the equal
equal variance
variance assumption
assumption isis valid,
valid,
the
the plot
plot should
should appear
appear as
as aa random
random
scatter
scatter around
around the
the zero
zero center
center line.
line.
IfIf not,
not, you
you will
will see
see aa pattern
pattern in
in the
the
residuals.
residuals.
Residuals Versus the Fitted Values
(response is y)
15
Residual
10
-5
-10
60
70
80
Fitted Value
90
100
Estimating the
average value of y
when x = x0
yy aa bx
bx00
yy tt//22 MSE
MSE
n
S
Sxxxx
n
To
aaparticular
value
of
xx xx00 ::
To predict
predict
particular
value
of yy when
when
yy tt//22
11 ((xx00 xx))22
MSE
MSE 11
n
S
n
Sxxxx
Calculate
yy 40.78424
.76556(50)
79.06
Calculate
40.78424
.76556(50)
79.06
22
11 ((50
46
))
50
46
yy 22..306
306 75
75..7532
7532
10
2474
10
2474
79
to
79..06
06 66..55
55 or
or72.51
72.51
to85.61.
85.61.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Calculate
yy 40.78424
.76556(50)
79.06
Calculate
40.78424
.76556(50)
79.06
22
11 ((50
46
))
50
46
yy 22..306
306 75
75..7532
7532 11
10
2474
10
2474
Notice how
79
.
06
21
.
11
or
57.95
to
100.17.
79.06 21.11 or57.95
to100.17.
much wider this
interval is!
Minitab Output
95.0% PI
(57.95,100.17)
Regression
95% CI
95% PI
110
S
R-Sq
R-Sq(adj)
100
8.70363
70.5%
66.8%
90
y
Green prediction
bands are always
wider than red
confidence bands.
80
70
60
50
40
30
20
30
40
50
x
60
70
80
Correlation Analysis
Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player
10
Height, x
73
71
75
72
72
75
67
69
71
69
Weight, y
Use
Useyour
yourcalculator
calculator
to
tofind
findthe
thesums
sums
and
andsums
sumsof
of
squares.
squares.
SSxyxy 328
328 SSxxxx 60
60..44 SSyyyy 2610
2610
328
328
rr
..8261
8261
((60
60..44)()(2610
2610))
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Football Players
Scatterplot of Weight vs Height
210
200
Weight
190
rr==.8261
.8261
180
170
Strong
Strongpositive
positive
correlation
correlation
160
150
66
67
68
69
70
71
Height
72
73
74
75
As
Asthe
theplayers
players
height
heightincreases,
increases,so
so
does
doeshis
hisweight.
weight.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.
Use
the
Exploring
Correlation
applet
to
rr==0;
No
0; No
rr==.931;
.931;Strong
Strong
explore
some correlation patterns:
correlation
correlation
positive
positivecorrelation
correlation
rr==1;
1;Linear
Linear
relationship
relationship
rr==-.67;
-.67;Weaker
Weaker
negative
negativecorrelation
correlation
MY
APPLET
Inference using r
To testH 0 : 0 versus
Ha : 0
This test is
exactly
equivalent to
the t-test for the
slope .
n2
Test Statistic
: tr
1 r2
Reject
H 0 if t t / 2 ort t / 2 withn - 2 df .
rr ..8261
8261
Example
nn22
Test
:: tt rr
TestStatistic
Statistic
11rr22
88
..8261
4.15
8261
22 4.15
11..8261
8261
MY
APPLET
Key Concepts
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the
appropriate model is y x .
2. The random error has a normal distribution with
mean 0 and variance 2.
II. Method of Least Squares
1. Estimates a and b, for and , are chosen to
minimize SSE, the sum of the squared deviations
about the regression line, y a bx.
2. The least squares estimates are b = Sxy/Sxx and
a y b x.
Key Concepts
III. Analysis of Variance
1. Total SS SSR SSE, where Total SS Syy and
SSR (Sxy)2 Sxx.
2. The best estimate of 2 is MSE SSE (n 2).
IV. Testing, Estimation, and Prediction
1. A test for the significance of the linear regression
H0 : 0 can be implemented using one of two test
statistics:
b
MSE/ S xx
or
MSR
F
MSE
Key Concepts
2.
SSR
R
TotalSS
2
3.
4.
5.
Key Concepts
V. Correlation Analysis
1. Use the correlation coefficient to measure the
relationship between x and y when both variables are
random:
S
r
xy
S xx S yy