You are on page 1of 56

Linear Regression and

Correlation

Chapter 13

McGraw-Hill/Irwin ©The McGraw-Hill Companies, Inc. 2008


GOALS

z Understand and interpret the terms dependent and


independent variable
variable.
z Calculate and interpret the coefficient of correlation,
the coefficient of determination,, and the standard
error of estimate.
z Conduct a test of hypothesis
y to determine whether
the coefficient of correlation in the population is zero.
z Calculate the least squares regression line.
z Construct and interpret confidence and prediction
intervals for the dependent variable.

2
R
Regression
i A Analysis
l i - Introduction
I t d ti

z Recall in Chapter 4 the idea of showing the


relationship between two variables with a scatter
di
diagram was iintroduced.
t d d
z In that case we showed that, as the age of the buyer
increased,, the amount spent
p for the vehicle also
increased.
z In this chapter we carry this idea further. Numerical
measures to express the strength of relationship
between two variables are developed.
z In addition, an equation is used to express the
relationship.
l ti hi bbetween
t variables,
i bl allowing
ll i us tto
estimate one variable on the basis of another.

3
R
Regression
i A Analysis
l i - Uses
U

Some examples.
z Is there a relationship
p between the amount Healthtex
spends per month on advertising and its sales in the
month?
z Can we base an estimate of the cost to heat a home
in January on the number of square feet in the
home?
z Is there a relationship between the miles per gallon
achieved by large pickup trucks and the size of the
engine?
z Is there a relationship between the number of hours
that students studied for an exam and the score
earned?

4
Correlation Analysis

z Correlation Analysis is the study of the


relationship between variables. It is also
defined as group of techniques to measure
the association between two variables.
z A Scatter Diagram is a chart that portrays
the relationship between the two variables. It
is the usual first step in correlations analysis
– The Dependent Variable is the variable being
predicted
di t d or estimated.
ti t d
– The Independent Variable provides the basis for
estimation It is the predictor variable
estimation. variable.

5
R
Regression
i E Example
l

The sales manager of Copier Sales


of America, which has a large
sales force throughout the
United States and Canada,
wants to determine whether
there is a relationship between
the number of sales calls made
in a month and the number of
copiers sold that month. The
manager selects a random
sample of 10 representatives
and determines the number of
sales calls each representative
p
made last month and the
number of copiers sold.

6
S tt Di
Scatter Diagram

7
Correlation, r
The Coefficient of Correlation

The Coefficient of Correlation (r) is a measure of the


strength of the relationship between two variables
variables. It
requires interval or ratio-scaled data.
z It can range from -1.00 to 1.00.
z Values of -1.00 or 1.00 indicate perfect and strong
correlation.
z Values close to 0.0
0 0 indicate weak correlation.
correlation
z Negative values indicate an inverse relationship and
positive values indicate a direct relationship.

8
P f t Correlation
Perfect C l ti

9
Mi it b Scatter
Minitab S tt Pl Plots
t

10
Correlation Coefficient - Interpretation

11
C
Correlation
l ti Coefficient
C ffi i t - Formula
F l

12
Coefficient of Determination

The coefficient of determination (r2) is the


proportion of the total variation in the
dependent variable (Y) that is explained or
accounted for by the variation in the
independent variable (X). It is the square of
the coefficient of correlation
correlation.
z It ranges from 0 to 1.
z It does
d nott give
i any information
i f ti on the
th
direction of the relationship between the
variables.
i bl
13
C
Correlation
l ti Coefficient
C ffi i t - Example
E l

Using the Copier Sales of


America data which a
scatterplot was
developed earlier,
compute the correlation
coefficient and
coefficient of
determination.

14
C
Correlation
l ti Coefficient
C ffi i t - Example
E l

15
Correlation Coefficient – Excel Example

16
C
Correlation
l ti Coefficient
C ffi i t - Example
E l

How do we interpret a correlation of 0.759?


First it is positive
First, positive, so we see there is a direct relationship between
the number of sales calls and the number of copiers sold. The value
of 0.759 is fairly close to 1.00, so we conclude that the association
is strong.
strong

However, does this mean that more sales calls cause more sales?
No, we have not demonstrated cause and effect here, only that the
two variables—sales calls and copiers sold—are related.
17
Coefficient of Determination (r2) - Example

•The coefficient of determination, r2 ,is 0.576,


f
found
d by
b (0
(0.759)
759)2

•This
This is a proportion or a percent; we can say that
57.6 percent of the variation in the number of
copiers sold is explained, or accounted for, by the
variation in the number of sales calls.

18
Testing
g the Significance
g of
the Correlation Coefficient

H0: ρ = 0 (the correlation in the population is 0)


H1: ρ ≠ 0 (the correlation in the population is not 0)
j
Reject H0 if:
t > tα/2,n-2 or t < -tα/2,n-2

19
Testing the Significance of
the Correlation Coefficient - Example

H0: ρ = 0 (the correlation in the population is 0)


H1: ρ ≠ 0 (the correlation in the population is not 0)
Reject H0 if:
t > tα/2,n-2
α/2 n 2 or t < -tα/2,n-2
α/2 n 2
t > t0.025,8 or t < -t0.025,8
t > 2.306 or t < -2.306

20
Testing the Significance of
the Correlation Coefficient - Example

The computed t (3.297) is within the rejection region, therefore, we will reject H0. This means
tthe
e correlation
co e at o in tthe
e popu
population
at o is
s not
ot zero.
e o From
o ap practical
act ca sta
standpoint,
dpo t, itt indicates
d cates to tthe
e
sales manager that there is correlation with respect to the number of sales calls made
and the number of copiers sold in the population of salespeople.
21
Mi it b
Minitab

22
Li
Linear Regression
R i Model
M d l

23
C
Computing
ti ththe Slope
Sl off the
th Line
Li

24
C
Computing
ti ththe Y-Intercept
YI t t

25
Regression Analysis

In regression analysis we use the independent variable


(X) to estimate the dependent
p variable (Y)).
z The relationship between the variables is linear.
z Both variables must be at least interval scale.
z The least squares criterion is used to determine the
equation.

26
Regression Analysis – Least Squares
Principle

z The least squares principle is used to


bt i a and
obtain d b.
z The equations
q to determine a and b
are:
n ( Σ X Y ) − ( Σ X )( Σ Y )
b=
n(ΣX 2 ) − (Σ X )2
ΣY ΣX
a= −b
n n

27
Illustration of the Least Squares
Regression Principle

28
R
Regression
i E Equation
ti - Example
E l

Recall the example involving


Copier Sales of America. The
sales manager gathered
information on the number of
sales calls made and the
number of copiers sold for a
random sample of 10 sales
representatives. Use the least
squares method to determine a
linear equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a representative
who made 20 calls?

29
Finding the Regression Equation - Example

The regression equation is :


^
Y = a + bX
^
Y = 18 .9476 + 1 .1842 X
^
Y = 18 .9476 + 1 .1842 ( 20 )
^
Y = 42 .6316
30
C
Computing
ti ththe Estimates
E ti t off Y

Step 1 – Using the regression equation, substitute the


value of each X to solve for the estimated sales

Tom Keller Soni Jones


^ ^
Y = 18.9476 + 1.1842 X Y = 18.9476 + 1.1842 X
^ ^
Y = 18.9476 + 1.1842(20) Y = 18.9476 + 1.1842(30)
^ ^
Y = 42.6316 Y = 54.4736
31
Plotting the Estimated and the Actual Y’s

32
The Standard Error of Estimate

z The standard error of estimate measures the


scatter or dispersion,
scatter, dispersion of the observed values
around the line of regression
z The formulas that are used to compute
p the
standard error:

ΣY 2 − aΣY − bΣXY
^
Σ (Y − Y ) 2
s y.x = s y. x =
n−2 n−2

33
Standard Error of the Estimate - Example

Recall the example involving


Copier Sales of America.
Th sales
The l manager
determined the least
squares regression
equation is given below
below.
Determine the standard error
of estimate as a measure
off how
h wellll th
the values
l fit
the regression line.
^ ^

Y = 18.9476 + 1.1842 X Σ (Y − Y ) 2
s y.x =
n−2
784 .211
= = 9.901
10 − 2

34
Graphical Illustration of the Differences between Actual ^
Y – Estimated Y (Y − Y )

35
Standard Error of the Estimate - Excel

36
Assumptions
p Underlying
y g Linear
Regression
For each value of X, there is a group of Y values, and these
z Y values are normally distributed. The means of these normal
distributions of Y values all lie on the straight line of regression.
z The standard deviations of these normal distributions are equal.
z The Y values are statistically independent
independent. This means that in
the selection of a sample, the Y values chosen for a particular X
value do not depend on the Y values for any other X values.

37
Confidence Interval and Prediction
Interval Estimates of Y

•A confidence interval reports the mean value of Y


f a given
for i X
X.
•A prediction interval reports the range of values
of Y for a particular value of X.
X

38
Confidence Interval Estimate - Example

We return to the Copier Sales of America


illustration Determine a 95 percent confidence
illustration.
interval for all sales representatives who make
25 calls.
calls

39
Confidence Interval Estimate - Example

Step 1 – Compute the point estimate of Y


In other words
words, determine the number of copiers we expect a sales
representative to sell if he or she makes 25 calls.

The regression equation is :


^
Y = 18.9476 + 1.1842 X
^
Y = 18.9476 + 1.1842(25)
^

40
Y = 48.5526
Confidence Interval Estimate - Example

Step 2 – Find the value of t


z To find the t value,
value we need to first know the number
of degrees of freedom. In this case the degrees of
freedom is n - 2 = 10 – 2 = 8.
z We set the confidence level at 95 percent. To find
the value of t, move down the left-hand column of
Appendi B
Appendix B.2
2 to 8 degrees of freedom
freedom, then mo
move e
across to the column with the 95 percent level of
confidence.
z The value of t is 2.306.
41
Confidence Interval Estimate - Example

42
Confidence Interval Estimate - Example

Step 4 – Use the formula above by substituting the numbers computed


i previous
in i slides
lid

Thus, the 95 percent confidence interval for the average sales of all
sales representatives who make 25 calls is from 40.9170 up to
56.1882 copiers.
43
Prediction Interval Estimate - Example

We return to the Copier


p Sales of America
illustration. Determine a 95 percent
prediction interval for Sheila Baker,, a West
p
Coast sales representative who made 25
calls.

44
Prediction Interval Estimate - Example

Step 1 – Compute the point estimate of Y


In other words, determine the number of copiers we
expect a sales representative to sell if he or she
makes 25 calls.

The regression equation is :


^
Y = 18.9476 + 1.1842 X
^
Y = 18.9476 + 1.1842(25)
^

45
Y = 48.5526
Prediction Interval Estimate - Example

Step 2 – Using the information computed


earlier in the confidence interval estimation
example, use the formula above.

If Sh
Sheila
il BBaker
k makes
k 25
2 sales
l calls,
ll the
h number
b off copiers
i she
h
will sell will be between about 24 and 73 copiers.
46
Confidence and Prediction Intervals –
Minitab Illustration

47
T
Transforming
f i Data
D t

z The coefficient of correlation describes the


strength of the linear relationship between
two variables. It could be that two variables
are closelyy related,, but there relationship
p is
not linear.
z Be cautious when yyou are interpreting
p g the
coefficient of correlation. A value of r may
indicate there is no linear relationship, but it
could be there is a relationship of some other
nonlinear or curvilinear form.

48
T
Transforming
f i Data
D t - Example
E l

On the right is a listing of 22 professional


golfers, the number of events in
which they participated, the amount
off their
th i winnings,
i i andd th
their
i mean
score for the 2004 season. In golf,
the objective is to play 18 holes in
the least number of strokes. So, we
would expect that those golfers with
the lower mean scores would have
the larger winnings. To put it another
way, score and winnings should be
inverselyy related. In 2004 Tiger
g
Woods played in 19 events, earned
$5,365,472, and had a mean score
per round of 69.04. Fred Couples
played in 16 events, earned
$1 396 109 and had a mean score
$1,396,109,
per round of 70.92. The data for the
22 golfers follows.

49
S tt
Scatterplot
l t off Golf
G lf Data
D t

z The correlation between the


variables Winnings and
S
Score is
i 0.782.
0 782 This
Thi iis a
fairly strong inverse
relationship.
z However, when we plot the
data on a scatter diagram
the relationship does not
appear to be linear; it does
not seem to follow a straight
line.
line

50
What can we do to explore other (nonlinear)
relationships?
One possibility is to transform one of the
example, instead of using Y as
variables For example
variables.
the dependent variable, we might use its log,
reciprocal,
p , square,
q , or square
q root. Another
possibility is to transform the independent
variable in the same way. There are other
transformations, but these are the most
common.

51
T
Transforming
f i Data
D t - Example
E l

In the golf winnings


example,
p , changing
g g the
scale of the dependent
variable is effective. We
determine the log of each
golfer’s winnings and
then find the correlation
between the log of
winnings and score. That
is, we find the log to the
base 10 of Tiger Woods’
earnings of $5,365,472,
which
hich is 6.72961.
6 72961
52
S tt Pl
Scatter Plott off Transformed
T f dY

53
Linear Regression Using the
Transformed Y

54
Using the Transformed Equation for
Estimation

Based on the regression equation, a golfer with


a mean score of 70 could expect to earn:

•The value 6.4372


6 4372 is the log to the base 10 of winnings.
winnings
•The antilog of 6.4372 is 2.736
•So a g
golfer that had a mean score of 70 could expect
p to
earn $2,736,528.
55
End of Chapter 13

56

You might also like