You are on page 1of 45

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.

1
Chapter 17
Simple Linear Regression
and Correlation
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.2
Linear Regression Analysis
Regression analysis is used to predict the value of one
variable (the dependent variable) on the basis of other
variables (the independent variables).

Dependent variable: denoted Y
Independent variables: denoted X
1
, X
2
, , X
k
If we only have ONE independent variable, the model is





which is referred to as simple linear regression. We would be
interested in estimating
0
and
1
from the data we collect.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.3
Linear Regression Analysis

Variables:
X = Independent Variable (we provide this)
Y = Dependent Variable (we observe this)
Parameters:

0
= Y-Intercept

1
= Slope
~ Normal Random Variable (

= 0,

= ???) [Noise]
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.4
Effect of Larger Values of


House size
House
Price
25K$
Same square footage, but different price points
(e.g. dcor options, cabinet upgrades, lot location)
Lower vs. Higher
Variability
House Price = 25,000 + 75(Size) +
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.5
Theoretical Linear Model
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.6
1. Building the Model Collect Data
Test 2 Grade =
0
+1*(Test 1 Grade)

From Data:
Estimate
0

Estimate
1
Estimate



Student Test 1 Test 2
1 50 32
2 51 33
3 52 34
4 53 35
5 54 36
6 55 37
7 56 39
8 57 40
9 58 41
10 59 42
11 60 43
12 61 44
13 62 46
14 63 47
15 64 48
16 65 49
17 66 50
18 67 51
19 68 53
20 69 54
21 70 55
22 71 56
23 72 57
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.7
Linear Regression Analysis
Plot of Fitted Model
Test 1
T
e
s
t

2
40 50 60 70 80 90 100
0
20
40
60
80
100
Plot of Fitted Model
60 70 80 90 100
Test B1
42
52
62
72
82
92
T
e
s
t

B
2
Plot of Fitted Model
Test B1
T
e
s
t

B
2
50 60 70 80 90 100
50
60
70
80
90
100
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.8
Correlation Analysis -1 < < 1
If we are interested only in determining whether a
relationship exists, we employ correlation analysis.
Example: Students height and weight.


Plot of Height vs Weight
100 140 180 220 260
Weight
4.6
5
5.4
5.8
6.2
6.6
7
H
e
i
g
h
t
Plot of Height vs Weight
100 140 180 220 260
Weight
5.3
5.6
5.9
6.2
6.5
6.8
H
e
i
g
h
t
Plot of Height vs Weight
100 140 180 220 260
Weight
5.4
5.8
6.2
6.6
7
H
e
i
g
h
t
Plot of Height vs Weight
100 140 180 220 260
Weight
5
5.4
5.8
6.2
6.6
H
e
i
g
h
t
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.9
Correlation Analysis -1 < < 1
If the correlation coefficient is close to +1 that means you
have a strong positive relationship.

If the correlation coefficient is close to -1 that means you
have a strong negative relationship.

If the correlation coefficient is close to 0 that means you
have no correlation.

WE HAVE THE ABILITY TO TEST THE HYPOTHESIS
H
0
: = 0
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.10
Regression: Model Types X=size of house, Y=cost of house
Deterministic Model: an equation or set of equations that
allow us to fully determine the value of the dependent
variable from the values of the independent variables.
y = $25,000 + (75$/ft
2
)(x)
Area of a circle: A = *r
2

Probabilistic Model: a method used to capture the
randomness that is part of a real-life process.
y = 25,000 + 75x +
E.g. do all houses of the same size (measured in square feet)
sell for exactly the same price?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.11
Simple Linear Regression Model
Meaning of and
> 0 [positive slope] < 0 [negative slope]

y
x
run
rise
=slope (=rise/run)
=y-intercept
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.12
Which line has the best fit to the data?
?
?
?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.13
Estimating the Coefficients
In much the same way we base estimates of on , we
estimate with b
0
and with b
1
, the y-intercept and slope
(respectively) of the least squares or regression line given
by:


(This is an application of the least squares method and it
produces a straight line that minimizes the sum of the
squared differences between the points and the line)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.14
Least Squares Line
these differences are
called residuals or
errors
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.15
Least Squares Line[sure glad we have computers now!]
The coefficients b
1
and b
0
for the
least squares line


are calculated as:
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.16
Data
Statistics
Information
Data Points:
x y
1 6
2 1
3 9
4 5
5 17
6 12
y = .934 + 2.114x
Least Squares Line See if you can estimate Y-intercept and slope from this data
Recall
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.17
Least Squares Line See if you can estimate Y-intercept and slope from this data
X Y X - Xbar Y - Ybar (X-Xbar)*(Y-Ybar) (X - Xbar)
2
1 6 -2.500 -2.333 5.833 6.250
2 1 -1.500 -7.333 11.000 2.250
3 9 -0.500 0.667 -0.333 0.250
4 5 0.500 -3.333 -1.667 0.250
5 17 1.500 8.667 13.000 2.250
6 12 2.500 3.667 9.167 6.250
Sum = 21 50 0.000 0.000 37.000 17.500
Xbar = 3.500
Ybar = 8.333
s
xy
= 7.400 37.00/(6-1)
s
x
2
= 3.500 17.5/(6-1)
b
1
= 2.114 7.4/3.5
b
0
= 0.933 8.33 - 2.114*3.50
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.18
Excel: Data Analysis - Regression
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7007
R Square 0.4910 The proportion of the variation in the variable Y that can be explained by your regression model
Adjusted R Square 0.3637
Standard Error 4.5029 Will use later
Observations 6
ANOVA
df SS MS F Significance F Same as p-value
Regression 1 78.22857143 78.22857143 3.858149366 0.120968388 H0: Regression Model is "NO Good"
Residual 4 81.1047619 20.27619048
Total 5 159.3333333
Coefficients Standard Error t Stat P-value
Intercept 0.933333333 4.19198025 0.222647359 0.834716871
X Variable 1 2.114285714 1.076401159 1.96421724 0.120968388 H0:
1
= 0
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.19
Excel: Plotted Regression Model You will need to play around
with this to get the plot to look Good

X Variable 1 Line Fit Plot
0
5
10
15
20
0 1 2 3 4 5 6 7
X Variable 1
Y
Y
Predicted Y
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.20
Required Conditions
For these regression methods to be valid the following four
conditions for the error variable ( ) must be met:
The probability distribution of is normal.
The mean of the distribution is 0; that is, E( ) = 0.
The standard deviation of is , which is a constant
regardless of the value of x.
The value of associated with any particular value of y is
independent of associated with any other value of y.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.21
Assessing the Model
The least squares method will always produce a straight line,
even if there is no relationship between the variables, or if
the relationship is something other than linear.

Hence, in addition to determining the coefficients of the least
squares line, we need to assess it to see how well it fits the
data. Well see these evaluation methods now. Theyre based
on the what is called sum of squares for errors (SSE).
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.22
Sum of Squares for Error (SSE another thing to calculate)
The sum of squares for error is calculated as:



and is used in the calculation of the standard error of
estimate:


If is zero, all the points fall on the regression line.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.23
Standard Error
If is small, the fit is excellent and the linear model should
be used for forecasting. If is large, the model is poor
But what is small and what is large?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.24
Standard Error
Judge the value of by comparing it to the sample mean of
the dependent variable ( ).

In this example,
= .3265 and
= 14.841

so (relatively speaking) it appears to be small, hence our
linear regression model of car price as a function of
odometer reading is good.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.25
Testing the SlopeExcel output does this for you.
If no linear relationship exists between the two variables, we
would expect the regression line to be horizontal, that is, to
have a slope of zero.

We want to see if there is a linear relationship, i.e. we want
to see if the slope ( ) is something other than zero. Our
research hypothesis becomes:
H
1
: 0
Thus the null hypothesis becomes:
H
0
: = 0
Already discussed!
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.26
Testing the Slope
We can implement this test statistic to try our hypotheses:
H0:
1
= 0

where is the standard deviation of b
1
, defined as:


If the error variable ( ) is normally distributed, the test
statistic has a Student t-distribution with n2 degrees of
freedom. The rejection region depends on whether or not
were doing a one- or two- tail test (two-tail test is most
typical).
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.27
Example 17.4
Test to determine if the slope is significantly different from
0 (at 5% significance level)

We want to test:
H
1
: 0
H
0
: = 0
(if the null hypothesis is true, no linear relationship exists)
The rejection region is:


OR check the p-value.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.28
Example 17.4
We can compute t manually or refer to our Excel output




We see that the t statistic for
odometer (i.e. the slope, b
1
) is 13.49
which is greater than t
Critical
= 1.984. We also note that the
p-value is 0.000.
There is overwhelming evidence to infer that a linear
relationship between odometer reading and price exists.
COMPUTE
Compare
p-value
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.29
Testing the Slope
We can also estimate (to some level of confidence) and
interval for the slope parameter, .
Recall that your estimate for is b
1
.
The confidence interval estimator is given as:


Hence:

That is, we estimate that the slope coefficient lies between
.0768 and .0570
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.30
Coefficient of Determination
Tests thus far have shown if a linear relationship exists; it is
also useful to measure the strength of the relationship. This
is done by calculating the coefficient of determination R
2
.



The coefficient of determination is the square of the
coefficient of correlation (r), hence R
2
= (r)
2

r will be computed shortly and this is true for models
with only 1 indepenent variable
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.31
Coefficient of Determination
R
2
has a value of .6483. This means 64.83% of the variation
in the auction selling prices (y) is explained by your
regression model. The remaining 35.17% is unexplained,
i.e. due to error.
Unlike the value of a test statistic, the coefficient of
determination does not have a critical value that enables us
to draw conclusions.
In general the higher the value of R
2
, the better the model
fits the data.
R
2
= 1: Perfect match between the line and the data points.
R
2
= 0: There are no linear relationship between x and y.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.32
Remember Excels Output
An analysis of variance (ANOVA) table for the
simple linear regression model can be give by:
Source
degrees
of
freedom
Sums of
Squares
Mean
Squares
F-Statistic
Regression 1 SSR
MSR =
SSR/1
F=MSR/MSE
Error n2 SSE
MSE =
SSE/(n2)
Total n1
Variation
in y (SST)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.33
Using the Regression Equation
We could use our regression equation:
y = 17.250 .0669x

to predict the selling price of a car with 40 (40,000) miles on
it:

y = 17.250 .0669x = 17.250 .0669(40) = 14, 574

We call this value ($14,574) a point prediction (estimate).
Chances are though the actual selling price will be different,
hence we can estimate the selling price in terms of a
confidence interval.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.34
Prediction Interval
The prediction interval is used when we want to predict one
particular value of the dependent variable, given a specific
value of the independent variable:





(x
g
is the given value of x were interested in)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.35
Confidence Interval Estimator for Mean of Y
The confidence interval estimate for the expected value of y
(Mean of Y) is used when we want to predict an interval we
are pretty sure contains the true regression line . In this
case, we are estimating the mean of y given a value of x:



(Technically this formula is used for infinitely large
populations. However, we can interpret our problem as
attempting to determine the average selling price of all Ford
Tauruses, all with 40,000 miles on the odometer)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.36
Whats the Difference?
Prediction Interval
Confidence Interval
1 no 1
Used to estimate the value of
one value of y (at given x)
Used to estimate the mean
value of y (at given x)
The confidence interval estimate of the expected value of y will be narrower than
the prediction interval for the same given value of x and confidence level. This is
because there is less error in estimating a mean value as opposed to predicting an
individual value.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.37
Regression Diagnostics
There are three conditions that are required in order to
perform a regression analysis. These are:
The error variable must be normally distributed,
The error variable must have a constant variance, &
The errors must be independent of each other.

How can we diagnose violations of these conditions?
Residual Analysis, that is, examine the differences
between the actual data points and those predicted by the
linear equation
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.38
Nonnormality
We can take the residuals and put them into a histogram to
visually check for normality







were looking for a bell shaped histogram with the mean
close to zero [our old test for normality].
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.39
Heteroscedasticity
When the requirement of a constant variance is violated, we
have a condition of heteroscedasticity.






We can diagnose heteroscedasticity by plotting the residual
against the predicted y.

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.40
Heteroscedasticity
If the variance of the error variable ( ) is not constant, then
we have heteroscedasticity. Heres the plot of the residual
against the predicted value of y:
there doesnt appear to be a
change in the spread of the
plotted points, therefore no
heteroscedasticity

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.41
Nonindependence of the Error Variable
If we were to observe the auction price of cars every week
for, say, a year, that would constitute a time series.

When the data are time series, the errors often are correlated.
Error terms that are correlated over time are said to be
autocorrelated or serially correlated.

We can often detect autocorrelation by graphing the
residuals against the time periods. If a pattern emerges, it is
likely that the independence requirement is violated.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.42
Nonindependence of the Error Variable
Patterns in the appearance of the residuals over time
indicates that autocorrelation exists:
Note the runs of positive residuals,
replaced by runs of negative residuals
Note the oscillating behavior of the
residuals around zero.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.43
Outliers Problem worked earlier
An outlier is an observation that is unusually small or
unusually large.
E.g. our used car example had odometer readings from 19.1
to 49.2 thousand miles. Suppose we have a value of only
5,000 miles (i.e. a car driven by an old person only on
Sundays ) this point is an outlier.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.44
Outliers
Possible reasons for the existence of outliers include:
There was an error in recording the value
The point should not have been included in the sample
* Perhaps the observation is indeed valid.

Outliers can be easily identified from a scatter plot.

If the absolute value of the standard residual is > 2, we
suspect the point may be an outlier and investigate further.

They need to be dealt with since they can easily influence
the least squares line
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc. 17.45
Procedure for Regression Diagnostics
1. Develop a model that has a theoretical basis.
2. Gather data for the two variables in the model.
3. Draw the scatter diagram to determine whether a linear
model appears to be appropriate. Identify possible
outliers.
4. Determine the regression equation.
5. Calculate the residuals and check the required conditions
6. Assess the models fit.
7. If the model fits the data, use the regression equation to
predict a particular value of the dependent variable
and/or estimate its mean.

You might also like