You are on page 1of 50

Simple Linear Regression

Introduction
In Chapter we examine the relationship between
interval variables via a mathematical equation.
The motivation for using the technique:
Forecast the value of a dependent variable (y) from
the value of independent variables (x1, x2,xk.).
Analyze the specific relationships between the
independent variables and the dependent variable.
2

The Model
The model has a deterministic and a probabilistic components
House
Cost

Most lots sell


for $25,000

ut
o
b
sa
t
s
o
ec
s
u
e)
z
o
.
i
t
h
S
o
a
fo + 75(
g
e
n
r
i
a
d
Buil er squ 25000
p
=
$75 e cost
s
Hou

House size

The Model
However, house cost vary even among same size
houses!
Since cost behave unpredictably,
House
Cost

Most lots sell


for $25,000

we add a random component.

House cost = 25000 + 75(Size)


House size

The Model
The first order linear model

yy 00 11xx
y = dependent variable
x = independent variable
0 = y-intercept
1 = slope of the line
= error variable

0 and 1 are unknown population


parameters, therefore are estimated
from the data.

Rise
0

Run

= Rise/Run

x
5

Estimating the Coefficients


The estimates are determined by
drawing a sample from the population of interest,
calculating sample statistics.
producing a straight line that cuts into the data.
y

Question: What should be


considered a good line?

The Least Squares (Regression) Line


A good line is one that minimizes
the sum of squared differences between the
points and the line.

The Least Squares (Regression) Line


Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
4
3
2.5
2

Let us compare two lines


The second line is horizontal

(2,4)

(4,3.2)

(1,2)

(3,1.5)

The smaller the sum of


squared differences
the better the fit of the
line to the data.
8

The Estimated Coefficients


To calculate the estimates of the line
coefficients, that minimize the differences
between the data points and the line, use
the formulas:

cov(XX,,YY))
cov(
bb11
22
s
s xx

The regression equation that estimates


the equation of the first order linear model
is:

yy bb00 bb11xx

bb00 yybb11xx

The Simple Linear Regression Line


Example 17.2 (Xm17-02)
A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
A random sample of 100 cars
is selected, and the data
recorded.
Find the regression line.

Independent Dependent
variable x variable y

10

The Simple Linear Regression Line


Solution
Solving by hand: Calculate a number of statistics
( x x)

43,528,690

x 36,009.45;

s 2x

y 14,822.823;

( x x )( y

cov( X , Y )

n 1

y)

n 1

2,712,511

where n = 100.

cov( X , Y ) 1,712,511
b1

.06232
2
sx
43,528,690
b 0 y b1 x 14,822.82 ( .06232)(36,009.45) 17,067

y b 0 b1x 17,067 .0623x

11

The Simple Linear Regression Line


Solution continued
Using the computer (Xm17-02)
Tools > Data Analysis > Regression >
[Shade the y range and the x range] > OK

12

The Simple Linear Regression Line


Xm17-02
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8063
R Square
0.6501
Adjusted R Square
0.6466
Standard Error 303.1
Observations
100

y 17,067 .0623 x

ANOVA
df
Regression
Residual
Total

SS
MS
1 16734111 16734111
98 9005450
91892
99 25739561

CoefficientsStandard Error t Stat


Intercept
17067
169
100.97
Odometer
-0.0623
0.0046
-13.49

Significance F
182.11
0.0000

P-value
0.0000
0.0000

13

Interpreting the Linear Regression


-Equation

17067

Odometer Line Fit Plot

Price

16000

15000
14000

No data 13000
Odometer

y 17,067 .0623 x
The intercept is b0 = $17067.
Do not interpret the intercept as the
Price of cars that have not been driven

This is the slope of the line.


For each additional mile on the odometer,
the price decreases by an average of $0.0623

14

Error Variable: Required Conditions


The error is a critical part of the regression model.
Four requirements involving the distribution of must
be satisfied.

The probability distribution of is normal.


The mean of is zero: E() = 0.
The standard deviation of is for all values of x.
The set of errors associated with different values of y are all
independent.
15

The Normality of
E(y|x3)
The standard deviation remains constant,

0 + 1x3
E(y|x2)
0 + 1x2

but the mean value changes with x


0 + 1x1

E(y|x1)

From the
the first
first three
three assumptions
assumptions we
we have:
have:
From
x1
normally distributed
distributed with
with mean
mean
yy isis normally
E(y) == 00 ++ 11x,
x, and
and aa constant
constant standard
standard
E(y)
deviation
deviation

x2

x3

16

Assessing the Model


The least squares method will produces a
regression line whether or not there are linear
relationship between x and y.
Consequently, it is important to assess how well
the linear model fits the data.
Several methods are used to assess the model.
All are based on the sum of squares for errors,
SSE.
17

Sum of Squares for Errors


This is the sum of differences between the points and
the regression line.
It can serve as a measure of how well the line fits the
data. SSE is defined by
nn

SSE
SSE

A shortcut formula

22 .

(
y

y
)

( y i i y i )i .

i i 11

cov(X , Y)
22 cov( X , Y )
SSE

(
n

1
)
s
SSE (n 1)s YY
22
s
s xx

18

Standard Error of Estimate


The mean error is equal to zero.
If is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
Therefore, we can, use as a measure of the
suitability of using a linear model.
An estimator of is given by s
tandard
dard Error
Error of
of Estimate
Estimate
SStan
SSE
SSE
ss
nn22

19

Standard Error of Estimate,


Example

Example 17.3

Calculate the standard error of estimate for Example 17.2,


and describe what does it tell you about the model fit?

Solution
s 2Y

( y i y i ) 2
n 1

Calculated before

259,996

2
cov(
X
,
Y
)
(

2
,
712
,
511
)
SSE (n 1)s 2Y
99(259,996)
9,005,450
2
43,528,690
sx

SSE
9,005,450
s

303.13
n2
98

It is hard to assess the model based


on s even when compared with the
mean value of y.
s 303.1 y 14,823 20

Testing the slope


When no linear relationship exists between two
variables, the regression line should be horizontal.

Linear relationship.
Different inputs (x) yield
different outputs (y).

No linear relationship.
Different inputs (x) yield
the same output (y).

The slope is not equal to zero

The slope is equal to zero

21

Testing the Slope


We can draw inference about 1 from b1 by testing
H0: 1 = 0
H1: 1 = 0 (or < 0,or > 0)
The test statistic is
b1111
b
tt
ssbb11

where

ssbb11

ss
((nn11))ss2x2x

The standard error of b1.

If the error variable is normally distributed, the statistic


is Student t distribution with d.f. = n-2.
22

Testing the Slope,


Example
Example 17.4
Test to determine whether there is enough evidence
to infer that there is a linear relationship between the
car auction price and the odometer reading for all
three-year-old Tauruses, in Example 17.2.
Use = 5%.

23

Testing the Slope,


Example
Solving by hand
To compute t we need the values of b1 and sb1.
b1 .0623
sb1
t

s
( n 1) s x2

303.1
(99)(43,528,690)

.00462

b1 1 .0623 0

13.49
.00462
sb1

The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.
Approximately, t.025 = 1.984
24

Testing the Slope,


Example Xm17-02
Using the computer
Price
Odometer SUMMARY OUTPUT
14636
37388
14122
44758 Regression Statistics
14016
45833 Multiple R
0.8063
15590
30862 R Square
0.6501
There is overwhelming evidence to infer
15568
31705 Adjusted R Square
0.6466
14718
34010 Standard Error 303.1
that the odometer reading affects the
14470
45854 Observations
100
auction selling price.
15690
19057
15072
40149 ANOVA
14802
40237
df
SS
MS
F
Significance F
15190
32359 Regression
1 16734111 16734111
182.11
0.0000
14660
43533 Residual
98 9005450
91892
15612
32744 Total
99 25739561
15610
34470
14634
37720
Coefficients
Standard Error t Stat
P-value
14632
41350 Intercept
17067
169
100.97
0.0000
15740
24469 Odometer
-0.0623
0.0046
-13.49
0.0000

25

Coefficient of determination
To measure the strength of the linear relationship we
use the coefficient of determination.

2
2

cov(
X
,
Y
)

R22 cov( X , Y )

22x s22y
s
sx sy

SSE
SSE
oror R 11
2
2
(
y

y
)
( y i y )
R22

26

Coefficient of determination
To understand the significance of this coefficient note:
art by
p
n
i
ed
n
i
a
l
p
Ex

Overall variability in y Re

mains
,

in par

t, une

The regression model

xplain
ed

The error

27

Coefficient of determination
y2

Two data points (x1,y1) and (x2,y2)


of a certain sample are shown.

Variation in y = SSR + SSE

y1

x1

x2

Total variation in y =

Variation explained by the


regression line

(y1 y )2 (y 2 y)2

( y 1 y ) 2 ( y 2 y ) 2

+ Unexplained variation (error)


( y 1 y 1 ) 2 ( y 2 y 2 ) 2

28

Coefficient of determination
R2 measures the proportion of the variation in y
that is explained by the variation in x.
2

R 1

SSE
(y i y)

(y y)

( y i y ) 2 SSE

SSR
(y i y)2

R2 takes on any value between zero and one.


R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
29

Coefficient of determination,
Example
Example 17.5
Find the coefficient of determination for Example 17.2;
what does this statistic tell you about the model?

Solution
2

Solving by hand;R

[cov( x, y )]2
s x2 s 2y

[ 2, 712,511]2
( 43,528, 688)( 259,996 )

.6501

30

Coefficient of determination
Using the computer
From the regression output we have
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8063
R Square
0.6501
Adjusted R Square
0.6466
Standard Error
303.1
Observations
100

65% of the variation in the auction


selling price is explained by the
variation in odometer reading. The
rest (35%) remains unexplained by
this model.

ANOVA
df
Regression
Residual
Total

Intercept
Odometer

SS
16734111
9005450
25739561

MS
16734111
91892

CoefficientsStandard Error
17067
169
-0.0623
0.0046

t Stat
100.97
-13.49

1
98
99

Significance F
182.11
0.0000

P-value
0.0000
0.0000

31

17.6 Finance Application: Market Model


One of the most important applications of linear
regression is the market model.
It is assumed that rate of return on a stock (R) is
linearly related to the rate of return on the overall
market.
R = 0 + 1Rm +
Rate of return on a particular stock

Rate of return on some major stock index

The beta coefficient measures how sensitive the stocks rate


of return is to changes in the level of the overall market.

32

The Market Model,


Example
Example 17.6 (Xm17-06)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.5601
R Square
0.3137
Adjusted R Square
0.3019
Standard Error0.0631
Observations
60

Estimate the market model for Nortel, a stock traded


in the Toronto Stock Exchange (TSE).
Data consisted of monthly percentage return for
Nortel and monthly percentage return for all the
stocks.

This isANOVA
a measure of the stocks
This is a measure of the total market-related risk
df
SS
MS
F Significance F
marketRegression
related risk. In this
sample,
embedded
in
the Nortel
stock.
1 0.10563 0.10563
26.51
0.0000
Residual
0.003985
for each
1% increase in58the0.231105
TSE
Specifically, 31.37% of the variation in Nortels
59 0.336734
return,Total
the average increase
in
return are explained by the variation in the
Nortels return is
.8877%.Standard Error tTSEs
returns.
Coefficients
Stat
P-value
Intercept
TSE

0.0128
0.8877

0.0082
0.1724

1.56
5.15

0.1245
0.0000

33

Using the Regression Equation


Before using the regression model, we need to
assess how well it fits the data.
If we are satisfied with how well the model fits
the data, we can use it to predict the values of y.
To make a prediction we use
Point prediction, and
Interval prediction
34

Point Prediction
Example 17.7
Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 17.2).
A point prediction

y 17067 .0623x 17067 .0623( 40,000) 14,575


It is predicted that a 40,000 miles car would sell for
$14,575.
How close is this prediction to the real price?
35

Interval Estimates
Two intervals can be used to discover how closely the
predicted value will match the true value of y.
Prediction interval predicts y for a given value of x,
Confidence interval estimates the average y for a given x.

The prediction
prediction interval
interval
The
tt 22ss
yy

22
(
x

x
)
(
x

x
)
11
gg
1

1
11))ss2x2x
nn ((nn

The confidence interval


y t 2 s

2
(
x

x
)
1
g
2
n (n 1) s x

36

Interval Estimates,
Example
Example 17.7 - continued
Provide an interval estimate for the bidding price on
a Ford Taurus with 40,000 miles on the odometer.
Two types of predictions are required:
A prediction for a specific car
An estimate for the average price per car

37

Interval Estimates,
Example
Solution
A prediction interval provides the price estimate for a
single car:
( x g x)2

1
y t 2 s 1
n (n 1)s 2x
t.025,98

Approximately

1
(40,000 36,009) 2
[17,067 .0623(40000)] 1.984(303.1) 1

14,575 605
100 (100 1)43,528,690

38

Interval Estimates,
Example
Solution continued
A confidence interval provides the estimate of the
mean price per car for a Ford Taurus with 40,000
miles reading on the odometer.
The confidence interval (95%) = y t 2 s

( x g x)2

( xi x)2

1
(40,000 36,009) 2
[17,067 .0623( 40000)] 1.984(303.1)

14,575 70
100 (100 1) 43,528,690

39

The effect of the given xg on the


length of the interval
As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
y b 0 b1x g

y t 2 s

2
1 ( x g x)

n (n 1)s 2x

40

The effect of the given xg on the


length of the interval
As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
y b 0 b1x g

y( x g x 1)
y( x g x 1)

y t 2 s
y t 2 s

2
1 ( x g x)

n (n 1)s 2x

1
12

n (n 1)s 2x

x 1 x 1

( x 1) x 1 ( x 1) x 1

41

The effect of the given xg on the


length of the interval
As xg moves away from x the interval becomes longer. That is, the shortest
interval is found at x.
y b 0 b1x g

x2

x2

( x 2) x 2 ( x 2) x 2

y t 2 s

2
1 ( x g x)

n (n 1) s 2x

y t 2 s

1
12

n (n 1)s 2x

y t 2 s

1
22

n (n 1)s 2x

42

Coefficient of Correlation
The coefficient of correlation is used to measure the
strength of association between two variables.
The coefficient values range between -1 and 1.
If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
If r = 0 there is no linear pattern.

The coefficient can be used to test for linear


relationship between two variables.
43

Testing the coefficient of correlation


To test the coefficient of correlation for linear
relationship between X and Y
X and Y must be observational
X and Y are bivariate normally distributed

Y
X

44

Testing the coefficient of correlation


When no linear relationship exist between the two
variables, = 0.
The hypotheses are:
H0 : 0
H1 : 0

The test statistic is:


The statistic is Student t
distributed with d.f. = n - 2,
provided the variables
are bivariate normally
distributed.

tr

n2

1 r 2
where r is the sample
coefficient of correlatio n
calculated by r

cov( x, y )
sx s y

45

Testing the Coefficient of correlation


Foreign Index Funds (Index)
A certain investor prefers the investment in an index
mutual funds constructed by buying a wide
assortment of stocks.
The investor decides to avoid the investment in a
Japanese index fund if it is strongly correlated with
an American index fund that he owns.
From the data shown in Index.xls should he avoid
the investment in the Japanese index fund?
46

Testing the Coefficient of correlation


Foreign Index Funds
A certain investor prefers the investment in an index
mutual funds constructed by buying a wide
assortment of stocks.
The investor decides to avoid the investment in a
Japanese index fund if it is strongly correlated with
an American index fund that he owns.
From the data shown in Index.xls should he avoid
the investment in the Japanese index fund?
47

Testing the Coefficient of Correlation,


Example
Solution
Problem objective: Analyze relationship between two
interval variables.
The two variables are observational (the return for
each fund was not controlled).
We are interested in whether there is a linear
relationship between the two variables, thus, we
need to test the coefficient of correlation
48

Testing the Coefficient of Correlation,


Example
Solution continued
The hypotheses
H0: 0
H1: 0.
Solving by hand:
The rejection region:
|t| > t/2,n-2 = t.025,59-2 2.000.

The value of the t statistic is


n2
t r
4.26
2
1 r
Conclusion: There is sufficient
evidence at = 5% to infer that
there are linear relationship
between the two variables.

The sample coefficient of correlation:


Cov(x,y) = .001279; sx = .0509; sy = 0512
r = cov(x,y)/sxsy=.491

49

Testing the Coefficient of Correlation,


Example
Excel solution (Index)
US Index
US Index
Japanese Index

1
0.4911

Japanese Index
1

50

You might also like