Simple Regression

Simple Linear Regression
Introduction
In Chapter we examine the relationship between
interval variables via a mathematical equation.
The motivation for using the technique:
Forecast the value of a dependent variable (y) from
the value of independent variables (x1, x2,xk.).
Analyze the specific relationships between the
independent variables and the dependent variable.
2
The Model
The model has a deterministic and a probabilistic components
House
Cost
Most lots sell

for $25,000
ut
o
b
sa
t
s
o
ec
s
u
e)
z
o
.
i
t
h
S
o
a
fo + 75(
g
e
n
r
i
a
d
Buil er squ 25000
p
=
$75 e cost
s
Hou
House size
The Model
However, house cost vary even among same size
houses!
Since cost behave unpredictably,
House
Cost
Most lots sell

for $25,000
we add a random component.
House cost = 25000 + 75(Size)

House size
The Model
The first order linear model
yy 00 11xx
y = dependent variable
x = independent variable
0 = y-intercept
1 = slope of the line
= error variable
0 and 1 are unknown population

parameters, therefore are estimated
from the data.
Rise
0
Run
= Rise/Run
x
5
Estimating the Coefficients

The estimates are determined by
drawing a sample from the population of interest,
calculating sample statistics.
producing a straight line that cuts into the data.
y
Question: What should be

considered a good line?
The Least Squares (Regression) Line

A good line is one that minimizes
the sum of squared differences between the
points and the line.
The Least Squares (Regression) Line

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
4
3
2.5
2
Let us compare two lines

The second line is horizontal
(2,4)
(4,3.2)
(1,2)
(3,1.5)
The smaller the sum of

squared differences
the better the fit of the
line to the data.
8
The Estimated Coefficients

To calculate the estimates of the line
coefficients, that minimize the differences
between the data points and the line, use
the formulas:
cov(XX,,YY))
cov(
bb11
22
s
s xx
The regression equation that estimates

the equation of the first order linear model
is:
yy bb00 bb11xx
bb00 yybb11xx
The Simple Linear Regression Line

Example 17.2 (Xm17-02)
A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
A random sample of 100 cars
is selected, and the data
recorded.
Find the regression line.
Independent Dependent
variable x variable y
10

Solution
Solving by hand: Calculate a number of statistics
( x x)
43,528,690
x 36,009.45;
s 2x
y 14,822.823;
( x x )( y
cov( X , Y )
n 1
y)
n 1
2,712,511
where n = 100.
cov( X , Y ) 1,712,511
b1
.06232
2
sx
43,528,690
b 0 y b1 x 14,822.82 ( .06232)(36,009.45) 17,067
y b 0 b1x 17,067 .0623x
11

Solution continued
Using the computer (Xm17-02)
Tools > Data Analysis > Regression >
[Shade the y range and the x range] > OK
12

Xm17-02
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8063
R Square
0.6501
Adjusted R Square
0.6466
Standard Error 303.1
Observations
100
y 17,067 .0623 x
ANOVA
df
Regression
Residual
Total
SS
MS
1 16734111 16734111
98 9005450
91892
99 25739561
CoefficientsStandard Error t Stat

Intercept
17067
169
100.97
Odometer
-0.0623
0.0046
-13.49
Significance F
182.11
0.0000
P-value
0.0000
0.0000
13
Interpreting the Linear Regression

-Equation
17067
Odometer Line Fit Plot
Price
16000
15000
14000
No data 13000
Odometer
y 17,067 .0623 x
The intercept is b0 = $17067.
Do not interpret the intercept as the
Price of cars that have not been driven
This is the slope of the line.

For each additional mile on the odometer,
the price decreases by an average of $0.0623
14
Error Variable: Required Conditions

The error is a critical part of the regression model.
Four requirements involving the distribution of must
be satisfied.
The probability distribution of is normal.

The mean of is zero: E() = 0.
The standard deviation of is for all values of x.
The set of errors associated with different values of y are all
independent.
15
The Normality of
E(y|x3)
The standard deviation remains constant,
0 + 1x3
E(y|x2)
0 + 1x2
but the mean value changes with x

0 + 1x1
E(y|x1)
From the
the first
first three
three assumptions
assumptions we
we have:
have:
From
x1
normally distributed
distributed with
with mean
mean
yy isis normally
E(y) == 00 ++ 11x,
x, and
and aa constant
constant standard
standard
E(y)
deviation
deviation
x2
x3
16
Assessing the Model

The least squares method will produces a
regression line whether or not there are linear
relationship between x and y.
Consequently, it is important to assess how well
the linear model fits the data.
Several methods are used to assess the model.
All are based on the sum of squares for errors,
SSE.
17
Sum of Squares for Errors

This is the sum of differences between the points and
the regression line.
It can serve as a measure of how well the line fits the
data. SSE is defined by
nn
SSE
SSE
A shortcut formula
22 .
(
y
y
)
( y i i y i )i .
i i 11
cov(X , Y)
22 cov( X , Y )
SSE
(
n
1
)
s
SSE (n 1)s YY
22
s
s xx
18
Standard Error of Estimate

The mean error is equal to zero.
If is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
Therefore, we can, use as a measure of the
suitability of using a linear model.
An estimator of is given by s
tandard
dard Error
Error of
of Estimate
Estimate
SStan
SSE
SSE
ss
nn22
19
Standard Error of Estimate,

Example
Example 17.3
Calculate the standard error of estimate for Example 17.2,

and describe what does it tell you about the model fit?
Solution
s 2Y
( y i y i ) 2
n 1
Calculated before
259,996
2
cov(
X
,
Y
)
(
2
,
712
,
511
)
SSE (n 1)s 2Y
99(259,996)
9,005,450
2
43,528,690
sx
SSE
9,005,450
s
303.13
n2
98
It is hard to assess the model based

on s even when compared with the
mean value of y.
s 303.1 y 14,823 20
Testing the slope

When no linear relationship exists between two
variables, the regression line should be horizontal.
Linear relationship.
Different inputs (x) yield
different outputs (y).
No linear relationship.
Different inputs (x) yield
the same output (y).
The slope is not equal to zero
The slope is equal to zero
21
Testing the Slope

We can draw inference about 1 from b1 by testing
H0: 1 = 0
H1: 1 = 0 (or < 0,or > 0)
The test statistic is
b1111
b
tt
ssbb11
where
ssbb11
ss
((nn11))ss2x2x
The standard error of b1.
If the error variable is normally distributed, the statistic

is Student t distribution with d.f. = n-2.
22
Testing the Slope,

Example
Example 17.4
Test to determine whether there is enough evidence
to infer that there is a linear relationship between the
car auction price and the odometer reading for all
three-year-old Tauruses, in Example 17.2.
Use = 5%.
23
Testing the Slope,

Example
Solving by hand
To compute t we need the values of b1 and sb1.
b1 .0623
sb1
t
s
( n 1) s x2
303.1
(99)(43,528,690)
.00462
b1 1 .0623 0
13.49
.00462
sb1
The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.
Approximately, t.025 = 1.984
24
Testing the Slope,

Example Xm17-02
Using the computer
Price
Odometer SUMMARY OUTPUT
14636
37388
14122
44758 Regression Statistics
14016
45833 Multiple R
0.8063
15590
30862 R Square
0.6501
There is overwhelming evidence to infer
15568
31705 Adjusted R Square
0.6466
14718
34010 Standard Error 303.1
that the odometer reading affects the
14470
45854 Observations
100
auction selling price.
15690
19057
15072
40149 ANOVA
14802
40237
df
SS
MS
F
Significance F
15190
32359 Regression
1 16734111 16734111
182.11
0.0000
14660
43533 Residual
98 9005450
91892
15612
32744 Total
99 25739561
15610
34470
14634
37720
Coefficients
Standard Error t Stat
P-value
14632
41350 Intercept
17067
169
100.97
0.0000
15740
24469 Odometer
-0.0623
0.0046
-13.49
0.0000
25
Coefficient of determination
To measure the strength of the linear relationship we
use the coefficient of determination.
2
2
cov(
X
,
Y
)
R22 cov( X , Y )
22x s22y
s
sx sy
SSE
SSE
oror R 11
2
2
(
y
y
)
( y i y )
R22
26
To understand the significance of this coefficient note:
art by
p
n
i
ed
n
i
a
l
p
Ex
Overall variability in y Re
mains
,
in par
t, une
The regression model
xplain
ed
The error
27
y2
Two data points (x1,y1) and (x2,y2)

of a certain sample are shown.
Variation in y = SSR + SSE
y1
x1
x2
Total variation in y =
Variation explained by the

regression line
(y1 y )2 (y 2 y)2
( y 1 y ) 2 ( y 2 y ) 2
+ Unexplained variation (error)

( y 1 y 1 ) 2 ( y 2 y 2 ) 2
28
R2 measures the proportion of the variation in y
that is explained by the variation in x.
2
R 1
SSE
(y i y)
(y y)
( y i y ) 2 SSE
SSR
(y i y)2
R2 takes on any value between zero and one.

R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
29
Coefficient of determination,
Example
Example 17.5
Find the coefficient of determination for Example 17.2;
what does this statistic tell you about the model?
Solution
2
Solving by hand;R
[cov( x, y )]2
s x2 s 2y
[ 2, 712,511]2
( 43,528, 688)( 259,996 )
.6501
30
Using the computer
From the regression output we have
SUMMARY OUTPUT
Multiple R
0.8063
R Square
0.6501
Adjusted R Square
0.6466
Standard Error
303.1
Observations
100
65% of the variation in the auction

selling price is explained by the
variation in odometer reading. The
rest (35%) remains unexplained by
this model.
ANOVA
df
Regression
Residual
Total
Intercept
Odometer
SS
16734111
9005450
25739561
MS
16734111
91892
CoefficientsStandard Error
17067
169
-0.0623
0.0046
t Stat
100.97
-13.49
1
98
99
Significance F
182.11
0.0000
P-value
0.0000
0.0000
31
17.6 Finance Application: Market Model

One of the most important applications of linear
regression is the market model.
It is assumed that rate of return on a stock (R) is
linearly related to the rate of return on the overall
market.
R = 0 + 1Rm +
Rate of return on a particular stock
Rate of return on some major stock index
The beta coefficient measures how sensitive the stocks rate

of return is to changes in the level of the overall market.
32
The Market Model,

Example
Example 17.6 (Xm17-06)
SUMMARY OUTPUT
Multiple R
0.5601
R Square
0.3137
Adjusted R Square
0.3019
Standard Error0.0631
Observations
60
Estimate the market model for Nortel, a stock traded

in the Toronto Stock Exchange (TSE).
Data consisted of monthly percentage return for
Nortel and monthly percentage return for all the
stocks.
This isANOVA
a measure of the stocks
This is a measure of the total market-related risk
df
SS
MS
F Significance F
marketRegression
related risk. In this
sample,
embedded
in
the Nortel
stock.
1 0.10563 0.10563
26.51
0.0000
Residual
0.003985
for each
1% increase in58the0.231105
TSE
Specifically, 31.37% of the variation in Nortels
59 0.336734
return,Total
the average increase
in
return are explained by the variation in the
Nortels return is
.8877%.Standard Error tTSEs
returns.
Coefficients
Stat
P-value
Intercept
TSE
0.0128
0.8877
0.0082
0.1724
1.56
5.15
0.1245
0.0000
33
Using the Regression Equation

Before using the regression model, we need to
assess how well it fits the data.
If we are satisfied with how well the model fits
the data, we can use it to predict the values of y.
To make a prediction we use
Point prediction, and
Interval prediction
34
Point Prediction
Example 17.7
Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 17.2).
A point prediction
y 17067 .0623x 17067 .0623( 40,000) 14,575

It is predicted that a 40,000 miles car would sell for
$14,575.
How close is this prediction to the real price?
35
Interval Estimates
Two intervals can be used to discover how closely the
predicted value will match the true value of y.
Prediction interval predicts y for a given value of x,
Confidence interval estimates the average y for a given x.
The prediction
prediction interval
interval
The
tt 22ss
yy
22
(
x
x
)
(
x
x
)
11
gg
1
1
11))ss2x2x
nn ((nn
The confidence interval

y t 2 s
2
(
x
x
)
1
g
2
n (n 1) s x
36
Interval Estimates,
Example
Example 17.7 - continued
Provide an interval estimate for the bidding price on
a Ford Taurus with 40,000 miles on the odometer.
Two types of predictions are required:
A prediction for a specific car
An estimate for the average price per car
37
Interval Estimates,
Example
Solution
A prediction interval provides the price estimate for a
single car:
( x g x)2
1
y t 2 s 1
n (n 1)s 2x
t.025,98
Approximately
1
(40,000 36,009) 2
[17,067 .0623(40000)] 1.984(303.1) 1
14,575 605
100 (100 1)43,528,690
38
Interval Estimates,
Example
Solution continued
A confidence interval provides the estimate of the
mean price per car for a Ford Taurus with 40,000
miles reading on the odometer.
The confidence interval (95%) = y t 2 s
( x g x)2
( xi x)2
1
(40,000 36,009) 2
[17,067 .0623( 40000)] 1.984(303.1)
14,575 70
100 (100 1) 43,528,690
39
The effect of the given xg on the

length of the interval
As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
y b 0 b1x g
y t 2 s
2
1 ( x g x)
n (n 1)s 2x
40

As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
y b 0 b1x g
y( x g x 1)
y( x g x 1)
y t 2 s
y t 2 s
2
1 ( x g x)
n (n 1)s 2x
1
12
n (n 1)s 2x
x 1 x 1
( x 1) x 1 ( x 1) x 1
41

As xg moves away from x the interval becomes longer. That is, the shortest
interval is found at x.
y b 0 b1x g
x2
x2
( x 2) x 2 ( x 2) x 2
y t 2 s
2
1 ( x g x)
n (n 1) s 2x
y t 2 s
1
12
n (n 1)s 2x
y t 2 s
1
22
n (n 1)s 2x
42
Coefficient of Correlation
The coefficient of correlation is used to measure the
strength of association between two variables.
The coefficient values range between -1 and 1.
If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
If r = 0 there is no linear pattern.
The coefficient can be used to test for linear

relationship between two variables.
43
Testing the coefficient of correlation

To test the coefficient of correlation for linear
relationship between X and Y
X and Y must be observational
X and Y are bivariate normally distributed
Y
X
44
Testing the coefficient of correlation

When no linear relationship exist between the two
variables, = 0.
The hypotheses are:
H0 : 0
H1 : 0
The test statistic is:

The statistic is Student t
distributed with d.f. = n - 2,
provided the variables
are bivariate normally
distributed.
tr
n2
1 r 2
where r is the sample
coefficient of correlatio n
calculated by r
cov( x, y )
sx s y
45
Testing the Coefficient of correlation

Foreign Index Funds (Index)
A certain investor prefers the investment in an index
mutual funds constructed by buying a wide
assortment of stocks.
The investor decides to avoid the investment in a
Japanese index fund if it is strongly correlated with
an American index fund that he owns.
From the data shown in Index.xls should he avoid
the investment in the Japanese index fund?
46
Testing the Coefficient of correlation

Foreign Index Funds
A certain investor prefers the investment in an index
mutual funds constructed by buying a wide
assortment of stocks.
The investor decides to avoid the investment in a
Japanese index fund if it is strongly correlated with
an American index fund that he owns.
From the data shown in Index.xls should he avoid
the investment in the Japanese index fund?
47
Testing the Coefficient of Correlation,

Example
Solution
Problem objective: Analyze relationship between two
interval variables.
The two variables are observational (the return for
each fund was not controlled).
We are interested in whether there is a linear
relationship between the two variables, thus, we
need to test the coefficient of correlation
48

Example
Solution continued
The hypotheses
H0: 0
H1: 0.
Solving by hand:
The rejection region:
|t| > t/2,n-2 = t.025,59-2 2.000.
The value of the t statistic is

n2
t r
4.26
2
1 r
Conclusion: There is sufficient
evidence at = 5% to infer that
there are linear relationship
between the two variables.
The sample coefficient of correlation:

Cov(x,y) = .001279; sx = .0509; sy = 0512
r = cov(x,y)/sxsy=.491
49

Example
Excel solution (Index)
US Index
US Index
Japanese Index
1
0.4911
Japanese Index
1
50

Simple Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Regression

Uploaded by

Copyright:

Available Formats

Simple Linear Regression

Most lots sell

Most lots sell

we add a random component.

House cost = 25000 + 75(Size)

0 and 1 are unknown population

Estimating the Coefficients

Question: What should be

The Least Squares (Regression) Line

The Least Squares (Regression) Line

Let us compare two lines

The smaller the sum of

The Estimated Coefficients

The regression equation that estimates

The Simple Linear Regression Line

The Simple Linear Regression Line

y b 0 b1x 17,067 .0623x

The Simple Linear Regression Line

The Simple Linear Regression Line

CoefficientsStandard Error t Stat

Interpreting the Linear Regression

Odometer Line Fit Plot

This is the slope of the line.

Error Variable: Required Conditions

The probability distribution of is normal.

but the mean value changes with x

Assessing the Model

Sum of Squares for Errors

Standard Error of Estimate

Standard Error of Estimate,

Calculate the standard error of estimate for Example 17.2,

It is hard to assess the model based

Testing the slope

The slope is not equal to zero

The slope is equal to zero

Testing the Slope

The standard error of b1.

If the error variable is normally distributed, the statistic

Testing the Slope,

Testing the Slope,

Testing the Slope,

The regression model

Two data points (x1,y1) and (x2,y2)

Variation in y = SSR + SSE

Variation explained by the

+ Unexplained variation (error)

R2 takes on any value between zero and one.

65% of the variation in the auction

17.6 Finance Application: Market Model

Rate of return on some major stock index

The beta coefficient measures how sensitive the stocks rate

The Market Model,

Estimate the market model for Nortel, a stock traded

Using the Regression Equation

y 17067 .0623x 17067 .0623( 40,000) 14,575

The confidence interval

The effect of the given xg on the

The effect of the given xg on the

The effect of the given xg on the

The coefficient can be used to test for linear

Testing the coefficient of correlation

Testing the coefficient of correlation

The test statistic is:

Testing the Coefficient of correlation