You are on page 1of 62

Chapter 13

Simple Linear Regression

David Chow
Nov 2014

Learning Objectives
Understand the simple linear regression model
Model assumptions
Meaning of the coefficients b0 and b1

Predict the value of a dependent variable using


regression results
Make inferences about the coefficients
Estimate mean values and predict individual values

Chap 13-2
2

Overview

This PPT
Linear regression model
Estimate regression line
Measures of variation
Residual analysis
Statistical inference

Testing & interval estimate

Excel file
Eg: House price

Scatter plot
Reg: standard output
Reg: residuals

Data for review question


Answer

What We Know
1.

Equation of a straight line

2.

Scatter plot

Linear / non-linear relationship

Strong / weak relationship

Correlation: ____ relationship

3.

between two variables

Correlation doesnt imply casual relationship

4.

Eg: No. of firefighters and damage (in $)

Hypothesis testing

Chap 13-4

Regression Analysis
Dependent variable (y): the variable we wish to
predict or explain
Independent variable (x): the variable used to
predict or explain y

Regression analysis is used to:

Explain the impact of changes in x on the dependent variable y

Predict the value of y based on the value of x(s)

Linear Regression:

Only one x

Linear relationship between x and y

Chap 13-5

Examples
Y=___, X=___
Useful tip:
A visual check BEFORE running
regression is always helpful!

Linear Regression Model

Linear Regression Model


Population
Y-intercept

Dependent
Variable

Population
Slope
Coefficient

Independent
Variable

Yi 0 1Xi i
Linear component

Random
Error term

Random error component

Assumptions
1. X and Y are linearly related
2. 0 and 1 are population parameters
3. Given an X-value (say, Xi), Y= Yi is random, because of
4. Assume E()=0, so that E(Y)=____
Chap 13-8

Linear Regression Model


Y

Yi 0 1Xi i

Population Regression
line: E(Y) = 0 +1 X

Observed Value
of Y for Xi

i
Predicted Value
of Y for Xi
Intercept = 0

Slope = 1
Random Error
for this Xi value

Xi

X
Chap 13-9

Model Assumptions: LINE

Linearity

Independence of errors

Error values are statistically independent

Normality of error

The relationship between X and Y is linear

Error values are normally distributed for any given value of X

Equal variance (also called homoscedasticity)

The probability distribution of the errors has constant variance

Assumption on error terms:


~ N (0, 2)
See the appendix for a graphical presentation
Chap 13-10

Estimate the Regression


Line and Predict Y Values

Est. Regression Equation


(Prediction Line)
The regression equation provides an
estimate of the population regression line
Estimated (or
predicted) Y value
for observation i

Estimate of
intercept

Estimate of slope

Yi b0 b1Xi

Value of X for
observation i

Chap 13-12

Least Squares Method


(The Best Fitted Line)
b0 and b1 are obtained by finding the values of
that minimize the sum of the squared
differences between Y and Y :
2
2

min (Yi Yi ) min (Yi (b0 b1Xi ))

The computations (to find b0 and b1) are done by Excel

Formulae for b0 and b1 are in the appendix

Chap 13-13

Interpreting b0 and b1

b0 is the estimated mean value of


Y when the value of X is zero

b1 is the estimated change in the


mean value of Y as a result of a
one-unit increase in X

Chap 13-14

Eg: House Price


(in Excel File)

A real estate agent wishes to examine the relationship


between the selling price of a home and its size
(measured in square feet)

A random sample of 10 houses is selected


Dependent var (Y) = house price (in $1000s)
Independent var (X) = square feet

Chap 13-15

Eg: House Price Data & Scatter Plot


House Price in
$1000s (Y)

Square Feet
(X)

245

1400

312

1600

279

1700

308

1875

House Price ($1000s)

450
400

350
300
250
200
150
100

50
0
0

500

1000

1500

2000

2500

3000

Square Feet

199

1100

219

1550

405

2350

324

2450

319

1425

255

1700
Chap 13-16

Using Excel
1. Choose Data

2. Choose Data Analysis


3. Choose Regression

4. Input data range and output options


Chap 13-17

Using Excel: Regression Output


Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

The regression equation is:


houseprice 98.24833 0.10977 (square feet)

41.33032

Observations

10

ANOVA
df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept

Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

Chap 13-18

Eg: House Price

House Price ($1000s)

House price model: Scatter Plot and Prediction Line

Intercept
= 98.248

450
400
350
300
250
200
150

Slope
= 0.10977

100
50
0
0

500

1000

1500

2000

2500

3000

Square Feet

houseprice 98.24833 0.10977 (square feet)


Chap 13-19

Eg: Interpreting bo
houseprice 98.24833 0.10977 (square feet)

bo is the estimated mean value of Y when the value of X


is zero (if X = 0 is in the range of observed X values)

Because a house cannot have a square footage of 0, bo


has no practical application

Generally speaking, intercept


bo is not our focus in regression

Chap 13-20

Eg: Interpreting b1
houseprice 98.24833 0.10977 (square feet)

b1 estimates the change in the mean value of Y


as a result of a one-unit increase in X

Here, b1 = 0.10977 tells us that the mean value of a house


increases by 0.10977($1000) = $109.77, on average, for each
additional square foot of size

Chap 13-21

Eg: Making Predictions


Predict the price for a house with 2000 sq feet:

house price 98.24833 0.10977 (sq.ft.)


98.24833 0.10977(2000)
317.78
The predicted price for a house with 2000
square feet is 317.78($1,000s) = $317,780

Chap 13-22

Making Predictions

General rule:
Predict only within the relevant range of Xs

House Price ($1000s)

Relevant range for


interpolation
450
400
350
300
250
200
150
100
50
0
0

500

1000

1500

2000

Square Feet

2500

3000

Do not try to extrapolate


beyond the range of
observed Xs
Chap 13-23

Measures of Variation
and r2

Measures of Variation

Total variation is made up of two parts:

SST
Total Sum of
Squares

SSR
Regression Sum
of Squares

SST ( Yi Y)2

Y)2
SSR ( Y
i

SSE
Error Sum of
Squares

)2
SSE ( Yi Y
i

where:

= Mean value of the dependent variable

Yi = Observed value of the dependent variable

Yi = Predicted value of Y for the given Xi value

Chap 13-25

Measures of Variation

SST = total sum of squares

Measures the variation of the Yi values around their


mean Y

SSR = regression sum of squares (Explained Variation)

(Total Variation)

Variation attributable to the relationship between X


and Y

SSE = error sum of squares (Unexplained Variation)

Variation in Y attributable to factors other than X

Chap 13-26

Measures of Variation
Y
Yi

SSE = (Yi - Yi )2

SST = (Yi - Y)2


_
SSR = (Yi - Y)2

_
Y

Xi

_
Y

X
Chap 13-27

Coefficient of Determination, r2

The coefficient of determination is the portion of the


total variation in Y that is explained by variation in X

The coefficient of determination is also called rsquared and is denoted as r2

SSR regression sum of squares


r

SST
total sum of squares
2

0 r 1
2

Chap 13-28

Examples of r2
Y
0 < r2 < 1

Weaker linear relationships


between X and Y:
Some but not all of the variation in
Y is explained by variation in X

Q: Graph for r2 = 1?

X
Chap 13-29

Measures of Variation and r2 in Excel


House Price Again
SSR 18934.9348
r

0.58082
SST 32600.5000
2

Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

58.08% of the variation in house prices


is explained by variation in sq feet

41.33032

Observations

10

ANOVA
df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

Chap 13-30

Standard Error of Estimate

The standard deviation of the variation of observations


around the regression line is estimated by
n

SSE
SYX

n2

(Yi Yi ) 2

i 1

n2

where
SSE = error sum of squares
n = sample size

Chap 13-31

Standard Error of Estimate in Excel


House Price Again
Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

SYX 41.33032

41.33032

Observations

10

ANOVA
df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

Chap 13-32

Comparing Standard Errors


SYX is a measure of the variation of observed
Y values from the regression line
Y

large SYX

small SYX

SYX carries the same unit as Y


It should be judged relative to the size of the Y values in the
sample data
Eg: SYX = $41.33K is moderately small relative to house
prices in the $200K - $400K range
Chap 13-33

Example: Salary Data


Years of employment and salary ($1000) in 7 subjects:
Years
6.6
7.4
8.8
9.7
10.5
10.7
11.8

Salary
32
42
52
61
62
66
65

Residual Analysis
(Autocorrelation is NOT covered)

Residual Analysis

The residual for observation i, ei, is the difference


between its observed and predicted value

ei Yi Y
i

Recall the regression assumptions

L: X and Y linearly related?

I: Errors statistically independent?

N: Errors normally distributed?

E: Errors have constant variance (homoscedasticity)?

A residual plot (residuals vs X) is very useful in


checking the assumptions
Chap 13-36

L: Linearity
Y

Not Linear

residuals

residuals

Linear
Chap 13-37

I: Independence

Not Independent

residuals

residuals

residuals

Independent

Chap 13-38

N: Normality

N: Are the errors normally distributed?


Many ways to check for normality

Stem-and-Leaf
Histogram
Normal Probability Plot, etc.

My recommendation is always ___

Chap 13-39

E: Equal Variance
Y

Non-constant variance

residuals

residuals

Constant variance
Chap 13-40

A Residual Plot by Excel


House Price Again
House Price Model Residual Plot
RESIDUAL OUTPUT

60

Residuals

40

Residuals

Predicted
House Price

80

251.92316

-6.923162

273.87671

38.12329

284.85348

-5.853484

304.06284

3.937162

-40

218.99284

-19.99284

-60

268.38832

-49.38832

356.20251

48.79749

367.17929

-43.17929

254.6674

64.33264

10

284.85348

-29.85348

20
0
-20

1000

2000

3000

Square Feet

Key: Any pattern in the residual plot?


NOTE: Autocorrelation is NOT covered
in this course
Chap 13-41

Testing for Significance


and Interval Estimate

Inferences About the Slope

The standard error of the regression slope


coefficient (b1) is estimated by
S YX
Sb1

SSX

S YX
2
(X

X
)
i

where:

Sb1

= Estimate of the standard error of the slope

S YX

SSE = Standard error of the estimate


n2
Chap 13-43

t Test for the Slope

Typical question: X and Y linearly related?

H0: 1 = 0
H1: 1 0

(no linear relationship)


(linear relationship does exist)

Test statistic

t STAT

b1 1
Sb

d.f. n 2

where:
b1 = regression slope
coefficient
1 = hypothesized slope
Sb1 = standard
error of the slope
Chap 13-44

Inferences About the Slope in Excel


House Price Again
Want to know if ____
H0: 1 = 0; H1: 1 0
Coefficients

Intercept
Square Feet

Standard Error

t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

b1

Sb1
t STAT

Next, how to test if slope


equals a particular value?

b1 1
Sb

0.10977 0
3.32938
0.03297

Chap 13-45

Inferences About the Slope in Excel


House Price Again
Test Statistic: tSTAT = 3.329
Critical values?

a 0.05
d.f. 10 2 8

a/2=.025

Reject H0

a/2=.025

Do not reject H0

-t/2
-2.3060

Decision: Reject H0

Reject H0

t/2
2.3060

3.329

Chap 13-46

Inferences About the Slope in Excel


House Price Again

The p-value Approach


Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

p-value

Decision: Reject H0, since p-value <

Chap 13-47

F Test for Model Significance


H0: 1 = 0;

F Test statistic:

where

H1: 1 0

MSR
FSTAT
MSE
MSR

SSR
k

MSE

SSE
n k 1

FSTAT follows an F distribution


Numerator d.f. = k
Denominator d.f. = n-k-1
k = the number of independent variables in the regression model
Chap 13-48

F-Test for Model Significance


House Price Again
Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

FSTAT

MSR 18934.9348

11.0848
MSE 1708.1957

41.33032

Observations

10

With 1 and 8 degrees


of freedom

p-value for
the F-Test

ANOVA
df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Significance F
0.01039

Chap 13-49

F Test for Significance


Test Statistic:

H0: 1 = 0 H1: 1 0
a = .05
df1= 1
df2 = 8

FSTAT

MSR
11.08
MSE

Decision:

Critical Value:

Reject H0 at a = 0.05

Fa = 5.32
a = .05

Do not
reject H0

Reject H0

F.05 = 5.32
Chap 13-50

Interval Estimate for the Slope


House Price Again

b1 t / 2S b

d.f. = n - 2

Excel Printout
Intercept
Square Feet

Coefficients

Standard Error

t Stat

P-value

98.24833
0.10977

Lower 95%

Upper 95%

58.03348

1.69296

0.12892

-35.57720

232.07386

0.03297

3.32938

0.01039

0.03374

0.18580

At 95% level of confidence, the confidence interval for


the slope is (0.0337, 0.1858)
we are 95% confident that the average impact on sales
price is between $33.74 and $185.80 per sq foot of size
Same conclusion as t-test?
Chap 13-51

t Test for a Correlation Coefficient

Hypotheses
H0: = 0
H1: 0

(no correlation between X and Y)


(correlation exists)

Test statistic

t STAT

r -
1 r2
n2

(with n 2 degrees of freedom)

where
r r 2 if b1 0
r r 2 if b1 0
Chap 13-52

t-test For A Correlation Coefficient


Is there evidence of a linear relationship between square
feet and house price at the .05 level of significance?

H0: = 0
H1: 0

(No correlation)
(correlation exists)

a =.05 , df = 10 - 2 = 8
t STAT

r
1 r2
n2

.762 0
1 .7622
10 2

3.329

Critical Value =
Decision:
Chap 13-53

Estimating Mean Values and


Predicting Individual Values
Goal: Form intervals around Y to express
uncertainty about the value of Y for a given Xi
Confidence
Interval for
the mean of
Y, given Xi

Y = b0+b1Xi

Prediction Interval
for an individual Y,
given Xi

Xi

X
Chap 13-54

Suggested Procedures for a


Regression Analysis
1.

2.
3.

Start with a scatter plot to observe possible


relationship
Run regression
Perform residual analysis

Plot the residuals vs. X to check for assumptions such as


linearity & homoscedasticity
Use a histogram to see if the residuals are normally
distributed

Chap 13-55

Suggested Procedures for a


Regression Analysis
4.

5.

6.

If there is violation of any assumption, use alternative


methods or models
If there is no evidence of assumption violation, then
test for the significance of the regression coefficients
and construct confidence intervals and prediction
intervals
Avoid making predictions or forecasts outside the
relevant range

Chap 13-56

Review Questions 1-3


1. In a regression analysis, the error term is a random
variable with an expected value of ____
2. In a regression analysis, if r2 = 1, SST = ____
3. A regression analysis between sales (in $1000) and
advertising (in $100) resulted in the following equation:
Y-hat = 500 + 4X
a) How to interpret b1?
b) Based on the above equation, if advertising is $10,000,
the point estimate for sales (in dollars) is ____

Review Question 4
4. Shown below is a portion of an Excel printout for a regression
analysis relating Y (quantity demanded) and X (unit price).
ANOVA
Regression
Residual
Total

Intercept
X

df
1
46
47

SS
5048.818
3132.661
8181.479

Coefficients
80.390
-2.137

Standard Error
3.102
0.248

a) Perform a t test to determine if Y and X are related. Let = 0.05


b) Compute R2 and interpret its meaning

Review Question 5:
Restaurant Ratings
A regression model is developed to relate the price per person
and the sum of the ratings for food, decor, and service
1. Find r2
2. Calculate and Interpret the regression coefficients
3. At the 0.05 level of significance, is there evidence of a linear
relationship between the price per person and the summated
rating?
4. Construct a 95% confidence interval for the slope

5. Evaluate the usefulness of this model

Appendix

Figure: Regression Model Assumptions

Least Squares Method


(Formulae for b0 and b1)

b1

( x x )( y y )

(x x )
i

b0 y b1 x

You might also like