L11 Regression

Chapter 13
Simple Linear Regression
David Chow
Nov 2014
Learning Objectives
Understand the simple linear regression model
Model assumptions
Meaning of the coefficients b0 and b1
Predict the value of a dependent variable using

regression results
Make inferences about the coefficients
Estimate mean values and predict individual values
Chap 13-2
2
Overview
This PPT
Linear regression model
Estimate regression line
Measures of variation
Residual analysis
Statistical inference
Testing & interval estimate
Excel file
Eg: House price
Scatter plot
Reg: standard output
Reg: residuals
Data for review question

Answer
What We Know
1.
Equation of a straight line
2.
Scatter plot
Linear / non-linear relationship
Strong / weak relationship
Correlation: ____ relationship
3.
between two variables
Correlation doesnt imply casual relationship
4.
Eg: No. of firefighters and damage (in $)
Hypothesis testing
Chap 13-4
Regression Analysis
Dependent variable (y): the variable we wish to
predict or explain
Independent variable (x): the variable used to
predict or explain y
Regression analysis is used to:
Explain the impact of changes in x on the dependent variable y
Predict the value of y based on the value of x(s)
Linear Regression:
Only one x
Linear relationship between x and y
Chap 13-5
Examples
Y=___, X=___
Useful tip:
A visual check BEFORE running
regression is always helpful!
Linear Regression Model

Population
Y-intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
Yi 0 1Xi i
Linear component
Random
Error term
Random error component
Assumptions
1. X and Y are linearly related
2. 0 and 1 are population parameters
3. Given an X-value (say, Xi), Y= Yi is random, because of
4. Assume E()=0, so that E(Y)=____
Chap 13-8

Y
Yi 0 1Xi i
Population Regression
line: E(Y) = 0 +1 X
Observed Value
of Y for Xi
i
Predicted Value
of Y for Xi
Intercept = 0
Slope = 1
Random Error
for this Xi value
Xi
X
Chap 13-9
Model Assumptions: LINE
Linearity
Independence of errors
Error values are statistically independent
Normality of error
The relationship between X and Y is linear
Error values are normally distributed for any given value of X
Equal variance (also called homoscedasticity)
The probability distribution of the errors has constant variance
Assumption on error terms:

~ N (0, 2)
See the appendix for a graphical presentation
Chap 13-10
Estimate the Regression

Line and Predict Y Values
Est. Regression Equation

(Prediction Line)
The regression equation provides an
estimate of the population regression line
Estimated (or
predicted) Y value
for observation i
Estimate of
intercept
Estimate of slope
Yi b0 b1Xi
Value of X for
observation i
Chap 13-12
Least Squares Method

(The Best Fitted Line)
b0 and b1 are obtained by finding the values of
that minimize the sum of the squared
differences between Y and Y :
2
2
min (Yi Yi ) min (Yi (b0 b1Xi ))
The computations (to find b0 and b1) are done by Excel
Formulae for b0 and b1 are in the appendix
Chap 13-13
Interpreting b0 and b1
b0 is the estimated mean value of

Y when the value of X is zero
b1 is the estimated change in the

mean value of Y as a result of a
one-unit increase in X
Chap 13-14
Eg: House Price

(in Excel File)
A real estate agent wishes to examine the relationship

between the selling price of a home and its size
(measured in square feet)
A random sample of 10 houses is selected

Dependent var (Y) = house price (in $1000s)
Independent var (X) = square feet
Chap 13-15
Eg: House Price Data & Scatter Plot

House Price in
$1000s (Y)
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
House Price ($1000s)
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Chap 13-16
Using Excel
1. Choose Data
2. Choose Data Analysis

3. Choose Regression
4. Input data range and output options

Chap 13-17
Using Excel: Regression Output

Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
The regression equation is:

houseprice 98.24833 0.10977 (square feet)
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Chap 13-18
Eg: House Price
House price model: Scatter Plot and Prediction Line
Intercept
= 98.248
450
400
350
300
250
200
150
Slope
= 0.10977
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet

Chap 13-19
Eg: Interpreting bo
bo is the estimated mean value of Y when the value of X

is zero (if X = 0 is in the range of observed X values)
Because a house cannot have a square footage of 0, bo

has no practical application
Generally speaking, intercept

bo is not our focus in regression
Chap 13-20
Eg: Interpreting b1
b1 estimates the change in the mean value of Y

as a result of a one-unit increase in X
Here, b1 = 0.10977 tells us that the mean value of a house

increases by 0.10977($1000) = $109.77, on average, for each
additional square foot of size
Chap 13-21
Eg: Making Predictions

Predict the price for a house with 2000 sq feet:
house price 98.24833 0.10977 (sq.ft.)

98.24833 0.10977(2000)
317.78
The predicted price for a house with 2000
square feet is 317.78($1,000s) = $317,780
Chap 13-22
Making Predictions
General rule:
Predict only within the relevant range of Xs
Relevant range for

interpolation
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
Square Feet
2500
3000
Do not try to extrapolate

beyond the range of
observed Xs
Chap 13-23
Measures of Variation
and r2
Total variation is made up of two parts:
SST
Total Sum of
Squares
SSR
Regression Sum
of Squares
SST ( Yi Y)2
Y)2
SSR ( Y
i
SSE
Error Sum of
Squares
)2
SSE ( Yi Y
i
where:
= Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yi = Predicted value of Y for the given Xi value
Chap 13-25
SST = total sum of squares
Measures the variation of the Yi values around their

mean Y
SSR = regression sum of squares (Explained Variation)
(Total Variation)
Variation attributable to the relationship between X

and Y
SSE = error sum of squares (Unexplained Variation)
Variation in Y attributable to factors other than X
Chap 13-26
Y
Yi
SSE = (Yi - Yi )2
SST = (Yi - Y)2

_
SSR = (Yi - Y)2
_
Y
Xi
_
Y
X
Chap 13-27
Coefficient of Determination, r2
The coefficient of determination is the portion of the

total variation in Y that is explained by variation in X
The coefficient of determination is also called rsquared and is denoted as r2
SSR regression sum of squares

r
SST
total sum of squares
2
0 r 1
2
Chap 13-28
Examples of r2
Y
0 < r2 < 1
Weaker linear relationships

between X and Y:
Some but not all of the variation in
Y is explained by variation in X
Q: Graph for r2 = 1?
X
Chap 13-29
Measures of Variation and r2 in Excel

House Price Again
SSR 18934.9348
r
0.58082
SST 32600.5000
2
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
58.08% of the variation in house prices

is explained by variation in sq feet
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Chap 13-30
Standard Error of Estimate
The standard deviation of the variation of observations

around the regression line is estimated by
n
SSE
SYX
n2
(Yi Yi ) 2
i 1
n2
where
SSE = error sum of squares
n = sample size
Chap 13-31
Standard Error of Estimate in Excel

House Price Again
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
SYX 41.33032
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Chap 13-32
Comparing Standard Errors

SYX is a measure of the variation of observed
Y values from the regression line
Y
large SYX
small SYX
SYX carries the same unit as Y

It should be judged relative to the size of the Y values in the
sample data
Eg: SYX = $41.33K is moderately small relative to house
prices in the $200K - $400K range
Chap 13-33
Example: Salary Data

Years of employment and salary ($1000) in 7 subjects:
Years
6.6
7.4
8.8
9.7
10.5
10.7
11.8
Salary
32
42
52
61
62
66
65
Residual Analysis
(Autocorrelation is NOT covered)
Residual Analysis
The residual for observation i, ei, is the difference

between its observed and predicted value
ei Yi Y
i
Recall the regression assumptions
L: X and Y linearly related?
I: Errors statistically independent?
N: Errors normally distributed?
E: Errors have constant variance (homoscedasticity)?
A residual plot (residuals vs X) is very useful in

checking the assumptions
Chap 13-36
L: Linearity
Y
Not Linear
residuals
residuals
Linear
Chap 13-37
I: Independence
Not Independent
residuals
residuals
residuals
Independent
Chap 13-38
N: Normality
N: Are the errors normally distributed?

Many ways to check for normality
Stem-and-Leaf
Histogram
Normal Probability Plot, etc.
My recommendation is always ___
Chap 13-39
E: Equal Variance
Y
Non-constant variance
residuals
residuals
Constant variance
Chap 13-40
A Residual Plot by Excel

House Price Again
House Price Model Residual Plot
RESIDUAL OUTPUT
60
Residuals
40
Residuals
Predicted
House Price
80
251.92316
-6.923162
273.87671
38.12329
284.85348
-5.853484
304.06284
3.937162
-40
218.99284
-19.99284
-60
268.38832
-49.38832
356.20251
48.79749
367.17929
-43.17929
254.6674
64.33264
10
284.85348
-29.85348
20
0
-20
1000
2000
3000
Square Feet
Key: Any pattern in the residual plot?

NOTE: Autocorrelation is NOT covered
in this course
Chap 13-41
Testing for Significance

and Interval Estimate
Inferences About the Slope
The standard error of the regression slope

coefficient (b1) is estimated by
S YX
Sb1
SSX
S YX
2
(X
X
)
i
where:
Sb1
= Estimate of the standard error of the slope
S YX
SSE = Standard error of the estimate

n2
Chap 13-43
t Test for the Slope
Typical question: X and Y linearly related?
H0: 1 = 0
H1: 1 0
(no linear relationship)

(linear relationship does exist)
Test statistic
t STAT
b1 1
Sb
d.f. n 2
where:
b1 = regression slope
coefficient
1 = hypothesized slope
Sb1 = standard
error of the slope
Chap 13-44
Inferences About the Slope in Excel

House Price Again
Want to know if ____
H0: 1 = 0; H1: 1 0
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
b1
Sb1
t STAT
Next, how to test if slope

equals a particular value?
b1 1
Sb
0.10977 0
3.32938
0.03297
Chap 13-45

House Price Again
Test Statistic: tSTAT = 3.329
Critical values?
a 0.05
d.f. 10 2 8
a/2=.025
Reject H0
a/2=.025
Do not reject H0
-t/2
-2.3060
Decision: Reject H0
Reject H0
t/2
2.3060
3.329
Chap 13-46

House Price Again
The p-value Approach

Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
p-value
Decision: Reject H0, since p-value <
Chap 13-47
F Test for Model Significance

H0: 1 = 0;
F Test statistic:
where
H1: 1 0
MSR
FSTAT
MSE
MSR
SSR
k
MSE
SSE
n k 1
FSTAT follows an F distribution

Numerator d.f. = k
Denominator d.f. = n-k-1
k = the number of independent variables in the regression model
Chap 13-48
F-Test for Model Significance

House Price Again
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
FSTAT
MSR 18934.9348
11.0848
MSE 1708.1957
41.33032
Observations
10
With 1 and 8 degrees

of freedom
p-value for
the F-Test
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Significance F
0.01039
Chap 13-49
F Test for Significance

Test Statistic:
H0: 1 = 0 H1: 1 0
a = .05
df1= 1
df2 = 8
FSTAT
MSR
11.08
MSE
Decision:
Critical Value:
Reject H0 at a = 0.05
Fa = 5.32
a = .05
Do not
reject H0
Reject H0
F.05 = 5.32
Chap 13-50
Interval Estimate for the Slope

House Price Again
b1 t / 2S b
d.f. = n - 2
Excel Printout
Intercept
Square Feet
Coefficients
Standard Error
t Stat
P-value
98.24833
0.10977
Lower 95%
Upper 95%
58.03348
1.69296
0.12892
-35.57720
232.07386
0.03297
3.32938
0.01039
0.03374
0.18580
At 95% level of confidence, the confidence interval for

the slope is (0.0337, 0.1858)
we are 95% confident that the average impact on sales
price is between $33.74 and $185.80 per sq foot of size
Same conclusion as t-test?
Chap 13-51
t Test for a Correlation Coefficient
Hypotheses
H0: = 0
H1: 0
(no correlation between X and Y)

(correlation exists)
Test statistic
t STAT
r -
1 r2
n2
(with n 2 degrees of freedom)
where
r r 2 if b1 0
r r 2 if b1 0
Chap 13-52
t-test For A Correlation Coefficient

Is there evidence of a linear relationship between square
feet and house price at the .05 level of significance?
H0: = 0
H1: 0
(No correlation)
(correlation exists)
a =.05 , df = 10 - 2 = 8
t STAT
r
1 r2
n2
.762 0
1 .7622
10 2
3.329
Critical Value =
Decision:
Chap 13-53
Estimating Mean Values and

Predicting Individual Values
Goal: Form intervals around Y to express
uncertainty about the value of Y for a given Xi
Confidence
Interval for
the mean of
Y, given Xi
Y = b0+b1Xi
Prediction Interval
for an individual Y,
given Xi
Xi
X
Chap 13-54
Suggested Procedures for a

Regression Analysis
1.
2.
3.
Start with a scatter plot to observe possible

relationship
Run regression
Perform residual analysis
Plot the residuals vs. X to check for assumptions such as

linearity & homoscedasticity
Use a histogram to see if the residuals are normally
distributed
Chap 13-55
Suggested Procedures for a

Regression Analysis
4.
5.
6.
If there is violation of any assumption, use alternative

methods or models
If there is no evidence of assumption violation, then
test for the significance of the regression coefficients
and construct confidence intervals and prediction
intervals
Avoid making predictions or forecasts outside the
relevant range
Chap 13-56
Review Questions 1-3

1. In a regression analysis, the error term is a random
variable with an expected value of ____
2. In a regression analysis, if r2 = 1, SST = ____
3. A regression analysis between sales (in $1000) and
advertising (in $100) resulted in the following equation:
Y-hat = 500 + 4X
a) How to interpret b1?
b) Based on the above equation, if advertising is $10,000,
the point estimate for sales (in dollars) is ____
Review Question 4
4. Shown below is a portion of an Excel printout for a regression
analysis relating Y (quantity demanded) and X (unit price).
ANOVA
Regression
Residual
Total
Intercept
X
df
1
46
47
SS
5048.818
3132.661
8181.479
Coefficients
80.390
-2.137
Standard Error
3.102
0.248
a) Perform a t test to determine if Y and X are related. Let = 0.05

b) Compute R2 and interpret its meaning
Review Question 5:
Restaurant Ratings
A regression model is developed to relate the price per person
and the sum of the ratings for food, decor, and service
1. Find r2
2. Calculate and Interpret the regression coefficients
3. At the 0.05 level of significance, is there evidence of a linear
relationship between the price per person and the summated
rating?
4. Construct a 95% confidence interval for the slope
5. Evaluate the usefulness of this model
Appendix
Figure: Regression Model Assumptions
Least Squares Method

(Formulae for b0 and b1)
b1
( x x )( y y )
(x x )
i
b0 y b1 x

L11 Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L11 Regression

Uploaded by

Copyright:

Available Formats

Chapter 13

Simple Linear Regression

Predict the value of a dependent variable using

Testing & interval estimate

Data for review question

Equation of a straight line

Linear / non-linear relationship

Strong / weak relationship

Correlation: ____ relationship

between two variables

Correlation doesnt imply casual relationship

Eg: No. of firefighters and damage (in $)

Regression analysis is used to:

Explain the impact of changes in x on the dependent variable y

Predict the value of y based on the value of x(s)

Linear relationship between x and y

Linear Regression Model

Linear Regression Model

Random error component

Linear Regression Model

Model Assumptions: LINE

Error values are statistically independent

The relationship between X and Y is linear

Error values are normally distributed for any given value of X

Equal variance (also called homoscedasticity)

The probability distribution of the errors has constant variance

Assumption on error terms:

Estimate the Regression

Est. Regression Equation

Least Squares Method

min (Yi Yi ) min (Yi (b0 b1Xi ))

The computations (to find b0 and b1) are done by Excel

Formulae for b0 and b1 are in the appendix

b0 is the estimated mean value of

b1 is the estimated change in the

Eg: House Price

A real estate agent wishes to examine the relationship

A random sample of 10 houses is selected

Eg: House Price Data & Scatter Plot

House Price ($1000s)

2. Choose Data Analysis

4. Input data range and output options

Using Excel: Regression Output

The regression equation is:

Eg: House Price

House Price ($1000s)

House price model: Scatter Plot and Prediction Line

houseprice 98.24833 0.10977 (square feet)

bo is the estimated mean value of Y when the value of X

Because a house cannot have a square footage of 0, bo

Generally speaking, intercept

b1 estimates the change in the mean value of Y

Here, b1 = 0.10977 tells us that the mean value of a house

Eg: Making Predictions

house price 98.24833 0.10977 (sq.ft.)

House Price ($1000s)

Relevant range for

Do not try to extrapolate

Total variation is made up of two parts:

= Mean value of the dependent variable

Yi = Observed value of the dependent variable

Yi = Predicted value of Y for the given Xi value

SST = total sum of squares

Measures the variation of the Yi values around their