Professional Documents
Culture Documents
Learning Objectives
In this chapter
How to use regression analysis to predict the value of a dependent variable based on an independent variable The meaning of the regression coefficients b0 and b1 How to evaluate the assumptions of regression analysis and know what to do if the assumptions are violated To make inferences about the slope and correlation coefficient To estimate mean values and predict individual values
Dependent variable: the variable we wish to predict or explain Independent variable: the variable used to explain the dependent variable
Types of Relationships
Linear relationships Y Y Curvilinear relationships
X Y Y
Types of Relationships
Strong relationships Y Y
(continued)
Weak relationships
X Y Y
Types of Relationships
No relationship Y
(continued)
X Y
Yi = 0 + 1Xi + i
Linear component Random Error component
Yi = 0 + 1Xi + i
i
Slope = 1 Random Error for this Xi value
Xi
=b +b X Yi 0 1 i
Graphical Presentation
House price model: scatter plot
450 400 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 df 1 8 9 SS 18934.9348 13665.5652 32600.5000 MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
ANOVA
Regression Residual Total
Graphical Presentation
House price model: scatter plot and regression line
450 400 House Price ($1000s) 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet
Slope = 0.10977
Intercept = 98.248
b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values)Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
Measures of Variation
SST =
Total Sum of Squares
SSR +
SSE
SST = ( Yi Y )2
where:
SSR = ( Yi Y )2
SSE = ( Yi Yi )2
Measures of Variation
SST = Total sum of squares
Measures the variation of the Yi values around their mean Y
SSR = regression sum of squares
Measures of Variation
(continued)
Y Yi
SST = (Yi - Y)2
SSE = (Yi - Yi )2
_
_ SSR = (Yi - Y)2
_ Y
Xi
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable The coefficient of determination is also called rsquared and is denoted as r2
SSR regression sum of squares r = = SST total sum of squares
2
note:
0 r 1
2
r2 = 1 Y
r =1
2
X Y
r2 = 0
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 df 1 8 9 SS 18934.9348 13665.5652 32600.5000
ANOVA
Regression Residual Total
S YX
Where
SSE = = n2
( Yi Yi )2
i=1
n2
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 df 1 8 9 SS 18934.9348 13665.5652 32600.5000 MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
S YX = 41.33032
ANOVA
Regression Residual Total
small s YX
large s YX
The magnitude of SYX should always be judged relative to the size of the Y values in the sample data i.e., SYX = $41.33K is moderately small relative to house prices in the $200 - $300K range
Assumptions of Regression
Use the acronym LINE: Linearity
The underlying relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values () are normally distributed for any given value of X
Residual Analysis
ei = Yi Yi
The residual for observation i, ei, is the difference between its observed and predicted value Check the assumptions of regression by examining the residuals
Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption Examine for constant variance for all levels of X (homoscedasticity)
x
residuals residuals
Not Linear
Linear
X
residuals
Independent
residuals
residuals
100
0 -3 -2 -1 0 1 2 3
Residual
x
residuals
x
residuals
x Non-constant variance
Constant variance
3000
Autocorrelation
Autocorrelation is correlation of the errors (residuals) over time
Time (t) Residual Plot
Here, residuals show a cyclic pattern, not random. Cyclical patterns are a sign of positive autocorrelation
15 10
Residuals
5 0 -5 0 -10 -15
Time (t)
Violates the regression assumption that residuals are random and independent
D=
(e e
i= 2 i
i 1
The possible range is 0 D 4 D should be close to 2 if H0 is true D less than 2 may signal positive autocorrelation, D greater than 2 may signal negative autocorrelation
ei2
i=1
dL
dU
(continued)
Is there autocorrelation?
20 0 0 5 10 15 Tim e 20 25 30
Sales
(continued)
80 60 40 20 0 0 5 10 15 Tim e
20
25
30
D=
(e e
i= 2 i
i 1
)2 =
ei
i =1
(continued)
Here, n = 25 and there is k = 1 one independent variable Using the Durbin-Watson table, dL = 1.29 and dU = 1.45 D = 1.00494 < dL = 1.29, so reject H0 and conclude that significant positive autocorrelation exists Therefore the linear model is not the appropriate model to forecast sales
Decision: reject H0 since D = 1.00494 < dL
Reject H0 Inconclusive Do not reject H0
dL=1.29
dU=1.45
S YX
(X X)
i
Sb1
S YX
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 df 1 8 9 SS 18934.9348 13665.5652 32600.5000
Sb1 = 0.03297
MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
ANOVA
Regression Residual Total
small Sb1
large Sb1
Null and alternative hypotheses H0: 1 = 0 (no linear relationship) H1: 1 0 (linear relationship does exist) Test statistic
b1 1 t= Sb1
d.f. = n 2
where: b1 = regression slope coefficient 1 = hypothesized slope Sb 1 standard = error of the slope
(continued)
The slope of this model is 0.1098 Does square footage of the house affect its sales price?
b1
Standard Error 58.03348 0.03297
Sb1
t Stat 1.69296 3.32938 P-value 0.12892 0.01039
(continued)
b1
Standard Error 58.03348 0.03297
Sb1
t Stat
t
P-value 0.12892 0.01039
1.69296 3.32938
Reject H0
-t/2 -2.3060
Do not reject H0
t/2 2.3060
Reject H0
3.329
Decision: Reject H0 Conclusion: There is sufficient evidence that square footage affects house price
(continued) P-value
This is a two-tail test, so the pvalue is P(t > 3.329)+P(t < -3.329) = 0.01039 (for 8 d.f.)
Decision: P-value < so Reject H0 Conclusion: There is sufficient evidence that square footage affects house price
MSR F= MSE
MSR = MSE = SSR k SSE n k 1
where F follows an F distribution with k numerator and (n k - 1) denominator degrees of freedom (k = the number of independent variables in the regression model)
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 df 1 8 9
ANOVA
Regression Residual Total
(continued)
Test Statistic:
MSR F= = 11.08 MSE
Do not reject H0
Reject H0
F.05 = 5.32
b1 t n2Sb1
Excel Printout for House Prices:
Coefficients Intercept Square Feet 98.24833 0.10977 Standard Error 58.03348 0.03297 t Stat 1.69296 3.32938 P-value 0.12892 0.01039
d.f. = n - 2
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)
Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size
This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance
t=
r -
1 r n2
2
t=
r 1 r 2 n2
.762 0 1 .762 2 10 2
= 3.329
Reject H0
-t/2 -2.3060
Do not reject H0
t/2 2.3060
Reject H0
3.329
Xi
Find the 95% confidence interval for the mean price of 2,000 square-foot houses
Predicted Price Yi = 317.85 ($1,000s)
Y t n-2S YX
The confidence interval endpoints are 280.66 and 354.90, or from $280,660 to $354,900
Find the 95% prediction interval for an individual house with 2,000 square feet
Predicted Price Yi = 317.85 ($1,000s)
Y t n-1S YX
The prediction interval endpoints are 215.50 and 420.07, or from $215,500 to $420,070
Y Confidence Interval Estimate for Y|X=Xi Prediction Interval Estimate for YX=Xi
Lacking an awareness of the assumptions underlying least-squares regression Not knowing how to evaluate the assumptions Not knowing the alternatives to least-squares regression if a particular assumption is violated Using a regression model without knowledge of the subject matter Extrapolating outside the relevant range
Chapter Summary
Introduced types of regression models Reviewed assumptions of regression and correlation Discussed determining the simple linear regression equation Described measures of variation Discussed residual analysis Addressed measuring autocorrelation
Chapter Summary
(continued)
Described inference about the slope Discussed correlation -- measuring the strength of the association Addressed estimation of mean values and prediction of individual values Discussed possible pitfalls in regression and recommended strategies to avoid them
Thank You