You are on page 1of 36

The Simple Linear

Regression Model
MDA course 6
Purpose of Regression and
Correlation Analysis
• Regression Analysis is Used Primarily for
Prediction
A statistical model used to predict the values of a
dependent or response variable based on values of
at least one independent or explanatory variable

Correlation Analysis is Used to Measure


Strength of the Association Between
Numerical Variables
The Scatter Diagram

Plot of all (Xi , Yi) pairs

Axis
100
Title
50

0 Axis Title
0 20 40 60
Types of Regression Models

Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship


Simple Linear Regression
Model
• Relationship Between Variables Is a Linear Function

• The Straight Line that Best Fit the Data

Y intercept Random
Error

Yi   0   1 X i   i
Dependent
(Response) Independent
Slope (Explanatory)
Variable
Variable
Error Variable: Required
Conditions
 The error  is a critical part of the regression
model.
 Four requirements involving the distribution of 
must be satisfied.
 The probability distribution of  is normal.
 The mean of  is zero: E() = 0.
 The standard deviation of  is s for all values of x.
 The set of errors associated with different values of y
are all independent.
6
Population
Linear Regression Model

Y Yi   0  1X i   i Observed
Value

i = Random Error

m   0  1X i
YX

X
Observed Value
Sample Linear Regression
Model

Y i  b0  b1X i


Yi = Predicted Value of Y for observation i

Xi = Value of X for observation i

b0 = Sample Y - intercept used as estimate of


the population 0
b1 = Sample Slope used as estimate of the
population 1
To calculate the estimates of the The regression equation that
coefficients estimates
that minimize the differences the equation of the first
between the data order linear model
points and the line, use the is:
formulas:

cov( X , Y )
b1  ŷ  b 0  b1x
s 2x
b 0  y  b1 x

9
REGRESSION COEFFICIENTS

 To calculate the estimates of the coefficients that


minimize the differences between the data points and
the line, use the formulas ( least squares method):
n X iYi  ( X i )( Yi )
b1 et b0  Y  b1 X
n( X )  ( X i )
i
2 2

 EXCEL offers several approaches to regression,


including trendlines, regression functions and the
regression analysis tool
Simple Linear Regression
Equation: Example

Annual
Store Square Sales
You wish to examine the Feet ($000)
relationship between the 1 1,726 3,681
square footage of produce
2 1,542 3,395
stores and its annual sales.
Sample data for 7 stores 3 2,816 6,653
were obtained. Find the 4 5,555 9,543
equation of the straight 5 1,292 3,318
line that fits the data best 6 2,208 5,563
7 1,313 3,760
Scatter Diagram Example

12000
Annua l Sa le s ($000)

10000

8000

6000

4000

2000

0
0 1000 2000 3000 4000 5000 6000

S q u a re F e e t
Excel Output
Equation for the Best
Straight Line

Y i  b0  b1 X i
 1636 . 415  1 . 487 X i

From Excel Printout:


C o e ffi c i e n ts
I n te r c e p t 1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
Graph of the Best
Straight Line
12000
Annua l Sa le s ($000)

10000

8000

6000

4000

2000

0
0 1000 2000 3000 4000 5000 6000

S q u a re F e e t
Interpreting the Results


Yi = 1636.415 +1.487Xi

The slope of 1.487 means for each increase of one


unit in X, the Y is estimated to increase 1.487units.

For each increase of 1 square foot in the size of the


store, the model predicts that the expected annual
sales are estimated to increase by $1487.
Inferences about the Slope: t
Test
• t Test for a Population Slope
Is a Linear Relationship Between X & Y ?
•Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship)
H1: 1  0 (Linear Relationship)

b1   1 SYX
•Test Statistic: t  Where Sb 
S b1 1 n
( Xi  X )
2
i 1
and df = n - 2
Standard Error of Estimate

n 
 ( Yi  Yi )
SSE 2
Syx  = i 1
n2
n2

The standard deviation of the variation of


observations around the regression line
Graph of the Best
Straight Line
12000
Annua l Sa le s ($000)

10000

8000

6000

4000

2000

0
0 1000 2000 3000 4000 5000 6000

S q u a re F e e t
Example: Produce Stores

Data for 7 Stores:


Regression
Annual Model Obtained:
Store Square
Feet
Sales
($000)

Yi = 1636.415 +1.487Xi
1 1,726 3,681
2 1,542 3,395 The slope of this model
3 2,816 6,653 is 1.487.
4 5,555 9,543 Is there a linear
5 1,292 3,318 relationship between the
6 2,208 5,563 square footage of a store
7 1,313 3,760 and its annual sales?
Inferences about the
Slope: t Test Example

 H0: 1 = 0 Test Statistic:


 H1: 1  0 From Excel Printout
t S tat P-valu e
a  .05 I n te r c e p t 3.6244333 0.0151488

df  7 - 2 = 7 X V a ria b le 1 9.009944 0.0002812

Critical Value(s): Decision:


Reject Reject Reject H0

.025 .025
Conclusion:
There is evidence of a
-2.5706 0 2.5706
t linear relationship.
Inferences about the Slope:
Confidence Interval Example

Confidence Interval Estimate of the Slope


b1 tn-2 Sb1
Excel Printout for Produce Stores
L o w er 95% U p p er 95%
I n te r c e p t 475.810926 2797.01853
X V a r i a b l e 11 . 0 6 2 4 9 0 3 7 1.91077694

At 95% level of Confidence The confidence Interval for the


slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship
between annual sales and the size of the store.
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
•measures_the variation of the Yi values around their
mean Y
SSR = Regression Sum of Squares
•explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares
•variation attributable to factors other than the
relationship between X and Y
Measures of Variation: The
Sum of Squares
Y 
SSE =(Yi - Yi )2
_
SST = (Yi - Y)2

 _
SSR = (Yi - Y)2
_
Y

X
Xi
Measures of Variation
The Sum of Squares: Example

Excel Output for Produce Stores


df SS
R e g r e ssi o n 1 30380456.12
R e si d u a l 5 1871199.595
T o ta l 6 32251655.71

SSR SSE SST


 Testing the validity of the model

 We pose the question:


Is there at least one independent variable linearly
related to the dependent variable?
 To answer the question we test the hypothesis

H0: 1 = 0
H1: At least one i is not equal to 0

 If at least one i is not equal to zero, the model is valid.


ANOVA - Summary Table

Source of Degrees Sum of Mean F Test


Variation of Squares Square Statistic
Freedom (Variance)
MSR
Explained k-1 SSR MSR = =
MSE
(Factor) SSR/(k - 1)
Within n-k SSE MSE =
(Error) SSE/(n - k)
Total n-1 SST =
SSR+SSE
 To test these hypotheses we perform an analysis
of variance procedure.

 The F test
 Construct the F statistic
MSR=SSR/k-1

MSR
[Variation in y] = SSR + SSE. F
Large F results from a large SSR. MSE
Then, much of the variation in y is
explained by the regression
 Rejection regionmodel. MSE=SSE/(n-k)
The null hypothesis should
be rejected; thus, the model is valid.
F >Fa,k,n-k Required conditions
must be satisfied.
The Coefficient of
Determination

SSR regression sum of squares


r2 = =
SST total sum of squares

Measures the proportion of variation that is


explained by the independent variable X in
the regression model
Coefficients of Determination
(r2) and Correlation (r)

Y r2 = 1, r = +1 Y r2 = 1, r = -1
^=b +b X
Yi 0 1 i
^=b +b X
Yi 0 1 i
X X

Yr2 = .8, r = +0.9 Y r2 = 0, r = 0

^=b +b X
Y ^=b +b X
Y
i 0 1 i i 0 1 i
X X
Measures of Variation:
Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R 0 .9 7 0 5 5 7 2
R S q u a re 0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re 0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r 6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s 7
r2 = .94 Syx
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage
Estimation of
Predicted Values

Confidence Interval Estimate for mXY


The Mean of Y given a particular Xi
Size of interval vary according to
Standard error distance away from mean, X.
of the estimate
1 ( Xi  X ) 2
Ŷi  t n  2  Syx  n
n  ( X  X )2
t value from table i
i 1
with df=n-2
Estimation of
Predicted Values
Confidence Interval Estimate for
Individual Response Yi at a Particular Xi
Addition of this 1 increased width of
interval from that for the mean Y

1 ( Xi  X ) 2
Ŷi  t n  2  Syx 1  n
n  ( X  X )2
i
i 1
Interval Estimates for
Different Values of X
Confidence Interval Confidence
for a individual Yi Interval for the
Y mean of Y

_ X
X A Given X
Example: Produce Stores

Data for 7 Stores:


Annual
Store Square Sales Predict the annual
Feet ($000)
sales for a store with
1 1,726 3,681 2000 square feet.
2 1,542 3,395
3 2,816 6,653 Regression Model Obtained:
4 5,555 9,543
5 1,292 3,318 
6 2,208 5,563
Yi = 1636.415 +1.487Xi
7 1,313 3,760
Estimation of Predicted
Values: Example

Confidence Interval Estimate for Individual Y


Find the 95% confidence interval for the average annual sales
for stores of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706

1 ( X i  X )2
Ŷi  t n  2  Syx  n = 4610.45  980.97
n  ( X  X )2
i
i 1 Confidence interval for mean Y
Estimation of Predicted
Values: Example
Confidence Interval Estimate for mXY
Find the 95% confidence interval for annual sales of one
particular store of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)

X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706

1 ( X i  X )2
Ŷi  t n  2  Syx 1  n = 4610.45  1853.45
n  ( X  X )2
i
i 1
Confidence interval for individua
Y

You might also like