You are on page 1of 86

DMBA: Statistics

Lecture 2: Simple Linear Regression


Least Squares, SLR properties, Inference,
and Forecasting

Carlos Carvalho
The University of Texas McCombs School of Business
mccombs.utexas.edu/faculty/carlos.carvalho/teaching

Todays Plan

1. The Least Squares Criteria


2. The Simple Linear Regression Model
3. Estimation for the SLR Model
I

sampling distributions

confidence intervals

hypothesis testing

Linear Prediction

Yi = b0 + b1 Xi

b0 is the intercept and b1 is the slope

We find b0 and b1 using Least Squares


3

The Least Squares Criterion

The formulas for b0 and b1 that minimize the least squares


criterion are:

b1 = corr (X , Y )

sY
sX

b0 = Y b1 X

where,
v
u n
uX
2
Yi Y
sY = t
i=1

v
u n
uX

2
and sX = t
Xi X
i=1

Review: Covariance and Correlation

Correlation and Covariance

Measure the direction


and strength of the linear
relationship 56$23"$()4".1
, -"./01"$23"$direction
.4*$strength
between variables Y and
X
1"(.2)54/3)7$8"29""4$23"$:.1).8("/$;$.4*$<

The direction i
the sign of the c

Applied Regression
n Analysis
i
Carlos M. Carvalhoi=1

Cov (Y , X ) =

)
(Y Y )(Xi X
n1

Correlation and Covariance

Correlation is the standardized covariance:


corr(X , Y ) = p

cov(X , Y )
var(X )var(Y )

cov(X , Y )
sd(X )sd(Y )

The correlation is scale invariant and the units of measurement


dont matter: It is always true that 1 corr(X , Y ) 1.
This gives the direction (- or +) and strength (0 1)
of the linear relationship between X and Y .

Correlation
corr = .5
2
1
0
-1
-2
-3

-3

-2

-1

corr = 1

-1

-3

-2

-1

-2

-3

1
0
-1
-2
-3

-3

-2

-1

corr = -.8

corr = .8

-3

-2

-1

-3

-2

-1

Correlation
Only measures linear relationships:
corr(X , Y ) = 0 does not mean the variables are not related!
corr = 0.72

-8

-6

-4

10

-2

15

20

corr = 0.01

-3

-2

-1

10

15

20

Also be careful with influential observations.


8

Back to Least Squares


1. Intercept:
Y = b0 + b1 X

b0 = Y b1 X
I

, Y ) is on the regression line!


The point (X

Least squares finds the point of means and rotate the line
through that point until getting the right slope

2. Slope:
b1 = corr (X , Y )

sY
sX

So, the right slope is the correlation coefficient times a scaling


factor that ensures the proper units for b1
9

More on Least Squares

From now on, terms fitted values (Yi ) and residuals (ei ) refer
to those obtained from the least squares line.
The fitted values and residuals have some special properties. Lets
look at the housing data analysis to figure out what these
properties are...

10

120
100

corr(y.hat, x) = 1

80

Fitted Values

140

160

The Fitted Values and X

1.0

1.5

2.0

2.5

3.0

3.5

X
11

30

The Residuals and X

-10

10

mean(e) = 0

-20

Residuals

20

corr(e, x) = 0

1.0

1.5

2.0

2.5

3.0

3.5

12

Why?
What is the intuition for the relationship between Y and e and X ?

100

120

Crazy line: 10 + 50 X

80

LS line: 38.9 + 35.4 X

60

140

160

Lets consider some crazyalternative line:

1.0

1.5

2.0

2.5

3.0

3.5

13

Fitted Values and Residuals


This is a bad fit! We are underestimating the value of small houses
30

and overestimating the value of big houses.

-10

10

mean(e) = 1.8

-20

Crazy Residuals

20

corr(e, x) = -0.7

1.0

1.5

2.0

2.5

3.0

3.5

Clearly, we have left some predictive ability on the table!


14

Fitted Values and Residuals


As long as the correlation between e and X is non-zero, we could
always adjust our prediction rule to do better.
We need to exploit all of the predictive power in the X values and
put this into Y , leaving no Xness in the residuals.
In Summary: Y = Y + e where:
I

Y is made from X ; corr(X , Y ) = 1.

e is unrelated to X ; corr(X , e) = 0.

15

Another way to derive things

The intercept:
n

1X
1X
ei = 0
(Yi b0 b1 Xi )
n i=1
n i=1
=0
Y b0 b1 X

b0 = Y b1 X

16

Another way to derive things


The slope:

corr(e, X ) =

n
X

) = 0
ei (Xi X

i=1

n
X

)
(Yi b0 b1 Xi )(Xi X

i=1
n
X
))(Xi X
)
=
(Yi Y b1 (Xi X
i=1

Pn
b1 =

i=1 (Xi X )(Yi


Pn
2
i=1 (Xi X )

Y )

= rxy

sy
sx

17

Decomposing the Variance


How well does the least squares line explain variation in Y ?
Since Y and e are independent (i.e. cov(Y , e) = 0),
var(Y ) = var(Y + e) = var(Y ) + var(e)

This leads to
n
X
i=1

(Yi Y )2 =

n
n
X
X
(Yi Y )2 +
ei2
i=1

i=1

18

Decomposing the Variance ANOVA Tables

SSR: Variation in Y explained by the regression line.


SSE: Variation in Y that is left unexplained.

SSR = SST perfect fit.


Be careful of similar acronyms; e.g. SSR for residual SS.
19

Decomposing
Variance
The
ANOVATables
Table
Decomposingthethe
Variance
ANOVA

Applied Regression Analysis Fall 2008

20

A Goodness of Fit Measure: R 2

The coefficient of determination, denoted by R 2 ,


measures goodness of fit:

R2 =

SSR
SSE
=1
SST
SST

0 < R 2 < 1.

The closer R 2 is to 1, the better the fit.

21

A Goodness of Fit Measure: R 2


2 ( i.e., R 2 is squared correlation).
An interesting fact: R 2 = rxy

=
=
=

Pn
(Yi Y )2
Pi=1
n
(Yi Y )2
Pni=1
2
i=1 (b0 + b1 Xi b0 b1 X )
Pn
2
i=1 (Yi Y )
P
)2
b12 ni=1 (Xi X
b12 sx2
2
Pn
=
= rxy
sy2
(Yi Y )2
i=1

No surprise: the higher the sample correlation between


X and Y , the better you are doing in your regression.

22

BackBack
to the
House
to the
HouseData
Data

''-

''/

Applied Regression Analysis


Carlos M. Carvalho

''.
!""#$%%&$'()*"$+,

23

Prediction and the Modelling Goal

A prediction rule is any function where you input X and it


outputs Y as a predicted response at X .
The least squares line is a prediction rule:
Y = f (X ) = b0 + b1 X

24

Prediction and the Modelling Goal


Y is not going to be a perfect prediction.
We need to devise a notion of forecast accuracy.

25

Prediction and the Modelling Goal


There are two things that we want to know:
I

What value of Y can we expect for a given X?

How sure are we about this forecast? Or how different could


Y be from what we expect?

Our goal is to measure the accuracy of our forecasts or how much


uncertainty there is in the forecast. One method is to specify a
range of Y values that are likely, given an X value.
Prediction Interval: probable range for Y-values given X

26

Prediction and the Modelling Goal


Key Insight: To construct a prediction interval, we will have to
assess the likely range of residual values corresponding to a Y value
that has not yet been observed!
We will build a probability model (e.g., normal distribution).
Then we can say something like with 95% probability the
residuals will be no less than -$28,000 or larger than $28,000.
We must also acknowledge that the fitted line may be fooled by
particular realizations of the residuals.

27

The Simple Linear Regression Model


The power of statistical inference comes from the ability to make
precise statements about the accuracy of the forecasts.
In order to do this we must invest in a probability model.

Simple Linear Regression Model: Y = 0 + 1 X +


N(0, 2 )
The error term is independent idosyncratic noise.

28

Independent Normal Additive Error


Why do we have N(0, 2 )?
I

E [] = 0 E [Y | X ] = 0 + 1 X
(E [Y | X ] is conditional expectation of Y given X ).

Many things are close to Normal (central limit theorem).

MLE estimates for s are the same as the LS bs.

It works! This is a very robust model for the world.

We can think of 0 + 1 X as the true regression line.

29

The Regression Model and our House Data


Think of E [Y |X ] as the average price of houses with size X:
Some houses could have a higher than expected value, some lower,
and the true line tells us what to expect on average.
The error term represents influence of factors other X .

30

Conditional Distributions
The conditional distribution for Y given X is Normal:
Y |X N(0 + 1 X , 2 ).
controls dispersion:

31

Conditional vs Marginal Distributions

More on the conditional distribution:


Y |X N(E [Y |X ], var(Y |X )).
I

Mean is E [Y |X ] = E [0 + 1 X + ] = 0 + 1 X .

Variance is var(0 + 1 X + ) = var() = 2 .

Remember our sliced boxplots:


I

2 < var(Y ) if X and Y are related.

32

Prediction Intervals with the True Model


You are told (without looking at the data) that
0 = 40; 1 = 45; = 10
and you are asked to predict price of a 1500 square foot house.
What do you know about Y from the model?
Y

= 40 + 45(1.5) +
= 107.5 +

Thus our prediction for price is


Y N(107.5, 102 )
33

Prediction Intervals with the True Model

The model says that the mean value of a 1500 sq. ft. house is
$107,500 and that deviation from mean is within $20,000.
We are 95% sure that
I

20 < < 20

$87, 500 < Y < $127, 500

In general, the 95 % Prediction Interval is PI = 0 + 1 X 2.

34

Summary of Simple Linear Regression


Assume that all observations are drawn from our regression model
and that errors on those observations are independent.
The model is
Yi = 0 + 1 Xi + i
where is independent and identically distributed N(0, 2 ).
The SLR has 3 basic parameters:
I

0 , 1 (linear pattern)

(variation around the line).

35

Key Characteristics of Linear Regression Model

Mean of Y is linear in X .

Error terms (deviations from line) are normally distributed


(very few deviations are more than 2 sd away from the
regression mean).

Error terms have constant variance.

36

Break

Back in 15 minutes...

37

Recall: Estimation for the SLR Model


SLR assumes every observation in the dataset was generated by
the model:
Yi = 0 + 1 Xi + i

This is a model for the conditional distribution of Y given X.


We use Least Squares to estimate 0 and 1 :
1 = b1 =

Pn

i=1 (Xi X )(Yi


Pn
2
i=1 (Xi X )

Y )

0 = b0 = Y b1 X

38

Estimation for the SLR Model

39

Estimation of Error Variance


iid

Recall that i N(0, 2 ), and that drives the width of the


prediction intervals:
2 = var(i ) = E [(i E [i ])2 ] = E [2i ]

A sensible strategy would be to estimate the average for squared


errors with the sample average squared residuals:
n

2 =

1X 2
ei
n
i=1

40

Estimation of Error Variance


However, this is not an unbiased estimator of 2 . We have to alter
the denominator slightly:
n

1 X 2
SSE
s =
ei =
n2
n2
2

i=1

(2 is the number of regression coefficients; i.e. 2 for 0 + 1 ).


We have n 2 degrees of freedom because 2 have been used up
in the estimatiation of b0 and b1 .
We usually use s =

p
SSE /(n 2), in the same units as Y .

41

Degrees of Freedom
Degrees of Freedom is the number of times you get to observe
useful information about the variance youre trying to estimate.
For example, consider SST =
I

Pn

i=1 (Yi

Y )2 :

If n = 1, Y = Y1 and SST = 0: since Y1 is used up


estimating the mean, we havent observed any variability!

For n > 1, weve only had n 1 chances for deviation from


the mean, and we estimate sy2 = SST /(n 1).

In regression with p coefficients (e.g., p = 2 in SLR), you only get


n p real observations of variability DoF = n p.
42

Estimation of Error Variance


Estimation of V2
!,"-"$).$s )/$0,"$123"($4506507

Where is s in the Excel output?

8"9"9:"-$;,"/"<"-$=45$.""$>.0?/*?-*$"--4-@$-"?*$)0$?.$estimated
Remember
that whenever you see standard error read it as
.0?/*?-*$*"<)?0)4/&$V ).$0,"$.0?/*?-*$*"<)?0)4/

estimated standard deviation: is the standard deviation.


Applied Regression Analysis
Carlos M. Carvalho

!""#$%%%&$'()*"$+

43

Sampling Distribution of Least Squares Estimates


How much do our estimates depend on the particular random
sample that we happen to observe? Imagine:
I

Randomly draw different samples of the same size.

For each sample, compute the estimates b0 , b1 , and s.

If the estimates dont vary much from sample to sample, then it


doesnt matter which sample you happen to observe.
If the estimates do vary a lot, then it matters which sample you
happen to observe.

44

Sampling Distribution of Least Squares Estimates

45

Sampling Distribution of Least Squares Estimates

46

Sampling Distribution of Least Squares Estimates

LS lines are much closer to the true line when n = 50.


For n = 5, some lines are close, others arent:
we need to get lucky

47

Review: Sampling Distribution of Sample Mean


Step back for a moment and consider the mean for an iid sample
of n observations of a random variable {X1 , . . . , Xn }
Suppose that E (Xi ) = and var(Xi ) = 2
I
I

) =
E (X

1
n

E (Xi ) =
P 
) = var 1
var(X
Xi =
n

1
n2

var (Xi ) =

2
n



N , 2 .
If X is normal, then X
n
If X is not normal, we have the central limit theorem (more in a
minute)!
48

Oracle vs SAP Example (understanding variation)

49

Oracle vs SAP

50

Oracle vs SAP

Do you really believe that SAP affects ROE?


How else could we look at this question?

51

Central Limit Theorem

Simple CLT states that for iid random variables, X , with mean
and variance 2 , the distribution of the sample mean becomes
normal as the number of observations, n, gets large.
2

n N(, ), and sample averages tend to be normally


That is, X
n
distributed in large samples.

52

Central Limit Theorem

0.0

0.2

0.4

f(X)

0.6

0.8

1.0

Exponential random variables dont look very normal:

E [X ] = 1 and var(X ) = 1.
53

Central Limit Theorem

150
50
0

Frequency

250

1000 means from n=2 samples

apply(matrix(rexp(2 * 1000), ncol = 1000), 2, mean)

54

Central Limit Theorem

100
50
0

Frequency

150

1000 means from n=5 samples

0.0

0.5

1.0

1.5

2.0

2.5

3.0

apply(matrix(rexp(5 * 1000), ncol = 1000), 2, mean)

55

Central Limit Theorem

150
100
50
0

Frequency

200

250

1000 means from n=10 samples

0.0

0.5

1.0

1.5

2.0

apply(matrix(rexp(10 * 1000), ncol = 1000), 2, mean)

56

Central Limit Theorem

100
50
0

Frequency

150

200

1000 means from n=100 samples

0.7

0.8

0.9

1.0

1.1

1.2

1.3

apply(matrix(rexp(100 * 1000), ncol = 1000), 2, mean)

57

Central Limit Theorem

150
100
50
0

Frequency

200

1000 means from n=1k samples

0.90

0.95

1.00

1.05

1.10

apply(matrix(rexp(1000 * 1000), ncol = 1000), 2, mean)

58

Sampling Distribution of b1
The sampling distribution of b1 describes how estimator b1 = 1
varies over different samples with the X values fixed.
It turns out that b1 is normally distributed: b1 N(1 , b21 ).
I

b1 is unbiased: E [b1 ] = 1 .

Sampling sd b1 determines precision of b1 .

59

Sampling Distribution of b1
Can we intuit what should be in the formula for b1 ?
I

How should figure in the formula?

What about n?

Anything else?
var(b1 ) = P

2
2
=
)2
(n 1)sx2
(Xi X

Three Factors:
sample size (n), error variance ( 2 = 2 ), and X -spread (sx ).

60

Sampling Distribution of b0

The intercept is also normal and unbiased: b0 N(0 , b20 ).

b20

= var(b0 ) =


2
1
X
+
n (n 1)sx2

What is the intuition here?

61

The Importance of Understanding Variation


When estimating a quantity, it is vital to develop a notion of the
precision of the estimation; for example:
I

estimate the slope of the regression line

estimate the value of a house given its size

estimate the expected return on a portfolio

estimate the value of a brand name

estimate the damages from patent infringement

Why is this important?


We are making decisions based on estimates, and these may be
very sensitive to the accuracy of the estimates!
62

The Importance of Understanding Variation


Example from everyday life:
I

When building a house, we can estimate a required piece of


wood to 1/4?

When building a fine cabinet, the estimates may have to be


accurate to 1/16 or even 1/32.

The standard deviations of the least squares estimators of the


slope and intercept give a precise measurement of the accuracy of
the estimator.
However, these formulas arent especially practical
since they involve the unknown quantity:
63

Estimated Variance

We estimate variation with sample standard deviations:


s
sb1 =

Recall that s =

s2
(n 1)sx2

qP

s
sb0 =


s2


2
1
X
+
n (n 1)sx2

ei2 /(n 2) is the estimator for = .

Hence, sb1 =
b1 and sb0 =
b0 are estimated coefficient sds.
A high level of info/precision/accuracy means small sb values.

64

Normal and Students t


Recall what Student discovered:

If N(, 2 ), but you estimate 2 s 2 based on n p


degrees of freedom, then tnp (, s 2 ).
For example:
I
I

Y tn1 (, sy2 /n).




b0 tn2 0 , sb20 and b1 tn2 1 , sb21

The t distribution is just a fat-tailed version of the normal. As


n p , our tails get skinny and the t becomes normal.
65

Standardized Normal and Students t

Well also usually standardize things:


bj j
bj j
N(0, 1) =
tn2 (0, 1)
bj
sbj
We use Z N(0, 1) and Znp tnp (0, 1) to
represent standard random variables.
Notice that the t and normal distributions depend upon assumed
values for j : this forms the basis for confidence intervals,
hypothesis testing, and p-values.

66

Testing and Confidence Intervals (in 3 slides)


Suppose Znp is distributed tnp (0, 1). A centered interval is
P(tnp,/2 < Znp < tnp,/2 ) = 1

67

Confidence Intervals

Since bj tnp (j , sbj ),


bj j
< tnp,/2 )
sbj
= P(bj tnp,/2 sbj < j < bj + tnp,/2 sbj )

1 = P(tnp,/2 <

Thus (1 )*100% of the time, j is within the Confidence


Interval: bj tnp,/2 sbj

68

Testing
Similarly, suppose that assuming bj tnp (j , sbj ) for our sample
bj leads to (recall Znp tnp (0, 1))






bj j
bj j


= .

+ P Znp >
P Znp <
sbj
sbj
Then the p-value is = 2P(Znp > |bj j |/sbj ).
You do this calculation for j = j0 , an assumed null/safe value,
and only reject j0 if is too small (e.g., < 1/20).
In regression, j0 = 0 almost always.
69

More Detail... Confidence Intervals

Why should we care about Confidence Intervals?


I

The confidence interval captures the amount of information in


the data about the parameter.

The center of the interval tells you what your estimate is.

The length of the interval tells you how sure you are about
your estimate.

70

More Detail... Testing

Suppose that we are interested in the slope parameter, 1 .


For example, is there any evidence in the data to support the
existence of a relationship between X and Y?
We can rephrase this in terms of competing hypotheses.
H0 : 1 = 0. Null/safe; implies no effect and we ignore X .
H1 : 1 6= 0. Alternative; leads us to our best guess 1 = b1 .

71

Hypothesis Testing

If we want statistical support for a certain claim about the data,


we want that claim to be the alternative hypothesis.
Our hypothesis test will either reject or not reject the null
hypothesis (the default if our claim is not true).
If the hypothesis test rejects the null hypothesis, we have
statistical support for our claim!

72

Hypothesis Testing

We use bj for our test about j .


I

Reject H0 when bj is far from j0 (ususally 0).

Assume H0 when bj is close to j0 .

An obvious tactic is to look at the difference bj j0 .


But this measure doesnt take into account the uncertainty in
estimating bj : What we really care about is how many standard
deviations bj is away from j0 .

73

Hypothesis Testing

The t-statistic for this test is

bj j0
bj
zbj =
=
for j0 = 0.
sbj
sb j
If H0 is true, this should be distributed zbj tnp (0, 1).
I

Small |zbj | leaves us happy with the null j0 .

Large |zbj | (i.e., > about 2) should get us worried!

74

Hypothesis Testing
We assess the size of zbj with the p-value :
= P(|Znp | > |zbj |) = 2P(Znp > |zbj |)
(once again, Znp tnp (0, 1)).

0.2

-z

0.0

p(Z)

0.4

p-value = 0.05 (with 8 df)

-4

-2

75

Hypothesis Testing

The p-value is the probablity, assuming that the null hypothesis is


true, of seeing something more extreme
(further from the null) than what we have observed.
You can think of 1 (inverse p-value) as a measure of distance
between the data and the null hypothesis. In other words, 1 is
the strength of evidence against the null.

76

Hypothesis Testing
The formal 2-step approach to hypothesis testing
I

Pick the significance level (often 1/20 = 0.05),


our acceptable risk (probability) of rejecting a true
null hypothesis (we call this a type 1 error).
This plays the same role as in CIs.

Calculate the p-value, and reject H0 if <


(in favor of our best alternative guess; e.g. j = bj ).
If > , continue working under null assumptions.

This is equivilent to having the rejection region |zbj | > tnp,/2 .


77

Example: Hypothesis Testing

Consider again a CAPM regression for the Windsor fund.


Does Windsor have a non-zero intercept?
(i.e., does it make/lose money independent of the market?).

H0 : 0 = 0 and there is no-free money.


H1 : 0 6= 0 and Windsor is cashing regardless of market.

78

Hypothesis Testing Windsor Fund Example

-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<
Example:
Hypothesis Testing

b!

sb!

b!
sb!

Applied Regression Analysis


Carlos M. Carvalho

!""#$%%%&$'()*"$+,

It turns out that we reject the null at = .05 ( = .0105). Thus


Windsor does have an alpha over the market.
79

Example: Hypothesis Testing


Looking at the slope, this is a very rare case where the null
hypothesis is not zero:
H0 : 1 = 1 Windsor is just the market (+ alpha).
H1 : 1 6= 1 and Windsor softens or exaggerates market moves.
We are asking whether or not Windsor moves in a different way
than the market (e.g., is it more conservative?).
Now,
t=

0.0643
b1 1
=
= 2.205
sb1
0.0291

tn2,/2 = t178,0.025 = 1.96


Reject H0 at the 5% level
80

Forecasting

The conditional forecasting problem: Given covariate Xf and


sample data {Xi , Yi }ni=1 , predict the future observation yf .
The solution is to use our LS fitted value: Yf = b0 + b1 Xf .
This is the easy bit. The hard (and very important!) part of
forecasting is assessing uncertainty about our predictions.

81

Forecasting
If we use Yf , our prediction error is
ef = Yf Yf = Yf b0 b1 Xf

82

Forecasting
This can get quite complicated! A simple strategy is to build the
following (1 )100% prediction interval:
b0 + b1 Xf tn2,/2 s
A large predictive error variance (high uncertainty) comes from
I

Large s (i.e., large s).

Small n (not enough data).

Small sx (not enough observed spread in covariates).

.
Large difference between Xf and X

Just remember that you are uncertain about b0 and b1 !


Reasonably inflating the uncertainty in the interval above is always
a good idea... as always, this is problem dependent.

83

Forecasting
, the space between lines is magnified...
For Xf far from our X

84

Glossary and Equations


I

Yi = b0 + b1 Xi is the ith fitted value.

ei = Yi Yi is the ith residual.

s: standard error of regression residuals ( = ).


s2 =

1 X 2
ei
n2

sbj : standard error of regression coefficients.


s
sb1 =

s2
(n 1)sx2

s
sb0 = s

2
1
X
+
n (n 1)sx2

85

Glossary and Equations

is the significance level (prob of type 1 error).

tnp,/2 is the value such that for Znp tnp (0, 1),
P(Znp > tnp,/2 ) = P(Znp < tnp,/2 ) = /2.

zbj tnp (0, 1) is the standardized coefficient t-value:


zbj =

bj j0
sbj

(= bj /sbj most often)

The (1 ) 100% for j is bj tnp,/2 sbj .

86

You might also like