You are on page 1of 29

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5.

Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

All these notions can be extended to the case with multiple predictors...

193 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Example We can use two predictors for Intel: S&P500 and ination.

0.40.30.20.1 0.0 0.1 0.2 0.3 0.4

Intel

0.015 0.010 0.005 0.000 0.005 0.010 0.015 0.020

0.200.150.100.05 0.00 0.05 0.10

SP500

194 / 221

Veronika Czellar HEC Paris

Statistics

Inflation

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3 Multiple regression


5.3.1 Multiple regression equation
We extend the regression theory to k explanatory variables. Denition: multiple linear regression equation yi = 0 + 1 xi1 + 2 xi2 + + k xik + i , where xi1 , . . . , xik are observable variables; 0 , 1 , . . . , k are xed and unknown parameters; 1 , . . . , n i.i.d. N (0, 2 ); > 0 is a xed and unknown parameter.
195 / 221 Veronika Czellar HEC Paris Statistics

i = 1, . . . , n

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Denitions The least squares (LS) estimators of 0 , . . . , k are


n

(0 , . . . , k ) = arg min

0 ,...,k

i=1

yi (0 + 1 xi1 + + k xik )

Remark: explicit formulas for these estimators are available . . .

196 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

. . . but require a matrix form of the regression model, Y = X + , 1 x11 x1k y1 0 1 . . . . , = . , = . . . with Y = . , X = . . . . . . . . . . yn 1 xn1 xnk k n The LS estimator of minimizes (Y X )T (Y X ), and is = (X T X )1 X T Y . No need to learn this slide by heart, we will use Excel to estimate the parameters.

197 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Regressing Intel on S&P500 and ination:

Back to one predictor

198 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

The R Square in the Excel output is higher, than in the case of one predictor S&P500 only. Question What does R Square mean in the multiple regression?

199 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.2 Evaluating a multiple regression equation


Denition
n y i=1 (i
Back to simple regression

The coecient of multiple determination R 2 is dened by R2 = y )2 ,

Syy

where Syy = n (yi y )2 and yi are the tted values i=1 0 + 1 xi1 + + k xik . yi = Proposition R2 = 1
200 / 221 Veronika Czellar HEC Paris

n i=1 (yi

Syy

yi ) 2

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Properties of the coecient of multiple determination


1

It can range from 0 to 1. A value near 0 indicates little linear association between the set of independent variables and the dependent variable. A value near 1 means a strong association. R 2 cannot go down when an extra predictor is added to the model and it will generally increase. R 2 can almost always be made very close to 1 by using a model with k quite close to n, even if many of the predictors would contribute only marginally to variation in y .

201 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Denition The adjusted R 2 is dened by Adjusted R 2 = 1 Properties of the adjusted R 2


1

n1 nk 1

n i=1 (yi

Syy

yi ) 2

adjusted R 2 penalizes the addition of extraneous predictors to the model; adjusted R 2 is smaller than R 2 .

202 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Question High values of R 2 suggest that the model t is a useful one. But how large should this value be before we draw this conclusion?

203 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.3 Testing the global utility of the multiple regression


H0 : 1 = 2 = = k = 0 Ha : at least one among 1 , . . . , k is not zero Model utility F test: F = R 2 /k H0 F(k,nk1) . 2 )/(n k 1) (1 R

204 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Model utility F test for the Intel example with two predictors:

205 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Warning If the F test results in the rejection of H0 , it does not mean that all predictors are useful.

206 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.4 Evaluating individual regression coecients


The t tests can be extended to the multivariate case... For any given j {0, 1, . . . , k}, we can test H0 : j = 0 , Ha : j = 0 using a t test: T j = where j SE (j )
H0

tnk1 ,

207 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

SE (j ) is the standard error of the coecient j. (and has the matrix form 2 (X T X )1 ); jj

1 = nk1 n (yi yi )2 and is called multiple standard i=1 error of estimate.

Example Do an individual test of each independent variable for the Intel regression with two predictors. Which variable would you consider eliminating? Use the 0.05 signicance level.

208 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

209 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Remark: if there are more than one nonsignicant variables, we should delete only one variable at a time. Each time we delete a variable, we need to rerun the regression equation and check the remaining variables. This method is called backward stepwise regression method.

210 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.5 Transformed variables


We can also include transformed variables or mixtures of variables in a multiple regression model. Example Global warming is the increase in the average temperature of Earths near-surface air and oceans since the mid-20th century and its projected continuation. It is well-known that climate change is inuenced by human CO2 emissions. Global CO2 emissions totalled 31.1 billion tons in 2009, 37 percent above those in 1990. Global data GlobalAirpollution.txt for more than 65 countries has been released in August 2010 by and available on the CERINA Plan website (and on the course website as well).

211 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Example continued We would like to investigate the impact of GDP per capita and population growth on the increase of CO2 emissions. Year2008 : emissions of CO2 in 2008 (in million tons, yi,2008 ) Year2009 : emissions of CO2 in 2009 (in million tons, yi,2009 ) GDP2009realgrowth : GDP real growth rate (in %, xi,1 ) PopGrowth2009 : population growth rate (in %, xi,2 )
2 SquareGDP2009 : squared GDP2009realgrowth (xi,1 ) 2 SquarePopGrowth2009 : squared PopGrowth2009 (xi,2 )

Fit the following model: yi,2009 2 2 = 0 + 1 xi,1 + 2 xi,2 + 3 xi,1 + 4 xi,2 + i , yi,2008
212 / 221 Veronika Czellar HEC Paris Statistics

i = 1, . . . , 65 .

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

213 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

214 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.6 Dummy variables


We can also include a dummy variable as a predictor, which takes the values 0 or 1 to indicate the absence or presence of some categorical eect. Example: CEO salaries (see NorthwestCEOsalaries.txt on course website)

215 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.7 Qualitative variables


A categorical (or qualitative) variable is a predictor that takes a nite number d possible values. Only d 1 categories are added to the regression model. Example: prices of LCD televisions (see LCD.txt on course website, and exercise 5.12)

216 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.8 Interaction variables


In some cases, it can be useful to add interaction terms, which are products of at least two variables. Example: CEO salaries The product between the woman dummy and sales is an interaction term.

217 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Model assumptions in multiple regression can be veried in the same way as in simple linear regression (see 5.2.8).
Back to simple regression

However, there is an additional requirement in multiple regression: predictors should not be correlated. . .

218 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

5.3.9 Multicollinearity Multicollinearity exists when independent variables are correlated. Several clues that indicate problems with multicollinearity: An independent variable known to be an important predictor ends up being not signicant. A regression coecient that should have a positive sign turns out to be negative, or vice versa. When an independent variable is added or removed, there is a drastic change in the values of the remaining coecients.

219 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

For further details about linear regression, see Kutner, Nachtscheim, Neter and Li (2005), Applied Linear Statistical Models, 5th ed., McGraw-Hill; Fox (2008), Applied Regression Analysis and Generalized Linear Models, 2nd ed., Sage Publications.

220 / 221

Veronika Czellar HEC Paris

Statistics

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Thank you... Merci


Danke
Grazie

Gracias

Spasibo

Ksznm o o o

221 / 221

Veronika Czellar HEC Paris

Statistics

You might also like