5.3. Multiple Regression (Course)

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5.
Regression analysis
5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression
All these notions can be extended to the case with multiple predictors...
193 / 221
Veronika Czellar HEC Paris
Statistics
1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5. Regression analysis
Example We can use two predictors for Intel: S&P500 and ination.
0.40.30.20.1 0.0 0.1 0.2 0.3 0.4
Intel
0.015 0.010 0.005 0.000 0.005 0.010 0.015 0.020
0.200.150.100.05 0.00 0.05 0.10
SP500
194 / 221
Statistics
Inflation
5.3 Multiple regression

5.3.1 Multiple regression equation
We extend the regression theory to k explanatory variables. Denition: multiple linear regression equation yi = 0 + 1 xi1 + 2 xi2 + + k xik + i , where xi1 , . . . , xik are observable variables; 0 , 1 , . . . , k are xed and unknown parameters; 1 , . . . , n i.i.d. N (0, 2 ); > 0 is a xed and unknown parameter.
195 / 221 Veronika Czellar HEC Paris Statistics
i = 1, . . . , n
Denitions The least squares (LS) estimators of 0 , . . . , k are

n
(0 , . . . , k ) = arg min
0 ,...,k
i=1
yi (0 + 1 xi1 + + k xik )
Remark: explicit formulas for these estimators are available . . .
196 / 221
Statistics
. . . but require a matrix form of the regression model, Y = X + , 1 x11 x1k y1 0 1 . . . . , = . , = . . . with Y = . , X = . . . . . . . . . . yn 1 xn1 xnk k n The LS estimator of minimizes (Y X )T (Y X ), and is = (X T X )1 X T Y . No need to learn this slide by heart, we will use Excel to estimate the parameters.
197 / 221
Statistics
Regressing Intel on S&P500 and ination:
Back to one predictor
198 / 221
Statistics
The R Square in the Excel output is higher, than in the case of one predictor S&P500 only. Question What does R Square mean in the multiple regression?
199 / 221
Statistics
5.3.2 Evaluating a multiple regression equation

Denition
n y i=1 (i
Back to simple regression
The coecient of multiple determination R 2 is dened by R2 = y )2 ,
Syy
where Syy = n (yi y )2 and yi are the tted values i=1 0 + 1 xi1 + + k xik . yi = Proposition R2 = 1
200 / 221 Veronika Czellar HEC Paris
n i=1 (yi
Syy
yi ) 2
Statistics
Properties of the coecient of multiple determination

1
It can range from 0 to 1. A value near 0 indicates little linear association between the set of independent variables and the dependent variable. A value near 1 means a strong association. R 2 cannot go down when an extra predictor is added to the model and it will generally increase. R 2 can almost always be made very close to 1 by using a model with k quite close to n, even if many of the predictors would contribute only marginally to variation in y .
201 / 221
Statistics
Denition The adjusted R 2 is dened by Adjusted R 2 = 1 Properties of the adjusted R 2

1
n1 nk 1
n i=1 (yi
Syy
yi ) 2
adjusted R 2 penalizes the addition of extraneous predictors to the model; adjusted R 2 is smaller than R 2 .
202 / 221
Statistics
Question High values of R 2 suggest that the model t is a useful one. But how large should this value be before we draw this conclusion?
203 / 221
Statistics
5.3.3 Testing the global utility of the multiple regression

H0 : 1 = 2 = = k = 0 Ha : at least one among 1 , . . . , k is not zero Model utility F test: F = R 2 /k H0 F(k,nk1) . 2 )/(n k 1) (1 R
204 / 221
Statistics
Model utility F test for the Intel example with two predictors:
205 / 221
Statistics
Warning If the F test results in the rejection of H0 , it does not mean that all predictors are useful.
206 / 221
Statistics
5.3.4 Evaluating individual regression coecients

The t tests can be extended to the multivariate case... For any given j {0, 1, . . . , k}, we can test H0 : j = 0 , Ha : j = 0 using a t test: T j = where j SE (j )
H0
tnk1 ,
207 / 221
Statistics
SE (j ) is the standard error of the coecient j. (and has the matrix form 2 (X T X )1 ); jj
1 = nk1 n (yi yi )2 and is called multiple standard i=1 error of estimate.
Example Do an individual test of each independent variable for the Intel regression with two predictors. Which variable would you consider eliminating? Use the 0.05 signicance level.
208 / 221
Statistics
209 / 221
Statistics
Remark: if there are more than one nonsignicant variables, we should delete only one variable at a time. Each time we delete a variable, we need to rerun the regression equation and check the remaining variables. This method is called backward stepwise regression method.
210 / 221
Statistics
5.3.5 Transformed variables

We can also include transformed variables or mixtures of variables in a multiple regression model. Example Global warming is the increase in the average temperature of Earths near-surface air and oceans since the mid-20th century and its projected continuation. It is well-known that climate change is inuenced by human CO2 emissions. Global CO2 emissions totalled 31.1 billion tons in 2009, 37 percent above those in 1990. Global data GlobalAirpollution.txt for more than 65 countries has been released in August 2010 by and available on the CERINA Plan website (and on the course website as well).
211 / 221
Statistics
Example continued We would like to investigate the impact of GDP per capita and population growth on the increase of CO2 emissions. Year2008 : emissions of CO2 in 2008 (in million tons, yi,2008 ) Year2009 : emissions of CO2 in 2009 (in million tons, yi,2009 ) GDP2009realgrowth : GDP real growth rate (in %, xi,1 ) PopGrowth2009 : population growth rate (in %, xi,2 )
2 SquareGDP2009 : squared GDP2009realgrowth (xi,1 ) 2 SquarePopGrowth2009 : squared PopGrowth2009 (xi,2 )
Fit the following model: yi,2009 2 2 = 0 + 1 xi,1 + 2 xi,2 + 3 xi,1 + 4 xi,2 + i , yi,2008
212 / 221 Veronika Czellar HEC Paris Statistics
i = 1, . . . , 65 .
213 / 221
Statistics
214 / 221
Statistics
5.3.6 Dummy variables

We can also include a dummy variable as a predictor, which takes the values 0 or 1 to indicate the absence or presence of some categorical eect. Example: CEO salaries (see NorthwestCEOsalaries.txt on course website)
215 / 221
Statistics
5.3.7 Qualitative variables

A categorical (or qualitative) variable is a predictor that takes a nite number d possible values. Only d 1 categories are added to the regression model. Example: prices of LCD televisions (see LCD.txt on course website, and exercise 5.12)
216 / 221
Statistics
5.3.8 Interaction variables

In some cases, it can be useful to add interaction terms, which are products of at least two variables. Example: CEO salaries The product between the woman dummy and sales is an interaction term.
217 / 221
Statistics
Model assumptions in multiple regression can be veried in the same way as in simple linear regression (see 5.2.8).
Back to simple regression
However, there is an additional requirement in multiple regression: predictors should not be correlated. . .
218 / 221
Statistics
5.3.9 Multicollinearity Multicollinearity exists when independent variables are correlated. Several clues that indicate problems with multicollinearity: An independent variable known to be an important predictor ends up being not signicant. A regression coecient that should have a positive sign turns out to be negative, or vice versa. When an independent variable is added or removed, there is a drastic change in the values of the remaining coecients.
219 / 221
Statistics
For further details about linear regression, see Kutner, Nachtscheim, Neter and Li (2005), Applied Linear Statistical Models, 5th ed., McGraw-Hill; Fox (2008), Applied Regression Analysis and Generalized Linear Models, 2nd ed., Sage Publications.
220 / 221
Statistics
Thank you... Merci

Danke
Grazie
Gracias
Spasibo
Ksznm o o o
221 / 221
Statistics

5.3. Multiple Regression (Course)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5.3. Multiple Regression (Course)

Uploaded by

Copyright:

Available Formats

1. Descriptive statistics 2. Foundations of inferential statistics 3. Estimation and condence intervals 4. Testing statistical hypotheses 5.

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

Veronika Czellar HEC Paris

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression

0.40.30.20.1 0.0 0.1 0.2 0.3 0.4

0.015 0.010 0.005 0.000 0.005 0.010 0.015 0.020

0.200.150.100.05 0.00 0.05 0.10

Veronika Czellar HEC Paris

5.1 Correlation 5.2 Simple linear regression 5.3 Multiple regression