You are on page 1of 14

Multiple Linear Regression

Sasadhar Bera, IIM Ranchi

Multiple Linear Regression Model


Multiple linear regression involves one dependent
variable and more than one independent variable. The
equation that describes multiple linear regression model is
given below:
y = 0 + 1 x1 + 2 x2 + .

. + k xk +

y is dependent variable and x1, x2 , .


.
.,xk are
independent variables. These independent variables being
used to predict the dependent variable.

0 , 1 , 2 , . . ., k are total (k+1) unknown regression


coefficients (also called model parameters). These
regression coefficients are estimated based on observed
sample data.
The term (pronounced as epsilon) is random error.
Sasadhar Bera, IIM Ranchi

Data for Multiple Regression


Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
i = 1, 2, . . ., n
y
y1
y2
.
yi
.
yn

x1
x11
x21
.
xi1
.
xn1

j = 1, 2, . . .,k
x2
x12
x22
.
xi2
.
xn2

.
.
.
.
.
.
.

xj
x1j
x2j
.
xij
.
xnj

.
.
.
.
.
.
.

xk
x1k
x2k
.
xik
.
xnk

Sasadhar Bera, IIM Ranchi

Scalar Notation: Multiple Linear Regression


Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
The scalar notation of regression model:

yi = 0 + 1 xi1 + 2 xi2 + .
i = 1, 2, . . ., n
j = 1, 2, . . .,k

. + j xij + . . + k xik + i

n = total number of observations


k = number of independent variables

j s are model parameters.

Sasadhar Bera, IIM Ranchi

Matrix Notation: Multiple Linear Regression


Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
yn1 = Xn(k+1) (k+1) 1 + n1
n = total number of observations, k = total number of
variables, is model parameters in vector notation.
y1
.

y yi

.
y n

1 x11 . x1 j

. . .
.
X 1 x i1 . x ij

. . .
.
1 x . x
n1
nj

.
.
.
.
.

x 1k

.
x ik

.
x nk

0

1
.

j
.

k

1
.

i

.
n

Sasadhar Bera, IIM Ranchi

Model Parameter Estimation


The error in regression model is the difference between
actual and predicted value. It may be positive or negative
value.
Error is also known as residual. Predicted value by
regression equation is called fitted value or fit.
The sum of squared difference between the actual and
predicted values known as sum of square of error. Least
square method minimizes the sum of square of error to
find out the best fitting plane.
It is to be noted that the regressor variables in linear
regression model are non-random. That means its values
are fixed.
Sasadhar Bera, IIM Ranchi

Model Parameter Estimation (Contd.)


In matrix notation, the regression equation:
y =X +

By using least square estimator, we want estimate


n

that minimizes L =

i 1

2
i

=
T

y X ( y X)
T

The least square estimator must satisfy:

T
T
( L) 2 X y 2 X X 0

( XT X)1 XT y , estimated model parameters.

The fitted regression line: y X


Sasadhar Bera, IIM Ranchi

Estimated Residual and Standard Error


For

ith

observation (Xi), predicted value or Fit :

y i Xi

Error in the fit called residual:

ei y i y i
n

2
e
i

Mean Square Error = MSE =

i 1

n k 1

where n is the total number of observations, k is number


of regressors.

Standard error (SE) of estimate = =

MSE

Variance( ) = (X T X) 1
Sasadhar Bera, IIM Ranchi

Testing Significance of Regression Model


The test for significance of regression is a test to
determine if there is a linear relationship between the
response variable and regressor variables.
H0 : 1 = 2 = . . . = k = 0
H1 : At least one j is not zero
The test procedure involves an analysis of variance
(ANOVA) partitioning of the total sum of square into a sum
of squares due to regression and a sum of square due to
error (or residual)
Total number of model parameters = p = Number of
regression coefficients = (k+1)
Sasadhar Bera, IIM Ranchi

10

Testing Significance of Regression Model (Contd.)


ANOVA table
Source of
Variation
Regression
Residual
error
Total

DF

SS

MS

FCal

SSR

SSR /k =MSR

MSR/MSE

n k-1

SSE

SSE / (n-k-1)
= MSE

n 1

TSS

y
2
i
n

T
SSR yi y XT y i1

n
i 1
n

TSS = SSR + SSE

SSE yi yi y T y XT y

i 1
n

TSS yi y
i 1

Sasadhar Bera, IIM Ranchi

11

Significance Test of Individual Regression


Coefficient
Adding an unimportant variable to the model can actually
increase the mean square error, thereby decreasing the
usefulness of the model.
The hypothesis for testing the significance of any
individual regression coefficient, say j is
H0: j = 0

H1: j 0

Test Statistic = Tcal =

j
2 C jj

, ( n k 1)

where 2 is mean square error (MSE) and C is the diagonal


element of (XTX)-1 . Reject H0 if Tcal > t , ( n k 1)
2

12

Sasadhar Bera, IIM Ranchi

Confidence Interval of Mean Response


In matrix notation, the regression equation:
y =X +
where Normal (0, 2)

Mean response at a point x0 = [1, x01, x02, . .,x0j, . . .,x0k ]T


Mean response = y = E(y) = E(X ) + E() = X + 0

y|x = E(y | x0 ) = x0
0

var(y | x0 )

x T0 (XT X)1 x 0

(1-) % confidence interval of mean response at point x0

y|x

( n p )

x T0 (XT X)1 x 0
Sasadhar Bera, IIM Ranchi

13

Coefficient of Multiple Determination


Coefficient of multiple determination =

R2

SSR
=
TSS

SSR TSS SSE


SSE

1
TSS
TSS
TSS
SSR : Sum of square due to regression
SSE : Sum of square due to error
TSS : Total sum of square

Coefficient of variation is the fraction of variation of the


dependent variable explained by regressor variables.
R2 is measure the goodness of linear fit. The better the
linear fit is, the R2 closer to 1.
14

Sasadhar Bera, IIM Ranchi

Coefficient of Multiple Determination (Contd.)


The major drawback of using coefficient of multiple
determination (R2) is that adding a predictor variable to the
model will always increase R2, regardless of whether the
additional variable is significant or not. To avoid such
situation, regression model builders prefer to use adjusted
R2 statistic.

SSE

2
adj

n 1
( n p)
(1 R 2 )
1
1
TSS
n p

(n 1)

In general, adjusted R2 statistic will not increase as variables


are added to the model.

When R2 and adjusted R2 differ dramatically there is a good


chance that non-significant terms have been included in the
15
model.
Sasadhar Bera, IIM Ranchi