You are on page 1of 14

# Multiple Linear Regression

## Multiple Linear Regression Model

Multiple linear regression involves one dependent
variable and more than one independent variable. The
equation that describes multiple linear regression model is
given below:
y = 0 + 1 x1 + 2 x2 + .

. + k xk +

## y is dependent variable and x1, x2 , .

.
.,xk are
independent variables. These independent variables being
used to predict the dependent variable.

## 0 , 1 , 2 , . . ., k are total (k+1) unknown regression

coefficients (also called model parameters). These
regression coefficients are estimated based on observed
sample data.
The term (pronounced as epsilon) is random error.

## Data for Multiple Regression

Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
i = 1, 2, . . ., n
y
y1
y2
.
yi
.
yn

x1
x11
x21
.
xi1
.
xn1

j = 1, 2, . . .,k
x2
x12
x22
.
xi2
.
xn2

.
.
.
.
.
.
.

xj
x1j
x2j
.
xij
.
xnj

.
.
.
.
.
.
.

xk
x1k
x2k
.
xik
.
xnk

## Scalar Notation: Multiple Linear Regression

Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
The scalar notation of regression model:

yi = 0 + 1 xi1 + 2 xi2 + .
i = 1, 2, . . ., n
j = 1, 2, . . .,k

. + j xij + . . + k xik + i

## n = total number of observations

k = number of independent variables

## Matrix Notation: Multiple Linear Regression

Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
yn1 = Xn(k+1) (k+1) 1 + n1
n = total number of observations, k = total number of
variables, is model parameters in vector notation.
y1
.

y yi

.
y n

1 x11 . x1 j

. . .
.
X 1 x i1 . x ij

. . .
.
1 x . x
n1
nj

.
.
.
.
.

x 1k

.
x ik

.
x nk

0

1
.

j
.

k

1
.

i

.
n

## Model Parameter Estimation

The error in regression model is the difference between
actual and predicted value. It may be positive or negative
value.
Error is also known as residual. Predicted value by
regression equation is called fitted value or fit.
The sum of squared difference between the actual and
predicted values known as sum of square of error. Least
square method minimizes the sum of square of error to
find out the best fitting plane.
It is to be noted that the regressor variables in linear
regression model are non-random. That means its values
are fixed.

## Model Parameter Estimation (Contd.)

In matrix notation, the regression equation:
y =X +

## By using least square estimator, we want estimate

n

that minimizes L =

i 1

2
i

=
T

y X ( y X)
T

## The least square estimator must satisfy:

T
T
( L) 2 X y 2 X X 0

For

ith

y i Xi

ei y i y i
n

2
e
i

i 1

n k 1

of regressors.

## Standard error (SE) of estimate = =

MSE

Variance( ) = (X T X) 1

## Testing Significance of Regression Model

The test for significance of regression is a test to
determine if there is a linear relationship between the
response variable and regressor variables.
H0 : 1 = 2 = . . . = k = 0
H1 : At least one j is not zero
The test procedure involves an analysis of variance
(ANOVA) partitioning of the total sum of square into a sum
of squares due to regression and a sum of square due to
error (or residual)
Total number of model parameters = p = Number of
regression coefficients = (k+1)

10

ANOVA table
Source of
Variation
Regression
Residual
error
Total

DF

SS

MS

FCal

SSR

SSR /k =MSR

MSR/MSE

n k-1

SSE

SSE / (n-k-1)
= MSE

n 1

TSS

y
2
i
n

T
SSR yi y XT y i1

n
i 1
n

## TSS = SSR + SSE

SSE yi yi y T y XT y

i 1
n

TSS yi y
i 1

11

## Significance Test of Individual Regression

Coefficient
Adding an unimportant variable to the model can actually
increase the mean square error, thereby decreasing the
usefulness of the model.
The hypothesis for testing the significance of any
individual regression coefficient, say j is
H0: j = 0

H1: j 0

j
2 C jj

, ( n k 1)

## where 2 is mean square error (MSE) and C is the diagonal

element of (XTX)-1 . Reject H0 if Tcal > t , ( n k 1)
2

12

## Confidence Interval of Mean Response

In matrix notation, the regression equation:
y =X +
where Normal (0, 2)

## Mean response at a point x0 = [1, x01, x02, . .,x0j, . . .,x0k ]T

Mean response = y = E(y) = E(X ) + E() = X + 0

y|x = E(y | x0 ) = x0
0

var(y | x0 )

x T0 (XT X)1 x 0

y|x

( n p )

x T0 (XT X)1 x 0

13

## Coefficient of Multiple Determination

Coefficient of multiple determination =

R2

SSR
=
TSS

## SSR TSS SSE

SSE

1
TSS
TSS
TSS
SSR : Sum of square due to regression
SSE : Sum of square due to error
TSS : Total sum of square

## Coefficient of variation is the fraction of variation of the

dependent variable explained by regressor variables.
R2 is measure the goodness of linear fit. The better the
linear fit is, the R2 closer to 1.
14

## Coefficient of Multiple Determination (Contd.)

The major drawback of using coefficient of multiple
determination (R2) is that adding a predictor variable to the
model will always increase R2, regardless of whether the
additional variable is significant or not. To avoid such
situation, regression model builders prefer to use adjusted
R2 statistic.

SSE

2

n 1
( n p)
(1 R 2 )
1
1
TSS
n p

(n 1)