Multiple Regression

Applied Business Forecasting
and planning
Multiple Regression Analysis
Introduction
In simple linear regression we studied the

relationship between one explanatory
variable and one response variable.
Now, we look at situations where several
explanatory variables works together to
explain the response.
Introduction
Following our principles of data analysis,

we look first at each variable separately,
then at relationships among the variables.
We look at the distribution of each variable
to be used in multiple regression to
determine if there are any unusual patterns
that may be important in building our
regression analysis.
Multiple Regression
Example. In a study of direct operating cost, Y, for 67

branch offices of consumer finance charge, four
independent variables were considered:
X1: Average size of loan outstanding during the year,
X2 : Average number of loans outstanding,
X3 : Total number of new loan applications processed, and
X4 : Office salary scale index.
The model for this example is

Y 0 1 x1 2 x2 3 X 3 4 x4
Formal Statement of the Model
General regression model
Y 0 1 x1 2 x2 k xk
0, 1, , k are parameters
X1, X2, ,Xk are known constants
, the error terms are independent N(o, 2)
Estimating the parameters of the model
The values of the regression parameters i are not

known. We estimate them from data.
As in the simple linear regression case, we use the
least-squares method to fit a linear function
y b0 b1 x1 b2 x2 bk xk
to the data.
The least-squares method chooses the bs that make the
sum of squares of the residuals as small as possible.
The least-squares estimates are the values that

minimize the quantity
n
(
y
y
)
i i
i 1
Since the formulas for the least-squares estimates are

complicated and hand calculation is out of question,
we are content to understand the least-squares
principle and let software do the computations.
The estimate of i is bi and it indicates the change

in the mean response per unit increase in Xi when
the rest of the independent variables in the model
are held constant.
The parameters i are frequently called partial
regression coefficients because they reflect the
partial effect of one independent variable when the
rest of independent variables are included in the
model and are held constant
The observed variability of the responses

about this fitted model is measured by the
variance
n
1
S
( yi y i ) 2
n k 1 i 1
2
and the regression standard error

s s2
In the model 2 and measure the

variability of the responses about the
population regression equation.
It is natural to estimate 2 by s2 and by s.
Analysis of Variance Table
The basic idea of the regression ANOVA table are

the same in simple and multiple regression.
The sum of squares decomposition and the
associated degrees of freedom are:
2
2
2
(
y
y
)
(
y
y
)
(
y
y
)
i
i
i i
SST
df:
SSR
n 1 k (n k 1)
SSE
Analysis of Variance Table

Source
Sum of
Squares
df
Mean
Square
F-test
Regression
SSR
MSR=
SSR/k
MSR/MSE
Error
SSE
n-k-1
MSE=
SSE/n-k-1
Total
SST
n-1
F-test for the overall fit of the model
To test the statistical significance of the regression

relation between the response variable y and the set of
variables x1,, xk, i.e. to choose between the
alternatives:
H 0 : 1 2 k 0
H a : not all i (i 1, k ) equal zero
We use the test statistic:
MSR
F
MSE
F-test for the overall fit of the model
The decision rule at significance level is:
Reject H0 if
F F ( ; k , n k 1)
Where the critical value F(, k, n-k-1) can be found from an

F-table.
The existence of a regression relation by itself does not

assure that useful prediction can be made by using it.
Note that when k=1, this test reduces to the F-test for
testing in simple linear regression whether or not 1= 0
Interval estimation of i
For our regression model, we have:

bi i
s (bi )
has a t - distribution with n - k - 1 degrees of freedom
Therefore, an interval estimate for i

with 1- confidence coefficient is:
Where
bi t ( ; n k 1) s (bi )
2
s (bi )
MSE
( x x )2
Significance tests for i
To test:
H 0 : i 0
H a : i 0
We may use the test statistic:
Reject H0 if
bi
t
s (bi )
t t ( ; n k 1) or
2
t t ( ; n k 1)
2
Multiple regression model Building
Often we have many explanatory variables,

and our goal is to use these to explain the
variation in the response variable.
A model using just a few of the variables
often predicts about as well as the model
using all the explanatory variables.
We may find that the reciprocal of a variable is a

better choice than the variable itself, or that
including the square of an explanatory variable
improves prediction.
We may find that the effect of one explanatory
variable may depends upon the value of another
explanatory variable. We account for this situation
by including interaction terms.
The simplest way to construct an interaction

term is to multiply the two explanatory
variables together.
How can we find a good model?
Selecting the best Regression equation.
After a lengthy list of potentially useful

independent variables has been compiled,
some of the independent variables can be
screened out. An independent variable
May not be fundamental to the problem

May be subject to large measurement error
May effectively duplicate another independent
variable in the list.
Selecting the best Regression Equation.
Once the investigator has tentatively

decided upon the functional forms of the
regression relations (linear, quadratic, etc.),
the next step is to obtain a subset of the
explanatory variables (x) that best explain
the variability in the response variable y.
Selecting the best Regression Equation.
An automatic search procedure that develops

sequentially the subset of explanatory
variables to be included in the regression
model is called stepwise procedure.
It was developed to economize on
computational efforts.
It will end with the identification of a single
regression model as best.
Example: Sales Forecasting
Sales Forecasting
Multiple regression is a popular technique for predicting

product sales with the help of other variables that are likely to
have a bearing on sales.
Example
The growth of cable television has created vast new potential

in the home entertainment business. The following table gives
the values of several variables measured in a random sample
of 20 local television stations which offer their programming
to cable subscribers. A TV industry analyst wants to build a
statistical model for predicting the number of subscribers that a
cable station can expect.
Example:Sales Forecasting
Y = Number of cable subscribers (SUSCRIB)

X1 = Advertising rate which the station charges local
advertisers for one minute of prim time
space (ADRATE)
X2 = Kilowatt power of the stations non-cable signal
(KILOWATT)
X3 = Number of families living in the stations area of
dominant influence (ADI), a geographical division of
radio and TV audiences (APIPOP)
X4 = Number of competing stations in the ADI
(COMPETE)
The sample data are fitted by a multiple regression

model using Excel program.
The marginal t-test provides a way of choosing the
variables for inclusion in the equation.
The fitted Model is
SUBSCRIBE 0 1 ADRATE 2 APIPOP 3 COMPETE 4 SIGNAL
Excel Summary output
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.884267744
R Square
0.781929444
Adjusted R Square
0.723777295
Standard Error
142.9354188
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
AD_Rate
Signal
APIPOP
Compete
4
15
19
SS
1098857.84
306458.0092
1405315.85
MS
F
Significance F
274714.4601 13.44626923 7.52E-05
20430.53395
Coefficients
Standard Error
t Stat
P-value
Lower 95% Upper 95%
51.42007002
98.97458277
0.51952803 0.610973806 -159.539 262.3795
-0.267196347
0.081055107 -3.296477624 0.004894126 -0.43996 -0.09443
-0.020105139
0.045184758 -0.444954014 0.662706578 -0.11641 0.076204
0.440333955
0.135200486 3.256896248 0.005307766 0.152161 0.728507
16.230071
26.47854322
0.61295181 0.549089662 -40.2076 72.66778
Do we need all the four variables in the

model?
Based on the partial t-test, the variables
signal and compete are the least significant
variables in our model.
Lets drop the least significant variables one
at a time.
Excel Summary Output

SUMMARY OUTPUT
Multiple R
0.882638739
R Square
0.779051144
Adjusted R Square
0.737623233
Standard Error
139.3069743
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
AD_Rate
APIPOP
Compete
3
16
19
SS
1094812.92
310502.9296
1405315.85
MS
F
Significance F
364937.64 18.80498277
1.69966E-05
19406.4331
Coefficients Standard Error

t Stat
P-value
51.31610447
96.4618242 0.531983558 0.602046756
-0.259538026
0.077195983 -3.36206646 0.003965102
0.433505145
0.130916687 3.311305499 0.004412929
13.92154404
25.30614013 0.550125146 0.589831583
Lower 95% Upper 95%

-153.1737817
255.806
-0.423186162 -0.09589
0.15597423 0.711036
-39.72506442 67.56815
The variable Compete is the next variable to

get rid of.
Excel Summary Output
SUMMARY OUTPUT
Multiple R
0.8802681
R Square
0.774871928
Adjusted R Square
0.748386273
Standard Error
136.4197776
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
AD_Rate
APIPOP
2
17
19
SS
1088939.802
316376.0474
1405315.85
MS
544469.901
18610.35573

t Stat
96.28121395
50.16415506 1.919322948
-0.254280696
0.075014548 -3.389751739
0.495481252
0.065306012 7.587069489
F
Significance F
29.2562866
3.13078E-06
P-value
0.07188916
0.003484198
7.45293E-07
Lower 95%
Upper 95%
-9.556049653 202.1184776
-0.41254778 -0.096013612
0.357697418 0.633265086
All the variables in the model are

statistically significant, therefore our final
model is:
Final Model
SUBSCRIBE 96.28 0.25 ADRATE 0.495 APIPOP
Interpreting the Final Model
What is the interpretation of the estimated parameters.

Is the association positive or negative?
Does this make sense intuitively, based on what the data
represents?
What other variables could be confounders?
Are there other analysis that you might consider doing?
New questions raised?
Multicollinearity
In multiple regression analysis, one is often

concerned with the nature and significance of the
relations between the explanatory variables and
the response variable.
Questions that are frequently asked are:
What is the relative importance of the effects of the

different independent variables?
What is the magnitude of the effect of a given
independent variable on the dependent variable?
Multicollinearity
Can any independent variable be dropped from the

model because it has little or no effect on the dependent
variable?
Should any independent variables not yet included in
the model be considered for possible inclusion?
Simple answers can be given to these questions if
The independent variables in the model are uncorrelated

among themselves.
They are uncorrelated with any other independent
variables that are related to the dependent variable but
omitted from the model.
Multicollinearity
When the independent variables are correlated among

themselves, multicollinearity or colinearity among them is
said to exist.
In many non-experimental situations in business, economics,
and the social and biological sciences, the independent
variables tend to be correlated among themselves.
For example, in a regression of family food expenditures on
the variables: family income, family savings, and the age of
head of household, the explanatory variables will be
correlated among themselves.
Multicollinearity
Further, the explanatory variables will also

be correlated with other socioeconomic
variables not included in the model that do
affect family food expenditures, such as
family size.
Multicollinearity
Some key problems that typically arise when the
explanatory variables being considered for the regression
model are highly correlated among themselves are:
1.
2.
3.
Adding or deleting an explanatory variable changes the

regression coefficients.
The estimated standard deviations of the regression coefficients
become large when the explanatory variables in the regression
model are highly correlated with each other.
The estimated regression coefficients individually may not be
statistically significant even though a definite statistical relation
exists between the response variable and the set of explanatory
variables.
Multicollinearity Diagnostics
A formal method of detecting the presence of

multicollinearity that is widely used is by the means
of Variance Inflation Factor.
It measures how much the variances of the estimated

regression coefficients are inflated as compared to when
the independent variables are not linearly related.
VIF j
1
,
2
1 Rj
j 1,2, k
R 2j Is the coefficient of determination from the regression

of the jth independent variable on the remaining k-1
independent variables.
AVIF near 1 suggests that multicollinearity is not a

problem for the independent variables.
Its estimated coefficient and associated t value will not change

much as the other independent variables are added or deleted from
the regression equation.
A VIF much greater than 1 indicates the presence of

multicollinearity. A maximum VIF value in excess of 10 is
often taken as an indication that the multicollinearity may
be unduly influencing the least square estimates.
the estimated coefficient attached to the variable is unstable and

its associated t statistic may change considerably as the other
independent variables are added or deleted.
The simple correlation coefficient between all pairs of

explanatory variables (i.e., X1, X2, , Xk ) is helpful in
selecting appropriate explanatory variables for a
regression model and is also critical for examining
multicollinearity.
While it is true that a correlation very close to +1 or 1
does suggest multicollinearity, it is not true (unless
there are only two explanatory variables) to infer
multicollinearity does not exist when there are no high
correlations between any pair of explanatory variables.
Pearson Correlation Coefficients, N = 20
Prob > |r| under H0: Rho=0
SUBSCRIB
ADRATE
KILOWATT
APIPOP
COMPETE
1.00000
-0.02848
0.9051
0.44762
0.0478
0.90447
<.0001
0.79832
<.0001
-0.02848
0.9051
1.00000
-0.01021
0.9659
0.32512
0.1619
0.34147
0.1406
KILOWATT
KILOWATT
0.44762
0.0478
-0.01021
0.9659
1.00000
0.45303
0.0449
0.46895
0.0370
APIPOP
APIPOP
0.90447
<.0001
0.32512
0.1619
0.45303
0.0449
1.00000
0.87592
<.0001
COMPETE
COMPETE
0.79832
<.0001
0.34147
0.1406
0.46895
0.0370
0.87592
1.00000
SUBSCRIB
SUBSCRIB
ADRATE
ADRATE
<.0001
SUBSCRIBE 51.42 0.27 ADRATE - .02 SIGNAL 0.44 APIPOP 16.23 COMPETE
SUBSCRIBE 51.32 0.26 ADRATE 0.43 APIPOP 13.92 COMPETE
SUBSCRIBE 96.28 0.25 ADRATE 0.495 APIPOP
VIF calculation:
Fit the model
APIPOP 0 1 SIGNAL 2 ADRATE 3 COMPETE

SUMMARY OUTPUT
Multiple R
0.878054
R Square
0.770978
Adjusted R Square
0.728036
Standard Error
264.3027
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Compete
ADRATE
Signal
3
16
19
SS
3762601
1117695
4880295
MS
1254200
69855.92
Coefficients
Standard Error t Stat
-472.685 139.7492 -3.38238
159.8413 28.29157 5.649786
0.048173 0.149395 0.322455
0.037937 0.083011 0.457012
F
Significance F
17.9541
2.25472E-05
P-value
0.003799
3.62E-05
0.751283
0.653806
Lower 95%
Upper 95%
-768.9402258
-176.43
99.86587622 219.8168
-0.268529713 0.364876
-0.138038952 0.213913
Fit the model

Compete 0 1 ADRATE 2 APIPOP 3 SIGNAL
SUMMARY OUTPUT
Multiple R
0.882936
R Square
0.779575
Adjusted R Square
0.738246
Standard Error
1.34954
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
ADRATE
Signal
APIPOP
3
16
19
SS
103.0599
29.14013
132.2
MS
34.35329
1.821258
Coefficients
3.10416 0.520589
5.96278
0.000491 0.000755 0.649331
0.000334 0.000418 0.799258
0.004167 0.000738 5.649786
F
Significance F
18.86239
1.66815E-05
P-value
1.99E-05
0.525337
0.435846
3.62E-05
Lower 95%
Upper 95%
2.000559786
4.20776
-0.001110874 0.002092
-0.000552489 0.001221
0.002603667 0.005731
Fit the model

Signal 0 1 ADRATE 2 APIPOP 3 COMPETE
SUMMARY OUTPUT
Multiple R
0.512244
R Square
0.262394
Adjusted R Square
0.124092
Standard Error
790.8387
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
APIPOP
Compete
ADRATE
SS
3 3559789
16 10006813
19 13566602
MS
1186596
625425.8
Coefficients
5.171093 547.6089 0.009443
0.339655 0.743207 0.457012
114.8227 143.6617 0.799258
-0.38091 0.438238 -0.86919
F
Significance F
1.897261
0.170774675
P-value
0.992582
0.653806
0.435846
0.397593
Lower 95%
Upper 95%
-1155.707711
1166.05
-1.235874129 1.915184
-189.7263711 419.3718
-1.309935875 0.548109
Fit the model

ADRATE 0 1 Signal 2 APIPOP 3 COMPETE
SUMMARY OUTPUT
Multiple R
0.399084
R Square
0.159268
Adjusted R Square
0.001631
Standard Error
440.8588
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Signal
APIPOP
Compete
3
16
19
SS
589101.7
3109703
3698805
MS
196367.2
194356.5
Coefficients
253.7304 298.6063 0.849716
-0.11837 0.136186 -0.86919
0.134029 0.415653 0.322455
52.3446 80.61309 0.649331
F
Significance F
1.010346
0.413876018
P-value
0.408018
0.397593
0.751283
0.525337
Lower 95%
Upper 95%
-379.2865355 886.7474
-0.407073832 0.170329
-0.747116077 1.015175
-118.5474784 223.2367
VIF calculation Results:

Variable
ADRATE
COMPETE
SIGNAL
APIPOP
R- Squared
0.159268
0.779575
0.262394
0.770978
VIF
1.19
4.54
1.36
4.36
There is no significant multicollinearity.
Qualitative Independent Variables
Many variables of interest in business, economics,

and social and biological sciences are not
quantitative but are qualitative.
Examples of qualitative variables are gender
(male, female), purchase status (purchase, no
purchase), and type of firms.
Qualitative variables can also be used in multiple
regression.
Qualitative Independent Variables
An economist wished to relate the speed with which a

particular insurance innovation is adopted (y) to the size of
the insurance firm (x1) and the type of firm. The dependent
variable is measured by the number of months elapsed
between the time the first firm adopted the innovation and
and the time the given firm adopted the innovation. The
first independent variable, size of the firm, is quantitative,
and measured by the amount of total assets of the firm. The
second independent variable, type of firm, is qualitative
and is composed of two classes-Stock companies and
mutual companies.
Indicator variables
Indicator, or dummy variables are used to
determine the relationship between qualitative
independent variables and a dependent variable.
Indicator variables take on the values 0
and 1.
For the insurance innovation example, where the
qualitative variable has two classes, we might
define the indicator variable x2 as follows:
1 if stock company
x2
0 otherwise
Indicator variables
A qualitative variable with c classes will be

represented by c-1 indicator variables.
A regression function with an indicator
variable with two levels (c = 2) will yield
two estimated lines.
Interpretation of Regression Coefficients
In our insurance innovation example, the

regression model is:
y 0 1 x1 2 x2
Where:
x1 size of firm
x2
1 if stock company
0 otherwise
To understand the meaning of the regression

coefficients in this model, consider first the
case of mutual firm. For such a firm, x2 = 0
and we have:
y i b0 b1 x1 b2 (0) b0 b1 x1
Mutual firms
For a stock firm x2 = 1 and the response

function is:
y i b0 b1 x1 b2 (1) (b0 b2 ) b1 x1
Stock firms
The response function for the mutual firms is a

straight line, with y intercept 0 and slope 1.
For stock firms, this also is a straight line, with the

same slope 1 but with y intercept 0+2.
With reference to the insurance innovation

example, the mean time elapsed before the
innovation is adopted is linear function of size of
firm (x1), with the same slope 1for both types of
firms.
2 indicates how much lower or higher the

response function for stock firm is than the one for
the mutual firm.
2 measures the differential effect of type of firms.
In general, 2 shows how much higher (lower) the
mean response line is for the class coded 1 than
the line for the class coded 0, for any level of x1.
Example: Insurance Innovation Adoption
Here is the data set for the insurance innovation example:

Months Elapsed
17
26
21
30
22
0
12
19
4
16
28
15
11
38
31
21
20
13
30
14
Size
151
92
175
31
104
277
210
120
290
238
164
272
295
68
85
224
166
305
124
246
type of firm
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
Type
Mutual
Mutual
Mutual
Mutual
Mutual
Mutual
Mutual
Mutual
Mutual
Stock
Stock
Stock
Stock
Stock
Stock
Stock
Stock
Stock
Stock
Stock
Fitting the regression model

y 0 1 x1 2 x2
Where
x1 size of firm
x2
1 if stock company
0 otherwise
fitted response function is:

y 33.87 .1061x1 8.77 x2

SUMMARY OUTPUT
Multiple R
0.95993655
R Square
0.92147818
Adjusted R Square
0.91224031
Standard Error
2.78630562
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Size
type of firm
2
17
19
SS
MS
F
Significance F
1548.820517 774.4103 99.75016
4.04966E-10
131.979483 7.763499
1680.8

t Stat
P-value
33.8698658 1.562588138 21.67549
8E-14
-0.10608882 0.007799653 -13.6017 1.45E-10
8.76797549 1.286421264 6.815789 3.01E-06
Lower 95%
Upper 95%
30.57308841 37.16664321
-0.122544675 -0.089632969
6.053860079
11.4820909
The fitted response function is:

y 33.87 .1061x1 8.77 x2
Stock firms response function is:

y (33.87 8.77) .1061x1
Mutual firms response function is:

y 33.87 .1061x1
Interpretation ?
Accounting for Seasonality in a Multiple

regression Model
Seasonal Patterns are not easily accounted for by

the typical causal variables that we use in
regression analysis.
An indicator variable can be used effectively to
account for seasonality in our time series data.
The number of seasonal indicator variables to use
depends on the data.
If we have p periods in our data series, we can not
use more than P-1 seasonal indicator variables.
Example: Private Housing Starts (PHS)
Housing starts in the United States measured in

thousands of units. These data are plotted for 1990
Q1 through 1999Q4. There are typically few
housing starts during the first quarter of the year
(January, February, March); there is usually a big
increase in the second quarter of (April, May,
June), followed by some decline in the third
quarter (July, August, September), and further
decline in the fourth quarter (October, November,
December).
Nov -98
J ul-98
M ar-98
N ov -97
J ul-97
M ar-97
N ov -96
J ul-96
M ar-96
N ov -95
J ul-95
M ar-95
N ov -94
100
J ul-94
250
M ar-94
N ov -93
J ul-93
M ar-93
N ov -92
J ul-92
200
M ar-92
Nov -91
J ul-91
M ar-91
Nov -90
J ul-90
M ar-90

Private Housing Starts (PHS) in Thousands of Units
400
350
300
1
1
150
1
"1" marks the first quarter of each year.
50
To Account for and measure this seasonality in a

regression model, we will use three dummy
variables: Q2 for the second quarter, Q3 for the
third quarter, and Q4 for the fourth quarter. These
will be coded as follows:
Q2 = 1 for all second quarters and zero otherwise.

Q3 = 1 for all third quarters and zero otherwise
Q4 = 1 for all fourth quarters and zero otherwise.
Data for private housing starts (PHS), the mortgage rate

(MR), and these seasonal indicator variables are shown in
the following slide.
Examine the data carefully to verify your understanding
of the coding for Q2, Q3, Q4.
Since we have assigned dummy variables for the second,
third, and fourth quarters, the first quarter is the base
quarter for our regression model.
Note that any quarter could be used as the base, with
indicator variables to adjust for differences in other
quarters.

PERIOD
31-Mar-90
30-Jun-90
30-Sep-90
31-Dec-90
31-Mar-91
30-Jun-91
30-Sep-91
31-Dec-91
31-Mar-92
30-Jun-92
30-Sep-92
31-Dec-92
31-Mar-93
30-Jun-93
30-Sep-93
31-Dec-93
31-Mar-94
30-Jun-94
30-Sep-94
31-Dec-94
31-Mar-95
30-Jun-95
30-Sep-95
31-Dec-95
31-Mar-96
30-Jun-96
30-Sep-96
31-Dec-96
31-Mar-97
30-Jun-97
30-Sep-97
31-Dec-97
31-Mar-98
30-Jun-98
30-Sep-98
31-Dec-98
31-Mar-99
30-Jun-99
30-Sep-99
31-Dec-99
PHS
217
271.3
233
173.6
146.7
254.1
239.8
199.8
218.5
296.4
276.4
238.8
213.2
323.7
309.3
279.4
252.6
354.2
325.7
265.9
214.2
296.7
308.2
257.2
240
344.5
324
252.4
237.8
324.5
314.6
256.8
258.4
360.4
348
304.6
294.1
377.1
355.6
308.1
MR
10.1202
10.3372
10.1033
9.9547
9.5008
9.5265
9.2755
8.6882
8.7098
8.6782
8.0085
8.2052
7.7332
7.4515
7.0778
7.0537
7.2958
8.4370
8.5882
9.0977
8.8123
7.9470
7.7012
7.3508
7.2430
8.1050
8.1590
7.7102
7.7905
7.9255
7.4692
7.1980
7.0547
7.0938
6.8657
6.7633
6.8805
7.2037
7.7990
7.8338
Q2
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
Q3
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
Q4
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
The regression model for private housing

starts (PHS) is:
PHS 0 1 ( MR) 2 (Q 2) 3 (Q3) 4 (Q 4)
In this model we expect b1 to have a

negative sign, and we would expect b2, b3,
b4 all to have positive signs. Why?
Regression results for this model are shown
in the next slide.

SUMMARY OUTPUT
Multiple R
0.885398221
R Square
0.78393001
Adjusted R Square
0.759236296
Standard Error
26.4498851
Observations
40
ANOVA
df
Regression
Residual
Total
Intercept
MR
Q2
Q3
Q4
4
35
39
SS
88837.93624
24485.87476
113323.811
MS
F
Significance F
22209.48406 31.74613731
3.33637E-11
699.5964217

t Stat
P-value
473.0650749
35.54169837 13.31014264 2.93931E-15
-30.04838192
4.257226391 -7.058206249 3.21421E-08
95.74106935
11.84748487 8.081130334
1.6292E-09
73.92904763
11.82881519 6.249911462 3.62313E-07
20.54778131
11.84139803
1.73524961 0.091495355
Lower 95%
Upper 95%
400.9115031 545.2186467
-38.69102153 -21.40574231
71.689367 119.7927717
49.91524679 97.94284847
-3.491564078
44.5871267
Use the prediction equation to make a

forecast for each of the fourth quarter of
1999.
Prediction equation:
PHS 473.06 30.05( MR) 95.74(Q 2) 73.93(Q3) 20.55(Q 4)

400
350
Private Housing Starts (PHS) with a Simple Regression Forecast (PHSF1) and a Multiple Regression Forecast (PHSF2) in
Thousands of Units
300
250
200
150
100
50
PHS
PHSF1
PHSF2
N ov -98
J ul-98
M ar-98
N ov -97
J ul-97
M ar-97
N ov -96
J ul-96
M ar-96
N ov -95
J ul-95
M ar-95
N ov -94
J ul-94
M ar-94
N ov -93
J ul-93
M ar-93
N ov -92
J ul-92
M ar-92
N ov -91
J ul-91
M ar-91
N ov -90
J ul-90
M ar-90
Regression Diagnostics and Residual

Analysis
It is important to check the adequacy of the model before it

becomes part of the decision making process.
Residual plots can be used to check the model
assumptions.
It is important to study outlying observations to decide
whether they should be retained or eliminated.
If retained, whether their influence should be reduced in
the fitting process or revise the regression function.
Time Series Data and the Problem of

Serial Correlation
In the regression models we assume that the

errors i are independent.
In business and economics, many regression

applications involve time series data.
For such data, the assumption of
uncorrelated or independent error terms is
often not appropriate.
Problems of Serial Correlation
If the error terms in the regression model are

autocorrelated, the use of ordinary least squares
procedures has a number of important
consequences
MSE underestimate the variance of the error terms

The confidence intervals and tests using the t and F
distribution are no longer strictly applicable.
The standard error of the regression coefficients
underestimate the variability of the estimated
regression coefficients. Spurious regression can result.
First order serial correlation
The error term in current period is directly related to the error

term in the previous time period.
Let the subscript t represent time, then the simple linear
regression model is:
Where
yt 0 1 xt t
t = error at time t t t 1 t
= the parameter that measures correlation between adjacent error
terms
t normally distributed error terms with mean zero and variance 2
Example
The effect of positive serial correlation in a

simple linear regression model.
Misleading forecasts of future y values.

Standard error of the estimate, S y.x will
underestimate the variability of the ys about
the true regression line.
Strong autocorrelation can make two unrelated
variables appear to be related.
Durbin-Watson Test for Serial

Correlation
Recall the first-order serial correlation model

yt 0 1 xt t
t t 1 t
The hypothesis to be tested are:
H0 : 0
Ha : 0
The alternative hypothesis is > 0 since in business and
economic time series tend to show positive correlation.

Correlation
The Durbin-Watson statistic is defined as

n
DW
(e
t 2
et 1 ) 2
e
t 1
2
t
Where
et yt y t the residual for time period t
et 1 yt 1 y t 1 the residual for time period t - 1

Correlation
The auto correlation coefficient can be

estimated by the lag 1 residual
autocorrelation r1(e)
n
r1 (e)
e e
t 2
n
t 1
2
e
t
t 1
And it can be shown that

DW 2(1 r1 (e))

Correlation
Since 1 < r1(e) < 1 then 0 < DW < 4
If r1(e) = 0, then DW = 2 (there is no

correlation.)
If r1(e) > 0, then DW < 2 (positive
correlation)
If r1(e) < 0, Then DW > 2 (negative
correlation)

Correlation
Decision rule:
If DW > U, Do not reject H 0.

If DW < L, Reject H0
If L DW U, the test is inconclusive.
The critical Upper (U) an Lower (L) bound can be

found in Durbin-Watson table of your text book.
To use this table you need to know The significance
level () The number of independent parameters in
the model (k), and the sample size (n).
Example
The Blaisdell Company wished to predict

its sales by using industry sales as a
predictor variable. The following table
gives seasonally adjusted quarterly data on
company sales and industry sales for the
period 1983-1987.
Example
Year
1983
1984
1985
1986
1987
Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CompSale
20.96
21.4
21.96
21.52
22.39
22.76
23.48
23.66
24.1
24.01
24.54
24.3
25
25.64
26.36
26.98
27.52
27.78
28.24
28.78
InduSale
127.3
130
132.7
129.4
135
137.1
141.2
142.8
145.5
145.3
148.3
146.4
150.2
153.1
157.3
160.7
164.2
165.6
168.7
171.7
Example
Blaisdell Company Example
Company Sales ($
millions)
35
30
25
20
15
10
5
0
0
50
100
Industry sales($ millions)
150
200
Example
The scatter plot suggests that a linear regression

model is appropriate.
Least squares method was used to fit a regression
line to the data.
The residuals were plotted against the fitted
values.
The plot shows that the residuals are consistently
above or below the fitted value for extended
periods.
Example
Example
To confirm this graphic diagnosis we will use the DurbinWatson test for:
H0 : 0
The test statistic is:
Ha : 0
n
DW
(e
t 2
et 1 ) 2
e
t 1
2
t
Example
Year
1983
1984
1985
1986
1987
Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Company sales(y) Industry sales(x)

20.96
127.3
21.4
130
21.96
132.7
21.52
129.4
22.39
135
22.76
137.1
23.48
141.2
23.66
142.8
24.1
145.5
24.01
145.3
24.54
148.3
24.3
146.4
25
150.2
25.64
153.1
26.36
157.3
26.98
160.7
27.52
164.2
27.78
165.6
28.24
168.7
28.78
171.7
Blaisdell Company Example
($
35
30
et
-0.02605
-0.06202
0.022021
0.163754
0.04657
0.046377
0.043617
-0.05844
-0.0944
-0.14914
-0.14799
-0.05305
-0.02293
0.105852
0.085464
0.106102
0.029112
0.042316
-0.04416
-0.03301
et -et-1
(et -et-1)^2
-0.03596
0.084036
0.141733
-0.11718
-0.00019
-0.00276
-0.10205
-0.03596
-0.05474
0.001152
0.094937
0.030125
0.12878
-0.02039
0.020638
-0.07699
0.013204
-0.08648
0.011152
0.001293
0.007062
0.020088
0.013732
3.76E-08
7.61E-06
0.010415
0.001293
0.002997
1.33E-06
0.009013
0.000908
0.016584
0.000416
0.000426
0.005927
0.000174
0.007478
0.000124
et ^2
0.000679
0.003846
0.000485
0.026815
0.002169
0.002151
0.001902
0.003415
0.008911
0.022243
0.021901
0.002815
0.000526
0.011205
0.007304
0.011258
0.000848
0.001791
0.00195
0.00109
0.097941 0.133302
Example
.09794
DW
.735
.13330
Using Durbin Watson table of your text

book, for k = 1, and n=20, and using = .
01 we find U = 1.15, and L = .95
Since DW = .735 falls below L = .95 , we
reject the null hypothesis, namely, that the
error terms are positively autocorrelated.
Remedial Measures for Serial Correlation
Addition of one or more independent

variables to the regression model.
One major cause of autocorrelated error terms

is the omission from the model of one or more
key variables that have time-ordered effects on
the dependent variable.
Use transformed variables.
The regression model is specified in terms of

changes rather than levels.
Extensions of the Multiple Regression

Model
In some situations, nonlinear terms may be needed

as independent variables in a regression analysis.
Business or economic logic may suggest that nonlinearity is expected.

A graphic display of the data may be helpful in
determining whether non-linearity is present.
One common economic cause for non-linearity is

diminishing returns.
Fore example, the effect of advertising on sales may

diminish as increased advertising is used.

Model
Some common forms of nonlinear functions

are :
Y 0 1 ( X ) 2 ( X 2 )
Y 0 1 ( X ) 2 ( X 2 ) 3 ( X 3 )
Y 0 1 (1 X )
Y e 0 X 1

Model
To illustrate the use and interpretation of a

non-linear term, we return to the problem of
developing a forecasting model for private
housing starts (PHS).
So far we have looked at the following
model
PHS 0 1 ( MR) 2 (Q 2) 3 (Q3) 4 (Q 4)
Where MR is the mortgage rate and Q2, Q3, and Q4 are

indicators variables for quarters 2, 3, and 4.
Example: Private Housing Start
First we add real disposable personal

income per capita (DPI) as an independent
variable. Our new model for this data set is:
PHS 0 1 ( MR) 2 (Q 2) 3 (Q3) 4 (Q 4) 5 ( DPI )

in the next slide.

SUMMARY OUTPUT
Multiple R
0.943791346
R Square
0.890742104
Adjusted R Square
0.874187878
Standard Error
19.05542121
Observations
39
ANOVA
df
Regression
Residual
Total
Intercept
MR
Q2
Q3
Q4
DPI
5
33
38
SS
MS
F
Significance F
97690.01942
19538 53.80753
6.51194E-15
11982.59955 363.1091
109672.619

-31.06403714
105.1938477
-20.1992545
4.124906847
97.03478074
8.900711541
75.40017073
8.827185877
20.35306822
8.83373887
0.022407799
0.004356973
t Stat
-0.2953
-4.8969
10.90191
8.541813
2.304015
5.142974
P-value
0.769613
2.5E-05
1.78E-12
7.17E-10
0.027657
1.21E-05
Lower 95%
Upper 95%
-245.0826992 182.9546249
-28.59144723 -11.80706176
78.9261326 115.1434289
57.44111179 93.35922967
2.380677107 38.32545934
0.013543464 0.031272134
The prediction model is
PHS 31.06 20.19( MR) 97.03(Q 2) 75.40(Q3) 20.35(Q 4) 0.02( DPI )
In comparison with the previous model, we

see that the R-squared has improved.It has
changed from 78% to 89%.
The standard error of the estimate has
decreased from 26.49 for the previous
model to 19.05 for the new model.
The value of the DW test has changed from 0.88 for

the previous model to 0.78 for the new model.
At 5% level the critical value for DW test, from
Durbin-Watson table, for k = 5, and n = 39 is L=
1.22, and U = 1.79.
Since The value of the DW test is smaller than
L=1.22, we reject the null hypothesis H0: =0
This implies that there is serial correlation in both
models, the assumption of the independence of the
error terms is not valid.
The Plot of PHS against

DPI shows a curve linear
relation.
Next we introduce a
nonlinear term into the
regression.
The square of disposable
personal income per capita
(DPI2) is included in the
regression model.
Private Housing Start and Disposable Personal Income

21500
21000
20500
20000
PHS
19500
19000
18500
18000
17500
0
50
100
150
200
DPI
250
300
350
400
We also add the dependent variable, lagged

one quarter, as an independent variable in
order to help reduce serial correlation.
The third model that we fit to our data set
is:
PHS 0 1 ( MR) 2 (Q 2) 3 (Q3) 4 (Q 4) 5 ( DPI ) 6 ( DPI 2 ) 7 ( LPHS )

in the next slide.

SUMMARY OUTPUT
Multiple R
0.97778626
R Square
0.956065971
Adjusted R Square
0.946145384
Standard Error
12.46719572
Observations
39
ANOVA
df
Regression
Residual
Total
Intercept
MR
Q2
Q3
Q4
DPI
DPI SQUARED
LPHS
7
31
38
SS
104854.2589
4818.360042
109672.619
MS
F
Significance F
14979.17985 96.37191
3.07085E-19
155.4309691

t Stat
P-value
716.5926532
1017.664989 0.704153784 0.486593
-13.65521724
3.093504134 -4.414158396 0.000114
106.9813297
6.069780998 17.62523718 1.04E-17
27.72122303
9.111432565 3.042465916 0.004748
-13.37855186
7.653050858 -1.748133144
0.09034
-0.060399279
0.104412354 -0.578468704 0.567127
0.000335974
0.000536397 0.626354647 0.535668
0.655786939
0.097265424 6.742241114 1.51E-07
Lower 95%
Upper 95%
-1358.949934
2792.13524
-19.96446404 -7.345970448
94.60192287 119.3607366
9.138323433 46.30412262
-28.98706069
2.22995698
-0.273349798
0.15255124
-0.000758014 0.001429963
0.457412689 0.854161189
The inclusion of DPI2 and Lagged PHS has increased

the R-squared to 96%
The standard error of the estimate has decreased to
12.45
The value of the DW test has increased to 2.32 which
is greater than U = 1.79 which rule out positive serial
correlation.
You see that the third model worked best for this data
set.
The following slide gives the data set.

PERIOD
30-Jun-90
30-Sep-90
31-Dec-90
31-Mar-91
30-Jun-91
30-Sep-91
31-Dec-91
31-Mar-92
30-Jun-92
30-Sep-92
31-Dec-92
31-Mar-93
30-Jun-93
30-Sep-93
31-Dec-93
31-Mar-94
30-Jun-94
30-Sep-94
31-Dec-94
31-Mar-95
30-Jun-95
30-Sep-95
31-Dec-95
31-Mar-96
30-Jun-96
30-Sep-96
31-Dec-96
31-Mar-97
30-Jun-97
30-Sep-97
31-Dec-97
31-Mar-98
30-Jun-98
30-Sep-98
31-Dec-98
31-Mar-99
30-Jun-99
30-Sep-99
31-Dec-99
PHS
271.3
233
173.6
146.7
254.1
239.8
199.8
218.5
296.4
276.4
238.8
213.2
323.7
309.3
279.4
252.6
354.2
325.7
265.9
214.2
296.7
308.2
257.2
240
344.5
324
252.4
237.8
324.5
314.6
256.8
258.4
360.4
348
304.6
294.1
377.1
355.6
308.1
MR
10.3372
10.1033
9.9547
9.5008
9.5265
9.2755
8.6882
8.7098
8.6782
8.0085
8.2052
7.7332
7.4515
7.0778
7.0537
7.2958
8.4370
8.5882
9.0977
8.8123
7.9470
7.7012
7.3508
7.2430
8.1050
8.1590
7.7102
7.7905
7.9255
7.4692
7.1980
7.0547
7.0938
6.8657
6.7633
6.8805
7.2037
7.7990
7.8338
LPHS
217
271.3
233
173.6
146.7
254.1
239.8
199.8
218.5
296.4
276.4
238.8
213.2
323.7
309.3
279.4
252.6
354.2
325.7
265.9
214.2
296.7
308.2
257.2
240
344.5
324
252.4
237.8
324.5
314.6
256.8
258.4
360.4
348
304.6
294.1
377.1
355.6
Q2
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
Q3
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
Q4
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
DPI
18063
18031
17856
17748
17861
17816
17811
18000
18085
18036
18330
17975
18247
18246
18413
18154
18409
18493
18667
18834
18798
18871
18942
19071
19081
19161
19152
19331
19315
19385
19478
19632
19719
19905
20194
20377
20472
20756
21124
DPI SQUARED
1,631,359.85
1,625,584.81
1,594,183.68
1,574,957.52
1,595,076.61
1,587,049.28
1,586,158.61
1,620,000.00
1,635,336.13
1,626,486.48
1,679,944.50
1,615,503.13
1,664,765.05
1,664,582.58
1,695,192.85
1,647,838.58
1,694,456.41
1,709,955.25
1,742,284.45
1,773,597.78
1,766,824.02
1,780,573.21
1,793,996.82
1,818,515.21
1,820,422.81
1,835,719.61
1,833,995.52
1,868,437.81
1,865,346.13
1,878,891.13
1,896,962.42
1,927,077.12
1,944,194.81
1,980,963.41
2,038,980.00
2,076,010.87
2,095,440.74
2,153,982.23
2,231,020.37

Multiple Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression

Uploaded by

Copyright:

Available Formats

Applied Business Forecasting

In simple linear regression we studied the

Following our principles of data analysis,

Example. In a study of direct operating cost, Y, for 67

X1: Average size of loan outstanding during the year,

X2 : Average number of loans outstanding,

X3 : Total number of new loan applications processed, and

X4 : Office salary scale index.

The model for this example is

Formal Statement of the Model

General regression model

X1, X2, ,Xk are known constants

, the error terms are independent N(o, 2)

Estimating the parameters of the model

The values of the regression parameters i are not

Estimating the parameters of the model

The least-squares estimates are the values that

Since the formulas for the least-squares estimates are

Estimating the parameters of the model

The estimate of i is bi and it indicates the change

Estimating the parameters of the model

The observed variability of the responses

and the regression standard error

Estimating the parameters of the model

In the model 2 and measure the

Analysis of Variance Table

The basic idea of the regression ANOVA table are

Analysis of Variance Table

F-test for the overall fit of the model

To test the statistical significance of the regression

H a : not all i (i 1, k ) equal zero

We use the test statistic:

F-test for the overall fit of the model

The decision rule at significance level is:

Where the critical value F(, k, n-k-1) can be found from an

The existence of a regression relation by itself does not

For our regression model, we have:

has a t - distribution with n - k - 1 degrees of freedom

Therefore, an interval estimate for i

Significance tests for i

We may use the test statistic:

Multiple regression model Building

Often we have many explanatory variables,

Multiple regression model Building

We may find that the reciprocal of a variable is a

Multiple regression model Building

The simplest way to construct an interaction

Selecting the best Regression equation.

After a lengthy list of potentially useful

May not be fundamental to the problem

Selecting the best Regression Equation.

Once the investigator has tentatively

Selecting the best Regression Equation.

An automatic search procedure that develops

Example: Sales Forecasting

Multiple regression is a popular technique for predicting

The growth of cable television has created vast new potential

Y = Number of cable subscribers (SUSCRIB)

The sample data are fitted by a multiple regression

SUBSCRIBE 0 1 ADRATE 2 APIPOP 3 COMPETE 4 SIGNAL

Excel Summary output

Do we need all the four variables in the

Excel Summary Output

Coefficients Standard Error