You are on page 1of 34

Business Statistics, 5th ed.

by Ken Black
Chapter 16
Discrete Distributions

Building Multiple Regression Models

PowerPoint presentations prepared by Lloyd Jaisingh, Morehead State University

Learning Objectives
Analyze and interpret nonlinear variables in multiple regression analysis. Understand the role of qualitative variables and how to use them in multiple regression analysis. Learn how to build and evaluate multiple regression models. Learn how to detect influential observations in regression analysis.

General Linear Regression Model


Y = 0 + 1X1 + 2X2 + 3X3 + . . . + kXk+ Y = the value of the dependent (response) variable 0 = the regression constant 1 = the partial regression coefficient of independent variable 1 2 = the partial regression coefficient of independent variable 2 k = the partial regression coefficient of independent variable k k = the number of independent variables = the error of prediction

Non Linear Models: Mathematical Transformation


Y 0 1 X 1 2 X 2 Y 0 1 X 1 2 X 1
2

First-order with Two Independent Variables Second-order with One Independent Variable
Second-order with an Interaction Term
2

Y 0 1 X 1 2 X 2 3 X 1 X 2

Y 0 1 X 1 2 X 2 3 X 1 4 X 2 5 X 1 X 2
2

Second-order with Two Independent Variables

Sales Data and Scatter Plot for 13 Manufacturing Companies


Sales Manufacturer ($1,000,000) 1 2.1 2 3.6 3 6.2 4 10.4 5 22.8 6 35.6 7 57.1 8 83.5 9 109.4 10 128.6 11 196.8 12 280.0 13 462.3 Number of Manufacturing Representatives 2 1 2 3 4 4 5 5 6 7 8 10 11
500 450 400 350 300 Sales 250 200 150 100 50 0 0 2 4 6 8 10 12 Number of Representatives

Excel Simple Linear Regression Output for the Manufacturing Example


Intercept numbers Coefficients Standard Error -107.03 28.737 41.026 4.779

Regression Statistics Multiple R 0.933 R Square 0.870 Adjusted R Square 0.858 Standard Error 51.10 Observations

13

t Stat -3.72 8.58

P-value 0.003 0.000

ANOVA
df

Regression Residual Total

1 11 12

SS 192395 28721 221117

MS 192395 2611

F 73.69

Significance F 0.000

Manufacturing Data with Newly Created Variable


Sales Manufacturer ($1,000,000) 1 2.1 2 3.6 3 6.2 4 10.4 5 22.8 6 35.6 7 57.1 8 83.5 9 109.4 10 128.6 11 196.8 12 280.0 13 462.3 Number of Mgfr Reps X1 2 1 2 3 4 4 5 5 6 7 8 10 11 (No. Mgfr Reps)2 X2 = (X1)2 4 1 4 9 16 16 25 25 36 49 64 100 121

Scatter Plots Using Original and Transformed Data


500 450 400 350 300 250 200 150 100 50 0 0 2 4 6 8 10 12
Number of Representatives

Sales

Sales

500 450 400 350 300 250 200 150 100 50 0 0 50 100 150
Number of Mfg. Reps. Squared

Computer Output for Quadratic Model to Predict Sales


Intercept MfgrRp MfgrRpSq ANOVA df Regression Residual Total 2 10 12 SS 215069 6048 221117 MS 107534 605

Regression Statistics Multiple R 0.986 R Square 0.973 Adjusted R Square 0.967 Standard Error 24.593 Observations 13 t Stat 0.73 - 1.65 6.12 P-value 0.481 0.131 0.000

Coefficients Standard Error 18.067 24.673 -15.723 9.5450 4.750 0.776

F 177.79

Significance F 0.000

Tukeys Four Quadrant Approach

Move toward

,Y , , or X ,
X , , or

toward log X, -1

Y toward X , X ,
Move toward
2 3

,Y , , or

Move toward log X, -1 toward log Y, -1 Y ,

Move toward

, X , or Y ,

toward log Y, -1

Stock 1 41

Stock 2 36 36

Stock 3 35 35

Prices of Three Stocks over a 15-Month Period

39

38
45 41 43 47 49

38
51 52 55 57 58

32
41 39 55 52 54

41
35 36 39 33 28 31

62
70 72 74 83 101 107

65
77 75 74 81 92 91

Regression Models for the Three Stocks


Y 0 1 X 1 2 where: Y = price of stock 1

X X
1 2

price of stock 2 price of stock 3

First-order with Two Independent Variables

Y
0 0

X X X X Y X X X
1 1 2 2 3 1 1 1 2 2 3 3

Second-order with an Interaction Term

where : Y = price of stock1

X X X

1 2 3

price of stock 2 price of stock 3

X X
1

Regression for Three Stocks: First-order, Two Independent Variables


The regression equation is Stock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3 Predictor Coef Constant 50.855 Stock 2 -0.1190 Stock 3 -0.0708 S = 4.570 StDev 3.791 0.1931 0.1990 T P 13.41 0.000 -0.62 0.549 -0.36 0.728 R-Sq(adj) = 38.4%

R-Sq = 47.2%

Analysis of Variance Source Regression Error Total DF 2 12 14 SS 224.29 250.64 474.93 MS 112.15 20.89 F P 5.37 0.022

Regression for Three Stocks: Second-order With an Interaction Term


The regression equation is Stock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 0.00998 Inter

Predictor Coef StDev T Constant 12.046 9.312 1.29 Stock 2 0.8788 0.2619 3.36 Stock 3 0.2205 0.1435 1.54 Inter -0.009985 0.002314 -4.31 S = 2.909 R-Sq = 80.4%

P 0.222 0.006 0.153 0.001

R-Sq(adj) = 25.1%

Analysis of Variance Source Regression Error Total DF 3 11 14 SS 381.85 93.09 474.93 MS 127.28 8.46 F P 15.04 0.000

Nonlinear Regression Models: Model Transformation


log Y log X log
Y

X 0 1 0

Y b b
'

'

b0 b1 X
' '

where :
0 ' 1

'

log Y

log b0 log b1

Data Set for Model Transformation Example


ORIGINAL DATA TRANSFORMED DATA

Company 1 2 3 4 5 6 7

Y 2580 11942 9845 27800 18926 4800 14550

X 1.2 2.6 2.2 3.2 2.9 1.5 2.7

Company 1 2 3 4 5 6 7

LOG Y 3.41162 4.077077 3.993216 4.444045 4.277059 3.681241 4.162863

X 1.2 2.6 2.2 3.2 2.9 1.5 2.7

Y = Sales ($ million/year)

X = Advertising ($ million/year)

Regression Statistics Multiple R 0.990 R Square 0.980 Adjusted R Square 0.977 Standard Error 0.054 Observations 7

Regression Output for Model Transformation Example


t Stat 39.80 15.82 P-value 0.000 0.000

Intercept X
ANOVA

Coefficients Standard Error 2.9003 0.0729 0.4751 0.0300

df
Regression Residual Total 1 5 6

SS 0.7392 0.0148 0.7540

MS 0.7392 0.0030

F 250.36

Significance F 0.000

Prediction with the Transformed Model


X Y b0 b1 log Y log b0 X log b1

2.900364 X 0.475127 For X = 2, log Y 2.900364 2 0.475127 3.850618 Y anti log(log Y ) anti log(3.850618) 7089.5

Prediction with the Transformed Model


X Y b0 b1 log Y log b0 X log b1

2.900364 X 0.475127 anti log( 2.900364) 794.99427 0.475127 anti log( 0.475127) 2.986256

log b0 2.900364

b log b b

0 1 1

For X = 2, Y 794.99427 7089.5

2.986256

Indicator (Dummy) Variables


Qualitative (categorical) Variables The number of dummy variables needed for a qualitative variable is the number of categories less one. [c - 1, where c is the number of categories] For dichotomous variables, such as gender, only one dummy variable is needed. There are two categories (female and male); c = 2; c - 1 = 1. Your office is located in which region of the country? ___Northeast ___Midwest ___South ___West number of dummy variables = c - 1 = 4 - 1 = 3

Data for the Monthly Salary Example


Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Monthly Gender Age Salary (1=Male, ($1000) (10 Years) 0=Female) 1.548 3.2 1 1.629 3.8 1 1.011 2.7 0 1.229 3.4 0 1.746 3.6 1 1.528 4.1 1 1.018 3.8 0 1.190 3.4 0 1.551 3.3 1 0.985 3.2 0 1.610 3.5 1 1.432 2.9 1 1.215 3.3 0 0.990 2.8 0 1.585 3.5 1

Regression Output for the Monthly Salary Example


The regression equation is Salary = 0.732 + 0.111 Age + 0.459 Gender Predictor Constant Age Gender S = 0.09679 Coef 0.7321 0.11122 0.45868 StDev 0.2356 0.07208 0.05346 T P 3.11 0.009 1.54 0.149 8.58 0.000 R-Sq(adj) = 87.2%

R-Sq = 89.0%

Analysis of Variance

Source Regression Error Total

DF 2 12 14

SS 0.90949 0.11242 1.02191

MS F P 0.45474 48.54 0.000 0.00937

Regression Model Depicted with Males and Females Separated


1.800 1.600

Males
1.400 1.200 1.000 0.800

Females

Data for Multiple Regression to Predict Crude Oil Production


Y X1 World Crude Oil Production U.S. Energy Consumption U.S. Nuclear Generation U.S. Coal Production U.S. Dry Gas Production U.S. Fuel Rate for Autos

X2
X3 X4 X5

Y 55.7 55.7 52.8 57.3 59.7 60.2 62.7 59.6 56.1 53.5 53.3 54.5 54.0 56.2 56.7 58.7 59.9 60.6 60.2 60.2 60.6 60.9

X1 74.3 72.5 70.5 74.4 76.3 78.1 78.9 76.0 74.0 70.8 70.5 74.1 74.0 74.3 76.9 80.2 81.3 81.3 81.1 82.1 83.9 85.6

X2 83.5 114.0 172.5 191.1 250.9 276.4 255.2 251.1 272.7 282.8 293.7 327.6 383.7 414.0 455.3 527.0 529.4 576.9 612.6 618.8 610.3 640.4

X3 598.6 610.0 654.6 684.9 697.2 670.2 781.1 829.7 823.8 838.1 782.1 895.9 883.6 890.3 918.8 950.3 980.7 1029.1 996.0 997.5 945.4 1033.5

X4 21.7 20.7 19.2 19.1 19.2 19.1 19.7 19.4 19.2 17.8 16.1 17.5 16.5 16.1 16.6 17.1 17.3 17.8 17.7 17.8 18.2 18.9

X5 13.30 13.42 13.52 13.53 13.80 14.04 14.41 15.46 15.94 16.65 17.14 17.83 18.20 18.27 19.20 19.87 20.31 21.02 21.69 21.68 21.04 21.48

Model-Building: Search Procedures


All Possible Regressions Stepwise Regression Forward Selection Backward Elimination

All Possible Regressions with Five Independent Variables


Single Predictor X1 X2 X3 X4 X5 Two Predictors X1,X2 X1,X3 X1,X4 X1,X5 X2,X3 X2,X4 X2,X5 X3,X4 X3,X5 X4,X5 Three Predictors X1,X2,X3 X1,X2,X4 X1,X2,X5 X1,X3,X4 X1,X3,X5 X1,X4,X5 X2,X3,X4 X2,X3,X5 X2,X4,X5 X3,X4,X5 Four Predictors X1,X2,X3,X4 X1,X2,X3,X5 X1,X2,X4,X5 X1,X3,X4,X5 X2,X3,X4,X5

Five Predictors X1,X2,X3,X4,X 5

Stepwise Regression
Perform k simple regressions; and select the best as the initial model Evaluate each variable not in the model

Return to previous step

If none meet the criterion, stop Add the best variable to the model; reevaluate previous variables, and drop any which are not significant

Forward Selection
Like stepwise, except variables are not reevaluated after entering the model

Backward Elimination
Start with the full model (all k predictors) If all predictors are significant, stop Otherwise, eliminate the most nonsignificant predictor; return to previous step

Stepwise: Step 1 - Simple Regression Results for Each Independent Variable


Dependent Variable Y Independent Variable X1 t-Ratio 11.77 R2 85.2%

Y
Y Y Y

X2
X3 X4 X5

4.43
3.91 1.08 33.54

45.0%
38.9% 4.6% 34.2%

MINITAB Stepwise Output


Stepwise Regression F-to-Enter: 4.00 F-to-Remove: 4.00

Response is Coiler on 5 predictors, with N = 26 Step Constant Seconds T-Value P-value Fuel Rate T-Value P-value S R-Sq 1.52 85.24 1 13.075 0.580 11.77 0.000 2 7.140 0.772 11.91 0.000 -0.52 -3.75 0.001 1.22 90.83

Multicollinearity
Condition that occurs when two or more of the independent variables of a multiple regression model are highly correlated
Difficult to interpret the estimates of the regression coefficients Inordinately small t values for the regression coefficients Standard deviations of regression coefficients are overestimated Sign of predictor variables coefficient opposite of what expected

Correlations among Oil Production Predictor Variables


Energy Consumption Energy Consumption Nuclear Coal Dry Gas Fuel Rate Nuclear Coal Dry Gas Fuel Rate

1 0.856 0.791 0.057 0.796

0.856 1 0.952 -0.404 0.972

0.791 0.952 1 -0.448 0.968

0.057 -0.404 -0.448 1 -0.423

0.791 0.972 0.968 -0.423 1

Copyright 2008 John Wiley & Sons, Inc. All rights reserved. Reproduction or translation of this work beyond that permitted in section 117 of the 1976 United States Copyright Act without express permission of the copyright owner is unlawful. Request for further information should be addressed to the Permissions Department, John Wiley & Sons, Inc. The purchaser may make back-up copies for his/her own use only and not for distribution or resale. The Publisher assumes no responsibility for errors, omissions, or damages caused by the use of these programs or from the use of the information herein.

You might also like