You are on page 1of 28

Data Analysis

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Three Types of Analysis we can classify analysis into three types Univariate, involving a single variable at a time,

1.

2. Bivariate, involving two variables at a time, and 3. Multivariate, involving three or more variables simultaneously.

Revision : Application Areas: Correlation


1.

Correlation and Regression are generally performed together. The application of correlation analysis is to measure the degree of association between two sets of quantitative data. The correlation coefficient measures this association. It has a value ranging from 0 (no correlation) to 1 (perfect positive correlation), or -1 (perfect negative correlation).

2. For example, how are sales of product A correlated with sales of product B? Or, how is the advertising expenditure correlated with other promotional expenditure? Or, are daily ice cream sales correlated with daily maximum temperature?

3. Correlation does not necessarily mean there is a causal effect. Given any two strings of numbers, there will be some correlation among them. It does not imply that one variable is causing a change in another, or is dependent upon another. 4. Correlation is usually followed by regression analysis in many applications.

Application Areas: Regression


1. The main objective of regression analysis is to explain the variation in one variable (called the dependent variable),based on the variation in one or more other variables (called the independent variables). The applications areas are in explaining variations in sales of a product based on advertising expenses, or number of sales people, or number of sales offices, or on all the above variables. If there is only one dependent variable and one independent variable is used to explain the variation in it, then the model is known

2.

3.

4. If multiple independent variables are used to explain the variation in a dependent variable, it is called a multiple regression model. 5. Even though the form of the regression equation could be either linear or non-linear, we will limit our discussion to linear (straight line) models.

The general regression model (linear) is of the type Y = b0 + b1x1 + b2x2 +.+ bnxn

( OR
where

Y = a + b1x1 + b2x2 +.+ bnxn )

y is the dependent variable


x1, x2 , x3.xn are the independent variables expected to be related to y and expected to explain or predict y. b1, b2, b3bn are the coefficients of the respective independent variables, which will

Purposes of Regression Analysis

To establish the relationship between a dependent variable (outcome) and a set of independent (explanatory) variables

To identify the relative importance of the different independent (explanatory) variables on the outcome
To make predictions

Steps of Regression Analysis Step 1: Construct a regression model Step 2: Estimate the regression and interpret the result Step 3: Conduct diagnostic analysis of the results Step 4: Change the original regression model if necessary Step 5: Make predictions

DATA (INPUT / OUTPUT)


1.

Input data on y and each of the x variables is required to do a regression analysis. This data is input into a computer package to perform the regression analysis.

2. The output consists of the b coefficients for all the independent variables in the model. It also gives the results of a t test for the significance of each variable in the model, and the results of the F test for the model on the whole.

3 Assuming the model is statistically significant at the desired confidence level (usually 90 or 95%), the coefficient of determination or R2 of the model is an important part of the output. The R2 value is the percentage (or proportion) of the total variance in y explained by all the independent variables in the regression equation.

Requirements for applying Multiple regression analysis

1. The variables used (independent and dependent) are assumed to be either interval scaled or ratio scaled. 2. Nominally scaled variables can be used as independent variables in a regression model, with dummy variable coding. 3. If the dependent variable happens to be a nominally scaled one, discriminant analysis should be the technique used instead of regression. 4. Dependent variable essentially METRIC Independent variables Metric or Dummy

Worked Example: Problem

A manufacturer and marketer of electric motors would like to build a regression model consisting of five or six independent variables, to predict sales. Past data has been collected for 15 sales territories, on Sales and six different independent variables. Build a regression model and recommend whether or not it should be used by the company.

The data are for a particular year, in different sales territories in which the company operates, and the variables on which data are collected are as follows:

Dependent Variable Y =sales in Rs.lakhs in the territory Independent Variables X1 = market potential in the territory (in Rs.lakhs). X2 = No. of dealers of the company in the territory. X3 = No. of salespeople in the territory. X4 = Index of competitor activity in the territory on a 5 point scale (1=low, 5=high level of activity by competitors). X5 = No. of service people in the territory. X6 = No. of existing customers in the territory. The following slide gives the Data file :

1 SALES 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 POTENTL

3 DEALERS

4 PEOPLE

5 COMPET

6 SERVICE

7 CUSTOM

5 60 20 11 45 6 15 22 29 3 16 8 18 23 81

25 150 45 30 75 10 29 43 70 40 40 25 32 73 150

1 12 5 2 12 3 5 7 4 1 4 2 7 10 15

6 30 15 10 20 8 18 16 15 6 11 9 14 10 35

5 4 3 3 2 2 4 3 2 5 4 3 3 4 4

2 5 2 2 4 3 5 6 5 2 2 3 4 3 7

20 50 25 20 30 16 30 40 39 5 17 10 31 43 70

Regression
We will first run the regression model of the following form, by entering all the 6 'x' variables in the model Y= b0+ b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

..Equation 1 [ OR Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6


..Equation 1] and determine the values of b0, b1, b2, b3, b4, b5, &

MULTIPLE REGRESSION RESULTS:

All independent variables were entered in one block


Dependent Variable: Multiple R: Multiple R-Square: Adjusted R-Square: Number of cases: SALES .988531605 .977194734 .960090784 15

The ANOVA Table


STAT. Analysis of Variance; Depen.Var: SALES (regdata1.sta) MULTIPLE REGRESS. Sums of Mean Effect Squares df Squares F Regress. 6609.484 6 1101.581 57.13269 .000004 Residual 154.249 8 19.281 Total 6763.733

From the analysis of variance table, the last column indicates the p-level to be 0.000004. This indicates that the model is statistically significant at a confidence level of (1-0.000004)*100 or (0.999996)*100, or 99.9996.

:
STAT. MULTIPLE REGRESS. Regression Summary for Dependent Variable: SALES R= .98853160 R2= .97719473 Adjusted R2= .96009078 F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910

N=15

St.Err. BETA of BETA .439073 .164315 .413967 .084871 .040806 .050490 .144411 .126591 .158646 .060074 .116511 .149302

B -3.1729 .22685 .81938 1.09104 -1.89270 -.54925 .06594

St. Err. of B 5.813394 .074611 .631266 .418122 1.339712 1.568233 .095002

t(8) -.54581 3.04044 1.29800 2.60937 -1.41276 -.35024 .33817

p-level .600084 .016052 .230457 .031161 .195427 .735204 .743935

Intercept POTENTL DEALERS PEOPLE COMPET SERVICE CUSTOM

Column 4 of the table, titled B lists all the coefficients for the model. These are : a (intercept) = -3.17298 b1 = .22685 b2 = .81938 b3 = 1.09104 b4 = -1.89270 b5 = -0.54925 b6 = 0.06594

Substituting these values of a, b1, b2, ..b6 in equation 1 we can write the equation (rounding off all coefficients to 2 decimals), as

Sales = -3.17 + .23 (potential) + .82 (dealers) + 1.09 (salespeople) - 1.89 (competitor activity) - 0.55 (service people) + 0.07 (existing customers)
[Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

..Equation 1] The estimated increase in sales for every unit increase or decrease in the independent variables is given by the coefficients of the respective variables. For instance, if the number of sales people is increased by 1, sales in Rs . lakhs, are estimated to increase by 1.09, if all other variables are unchanged. Similarly, if 1 more dealer is added, sales are expected to increase by 0.82 lakh, if

The SERVICE variable does not make too much intuitive sense. If we increase the number of service people, sales are estimated to decrease according to the 0.55 coefficient of the variable "No. of Service People" (SERVICE). Now look at the individual variable t tests, we find that the coefficients of the variable SERVICE is statistically not significant (p-level 0.735204). Therefore, the coefficient for SERVICE is not to be used in interpreting the regression, as it may lead to wrong conclusions. Strictly speaking, only two variables, potential (POTENTL) and No. of sales people (PEOPLE) are significant statistically at 90 percent confidence level since their p- level is less than 0.10. One should

Different modes of entering independent variables in the model Enter Forward Stepwise Regression Backward step wise Regression Step wise regression

The final model


Sales = -10.6164 + .2433 (POTENTL) + 1.4244 (PEOPLE)Equation 3 Predictions: If potential in a territory were to be Rs. 50 lakhs, and the territory had 6 salespeople, then expected sales, using the above equation would be = -10.6164 +.2433(50) +1.4244(6) = 10.095 lakhs. Similarly, we could use this model to make predictions regarding sales in any territory for which Potential and No. of Sales People were known.

Recommended usage 1. It is recommended that for serious decision-making, there has to be a-priori knowledge of the variables which are likely to affect y, and only such variables should be used in the regression analysis. 2. For exploratory research, the hit-and-trial approach may be used. 3. It is also recommended that unless the model is itself significant at the desired confidence level (as evidenced by the F test results printed out for the model), the R value should not be interpreted.

Multicollinearity and how to tackle it Multicollinearity : Interrelationship of the various independent variables It is essential to verify whether independent variables are highly correlated with each other. If they are, this may indicate that they are not independent of each other, and we may be able to use only 1 or 2 of them to predict the dependent variables. Independent variables which are highly correlated with each other should not be included in the model together

You might also like