You are on page 1of 37

Lecture Notes

Introductory Notes on Multivariate Analysis

Prepared by Prof Prithvi Yadav For the course - ADA

2004 Indian Institute of Management, Rajendra Nagar, Indore

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -1 -

Objectives
Overview of Multivariate Statistics

Utility of multivariate techniques

Survey of multivariate dependency techniques

Survey of multivariate interdependency techniques

Concept of Optimal Scaling

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -2 -

Index
1. Introduction of Multivariate Analysis 1.1 Impact of Computer Revolution 1.2 Concept of Relationships Univariate, Bivariate, Multivariate Multivariate Database Axiom: Any event probably has multiple causes 2. Statistical vs Operational Significance 2.1 Statistical Significance 2.2 Operational Significance 2.3 Statistical Significance and Proportion of variance Expained 3. Factors inhibit multivariate analysis techniques Limitations of the human mind Time & cost of data gathering Historical limitations of multivariate analytic techniques Problems with the inflation of Inadequate analytic engines Mathematical & statistical concepts which seem too complex Factionalism in the social sciences between empiricists and theorists 4. Experiment-wide error rates and the inflation of 5. Factors encouraging the use of multivariate techniques: Inexpensive & user friendly computers & statistical software Empirical emphasis on multivariate techniques in the field Expectation of multivariate design by journal editors Increasing availability of automated operational databases Increasing availability of automated archival databases 6. Definition of multivariate analysis: Involves the simultaneous analysis of the relationship among 3 or more variables Across multiple subjects Capable of discovering relationships which could not be revealed by a series of bivariate statistical tests Includes both multivariate & multivariable techniques 7. Multimeasurement models Interdependence v. dependence statistical techniques A study of multiple variables is not necessarily a multivariate study Univariate v. multivariate normality Multivariate statistics are not limited to those which assume multivariate normality 8. Factors which give rise to different multivariate techniques: The number of independent & dependent variables Whether the research question is one of dependence or interdependence
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -3 -

Scale of measurement of the independent & dependent variables: Nominal, ordinal, interval, or ratio Metric or nonmetric (i.e. continuous or categorical) Fixed or random Analytic technique used to estimate the model (e.g. OLS, maximum likelihood estimation) 9. Sampling of multivariate techniques: 9.1 Dependency methods: 9.1.1 Multiple linear regression MLR 9.1.2 Discriminant analysis DA 9.1.3 Multivariate analysis of variance MANOVA Multivariate analysis of covariance MANCOVA 9.1.4 Canonical correlation CC 9.1.5 Logistical regression LOGIT 9.1.6 Conjoint analysis Structural equation modeling LISREL* Probit* Path analysis* Two-stage least-squares regression* Loglinear analysis* Weighted least-squares regression* Survival analysis* 9.2 Interdependency methods: 9.2.1 Principle component analysis 9.2.2 Factor analysis FA 9.2.3.Cluster analysis 9.2.4 Multidimentional scaling 9.2.5 Linear & Non Linear Techniques 10. Concept of Optimal Scaling 10.1 Why use Optimal Scaling 10.2 Optimal scaling level & Measurement Level 10.3 CATREG (Categorical Regression Analysis) 10.4 CATPCA (Categorical Principal Component Analysis) 10.5 Correspondence Analysis 10.6 Homogeneity Analysis 11. Readings Appendix i. Multivariate Techniques by Data & Variable Type ii. Overview of Multivariate Analysis * will not be discussed
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -4 -

1. Introduction of Multivariate Analysis With the growth of computer technology in recent years, remarkable advances have been made in the analysis of psychological, sociological, and other types of behavioral data. Computers have made it possible to analyze large quantities of complex data with relative ease. At the same time, the ability to conceptualize data analysis has also advanced. Much of the increased understanding and mastery of data analysis has also advanced. Equally important has been the expanded understanding and application of a group of analytical statistical techniques known as multivariate analysis. Multivariate data occur in all branches of science. Almost all data collected by todays researchers can be classified as multivariate data. For example, a marketing researcher might be interested in identifying characteristics of individuals that would enable the researcher to determine whether a certain individual is likely to purchase a specific product. A social scientist might be interested in studying the relationships between teenage girls dating behaviors and their fathers attitudes. Each of these endeavors involves multivariate data. To begin a discussion of multivariate data analysis methods, the concept of an experimental unit must be defined. An experimental unit is any object or item that can be measured or evaluated in some way. Measuring and evaluating experimental units is a principal activity of most researchers. Examples of experimental units include people, animals, insects, fields, plots of land, companies, trees, wheat kernels, and countries. Multivariate data result whenever a researcher measures or evaluates more than one attribute or characteristic of each experimental unit. These attributes or characteristics are usually called variables by statisticians. Any researcher who examines only two variable relationships and avoids multivariate analysis is ignoring powerful tools that can provide potentially very useful information. As one-researcher states: For the purposes of any applied field, most of our tools are, or should be, multivariate. One is pushed to a conclusion that unless a . Problem is treated as a multivariate problem, it is treated superficially. According to Hardyck and Petrinovich Multivariate analysis methods will predominate in the future and will result in drastic changes in the manner in which research workers think about problems and how they design their research. These methods make it possible to ask specific and precise questions of considerable complexity in natural settings. This makes it possible to conduct theoretically significant research and to evaluate the effects of naturally occurring parametric variations in the context in which they normally occur. In this way, the natural correlation among the manifold influences on behaviour can be preserved and separate effects of these influences can be studied statistically with causing a typical isolation of either individuals or variables. 1.1 Impact of The Computer Revolution Widespread application of computers (first mainframe computers and, more recently, microcomputers) to process large, complex data bank has spurred the use of multivariate statistical methods. Today a number of pre package computer programs are available for multivariate data analysis and others are being developed. In fact, many researchers have appeared who realistically call themselves data analysts
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -5 -

instead of statisticians or (in the vernacular) quantitative types. These data analysts have contributed substantially to the increase in the number of journal articles using multivariate statistical techniques. Even for people with strong quantitative training, the availability of pre-packaged programs for multivariate analysis has facilitated the complex manipulation of data matrices that have long hampered the growth of multivariate techniques. With several major universities already requiring entering students to purchase their own microcomputers before matriculating, students and professors will soon be analyzing multivariate data routinely for decisions of various kinds in diverse fields. Some of the prepackaged programs designed for mainframe computers (e.g., SPSS, SAS packages) are now available in a form suitable for microcomputers, and more will soon be available. 1.2 Concept of Relationships

1.

Univariate Analysis

2.

Bivariate Relationship

Y = f (X)

3.

Multivariate Relationship

Y = f ( X1, X2, X3, ... Xk )

( Y1, Y2, Y3, ... Yj ) = f ( X1, X2, X3, ... Xk )

4. Multivariate Database (N x k)

X1 S1 S2 S3

X2

X3

....

Xk

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -6 -

Sn 5. Thoughts on Causality In most cases, it is better to theorize that an event Has multiple causes or correlates As opposed to a single cause or correlates Bivariate theory

Multivariate theory

X1 X2 Y X3 X4

2. Statistical v Operational Significance


2.1 Statistical significance The confidence we have that two or more variables are related, however small the relationship may be. The minimum standard for the rejection of the null hypothesis and affirming that a relationship exists is p 0.05 This means that we are 95% confident that a relationship exists, with a 5% chance of being wrong, i.e. making a Type I error

2.2 Operational significance


____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -7 -

Concerns the magnitude of the relationship The fact that two variables are significantly related (p 0.05) is no indication of the magnitude of the relationship Nor does statistical significance indicate the nature of the relationship: causal, correlative, spurious, etc.

2.3 Statistical Significance and the Proportion of Variance Explained Consider the following results Y = f (X), N = 52, ryx = + 0.273 Proportion of variance explained ryx = + 0.273, df = (N-2) = (52-2) = 50, (p < 0.05) ryx 2 = (+ 0.273)2 = + 0.0745, coefficient of determination (1 - rxy2 ) = (1.0 - 0.0745) = + 0.925, coefficient of non-determination (360) (0.0745) = 26.82

Explained = 7.47%

Unexplained =92.5%
26.820

2.4 Factors Which Have Inhibited Multivariate Research Limitations of the human mind in contending with multiple causes Time & cost of gathering multivariate data Historical limitations in the development of multivariate analytic techniques Problems with the inflation of in using the same bivariate technique repeatedly Time and cost of analyzing multivariate data before the advent of the computer For many, the mathematical & statistical concepts seem too complex Factionalism within the social sciences between empiricists and & theorists

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -8 -

4. Inflation of Alpha ( )
Given a study in which The means of 5 groups (k) Are to be compared using multiple t-tests The total number of comparisons (C) C = [k (k-1)] / 2 = [5 (5-1)] / 2 = 10 Experiment-wide error rate (EW ), given = 0.05 EW = 1 - (1 - )C = 1 - ( 1 - 0.05)10 = 0.40 Probability of making a Type I error Probability is 0.40, not 0.05 Inflation of = 700% = [(0.40 - 0.05)/0.05] (100)

5. Factors Encouraging Use of Multivariate Techniques


Availability of relatively inexpensive personal computers and user-friendly statistical software Empirical emphasis on multivariate research in the field Expectation of multivariate designs by journal editors Increasing availability of automated operational databases Increasing availability of automated archival databases available via the internet

6. Definition of Multivariate Analysis


Multivariate analysis is not easy to define. Broadly speaking, it refers to all statistical methods that simultaneously analyze multiple measurements on each individual or object under investigation. Any simultaneous analysis of more than two variables can be loosely considered multivariate analysis. As such, multivariate techniques are extensions of univariate analysis (analysis of single variable distributions) and bi-variate analysis (cross-classification, correlation, and simple regression used to analyze two variables). One reason for the difficulty of defining multivariate analysis is that the term multivariate is not used consistently in the literature. To some researchers, multivariate simply means examining relationships between or between more than two variables. Others use the term only for problems where there are multiple variables, all of which are assumed to have a multivariate normal distribution. However, to be considered truly multivariate, all of the variables must be random variables that are interrelated in such ways that their different effects cannot meaningfully be interpreted separately. Some authors state that the purpose of multivariate analysis is to measure, explain and/or predict the degree of relationship among variates (weighted combinations of variables). Thus, the multivariate character lies in the multiple variates (multiple combinations of variables), not only in the number of variables or observations.

7. Some Confusion about the Meaning of the Term Multivariate Analysis


A study that analyzes more then two variables is not necessarily a multivariate study for example; A survey with 20 questions in which the answers to all the questions are cross-tabulated by gender is not a multivariate test.
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -9 -

Some define multivariate analysis as employing only statistical techniques that assume that the variables in question are multivariate normally distributed. This is a very limiting definition. Some use the term multivariate to describe a scale that produces a composite score from multiple measures on an individual.

A risk assessment score on a probationer based upon current offense, prior record, social/educational history, etc. A better term for this might be a multi- measurement model

UNIVARIATE v. MULTIVARIATE NORMALITY

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 10 -

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 11 -

8. Factors Giving Rise to Different Multivariate Techniques


The number of independent (X) & dependent (Y) variables (Y) = f ( X1, X2, & X3) ( Y1, Y2, & Y3 ) = f (X1, X2, X3, & X4)

Whether the research question is one of Dependency Y = f (X1, X2, X3, ... Xk)

Or Interdependency

X1 X1 X2 X3 Xk

X2

X3

...

Xk

Multivariate analyses are often concerned with finding relationships among (1) the response variables, (2) the experimental units, and (3) both response variables and experimental units. One might say that relationships exist among the response variables when several of the variables really are measuring a common entity. For example, suppose one gives tests to third-graders in reading, spelling, arithmetic, and science. Individual students may tend to get high scores, medium scores, or low scores in all four areas. If this did happen, then these tests would be related to one another. In such a case, the common thing that these tests maybe measuring might be overall intelligence. Relationships might exist between the experimental units if some of them are similar to each other. For example, suppose breakfast cereals are evaluated for their nutritional content. One might measure the grams of fat, protein, carbohydrates, and sodium in each cereal. Cereals would be related to each other if they tended to be similar with respect to the amounts of fat, protein, carbohydrates, and sodium that are in a single serving of each cereal. One might expect sweetened cereals to be related to each other and high-fiber cereals to be related to each other. One might also expect sweetened cereals to be much different than high-fiber cereals.

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 12 -

Many multivariate techniques tend to be exploratory in nature rather than confirmatory. That is, many multivariate methods tend to motivate hypotheses rather than test them. Consider a situation in which a researcher may have 50 variables measured on more than 2000 experimental units. Traditional statistical methods usually require that a researcher state some hypotheses, collect some data, and then use these data to either substantiate or repudiate the hypotheses. An alternative situation that often exists is a case in which a researcher has a large amount of data available and wonders whether there might be valuable information in these data. Multivariate techniques are often useful for exploring data in an attempt to learn if there is worthwhile and valuable information contained in these data. Variable- and Individual-Directed Techniques One fundamental distinction between multivariate methods is that some are classified as variabledirected techniques, while others are classified as individual-directed techniques. Variable-directed techniques are those that are primarily concerned with relationships that might exist among the response variables being measured. Some examples of this type of technique are analyses performed on correlation matrices, principal components analysis, factor analysis, regression analysis, and canonical correlation analysis. Individual-directed techniques are those that are primarily concerned with relationships that might exist among the experimental units and/or individuals being measured. Some examples of this type of technique are discriminant analysis, cluster analysis, and multivariate analysis of variance (MANOVA). Creating New Variables We quite often find it useful to create new variables for each experimental unit so they can be compared to each other more easily. Many multivariate methods help researchers create new variables that have desirable properties. Some of the multivariate techniques that create new variables are principal components analysis, factor analysis, canonical correlation analysis, canonical discriminant analysis, and canonical variates analysis. Scales of measurement of the variables Nonmetric or categorical, e.g. gender Metric or continuous, e.g. age Fixed A metric variable broken into categories, e.g. income groups Random Levels of a metric variable determined by random selection Analytic technique used to estimate the model Examples Ordinary least squares (OLS) Maximum likelihood estimation Iterative proportional fitting, etc.

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 13 -

9. Sampling of Various Multivariate Techniques


9.1 Dependency Methods Multiple Linear Regression Multiple Discriminant Analysis Multivariate Analysis of Variance Multivariate Analysis of Covariance Canonical Correlation Logistical Regression Conjoint Analysis Structural Equation Modeling Probit Path Analysis 2-Stage Least Squares Regression Loglinear Analysis Weighted Least Squares Regression Nonlinear Estimation Survival analysis 9.2 Interdependency Methods Factor Analysis Cluster Analysis Multidimensional Scaling Correspondence Analysis Reliability Analysis

Differences Among Multivariate Dependency Techniques i. Dependency techniques Involve prediction of one or more dependent variables (DVs) from two or more independent variables (IVs) ii. Scale of measurement of the DV(s) Can be either metric Or nonmetric iii. Scale(s) of measurement among the IVs Can be either metric, Nonmetric Or a mixture of the two

iv. Algorithm used to estimate the parameters of the statistical model Ordinary least squares (OLS) Maximum likelihood estimation (MLE), etc.
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 14 -

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 15 -

9.1 Dependency Method


9.1.1 Multiple Linear Regression Multiple regression is the method of analysis that is appropriate when the research problem involves a single metric dependent variable presumed to be related to one or more metric independent variables. The objective of multiple regression analysis is to predict the changes in the dependent variable in response to changes in the several independent variables. This objective is most often achieved through the statistical rule of least squares. Whenever the researcher is interested in predicting the level of the dependent variable, multiple regression is useful. For example, monthly expenditures on dining out (dependent variable) might be predicted from information regarding a family's income, size, and the age of the head of the household (independent variables). Similarly, the researcher might attempt to predict a companys sales from information on its expenditures for advertising, the number of salespeople, and the number of stores selling its products. Q. To what extent can a metric dependent variable Y be predicted or explained by 2 or more metric and/or nonmetric variables Xk?

Y = a + b1X1 + b2X2 + ... + bkXk + e

a = regression constant bk = regression coefficient, the expected change in Y for a unit change in Xk e = the error in the model R2 = The proportion of variance in Y explained by the predictor variables 9.1.2 Discriminant Analysis Multiple Discriminant Analysis (MDA) if the single dependent variable is dichotomous (e.g., malefemale) or multi-chotomous (e.g., high-medium-low and therefore non-metric, the multivariate technique of multiple Discriminant analysis is appropriate. Discriminant analysis is useful in situations where the total sample can be divided into groups based on a dependent variable that has several known classes. The primary objectives of multiple Discriminant analysis are to understand group differences and predict the likelihood that an entity (individual or object) will belong to a particular class or group based on several metric independent variables.
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 16 -

A company specializing in credit cards would certainly like to be able to classify credit card applicants into two groups of individuals : (1) individuals who are good credit risks and (2) individuals who are bad credit risks. Individuals deemed to be good credit risks would then be offered credit cards, while those deemed to be bad risks would not. To help make this determination, the credit card company might consider several demographic characteristics that can be measured on each individual. For example, the company may consider educational level, salary, indebtedness, and past credit history as possible predictors of creditworthiness. The company would then attempt to use this information to help to decide whether an applicant for a credit card should be approved. The multivariate method that would help the company classify applicants into one of the two credit risk groups is called discriminant analysis. Discriminant analysis (DA) is primarily used to classify individuals or experimental units into two or more uniquely defined populations. To develop a discriminant rule for classifying experimental units into one of several possible categories, the researcher must have a random sample of experimental units from each possible classification group. Then, DA provides methods that will allow researchers to build rules that can be used to classify other experimental units into one of the classification groups. In the credit card example, a discriminant rule is constructed using demographic data from individuals known to be good credit risks and similar data from individuals known to be bad credit risks. Then new applicants for credit cards are classified into one of the two risk groups using the resulting rule. Q. Can 2 or more metric and/or nonmetric independent variables predict or explain membership in two or more categories of a nonmetric dependent variable?

Z = a + W1X1 + W2X2 + ... + WnXn This equation is called a discriminant function. Z = discriminant score indicating category membership in Y a = a constant Wn = discriminant weight for variable Xn X n = Independent variable Example To what extent can age, account balances and education predict credit risk; very good (1), good (2), or bad (3)? Results of the Discriminant Analysis Discriminant function Z = a + W1(age) + W2(balance) + W3(education) Z = -6.04 + 0.24(age) + 0.25(balance) -0.04(education)
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 17 -

Q What is the discriminant score for a 18 year old with 2 (Rs 20,000 account balance) and studied 12 yrs of education? Z = -6.04 + 0.24(18) + 0.25(2) -0.04(12) = -1.7 Discriminant analysis calculates the Baysian probabilities of a case with a discriminant score of Z = -1.7 being in each group. The two highest probabilities were good p = 0.6725 bad p = 0.2490 The group with the highest probability is good risk, therefore this case is predicted to be on good risk. Classification Table/ Confusion Table Observed VG 26 3 6 35 Predicted G 5 16 8 29 Bad 1 2 3 6 % Correct 81.25% 76.19% 17.64% 64.29%

VGood (32) Good (21) Bad(17) Total (70)

9.1.3 Multivariate Analysis of Variance Multivariate analysis of variance (MANOVA) is a multivariate generalization of (univariate) analysis of variance (ANOVA), which is a technique used to compare the means of several populations on a single measured variable. When several variables are measured on each experimental unit, you could produce an ANOVA on each measured variable using one variable at a time. For example, if 25 variables are measured, a researcher could produce 25 separate analyses, one for each variable. However, this is not wise. Unfortunately, a majority of the experiments are being analyzed using one-variable-at-a-time analyses. Statisticians raise two main objections to individual analyses for each measured variable. One objection is that the populations being compared may be different on some variables but not on others. Often a researcher finds it confusing as to which populations are really different and which are similar. Multivariate analysis of variance can help researchers to compare several populations by considering all of the measured variables, simultaneously. A second objection is that there is inadequate protection against making Type I errors when performing one-variable-at-a-time analyses. Recall that a Type I error occurs whenever a true hypothesis is rejected. The more variables that a researcher analyzes, the more likely it is that at least one of the variables will give rise to statistical significance. As the number of variables being analyzed increases, the probability of finding at least one of these analyses statistically significant (i.e. producing a p value of less than 0.05) approaches one. Certainly, the large risk of making Type I errors should be a concern for experimenters. A researcher should want to be confident
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 18 -

when claiming that two or more populations have different means with respect to a measured variable and that his or her claim would not be contradicted by other experimenters conducting similar analyses on similar data sets. A MANOVA should be performed whenever two or more different populations are being compared to one another on a large number of measured response variables. If a MANOVA shows significant differences between population means, then the researcher can be confident that real differences actually exist. In this case, it is reasonable to consider one-variable-at-a-time analyses to see where the differences actually occur. If the MANOVA does not reveal any significant differences between population means, then the researcher must use extreme caution when interpreting one-variable-at-atime analyses. Such analyses may be identifying nothing more than false positives. The difference between ANOVA and MANOVA is in the number of dependent variables. ANOVA MANOVA Only one metric DV, but there may be one or more nonmetric IVs. Two or more metric DVs, but there may be one or more nonmetric IVs.

9.1.4 Logistic Regression (Logit) Logistic regression is often used to model the probability that an experimental unit falls into a particular group based on information measured on the experimental unit. Such models can be used for discrimination purposes. In the credit card example described previously, one can model the probability that an individual with certain demographic characteristics will be a good credit risk. After developing this model, it can be used to predict the probability that a new applicant will fall into a certain risk group. Q How well can two or more metric and/or nonmetric IVs predict membership in a binary DV?

Prob (event) = 1 (1 + e -Z) Consider a binary DV in which one category is coded 0 and the other category 1. The event predicted is the category coded 1. (e = 2.71828)

= f (X1, X2, X3, Xk)

1
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 19 -

Example To what extent does age, gender (m = 0, f = 1), and account balance predict type of credit risk; good (0) or bad (1)? Results of Logistical Regression Analysis

Prob (event) =

1 (1 + e -Z)

Probability of a bad sentence Z = a + b1 (age) + b2 (gender) + b3 (balance) Z = -4.93 + 0.12 (age) + 1.70 (gender) + 0.41 (balance) Q What is the probability of a 26 year old female with 2 (Rs 20,000 account balance) being predicted to bad risk? Z = -4.93 + 0.12 (26) + 1.70 (1) + 0.41 (2) = 0.71 1 / (1 + e -Z) = 1 / [1 + (1 / e Z)] = 1 / [1 + (1 / 2.71828 0.71)] 1 / (1 + e -Z) = 0.67, the probability of a bad risk.

Q What is the probability of good risk? P = ( 1 - 0.67) = 0.33 Gender and account balance are significant predictors (p < 0.05) Age is not a significant predictor, p = 0.106 Classification Table
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 20 -

Observed Good (37) Bad (33) Total (70) 9.1.5 Canonical Analysis

Predicted Good Bad 30 7 9 24 39 31

% Correct 81.08% 72.73% 77.14%

Canonical variates analysis (CVA) is a method that creates new variables in conjunction with multivariate analyses of variance. These new variables are useful for helping researchers determine where the major differences among the population means occur when the populations are being compared on many different variables by using all of the measured variables simultaneously. Occasionally, the canonical variates may suggest important differences that might otherwise be missed. Canonical Correlation Analysis Canonical Correlation Analysis can be viewed as a logical extension of multiple regression analysis. Recall that multiple regression analysis involves a single metric dependent variable and several metric independent variables. With canonical analysis the objective is to correlate simultaneously several metric dependent variables and several metric independent variables. Whereas multiple regression involves a single dependent variable, canonical correlation involves multiple dependent variables. The underlying principle is to develop a linear combination of each set of variables (both independent and dependent) in a manner that maximizes the correlation between the two sets. Stated in a different manner, the procedure involves obtaining a set of weights for the dependent and independent variables that provide the maximum simple correlation between the set of dependent variables and the set of independent variable. The assignment of variables into these two groups must always be motivated by the nature of the response variables and never by an inspection of the data. For example, a legitimate assignment would be one in which the variables in one group are easy to obtain and inexpensive to measure, while the variables in the other group are hard to obtain or expensive to measure. Another would be measurements on fathers versus measurements on their daughters. A researcher wanted to compare fathers attitudes with their daughters dating behaviors. When several different variables have been measured on the fathers and several others measured on the daughters, canonical correlation analysis might be used to identify new variables that summarize any relationships that might exist between these two sets of family members. One basic question that canonical correlation analysis is expected to answer is whether the variables in one group can be used to predict the variables in the other group. When they can, then canonical correlation attempts to summarize the relationships between the two sets of variables by creating new variables from each of the two groups of original variables.

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 21 -

9.1.6 Conjoint Analysis Conjoint Analysis is concerned with understanding how people make choices between products or services or a combination of product and services, so that business can design new products or services that better meet customers underlying needs. Although it has been a mainstream research for the last 10 years or so, conjoint analysis has been found to be an extremely powerful way of capturing what really rives customers to buy one product over another and what customers really value. A key benefit of it is the ability to produce dynamic market models that enable companies to test out what steps they would need to take to improve their market share, or how competitors behavior will affect their customers. It has become one of the most widely used quantitative methods in marketing research. It is used to measure the perceived values of specific product features, to learn how demand for a particular product or service is related to price, and to forecast what the likely acceptance of a product would be if brought to market. Conjoint analysis is a type of experiment done by market researchers. It enables the researcher to understand consumer preferences or ratings of existing or possible products in terms of product attributes and their levels. The purpose of conducting a conjoint experiment is to learn the relative importance of product attributes, as well as learn what the most preferred attribute levels are. When done well, conjoint analysis helps the researcher to understand the existing and desired product. The researcher can simulate market share of preference of existing or possible products, even if the particular combination of factor levels that comprises the "product" was not explicitly judged by the subjects. A number of approaches exist for doing conjoint analysis. In full profile conjoint analysis, a product card consists of one level setting for each attribute under consideration. The set of such cards can be all possible combinations of attribute levels (one combination per card) or some fraction thereof. Often, the researcher is only interested in presenting a fraction, because all possible combinations represents too many product alternatives to judge without concern about fatigue and the reliability of the subject data. Fractional factor designs require the aid of a computer. The ORTHOPLAN procedure can generate such designs. The researcher often presents the choice alternatives as a set of physical cards that the subjects then sort in order of preference. The PLANCARDS procedure is a utility for generating such cards. Full-profile conjoint data can be analyzed by way of ordinary least squares regression. The CONJOINT procedure is a specially-tailored version of regression.

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 22 -

9.2 Interdependence Techniques


9.2.1 Principal Components Analysis When a researcher is beginning to think about analyzing a new data set, several questions about the data should be considered. Important questions include these : (i) Are there any aspects of the data that are strange or unusual? (ii) Can the data be assumed to be distributed normally? (iii) Are there any abnormalities in the data? (iv) Are there outliers in the data? Experimental units whose measured variables seem inconsistent with measurements on other experimental units are usually called outliers. By far, the most important reason for performing a principal components analysis (PCA) is to use it as a tool for screening multivariate data. New variables, called principal components scores, can be created. These new variables can be used as input for graphing and plotting programs, and an examination of the resulting graphical displays will often reveal abnormalities in the data that you are planning to analyze. For example, plots of principal component scores can help identify outliers in the data when they exist. In addition, the principal component scores can be analyzed individually to see whether distributional assumptions such as normality of the variables and independence of the experimental units hold. Such assumptions are often required for certain kinds of statistical analyses of the data to be valid. Principal components analysis uses a mathematical procedure that transforms a set of correlated response variables into a new set of uncorrelated variables that are called principal components. Principal components analysis can be performed on either a sample variance-covariance matrix or a correlation matrix. The type of matrix that is best often depends on the variables being measured. Occasionally, but not often, the newly created variables are interpretable. One cannot always expect to be able to interpret the newly created variables. In fact, it is considered to be a bonus when the principal component variables can actually be interpreted. When using PCA to screen a multivariate data set, you do not need to be able to interpret the principal components because PCA is extremely useful regardless of whether the new variables can be interpreted. Principal components analysis is quite helpful to researchers who want to partition experimental units into subgroups so that similar experimental units into subgroups so that similar experimental units belong to the same subgroup. In this case, principal component scores can be used as input to clustering programs. This often increases the effectiveness of the clustering programs, while reducing the cost of using such programs. Furthermore, the principal component scores can and should always be used to help validate the results of clustering programs. 9.2.2 Factor Analysis Factor analysis, including variations such as component analysis and common factor analysis, is a statistical approach that can be used to analyze interrelationships among a large number of variables and to explain these variables in terms of their common underlying dimensions (factors). The statistical

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 23 -

approach involves finding a way of condensing the information contained in a number of original variables into a smaller set of dimensions (factors) with a minimum loss of information. Factor analysis (FA) is a technique that is often used to create new variables that summarize all of the information that might be available in the original variables. For example, consider once again giving tests to third-graders in reading, spelling, arithmetic, and science, whereby individual students may tend to get high scores, medium scores, or low scores in all four areas. If this really does happen, then one might say that these test results are being explained by some underlying characteristic or factor that is common to all four tests. In this example, it might be reasonable to assume that such an underlying characteristic is overall intelligence. Factor analysis is also used to study relationships that might exist among the measured variables in a data set. Similar to PCA, FA is a variable-directed technique. One basic objective of FA is to determine whether the response variables exhibit patterns of relationships with each other, such that the variables can be partitioned into subsets of variables so that the variables in a subset are highly correlated with each other and so that variables in different subsets have low correlations with each other. Thus, FA is often used to study the correlation structure of the variables in a data set. One similarity between FA and PCA is that FA can also be used to create new variables that are uncorrelated with each other. Such variables are called factor scores. One advantage that FA seems to have over PCA when new variables are being created is that the new variables created by FA are generally much easier to interpret than those created by PCA. If a researcher wants to create a smaller set of new variables that are interpretable and that summarize most of the information in the measured variables then FA should be given serious consideration. 9.2.3 Cluster Analysis Cluster analysis (CA) is similar to discriminant analysis in that it is used to classify individuals or experimental units into uniquely defined subgroups. Discriminant analysis can be used when a researcher has previously obtained random samples from each of the uniquely defined subgroups. Cluster analysis deals with classification problems when it is not known beforehand from which subgroups observations originate. Cluster analysis is an analytical technique that can be used to develop meaningful subgroups of individuals or objects. Specifically, the objective is to classify a sample of entities (individuals or objects) into a small number of mutually exclusive groups based on the similarities among the entities. Unlike discriminant analysis, the groups are not predefined. Instead, the technique is used to identify the groups. Cluster analysis usually involves at least two steps. The first is the measurement of some form of similarity or association between the entities in order to determine how many groups really exist in the sample. The second step is to profile the persons or variables in order to determine their composition. This step may be accomplished by applying discriminant analysis to the groups identified by the cluster technique.

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 24 -

9.2.4 Multi Dimensional Scaling In Multi Dimensional Scaling, the objective is to transform consumer judgements of similarity or preference (e.g., preference for stores or brands) into distances represented in a multidimensional space. If objects A and B are judged by the respondents as being most similar compared to all other possible pairs of objects, multidimensional scaling techniques will position objects A and B in such a way that the distance between them in multidimensional space is smaller than that between any other two objects. A related scaling technique, conjoint analysis, is concerned with the joint effect of two or more independent variables on the ordering a single dependent variable. It permits development of stronger measurement scales by transforming rank order responses into metric effects. As will be seen, metric and nonmetric multidimensional scaling techniques produce similar-appearing results. The fundamental difference is that metric multidimensional scaling involves a preliminary transformation of the nonmetric data into metric data, followed by positioning of the objects using the transformed data. The Analysis of Data and of Similarities The starting point for most techniques of analysis is the data matrix. This matrix contains properties in the head column, units in the front column and scores in the body. As far as contents are concerned, multidimensional scaling is especially applied to preference data. In marketing research, for example, a number of products are presented to a sampling of people, and the individuals have to give their preference for each combination of two products. Input of the analysis is then the matrix of two by two similarities between the products. Such similarities were also the starting point in the investigation of the furnishing of the living room. Therefore, the analysis chosen in this research was multidimensional scaling (MDS). All two by two associations were calculated for 49 characteristics of the living rooms. Remember, for example, the association of the Persian carpet and the parquet floor. These associations formed the input of the MDS analysis. We must, however, qualify our observations. For, the distinction between the input of multidimensional scaling and of other latent structure analyses is in fact relative. The matrix of bivariate associations can be used as a starting point in factor analysis as well as cluster analysis. Think of the correlations between the indicators of marital adjustment, e.g., between staying at home and having outside activities together, or between happiness in marriage and getting on each others nerves. There is still a difference in research strategy. When the characteristics are presented to the respondents by twos and preferences are asked, the data already have the form of similarities in the observation phase. In this case, MDS is generally chosen. When, on the other hand, scores of separate characteristics are observed, we then in fact perform two analyses. A first analysis results in the matrix of similarities and the latter is used as the input of a second latent structure analysis. In this case, either factor analysis or cluster analysis is generally chosen.

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 25 -

9.2.5 Linear and Non-linear Techniques of Analysis Regression, factor analysis, canonical correlation analysis and other classical techniques are metrical and linear. We know that a non-metric variant has been developed for each metric technique. The pioneering work of Leo Goodman and the Gifi Group was praised in this respect. There is really no need to make a distinction between the adjectives non-metric and non-linear, for these non-metric techniques are also non-linear. This is implied in the expression loglinear model, because the model only becomes linear after taking the logarithm. The title of Gifis book Non-linear Multivariate Analysis also speaks for itself. Linearity is an old sore in statistics. In applications of classical regression analysis it became clear all too often that the plots of the data seldom follow the pattern of a nice straight line or a flat plane. To understand what is meant here, we only have to think of the exponential functions applied by the Club of Rome in the analysis of increasing environmental pollution. And, in addition to regression, the same holds of course for discriminant analysis, factor analysis and other classical techniques. A linearity test is, therefore, always in obligation. A linear function is naturally handy, as it is easy to calculate and to interpret. One can therefore consider in cases of non-linearity performing certain transformations on the data in such a way as to obtain linearity. Taking the logarithm is an example of such a transformation. It is, however, also possible to fit a non-linear function, which will be a quadratic function, a function of the third degree or, in general, a polynominal of the n-th degree, depending on the number of curves that are detected in the scatterplot: one, two or n respectively. It has become possible for nearly all-existing techniques to fit such a -1 nonlinear function. Our conclusion is that, especially in recent decades, a non-linear analogue has been developed for almost all multivariate analyses, so that for each technique we can make a sub-classification in the linear and the non-linear versions.

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 26 -

10. Optimal Scaling


10.1 Why Use Optimal Scaling? Categorical data are often found in marketing research, survey research, and research in the social and behavioral sciences. In fact, many researchers deal almost exclusively with categorical data. Categorical data are typically summarized in contingency tables. Analysis of tabular data requires a set of statistical models different from the usual correlation-and regression-based approaches used for quantitative data. Traditional analysis of two-way tables consists of displaying cell counts along with one or more sets of percentages. If the data in the table represent a sample, the chi-square statistic might be computed along with one ore more measures of association. Multi-way tables are handled with some difficulty, since view of the data is influenced by which variable is the row variable, which variable is the column variables, and which variables are control variables. Traditional methods dont work well for three or more variables because all statistics that might be produced are conditional statistics, which do not in general capture the interrelationships among the variables. Researchers have developed loglinear models as a comprehensive way for dealing with two-way and multiway tables. Loglinear models is an umbrella term for several different models: models for the logfrequency counts in a two-way or multiway table, logit models for log-odds when one categorical variable is dependent and there are one or more categorical predictor variables, association models for the log-odds ratios in two-way tables, and many other special-purpose models. Loglinear models have a number of advantages. They are comprehensive models that apply to tables of arbitrary complexity. They provide goodness-off-fit statistics, so that model-building can be undertaken until a suitable model is found. They provide parameter estimates and standard errors. However, loglinear models have a number of drawbacks. If the sample size is too small, the chi-square statistic on which the models are based is suspect. If the sample size is too large, it is difficult to arrive at a parsimonious model, and it can be difficult to discriminate between competing models that appear to fit the data. As the number of variables and the number of values per variable go up, models with more parameters are needed, and in practice, researchers have had some difficulty interpreting the parameter estimates. Optimal scaling is a technique that can be used instead of-or as a complement to-loglinear models. Optimal scaling extends traditional loglinear analyses by incorporating variables at mixed levels. Nonlinear relationships are described by relaxing the metric assumptions of the variables. Rather than interpreting parameter estimates, interpretation is often based on graphical displays in which similar variables or categories are positioned close to each other. The simplest form of optimal scaling is correspondence analysis for a two-way table. If the two-way table portrays two variables that a associated (not independent), correspondence analysis assigns re scores to the categories of true row and column variables in such a way as to account for as much of the
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 27 -

association between the two variables as possible. Depending on the dimensionality of the table, correspondence analysis assigns one or more sets of scores to each variable. Conventionally, row and column categories are displayed in two-dimensional plots defined by pairs of these scores. Using correspondence analysis and the plots it produces, you can learn the following: within a variable, categories that are similar or different; within a variable, categories that might be collapsed; across variables, categories that go together; what category a user-missing category most resembles; and what the optimal correlation is between the row and the column variable. 10.2 Optimal Scaling Use The techniques embodied in four of these procedures (Correspondence Analysis, Homogeneity Analysis, Nonlinear Principal Components Analysis, and Nonlinear Canonical Correlation Analysis) fall into the general area of multivariate data analysis known as dimension reduction. That is, as much as possible, relationships between variables are represented in a few dimensions, say two or three. This enables you to describe structures or patterns in the relationships between variables that would be too difficult to fathom in their original richness and complexity. In market research applications, these techniques can be a form of perceptual mapping. A major advantage of these procedures is that they accommodate data with different levels of optimal scaling. The final procedure, Categorical Regression, describes the relationship between a categorical response variable and a combination of categorical predictor variables. The categories are quantified such that the squared multiple correlation between the response and the combination of predictors is a maximum. The influence of each predictor variable on the response variable is described by the corresponding regression w eight. As in the other procedures, data can be analyzed with different levels of optimal scaling. Following are brief guidelines for each of the five procedures: Use Categorical Regression to predict the values of a categorical dependent variable from a combination of categorical independent variables. Use Nonlinear Principal Components Analysis to account for patterns of variation in a single set of variables of mixed optimal scaling levels. Use Nonlinear Canonical Correlation Analysis to assess the extent to which two or more sets of variables of mixed optimal scaling levels are correlated. Use Correspondence Analysis to analyze two-way contingency tables or data that can be expressed as a two-way table, such as brand preference or sociometric choice data. Use Homogeneity Analysis to analyze a categorical multivariate data matrix when you are willing to make no stronger assumption than that all variables are analyzed at the nominal level. 10.3 Categorical Regression with Optimal Scaling CATREG is an acronym for categorical regression with optimal scaling. The goal of regression analysis is to predict a response variable from a set of predictor variables. The standard approach requires continuous variables and entails deriving weights for the predictor variables such that the squared correlation between the response and the weighted combination of predictors is a maximum. For any
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 28 -

given change in a predictor, the sign of the corresponding weight indicates whether the predicted response increases or decreases. The size of the weight indicates the amount of change in the predicted response for a one-unit increase in the predictor. If some of the variables are not continuous, alternative analyses are available. If the response is continuous and the predictors are categorical, analysis of variance is often employed. If the response is categorical and the predictors are continuous, logistic regression or discriminant analysis may be appropriate. If the response and the predictors are both categorical, loglinear models are often used. Categorical regression with optimal scaling extends the standard approaches of regression and loglinear modeling by quantifying categorical variables. Scale values are assigned to each category of every variable such that these values are optimal with respect to the regression. The technique maximizes the squared correlation between the transformed response and the weighted combination of transformed predictors. One advantage of the optimal scaling approach over standard regression analysis is in dealing with nonlinear relationships between variables. If, for example, a predictor has both high and low values associated with one value of the response, standard linear regression will not perform very well. The predictor receives only one weight, and one weight cannot reflect the same amount of change in the predicted response for both large and small predictor values. However, in CATREG, nonlinear transformations of the variables are employed. The predictor described earlier could be treated as nominal, receiving large quantifications for both large and small observed values. Thus, both values affect the predicted response in the same manner. In addition to revealing nonlinear relationships between variables, nonlinear transformations of the predictors usually reduce the dependencies among the predictors. If you compare the eigenvalues of the correlation matrix for the predictors with the eigenvalues of the correlation matrix for the optimally scaled predictors, the latter set will usually be less variable than the former. In other words, in CATREG, optimal scaling makes the larger eigenvalues of the predictor correlation matrix smaller and the smaller eigenvalues larger. Categorical regression with optimal scoring is equivalent to optimal scaling canonical correlation analysis (OVERALS) with two sets, one of which contains only one variable. In the latter technique, similarity of sets is derived by comparing each set to an unknown variable that lies somewhere between all of the sets. In categorical regression, similarity of the transformed response and the linear combination of transformed predictors is assessed directly. 10.4 Nonlinear Principal Components Analysis (CATPCA) PRINCALS is an acronym for principal components analysis via alternating least squares. Standard principal components analysis is a statistical technique that linearly transforms an original set of variables into a substantially smaller set of uncorrelated variables that represents as much of the information in the original set as possible. The goal of principal components analysis is to reduce the dimensionality of the original data set while accounting for as much of the variation as possible in the original set of variables.
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 29 -

Objects in the analysis receive component scores. Plots of the component scores reveal patterns among the objects in the analysis and can reveal unusual objects in the data. Standard principal components analysis assumes that all variables in the analysis are measured at the numerical level and that relationships between pairs of variables are linear. Nonlinear principal components, also known as optimal scaling principal components, extends this methodology so that you can perform principal components analysis with any mix of nominal, ordinal, and numerical scaling levels. The aim is still to account for as much variation in the data as possible, given the specified dimensionality of the analysis. For nominal and ordinal variables, the program computes optimal scale values for the categories. An important application of PRINCALS is to examine preference data, in which respondents rank or rate a number of items with respect to preference. In the usual SPSS data configuration, rows are individuals, columns are measurements for the items, and the scores across rows are preference scores (on a 0 to 10 scale, for example), making the data row-conditional. For preference data, you may want to treat the individuals as variables. Using the TRANSPOSE procedure, you can transpose the data. The raters become the variables, and all variables are declared ordinal. There is no objection against using more variables than objects in PRINCALS. More generally, an optimal scaling principal components analysis of a set of ordinal scales is an alternative to computing the correlations between the scales and analyzing them using a standard principal components or factor analysis approach. Research has shown that nave use of the usual Pearson correlation coefficient as a measure of association for ordinal data can lead to nontrivial bias in estimation of the correlations. If all variables are declared numerical, PRINCALS produces an analysis equivalent to standard principal components analysis using factor analysis. Both procedures have their own benefits. If all variables are declared multiple nominal, PRINCALS produces an analysis equivalent to a homogeneity analysis run on the same variables. Thus, optimal scaling principal components analysis can be seen as a type of homogeneity analysis in which some of t e variables are declared ordinal or h numerical. Nonlinear Canonical Correlation Analysis Nonlinear canonical correlation analysis (OVERALS), or canonical correlation analysis with optimal scaling, is the most general of the five procedures in the optimal scaling family. This procedure performs nonlinear canonical correlation analysis on two or more sets of variables. The goal of canonical correlation analysis is to analyze the relationships between sets of variables instead of between the variables themselves, as in principal components analysis. In standard canonical correlation analysis, there are two sets of numerical variables. For example, one set of variables might be demographic background items on a set of respondents, while a second set of variables might be responses to a set of attitude items. Standard canonical correlation analysis is a statistical technique that
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 30 -

finds a linear combination of one set of variables and a linear combination of a second set of variables that are maximally correlated. Given this set of linear combinations, canonical correlation analysis can find subsequent independent sets of linear combinations, referred to as canonical variables, up to a maximum number equal to the number of variables in the smaller set. Optimal scaling canonical correlation analysis extends the standard analysis in several ways. First, there can be two or more sets of variables, so you are not restricted to two sets of variables, as is true in most popular implementations of canonical correlation analysis. Second, the scaling levels in the analysis can be any mix of nominal, ordinal, and numerical. Third, optimal scaling canonical correlation analysis determines the similarity among the sets by simultaneously comparing the canonical variables from each set to a compromise set of scores assigned to the objects. If there are two sets of variables in the analysis and all variables are defined to be numerical, optimal scaling canonical correlation analysis is equivalent to a standard canonical correlation analysis. Although SPSS does not have a canonical correlation analysis procedure, many of the relevant statistics can be obtained from multivariate analysis of variance. If there are two or more sets of variables with only one variable per set, optimal scaling canonical correlation analysis is equivalent to optimal scaling principal components analysis. If all variables in a one-variable-per-set analysis are multiple nominal, optimal scaling canonical correlation analysis is equivalent to homogeneity analysis. If there are two sets of variables, one of which contains only one variable, optimal scaling canonical correlation analysis is equivalent to categorical regression with optimal scaling. Optimal scaling canonical correlation analysis has various other applications. If you have two sets of variables and one of the sets contains a nominal variable declared as single nominal, optimal scaling canonical correlation analysis results can be interpreted in a fashion similar to regression analysis. If you consider the variable to be multiple nominal, the optimal scaling analysis is an alternative to discriminant analysis. Grouping the variables in more than two sets a variety of ways to analyze your data. 10.5 Correspondence Analysis The Correspondence Analysis procedure is a very general program to make biplots for correspondence tables, using either chi-squared distances, as in standard correspondence analysis, or Euclidean distances, for more general biplots. This procedure also offers the ability to constrain categories to have equal scores, a useful option to impose ordering on the categories. In addition, it offers the ability to fit supplementary points into the space defined by the active points. In a correspondence table, the row and column variables are assumed to represent unordered categories; therefore, we use the nominal optimal scaling level. Both variables are inspected for their nominal information only. That is, the only consideration is the fact that some objects are in the same category, while others are not. Nothing is assumed about the distance or order between categories of the same variable. One specific use of correspondence analysis is the analysis of a two-way contingency
____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 31 -

table. The SPSS Crosstabs procedure can also be used to analyze contingency tables, but correspondence analysis provides a graphic summary in the form of plots that show the relationships between categories of the two variables. If a table has r active rows and c active columns, the number of dimensions in the correspondence analysis solution is the minimum of r minus 1 or c minus 1, whichever is less. In other words, you could perfectly represent the row categories or the column categories of a contingency table in a space of min(r, c)- 1 dimensions. Practically speaking, however, you would like to represent the row and column categories of a two-way table in a low-dimensional space, say two dimensions, for the obvious reason that two-dimensional plots are comprehensible and multidimensional spatial representations are usually not. When fewer than the maximum number of possible dimensions is used, the statistics produced in the analysis describe how well the row and column categories are represented in the low-dimensional representation. Provided that the quality of representation of the two-dimensional solution is good, you can examine plots of the row points and the column points to learn which categories of the row variable are similar, which categories of the column variable are similar, and which row and column categories are similar to each other. Independence is a common focus in contingency table analyses. However, even in small tables, detecting the cause of departures from independence may be difficult. The utility of correspondence analysis lies in displaying such patterns for two-way tables of any size. If there is an association between the row and column variables-that is, if the chi-square value is significant-correspondence analysis may help reveal the nature of the relationship. Homogeneity Analysis HOMALS is an acronym for homogeneity analysis via alternating least squares. The input for homogeneity analysis, also known as multiple correspondence analysis, is the usual rectangular data matrix, where the rows represent subjects or, more generically, objects, and the columns represent variables. There may be two or more variables in the analysis. As in correspondence analysis, all variables in a homogeneity analysis are inspected for their nominal information only. The analysis considers only the fact that some objects are in the same category, while others are not. Nothing is assumed about the distance or order between categories of the same variable. While correspondence analysis is limited to two variables, homogeneity analysis can be thought of as the analysis of a multiway contingency table (with more than two variables). Multiway contingency tables can also be analyzed with the SPSS Crosstabs procedure, but Crosstabs gives separate summary statistics for each category of each control variable. With homogeneity analysis, it is often possible to summarize the relationship between all the variables with a single two-dimensional plot. For a one-dimensional solution, homogeneity analysis assigns optimal scale values (category quantifications) to each category of each variable in such a way that overall, on average, the categories
____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 32 -

have maximum spread. For a two-dimensional solution, homogeneity analysis finds a second set of quantifications of the categories of each variable unrelated to the first set, again attempting to maximize spread, and so on. Because categories of a variable receive as many scorings as there are dimensions, the variables in the analysis are said to be multiple nominal in optimal scaling level. Homogeneity analysis also assigns scores to the objects in the analysis in such a way that the category quantifications are the averages, or centroids, of the object scores of objects in that category. The output for homogeneity analysis includes plots of the category quantifications and the object scores. By design, homogeneity analysis tries to produce a solution in which objects within the same category are plotted close together and objects in different categories are plotted far apart. This is done for all variables in the analysis. The plots have the property that each object is as close as possible to the category points of categories that apply to the object. In this way, the categories divide the objects into homogenous subgroups (thus, one reason for the name homogeneity analysis). Variables are considered homogenous when they classify objects in the same categories into the same subgroups. If homogeneity analysis is used for two variables, the results are not completely identical to those produced by correspondence analysis, although both are appropriate when suitably interpreted. In the two-variable situaion, correspondence analysis produced unique output summarizing the fit and the quality of representation of the solution, including stability information. Thus, correspondence analysis is probably preferable to homogeneity analysis in the two-variable case in most circumstances. Another difference between the two procedures is that the input to homogeneity analysis is a data matrix, where the rows are objects and the columns are variables, while the input to correspondence analysis can be the same data matrix, a general proximity matrix, or a joint contingency table, which is an aggregated matrix where both the rows and columns represent categories of variables. Homogeneity analysis can be thought of as principal components analysis of nominal data with multiple optimal scaling levels. If the variables in the analysis are assumed to be numerical level (linear associations between the variables are assumed), then standard principal components analysis is appropriate. An advanced use of homogeneity analysis is to replace the original category values with the optimal scale values from the first dimension and perform a secondary multivariate analysis. The Factor Analysis procedure produces a first principal component that is equivalent to the first dimension of homogeneity analysis. The component scores in the first dimension are equal to the object scores, and the squared component loadings are equal to the discrimination measures. The second homogeneity analysis dimension, however, is not equal to the second dimension of factor analysis. Since homogeneity analysis replaces category labels with numerical scale values, many different procedures that require intervallevel (numerical) data can be applied after the homogeneity analysis. The same is true for nonlinear principal components analysis, nonlinear canonical correlation analysis, and categorical regression.

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 33 -

11. Readings Hair J.F., Anderson R.E., Tatham R.L. and Black W.C. (1998) Multivariate Data Analysis With Readings, 5th Edition, Upper Saddle River, New Jersey, Prentice-Hall International. Chapter 1. Gofton, L. R. and Ness, M. R. (1997) Business Marketing Research, London, Kogan Page. Chapter 8. Heenan D.A. and Addleman R.B. (1976) Quantitative Techniques for Todays Decions Makers, Harvard Business Review, May-June, pp.32-62. Hooley G.J. (1980) The Multivariate Jungle: The Academics Playground but the Managers Minefield. European Journal of Marketing Research, Vol. 14, No 1, pp. 379-386. Johnson, E Dollas. (1998). Applied Multivariate Methods for Data Analysis. Duxbury Press, An International Thomson Publishing Company, Washington. Ness M.R. (1997) Multivariate Analysis in Marketing Research, in: Padberg D., Ritson C. and Albisu L.M. (Eds), Agro-Food Marketing, Wallingford, Oxfordshire, CAB International. Chapter 12, pp. 253-278. Sheth J. (1971) The Multivariate Revolution in Marketing Research. Journal of Marketing, Vol. 35, Jan, pp.13-19. SPSS categories. 1998. Marketing Department, SPSS Inc., 444 North Michigan Avenue, Chicago. Tacq, Jacques. (1997). Multivariate Analysis Techniqyes in Social science Research from problem to Analysis. Sage Publications, London

____________________________________________________________________________________________________ Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 34 -

Appendix (i) Selection of Multivariate Technique By Data & Variable Type Dependent Metric Non Metric Multiple Multiple Regression Discriminant Analysis Analysis (MDA) Canonical Canonical Correlation Correlation with Analysis Dummy Variables Multivariate Canonical Analysis of Correlation with Variance Dummy Variables

Metric

Independent Variable Non Metric

____________________________________________________________________________________________________ Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 35 -

Appendix (ii) Over View of Multivariate Techniques

Multivariate Methods

Dependence Method

Interdependence Method

One Dependent

Many Dependent

Many Independent & Many Dependent

Metric

Non Metric

Metric

Non Metric Metric Canonical Analysis

Multiple Regression

Conjoint Analysis

Discriminant Analysis Factor Analysis Non-Metric Canonical with Dummy Metric MDS Cluster Analysis Non Metric MDS

Metric Multivariate ANOVA Canonical Analysis

____________________________________________________________________________________________________ Introduction to Multivariate Analysis :Prof Prithvi Yadav, Indian Institute of Management, Indore. @June, 2004 - 36 -

____________________________________________________________________________________________________ Introduction to Multivariate Analysis :Prof Prithvi Yadav, Indian Institute of Management, Indore. @June, 2004 - 37 -

You might also like