You are on page 1of 15

1

Notes on Overview of SPSS Environment

Prepared by Prof Prithvi Yadav For the course - ADA

2004 Indian Institute of Management, Rajendra Nagar, Indore

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

What is SPSS?
SPSS is a suite of computer programs, which has been developed over many years. The original SPSS and SPSSx were only available on mainframe computers. The Windows PC version is now one of the most widely-used programs of its type in the world. The most recent version is number 8, but this only operates under Windows 95. As Windows 95 is not yet widely used, this book is based on SPSS version 6 or 6.1. All users will have the Base system of whichever version of SPSS they have purchased. In addition, one can purchase extra modules which provide additional analytic procedures, and some of the more commonly used ones are described in this book. Even the Base system consists of a large set of programs, but the user does not need to know much about the actual SPSS programs; the important thing is learning to drive rather than learning how the car works! It is, however, important to understand the general characteristics of the structure of the package and the files that are used and created by it. Like earlier versions, SPSS for Windows consists of a number of components. First, there are the programs making up the package itself; these read the data, carry out the analysis, produce a file of the results. The normal user needs to know little about these programs, just as a driver needs to know little about the stru cture of the internal combustion engine or the physical characteristics of a differential. Second, there are the numbers that the user wants analysed, and these have to be entered into a data window and saved as a data file. Third, there are the commands which tell the package which analyses the user wants performed on the data. Fourth, there are the results of the analysis. Entering the data, providing the instructions on which analyses to perform, and examining the output can all be carried out on sc reen, with the data, commands and results being available in separate windows at the same time. Essentially, SPSS itself sits on the hard disk. In order to use it you must provide it with data to be analysed, which is entered into a table presented on the screen and is then stored in a data file which has the filename extension .SAV. When you want the data to be analysed, you have to tell SPSS which analysis you want done, by issuing commands. The commands can be entered by selecting from the Window menus, and they can be pasted into a syntax window and stored in a syntax file. It is not essential to save ones commands in a file, but I strongly urge you to do so. You can tell SPSS to print each procedure or command it is following in the output file as it does it. This is extremely useful when you are analyzing a complex set of data. When SPSS for Windows runs, it either reacts to the commands selected from the menus directly and applies them to the data in the table or it reads the commands from the syntax window and responds to them by applying them to the data.

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

When it is running, SPSS for Windows creates two output files. One holds the results of the analysis it has performed and is put into a window on the screen initially entitled !Output1. You will almost always want to save this file, and it will be saved with the filename extension .LST. The other output file is named SPSS.JNL, and records a list of the commands which SPSS carries out. Every time you use SPSS for Windows, the record of the commands you use is added to the end of the .JNL file, so over a series of sessions it can become very lengthy. You can turn off the process of recording the journal file, or you can ask the system to record only the .JNL file for the current session, overwriting any previous version. You can also have the .JNL stored on the floppy disk rather than the hard disk. To do any of these, you use Edit/Preferences: the way to achieve these alterations to the way the .JNL file is saved will be clear once you have experience at using SPSS for Windows. Naming files SPSS will automatically give data files the extension .SAV, output files the extension .LST, syntax files the extension .SPS and chart files of graphs the extension .CHT. The way file names are structured in Windows 3.1 is a hang-over from an operating system known as Microsoft DOS. In DOS, a filename could only consist of a maximum of eight characters followed by a full-stop and then a three character extension, and this pattern is also true of filenames in SPSS if you are using Windows 3.1. (As Windows 95 supports long filenames, these can be used if you have SPSS version 7.) You will find in practice that with SPSS version 6, the restrictions on filenames are a real nuisance, because you will be doing a number of analyses on a set of data and it is very easy to lose tract of all the files you create on the way. I have found it helpful to save the files created in each SPSS session using names which indicate the date when they were created. So, for example, if you are working on September 15th , save the syntax file with the name sep15.sps, the output file with the name sep15.lst. Then when you come back later and have to find a particular file on a crowded disk at least you know that the syntax store in sep15.sps produced the output stored in sep15.lst. In addition, you will know that the files named sep15.sps and sep15.lst are a pair and different from those named sep20.sps and sep20.lst. Another worthwhile tip is to be sure to keep all the files referring to one set of data in one directory, separate from the files referring to another set of data. One note of warning: do ensure that you keep the correct filename extensions for the type of file you are saving. So all syntax files must have the extension .sps, all data files the extension .sav and all output files the extension .lst. If you fail to do this you will have a chaotic situation, have great difficulties finding the files you want and may well lose vital files altogether. Essential terminology for all SPSS users You need to appreciate some of the terminology that is used when explaining how the package operates.

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

THE CASE When you approach SPSS, you have some data to be analysed, and this is in the form of responses or scores from a number of different respondents. A respondent may be a person or an organization such as a hospital ward or a school. (You might be dealing with the recorded of number of patients treated over a given period or the exam successes of each of 200 schools, for example.) Each responent it known as a case, and the results from one case (respondent) form one line in the data file. In SPSS for Windows, the data from one case forms one row in a table. VARIABLES AND LEVELS You will have a number of items of data from each case, such as the respondents age, sex, income, score on an intelligence test, number of heart attacks, etc. Each of these is a score on a variable (age, sex, income, etc.). Each variable has to have a name, which cannot be more than eight characters long and must not contain a space. (So you could name a variable intell, but not intelligence since the full word has more than eight letters. And you cannot call a variable score 1, as that contains a space; you would have to use score1 or score_1 as the name.) In SPSS for Windows version they are automatically named var0001, var0002, etc. until you rename them (which you should always do). When you tell SPSS to analyse the data from the data file, you have to tell it which variables to analyse, by indicating their names. So if you have a variable which indicates the respondents gender, you might name this variable sex. Then when you want to analyse the responses on this variable (for example, to find out how many respondents were male and how many female), you have to tell SPSS to analyse the variable sex, i.e. you use the name that has been given to that variable. It is important to be clear about the difference between variables and levels of a variable. The variable is whatever aspect of the respondent you have measured: age, sex, number of times admitted to hospital, intention to vote for a particular party, etc. The levels are the number of alternative values that the score on the variable can take. For example there are two levels of the variable sex: male and female. Age can have many levels; if you record the age in years, it can vary from 0 to about 105, so there would be 106 levels. Usually, age is put into categories such as 0-20, 21-40, 41-60, over 60, and in this particular case this gives four levels of the age variable. (It is quite simple to enter the actual ages into SPSS and then have the program code the values into a smaller number of categories, using the RECODE procedure. SYSTEM-MISSING AND USER-DEFINED MISSING VALUES You need to appreciate the concept of the system-missing value, and how it differs from a user-defined missing value. When you enter data into the Data Editor table, if you leave empty one of the cells in a column or a row that contains data, the empty cell will be filled with a full-stop (period). This cell will be detected as containing no data, and SPSS for Windows will give it the system missing value. So the systemmissing value is automatically inserted when a number is expected but none is provided by the data. But when some respondents have not answered all the questions asked, or have failed to provide a measure on one of the variables, it is sensible (for reasons that will become clear later) to record a no-response by entering a particular number. So one

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

might record male as 1, female as 2 and then use 3 to indicate that the person failed to indicate their sex. The value of 3 on the variable sex would then be a user-defined missing value. Of course one has to tell SPSS that this value does represent no response: how to do this is explained later. When choosing a number to represent data missing, it is essential to use a number that cannot be a genuine value for that variable. If one wanted to define a missing value for age, for example, one would use a number that could not possibly be genuine, such as 1 or 150. Case Number When entering data from a set of respondents, it is always worth inserting a variable tat represents the identification number of respondent. You can then readily find the data for any respondent you need , and check the entries against the original record of the responses. This identification number has to be entered as a score on a variable, just like any other. But, SPSS also assigns its own identification number to each case, numbering each case sequentially as it reads the data file. this is the $casenum variable, which you may see listed when you ask for information about the variables in the data file. The first case in the data file, which may have any user defined identification number you wish, will have the $casenum of 1, the next one will have $casenum 2 and so on. Do not rely on $casenum for identifying cases in the data file, as it can change if you put the cases into a different order. Use an ID number variable as well! Procedures: When SPSS analyses the data, it applies a procedure : for example, the part of SPSS which calculates a correlation coefficient is one procedure, the part which reports the average score on a var iable is another. What you need to run SPSS for Windows Assuming SPSS for windows is installed on your hard disk, there are seven thing you need to know in order to use it; 1 2 3 4 5 6 How to get the SPSS programm; How to create and save a file of data to be analysed; Which analyses you want done; How to obtain commands (procedures) which do those analyses; How to get SPSS to apply the commands to the data file; How to save the contents of the output (.LST) file on the floppy disks;

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

Data Files
Spreadsheets created with Lotus 1-2-3 and Excel. Database files created with dBASE and various SQL formats Tab-delimited and other types of ASCII text files Data files in SPSS format created on other operating system SYSTAT data files.

Opening a Data file


In addition to files saved in SPSS format, you can open Excel, Lotus 1-2-3, dBASE and ta-delimited files without converting the files to an intermediate format or entering data definition information. Data File Type : SPSS (*.sav) SPSS/PC+ (*.sys) SYSTAT SPSS Portable Excel Lotus 1-2-3 SYLK DBASE Tab Delimited Fixed ASCII

(*.por) (*.xls) (*.wk1, *.wks, *.wk3) (*.slk) (*.dbf) dBASE II, III & IV format (*.dat) (*.dat)

How The Data Editor Reads Excel 5 or Later files Data Type & Width Data type and width for each variable is determined by the data type & width in Excel file. Blank Cell Blank cells are converted to the system missing value. Variable Names If you read the first row of the excel file as variable names, names longer than 8 characters are truncated. How The Data Editor Reads Older Excel files & other spreadsheet files Data Type & Width Data type and width for each variable is determined by the data type & width in Excel file. If the first data cell in the column is blank, the global default data type for the spreadsheet (numeric) is used. Blank Cell For numeric variables, blank cells are converted to the system missing value indicating by a period. For string variables, blank are treated as valid string values. Variable Names If you read the first row of the excel file as variable names, names longer than 8 characters are truncated. If you do not read variable names from spreadsheet, the column letter (A,B,C.) are used for variable names.

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

How the data Editor Reads dBASE files : Field names are automatically translated to variable names. If the first eight characters of a field name do not produce a unique name, the field is dropped. Colons used in dBASE field names are translated to underscores. Records marked for deletion but not actually purged are included. SPSS creates a new string variable, D_R, which contains an asterisk for cases marked for deletion.
Reading Text Data Files

Tab Delimited files Space delimited files Comma delimited files Fixed field format files.

For delimited files, you can also specify other characters as delimiters between values, and you can specify multiple delimiters. How variables are Arranged ? To read data properly, the text Wizard needs to know how to determine where the data value for one variable ends and the data value for next variable begins. Delimited : Spaces, comma tabs, or other characters are used to separate variables. The variables are recorded in the same order for each case but do not necessarily in the column locations. Fixed width : Each variable is reordered in the same column location on the same record for each case in the data file. Are Variable names are included at the top of your file ? How cases are Arranged ? Each Line represents a case. Specific number of variables represents a case. How many cases do you want to import. Text Wizard Data Formatting Options: Do Not Import : Omit the selected varables from the imported data file. Numeric : Valid values include number, aleading plus or minus sign, and a decimal indicator. String : Valid values include virtually any keyboard characters and embedded blanks. Valid values include dates of general format. Date/Time: Dollar : valid values are numbers with an optional leading dollar sign and optional comma as thousands seperators. Comma: Valid values include numbers that use a period as a decimal indicator and commas as thousands seperators. Valid values include numbers that use a comma as decimal indicator Dot : and periods as thousands seperators.

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

Overview of Data Analysis in SPSS


Data Screening : It is very unusual for real data to arrive without problems. In large studies, the process of data screening consumes considerably more time and effort than the primary analysis of interest. The first step is to identify recording or data entry errors and to examine how appropriately the data meet the assumptions of the intended analysis. This section suggests some steps you can take to screen your data. Of course, exact instructions of what to do depend on the size of your study, its intended goals, and the problems encountered. Identifying outliers and rogue values. The first step is cleaning data is usually to find values outside the reasonable range for a variable and to determine whether they are real outliers or errors. Use Frequencies to count the occurr ence of each unique value (that is, when the variable does not have hundreds of unique values as Social Security numbers do). You may find typos or unexpected values and codes. Also, look for missing values that appear as valid values. For quantitative variables, use histograms in the Frequencies or Explore procedure and box plots and stem-and-leaf diagrams in the Explore procedure. Notice the information on outliers in the Explore plots. For large data sets, scan minimum and maximum values displayed in procedures like Descriptive and Means. You might find codes that are outside your coding scheme or codes for missing values (like999) that are being treated as data. Use Case Summaries to list data. You can choose a grouping variable to list the cases by cate gory. You might also find it useful to sort your data by a variable of interest before listing. By default, only the first 100 cases in your file are listed; you can raise or lower that limit. To list just selected cases, first use Select Cases from the Data menu. You might compute standardized scores and select cases with values larger than 3.

Often, some outliers are better identified when you study two or more variables together. For categorical data, cross tabulations may reveal unlikely or undesirable combinations, such as a persons who never visited your store but gave it a rating. Use bivariate scatterplots (from the Graphs menu) to reveal unusual combinations of values in numerical data. Consider a scatterplot matrix to display combination of values between multiple variables. The Mahalanobis distance and leverage statistics in Regression are useful for identifying outliers among a set of quantitative variables.

Assessing distributional assumptions. Data distributions may not be as advertised; often, they are not normal and probably not symmetric. Using Regression to predict

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

one variable from another will yield poor results if the variables are highly skewed. By trying a transformation such as Log-pop=LG10 (populatn) You may find that the logged values remedy the problem. To check distributions, you can: Use Frequencies or the Graphs menu to generate histograms with normal curves overlaid. Use the Explore procedure or P-P plots from the Graphs menu to generate normal probability plots. You can also use probability plots to compare the distribution of a variable to a number of standard distributions besides the normal. For larger data sets, compare the values of the mean, 5% trimmed mean, and median. If they differ markedly, the distribution is skewed. As a formal test of normality, try the Kolmogorov-Smirnov test or the Shapiro-Wilk test in Explore. If you plan an analysis that compares means of groups, you may encounter more problems. For example, if you plan to use body weight in an analysis of variance, the probabilities in the output may be distorted if the distribution within groups differs widely from normal or if the spread of the distributions across the groups varies greatly (that is, the assumptions of equal variances is violated). Use boxplots to identify skewed distributions and vastly different spreads across the groups. In the boxplots below, notice how a log transformation improves the within-group distributions of the variable population. Use the following strategies to detect and deal with distribution problems across groups: Perform Levenes test of homogeneous (or equal) variances in Explore or One-Way ANOVA. Use the suggestion for a power transformation to stabilize cell variances provided with the spread-versus-level plot in Explore.

Descriptive Statistics Descriptive statistics may be all you need for your current study, or they may be an early step in exploring and understanding a new set of data. Before deciding what you want to describe (the location or center of the distribution, its spread, and so on), you should consider what types of variables are present. That is, what do you know about the values of the variables? Unordered categories. Examples of unordered categories include the variable region with codes 1 to 6 re presenting Africa, Latin America, and so on, or the string variable religion with values Buddhist, Catholic, Muslim, and others. Ordered categories. Examples of ordered categories include the seven categories of the variable polviews that range from Extre mely liberal to Extremely conservative.
Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

10

Counts. The number of new cases of AIDS or children per family are examples of values that are counts. Measurements. Values measured on a scale of equal units, such as height in inches or centimeters, are measurements. These values are continuous-you could record the age of a subject as 35 years or as 35.453729 (including days, hours, minutes, and seconds). For many statistical purposes, counts are treated as measured variables. Arithmetic calculations like averages and differences make sense for both measurements and counts but not for the codes of unordered categorical variables. Numeric variables are called quantitative variables if it makes sense to do arithmetic on their values. The most common statistical descr iptors are appropriate for quantitative variables, In particular, means and standard deviations are appropriate for quantitative variables that follow a normal distribution. Often, however, real data do not meet this assumption of normality because the d istribution is skewed or contains outliers, gaps, or other problems. Descriptive Statistics for Normally Distributed Data The statistics in Table 1.1 assume at least a quantitative variable with a symmetric distribution. Because these statistics can be very misleading for distributions that depart widely from the normal, use graphics wherever possible to display the distribution. The Groups column in this and the following tables indicates whether the statistic is computed for the sample as a whole (sample ), for subgroups of cases within the sample determined by a single grouping variable (Groups), or for subgroups determined by combinations of grouping variables (Crossed). (Statistical texts often call grouping variables factors, and their values levels; in that vocabulary, crossed means for all combinations of levels of multiple factors.) You can also use the Split File feature on the Data menu of the Data Editor to stratify the cells further, although with Split File, you cannot get statistic across the sample as a whole. Descriptive Statistics for Normally Distributed Data
Table 1.1

Groups
Frequencies Descriptive Explore Case Summaries Means One Sample t test Indp t test Pair t test One ANOVA GLM Univariate Correlations Regressions Non Para. Discriminant Factor Sample Sample Group Crossed Crossed Sample Two Sample Group Crossed Sample Sample Sample Group Sample

Mean

SD

SE

Var

Skew

Kurt

95% CI

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

11

Always use graphics to ensure that the variables you are summarizing with these statistics have approximately normal distributions: Histogram with normal curve from Frequencies Stem-and-leaf plot from Explore Boxplot from Explore Other statistics of interest are z scores and means ordered by size, both available from the Descriptive procedure. Descriptive Statistics for Any Quantitative Variable or Numeric Variable with Ordered Values The statistics in Table 1.2 can be used to describe any quantitative variable whether or not its distribution is normal, and they may be useful descriptors for values that code ordered categories (for example, 1=strongly disagree, 2= disagree,. And 5= strongly agree).

Descriptive Statistics ,Normality not Required


Frequencies Descriptive Explore Case Summaries Means One ANOVA Non Par test

Groups Md Sample
Sample Group Crossed Crossed Group Sample

Min

Max

Rg Perc

Qrt

Cu %

Sum

Table 1.2 Graphics useful for understanding these distributions are the same as those for normality distributed numeric data: histograms, stem-and-leaf plots, and boxplots. The Explore procedure offers several robust estimators and other aids to understanding distributions that may deviate from the normal: 5% trimmed mean M-estimators Tukeys hinges

In Explore, you can request the five largest and five smallest values within each group, along with their respective case labels. Descriptive Statistics for Categories Frequency counts and percentages are useful for describing numeric and string variables with unordered categories. Groups Counts Percents Valid % Frequencies Sample Cross Tabs Crossed

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

12

The Case Processing Summary that accompanies each output table also contains counts and percentages of valid and missing cases. The most useful graphic for studying the distribution of variables with unordered categories is the bar chart, available from the Frequencies procedure or the Graphs menu.

Tests for Comparing Means The T Test and GLM Univariate procedures test hypotheses about means of quantitative variables. The purpose is to draw conclusions about population parameters based on statistics observed in the sample. These tests are available from the Compare Means and the General Linear Model menus. When the data come from markedly non-normal distributions, a non-parametric test may be more appropriate. Rather than using the data as recorded, several of the tests use ranks (SPSS converts your data into ranks as part of the computations). Be aware, however, of using non-parametric procedures to rescue bad data. If you have data that violate distributional assumptions for a t test or an analysis of variance, you should consider transformations before retreating to non-parametric. While the nonparametric test statistics drop the assumption of normality, they do have assumptions similar to their parametric counterparts. For example, the Mann-Whitney test assumes the distributions have the same shape. Also, if in fact the populations do differ, a nonparametric procedure may require a larger sample to prove it than a normal theory test would require. T Tests for One, Paired, and Two Samples SPSS provides three types of t tests for comparing means, depending on what you are comparing. Use the Independent-Samples T Test procedure to test whether the mean of a single variable for subjects in one group differs from that in another group. For example, does the average cholesterol level for a treatment group differ from that for a control group? The Mann-Whitney rank sum test is the non-parametric analog for the two sample t test. It is used to test that two samples come from identically distributed populationsthat is, there is no shift in the center of location (not the mean, because the distribution might be skewed). The test is not completely distribution-free because it assumes that the population has the same shape. Thus, the groups may differ with respect to center of location, but they should have the same variability and skewness. Other nonparametric tests for two independent samples are the Moses test of extreme reactions, the Kolmogorov-Smirnov test, and the Wald-Wolfowitz runs test. Use the Paired-Samples T Test procedure (also known as a dependent t test) to test whether the mean of casewise differences between two variables differs from 0.A typical study design for this test could include a before and an after measure for each subject. The before and after measures are stored as separate variables. As non-parametric analogs to the paired t test, SPSS provides the sign test and the Wilcoxon signed-rank test. For each pair of observations, the sign test uses only the

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

13

direction of the differences (positive or negative), while the Wilcoxon signed-rank test begins by ranking the differences without considering the signs, restoring the sign to each rank, and finally summing the ranks separately for the positive and negative differences. Use the One -Sample T Test procedure to test whether the mean of a single variable differs from a hypothesized value. If the average IQ in your country supposedly is 100 and the average IQ for a sample of your co-workers is 127.5, use One -Sample T Test to see if you can conclude that your co-workers are smarter than the average persons. One -Way and Univariate Analysis of Variance Analysis of variance is an extension of the two-samples t test to more than two groups. This analysis examines the variability among the sample means relative to the spread of the observations within each group. The null hypothesis is that the samples of values come from populations with equal means. For a one -way analysis of variance (one-way ANOVA), groups or cells are defined using the levels of a single grouping factor that has two or more levels. In GLM Univariate analysis of variance, cells are defined using the cross-classification of two or more factors. For example, if study subjects are grouped by gender (male, female) and city (Los Angeles, Chicago, New York), six cells are formed: LA males, LA females, Chicago males, Chicago females, NY males, and NY females. The total variation in the dependent measure is separated into components for gender, city, and the interaction between the two. The SPSS Base system provides three procedures for analysis of variance :

Analysis of Variance Procedures


Means One Way ANOVA One way ANOVA Table, test of linearity, eta ANOVA Table, post hoc range tests and pair wise multiple comparisons, contrast to test relations among cell means. Factorial ANOVA table, covariance

GLM Univariate

In some situations, a covariate (or in the language of regression, an independent variable) may add additional variability to the measure under study (the dependent variable). An analysis of covariance adjusts or removes the variability in the dependent variable due to the covariate. For example, if cholesterol is the measure studied for groups of people in treatment and control groups, age might be a useful covariate for subjects with varying ages. This is because cholesterol is known to increase with age; therefore, using it as a covariate removes unwanted variability. The Kruskal-Wallis test is the nonparametric analog for a one-way ANOVA. It is just like the Mann-Whitney test for two independent samples except that it sums the ranks for each of k groups. SPSS also provides a median test where, for each group, the number of cases with values larger than the overall median and the num ber less than

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

14

or equal to the median form a two-way frequency table. The Friedman test is a nonparametric extension of the paired t test to more than two variables.

Testing Relationships :
In selecting a statistic to measure the relation among variables, you need to identify what types of variables you are investigating. If values are categories, you will find an appropriate measure in the cross tabs procedure. If the values are from a quantitative distribution that can be considered normal, you may want to use a linear model in regression or a pearson correlation in the bivariate correlation procedure. If normality id too strong an assumption to make, you might consider the spearman correlation. Measures of Association for Categorical Variables : For two way tables of frequency counts formed by crossing two categorical variables. Cross tabs offers 22 tests of significance and measures of association. Each is appropriate for a particular table structure (rows by columns), and a few assume that categories are ordered. Table Structure 2x2 Test Pearson chi square, likelihood ratio square, fishers exact test, Yates corrected chi square, McNemars test, relative risk, and the odds ratio. Pearson and likelihood ratio chi square, Phi, Cramers V, Contingency coefficient, symmetric and asymmetric Lambdas, Goodman and Kruskal,s tau, and uncertainty coefficient (the last three are predictive measures) Gamma, Spearman,s Rho, Kendalls tau-b and tau-c and somers d ( a predic tive measure) Cohens kappa measure of agreement

Rx C

R x C with Ordered categories RxR

Correlation and Regression for Quantitative Variables :


A correlation coefficient is a measure of the linear relationship between two quantitative variables. A simple regression is no other method or the same problem. A correlation matrix displays statistics for many variables pair by pair; while a multiple regression characterizes the linear relationship between one variable and a set of variables. Pearson correlations are available in the bivariate, partial, regression and cross tab procedures. If you want to test that a statistic differ from zero ( i.e. there is no linear relationship between two variables), the data should follow normal distribution. When the data do not follow a normal distribution, the spearman correlation is available in the bivariate and cross tabs procedures.

Identifying Groups :
Cluster analysis, discriminant analysis and for factor analysis are useful for identifying groups. Cluster analysis is a multivariate procedure for detecting groups in

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

15

data. In both the k-means and hierarchical procedures, the cluster can be groups of cases. The hierarchical procedure can also be used to form groups of variables rather than cases. Clustering a good technique to use when you suspect that data may not be homogenous and you want to see if distinct groups exits or you want to classify the data into groups. In other words, you can begin with no knowledge of group membership. Classification can also be a goal for discriminant analysis. For this procedure, however, you begin with cases in known groups, and the analysis finds linear combinations of the variables that best characterizes the difference among groups (these functions can be used to classify new cases). Variables can be entered into the function in a stepwise manner thus, a subset of variables that maximizes group difference is identified. Factor analysis is appropriate for quantitative variables for which you want correlation. You can study the correlation of a large number of variables by grouping variables into factors. The variables in the each factor are more highly correlated with variables in their factor than with variables in other factors. You can also interpret each factor according to the meaning of the variables and summarize many variables with a few factors. The scores the factors can be used as data for t test , regression and so on. -------------------------**********------------------------------------------------

Notes on Overvi ew of SPSS Environment. Prepared by Prof Prithvi Yadav, IIM Indore, @June 2004

You might also like