You are on page 1of 1130

SYSTAT

8.0
Statistics
For more information about SYSTAT

software products, please visit our WWW site


at http://www.spss.com or contact
SPSS Science Marketing Department
SPSS Inc.
233 South Wacker Drive, 11
th
Floor
Chicago, IL 60606-6307
Tel: (312) 651-3000
Fax: (312) 651-3668
SYSTAT is a registered trademark and the other product names are the trademarks of
SPSS Inc. for its proprietary computer software. No material describing such software
may be produced or distributed without the written permission of the owners of the
trademark and license rights in the software and the copyrights in the published
materials.
The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use,
duplication, or disclosure by the Government is subject to restrictions as set forth in
subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at
52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11
th
Floor, Chicago, IL 60606-6307.
General notice: Other product names mentioned herein are used for identification
purposes only and may be trademarks of their respective companies.
Windows is a registered trademark of Microsoft Corporation.
ImageStream

Graphics & Presentation Filters, copyright 1991-1997 by INSO


Corporation. All Rights Reserved.
ImageStream Graphics Filters is a registered trademark and ImageStream is a trademark
of INSO Corporation.
For using GSLIB:
Copyright 1996, The Board of Trustees of the Leland Stanford Junior University. All
rights reserved.
The programs in GSLIB are distributed in the hope that they will be useful, but
WITHOUT ANY WARRANTY. No author or distributor accepts responsibility to
anyone for the consequences of using them or for whether they serve any particular
purpose or work at all, unless he says so in writing. Everyone is granted permission to
copy, modify and redistribute the programs in GSLIB, but only under the condition that
this notice and the above copyright notice remain intact.
For using Kernel statistics code:
Ken Clarkson wrote this. Copyright 1995 by AT&T.
Permission to use, copy, modify, and distribute this software for any purpose without
fee is hereby granted, provided that this entire notice is included in all copies of any
software which is or includes a copy or modification of this software and in all copies
of the supporting documentation for such software.
THIS SOFTWARE IS BEING PROVIDED AS IS, WITHOUT ANY EXPRESS OR IMPLIED
WARRANTY. IN PARTICULAR, NEITHER THE AUTHORS NOR AT&T MAKE ANY
REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANT-
ABILITY OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
SYSTAT

8.0 Statistics
Copyright 1998 by SPSS Inc.
All rights reserved.
Printed in the United States of America.
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the publisher.
1 2 3 4 5 6 7 8 9 0 03 02 01 00 99 98
ISBN 1-56827-222-7
v


Pr ef ace
Release 8.0 completes the transformation I had sought in selling SYSTAT to SPSS.
At the time of the acquisition, SYSTAT was known as a powerful but quirky package.
It appealed to a particular kind of intensely loyal and venturesome researcher.
SYSTAT needed to reach new users, however, without abandoning its historical core.
SPSS had the resources and the will to make this possible. Jack Noonan, President of
SPSS, and Joel York, Director of SPSS Science, have given SYSTAT the continuing
support it has needed. I want to thank the SYSTAT teamGreg Staky, Mike Pechnyo,
Sasha Khalileev, Lou Ross, Scott Sipiora, Keith Kroeger, Rick Marcantonio, Joe
Granda, and Ray Kuand all the others at SPSS who have helped as well.
We redesigned SYSTAT while maintaining its agility and the compatibility of its
commands. SYSTAT 8.0 has a truly state-of-the art interface: new data and graphics
editing, output document formatting, and native 32-bit performance. Probably the
simplest way to express this change from a personal point of view is to say that in the
early 1980s, I wrote SYSTAT to fit the needs of my own research; the SYSTAT team
has now produced the kind of system I need in my work today.
If you agree, tell your friends and colleagues. SYSTATs biggest promoter has always
been word-of-mouth. If your friends use SPSS, tell them they can run both packages at
the same time, reading and writing the same files, without exhausting system resources.
If your friends use SAS, tell them how familiar SYSTATs architecture will be to them,
how easy it is to use, and how well it complements the capabilities of SAS, particularly
in graphics. And if your friends havent looked at SYSTAT for a while, tell them they
wont believe whats in it now. We got a boost when people saw all of the value added
in Release 7.0. This time, were ready for a party!
Leland Wilkinson
Sr. Vice President, SYSTAT Products
SPSS Inc.
vii
Cont ent s
1 Introduction to Statistics 1
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Know Your Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Sum, Mean, and Standard Deviation . . . . . . . . . . . . . . . . . . . 3
Stem-and-Leaf Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Standardizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
What Is a Population? . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Picking a Simple Random Sample . . . . . . . . . . . . . . . . . . . . 8
Specifying a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Estimating a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Bootstrapping and Sampling 17
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Bootstrapping in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Bootstrap Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . 20
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 20
viii
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Example 1
Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Example 2
Spearman Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . 24
Example 3
Confidence Interval on a Median. . . . . . . . . . . . . . . . . . . . 25
Example 4
Canonical Correlations: Using Text Output . . . . . . . . . . . . . . 26
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Classification and
Regression Trees 31
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
The Basic Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Categorical or Quantitative Predictors. . . . . . . . . . . . . . . . . 35
Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Stopping Rules, Pruning, and Cross-Validation . . . . . . . . . . . . 37
Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Classification and Regression Trees in SYSTAT . . . . . . . . . . . . . . 40
Trees Main Dialog Box. . . . . . . . . . . . . . . . . . . . . . . . . . 40
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Example 1
Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Example 2
Regression Tree with Box Plots. . . . . . . . . . . . . . . . . . . . . 46
Example 3
Regression Tree with Dit Plots . . . . . . . . . . . . . . . . . . . . . 48
ix
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
4 Cluster Analysis 53
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54
Types of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .54
Correlations and Distances. . . . . . . . . . . . . . . . . . . . . . . .55
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .56
Partitioning via K-Means . . . . . . . . . . . . . . . . . . . . . . . . .60
Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62
Cluster Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . .64
Hierarchical Clustering Main Dialog Box . . . . . . . . . . . . . . . .64
K-Means Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . .67
Additive Trees Main Dialog Box . . . . . . . . . . . . . . . . . . . . .68
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . .70
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Example 1
K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Example 2
Hierarchical Clustering: Clustering Cases . . . . . . . . . . . . . . .76
Example 3
Hierarchical Clustering: Clustering Variables . . . . . . . . . . . . .79
Example 4
Hierarchical Clustering: Clustering Variables and Cases. . . . . . .80
Example 5
Hierarchical Clustering: Distance Matrix Input . . . . . . . . . . . .82
Example 6
Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85
x
5 Conjoint Analysis 87
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Additive Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Multiplicative Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Computing Table Margins Based on an Additive Model . . . . . . . 91
Applied Conjoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . 92
Conjoint Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . 93
Conjoint Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . . 93
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Example 1
Choice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Example 2
Word Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Example 3
Box-Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Example 4
Employment Discrimination . . . . . . . . . . . . . . . . . . . . . . 107
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Correlations, Similarities, and
Distance Measures 115
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
The Scatterplot Matrix (SPLOM) . . . . . . . . . . . . . . . . . . . 117
The Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . 117
Other Measures of Association. . . . . . . . . . . . . . . . . . . . 119
Transposed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Hadi Robust Outlier Detection. . . . . . . . . . . . . . . . . . . . . 123
xi
Correlations in SYSTAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Correlations Main Dialog Box . . . . . . . . . . . . . . . . . . . . . 124
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 129
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Example 1
Pearson Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Example 2
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Example 3
Missing Data: Pairwise Deletion. . . . . . . . . . . . . . . . . . . . 134
Example 4
Missing Data: EM Estimation. . . . . . . . . . . . . . . . . . . . . . 135
Example 5
Probabilities Associated with Correlations. . . . . . . . . . . . . . 137
Example 6
Hadi Robust Outlier Detection . . . . . . . . . . . . . . . . . . . . . 140
Example 7
Spearman Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 143
Example 8
S2 and S3 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 143
Example 9
Tetrachoric Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 145
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7 Correspondence Analysis 149
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
The Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
The Multiple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Correspondence Analysis in SYSTAT. . . . . . . . . . . . . . . . . . . . 151
Correspondence Analysis Main Dialog Box . . . . . . . . . . . . . 151
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 152
xii
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Example 1
Correspondence Analysis (Simple) . . . . . . . . . . . . . . . . . . 153
Example 2
Multiple Correspondence Analysis. . . . . . . . . . . . . . . . . . 155
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8 Crosstabulation 159
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Making Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Significance Tests and Measures of Association . . . . . . . . . 162
Crosstabulations in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . 168
One-Way Frequency Tables Main Dialog Box . . . . . . . . . . . 168
Two-Way Frequency Tables Main Dialog Box . . . . . . . . . . . 169
Multiway Frequency Tables Main Dialog Box . . . . . . . . . . . 172
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 173
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Example 1
One-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Example 2
Two-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Example 3
Frequency Input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Example 4
Missing Category Codes . . . . . . . . . . . . . . . . . . . . . . . . 180
Example 5
Percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Example 6
Multiway Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Example 7
Two-Way Table Statistics . . . . . . . . . . . . . . . . . . . . . . . 188
xiii
Example 8
Two-Way Table Statistics (Long Results). . . . . . . . . . . . . . . 190
Example 9
Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Example 10
Fishers Exact Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Example 11
Cochrans Test of Linear Trend . . . . . . . . . . . . . . . . . . . . 196
Example 12
Tables with Ordered Categories . . . . . . . . . . . . . . . . . . . . 198
Example 13
McNemars Test of Symmetry . . . . . . . . . . . . . . . . . . . . . 199
Example 14
Confidence Intervals for One-Way Table Percentages . . . . . . . 201
Example 15
Mantel-Haenszel Test . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9 Descriptive Statistics 207
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Spread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
The Normal Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 209
Non-Normal Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Subpopulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Descriptive Statistics in SYSTAT . . . . . . . . . . . . . . . . . . . . . . 213
Basic Statistics Main Dialog Box . . . . . . . . . . . . . . . . . . . 213
Stem Main Dialog Box. . . . . . . . . . . . . . . . . . . . . . . . . . 215
Cronbach Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . 216
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 217
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Example 1
Basic Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
xiv
Example 2
Saving Basic Statistics: One Statistic and
One Grouping Variable . . . . . . . . . . . . . . . . . . . . . . . . . 219
Example 3
Saving Basic Statistics:
Multiple Statistics and Grouping Variables . . . . . . . . . . . . . 220
Example 4
Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10 Design of Experiments 229
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Design of Experiments in SYSTAT . . . . . . . . . . . . . . . . . . . . . 231
Design of Experiments Main Dialog Box. . . . . . . . . . . . . . . 231
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 237
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Example 1
Complete Factorial Designs . . . . . . . . . . . . . . . . . . . . . . 237
Example 2
Box and Hunter Fractional Factorial Design. . . . . . . . . . . . . 238
Example 3
Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Example 4
Taguchi Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Example 5
Plackett-Burman Design. . . . . . . . . . . . . . . . . . . . . . . . 242
Example 6
Box and Behnken Design . . . . . . . . . . . . . . . . . . . . . . . 243
Example 7
Mixture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
xv
11 Discriminant Analysis 245
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Linear Discriminant Model . . . . . . . . . . . . . . . . . . . . . . . 246
Discriminant Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . 253
Discriminant Analysis Main Dialog Box. . . . . . . . . . . . . . . . 253
Discriminant Analysis Statistics . . . . . . . . . . . . . . . . . . . . 256
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 258
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Example 1
Complete Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Example 2
Automatic Forward Stepping. . . . . . . . . . . . . . . . . . . . . . 263
Example 3
Automatic Backward Stepping. . . . . . . . . . . . . . . . . . . . . 268
Example 4
Interactive Stepping. . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Example 5
Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Example 6
Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Example 7
Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
12 Factor Analysis 297
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A Principal Component . . . . . . . . . . . . . . . . . . . . . . . . . 298
Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Principal Components versus Factor Analysis . . . . . . . . . . . . 304
Applications and Caveats. . . . . . . . . . . . . . . . . . . . . . . . 304
xvi
Factor Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . 305
Factor Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . . 305
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 309
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Example 1
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . 311
Example 2
Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Example 3
Iterated Principal Axis . . . . . . . . . . . . . . . . . . . . . . . . . 318
Example 4
Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Example 5
Factor Analysis Using a Covariance Matrix . . . . . . . . . . . . . 324
Example 6
Factor Analysis Using a Rectangular File . . . . . . . . . . . . . . 327
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13 Linear Models 335
Simple Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Equation for a Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Estimation and Inference . . . . . . . . . . . . . . . . . . . . . . . 339
Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Multiple Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 343
Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Using an SSCP, a Covariance, or a
Correlation Matrix as Input . . . . . . . . . . . . . . . . . . . . . . . . . 351
xvii
Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Effects Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Means Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Multigroup ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Factorial ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Data Screening and Assumptions . . . . . . . . . . . . . . . . . . . 358
Levene Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Pairwise Mean Comparisons. . . . . . . . . . . . . . . . . . . . . . 359
Linear and Quadratic Contrasts . . . . . . . . . . . . . . . . . . . . 360
Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Assumptions in Repeated Measures . . . . . . . . . . . . . . . . . 364
Issues in Repeated Measures Analysis. . . . . . . . . . . . . . . . 365
Types of Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 366
SYSTATs Sums of Squares. . . . . . . . . . . . . . . . . . . . . . . 367
14 Linear Models I:
Linear Regression 369
Linear Regression in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . 370
Regression Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . 370
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 373
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
Example 1
Simple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . 374
Example 2
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Example 3
Residuals and Diagnostics for Simple Linear Regression . . . . . 380
Example 4
Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 383
Example 5
Automatic Stepwise Regression. . . . . . . . . . . . . . . . . . . . 387
xviii
Example 6
Interactive Stepwise Regression . . . . . . . . . . . . . . . . . . 390
Example 7
Testing whether a Single Coefficient Equals Zero . . . . . . . . . 394
Example 8
Testing whether Multiple Coefficients Equal Zero . . . . . . . . . 396
Example 9
Testing Nonzero Null Hypotheses . . . . . . . . . . . . . . . . . . 397
Example 10
Regression with Ecological or Grouped Data . . . . . . . . . . . . 398
Example 11
Regression without the Constant . . . . . . . . . . . . . . . . . . . 399
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
15 Linear Models II:
Analysis of Variance 401
Analysis of Variance in SYSTAT . . . . . . . . . . . . . . . . . . . . . . 402
ANOVA: Estimate Model . . . . . . . . . . . . . . . . . . . . . . . . 402
ANOVA: Hypothesis Test . . . . . . . . . . . . . . . . . . . . . . . 404
Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 406
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 408
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Example 1
One-Way ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Example 2
ANOVA Assumptions and Contrasts . . . . . . . . . . . . . . . . . 412
Example 3
Two-Way ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
Example 4
Single-Degree-of-Freedom Designs . . . . . . . . . . . . . . . . . 427
Example 5
Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Example 6
Separate Variance Hypothesis Tests . . . . . . . . . . . . . . . . 431
Example 7
Analysis of Covariance. . . . . . . . . . . . . . . . . . . . . . . . . 432
xix
Example 8
One-Way Repeated Measures . . . . . . . . . . . . . . . . . . . . . 434
Example 9
Repeated Measures ANOVA for One Grouping Factor and
One Within Factor with Ordered Levels. . . . . . . . . . . . . . . . 440
Example 10
Repeated Measures ANOVA for Two Grouping Factors and
One Within Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Example 11
Repeated Measures ANOVA for Two Trial Factors . . . . . . . . . 445
Example 12
Repeated Measures Analysis of Covariance. . . . . . . . . . . . . 448
Example 13
Multivariate Analysis of Variance . . . . . . . . . . . . . . . . . . . 450
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
16 Linear Models III:
General Linear Models 457
General Linear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . 458
Model Estimation (in GLM) . . . . . . . . . . . . . . . . . . . . . . . 458
Pairwise Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 463
Hypothesis Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 471
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Example 1
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Example 2
Randomized Block Designs. . . . . . . . . . . . . . . . . . . . . . . 480
Example 3
Incomplete Block Designs . . . . . . . . . . . . . . . . . . . . . . . 480
Example 4
Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . 482
Example 5
Nested Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
xx
Example 6
Split Plot Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
Example 7
Latin Square Designs. . . . . . . . . . . . . . . . . . . . . . . . . . 488
Example 8
Crossover and Changeover Designs . . . . . . . . . . . . . . . . . 490
Example 9
Missing Cells Designs (the Means Model) . . . . . . . . . . . . . 494
Example 10
Covariance Alternatives to Repeated Measures . . . . . . . . . . 502
Example 11
Weighting Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Example 12
Hotellings T-Square . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Example 13
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 506
Example 14
Principal Components Analysis (Within Groups) . . . . . . . . . . 510
Example 15
Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . 514
Example 16
Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
Example 17
Partial Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
17 Logistic Regression 517
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Conditional Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Discrete Choice Logit . . . . . . . . . . . . . . . . . . . . . . . . . 522
Stepwise Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
xxi
Logistic Regression in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . 525
Estimate Model Main Dialog Box . . . . . . . . . . . . . . . . . . . 525
Deciles of Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 533
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Example 1
Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Example 2
Binary Logit with Multiple Predictors . . . . . . . . . . . . . . . . . 536
Example 3
Binary Logit with Interactions . . . . . . . . . . . . . . . . . . . . . 537
Example 4
Deciles of Risk and Model Diagnostics . . . . . . . . . . . . . . . . 542
Example 5
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Example 6
Multinomial Logit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Example 7
Conditional Logistic Regression . . . . . . . . . . . . . . . . . . . . 556
Example 8
Discrete Choice Models . . . . . . . . . . . . . . . . . . . . . . . . 558
Example 9
By-Choice Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 565
Example 10
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Example 11
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Example 12
Quasi-Maximum Likelihood. . . . . . . . . . . . . . . . . . . . . . . 574
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
xxii
18 Loglinear Models 585
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
Fitting a Loglinear Model . . . . . . . . . . . . . . . . . . . . . . . 588
Loglinear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . 589
Loglinear Model Main Dialog Box . . . . . . . . . . . . . . . . . . 589
Frequency Tables (Tabulate) . . . . . . . . . . . . . . . . . . . . . 593
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 594
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Example 1
Loglinear Modeling of a Four-Way Table . . . . . . . . . . . . . . 595
Example 2
Screening Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Example 3
Structural Zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Example 4
Tables without Analyses. . . . . . . . . . . . . . . . . . . . . . . . 612
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
19 Multidimensional Scaling 615
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
Collecting Dissimilarity Data . . . . . . . . . . . . . . . . . . . . . 617
Scaling Dissimilarities . . . . . . . . . . . . . . . . . . . . . . . . . 618
Multidimensional Scaling in SYSTAT . . . . . . . . . . . . . . . . . . . 619
Multidimensional Scaling Main Dialog Box . . . . . . . . . . . . . 619
Multidimensional Scaling Configuration. . . . . . . . . . . . . . . 622
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 623
xxiii
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Example 1
Kruskal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Example 2
Guttman Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 626
Example 3
Individual Differences Multidimensional Scaling . . . . . . . . . . 628
Example 4
Nonmetric Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Example 5
Power Scaling Ratio Data. . . . . . . . . . . . . . . . . . . . . . . . 636
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
20 Nonlinear Models 643
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Modeling the Dose-Response Function. . . . . . . . . . . . . . . . 644
Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Nonlinear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . 652
Nonlinear Model Specification. . . . . . . . . . . . . . . . . . . . . 652
Loss Functions for Analytic Function Minimization . . . . . . . . . 660
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 661
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
Example 1
Nonlinear Model with Three Parameters. . . . . . . . . . . . . . . 662
Example 2
Confidence Curves and Regions . . . . . . . . . . . . . . . . . . . . 665
Example 3
Fixing Parameters and Evaluating Fit . . . . . . . . . . . . . . . . . 668
xxiv
Example 4
Functions of Parameters. . . . . . . . . . . . . . . . . . . . . . . . 671
Example 5
Contouring the Loss Function . . . . . . . . . . . . . . . . . . . . . 673
Example 6
Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . 675
Example 7
Iteratively Reweighted Least Squares for Logistic Models . . . . 676
Example 8
Robust Estimation (Measures of Location) . . . . . . . . . . . . . 678
Example 9
Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
Example 10
Piecewise Regression . . . . . . . . . . . . . . . . . . . . . . . . . 686
Example 11
Kinetic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
Example 12
Minimizing an Analytic Function . . . . . . . . . . . . . . . . . . . 689
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
21 Nonparametric Statistics 693
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
Rank (Ordinal) Data. . . . . . . . . . . . . . . . . . . . . . . . . . . 694
Categorical (Nominal) Data . . . . . . . . . . . . . . . . . . . . . . 694
Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
Nonparametric Statistics for
Independent Samples in SYSTAT . . . . . . . . . . . . . . . . . . . . . 695
Kruskal-Wallis Main Dialog Box . . . . . . . . . . . . . . . . . . . 695
Two-Sample Kolmogorov-Smirnov Main Dialog Box. . . . . . . . 696
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697
xxv
Nonparametric Statistics for
Related Variables in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . 697
Sign Tests Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . 697
Wilcoxon Signed-Rank Test Main Dialog Box . . . . . . . . . . . . 698
Friedman Tests Main Dialog Box . . . . . . . . . . . . . . . . . . . 698
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
Nonparametric Statistics for
Single Samples in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . 699
One-Sample Kolmogorov-Smirnov Main Dialog Box . . . . . . . . 699
Wald-Wolfowitz Runs Main Dialog Box. . . . . . . . . . . . . . . . 701
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
Example 1
Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
Example 2
Mann-Whitney Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
Example 3
Two-Sample Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . 705
Example 4
Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
Example 5
Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
Example 6
Sign and Wilcoxon Tests for Multiple Variables . . . . . . . . . . . 707
Example 7
Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
Example 8
One-Sample Kolmogorov-Smirnov Test. . . . . . . . . . . . . . . . 710
Example 9
Wald-Wolfowitz Runs Test . . . . . . . . . . . . . . . . . . . . . . . 712
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
xxvi
22 Partially Ordered Scalogram
Analysis with Coordinates 715
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
POSAC in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
POSAC Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . 718
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 719
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
Example 1
Scalogram AnalysisA Perfect Fit . . . . . . . . . . . . . . . . . 720
Example 2
Binary Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Example 3
Multiple Categories. . . . . . . . . . . . . . . . . . . . . . . . . . . 724
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
23 Path Analysis (RAMONA) 729
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
The Path Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
RAMONAs Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
Path Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . 739
RAMONA Model Main Dialog Box . . . . . . . . . . . . . . . . . . 739
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 745
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
Example 1
Path Analysis Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 746
xxvii
Example 2
Path Analysis with a Restart File. . . . . . . . . . . . . . . . . . . . 751
Example 3
Path Analysis Using Rectangular Input . . . . . . . . . . . . . . . . 764
Example 4
Path Analysis and Standard Errors . . . . . . . . . . . . . . . . . . 771
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
24 Perceptual Mapping 789
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Preference Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 790
Biplots and MDPREF. . . . . . . . . . . . . . . . . . . . . . . . . . . 794
Procrustes Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . 794
Perceptual Mapping in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . 795
Perceptual Mapping Main Dialog Box . . . . . . . . . . . . . . . . 795
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 797
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
Example 1
Vector Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
Example 2
Circle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
Example 3
Internal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
Example 4
Procrustes Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
xxviii
25 Probit Analysis 807
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
Interpreting the Results . . . . . . . . . . . . . . . . . . . . . . . . 808
Probit Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . 808
Probit Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . . 808
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 811
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
Example 1
Probit Analysis (Simple Model) . . . . . . . . . . . . . . . . . . . . 811
Example 2
Probit Analysis with Interactions . . . . . . . . . . . . . . . . . . . 813
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
26 Set and Canonical Correlation 817
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Partialing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
Measures of Association Between Sets. . . . . . . . . . . . . . . 819
R2Y,X Proportion of Generalized Variance . . . . . . . . . . . . . 819
T2Y,X and P2Y,X Proportions of Additive Variance . . . . . . . . . 820
Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
Types of Association between Sets . . . . . . . . . . . . . . . . . 822
Testing the Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . 823
Estimates of the Population R2Y,X, T2Y,X, and P2Y,X. . . . . . . . 824
xxix
Set and Canonical Correlations in SYSTAT . . . . . . . . . . . . . . . . 825
Set and Canonical Correlations Main Dialog Box . . . . . . . . . . 825
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 827
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828
Example 1
Canonical CorrelationsSimple Model . . . . . . . . . . . . . . . 828
Example 2
Partial Set Correlation Model . . . . . . . . . . . . . . . . . . . . . 831
Example 3
Contingency Table Analysis . . . . . . . . . . . . . . . . . . . . . . 835
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
27 Signal Detection Analysis 841
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841
Detection Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 842
Signal Detection Analysis in SYSTAT. . . . . . . . . . . . . . . . . . . . 843
Signal Detection Analysis Main Dialog Box . . . . . . . . . . . . . 843
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 847
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
Example 1
Normal Distribution Model for Signal Detection . . . . . . . . . . . 850
Example 2
Nonparametric Model for Signal Detection . . . . . . . . . . . . . 855
Example 3
Logistic Model for Signal Detection . . . . . . . . . . . . . . . . . . 856
Example 4
Negative Exponential Model for Signal Detection. . . . . . . . . . 857
xxx
Example 5
Chi-Square Model for Signal Detection . . . . . . . . . . . . . . . 860
Example 6
Poisson Model for Signal Detection . . . . . . . . . . . . . . . . . 863
Example 7
Gamma Model for Signal Detection . . . . . . . . . . . . . . . . . 864
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
28 Spatial Statistics 869
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
The Basic Spatial Model . . . . . . . . . . . . . . . . . . . . . . . . 869
The Geostatistical Model . . . . . . . . . . . . . . . . . . . . . . . 871
Variogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
Variogram Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Anisotropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
Simple Kriging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877
Ordinary Kriging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877
Universal Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
Point Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
Spatial Statistics in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . 882
Spatial Statistics Main Dialog Box . . . . . . . . . . . . . . . . . . 882
Using Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
Example 1
Kriging (Ordinary) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
Example 2
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
Example 3
Point Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
xxxi
Example 4
Unusual Distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 904
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907
29 Survival Analysis 909
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 911
Parametric Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 914
Survival Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . 917
Survival Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . 917
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 924
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925
Example 1
Life Tables: The Kaplan-Meier Estimator . . . . . . . . . . . . . . . 925
Example 2
Actuarial Life Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Example 3
Stratified Kaplan-Meier Estimation . . . . . . . . . . . . . . . . . . 929
Example 4
Turnbull Estimation: K-M for Interval-Censored Data . . . . . . . . 933
Example 5
Cox Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936
Example 6
Stratified Cox Regression. . . . . . . . . . . . . . . . . . . . . . . . 938
Example 7
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 943
Example 8
The Weibull Model for Fully Parametric Analysis . . . . . . . . . . 945
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956
xxxii
30 T Tests 959
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . 962
The T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
T Tests in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Two-Sample T Test Main Dialog Box . . . . . . . . . . . . . . . . 965
Paired T Test Main Dialog Box . . . . . . . . . . . . . . . . . . . . 965
One-Sample T Test Main Dialog Box. . . . . . . . . . . . . . . . . 966
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Usage Considerations (T Tests) . . . . . . . . . . . . . . . . . . . . 968
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
Example 1
Two-Sample T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
Example 2
Bonferroni and Dunn-Sidak Adjustments . . . . . . . . . . . . . . 971
Example 3
T Test Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . 973
Example 4
Paired T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975
Example 5
One-Sample T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 977
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
31 Test Item Analysis 979
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 980
Classical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
Latent Trait Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
Test Item Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . 983
Classical Test Item Analysis Main Dialog Box . . . . . . . . . . . 983
Logistic Test Item Analysis Main Dialog Box . . . . . . . . . . . . 984
xxxiii
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . 985
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989
Example 1
Classical Test Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 989
Example 2
Logistic Model (One Parameter) . . . . . . . . . . . . . . . . . . . . 990
Example 3
Logistic Model (Two Parameter). . . . . . . . . . . . . . . . . . . . 993
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997
32 Time Series 999
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
ARIMA Modeling and Forecasting. . . . . . . . . . . . . . . . . . 1003
Seasonal Decomposition and Adjustment . . . . . . . . . . . . . 1013
Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 1014
Fourier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015
Graphical Displays for Time Series in SYSTAT . . . . . . . . . . . . . 1016
T-Plot Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . 1016
Time Main Dialog Box. . . . . . . . . . . . . . . . . . . . . . . . . 1017
ACF Plot Main Dialog Box. . . . . . . . . . . . . . . . . . . . . . . 1018
PACF Plot Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . 1018
CCF Plot Main Dialog Box. . . . . . . . . . . . . . . . . . . . . . . 1019
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 1020
Transformations of Time Series in SYSTAT . . . . . . . . . . . . . . . 1020
Transformations Main Dialog Box . . . . . . . . . . . . . . . . . . 1020
Clear Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1021
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022
xxxiv
Smoothing a Time Series in SYSTAT. . . . . . . . . . . . . . . . . . . . 1022
Smooth Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . 1022
LOWESS Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . 1023
Exponential Smoothing Main Dialog Box . . . . . . . . . . . . . . 1024
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025
Seasonal Adjustments in SYSTAT . . . . . . . . . . . . . . . . . . . . . 1025
Seasonal Adjustment Main Dialog Box . . . . . . . . . . . . . . . 1025
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026
ARIMA Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . 1026
ARIMA Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . 1026
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027
Fourier Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . 1028
Fourier Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . 1028
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
Example 1
Time Series Plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
Example 2
Autocorrelation Plot . . . . . . . . . . . . . . . . . . . . . . . . . . 1031
Example 3
Partial Autocorrelation Plot . . . . . . . . . . . . . . . . . . . . . . 1032
Example 4
Cross-Correlation Plot . . . . . . . . . . . . . . . . . . . . . . . . . 1033
Example 5
Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
Example 6
Moving Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038
Example 7
Smoothing (A 4253H Filter) . . . . . . . . . . . . . . . . . . . . . . 1039
Example 8
LOWESS Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 1040
Example 9
Multiplicative Seasonal Factor . . . . . . . . . . . . . . . . . . . . 1042
Example 10
Multiplicative Seasonality with a Linear Trend . . . . . . . . . . . 1043
Example 11
ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047
Example 12
Fourier Modeling of Temperature . . . . . . . . . . . . . . . . . . 1054
xxxv
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057
33 Two-Stage Least Squares 1059
Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059
Two-Stage Least Squares Estimation . . . . . . . . . . . . . . . . 1059
Heteroskedasticity. . . . . . . . . . . . . . . . . . . . . . . . . . . 1060
Two-Stage Least Squares in SYSTAT . . . . . . . . . . . . . . . . . . 1061
Two-Stage Least Squares Main Dialog Box . . . . . . . . . . . . 1061
Using Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063
Usage Considerations. . . . . . . . . . . . . . . . . . . . . . . . . 1063
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064
Example 1
Heteroskedasticity-Consistent Standard Errors . . . . . . . . . . 1064
Example 2
Two-Stage Least Squares . . . . . . . . . . . . . . . . . . . . . . 1066
Example 3
Two-Stage Instrumental Variables . . . . . . . . . . . . . . . . . 1068
Example 4
Polynomially Distributed Lags . . . . . . . . . . . . . . . . . . . . 1069
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071
Index 1073
xxxvii
Li st of Exampl es
Actuarial Life Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
ANOVA Assumptions and Contrasts . . . . . . . . . . . . . . . . . . . 412
ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047
Autocorrelation Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031
Automatic Backward Stepping . . . . . . . . . . . . . . . . . . . . . . 268
Automatic Forward Stepping . . . . . . . . . . . . . . . . . . . . . . . . 263
Automatic Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . 387
Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Binary Logit with Interactions . . . . . . . . . . . . . . . . . . . . . . . 537
Binary Logit with Multiple Predictors . . . . . . . . . . . . . . . . . . . 536
Binary Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Bonferroni and Dunn-Sidak Adjustments . . . . . . . . . . . . . . . . . 971
Box and Behnken Design . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Box and Hunter Fractional Factorial Design . . . . . . . . . . . . . . . 238
Box-Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
By-Choice Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . 514
Canonical Correlations: Using Text Output . . . . . . . . . . . . . . . . . 26
Canonical CorrelationsSimple Model . . . . . . . . . . . . . . . . . 828
Chi-Square Model for Signal Detection . . . . . . . . . . . . . . . . . . 860
Choice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Circle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
Classical Test Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 989
Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xxxviii
Cochrans Test of Linear Trend . . . . . . . . . . . . . . . . . . . . . . 196
Complete Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Complete Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . 237
Conditional Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 556
Confidence Curves and Regions . . . . . . . . . . . . . . . . . . . . . . 665
Confidence Interval on a Median . . . . . . . . . . . . . . . . . . . . . . 25
Confidence Intervals for One-Way Table Percentages . . . . . . . . . 201
Contingency Table Analysis . . . . . . . . . . . . . . . . . . . . . . . . 835
Contouring the Loss Function . . . . . . . . . . . . . . . . . . . . . . . 673
Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Correspondence Analysis (Simple) . . . . . . . . . . . . . . . . . . . . 153
Covariance Alternatives to Repeated Measures . . . . . . . . . . . . 502
Cox Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936
Cross-Correlation Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033
Crossover and Changeover Designs . . . . . . . . . . . . . . . . . . . 490
Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Deciles of Risk and Model Diagnostics . . . . . . . . . . . . . . . . . . 542
Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
Discrete Choice Models . . . . . . . . . . . . . . . . . . . . . . . . . . 558
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Employment Discrimination . . . . . . . . . . . . . . . . . . . . . . . . 107
Factor Analysis Using a Covariance Matrix . . . . . . . . . . . . . . . 324
Factor Analysis Using a Rectangular File . . . . . . . . . . . . . . . . . 327
Fishers Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Fixing Parameters and Evaluating Fit . . . . . . . . . . . . . . . . . . . 668
Fourier Modeling of Temperature . . . . . . . . . . . . . . . . . . . . . 1054
Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . 482
Frequency Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
Functions of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 671
Gamma Model for Signal Detection . . . . . . . . . . . . . . . . . . . . 864
xxxix
Guttman Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Hadi Robust Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 140
Heteroskedasticity-Consistent Standard Errors . . . . . . . . . . . . 1064
Hierarchical Clustering: Clustering Cases . . . . . . . . . . . . . . . . . .76
Hierarchical Clustering: Clustering Variables . . . . . . . . . . . . . . . .79
Hierarchical Clustering: Clustering Variables and Cases . . . . . . . . .80
Hierarchical Clustering: Distance Matrix Input . . . . . . . . . . . . . . .82
Hotellings T-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Incomplete Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Individual Differences Multidimensional Scaling . . . . . . . . . . . . . 628
Interactive Stepping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Interactive Stepwise Regression . . . . . . . . . . . . . . . . . . . . . 390
Internal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
Iterated Principal Axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Iteratively Reweighted Least Squares for Logistic Models . . . . . . . 676
Kinetic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Kriging (Ordinary) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
Kruskal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
Latin Square Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Life Tables: The Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . 925
Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
Logistic Model (One-Parameter) . . . . . . . . . . . . . . . . . . . . . . 990
Logistic Model (Two-Parameter) . . . . . . . . . . . . . . . . . . . . . . 993
Logistic Model for Signal Detection . . . . . . . . . . . . . . . . . . . . 856
Loglinear Modeling of a Four-Way Table . . . . . . . . . . . . . . . . . 595
LOWESS Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1040
Mann-Whitney Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
xl
Mantel-Haenszel Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 675
McNemars Test of Symmetry . . . . . . . . . . . . . . . . . . . . . . . 199
Minimizing an Analytic Function . . . . . . . . . . . . . . . . . . . . . . 689
Missing Category Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Missing Cells Designs (the Means Model) . . . . . . . . . . . . . . . 494
Missing Data: EM Estimation . . . . . . . . . . . . . . . . . . . . . . . . 135
Missing Data: Pairwise Deletion . . . . . . . . . . . . . . . . . . . . . . 134
Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Mixture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
Moving Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038
Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Multiple Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . 155
Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 383
Multiplicative Seasonal Factor . . . . . . . . . . . . . . . . . . . . . . 1042
Multiplicative Seasonality with a Linear Trend . . . . . . . . . . . . . 1043
Multivariate Analysis of Variance . . . . . . . . . . . . . . . . . . . . . 450
Multiway Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Negative Exponential Model for Signal Detection . . . . . . . . . . . 857
Nested Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
Nonlinear Model with Three Parameters . . . . . . . . . . . . . . . . . 662
Nonmetric Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Nonparametric Model for Signal Detection . . . . . . . . . . . . . . . 855
Normal Distribution Model for Signal Detection . . . . . . . . . . . . . 850
Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
One-Sample Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . 710
One-Sample T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409, 473
xli
One-Way Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . 434
One-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Paired T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975
Partial Autocorrelation Plot . . . . . . . . . . . . . . . . . . . . . . . . 1032
Partial Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Partial Set Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . 831
Path Analysis and Standard Errors . . . . . . . . . . . . . . . . . . . . . 771
Path Analysis Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
Path Analysis Using Rectangular Input . . . . . . . . . . . . . . . . . . 764
Path Analysis with a Restart File . . . . . . . . . . . . . . . . . . . . . . 751
Pearson Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Piecewise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
Plackett-Burman Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Point Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
Poisson Model for Signal Detection . . . . . . . . . . . . . . . . . . . . 863
Polynomially Distributed Lags . . . . . . . . . . . . . . . . . . . . . . . 1069
Power Scaling Ratio Data . . . . . . . . . . . . . . . . . . . . . . . . . . 636
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Principal Components Analysis (Within Groups) . . . . . . . . . . . . . 510
Probabilities Associated with Correlations . . . . . . . . . . . . . . . . 137
Probit Analysis (Simple Model) . . . . . . . . . . . . . . . . . . . . . . 811
Probit Analysis with Interactions . . . . . . . . . . . . . . . . . . . . . . 813
Procrustes Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Quasi-Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 574
Randomized Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . 480
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
Regression Tree with Box Plots . . . . . . . . . . . . . . . . . . . . . . .46
Regression Tree with Dit Plots . . . . . . . . . . . . . . . . . . . . . . . .48
xlii
Regression with Ecological or Grouped Data . . . . . . . . . . . . . . 398
Regression without the Constant . . . . . . . . . . . . . . . . . . . . . 399
Repeated Measures Analysis of Covariance . . . . . . . . . . . . . . 448
Repeated Measures ANOVA for One Grouping Factor and
One Within Factor with Ordered Levels . . . . . . . . . . . . . . . . . . 440
Repeated Measures ANOVA for Two Grouping Factors and
One Within Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Repeated Measures ANOVA for Two Trial Factors . . . . . . . . . . . 445
Residuals and Diagnostics for Simple Linear Regression . . . . . . . 380
Robust Estimation (Measures of Location) . . . . . . . . . . . . . . . 678
Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
S2 and S3 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Saving Basic Statistics: Multiple Statistics and
Grouping Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Saving Basic Statistics: One Statistic and
One Grouping Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Scalogram AnalysisA Perfect Fit . . . . . . . . . . . . . . . . . . . . 720
Screening Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Separate Variance Hypothesis Tests . . . . . . . . . . . . . . . . . . . 431
Sign and Wilcoxon Tests for Multiple Variables . . . . . . . . . . . . . 707
Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 374
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
Single-Degree-of-Freedom Designs . . . . . . . . . . . . . . . . . . . 427
Smoothing (A 4253H Filter) . . . . . . . . . . . . . . . . . . . . . . . . . 1039
Spearman Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Spearman Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 24
Split Plot Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
Stratified Cox Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 938
xliii
Stratified Kaplan-Meier Estimation . . . . . . . . . . . . . . . . . . . . . 929
Structural Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
T Test Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973
Tables with Ordered Categories . . . . . . . . . . . . . . . . . . . . . . 198
Tables without Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Taguchi Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Testing Nonzero Null Hypotheses . . . . . . . . . . . . . . . . . . . . . 397
Testing whether a Single Coefficient Equals Zero . . . . . . . . . . . . 394
Testing whether Multiple Coefficients Equal Zero . . . . . . . . . . . . 396
Tetrachoric Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
The Weibull Model for Fully Parametric Analyses . . . . . . . . . . . . 945
Time Series Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Turnbull Estimation: K-M for Interval-Censored Data . . . . . . . . . . 933
Two-Sample Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . 705
Two-Sample T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
Two-Stage Instrumental Variables . . . . . . . . . . . . . . . . . . . . 1068
Two-Stage Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 1066
Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
Two-Way Table Statistics (Long Results) . . . . . . . . . . . . . . . . . 190
Two-Way Table Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Two-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Unusual Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904
Vector Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
Wald-Wolfowitz Runs Test . . . . . . . . . . . . . . . . . . . . . . . . . 712
Weighting Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
Word Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
1


Chapt er
1
Introduction to Statistics
Leland Wilkinson
Statistics and state have the same root. Statistics are the numbers of the state. More
generally, they are any numbers or symbols that formally summarize our observations
of the world. As we all know, summaries can mislead or elucidate. Statistics also
refers to the introductory course we all seem to hate in college. When taught well,
however, it is this course that teaches us how to use numbers to elucidate rather than
to mislead.
Statisticians specialize in many areasprobability, exploratory data analysis,
modeling, social policy, decision making, and others. While they may philosophically
disagree, statisticians nevertheless recognize at least two fundamental tasks:
description and inference. Description involves characterizing a batch of data in
simple but informative ways. Inference involves generalizing from a sample of data
to a larger population of possible data. Descriptive statistics help us to observe more
acutely, and inferential statistics help us to formulate and test hypotheses.
Any distinctions, such as this one between descriptive and inferential statistics, are
potentially misleading. Lets look at some examples, however, to see some
differences between these approaches.
Descriptive Statistics
Descriptive statistics may be single numerical summaries of a batch, such as an
average. Or, they may be more complex tables and graphs. What distinguishes
descriptive statistics is their reference to a given batch of data rather than to a more
general population or class. While there are exceptions, we usually examine
descriptive statistics to understand the structure of a batch. A closely related field is
2
Chapter 1
called exploratory data analysis. Both exploratory and descriptive methods may lead
us to formulate laws or test hypotheses, but their focus is on the data at hand.
Consider, for example, the following batch. These are numbers of arrests by sex in
1985 for selected crimes in the United States. The source is the FBI Uniform Crime
Reports. What can we say about differences between the patterns of arrests of men and
women in the United States in 1985?
Know Your Batch
First, we must be careful in characterizing the batch. These statistics do not cover the
gamut of U.S. crimes. We left out curfew and loitering violations, for example. Not all
reported crimes are included in these statistics. Some false arrests may be included.
CRIME MALES FEMALES
murder 12904 1815
rape 28865 303
robbery 105401 8639
assault 211228 32926
burglary 326959 26753
larceny 744423 334053
auto 97835 10093
arson 13129 2003
battery 416735 75937
forgery 46286 23181
fraud 151773 111825
embezzle 5624 3184
vandal 181600 20192
weapons 134210 10970
vice 29584 67592
sex 74602 6108
drugs 562754 90038
gambling 21995 3879
family 35553 5086
dui 1208416 157131
drunk 726214 70573
disorderly 435198 99252
vagrancy 24592 3001
runaway 53808 72473
3
I ntroducti on to Stati sti cs
State laws vary on the definitions of some of these crimes. Agencies may modify arrest
statistics for political purposes. Know where your batch came from before you use it.
Sum, Mean, and Standard Deviation
Were there more male than female arrests for these crimes in 1985? The following
output shows us the answer. Males were arrested for 5,649,688 crimes (not 5,649,688
malessome may have been arrested more than once). Females were arrested
1,237,007 times.
How about the average (mean) number of arrests for a crime? For males, this was
235,403 and for females, 51,542. Does the mean make any sense to you as a summary
statistic? Another statistic in the table, the standard deviation, measures how much
these numbers vary around the average. The standard deviation is the square root of the
average squared deviation of the observations from their mean. It, too, has problems in
this instance. First of all, both the mean and standard deviation should represent what
you could observe in your batch, on average: the mean number of fish in a pond, the
mean number of children in a classroom, the mean number of red blood cells per cubic
millimeter. Here, we would have to say, the mean murder-rape-robbery--runaway
type of crime. Second, even if the mean made sense descriptively, we might question
its use as a typical crime-arrest statistic. To see why, we need to examine the shape of
these numbers.
Stem-and-Leaf Plots
Lets look at a display that compresses these data a little less drastically. The stem-and-
leaf plot is like a tally. We pick a most significant digit or digits and tally the next digit
to the right. By using trailing digits instead of tally marks, we preserve extra digits in
the data. Notice the shape of the tally. There are mostly smaller numbers of arrests and
a few crimes (such as larceny and driving under the influence of alcohol) with larger
MALES FEMALES
N of cases 24 24
Minimum 5624.000 303.000
Maximum 1208416.000 334053.000
Sum 5649688.000 1237007.000
Mean 235403.667 51541.958
Standard Dev 305947.056 74220.864
4
Chapter 1
numbers of arrests. Another way of saying this is that the data are positively skewed
toward larger numbers for both males and females.
The Median
When data are skewed like this, the mean gets pulled from the center of the majority
of numbers toward the extreme with the few. A statistic that is not as sensitive to
extreme values is the median. The median is the value above which half the data fall.
More precisely, if you sort the data, the median is the middle value or the average of
the two middle values. Notice that for males the median is 101,618, and for females,
21,686. Both are considerably smaller than the means and more typical of the majority
of the numbers. This is why the median is often used for representing skewed data,
such as incomes, populations, or reaction times.
We still have the same representativeness problem that we had with the mean,
however. Even if the medians corresponded to real data values in this batch (which
they dont because there is an even number of observations), it would be hard to
characterize what they would represent.
Stem and Leaf Plot of variable: MALES, N = 24
Minimum: 5624.000
Lower hinge: 29224.500
Median: 101618.000
Upper hinge: 371847.000
Maximum: 1208416.000

0 H 011222234579
1 M 0358
2 1
3 H 2
4 13
5 6
6
7 24
* * * Outside Values * * *
12 0

Stem and Leaf Plot of variable: FEMALES, N = 24
Minimum: 303.000
Lower hinge: 4482.500
Median: 21686.500
Upper hinge: 74205.000
Maximum: 334053.000

0 H 00000000011
0 M 2223
0
0 H 6777
0 99
1 1
1
1 5
* * * Outside Values * * *
3 3
5
I ntroducti on to Stati sti cs
Sorting
Most people think of means, standard deviations, and medians as the primary
descriptive statistics. They are useful summary quantities when the observations
represent values of a single variable. We purposely chose an example where they are
less appropriate, however, even when they are easily computable. There are better
ways to reveal the patterns in these data. Lets look at sorting as a way of uncovering
structure.
I was talking once with an FBI agent who had helped to uncover the Chicago
machines voting fraud scandal some years ago. He was a statistician, so I was curious
what statistical methods he used to prove the fraud. He replied, We sorted the voter
registration tape alphabetically by last name. Then we looked for duplicate names and
addresses. Sorting is one of the most basic and powerful data analysis techniques. The
stem-and-leaf plot, for example, is a sorted display.
We can sort on any numerical or character variable. It depends on our goal. We
began this chapter with a question: Are there differences between the patterns of arrests
of men and women in the United States in 1985? How about sorting the male and
female arrests separately? If we do this, we will get a list of crimes in order of
decreasing frequency within sex.
MALES FEMALES
dui larceny
larceny dui
drunk fraud
drugs disorderly
disorderly drugs
battery battery
burglary runaway
assault drunk
vandal vice
fraud assault
weapons burglary
robbery forgery
auto vandal
sex weapons
6
Chapter 1
You might want to connect similar crimes with lines. The number of crossings would
indicate differences in ranks.
Standardizing
This ranking is influenced by prevalence. The most frequent crimes occur at the top of
the list in both groups. Comparisons within crimes are obscured by this influence. Men
committed almost 100 times as many rapes as women, for example, yet rape is near the
bottom of both lists. If we are interested in contrasting the sexes on patterns of crime
while holding prevalence constant, we must standardize the data. There are several
ways to do this. You may have heard of standardized test scores for aptitude tests.
These are usually produced by subtracting means and then dividing by standard
deviations. Another method is simply to divide by row or column totals. For the crime
data, we will divide by totals within rows (each crime). Doing so gives us the
proportion of each arresting crime committed by men or women. The total of these two
proportions will thus be 1.
Now, a contrast between men and women on this standardized value should reveal
variations in arrest patterns within crime type. By subtracting the female proportion
from the male, we will highlight primarily male crimes with positive values and female
crimes with negative. Next, sort these differences and plot them in a simple graph. The
following shows the result:
runaway auto
forgery robbery
family sex
vice family
rape gambling
vagrancy embezzle
gambling vagrancy
arson arson
murder murder
embezzle rape
MALES FEMALES
7
I ntroducti on to Stati sti cs
Now we can see clear contrasts between males and females in arrest patterns. The
predominantly aggressive crimes appear at the top of the list. Rape now appears where
it belongsan aggressive, rather than sexual, crime. A few crimes dominated by
females are at the bottom.
Inferential Statistics
We often want to do more than describe a particular sample. In order to generalize,
formulate a policy, or test a hypothesis, we need to make an inference. Making an
inference implies that we think a model describes a more general population from
which our data have been randomly sampled. Sometimes it is difficult to imagine a
population from which you have gathered data. A population can be all possible
voters, all possible replications of this experiment, or all possible moviegoers.
When you make inferences, you should have a population in mind.
What Is a Population?
We are going to use inferential methods to estimate the mean age of the unusual
population contained in the 1980 edition of Whos Who in America. We could enter all
73,500 ages into a SYSTAT file and compute the mean age exactly. If it were practical,
this would be the preferred method. Sometimes, however, a sampling estimate can be
more accurate than an entire census. For example, biases are introduced into large
censuses from refusals to comply, keypunch or coding errors, and other sources. In
8
Chapter 1
these cases, a carefully constructed random sample can yield less-biased information
about the population.
This an unusual population because it is contained in a book and is therefore finite.
We are not about to estimate the mean age of the rich and famous. After all, Spy
magazine used to have a regular feature listing all of the famous people who are not in
Whos Who. And bogus listings may escape the careful fact checking of the Whos Who
research staff. When we get our estimate, we might be tempted to generalize beyond
the book, but we would be wrong to do so. For example, if a psychologist measures
opinions in a random sample from a class of college sophomores, his or her
conclusions should begin with the statement, College sophomores at my university
think If the word people is substituted for college sophomores, it is the
experimenters responsibility to make clear that the sample is representative of the
larger group on all attributes that might affect the results.
Picking a Simple Random Sample
That our population is finite should cause us no problems as long as our sample is much
smaller than the population. Otherwise, we would have to use special techniques to
adjust for the bias it would cause. How do we choose a simple random sample from
a population? We use a method that ensures that every possible sample of a given size
has an equal chance of being chosen. The following methods are not random:
n Pick the first name on every tenth page (some names have no chance of being
chosen).
n Close your eyes, flip the pages of the book, and point to a name (Tversky and others
have done research that shows that humans cannot behave randomly).
n Randomly pick the first letter of the last name and randomly choose from the
names beginning with that letter (there are more names beginning with C, for
example, than with I).
The way to pick randomly from a book, file, or any finite population is to assign a
number to each name or case and then pick a sample of numbers randomly. You can
use SYSTAT to generate a random number between 1 and 73,500, for example, with
the expression:
1 + INT(73500URN)
9
I ntroducti on to Stati sti cs
There are too many pages in Whos Who to use this method, however. As a short cut, I
randomly generated a page number and picked a name from the page using the random
number generator. This method should work well provided that each page has
approximately the same number of names (between 19 and 21 in this case). The sample
is shown below:
AGE SEX AGE SEX
60 male 38 female
74 male 44 male
39 female 49 male
78 male 62 male
66 male 76 female
63 male 51 male
45 male 51 male
56 male 75 male
65 male 65 female
51 male 41 male
52 male 67 male
59 male 50 male
67 male 55 male
48 male 45 male
36 female 49 male
34 female 58 male
68 male 47 male
50 male 55 male
51 male 67 male
47 male 58 male
81 male 76 male
56 male 70 male
49 male 69 male
58 male 46 male
58 male 60 male
10
Chapter 1
Specifying a Model
To make an inference about age, we need to construct a model for our population:
This model says that the age ( ) of someone we pick from the book can be described
by an overall mean age ( ) plus an amount of error ( ) specific to that person and due
to random factors that are too numerous and insignificant to describe systematically.
Notice that we use Greek letters to denote things that we cannot observe directly and
Roman letters for those that we do observe. Of the unobservables in the model, is
called a parameter, and , a random variable. A parameter is a constant that helps to
describe a population. Parameters indicate how a model is an instance of a family of
models for similar populations. A random variable varies like the tossing of a coin.
There are two more parameters associated with the random variable but not
appearing in the model equation. One is its mean ( ),which we have rigged to be 0,
and the other is its standard deviation ( or simply ). Because is simply the sum
of (a constant) and (a random variable), its standard deviation is also .
In specifying this model, we assume the following:
n The model is true for every member of the population.
n The error, plus or minus, that helps determine one population members age is
independent of (not predictable from) the error for other members.
n The errors in predicting all of the ages come from the same random distribution
with a mean of 0 and a standard deviation of .
Estimating a Model
Because we have not sampled the entire population, we cannot compute the parameter
values directly from the data. We have only a small sample from a much larger
population, so we can estimate the parameter values only by using some statistical
method on our sample data. When our three assumptions are appropriate, the sample
mean will be a good estimate of the population mean. Without going into all of the
details, the sample estimate will be, on average, close to the values of the mean in the
population.
a + =
a

11
I ntroducti on to Stati sti cs
We can use various methods in SYSTAT to estimate the mean. One way is to
specify our model using Linear Regression. Select AGE and add it to the Dependent
list. With commands:
REGRESSION
MODEL AGE=CONSTANT
This model says that AGE is a function of a constant value ( ). The rest is error ( ).
Another method is to compute the mean from the Basic Statistics routines. The result
is shown below:
Our best estimate of the mean age of people in Whos Who is 56.7 years.
Confidence Intervals
Our estimate seems reasonable, but it is not exactly correct. If we took more samples
of size 50 and computed estimates, how much would we expect them to vary? First, it
should be plain without any mathematics to see that the larger our sample, the closer
will be our sample estimate to the true value of in the population. After all, if we
could sample the entire population, the estimates would be the true values. Even so, the
variation in sample estimates is a function only of the sample size and the variation of
the ages in the population. It does not depend on the size of the population (number of
people in the book). Specifically, the standard deviation of the sample mean is the
standard deviation of the population divided by the square root of the sample size. This
standard error of the mean is listed on the output above as 1.643. On average, we
would expect our sample estimates of the mean age to vary by plus or minus a little
more than one and a half years, assuming samples of size 50.
If we knew the shape of the sampling distribution of mean age, we would be able to
complete our description of the accuracy of our estimate. There is an approximation
that works quite well, however. If the sample size is reasonably large (say, greater than
25), then the mean of a simple random sample is approximately normally distributed.
This is true even if the population distribution is not normal, provided the sample size
is large.
AGE
N OF CASES 50
MEAN 56.700
STANDARD DEV 11.620
STD. ERROR 1.643

12
Chapter 1
We now have enough information from our sample to construct a normal
approximation of the distribution of our sample mean. The following figure shows this
approximation to be centered at the sample estimate of 56.7 years. Its standard
deviation is taken from the standard error of the mean, 1.643 years.
We have drawn the graph so that the central area comprises 95% of all the area under
the curve (from about 53.5 to 59.9). From this normal approximation, we have built a
95% symmetric confidence interval that gives us a specific idea of the variability of
our estimate. If we did this entire procedure againsample 50 names, compute the
mean and its standard error, and construct a 95% confidence interval using the normal
approximationthen we would expect that 95 intervals out of a hundred so
constructed would cover the real population mean age. Remember, population mean
age is not necessarily at the center of the interval that we just constructed, but we do
expect the interval to be close to it.
Hypothesis Testing
From the sample mean and its standard error, we can also construct hypothesis tests on
the mean. Suppose that someone believed that the average age of those listed in Whos
Who is 61 years. After all, we might have picked an unusual sample just through the
luck of the draw. Lets say, for argument, that the population mean age is 61 and the
standard deviation is 11.62. How likely would it be to find a sample mean age of 56.7?
If it is very unlikely, then we would reject this null hypothesis that the population mean
is 61. Otherwise, we would fail to reject it.
50 55 60 65
Mean Age
0.0
0.1
0.2
0.3
D
e
n
s
i
t
y
13
I ntroducti on to Stati sti cs
There are several ways to represent an alternative hypothesis against this null
hypothesis. We could make a simple alternative value of 56.7 years. Usually, however,
we make the alternative compositethat is, it represents a range of possibilities that do
not include the value 61. Here is how it would look:
H
0
: (null hypothesis)
H
A
: (alternative hypothesis)
We would reject the null hypothesis if our sample value for the mean were outside of
a set of values that a population value of 61 could plausibly generate. In this context,
plausible means more probable than a conventionally agreed upon critical level for
our test. This value is usually 0.05. A result that would be expected to occur fewer than
five times in a hundred samples is considered significant and would be a basis for
rejecting our null hypothesis.
Constructing this hypothesis test is mathematically equivalent to sliding the normal
distribution in the above figure to center over 61. We then look at the sample value 56.7
to see if it is outside of the middle 95% of the area under the curve. If so, we reject the
null hypothesis.

The following t test output shows a p value (probability) of 0.012 for this test. Because
this value is lower than 0.05, we would reject the null hypothesis that the mean age is
61. This is equivalent to saying that the value of 61 does not appear in the 95%
confidence interval.
61 =
61
50 55 60 65
Mean Age
0.0
0.1
0.2
0.3
D
e
n
s
i
t
y
50 55 60 65
Mean Age
0.0
0.1
0.2
0.3
D
e
n
s
i
t
y
56.7
14
Chapter 1
The mathematical duality between confidence intervals and hypothesis testing may
lead you to wonder which is more useful. The answer is that it depends on the context.
Scientific journals usually follow a hypothesis testing model because their null
hypothesis value for an experiment is usually 0 and the scientist is attempting to reject
the hypothesis that nothing happened in the experiment. Any rejection is usually taken
to be interesting, even when the sample size is so large that even tiny differences from
0 will be detected.
Those involved in making decisionsepidemiologists, business people,
engineersare often more interested in confidence intervals. They focus on the size
and credibility of an effect and care less whether it can be distinguished from 0. Some
statisticians, called Bayesians, go a step further and consider statistical decisions as a
form of betting. They use sample information to modify prior hypotheses. See Box and
Tiao (1973) or Berger (1985) for further information on Bayesian statistics.
Checking Assumptions
Now that we have finished our analyses, we should check some of the assumptions we
made in doing them. First, we should examine whether the data look normally
distributed. Although sample means will tend to be normally distributed even when the
population isnt, it helps to have a normally distributed population, especially when we
do not know the population standard deviation. The stem-and-leaf plot gives us a quick
idea:
One-sample t test of AGE with 50 cases; Ho: Mean = 61.000

Mean = 56.700 95.00% CI = 53.398 to 60.002
SD = 11.620 t = -2.617
df = 49 Prob = 0.012
Stem and leaf plot of variable: AGE , N = 50
Minimum: 34.000
Lower hinge: 49.000
Median: 56.000
Upper hinge: 66.000
Maximum: 81.000
3 4
3 689
4 14
4 H 556778999
5 0011112
5 M 556688889
6 0023
6 H 55677789
7 04
7 5668
8 1
15
I ntroducti on to Stati sti cs
There is another plot, called a dot histogram (dit) plot which looks like a stem-and-leaf
plot. We can use different symbols to denote males and females in this plot, however,
to see if there are differences in these subgroups. Although there are not enough
females in the sample to be sure of a difference, it is nevertheless a good idea to
examine it. The dot histogram reveals four of the six females to be younger than
everyone else.
A better test of normality is to plot the sorted age values against the corresponding
values of a mathematical normal distribution. This is called a normal probability plot. If
the data are normally distributed, then the plotted values should fall approximately on a
straight line. Our data plot fairly straight. Again, different symbols are used for the
males and females. The four young females appear in the bottom left corner of the plot.
Does this possible difference in ages by gender invalidate our results? No, but it
suggests that we might want to examine the gender differences further to see whether
or not they are significant.
16
Chapter 1
References
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd ed. New York:
Springer Verlag.
Box, G. E. P. and Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading,
Mass.: Addison-Wesley.
17


Chapt er
2
Bootstrapping and Sampling
Leland Wilkinson and Laszlo Engelman
Bootstrapping is not a module in SYSTAT. It is a procedure available in most modules
where appropriate. Bootstrapping is so important as a general statistical methodology,
however, that it deserves a separate chapter. SYSTAT handles bootstrapping as a
single option to the ESTIMATE command or its equivalent in each module. The
computations are handled without producing a scratch file of the bootstrapped
samples. This saves disk space and computer time. Bootstrap, jackknife, and other
samples are simply computed on-the-fly.
Statistical Background
Bootstrap (Efron and Tibshirani, 1993) is the most recent and most powerful of a
variety of strategies for producing estimates of parameters in samples taken from
unknown probability distributions. Efron and LePage (1992) summarize the problem
most succinctly. We have a set of real-valued observations independently
sampled from an unknown probability distribution F. We are interested in estimating
some parameter by using the information in the sample data with an estimator
. Some measure of the estimates accuracy is as important as the estimate
itself; we want a standard error of and, even better, a confidence interval on the true
value .
x
1
x
n
, ,

t x ( ) =

18
Chapter 2
Classical statistical methods provide a powerful way of handling this problem when
F is known and is simplewhen , for example, is the mean of the normal
distribution. Focusing on the standard error of the mean, we have:
Substituting the unbiased estimate for ,
we have:
Parametric methods often work fairly well even when the distribution is contaminated
or only approximately known because the central limit theorem shows that sums of
independent random variables with finite variances tend to be normal in large samples
even when the variables themselves are not normal. But problems arise for estimates
more complicated than a meanmedians, sample correlation coefficients, or
eigenvalues, especially in small or medium-sized samples and even, in some cases, in
large samples.
Strategies for approaching this problem nonparametrically have involved using
the empirical distribution to obtain information needed for the standard error
estimate. One approach is Tukeys jackknife (Tukey, 1958), which is offered in
SAMPLE=JACKKNIFE. Tukey proposed computing n subsets of ( ), each
consisting of all of the cases except the i th deleted case (for ). He
produced standard errors as a function of the n estimates from these subsets.
Another approach has involved subsampling, usually via simple random samples.
This option is offered in SAMPLE=SIMPLE. A variety of researchers in the 1950s and
1960s explored these methods empirically (for example, Block, 1960; see Noreen,
1989, for others). This method amounts to a Monte Carlo study in which the sample is

se x F ; { }

2
F ( )
n
-------------- =

2
F ( )

2
F ( )
x
i
x ( )
2
i 1 =
n

n 1 ( )
--------------------------- -
=
se x ( )
x
i
x ( )
2
i 1 =
n

n n 1 ( )
--------------------------- - =
F

x
1
x
n
, ,
i 1 n , , =
19
Bootstrappi ng and Sampl i ng
treated as the population. It is also closely related to methodology for permutation tests
(Fisher, 1935; Dwass, 1957; Edginton, 1980).
The bootstrap (Efron, 1979) has been the focus of most recent theoretical research.
is defined as:
Then, since
we have:
The computer algorithm for getting the samples for generating is to sample from
( ) with replacement. Efron and other researchers have shown that the general
procedure of generating samples and computing estimates yields data on which
we can make useful inferences. For example, instead of computing only and its
standard error, we can do histograms, densities, order statistics (for symmetric and
asymmetric confidence intervals), and other computations on our estimates. In other
words, there is much to learn from the bootstrap sample distributions of the estimates
themselves.
There are some concerns, however. The naive bootstrap computed this way (with
SAMPLE=BOOT and STATS for computing means and standard deviations) is not
especially good for long-tailed distributions. It is also not suited for time-series or
stochastic data. See LePage and Billard (1992) for recent research on and solutions to
some of these problems. There are also several simple improvements to the naive
boostrap. One is the pivot, or bootstrap-t method, discussed in Efron and Tibshirani
(1993). This is especially useful for confidence intervals on the mean of an unknown
distribution. Efron (1982) discusses other applications. There are also refinements
based on correction for bias in the bootstrap sample itself (DiCiccio and Efron, 1996).
In general, however, the naive bootstrap can help you get better estimates of
standard errors and confidence intervals than many large-sample approximations, such
as Fishers z transformation for Pearson correlations or Wald tests for coefficients in
nonlinear models. And in cases in which no good approximations are available (see
some of the examples below), the bootstrap is the only way to go.
F

: probability 1/n on x
i
for i 1 2 n , , , =

2
F

( ) x x ( )
2
=
se x F

, { }
x x ( )
2
n
------------------ =
F

x
1
x
n
, ,

20
Chapter 2
Bootstrapping in SYSTAT
Bootstrap Main Dialog Box
No dialog box exists for performing bootstrapping; therefore, you must use SYSTATs
command language. To do a bootstrap analysis, simply add the sample type to the
command that initiates model estimation (usually ESTIMATE).
Using Commands
The syntax is:
The arguments m and n stand for the number of samples and the sample size of each
sample. The parameter n is optional and defaults to the number of cases in the file.
The BOOT option generates samples with replacement, SIMPLE generates samples
without replacement, and JACK generates a jackknife set.
Usage Considerations
Types of data. Bootstrapping works on procedures with rectangular data only.
Print options. It is best to set PRINT=NONE; otherwise, you will get 16 miles of output.
If you want to watch, however, set PRINT=LONG and have some fun.
Quick Graphs. Bootstrapping produces no Quick Graphs. You use the file of bootstrap
estimates and produce the graphs you want. See the examples.
Saving files. If you are doing this for more than entertainment (watching output fly by),
save your data into a file before you use the ESTIMATE / SAMPLE command. See the
examples.
BY groups. By all means. Are you a masochist?
Case frequencies. Yes, FREQ=<variable> works. This feature does not use extra
memory.
Case weights. Use case weighting if it is available in a specific module.
ESTIMATE / SAMPLE=BOOT(m,n)
SIMPLE(m,n)
JACK
21
Bootstrappi ng and Sampl i ng
Examples
A few examples will serve to illustrate bootstrapping. They cover only a few of the
statistical modules, however. We will focus on the tools you can use to manipulate
output and get the summary statistics you need for bootstrap estimates.
Example 1
Linear Models
This example involves the famous Longley (1967) regression data. These real data
were collected by James Longley at the Bureau of Labor Statistics to test the limits of
regression software. The predictor variables in the data set are highly collinear, and
several coefficients of variation are extremely large. The input is:
Notice that we save the coefficients into the file BOOT. We request 2500 bootstrap
samples of size 16 (the number of cases in the file). Then we fit the Longley data with
a single regression to compare the result to our bootstrap. Finally, we use the bootstrap
file and compute basic statistics on the bootstrap estimated regression coefficients. The
OUTPUT command is used to save this part of the output to a file. We should not use it
earlier in the program unless we want to save the output for the 2500 regressions. To
view the bootstrap distributions, we create histograms on the coefficients to see their
distribution.
USE LONGLEY
GLM
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
SAVE BOOT / COEF
ESTIMATE / SAMPLE=BOOT(2500,16)
OUTPUT TEXT1
USE LONGLEY
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
ESTIMATE
USE BOOT
STATS
STATS X(1..6)
OUTPUT *
BEGIN
DEN X(1..6) / NORM
DEN X(1..6)
END
22
Chapter 2
The resulting output is:

Variables in the SYSTAT Rectangular file are:
DEFLATOR GNP UNEMPLOY ARMFORCE POPULATN TIME
TOTAL

Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995

Adjusted squared multiple R: 0.992 Standard error of estimate: 304.854

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -3482258.635 890420.384 0.0 . -3.911 0.004
DEFLATOR 15.062 84.915 0.046 0.007 0.177 0.863
GNP -0.036 0.033 -1.014 0.001 -1.070 0.313
UNEMPLOY -2.020 0.488 -0.538 0.030 -4.136 0.003
ARMFORCE -1.033 0.214 -0.205 0.279 -4.822 0.001
POPULATN -0.051 0.226 -0.101 0.003 -0.226 0.826
TIME 1829.151 455.478 2.480 0.001 4.016 0.003

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 1.84172E+08 6 3.06954E+07 330.285 0.000
Residual 836424.056 9 92936.006
-------------------------------------------------------------------------------------------
---------------------------------------


Durbin-Watson D Statistic 2.559
First Order Autocorrelation -0.348

Variables in the SYSTAT Rectangular file are:
CONSTANT X(1..6)

X(1) X(2) X(3) X(4) X(5) X(6)
N of cases 2500 2500 2500 2500 2500 2499
Minimum -816.248 -0.846 -12.994 -8.864 -2.591 -5050.438
Maximum 1312.052 0.496 7.330 2.617 3142.235 12645.703
Mean 20.648 -0.049 -2.214 -1.118 1.295 1980.382
Standard Dev 128.301 0.064 0.903 0.480 62.845 980.870
23
Bootstrappi ng and Sampl i ng
Following is the plot of the results:
The bootstrapped standard errors are all larger than the normal-theory standard errors.
The most dramatically different are the ones for the POPULATN coefficient (62.845
versus 0.226). It is well known that multicollinearity leads to large standard errors for
regression coefficients, but the bootstrap makes this even clearer.
Normal curves have been superimposed on the histograms, showing that the
coefficients are not normally distributed. We have run a relatively large number of
samples (2500) to reveal these long-tailed distributions. Were these data to be analyzed
formally, it would take a huge number of samples to get useful standard errors.
Beaton, Rubin, and Barone (1976) used a randomization technique to highlight this
problem. They added a uniform random extra digit to Longleys data so that their data
sets rounded to Longleys values and found in a simulation that the variance of the
simulated coefficient estimates was larger in many cases than the miscalculated
solutions from the poorer designed regression programs.
-1000 0 1000 2000
X(1)
0
200
400
600
800
1000
1200
C
o
u
n
t
-1.0 -0.5 0.0 0.5
X(2)
0
500
1000
1500
C
o
u
n
t
-20 -10 0 10
X(3)
0
500
1000
1500
C
o
u
n
t
-10 -5 0 5
X(4)
0
500
1000
1500
2000
C
o
u
n
t
-1000 0 1000 2000 3000 4000
X(5)
0
500
1000
1500
2000
C
o
u
n
t
-10000 0 10000 20000
X(6)
0
500
1000
1500
C
o
u
n
t
-1000 0 1000 2000
X(1)
0
200
400
600
800
1000
1200
C
o
u
n
t
0.0
0.1
0.2
0.3
0.4
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
-1.0 -0.5 0.0 0.5
X(2)
0
500
1000
1500
C
o
u
n
t
0.0
0.1
0.2
0.3
0.4
0.5
0.6
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
-20 -10 0 10
X(3)
0
500
1000
1500
C
o
u
n
t
0.0
0.1
0.2
0.3
0.4
0.5
0.6
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
-10 -5 0 5
X(4)
0
500
1000
1500
2000
C
o
u
n
t
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
-1000 0 1000 2000 3000 4000
X(5)
0
500
1000
1500
2000
C
o
u
n
t
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
-10000 0 10000 20000
X(6)
0
500
1000
1500
C
o
u
n
t
0.0
0.1
0.2
0.3
0.4
0.5
0.6
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
24
Chapter 2
Example 2
Spearman Rank Correlation
This example involves law school data from Efron and Tibshirani (1993). They use
these data to illustrate the usefulness of the bootstrap for calculating standard errors on
the Pearson correlation. There are similar calculations for a 95% confidence interval
on the Spearman correlation.
The bootstrap estimates are saved into a temporary file. The file format is
CORRELATION, meaning that 1000 correlation matrices will be saved, stacked on top
of each other in the file. Consequently, we need BASIC to sift through and delete every
odd line (the diagonal of the matrix). We also have to remember to change the file type
to RECTANGULAR so that we can sort and do other things later. Another approach
would have been to use the rectangular form of the correlation output:
Next, we reuse the new file and sort the correlations. Finally, we print the nearest values
to the percentiles. Following is the input:
Following is the output, our asymmetric confidence interval:
SPEARMAN LSAT*GPA
CORR
GRAPH NONE
USE LAW
RSEED=54321
SAVE TEMP
SPEARMAN LSAT GPA / SAMPLE=BOOT(1000,15)
BASIC
USE TEMP
TYPE=RECTANGULAR
IF CASE<>2*INT(CASE/2) THEN DELETE
SAVE BLAW
RUN
USE BLAW
SORT LSAT
IF CASE=975 THEN PRINT 95% CI Upper:,LSAT
IF CASE=25 THEN PRINT 95% CI Lower:,LSAT
OUTPUT TEXT2
RUN
OUTPUT *
DENSITY LSAT
95% CI Lower: 0.476
95% CI Upper: 0.953
SYSTAT file created.

1000 cases and 2 variables processed.
BASIC statements cleared.
25
Bootstrappi ng and Sampl i ng
The histogram of the entire file shows the overall shape of the distribution. Notice its
asymmetry.
Example 3
Confidence Interval on a Median
We will use the STATS module to compute a 95% confidence interval on the median
(Efron, 1979). The input is:
STATS
GRAPH NONE
USE OURWORLD
SAVE TEMP
STATS LIFE_EXP / MEDIAN,SAMPLE=BOOT(1000,57)
BASIC
USE TEMP
SAVE TEMP2
IF STATISTC$<>Median THEN DELETE
RUN
USE TEMP2
SORT LIFE_EXP
IF CASE=975 THEN PRINT 95% CI Upper:,LIFE_EXP
IF CASE=25 THEN PRINT 95% CI Lower:,LIFE_EXP
OUTPUT TEXT3
RUN
OUTPUT *
DENSITY LIFE_EXP
0.0 0.2 0.4 0.6 0.8 1.0 1.2
LSAT
C
o
u
n
t
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
0
50
100
150
200
0.0
0.1
0.2
26
Chapter 2
Following is the output:
Following is the histogram of the bootstrap sample medians:
Keep in mind that we are using the naive bootstrap method here, trusting the
unmodified distribution of the bootstrap sample to set percentiles. Looking at the
bootstrap histogram, we can see that the distribution is skewed and irregular. There are
improvements that can be made in these estimates. Also, we have to be careful about
how we interpret a confidence interval on a median.
Example 4
Canonical Correlations: Using Text Output
Most statistics can be bootstrapped by saving into SYSTAT files, as shown in the
examples. Sometimes you may want to search through bootstrap output for a single
number and compute standard errors or graphs for that statistic. The following example
uses SETCOR to compute the distribution of the two canonical correlations relating the
95% CI Lower: 63.000
95% CI Upper: 71.000
SYSTAT file created.

1000 cases and 2 variables processed.
BASIC statements cleared.
50 60 70 80
LIFE_EXP
C
o
u
n
t
0
100
200
300
400
500
600
0.0
0.1
0.2
0.3
0.4
0.5
0.6
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
27
Bootstrappi ng and Sampl i ng
species to measurements in the Fisher Iris data. The same correlations are computed in
the DISCRIM procedure. Following is the input:
Notice how the BASIC program searches through the output file TEMP.DAT for the
words Canonical correlations at the beginning of a line. Two lines later, the actual
numbers are in the output, so we use the LAG function to check when we are at that
point after having located the string. Then we convert the printed values back to
numbers with the VAL() function. If you are concerned with precision, use a larger
format for the output. Finally, we delete unwanted rows and save the results into the
file CC. From that file, we plot the two canonical correlations. For fun, we do a dot
histogram (dit) plot.
SETCOR
USE IRIS
MODEL SPECIES=SEPALLEN..PETALWID
CATEGORY SPECIES
OUTPUT TEMP
ESTIMATE / SAMPLE=BOOT(500,150)
OUTPUT *
BASIC
GET TEMP
INPUT A$,B$
LET R1=.
LET R2=.
LET FOUND=.
IF A$=Canonical AND B$=correlations ,
THEN LET FOUND=CASE
IF LAG(FOUND,2)<>. THEN FOR
LET R1=VAL(A$)
LET R2=VAL(B$)
NEXT
IF R1=. AND R2=. THEN DELETE
SAVE CC
RUN
EXIT
USE CC
DENSITY R1 R2 / DIT
28
Chapter 2
Following is the graph:
Notice the stripes in the plot on the left. These reveal the three-digit rounding we
incurred by using the standard FORMAT=3.
Computation
Computations are done by the respective statistical modules. Sampling is done on the
data.
Algorithms
Bootstrapping and other sampling is implemented via a one-pass algorithm that does
not use extra storage for the data. Samples are generated using the SYSTAT uniform
random number generator. It is always a good idea to reset the seed when running a
problem so that you can be certain where the random number generator started if it
becomes necessary to replicate your results.
Missing Data
Cases with missing data are handled by the specific module.
29
Bootstrappi ng and Sampl i ng
References
Beaton, A. E., Rubin, D. B., and Barone, J. L. (1976). The acceptability of regression
solutions: Another look at computational accuracy. Journal of the American Statistical
Association, 71, 158168.
Block, J. (1960). On the number of significant findings to be expected by chance.
Psychometrika, 25, 369380.
DiCiccio, T. J. and Efron, B. (1966). Bootstrap confidence intervals. Statistical Science, 11,
189228.
Dwass, M. (1957). Modified randomization sets for nonparametric hypotheses. Annals of
Mathematical Statistics, 29, 181187.
Edginton, E. S. (1980). Randomization tests. New York: Marcel Dekker.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of
Statistics, 7, 126.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Vol. 38 of
CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Penn.:
SIAM.
Efron, B. and LePage, R. (1992). Introduction to bootstrap. In R. LePage and L. Billard
(eds.), Exploring the Limits of Bootstrap. New York: John Wiley & Sons, Inc.
Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd.
Longley, J. W. (1967). An appraisal of least squares for the electronic computer from the
point of view of the user. Journal of the American Statistical Association, 62, 819841.
Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An introduction.
New York: John Wiley & Sons, Inc.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of
Mathematical Statistics, 29, 614.
31


Chapt er
3
Classification and
Regression Trees
Leland Wilkinson
The TREES module computes classification and regression trees. Classification trees
include those models in which the dependent variable (the predicted variable) is
categorical. Regression trees include those in which it is continuous. Within these
types of trees, the TREES module can use categorical or continuous predictors,
depending on whether a CATEGORY statement includes some or all of the predictors.
For any of the models, a variety of loss functions is available. Each loss function
is expressed in terms of a goodness-of-fit statisticthe proportion of reduction in
error (PRE). For regression trees, this statistic is equivalent to the multiple . Other
loss functions include the Gini index, twoing (Breiman et al., 1984), and the phi
coefficient.
TREES produces graphical trees called mobiles (Wilkinson, 1995). At the end of
each branch is a density display (box plot, dot plot, histogram, etc.) showing the
distribution of observations at that point. The branches balance (like a Calder mobile)
at each node so that the branch is level, given the number of observations at each end.
The physical analogy is most obvious for dot plots, in which the stacks of dots (one
for each observation) balance like marbles in bins.
TREES can also produce a SYSTAT BASIC program to code new observations
and predict the dependent variable. This program can be saved to a file and run from
the command window or submitted as a program file.
Statistical Background
Trees are directed graphs beginning with one node and branching to many. They are
fundamental to computer science (data structures), biology (classification),
psychology (decision theory), and many other fields. Classification and regression
R
2
32
Chapter 3
trees are used for prediction. In the last two decades, they have become popular as
alternatives to regression, discriminant analysis, and other procedures based on
algebraic models. Tree-fitting methods have become so popular that several
commercial programs now compete for the attention of market researchers and others
looking for software.
Different commercial programs produce different results with the same data,
however. Worse, some programs provide no documentation or supporting materials to
explain their algorithms. The result is a marketplace of competing claims, jargon, and
misrepresentation. Reviews of these packages (for example, Levine, 1991; Simon,
1991) use words like sorcerer, magic formula, and wizardry to describe the
algorithms and express frustration at vendors scant documentation. Some vendors, in
turn, have represented tree programs as state-of-the-art artificial intelligence
procedures capable of discovering hidden relationships and structures in databases.
Despite the marketing hyperbole, most of the now-popular tree-fitting algorithms
have been around for decades. The modern commercial packages are mainly
microcomputer ports (with attractive interfaces) of the mainframe programs that
originally implemented these algorithms. Warnings of abuse of these techniques are
not new either (for example, Einhorn, 1972; Bishop, Fienberg, and Holland, 1975).
Originally proposed as automatic procedures for detecting interactions among
variables, tree-fitting methods are actually closely related to classical cluster analysis
(Hartigan, 1975).
This introduction will attempt to sort out some of the differences between
algorithms and illustrate their use on real data. In addition, tree analyses will be
compared to discriminant analysis and regression.
The Basic Tree Model
The figure below shows a tree for predicting decisions by a medical school admissions
committee (Milstein et al., 1975). It was based on data for a sample of 727 applicants.
We selected a tree procedure for this analysis because it was easy to present the results
to the Yale Medical School admissions committee and because the tree model could
serve as a basis for structuring their discussions about admissions policy.
Notice that the values of the predicted variable (the committees decision to reject
or interview) are at the bottom of the tree and the predictors (Medical College
Admissions Test and college grade point average) come into the system at each node
of the tree.
33
Cl assi fi cati on and Regressi on Trees
The top node contains the entire sample. Each remaining node contains a subset of
the sample in the node directly above it. Furthermore, each node contains the sum of
the samples in the nodes connected to and directly below it. The tree thus splits
samples.
Each node can be thought of as a cluster of objects, or cases, that is to be split by
further branches in the tree. The numbers in parentheses below the terminal nodes
show how many cases are incorrectly classified by the tree. A similar tree data structure
is used for representing the results of single and complete linkage and other forms of
hierarchical cluster analysis (Hartigan, 1975). Tree prediction models add two
ingredients: the predictor and predicted variables labeling the nodes and branches.
The tree is binary because each node is split into only two subsamples. Classification
or regression trees do not have to be binary, but most are. Despite the marketing claims
of some vendors, nonbinary, or multibranch, trees are not superior to binary trees. Each
is a permutation of the other, as shown in the figure below.
The tree on the left (ternary) is not more parsimonious than that on the right (binary).
Both trees have the same number of parameters, or split points, and any statistics
associated with the tree on the left can be converted trivially to fit the one on the right.
A computer program for scoring either tree (IF ... THEN ... ELSE) would look identical.
For display purposes, it is often convenient to collapse binary trees into multibranch
trees, but this is not necessary.
GRADE POINT AVERAGE
n=727
<3.47
>3.47
MCAT VERBAL
REJECT
MCAT QUANTITATIVE
REJECT
REJECT
INTERVIEW
INTERVIEW
MCAT VERBAL
<555 >555 <535 >535
<655 >655
93 249
342 385
354 51
122 127
(9)
(45) (46)
(19) (49)
34
Chapter 3
Some programs that do multibranch splits do not allow further splitting on a predictor
once it has been used. This has an appealing simplicity. However, it can lead to
unparsimonious trees. It is unnecessary to make this restriction before fitting a tree.
The figure below shows an example of this problem. The upper right tree classifies
objects on an attribute by splitting once on shape, once on fill, and again on shape. This
allows the algorithm to separate the objects into only four terminal nodes having
common values. The upper left tree splits on shape and then only on fill. By not
allowing any other splits on shape, the tree requires five terminal nodes to classify
correctly. This problem cannot be solved by splitting first on fill, as the lower left tree
shows. In general, restricting splits to only one branch for each predictor results in
more terminal nodes.
123
1 2 3 23
3 2
1
123
1 1 2 2 3 4
1 1 2 3 2 4
2 3 2 4
1 1 2 2 3 4
1 1 2 3 2 4
2 3 2 4
3 4
1 1 2 2 3 4
1 1
3 4
2 3 2 4
1 1 2 2
shape
fill
shape
shape
fill
shape
fill
35
Cl assi fi cati on and Regressi on Trees
Categorical or Quantitative Predictors
The predictor variables in the figure on p. 33 are quantitative, so splits are created by
determining cut points on a scale. If predictor variables are categorical, as in the figure
above, splits are made between categorical values. It is not necessary to categorize
predictors before computing trees. This is as dubious a practice as recoding data well-
suited for regression into categories in order to use chi-square tests. Those who
recommend this practice are turning silk purses into sows ears. In fact, if variables are
categorized before doing tree computations, then poorer fits are likely to result.
Algorithms are available for mixed quantitative and categorical predictors, analogous
to analysis of covariance.
Regression Trees
Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a
quantitative variable. They called the method Automatic Interaction Detection
(AID). The algorithm performs stepwise splitting. It begins with a single cluster of
cases and searches a candidate set of predictor variables for a way to split the cluster
into two clusters. Each predictor is tested for splitting as follows: sort all the n cases on
the predictor and examine all ways to split the cluster in two. For each possible
split, compute the within-cluster sum of squares about the mean of the cluster on the
dependent variable. Choose the best of the splits to represent the predictors
contribution. Now do this for every other predictor. For the actual split, choose the
predictor and its cut point that yields the smallest overall within-cluster sum of squares.
Categorical predictors require a different approach. Since categories are unordered,
all possible splits between categories must be considered. For deciding on one split of
k categories into two groups, this means that possible splits must be considered.
Once a split is found, its suitability is measured on the same within-cluster sum of
squares as for a quantitative predictor.
Morgan and Sonquist called their algorithm AID because it naturally incorporates
interaction among predictors. Interaction is not correlation. It has to do, instead, with
conditional discrepancies. In the analysis of variance, interaction means that a trend
within one level of a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-products between
predictors. In the tree model, it is represented by branches from the same node that
have different splitting predictors further down the tree.
n 1
n 1
2
k
1
36
Chapter 3
The figure below shows a tree without interactions on the left and with interactions
on the right. Because interaction trees are a natural by-product of the AID splitting
algorithm, Morgan and Sonquist called the procedure automatic. In fact, AID trees
without interactions are quite rare for real data, so the procedure is indeed automatic.
To search for interactions using stepwise regression or ANOVA linear modeling, we
would have to generate interactions among p predictors and compute partial
correlations for every one of them in order to decide which ones to include in our
formal model
.
Classification Trees
Regression trees parallel regression/ANOVA modeling, in which the dependent
variable is quantitative. Classification trees parallel discriminant analysis and algebraic
classification methods. Kass (1980) proposed a modification to AID called CHAID for
categorized dependent and independent variables. His algorithm incorporated a
sequential merge-and-split procedure based on a chi-square test statistic. Kass was
concerned about computation time (although this has since proved an unnecessary
worry), so he decided to settle for a suboptimal split on each predictor instead of
searching for all possible combinations of the categories. Kasss algorithm is like
sequential crosstabulation. For each predictor:
n Crosstabulate the m categories of the predictor with the k categories of the
dependent variable.
n Find the pair of categories of the predictor whose subtable is least
significantly different on a chi-square test and merge these two categories.
n If the chi-square test statistic is not significant according to a preset critical value,
repeat this merging process for the selected predictor until no nonsignificant chi-
square is found for a subtable.
n Choose the predictor variable whose chi-square is the largest and split the sample
into subsets, where l is the number of categories resulting from the merging
process on that predictor.
2
p
A
B
C C C C
B
A
B
G F E D
C
2 k
m l
37
Cl assi fi cati on and Regressi on Trees
n Continue splitting, as with AID, until no significant chi-squares result.
The CHAID algorithm saves computer time, but it is not guaranteed to find the splits
that predict best at a given step. Only by searching all possible category subsets can we
do that. CHAID is also limited to categorical predictors, so it cannot be used for
quantitative or mixed categorical-quantitative models, as in the figure on p. 33.
Nevertheless, it is an effective way to search heuristically through rather large tables
quickly.
Note: Within the computer science community, there is a categorical splitting literature
that often does not cite the statistical work and is, in turn, not frequently cited by
statisticians (although this has changed in recent years). Quinlan (1986, 1992), the best
known of these researchers, developed a set of algorithms based on information theory.
These methods, called ID3, iteratively build decision trees based on training samples
of attributes.
Stopping Rules, Pruning, and Cross-Validation
AID, CHAID, and other forward-sequential tree-fitting methods share a problem with
other tree-clustering methodswhere do we stop? If we keep splitting, a tree will end
up with only one case, or object, at each terminal node. We need a method for
producing a smaller tree other than the exhaustive one. One way is to use stepwise
statistical tests, as in the F-to-enter or alpha-to-enter rule for forward stepwise
regression. We compute a test statistic (chi-square, F, etc.), choose a critical level for
the test (sometimes modifying it with the Bonferroni inequality), and stop splitting any
branch that fails to meet the test (see Wilkinson, 1979, for a review of this procedure
in forward selection regression).
Breiman et al. (1984) showed that this method tends to yield trees with too many
branches and can also fail to pursue branches that can add significantly to the overall
fit. They advocate, instead, pruning the tree. After computing an exhaustive tree, their
program eliminates nodes that do not contribute to the overall prediction. They add
another essential ingredient, howeverthe cost of complexity. This measure is similar
to other cost statistics, such as Mallows

(Neter, Wasserman, and Kutner, 1985),
which add a penalty for increasing the number of parameters in a model. Breimans
method is not like backward elimination stepwise regression. It resembles forward
stepwise regression with a cutting back on the final number of steps using a different
criterion than the F-to-enter. This method still cannot do as well as an exhaustive
search, which would be prohibitive for most practical problems.
C
p
38
Chapter 3
Regardless of how a tree is pruned, it is important to cross-validate it. As with
stepwise regression, the prediction error for a tree applied to a new sample can be
considerably higher than for the training sample on which it was constructed.
Whenever possible, data should be reserved for cross-validation.
Loss Functions
Different loss functions are appropriate for different forms of data. TREES offers a
variety of functions that are scaled as proportional reduction in error (PRE) statistics.
This allows you to try different loss functions on a problem and compare their
predictive validity.
For regression trees, the most appropriate loss functions are least squares, trimmed
mean, and least absolute deviations. Least-squares loss yields the classic AID tree. At
each split, cases are classified so that the within-group sum of squares about the mean
of the group is as small as possible. The trimmed mean loss works the same way but
first trims 20% of outlying cases (10% at each extreme) in a splittable subset before
computing the mean and sum of squares. It can be useful when you expect outliers in
subgroups and dont want them to influence the split decisions. LAD loss computes
least absolute deviations about the mean rather than squares. It, too, gives less weight
to extreme cases in each potential group.
For classification trees, use the phi coefficient (the default), Gini index, or twoing.
The phi coefficient is for a table formed by the split on k categories of the
dependent variable. The Gini index is a variance estimate based on all comparisons of
possible pairs of values in a subgroup. Finally, twoing is a word coined by Breiman et
al. to describe splitting k categories as if it were a two-category splitting problem. For
more information about the effects of Gini and twoing on computations, see Breiman
et al. (1984).
Geometry
Most discussions of trees versus other classifiers compare tree graphs and algebraic
equations. There is another graphic view of what a tree classifier does, however. If we
look at the cases embedded in the space of the predictor variables, we can ask how a
linear discriminant analysis partitions the cases and how a tree classifier partitions them.

2
n 2 k
39
Cl assi fi cati on and Regressi on Trees
The figure below shows how cases are split by a linear discriminant analysis. There
are three subgroups of cases in this example. The cutting planes are positioned
approximately halfway between each pair of group centroids. Their orientation is
determined by the discriminant analysis. With three predictors and four groups, there
are six cutting planes, although only four planes show in the figure. The fourth group
is assumed to be under the bottom plane in the figure. In general, if there are groups,
the linear discriminant model cuts them with planes.
The figure below shows how a tree-fitting algorithm cuts the same data. Only the
nearest subgroup (dark spots) shows; the other three groups are hidden behind the rear
and bottom cutting planes. Notice that the cutting planes are parallel to the axes. While
this would seem to restrict the discrimination compared to the more flexible angles
allowed the discriminant planes, the tree model allows interactions between variables,
which do not appear in the ordinary linear discriminant model. Notice, for example,
that one plane splits on the X variable, but the second plane that splits on the Y variable
cuts only the values to the left of the X partition. The tree model can continue to cut any
of these subregions separately, unlike the discriminant model, which can cut only
globally and with planes. This is a mixed blessing, however, since tree
methods, as we have seen, can over-fit the data. It is critical to test them on new
samples.
Tree models are not usually related by authors to dimensional plots in this way, but
it is helpful to see that they have a geometric interpretation. Alternatively, we can
construct algebraic expressions for trees. They would require dummy variables for any
categorical predictors and interaction (or product) terms for every split whose
descendants (or lower nodes) did not involve the same variables on both sides.

g
g g 1 ( ) 2
X
Y
Z
g g 1 ( ) 2
40
Chapter 3
Classification and Regression Trees in SYSTAT
Trees Main Dialog Box
To open the Trees dialog box, from the menus choose:
Statistics
Trees
X
Y
Z
41
Cl assi fi cati on and Regressi on Trees
Model selection and estimation are available in the main Trees dialog box:
Dependent. The variable you want to examine. The dependent variable should be
continuous or categorical numeric variables (for example, INCOME).
Independent(s). Select one or more continuous or categorical variables (grouping
variables).
Expand Model. Adds all possible sums and differences of the predictors to the model.
Loss. Select a loss function from the drop-down list.
n Least squares. The least squared loss (AID) minimizes the sum of the squared
deviation.
n Trimmed mean. The trimmed mean loss (TRIM) trims the extreme observations
(20%) prior to computing the mean.
n Least absolute deviations. The least absolute deviations loss (LAD).
n Phi coefficient. The phi coefficient loss computes the correlation between two
dichotomous variables.
n Gini index. The Gini index loss measures inequality or dispersion.
n Twoing. The twoing loss function.
Display nodes as. Select the type of density display. The following types are available:
n Box plot. Plot that uses boxes to show a distribution shape, central tendency, and
variability.
n Dit plot. Dot histogram. Produces a density display that looks similar to a histogram.
Unlike histograms, dot histograms represent every observation with a unique
symbol, so they are especially suited for small- to moderate-size samples of
continuous data.
n Dot plot. Plot that displays dots at the exact locations of data values.
n Jitter plot. Density plot that calculates the exact locations of the data values, but
jitters points randomly on a short vertical axis to keep points from colliding.
n Stripe. Places vertical lines at the location of data values along a horizontal data
scale and looks like supermarket bar codes.
n Text. Displays text output in the tree diagram including the mode, sample size, and
impurity value.
42
Chapter 3
Stopping Criteria
The Stopping Criteria dialog box contains the parameters for controlling stopping.
Specify the criteria for splitting to stop.
Number of splits. Maximum number of splits.
Minimum proportion. Minimum proportion reduction in error for the tree allowed at any
split.
Split minimum. Minimum split value allowed at any node.
Minimum objects at end of trees. Minimum count allowed at any node.
Using Commands
After selecting a file with USE filename, continue with:
TREES
MODEL yvar = xvarlist / EXPAND
ESTIMATE / PMIN=d, SMIN=d, NMIN=n, NSPLIT=n,
LOSS=LSQ
TRIM
LAD
PHI
GINI
TWOING,
DENSITY=STRIPE
JITTER
DOT
DIT
BOX
43
Cl assi fi cati on and Regressi on Trees
Usage Considerations
Types of data. TREES uses rectangular data only.
Print options. The default output includes the splitting history and summary statistics.
PRINT=LONG adds a BASIC program for classifying new observations. You can cut
and paste this BASIC program into a text window and run it in the BASIC module to
classify new data on the same variables for cross-validation and prediction.
Quick Graphs. TREES produces a Quick Graph for the fitted tree. The nodes may
contain text describing split parameters or they may contain density graphs of the data
being split. A dashed line indicates that the split is not significant.
Saving files. TREES does not save files. Use the BASIC program under PRINT=LONG
to classify your data, compute residuals, etc., on old or new data.
BY groups. TREES analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ = <variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in TREES.
Examples
The following examples illustrate the features of the TREES module. The first example
shows a classification tree for the Fisher-Anderson iris data set. The second example is
a regression tree on an example taken from Breiman et al. (1984), and the third is a
regression tree predicting the danger of a mammal being eaten by predators.
44
Chapter 3
Example 1
Classification Tree
This example shows a classification tree analysis of the Fisher-Anderson iris data set
featured in Discriminant Analysis. We use the Gini loss function and display a
graphical tree, or mobile, with dot histograms, or dit plots. The input is:
Following is the output:
The PRE for the whole tree is 0.89 (similar to for a regression model), which is not
bad. Before exulting, however, we should keep in mind that while Fisher chose the iris
data set to demonstrate his discriminant model on real data, it is barely worthy of the
effort. We can classify the data almost perfectly by looking at a scatterplot of petal
length against petal width.
The unique SYSTAT display of the tree is called a mobile (Wilkinson, 1995). The
dit plots are ideal for illustrating how it works. Imagine each case is a marble in a box
at each node. The mobile simply balances all of the boxes. The reason for doing this is
that we can easily see splits that cut only a few cases out of a group. These nodes will
hang out conspicuously. It is fairly evident in the first split, for example, which cuts the
population into half as many cases on the right (petal length less than 3) as on the left.
USE IRIS
LAB SPECIES/1=SETOSA,2=VERSICOLOR,3=VIRGINICA
TREES
MODEL SPECIES=SEPALLEN,SEPALWID,PETALLEN,PETALWID
ESTIMATE/LOSS=GINI,DENSITY=DIT
Variables in the SYSTAT Rectangular file are:
SPECIES SEPALLEN SEPALWID PETALLEN PETALWID
Split Variable PRE Improvement
1 PETALLEN 0.500 0.500
2 PETALWID 0.890 0.390
Fitting Method: Gini Index
Predicted variable: SPECIES
Minimum split index value: 0.050
Minimum improvement in PRE: 0.050
Maximum number of nodes allowed: 22
Minimum count allowed in each node: 5
The final tree contains 3 terminal nodes
Proportional reduction in error: 0.890
Node from Count Mode Impurity Split Var Cut Value Fit
1 0 150
2 1 50 SETOSA 0.0
3 1 100
4 3 54 VERSICOLOR 0.084
5 3 46 VIRGINICA 0.021
R
2
45
Cl assi fi cati on and Regressi on Trees
This display has a second important characteristic that is different from other tree
displays. The mobile coordinates the polarity of the terminal nodes (red on color
displays) rather than the direction of the splits. This design has three consequences: we
can evaluate the distributions of the subgroups on a common scale, we can see the
direction of the splits on each splitting variable, and we can look at the distributions on
the terminal nodes from left to right to see how the whole sample is split on the
dependent variable.
The first consequence means that every box containing data is a miniature density
display of the subgroups values on a common scale (same limits and same direction).
We dont need to drill down on the data in a subgroup to see its distribution. It is
immediately apparent in the tree. If you prefer box plots or other density displays,
simply use
DENSITY = BOX
or another density as an ESTIMATE option. Dit plots are most suitable for classification
trees, however; because they spike at the category values, they look like bar charts for
categorical data. For continuous data, dit plots look like histograms. Although they are
my favorite density display for this purpose, they can be time consuming to draw on
large samples, so box plots are the default graphical display. If you omit DENSITY
altogether, you will get a text summary inside each box.
The second consequence of ordering the splits according to the polarity of the
dependent (rather than the independent) variable is that the direction of the split can be
recognized immediately by looking at which side (left or right) the split is displayed
on. Notice that PETALLEN < 3.000 occurs on the left side of the first split. This means
that the relation between petal length and species (coded 1..3) is positive. The same is
true for petal width within the second split group because the split banner occurs on the
left. Banners on the right side of a split indicate a negative relationship between the
dependent variable and the splitting variable within the group being split, as in the next
example.
The third consequence of ordering the splits is that we can look at the terminal nodes
from left to right and see the consequences of the split in order. In the present example,
notice that the three species are ordered from left to right in the same order that they
are coded. You can change this ordering for a categorical variable with the CATEGORY
and ORDER commands. Adding labels, as we did here, makes the output more
interpretable.
46
Chapter 3
Example 2
Regression Tree with Box Plots
This example shows a simple AID model. The data set is Boston housing prices, cited
in Belsley, Kuh, and Welsch (1980) and used in Breiman et al. (1984). We are
predicting median home values (MEDV) from a set of demographic variables. The
input is:
USE BOSTON
TREES
MODEL MEDV=CRIM..LSTAT
ESTIMATE/PMIN=.005,DENSITY=BOX
47
Cl assi fi cati on and Regressi on Trees
Following is the output:
The Quick Graph of the tree more clearly reveals the sample-size feature of the mobile
display. Notice that a number of the splits, because they separate out a few cases only,
are extremely unbalanced. This can be interpreted in two ways, depending on context.
On the one hand, it can mean that outliers are being separated so that subsequent splits
can be more powerful. On the other hand, it can mean that a split is wasted by focusing
on the outliers when further splits dont help to improve the prediction. The former
case appears to apply in our example. The first split separates out a few expensive
housing tracts (the median values have a positively skewed distribution for all tracts),
which makes subsequent splits more effective. The box plots in the terminal nodes are
narrow.
Variables in the SYSTAT Rectangular file are:
CRIM ZN INDUS CHAS NOX RM
AGE DIS RAD TAX PTRATIO B
LSTAT MEDV
Split Variable PRE Improvement
1 RM 0.453 0.453
2 RM 0.524 0.072
3 LSTAT 0.696 0.171
4 PTRATIO 0.706 0.010
5 LSTAT 0.723 0.017
6 DIS 0.782 0.059
7 CRIM 0.809 0.027
8 NOX 0.815 0.006
Fitting Method: Least Squares
Predicted variable: MEDV
Minimum split index value: 0.050
Minimum improvement in PRE: 0.005
Maximum number of nodes allowed: 22
Minimum count allowed in each node: 5
The final tree contains 9 terminal nodes
Proportional reduction in error: 0.815
Node from Count Mean STD Split Var Cut Value Fit
1 0 506 22.533 9.197 RM 6.943 0.453
2 1 430 19.934 6.353 LSTAT 14.430 0.422
3 1 76 37.238 8.988 RM 7.454 0.505
4 3 46 32.113 6.497 LSTAT 11.660 0.382
5 3 30 45.097 6.156 PTRATIO 18.000 0.405
6 2 255 23.350 5.110 DIS 1.413 0.380
7 2 175 14.956 4.403 CRIM 7.023 0.337
8 5 25 46.820 3.768
9 5 5 36.480 8.841
10 4 41 33.500 4.594
11 4 5 20.740 9.080
12 6 5 45.580 9.883
13 6 250 22.905 3.866
14 7 101 17.138 3.392 NOX 0.538 0.227
15 7 74 11.978 3.857
16 14 24 20.021 3.067
17 14 77 16.239 2.975
48
Chapter 3
Example 3
Regression Tree with Dit Plots
This example involves predicting the danger of a mammal being eaten by predators
(Allison and Cicchetti, 1976). The predictors are hours of dreaming and nondreaming
sleep, gestational age, body weight, and brain weight. Although the danger index has
only five values, we are treating it as a quantitative variable with meaningful numerical
values. The input is:
USE SLEEP
TREES
MODEL DANGER=BODY_WT,BRAIN_WT,
SLO_SLEEP,DREAM_SLEEP,GESTATE
ESTIMATE / DENSITY=DIT
49
Cl assi fi cati on and Regressi on Trees
The resulting output is:
Variables in the SYSTAT Rectangular file are:
SPECIES$ BODY_WT BRAIN_WT SLO_SLEEP DREAM_SLEEP TOTAL_SLEEP
LIFE GESTATE PREDATION EXPOSURE DANGER
18 cases deleted due to missing data.
Split Variable PRE Improvement
1 DREAM_SLEEP 0.404 0.404
2 BRAIN_WT 0.479 0.074
3 SLO_SLEEP 0.547 0.068
Fitting Method: Least Squares
Predicted variable: DANGER
Minimum split index value: 0.050
Minimum improvement in PRE: 0.050
Maximum number of nodes allowed: 22
Minimum count allowed in each node: 5
The final tree contains 4 terminal nodes
Proportional reduction in error: 0.547
Node from Count Mean STD Split Var Cut Value Fit
1 0 44 2.659 1.380 DREAM_SLEEP 1.200 0.404
2 1 14 3.929 1.072 BRAIN_WT 58.000 0.408
3 1 30 2.067 1.081 SLO_SLEEP 12.800 0.164
4 2 6 3.167 1.169
5 2 8 4.500 0.535
6 3 23 2.304 1.105
7 3 7 1.286 0.488
50
Chapter 3
The prediction is fairly good (PRE = 0.547). The Quick Graph of this tree illustrates
another feature of mobiles. The dots in each terminal node are assigned a separate
color. This way, we can follow their path up the tree each time they are merged. If the
prediction is perfect, the top density plot will have colored dots perfectly separated.
The extent to which the colors are mixed in the top plot is a visual indication of the
badness-of-fit of the model. The fairly good separation of colors for the sleep data is
quite clear on the computer screen or with color printing but less evident in a black-
and-white figure.
Computation
Computations are in double precision.
Algorithms
TREES uses algorithms from Breiman et al. (1984) for its splitting computations.
Missing Data
Missing data are eliminated from the calculation of the loss function for each split
separately.
References
Allison, T. and Cicchetti, D. (1976). Sleep in mammals: Ecological and constitutional
correlates. Science, 194, 732734.
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying
influential data and sources of collinearity. New York: John Wiley & Sons, Inc.
Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate analysis.
Cambridge, Mass.: MIT Press.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. I. (1984). Classification and
regression trees. Belmont, Calif.: Wadsworth.
Einhorn, H. (1972). Alchemy in the behavioral sciences. Public Opinion Quarterly, 3,
367378.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.
51
Cl assi fi cati on and Regressi on Trees
Kass, G. V. (1980). An exploratory technique for investigating large quantities of
categorical data. Applied Statistics, 29, 119127.
Levine, M. (1991). Statistical analysis for the executive. Byte, 17, 183184.
Milstein, R. M., Burrow, G. N., Wilkinson, L., and Kessen, W. (1975). Prediction of
screening decisions in a medical school admission process. Journal of Medical
Education, 51, 626633.
Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a
proposal. Journal of the American Statistical Association, 58, 415434.
Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed.
Homewood, Ill.: Richard D. Irwin, Inc.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81106.
Quinlan, J. R. (1992). C4.5: Programs for machine learning. New York: Morgan
Kaufmann.
Simon, B. (1991). Knowledge seeker: Statistics for decision makers. PC Magazine
(January 29), 50.
Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,
86, 168174.
Wilkinson, L. (1995). Mobiles. Department of Statistics, Northwestern University,
Evanston, Ill.
53


Chapt er
4
Cluster Analysis
Leland Wilkinson, Laszlo Engelman, James Corter, and Mark Coward
SYSTAT provides a variety of cluster analysis methods on rectangular or symmetric
data matrices. Cluster analysis is a multivariate procedure for detecting natural
groupings in data. It resembles discriminant analysis in one respectthe researcher
seeks to classify a set of objects into subgroups although neither the number nor
members of the subgroups are known.
Cluster provides three procedures for clustering: Hierarchical Clustering, K-means,
and Additive Trees. The Hierarchical Clustering procedure comprises hierarchical
linkage methods. The K-means Clustering procedure splits a set of objects into a
selected number of groups by maximizing between-cluster variation and minimizing
within-cluster variation. The Additive Trees Clustering procedure produces a Sattath-
Tversky additive tree clustering.
Hierarchical Clustering clusters cases, variables, or both cases and variables
simultaneously; K-means clusters cases only; and Additive Trees clusters a similarity
or dissimilarity matrix. Eight distance metrics are available with Hierarchical
Clustering and K-means, including metrics for quantitative and frequency count data.
Hierarchical Clustering has six methods for linking clusters and displays the results
as a tree (dendrogram) or a polar dendrogram. When the MATRIX option is used to
cluster cases and variables, SYSTAT uses a gray-scale or color spectrum to represent
the values.
54
Chapter 4
Statistical Background
Cluster analysis is a multivariate procedure for detecting groupings in data. The objects
in these groups may be:
n Cases (observations or rows of a rectangular data file). For example, if health
indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) are
recorded for countries (cases), then developed nations may form a subgroup or
cluster separate from underdeveloped countries.
n Variables (characteristics or columns of the data). For example, if causes of death
(cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded for
each U.S. state (case), the results show that accidents are relatively independent of
the illnesses.
n Cases and variables (individual entries in the data matrix). For example, certain
wines are associated with good years of production. Other wines have other years
that are better.
Types of Clustering
Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the
same object to appear in more than one cluster. Exclusive clusters do not. All of the
methods implemented in SYSTAT are exclusive.
There are three approaches to producing exclusive clusters: hierarchical,
partitioned, and additive trees. Hierarchical clusters consist of clusters that completely
contain other clusters that completely contain other clusters, and so on. Partitioned
clusters contain no other clusters. Additive trees use a graphical representation in
which distances along branches reflect similarities among the objects.
The cluster literature is diverse and contains many descriptive synonyms:
hierarchical clustering (McQuitty, 1960; Johnson, 1967); single linkage clustering
(Sokal and Sneath, 1963), and joining (Hartigan, 1975). Output from hierarchical
methods can be represented as a tree (Hartigan, 1975) or a dendrogram (Sokal and
Sneath, 1963). (The linkage of each object or group of objects is shown as a joining of
branches in a tree. The root of the tree is the linkage of all clusters into one set, and
the ends of the branches lead to each separate object.)
55
Cl uster Anal ysi s
Correlations and Distances
To produce clusters, we must be able to compute some measure of dissimilarity
between objects. Similar objects should appear in the same cluster, and dissimilar
objects, in different clusters. All of the methods available in CORR for producing
matrices of association can be used in cluster analysis, but each has different
implications for the clusters produced. Incidentally, CLUSTER converts correlations to
dissimilarities by negating them.
In general, the correlation measures (Pearson, Mu2, Spearman, Gamma, Tau) are
not influenced by differences in scales between objects. For example, correlations
between states using health statistics will not in general be affected by some states
having larger average numbers or variation in their numbers. Use correlations when
you want to measure the similarity in patterns across profiles regardless of overall
magnitude.
On the other hand, the other measures such as Euclidean and City (city-block
distance) are significantly affected by differences in scale. For health data, two states
will be judged to be different if they have differing overall incidences even when they
follow a common pattern. Generally, you should use the distance measures when
variables are measured on common scales.
Standardizing Data
Before you compute a dissimilarity measure, you may need to standardize your data
across the measured attributes. Standardizing puts measurements on a common scale.
In general, standardizing makes overall level and variation comparable across
measurements. Consider the following data:
If we are clustering the four cases (A through D), variable X4 will determine almost
entirely the dissimilarity between cases, whether we use correlations or distances. If we
are clustering the four variables, whichever correlation measure we use will adjust for
the larger mean and standard deviation on X4. Thus, we should probably standardize
OBJECT X1 X2 X3 X4
A 10 2 11 900
B 11 3 15 895
C 13 4 12 760
D 14 1 13 874
56
Chapter 4
within columns if we are clustering rows and use a correlation measure if we are
clustering columns.
In the example below, case A will have a disproportionate influence if we are
clustering columns.
We should probably standardize within rows before clustering columns. This requires
transposing the data before standardization. If we are clustering rows, on the other
hand, we should use a correlation measure to adjust for the larger mean and standard
deviation of case A.
These are not immutable laws. The suggestions are only to make you realize that
scales can influence distance and correlation measures.
Hierarchical Clustering
To understand hierarchical clustering, its best to look at an example. The following
data reflect various attributes of selected performance cars.
OBJECT X1 X2 X3 X4
A 410 311 613 514
B 1 3 2 4
C 10 11 12 10
D 12 13 13 11
ACCEL BRAKE SLALOM MPG SPEED NAME$
5.0 245 61.3 17.0 153 Porsche 911T
5.3 242 61.9 12.0 181 Testarossa
5.8 243 62.6 19.0 154 Corvette
7.0 267 57.8 14.5 145 Mercedes 560
7.6 271 59.8 21.0 124 Saab 9000
7.9 259 61.7 19.0 130 Toyota Supra
8.5 263 59.9 17.5 131 BMW 635
8.7 287 64.2 35.0 115 Civic CRX
9.3 258 64.1 24.5 129 Acura Legend
10.8 287 60.8 25.0 100 VW Fox GL
13.0 253 62.3 27.0 95 Chevy Nova
57
Cl uster Anal ysi s
Cluster Displays
SYSTAT displays the output of hierarchical clustering in several ways. For joining
rows or columns, SYSTAT prints a tree. For matrix joining, it prints a shaded matrix.
Trees. A tree is printed with a unique ordering in which every branch is lined up such
that the most similar objects are closest to each other. If a perfect seriation (one-
dimensional ordering) exists in the data, the tree reproduces it. The algorithm for
ordering the tree is given in Gruvaeus and Wainer (1972). This ordering may differ
from that of trees printed by other clustering programs if they do not use a seriation
algorithm to determine how to order branches. The advantage of using seriation is most
apparent for single linkage clusterings.
If you join rows, the end branches of the tree are labeled with case numbers or
labels. If you join columns, the end branches of the tree are labeled with variable
names.
Direct display of a matrix. As an alternative to trees, SYSTAT can produce a shaded
display of the original data matrix in which rows and columns are permuted according
to an algorithm in Gruvaeus and Wainer (1972). Different characters represent the
magnitude of each number in the matrix (Ling, 1973). A legend showing the range of
data values that these characters represent appears with the display.
Cutpoints between these values and their associated characters are selected to
heighten contrast in the display. The method for increasing contrast is derived from
techniques used in computer pattern recognition, in which gray-scale histograms for
visual displays are modified to heighten contrast and enhance pattern detection. To
find these cutpoints, we sort the data and look for the largest gaps between adjacent
values. Tukeys gapping method (Wainer and Schacht, 1978) is used to determine how
many gaps (and associated characters) should be chosen to heighten contrast for a
given set of data. This procedure, time consuming for large matrices, is described in
detail in Wilkinson (1978).
If you have a course to grade and are looking for a way to find rational cutpoints in
the grade distribution, you might want to use this display to choose the cutpoints.
Cluster the matrix of numeric grades (n students by 1 grade) and let SYSTAT
choose the cutpoints. Only cutpoints asymptotically significant at the 0.05 level are
chosen. If no cutpoints are chosen in the display, give everyone an A, flunk them all,
or hand out numeric grades (unless you teach at Brown University or Hampshire
College).
n 1
58
Chapter 4
Clustering Rows
First, lets look at possible clusters of the cars in the example. Since the variables are
on such different scales, we will standardize them before doing the clustering. This will
give acceleration comparable influence to braking, for example. Then we select
Pearson correlations as the basis for dissimilarity between cars. The result is:
If you look at the correlation matrix for the cars, you will see how these clusters hang
together. Cars within the same cluster (for example, Corvette, Testarossa, Porsche)
generally correlate highly.
Porsche Testa Corv Merc Saab
Porsche 1.00
Testa 0.94 1.00
Corv 0.94 0.87 1.00
Merc 0.09 0.21 0.24 1.00
Saab 0.51 0.52 0.76 0.66 1.00
Toyota 0.24 0.43 0.40 0.38 0.68
BMW 0.32 0.10 0.56 0.85 0.63
Civic 0.50 0.73 0.39 0.52 0.26
Acura 0.05 0.10 0.30 0.98 0.77
VW 0.96 0.93 0.98 0.08 0.70
Chevy 0.73 0.70 0.49 0.53 0.13
Cluster Tree
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Distances
Porsche 911T
Testarossa
Corvette
Mercedes 560
Saab 9000
Toyota Supra
BMW 635
Civic CRX
Acura Legend
VW Fox GL
Chevy Nova
59
Cl uster Anal ysi s

Clustering Columns
We can cluster the performance attributes of the cars more easily. Here, we do not need
to standardize within cars (by rows) because all of the values are comparable between
cars. Again, to give each variable comparable influence, we will use Pearson
correlations as the basis for the dissimilarities. The result based on the data
standardized by variable (column) is:
Clustering Rows and Columns
To cluster the rows and columns jointly, we should first standardize the variables to
give each of them comparable influence on the clustering of cars. Once we have
standardized the variables, we can use Euclidean distances because the scales are
comparable. We used single linkage to produce the following result:
Toyota BMW Civic Acura VW
Toyota 1.00
BMW 0.25 1.00
Civic 0.30 0.50 1.00
Acura 0.53 0.79 0.35 1.00
VW 0.35 0.39 0.55 0.16 1.00
Chevy 0.03 0.06 0.32 0.54 0.53
Cluster Tree
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Distances
ACCEL
BRAKE
SLALOM
MPG
SPEED
60
Chapter 4
This figure displays the standardized data matrix itself with rows and columns
permuted to reveal clustering and each data value replaced by one of three symbols.
Notice that the rows are ordered according to overall performance, with the fastest cars
at the top.
Matrix clustering is especially useful for displaying large correlation matrices. You
may want to cluster the correlation matrix this way and then use the ordering to
produce a scatterplot matrix that is organized by the multivariate structure.
Partitioning via K-Means
To produce partitioned clusters, you must decide in advance how many clusters you
want. K-means clustering searches for the best way to divide your objects into different
sections so that they are separated as well as possible. The procedure begins by picking
seed cases, one for each cluster, which are spread apart from the center of all of the
cases as much as possible. Then it assigns all cases to the nearest seed. Next, it attempts
to reassign each case to a different cluster in order to reduce the within-groups sum of
squares. This continues until the within-groups sum of squares can no longer be
reduced.
-2
-1
0
1
2
3
Testarossa
Porsche 911T
Corvette
Acura Legend
Toyota Supra
BMW 635
Saab 9000
Mercedes 560
VW Fox GL
Chevy Nova
Civic CRX
B
R
A
K
E
M
P
G
A
C
C
E
L
S
L
A
L
O
M
S
P
E
E
D
61
Cl uster Anal ysi s
K-means clustering does not search through every possible partitioning of the data,
so it is possible that some other solution might have smaller within-groups sums of
squares. Nevertheless, it has performed relatively well on globular data separated in
several dimensions in Monte Carlo studies of cluster algorithms.
Because it focuses on reducing within-groups sums of squares, k-means clustering
is like a multivariate analysis of variance in which the groups are not known in advance.
The output includes analysis of variance statistics, although you should be cautious in
interpreting them. Remember, the program is looking for large F ratios in the first
place, so you should not be too impressed by large values.
Following is a three-group analysis of the car data. The clusters are similar to those
we found by joining. K-means clustering uses Euclidean distances instead of Pearson
correlations, so there are minor differences because of scaling. To keep the influences
of all variables comparable, we standardized the data before running the analysis.
Summary Statistics for 3 Clusters
Variable Between SS DF Within SS DF F-Ratio Prob
ACCEL 7.825 2 2.175 8 14.389 0.002
BRAKE 5.657 2 4.343 8 5.211 0.036
SLALOM 5.427 2 4.573 8 4.747 0.044
MPG 7.148 2 2.852 8 10.027 0.007
SPEED 7.677 2 2.323 8 13.220 0.003
-------------------------------------------------------------------------------
Cluster Number: 1
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Mercedes 560 0.60 | ACCEL -0.45 -0.14 0.17 0.23
Saab 9000 0.31 | BRAKE -0.15 0.23 0.61 0.28
Toyota Supra 0.49 | SLALOM -1.95 -0.89 0.11 0.73
BMW 635 0.16 | MPG -1.01 -0.47 -0.01 0.37
| SPEED -0.34 0.00 0.50 0.31
-------------------------------------------------------------------------------
Cluster Number: 2
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Civic CRX 0.81 | ACCEL 0.26 0.99 2.05 0.69
Acura Legend 0.67 | BRAKE -0.53 0.62 1.62 1.00
VW Fox GL 0.71 | SLALOM -0.37 0.72 1.43 0.74
Chevy Nova 0.76 | MPG 0.53 1.05 2.15 0.65
| SPEED -1.50 -0.91 -0.14 0.53
-------------------------------------------------------------------------------
Cluster Number: 3
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Porsche 911T 0.25 | ACCEL -1.29 -1.13 -0.95 0.14
Testarossa 0.43 | BRAKE -1.22 -1.14 -1.03 0.08
Corvette 0.31 | SLALOM -0.10 0.23 0.59 0.28
| MPG -1.40 -0.78 -0.32 0.45
| SPEED 0.82 1.21 1.94 0.52
62
Chapter 4
Additive Trees
Sattath and Tversky (1977) developed additive trees for modeling
similarity/dissimilarity data. Hierarchical clustering methods require objects in the
same cluster to have identical distances to each other. Moreover, these distances must
be smaller than the distances between clusters. These restrictions prove problematic for
similarity data, and as a result hierarchical clustering cannot fit this data well.
In contrast, additive trees use tree branch length to represent distances between
objects. Allowing the within-cluster distances to vary yields a tree diagram with
varying branch lengths. Objects within a cluster can be compared by focusing on the
horizontal distance along the branches connecting them. The additive tree for the car
data follows:
Additive Tree
Porsche
Testa
Corv
Merc
Saab
Toyota
BMW
Civic
Acura
VW
Chevy
63
Cl uster Anal ysi s
The distances between nodes of the graph are:
Each object is a node in the graph. In this example, the first 11 nodes represent the cars.
Other graph nodes correspond to groupings of the objects. Here, the 12th node
represents Porsche and Testa.
The distance between any two nodes is the sum of the (horizontal) lengths between
them. The distance between Chevy and VW is . The
distance between Chevy and Civic is . Consequently,
Chevy is more similar to VW than to Civic.
Node Length Child
1 0.10 Porsche
2 0.49 Testa
3 0.14 Corv
4 0.52 Merc
5 0.19 Saab
6 0.13 Toyota
7 0.11 BMW
8 0.71 Civic
9 0.30 Acura
10 0.42 VW
11 0.62 Chevy
12 0.06 1,2
13 0.08 8,10
14 0.49 12,3
15 0.18 13,11
16 0.35 9,15
17 0.04 14,6
18 0.13 17,16
19 0.0 5,18
20 0.04 4,7
21 0.0 20,19
0.62 0.08 0.42 + + 1.12 =
0.62 0.08 0.71 + + 1.41 =
64
Chapter 4
Cluster Analysis in SYSTAT
Hierarchical Clustering Main Dialog Box
Hierarchical clustering produces hierarchical clusters that are displayed in a tree.
Initially, each object (case or variable) is considered a separate cluster. SYSTAT begins
by joining the two closest objects as a cluster and continues (in a stepwise manner)
joining an object with another object, an object with a cluster, or a cluster with another
cluster until all objects are combined into one cluster.
To obtain a hierarchical cluster analysis, from the menus choose:
Statistics
Classification
Hierarchical Clustering
You must select the elements of the data file to cluster (Join):
n Rows. Rows (cases) of the data matrix are clustered.
n Columns. Columns (variables) of the data matrix are clustered.
n Matrix. Rows and columns of the data matrix are clusteredthey are permuted to
bring similar rows and columns next to one another.
Linkage allows you to specify the type of joining algorithm used to amalgamate
clusters (that is, define how distances between clusters are measured).
65
Cl uster Anal ysi s
n Single. Single linkage defines the distance between two objects or clusters as the
distance between the two closest members of those clusters. This method tends to
produce long, stringy clusters. If you use a SYSTAT file that contains a similarity
or dissimilarity matrix, you get clustering via Johnsons min method.
n Complete. Complete linkage uses the most distant pair of objects in two clusters to
compute between-cluster distances. This method tends to produce compact,
globular clusters. If you use a similarity or dissimilarity matrix from a SYSTAT
file, you get Johnsons max method.
n Centroid. Centroid linkage uses the average value of all objects in a cluster (the
cluster centroid) as the reference point for distances to other objects or clusters.
n Average. Average linkage averages all distances between pairs of objects in
different clusters to decide how far apart they are.
n Median. Median linkage uses the median distances between pairs of objects in
different clusters to decide how far apart they are.
n Ward. Wards method averages all distances between pairs of objects in different
clusters, with adjustments for covariances, to decide how far apart the clusters are.
For some data, the last four methods cannot produce a hierarchical tree with strictly
increasing amalgamation distances. In these cases, you may see stray branches that do
not connect to others. If this happens, you should consider Single or Complete linkage.
For more information on these problems, see Fisher and Van Ness (1971). These
reviewers concluded that these and other problems made Centroid, Average, Median,
and Ward (as well as k-means) inadmissible clustering procedures. In practice and in
Monte Carlo simulations, however, they sometimes perform better than Single and
Complete linkage, which Fisher and Van Ness considered admissible. Milligan
(1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation
of clustering algorithms. Consult his paper for further details.
In addition, the following options can be specified:
Distance. Specifies the distance metric used to compare clusters.
Polar. Produces a polar (circular) cluster tree.
Save cluster identifier variable. Saves cluster identifiers to a SYSTAT file. You can
specify the number of clusters to identify for the saved file. If not specified, two
clusters are identified.
66
Chapter 4
Clustering Distances
Both hierarchical clustering and k-means clustering allow you to select the type of
distance metric to use between objects. From the Distance drop-down list, you can
select:
n Gamma. Distances are computed using 1 minus the Goodman-Kruskal gamma
correlation coefficient. Use this metric with rank order or ordinal scales. Missing
values are excluded from computations.
n Pearson. Distances are computed using 1 minus the Pearson product-moment
correlation coefficient for each pair of objects. Use this metric for quantitative
variables. Missing values are excluded from computations.
n RSquared. Distances are computed using 1 minus the square of the Pearson
product-moment correlation coefficient for each pair of objects. Use this metric
with quantitative variables. Missing values are excluded from computations.
n Euclidean. Clustering is computed using normalized Euclidean distance (root mean
squared distances). Use this metric with quantitative variables. Missing values are
excluded from computations.
n Minkowski. Clustering is computed using the pth root of the mean pth powered
distances of coordinates. Use this metric for quantitative variables. Missing values
are excluded from computations. Use the Power text box to specify the value of p.
n Chisquare. Distances are computed as the chi-square measure of independence of
rows and columns on 2-by-n frequency tables, formed by pairs of cases (or
variables). Use this metric when the data are counts of objects or events.
n Phisquare. Distances are computed as the phi-square (chi-square/total) measure on
2-by-n frequency tables, formed by pairs of cases (or variables). Use this metric
when the data are counts of objects or events.
n Percent (available for hierarchical clustering only). Clustering uses a distance
metric that is the percentage of comparisons of values resulting in disagreements
in two profiles. Use this metric with categorical or nominal scales.
n MW (available for k-means clustering only). Distances are computed as the
increment in within sum of squares of deviations, if the case (or variable) would
belong to a cluster. The case (or variable) is moved into the cluster that minimizes
the within sum of squares of deviations. Use this metric with quantitative variables.
Missing values are excluded from computations.
67
Cl uster Anal ysi s
K-Means Main Dialog Box
K-means clustering splits a set of objects into a selected number of groups by
maximizing between-cluster variation relative to within-cluster variation. It is similar
to doing a one-way analysis of variance where the groups are unknown and the largest
F value is sought by reassigning members to each group.
K-means starts with one cluster and splits it into two clusters by picking the case
farthest from the center as a seed for a second cluster and assigning each case to the
nearest center. It continues splitting one of the clusters into two (and reassigning cases)
until a specified number of clusters are formed. K-means reassigns cases until the
within-groups sum of squares can no longer be reduced.
To obtain a k-means cluster analysis, from the menus choose:
Statistics
Classification
K-means Clustering
The following options can be specified:
Groups. Enter the number of desired clusters. If the number (Groups) of clusters is not
specified, two are computed (one split of the data).
Iterations. Enter the maximum number of iterations. If not stated, this maximum is 20.
Save identifier variable. Saves cluster identifiers to a SYSTAT file.
Distance. Specifies the distance metric used to compare clusters.
68
Chapter 4
Additive Trees Main Dialog Box
Additive trees were developed by Sattath and Tversky (1977) for modeling
similarity/dissimilarity data, which are not fit well by hierarchical joining trees.
Hierarchical trees imply that all within-cluster distances are smaller than all between-
cluster distances and that within-cluster distances are equal. This so-called
ultrametric condition seldom applies to real similarity data from direct judgment.
Additive trees, on the other hand, represent similarities with a network model in the
shape of a tree. Distances between objects are represented by the lengths of the
branches connecting them in the tree.
To obtain additive trees, from the menus choose:
Statistics
Classification
Additive Tree Clustering
The following options can be specified:
Data. Raw data matrix.
Transformed. Data after transformation into distance-like measures.
Model. Model (tree) distances.
Residuals. Residuals matrix.
Notree. No tree graph is displayed.
Nonumbers. Objects in the tree graph are not numbered.
69
Cl uster Anal ysi s
Nosubtract. Use of an additive constant. Additive Trees assumes interval-scaled data,
which implies complete freedom in choosing an additive constant, so it adds or
subtracts to exactly satisfy the triangle inequality. Use Nosubtract to allow strict
inequality and subtract no constant.
Height. Prints the distance of each node from the root.
Minvar. Combines the last few remaining clusters into the root node. Minvar requests
the program to search for the root that minimizes the variances of the distances from
the root to the leaves.
Using Commands
For the hierarchical tree method:
The distance metric is EUCLIDEAN, GAMMA, PEARSON, RSQUARED, MINKOWSKI,
CHISQUARE, PHISQUARE, or PERCENT. For MINKOWSKI, specify the root using
POWER=p.
The linkage methods include SINGLE, COMPLETE, CENTROID, AVERAGE,
MEDIAN, and WARD.
For the k-means splitting method:
The distance metric is EUCLIDEAN, GAMMA, PEARSON, RSQUARED, MINKOWSKI,
CHISQUARE, PHISQUARE, or MW. For MINKOWSKI, specify the root using POWER=p.
CLUSTER
USE filename
IDVAR var$
PRINT
SAVE filename / NUMBER=n DATA
JOIN varlist / POLAR DISTANCE=metric POWER=p
LINKAGE=method
CLUSTER
USE filename
IDVAR var$
PRINT
SAVE filename / NUMBER=N DATA
KMEANS varlist / NUMBER=n ITER=n DISTANCE=metric POWER=p
70
Chapter 4
For additive trees:
Usage Considerations
Types of data. Hierarchical Clustering works on either rectangular SYSTAT files or files
containing a symmetric matrix, such as those produced with Correlations. K-Means
works only on rectangular SYSTAT files. Additive Trees works only on symmetric
(similarity or dissimilarity) matrices.
Print options. Using PRINT=LONG for Hierarchical Clustering yields an ASCII
representation of the tree diagram (instead of the Quick Graph). This option is useful
if you are joining more than 100 objects.
Quick Graphs. Cluster analysis includes Quick Graphs for each procedure. Hierarchical
Clustering and Additive Trees have tree diagrams. For each cluster, K-Means displays
a profile plot of the data and a display of the variable means and standard deviations.
To omit Quick Graphs, specify GRAPH NONE.
Saving files. CLUSTER saves cluster indices as a new variable.
BY groups. CLUSTER analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Labeling output. For Hierarchical Clustering and K-Means, be sure to consider using ID
Variable (on the Data menu) for labeling the output.
CLUSTER
USE filename
ADD varlist / DATA TRANSFORMED MODEL RESIDUALS
TREE NUMBERS NOSUBTRACT HEIGHT
MINVAR ROOT=n1,n2
71
Cl uster Anal ysi s
Examples
Example 1
K-Means Clustering
The data in the file SUBWORLD are a subset of cases and variables from the
OURWORLD file:
The distributions of the economic variables (GDP_CAP, EDUC, HEALTH, and MIL)
are skewed with long right tails, so these variables are analyzed in log units.
This example clusters countries (cases). The input is:
Note that KMEANS must be specified last.
URBAN Percentage of the population living in cities
BIRTH_RT Births per 1000 people
DEATH_RT Deaths per 1000 people
B_TO_D Ratio of births to deaths
BABYMORT Infant deaths during the first year per 1000 live births
GDP_CAP Gross domestic product per capita (in U.S. dollars)
LIFEEXPM Years of life expectancy for males
LIFEEXPF Years of life expectancy for females
EDUC U.S. dollars spent per person on education
HEALTH U.S. dollars spent per person on health
MIL U.S. dollars spent per person on the military
LITERACY Percentage of the population who can read
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
KMEANS urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ,
mil health / NUMBER=4
72
Chapter 4
The resulting output is:
Distance metric is Euclidean distance

k-means splitting cases into 4 groups
Summary statistics for all cases
Variable Between SS df Within SS df F-ratio
URBAN 18.6065 3 9.3935 25 16.5065
BIRTH_RT 26.2041 3 2.7959 26 81.2260
DEATH_RT 23.6626 3 5.3374 26 38.4221
BABYMORT 26.0275 3 2.9725 26 75.8869
GDP_CAP 26.9585 3 2.0415 26 114.4464
EDUC 25.3712 3 3.6288 26 60.5932
HEALTH 24.9226 3 3.0774 25 67.4881
MIL 24.7870 3 3.2130 25 64.2893
LIFEEXPM 24.7502 3 4.2498 26 50.4730
LIFEEXPF 25.9270 3 3.0730 26 73.1215
LITERACY 24.8535 3 4.1465 26 51.9470
B_TO_D 22.2918 3 6.7082 26 28.7997
** TOTAL ** 294.3624 36 50.6376 309
-------------------------------------------------------------------------------
Cluster 1 of 4 contains 12 cases
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Austria 0.28 | URBAN -0.17 0.60 1.59 0.54
Belgium 0.09 | BIRTH_RT -1.14 -0.93 -0.83 0.10
Denmark 0.19 | DEATH_RT -0.77 0.00 0.26 0.35
France 0.14 | BABYMORT -0.85 -0.81 -0.68 0.05
Switzerland 0.26 | GDP_CAP 0.33 1.01 1.28 0.26
UK 0.14 | EDUC 0.47 0.95 1.28 0.28
Italy 0.16 | HEALTH 0.52 0.99 1.31 0.23
Sweden 0.23 | MIL 0.28 0.81 1.11 0.25
WGermany 0.31 | LIFEEXPM 0.23 0.75 0.99 0.23
Poland 0.39 | LIFEEXPF 0.43 0.79 1.07 0.18
Czechoslov 0.26 | LITERACY 0.54 0.72 0.75 0.06
Canada 0.30 | B_TO_D -1.09 -0.91 -0.46 0.18
-------------------------------------------------------------------------------
Cluster 2 of 4 contains 5 cases
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Ethiopia 0.40 | URBAN -2.01 -1.69 -1.29 0.30
Guinea 0.52 | BIRTH_RT 1.46 1.58 1.69 0.10
Somalia 0.38 | DEATH_RT 1.28 1.85 3.08 0.76
Afghanistan 0.38 | BABYMORT 1.38 1.88 2.41 0.44
Haiti 0.30 | GDP_CAP -2.00 -1.61 -1.27 0.30
| EDUC -2.41 -1.58 -1.10 0.51
| HEALTH -2.22 -1.64 -1.29 0.44
| MIL -1.76 -1.51 -1.37 0.17
| LIFEEXPM -2.78 -1.90 -1.38 0.56
| LIFEEXPF -2.47 -1.91 -1.48 0.45
| LITERACY -2.27 -1.83 -0.76 0.62
| B_TO_D -0.38 -0.02 0.25 0.26
-------------------------------------------------------------------------------
73
Cl uster Anal ysi s
Cluster 3 of 4 contains 11 cases
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Argentina 0.45 | URBAN -0.88 0.16 1.14 0.76
Brazil 0.32 | BIRTH_RT -0.60 0.07 0.92 0.49
Chile 0.40 | DEATH_RT -1.28 -0.70 0.00 0.42
Colombia 0.42 | BABYMORT -0.70 -0.06 0.55 0.47
Uruguay 0.61 | GDP_CAP -0.75 -0.38 0.04 0.28
Ecuador 0.36 | EDUC -0.89 -0.39 0.14 0.36
ElSalvador 0.52 | HEALTH -0.91 -0.47 0.28 0.38
Guatemala 0.65 | MIL -1.25 -0.59 0.37 0.49
Peru 0.37 | LIFEEXPM -0.63 0.06 0.77 0.49
Panama 0.51 | LIFEEXPF -0.57 0.04 0.61 0.44
Cuba 0.58 | LITERACY -0.94 0.20 0.73 0.51
| B_TO_D -0.65 0.63 1.68 0.76
-------------------------------------------------------------------------------
Cluster 4 of 4 contains 2 cases
Members Statistics
Case Distance | Variable Minimum Mean Maximum St.Dev.
Iraq 0.29 | URBAN -0.30 0.06 0.42 0.51
Libya 0.29 | BIRTH_RT 0.92 1.27 1.61 0.49
| DEATH_RT -0.77 -0.77 -0.77 0.0
| BABYMORT 0.44 0.47 0.51 0.05
| GDP_CAP -0.25 0.05 0.36 0.43
| EDUC -0.04 0.44 0.93 0.68
| HEALTH -0.51 -0.04 0.42 0.66
| MIL 1.34 1.40 1.46 0.08
| LIFEEXPM -0.09 -0.04 0.02 0.08
| LIFEEXPF -0.30 -0.21 -0.11 0.13
| LITERACY -0.94 -0.86 -0.77 0.12
| B_TO_D 1.61 2.01 2.42 0.57
74
Chapter 4
Cluster Profile Plots
1
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
I
n
d
e
x

o
f

C
a
s
e
-3 -2 -1 0 1 2 3 4
2
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
I
n
d
e
x

o
f

C
a
s
e
-3 -2 -1 0 1 2 3 4
3
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
I
n
d
e
x

o
f

C
a
s
e
-3 -2 -1 0 1 2 3 4
4
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
I
n
d
e
x

o
f

C
a
s
e
-3 -2 -1 0 1 2 3 4
75
Cl uster Anal ysi s
For each variable, cluster analysis compares the between-cluster mean square (Between
SS/df) to the within-cluster mean square (Within SS/df) and reports the F-ratio. However,
do not use these F ratios to test significance because the clusters are formed to characterize
differences. Instead, use these statistics to characterize relative discrimination. For
example, the log of gross domestic product (GDP_CAP) and BIRTH_RT are better
discriminators between countries than URBAN or DEATH_RT. For a good graphical view
of the separation of the clusters, you might rotate the data using the three variables with
the highest F ratios.
Following the summary statistics, for each cluster, cluster analysis prints the
distance from each case (country) in the cluster to the center of the cluster. Descriptive
statistics for these countries appear on the right. For the first cluster, the standard scores
for LITERACY range from 0.54 to 0.75 with an average of 0.72. B_TO_D ranges from
Cluster Profile Plots
1
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
2
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
3
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
4
URBAN
B_TO_D
DEATH_RT
LIFEEXPM
LITERACY
EDUC
MIL
HEALTH
LIFEEXPF
BABYMORT
BIRTH_RT
GDP_CAP
76
Chapter 4
1.09 to 0.46. Thus, for these predominantly European countries, literacy is well
above the average for the sample and the birth-to-death ratio is below average. In
cluster 2, LITERACY ranges from 2.27 to 0.76 for these five countries, and B_TO_D
ranges from 0.38 to 0.25. Thus, the countries in cluster 2 have a lower literacy rate
and a greater potential for population growth than those in cluster 1. The fourth cluster
(Iraq and Libya) has an average birth-to-death ratio of 2.01, the highest among the four
clusters.
Cluster Profiles
The variables in this Quick Graph are ordered by their F ratios. In the top left plot, there
is one line for each country in cluster 1 that connects its z scores for each of the
variables. Zero marks the average for the complete sample. The lines for these 12
countries all follow a similar pattern: above average values for GDP_CAP, below for
BIRTH_RT, and so on. The lines in cluster 3 do not follow such a tight pattern.
Cluster Means
The variables in cluster means plots are ordered by the F ratios. The vertical lines under
each cluster number indicate the grand mean across all data. The mean within each
cluster is marked by a dot. The horizontal lines indicate one standard deviation above
or below the mean. The countries in cluster 1 have above average means of gross
domestic product, life expectancy, literacy, and urbanization, and spend considerable
money on health care and the military, while the means of their birth rates, infant
mortality rates, and birth-to-death ratios are low. The opposite is true for cluster 2.
Example 2
Hierarchical Clustering: Clustering Cases
This example uses the SUBWORLD data (see the k-means example for a description)
to cluster cases. The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ mil health
77
Cl uster Anal ysi s
The resulting output is:
Distance metric is Euclidean distance
Single linkage method (nearest neighbor)

Cluster and Cluster Were joined No. of members
containing containing at distance in new cluster
------------ ------------ ------------ --------------
WGermany Belgium 0.0869 2
WGermany Denmark 0.1109 3
WGermany UK 0.1127 4
Sweden WGermany 0.1275 5
Austria Sweden 0.1606 6
Austria France 0.1936 7
Austria Italy 0.1943 8
Austria Canada 0.2112 9
Uruguay Argentina 0.2154 2
Switzerland Austria 0.2364 10
Czechoslov Poland 0.2411 2
Switzerland Czechoslov 0.2595 12
Guatemala ElSalvador 0.3152 2
Guatemala Ecuador 0.3155 3
Uruguay Chile 0.3704 3
Cuba Uruguay 0.3739 4
Haiti Somalia 0.3974 2
Switzerland Cuba 0.4030 16
Guatemala Brazil 0.4172 4
Peru Guatemala 0.4210 5
Colombia Peru 0.4433 6
Ethiopia Haiti 0.4743 3
Panama Colombia 0.5160 7
Switzerland Panama 0.5560 23
Libya Iraq 0.5704 2
Afghanistan Guinea 0.5832 2
Ethiopia Afghanistan 0.5969 5
Switzerland Libya 0.8602 25
Switzerland Ethiopia 0.9080 30
78
Chapter 4
The numerical results consist of the joining history. The countries at the top of the
panel are joined first at a distance of 0.087. The last entry represents the joining of the
largest two clusters to form one cluster of all 30 countries. Switzerland is in one of the
clusters and Ethiopia is in the other.
The clusters are best illustrated using a tree diagram. Because the example joins
rows (cases) and uses COUNTRY as an ID variable, the branches of the tree are labeled
with countries. If you join columns (variables), then variable names are used. The scale
for the joining distances is printed at the bottom. Notice that Iraq and Libya, which
form their own cluster as they did in the k-means example, are the second-to-last
cluster to link with others. They join with all the countries listed above them at a
distance of 0.583. Finally, at a distance of 0.908, the five countries at the bottom of the
display are added to form one large cluster.
Polar Dendrogram
Adding the POLAR option to JOIN yields a polar dendrogram.
79
Cl uster Anal ysi s
Example 3
Hierarchical Clustering: Clustering Variables
This example joins columns (variables) instead of rows (cases) to see which variables
cluster together. The input is:
The resulting output is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy,
educ mil health / COLS
Distance metric is Euclidean distance
Single linkage method (nearest neighbor)

Cluster and Cluster Were joined No. of members
containing containing at distance in new cluster
------------ ------------ ------------ --------------
LIFEEXPF LIFEEXPM 0.1444 2
HEALTH GDP_CAP 0.2390 2
EDUC HEALTH 0.2858 3
LIFEEXPF LITERACY 0.3789 3
BABYMORT BIRTH_RT 0.3859 2
EDUC LIFEEXPF 0.4438 6
MIL EDUC 0.4744 7
MIL URBAN 0.5414 8
B_TO_D BABYMORT 0.8320 3
B_TO_D DEATH_RT 0.8396 4
MIL B_TO_D 1.5377 12
80
Chapter 4
The scale at the bottom of the tree for the distance (1r ) ranges from 0.0 to 1.5. The
smallest distance is 0.011thus, the correlation of LIFEEXPM with LIFEEXPF is
0.989.
Example 4
Hierarchical Clustering: Clustering Variables and Cases
To produce a shaded display of the original data matrix in which rows and columns are
permuted according to an algorithm in Gruvaeus and Wainer (1972), use the MATRIX
option. Different shadings or colors represent the magnitude of each number in the
matrix (Ling, 1973).
If you use the MATRIX option with Euclidean distance, be sure that the variables are
on comparable scales because both rows and columns of the matrix are clustered.
Joining a matrix containing inches of annual rainfall and annual growth of trees in feet,
for example, would split columns more by scales than by covariation. In cases like this,
you should standardize your data before joining.
The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ,
mil health / MATRIX
81
Cl uster Anal ysi s
The resulting output is:
This clustering reveals three groups of countries and two groups of variables. The
countries with more urban dwellers and literate citizens, longest life-expectancies,
highest gross domestic product, and most expenditures on health care, education, and
the military are on the top left of the data matrix; countries with the highest rates of
death, infant mortality, birth, and population growth (see B_TO_D) are on the lower
right. You can also see that, consistent with the k-means and join examples, Iraq and
Libya spend much more on military, education, and health than their immediate
neighbors.
Permuted Data Matrix
-3
-2
-1
0
1
2
3
4
Canada
Italy
France
Sweden
WGermany
Belgium
Denmark
UK
Austria
Switzerland
Czechoslov
Poland
Cuba
Chile
Argentina
Uruguay
Panama
Colombia
Peru
Brazil
Ecuador
ElSalvador
Guatemala
Iraq
Libya
Ethiopia
Haiti
Somalia
Afghanistan
Guinea
U
R
B
A
N
L
I
T
E
R
A
C
Y
L
I
F
E
E
X
P
F
L
I
F
E
E
X
P
M
G
D
P
_
C
A
P
H
E
A
L
T
H
E
D
U
C
M
I
L
D
E
A
T
H
_
R
T
B
A
B
Y
M
O
R
T
B
I
R
T
H
_
R
T
B
_
T
O
_
D
82
Chapter 4
Example 5
Hierarchical Clustering: Distance Matrix Input
This example clusters a matrix of distances. The data, stored as a dissimilarity matrix
in the CITIES data file, are airline distances in hundreds of miles between 10 global
cities. The data are adapted from Hartigan (1975).
The input is:
Following is the output:
CLUSTER
USE cities
JOIN berlin bombay capetown chicago london,
montreal newyork paris sanfran seattle
Single linkage method (nearest neighbor)

Cluster and Cluster Were joined No. of members
containing containing at distance in new cluster
------------ ------------ ------------ --------------
PARIS LONDON 2.0000 2
NEWYORK MONTREAL 3.0000 2
BERLIN PARIS 5.0000 3
CHICAGO NEWYORK 7.0000 3
SEATTLE SANFRAN 7.0000 2
SEATTLE CHICAGO 17.0000 5
BERLIN SEATTLE 33.0000 8
BOMBAY BERLIN 39.0000 9
BOMBAY CAPETOWN 51.0000 10
83
Cl uster Anal ysi s
The tree is printed in seriation order. Imagine a trip around the globe to these cities.
SYSTAT has identified the shortest path between cities. The itinerary begins at San
Francisco, leads to Seattle, Chicago, New York, and so on, and ends in Capetown.
Note that the CITIES data file contains the distances between the cities; SYSTAT
did not have to compute those distances. When you save the file, be sure to save it as
a dissimilarity matrix.
This example is used both to illustrate direct distance input and to give you an idea
of the kind of information contained in the order of the SYSTAT cluster tree. For
distance data, the seriation reveals shortest paths; for typical sample data, the seriation
is more likely to replicate in new samples so that you can recognize cluster structure.
Example 6
Additive Trees
This example uses the ROTHKOPF data file. The input is:
The output includes:
CLUSTER
USE rothkopf
ADD a .. z
Similarities linearly transformed into distances.
77.0000 needed to make distances positive.
104.0000 added to satisfy triangle inequality.
Checking 14950 quadruples.
Checking 1001 quadruples.
Checking 330 quadruples.
Checking 70 quadruples.
Checking 1 quadruples.

Stress formula 1 = 0.0609
Stress formula 2 = 0.3985
r(monotonic) squared = 0.8412
r-squared (p.v.a.f.) = 0.7880
Node Length Child

1 23.3958 A
2 15.3958 B
3 14.8125 C
4 13.3125 D
5 24.1250 E
6 34.8370 F
7 15.9167 G
8 27.8750 H
9 25.6042 I
10 19.8333 J
11 13.6875 K
12 28.6196 L
13 21.8125 M
14 22.1875 N
15 19.0833 O
16 14.1667 P
84
Chapter 4
(SYSTAT also displays the raw data, as well as the model distances.)
17 18.9583 Q
18 21.4375 R
19 28.0000 S
20 23.8750 T
21 23.0000 U
22 27.1250 V
23 21.5625 W
24 14.6042 X
25 17.1875 Y
26 18.0417 Z
27 16.9432 1, 9
28 15.3804 2, 24
29 15.7159 3, 25
30 19.5833 4, 11
31 26.0625 5, 20
32 23.8426 7, 15
33 6.1136 8, 22
34 17.1750 10, 16
35 18.8068 13, 14
36 13.7841 17, 26
37 15.6630 18, 23
38 8.8864 19, 21
39 4.5625 27, 35
40 1.7000 29, 36
41 8.7995 33, 38
42 4.1797 39, 31
43 1.1232 12, 28
44 5.0491 34, 40
45 2.4670 42, 41
46 4.5849 30, 43
47 2.6155 32, 44
48 2.7303 6, 37
49 0.0 45, 48
50 3.8645 46, 47
51 0.0 50, 49
Additive Tree
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
85
Cl uster Anal ysi s
Computation
Algorithms
JOIN follows the standard hierarchical amalgamation method described in Hartigan
(1975). The algorithm in Gruvaeus and Wainer (1972) is used to order the tree.
KMEANS follows the algorithm described in Hartigan (1975). Modifications from
Hartigan and Wong (1979) improve speed. There is an important difference between
SYSTATs KMEANS algorithm and that of Hartigan (or implementations of Hartigans
in BMDP, SAS, and SPSS). In SYSTAT, seeds for new clusters are chosen by finding
the case farthest from the centroid of its cluster. In Hartigans algorithm, seeds form
new clusters are chosen by splitting on the variable with largest variance.
Missing Data
In cluster analysis, all distances are computed with pairwise deletion of missing values.
Since missing data are excluded from distance calculations by pairwise deletion, they
do not directly influence clustering when you use the MATRIX option for JOIN. To use
the MATRIX display to analyze patterns of missing data, create a new file in which
missing values are recoded to 1, and all other values, to 0. Then use JOIN with MATRIX
to see whether missing values cluster together in a systematic pattern.
References
Campbell, D. T. and Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81105.
Fisher, L. and Van Ness, J. W. (1971). Admissible clustering procedures. Biometrika, 58,
91104.
Gower, J. C. (1967). A comparison of some methods of cluster analysis. Biometrics, 23,
623637.
Gruvaeus, G. and Wainer, H. (1972). Two additions to hierarchical cluster analysis. The
British Journal of Mathematical and Statistical Psychology, 25, 200206.
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review,
139150.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241254.
86
Chapter 4
Ling, R. F. (1973). A computer generated aid for cluster analysis. Communications of the
ACM, 16, 355361.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate
observations. 5th Berkeley symposium on mathematics, statistics, and probability, Vol.
1, 281298.
McQuitty, L. L. (1960). Hierarchical syndrome analysis. Educational and Psychological
Measurement, 20, 293303.
Milligan, G. W. (1980). An examination of the effects of six types of error perturbation on
fifteen clustering algorithms. Psychometrika, 45, 325342.
Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319345.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic
relationships. University of Kansas Science Bulletin, 38, 14091438.
Sokal, R. R. and Sneath, P. H. A. (1963). Principles of numerical taxonomy. San Francisco:
W. H. Freeman and Company.
Wainer, H. and Schacht, S. (1978). Gappint. Psychometrika, 43, 203212.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236244.
Wilkinson, L. (1978). Permuting a matrix to a simple structure. Proceedings of the
American Statistical Association.
87


Chapt er
5
Conjoint Analysis
Leland Wilkinson
Conjoint analysis fits metric and nonmetric conjoint measurement models to observed
data. It is designed to be a general additive model program using a simple
optimization procedure. As such, conjoint analysis can handle measurement models
not normally amenable to other specialized conjoint programs.
Statistical Background
Conjoint measurement (Luce and Tukey, 1964; Krantz, 1964; Luce, 1966; Tversky,
1967; Krantz and Tversky, 1971) is an axiomatic theory of measurement that defines
the conditions under which there exist measurement scales for two or more variables
that jointly define a common scale under an additive composition rule. This theory
became the basis for a group of related numerical techniques for fitting additive
models, called conjoint analysis (Green and Rao, 1971; Green, Carmone, and Wind,
1972; Green and DeSarbo, 1978; Green and Srinivasan, 1978, 1990; Louviere, 1988,
1994). For an interesting historical comment on Sir Ronald Fishers appropriate
scores method for fitting additive models, see Heiser and Meulman (1995).
To see how conjoint analysis is based on additive models, well first graph an
additive table and then examine a multiplicative table to encounter one example of a
non-additive table. Then well consider the problem of computing margins of a
general table based on an additive model.
88
Chapter 5
Additive Tables
The following is an additive table. Notice that any cell (in roman) is the sum of the
corresponding row and column marginal values (in italic).
A common way to represent a two-way table like this is with a graph. I made a file
(PCONJ.SYD) containing all possible ordered pairs of the row and column indices. Then
I formed Y values by adding the indices:
The following graph of the additive table shows a plot of Y (the values in the cells)
against A (rows) stratified by B (columns) in the legend. Notice that the lines are
parallel.
Since we really have a three-dimensional graph (Y*A*B), it is sometimes convenient
to represent a two-way table as a 3-D or contour plot rather than as a stratified line
graph. Following is the input to do so:
1 2 3
4
5 6 7
3
4 5 6
2
3 4 5
1
2 3 4
USE PCONJ
LET Y=A+B
LINE Y*A/GROUP=B,OVERLAY
89
Conj oi nt Anal ysi s
The following contour plot of the additive table shows the result. Notice that the lines
in the contour plot are parallel for additive tables. Furthermore, although I used a
quadratic smoother, the contours are linear because I used a simple linear combination
of A and B to make Y.
Multiplicative Tables
Following is a multiplicative table. Notice that any cell is the product of the
corresponding marginal values. We commonly encounter these tables in cookbooks
(for sizing recipes) or in, well, multiplication tables. These tables are one instance of
two-way tables that are not additive.
PLOT Y*A*B/SMOO=QUAD, CONTOUR,
XMIN=0,XMAX=4,YMIN=0,YMAX=5,INDENT
1 2 3
4
4 8 12
3
3 6 9
2
2 4 6
1
1 2 3
0 1 2 3 4
B
0
1
2
3
4
5
A
2 3
4
5
6
7
8
90
Chapter 5
Lets look at a graph of this multiplicative table:
Notice that the lines are not parallel.
And the following figure shows the contour plot for the multiplicative model. Notice,
again, that the contours are not parallel.
Multiplicative tables and graphs may be pleasing to look at, but theyre not simple. We
all learned to add before multiplying. Scientists often simplify multiplicative functions
by logging them, since logs of products are sums of logs. This is also one of the reasons
we are told to be suspicious of fan-fold interactions (as in the line graph of the
multiplicative table) in the analysis of variance. If we can log the variables and remove
them (usually improving the residuals in the process), we should do so because it
leaves us with a simple linear model.
LET Y=A*B
LINE Y*A/GROUP=B,OVERLAY
91
Conj oi nt Anal ysi s
Computing Table Margins Based on an Additive Model
If we believe in Occams razor and assume that additive tables are generally preferable
to non-additive, we may want to fit additive models to a table of numbers before
accepting a more complex model. So far, we have been assuming that the marginal
indices are known. Testing for additivity is simply a matter of using these indices in a
formal model. What if the marginal indices are not known? All we have is a table of
numbers bordered by labeled categories. Can we find marginal values such that a linear
model based on these values would reproduce the table?
This is exactly what conjoint analysis does. Conjoint analysis originated in an
axiomatic approach to measurement (Luce and Tukey, 1964). An additive model
underlies a basic axiom of fundamental measurementscale values of separate
measurements can be added to produce a joint measurement. This powerful property
allows us to say that for all measurements a and b, we have made on a set of objects,
and , assuming that a and b are positive.
The following table is an example of such data. How do we find values for

and
such that ? Luce and Tukey devised rules for computing these values
assuming that the cell values can be fit by the additive model.
The following figure shows a solution. The values for a are , ,
, and . The values for b are , , and
.
b1 b2 b3
a4
1.38 2.07 2.48
a3
1.10 1.79 2.20
a2
.69 1.38 1.79
a1
.00 .69 1.10
a b + ( ) a > a b + ( ) b >
a
i
b
j
y
ij
a
i
b
j
+ =
a
1
0. = a
2
0.69 =
a
3
1.10 = a
4
1.38 = b
1
0. = b
2
0.69 =
b
3
1.10 =
92
Chapter 5
Applied Conjoint Analysis
In the last few decades, conjoint analysis has become popular, especially among market
researchers and some economists, for analyzing consumer preferences for goods based
on multiple attributes. Green and Srinivasan (1978, 1990), Crowe (1980), and Louviere
(1988) summarize this activity. The focus of most of these techniques has been on the
development of products with attributes ideally suited to consumer preferences.
Several trends in this area have been apparent.
First, psychometricians decided that the axiomatic approach was impractical for
large data sets and for data in which the conjoint measurement axioms were violated
or contained errors (for example, Emery and Barron, 1979). This trend was partly a
consequence of the development of numerical methods that could fit conjoint models
nonmetrically (Kruskal, 1965; Kruskal and Carmone, 1969; Srinivasan and Shocker,
1973; DeLeew et al., 1976). Green and Srinivasan (1978) coined the term conjoint
analysis for the application of these numerical methods.
Second, applied researchers began to substitute linear methods (usually least-
squares linear regression or ANOVA) for nonmetric algorithms. The justification for
this was usually practicalthe results appeared to be similar for all of the fitting
methods, so why not use the simple linear ones? Louviere (1988) articulates this
position, partly based on results from Green and Srinivasan (1978) and partly from his
own experience with real data sets. This argument is similar to one made by Weeks and
Bentler (1979), in which multidimensional scalings using a linear distance function
produced configurations almost indistinguishable from those using monotonic or
moderately nonlinear distance functions. This is a rather ad hoc conclusion, however,
and does not justify ignoring possible nonlinearities in the modeling process. We will
look at such a case in the examples.
Third, recent conjoint analysis applied methodology has moved toward designing
experiments rather than analyzing received ratings. Green and Srinivasan (1990) and
Louviere (1991) have pioneered this approach. Response surfaces for fractional
designs are analyzed to identify optimal combinations of product features. In
SYSTAT, this approach amounts to using DESIGN for setting up an experimental
design and then GLM for analyzing the results. With PRINT LONG, least-squares means
are produced for factorial designs. Otherwise, response surfaces can be plotted.
Fourth, discrete choice logistic regression has recently emerged as a rival to conjoint
analysis for modeling choice and preference behavior (Hensher and Johnson, 1981).
Steinberg (1992) describes the advantages and limitations of this approach. The LOGIT
procedure in SYSTAT offers this method.
93
Conj oi nt Anal ysi s
Finally, a commercial industry supplying the practical tools for conjoint studies has
produced a variety of software packages. Oppewal (1995) reviews some of these. In
many cases, more efforts are devoted to card decks and other stimulus materials
management than to the actual analysis of the models. CONJOINT in SYSTAT
represents the opposite end of the spectrum from these approaches. CONJOINT
presents methods for fitting these models that are inspired more by Luce and Tukeys
and Green and Raos original theoretical formulations than by the practical
requirements of data collection. The primary goal of SYSTAT CONJOINT is to provide
tools for scaling small- to moderate-sized data sets in which additive models can
simplify the presentation of data. Metric and nonmetric loss functions are available for
exploring the effects of nonlinearity on scaling. The examples highlight this
distinction.
Conjoint Analysis in SYSTAT
Conjoint Analysis Main Dialog Box
To open the Conjoint Analysis dialog box, from the menus choose:
Statistics
Conjoint Analysis
94
Chapter 5
Conjoint analyses are computed by specifying and then estimating a model.
Dependent(s). Select the variable(s) you want to examine. The dependent variable(s)
should be continuous numeric variables (for example, INCOME).
Independent(s). Select one or more continuous or categorical variables (grouping
variables).
Iterations. Enter the maximum number of iterations. If not stated, the maximum is 50.
Convergence. Enter the relative change in estimatesif all such changes are less than
the specified value, convergence is assumed.
Polarity. Enter the polarity of the preferences when doing preference mapping. If the
smaller number indicates the least and the higher number the most, select Positive. For
example, a questionnaire may include the question please rate a list of movies where
one star is the worst and five stars is the best. If the higher number indicates a lower
ranking and the lower number indicates a higher ranking, select Negative. For example,
a questionnaire may include the question please rank your favorite sports team where
1 is the best and 10 is the worst.
Loss. Specify a loss function to apply in model estimation:
n Stress. Conjoint analysis minimizes Kruskals STRESS.
n Tau. Conjoint analysis maximizes Kendalls tau-b.
Regression. Specify the regression form:
n Monotonic. Regression function is monotonically increasing or decreasing. If
LOSS=STRESS, this is Kruskals MONANOVA model.
n Linear. Regression function is ordinary linear regression.
n Log. Regression function is logarithmic.
n Power. Regression function is of the form . This is useful for Box-Cox
models.
Save file. Saves parameter estimates into filename.SYD.
y ax
c
=
95
Conj oi nt Anal ysi s
Using Commands
To request a conjoint analysis:
Usage Considerations
Types of data. CONJOINT uses rectangular data only.
Print options. The output is standard for all print options.
Quick Graphs. Quick Graphs produced by CONJOINT are utility functions for each
predictor variable in the model.
Saving files. CONJOINT saves parameter estimates as one case into a file if you precede
ESTIMATE with SAVE.
BY groups. CONJOINT analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in CONJOINT.
CONJOINT
MODEL depvarlist = indvarlist
ESTIMATE / ITERATIONS=n CONVERGENCE=d ,
LOSS = STRESS
TAU ,
REGRESSION = MONOTONIC
LINEAR
LOG
POWER ,
POLARITY = POSITIVE
NEGATIVE
96
Chapter 5
Examples
Example 1
Choice Data
The classical application of conjoint analysis is to product choice. The following
example from Green and Rao (1971) shows how to fit a nonmetric conjoint model to
some typical choice data. The input is:
Following is the output:
CONJOINT
USE BRANDS
MODEL RESPONSE=DESIGN$..GUARANT$
ESTIMATE / POLARITY=NEGATIVE
Iterative Conjoint Analysis

Monotonic Regression Model
Data are ranks
Loss function is Kruskal STRESS
Factors and Levels
DESIGN$
A
B
C
BRAND$
Bissell
Glory
K2R
PRICE
1.19
1.39
1.59
SEAL$
NO
YES
GUARANT$
NO
YES
Convergence Criterion: 0.000010
Maximum Iterations: 50

97
Conj oi nt Anal ysi s
Iteration Loss Max parameter change

1 0.5389079 0.2641755
2 0.4476390 0.2711012
3 0.3170808 0.2482502
4 0.1746641 0.3290621
5 0.1285278 0.1702260
6 0.1050734 0.1906332
7 0.0877708 0.1261961
8 0.0591691 0.2336527
9 0.0407008 0.1665511
10 0.0166571 0.1448756
11 0.0101404 0.1399945
12 0.0058237 0.2048317
13 0.0013594 0.1900774
14 0.0006314 0.0345039
15 0.0001157 0.0466520
16 0.0000065 0.0192437
17 0.0000000 0.0155169
18 0.0000000 0.0032732
19 0.0000000 0.0000032
20 0.0000000 0.0000000

Parameter Estimates (Part Worths)
A B C Bissell Glory K2R
-0.331 0.400 0.209 -0.122 -0.226 -0.195
PRICE(1) PRICE(2) PRICE(3) NO YES NO
0.302 0.159 -0.429 -0.131 -0.102 -0.039
YES
0.504

Goodness of Fit (Kendall tau)
RESPONSE
1.000

Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
A B C Bissell Glory K2R
0.856 0.699 0.935 0.922 0.843 0.856
PRICE(1) PRICE(2) PRICE(3) NO YES NO
0.778 0.922 0.817 0.948 0.974 0.987
YES
0.791
98
Chapter 5

A B C
DESIGN$
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
Shepard Diagram
0 5 10 15 20
Data
-2
-1
0
1
J
o
i
n
t

S
c
o
r
e
Bissell Glory K2R
BRAND$
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
1.19 1.39 1.59
PRICE
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
NO YES
SEAL$
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
NO YES
GUARANT$
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
99
Conj oi nt Anal ysi s
The fitting method chosen for this example is the default nonmetric loss using
Kruskals STRESS statistic. This is the same method used in the MONANOVA
program (Kruskal and Carmone, 1969). Although the minimization algorithm differs
from that program, the result should be comparable.
The iterations converged to a perfect fit (LOSS = 0). That is, there exists a set of
parameter estimates such that their sums fit the observed data perfectly when Kendalls
tau-b is used to measure fit. This rarely occurs with real data.
The parameter estimates are scaled to have zero sum and unit sum of squares. There
is a single goodness-of-fit value for this example because there is one response.
The root-mean-square deleted goodness-of-fit values are the goodness of fit when
each respective parameter is set to zero. This serves as an informal test of sensitivity.
The lowest value for this example is for the B parameter, indicating that the estimate
for B cannot be changed without substantially affecting the overall goodness of fit.
The Shepard diagram displays the goodness of fit in a scatterplot. The Data axis
represents the observed data values. The Joint Score axis represents the values of the
combined parameter estimates. For example, if we have parameters a1, a2, a3 and b1,
b2, then every case measured on, say, a2 and b1 will be represented by a point in the
plot whose ordinate (y value) is a2 + b1. This example involves only one condition per
card or case, so that the Shepard diagram has no duplicate values on the y axis.
Conjoint analysis can easily handle duplicate measurements either with multiple
dependent variables (multiple subjects exposed to common stimuli) or with duplicate
values for the same subject (replications).
The fitted jagged line is the best fitting monotonic regression of these fitted values
on the observed data. See Chapter 19 for a similar diagram. And note carefully the
warnings about degenerate solutions and other problems.
You may want to try this example with REGRESSION = LINEAR to see how the
results compare. The linear fit yields an almost perfect Pearson correlation. This also
means that GLM (MGLH) can produce nearly the same estimates:
The PRINT LONG statement causes GLM to print the least-squares estimates of the
marginal means that, for an additive model, are the parameters we seek. The GLM
parameter estimates will differ from the ones printed here only by a constant and
scaling parameter. Conjoint analysis always scales parameter estimates to have zero
sum and unit sum of squares. This way, they can be thought of as utilities over the
experimental domainsome negative, some positive.
GLM
MODEL RESPONSE = CONSTANT + DESIGN$..GUARANT$
CATEGORY DESIGN$..GUARANT$
PRINT LONG
ESTIMATE
100
Chapter 5
Example 2
Word Frequency
The data set WORDS contains the most frequently used words in American English
(Carroll et al., 1971). Three measures have been added to the data. The first is the (most
likely) part of speech (PART$). The second is the number of letters (LETTERS) in the
word. The third is a measure of the meaning (MEANING$). This admittedly informal
measure represents the amount of harm done to comprehension (1 = a little, 4 = a lot)
by omitting the word from a sentence. While linguists may argue over these
classifications, they do reveal basic differences. Instead of using a measure of
frequency, we will work with the rank order itself to see if there is enough information
to fit a model. This time, we will maximize Kendalls tau-b directly.
Following is the input:
Following is the output:
USE WORDS
CONJOINT
LET RANK=CASE
MODEL RANK = LETTERS PART$ MEANING
ESTIMATE / LOSS=TAU,POLARITY=NEGATIVE
Iterative Conjoint Analysis

Monotonic Regression Model
Data are ranks
Loss function is 1-(1+tau)/2
Factors and Levels
LETTERS
1
2
3
4
PART$
adjective
adverb
conjunction
preposition
pronoun
verb
MEANING
1
2
3
Convergence Criterion: 0.000010
101
Conj oi nt Anal ysi s
Maximum Iterations: 50

Iteration Loss Max parameter change

1 0.2042177 0.0955367
2 0.1988071 0.0911670
3 0.1897893 0.0708985
4 0.1861822 0.0308284
5 0.1843787 0.0259976
6 0.1825751 0.0131758
7 0.1825751 0.0000175
8 0.1825751 0.0000000

Parameter Estimates (Part Worths)
LETTERS(1) LETTERS(2) LETTERS(3) LETTERS(4) adjective adverb
0.154 0.174 -0.076 -0.270 -0.119 -0.273
conjunction preposition pronoun verb MEANING(1) MEANING(2)
-0.262 0.215 0.173 -0.162 0.749 -0.121
MEANING(3)
-0.182

Goodness of Fit (Kendall tau)
RANK
0.635

Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
LETTERS(1) LETTERS(2) LETTERS(3) LETTERS(4) adjective adverb
0.628 0.610 0.635 0.606 0.635 0.617
conjunction preposition pronoun verb MEANING(1) MEANING(2)
0.602 0.613 0.610 0.631 0.494 0.617
MEANING(3)
0.610
102
Chapter 5
The Shepard diagram reveals a slightly curvilinear relationship between the data and
the fitted values. We can parameterize that relationship by refitting the model as
follows:
SYSTAT will then print Computed Exponent: 1.392. We will further examine this type
of power function in the Box-Cox example.
The output tells us that, in general, shorter words are higher on the list, adverbs are
lower, and prepositions are higher. Also, the most frequently occurring words are
generally the most disposable. These statements must be made in the context of the
model, however. To the extent that the separate statements are inaccurate when the
ESTIMATE / REGRESSION=POWER,POLARITY=NEGATIVE
Shepard Diagram
0 10 20 30 40
Data
-1.0
-0.5
0.0
0.5
1.0
1.5
J
o
i
n
t

S
c
o
r
e
1 2 3 4
LETTERS
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
a
d
je
c
tiv
e
a
d
v
e
r
b
c
o
n
ju
n
c
tio
n
p
r
e
p
o
s
itio
n
p
r
o
n
o
u
n
v
e
r
b
PART$
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
1 2 3
MEANING
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
103
Conj oi nt Anal ysi s
data are examined separately for each, the additive model is violated. This is another
way of saying that the additive model is appropriate when there are no interactions or
configural effects. Incidentally, when these data are analyzed with GLM using the
(inverse transformed) word frequencies themselves rather than rank order in the list,
the conclusions are substantially the same.
Example 3
Box-Cox Model
Box and Cox (1964) devised a maximum likelihood estimator for the exponent in the
following model:
where X is a matrix of known values, is a vector of unknown parameters associated
with the transformed observations, and the residuals of the model are assumed to be
normally distributed and independent. The transformation itself is assumed to take the
following form:
Following is a SYSTAT program (originally coded by Grant Blank) to compute the
Box-Cox exponent and its standard error. The comments document the program flow:
USE BOXCOX
REM First we need GLM to code dummy variables.
GLM
CATEGORY TREATMEN,POISON
MODEL Y=CONSTANT+TREATMEN+POISON
SAVE TEMP / MODEL
ESTIMATE
REM Now use STATS to compute geometric mean.
STATS
USE TEMP
SAVE GMEAN
LET LY=LOG(Y)
STATS LY / MEAN
E y X { }
( )

y
y
y
( )
( )
log( ) ( )

%
&
K
'
K
1
0
0
104
Chapter 5
This program produces an estimate of 0.750 for lambda, with a 95% Wald confidence
interval of (1.055, 0.445). This is in agreement with the results in the original paper.
Box and Cox recommend rounding the exponent to 1 because of its natural
interpretation (rate of dying from poison). In general, it is wise to round such
transformations to interpretable values such as ... 1, 0.5, 0, 0.5, 2 ... to facilitate the
interpretation of results.
The Box-Cox procedure is based on a specific model that assumes normality in the
transformed data and that focuses on the dependent variable. We might ask whether it
is worthwhile to examine transformations of this sort without assuming normality and
resorting to maximum likelihood for our answer. This is especially appropriate if our
general method is to find an optimal estimate of the exponent and then round it to
the nearest interpretable value based on a confidence interval. Indeed, two discussants
of the Box and Cox paper, John Hartigan and John Tukey, asked just that.
The conjoint model offers one approach to this question. Specifically, we can use a
power function relating the y data values to the predictor variables in our model and
see how it converges.
Following is the input:
REM Now duplicate the geometric mean for every case.
MERGE GMEAN(LY) TEMP (Y,X(1..5))
LET GMEAN=LAG(GMEAN)
IF CASE=1 THEN LET GMEAN=EXP(LY)
REM Now estimate the exponent, following Box&Cox
NONLIN
MODEL Y = B0 + B1*X(1) + B2*X(2) + B3*X(3) + B4*X(4) + B5*X(5)
LOSS = ((Y^POWER-1) /(POWER*GMEAN^(POWER-1))-ESTIMATE)^2
ESTIMATE
USE BOXCOX
CONJOINT
MODEL Y=POISON TREATMEN
ESTIMATE / REGRESS=POWER
105
Conj oi nt Anal ysi s
Following is the output:
Iterative Conjoint Analysis

Power Regression Model
Data are dissimilarities
Loss function is least squares
Factors and Levels
POISON
1
2
3
TREATMEN
1
2
3
4
Convergence Criterion: 0.000010
Maximum Iterations: 50

Iteration Loss Max parameter change

1 0.1977795 0.1024469
2 0.1661894 0.0530742
3 0.1594770 0.1473320
4 0.1571216 0.0973117
5 0.1562271 0.0156619
6 0.1559910 0.0193429
7 0.1559285 0.0149959
8 0.1559166 0.0034746
9 0.1559135 0.0024772
10 0.1559131 0.0016637
11 0.1559129 0.0005579
12 0.1559134 0.0004575
13 0.1559129 0.0000321
14 0.1559130 0.0000188
15 0.1559127 0.0000021

Computed Exponent: -1.015

Parameter Estimates (Part Worths)
POISON(1) POISON(2) POISON(3) TREATMEN(1) TREATMEN(2) TREATMEN(3)
-0.375 -0.138 0.634 0.423 -0.414 0.133
TREATMEN(4)
-0.264

Goodness of Fit (Pearson correlation)
Y
-0.919

Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
POISON(1) POISON(2) POISON(3) TREATMEN(1) TREATMEN(2) TREATMEN(3)
0.872 0.912 0.785 0.866 0.868 0.914
TREATMEN(4)
0.898
106
Chapter 5
On each iteration, CONJOINT transforms the observed (y) values by the current
estimate of the exponent, regresses them on the currently weighted X variables (using
the conjoint parameter estimates), and computes the loss from the residuals of that
regression. Over iterations, this loss is minimized and we get to view the final fit in the
plotted Shepard diagram.
The CONJOINT program produced an estimate of 1.015 for the exponent. Draper and
Hunter (1969) reanalyzed the poison data using several criteria suggested in the
discussion to Box and Coxs paper and elsewhere (minimizing interaction F ratio,
maximizing main-effects F ratios, and minimizing Levenes test for heterogeneity of
within-group variances). They found the best exponent to be in the neighborhood of 1.
Shepard Diagram
0.0 0.5 1.0 1.5
Data
-1
0
1
2
J
o
i
n
t

S
c
o
r
e
0.0 0.5 1.0 1.5
-1
0
1
2
1 2 3
POISON
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
1 2 3 4
TREATMEN
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
107
Conj oi nt Anal ysi s
Example 4
Employment Discrimination
The following table shows the mean salaries (SALNOW) of employees at a Chicago
bank. These data are from the BANK.SYD data set used in many SPSS manuals. The
bank was involved in a discrimination lawsuit, and the focus of our interest is whether
we can represent the salaries by a simple additive model. At the time these data were
collected, there were no black females with a graduate school education working at the
bank. The education variable records the highest level reached.
Lets regress beginning salary (SALBEG) and current salary (SALNOW) on the gender
and education data. To represent our model, we will code the categories with integers:
for gender/race, 1=black females, 2=white females, 3=black males, 4=white males; for
education, 1=high school, 2=college, 3=grad school. These codings order the salaries
for both racial/gender status and educational levels.
Following is the input:
High School College Grad School
White Males
11735 16215 28251
Black Males
11513 13341 20472
White Females
9600 13612 11640
Black Females
8874 10278
USE BANK
IF SEX=1 AND MINORITY=1 THEN LET GROUP=1
IF SEX=1 AND MINORITY=0 THEN LET GROUP=2
IF SEX=0 AND MINORITY=1 THEN LET GROUP=3
IF SEX=0 AND MINORITY=0 THEN LET GROUP=4
LET EDUC=1
IF EDLEVEL>12 THEN LET EDUC=2
IF EDLEVEL>16 THEN LET EDUC=3
LABEL GROUP / 1=Black_Females,2=White_Females,
3=Black_Males,4=White_Males
LABEL EDUC / 1=High_School,2=College,3=Grad_School
CONJOINT
MODEL SALBEG,SALNOW=GROUP EDUC
ESTIMATE / REGRESS=POWER
108
Chapter 5
Following is the output:
Iterative Conjoint Analysis

Power Regression Model
Data are dissimilarities
Loss function is least squares
Factors and Levels
GROUP
Black_Female
White_Female
Black_Males
White_Males
EDUC
High_School
College
Grad_School
Convergence Criterion: 0.000010
Maximum Iterations: 50

Iteration Loss Max parameter change

1 0.3932757 0.0931128
2 0.3734472 0.2973392
3 0.3631769 0.2928259
4 0.3606965 0.1416823
5 0.3589525 0.0244544
6 0.3585654 0.0090515
7 0.3584647 0.0252027
8 0.3584328 0.0068830
9 0.3584239 0.0016764
10 0.3584233 0.0047662
11 0.3584215 0.0009750
12 0.3584225 0.0001914
13 0.3584253 0.0001697
14 0.3584253 0.0000182
15 0.3584231 0.0000021
16 0.3584189 0.0000004

Computed Exponent: -0.072

Parameter Estimates (Part Worths)
GROUP(1) GROUP(2) GROUP(3) GROUP(4) EDUC(1) EDUC(2)
-0.366 -0.200 -0.034 0.144 -0.356 -0.010
EDUC(3)
0.823

Goodness of Fit (Pearson correlation)
SALBEG SALNOW
0.815 0.787

Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
GROUP(1) GROUP(2) GROUP(3) GROUP(4) EDUC(1) EDUC(2)
0.782 0.785 0.801 0.795 0.753 0.801
EDUC(3)
0.696
109
Conj oi nt Anal ysi s
The computed exponent (0.072) suggests that a log transformation would be
appropriate for fitting a parametric model. The two salary measurements (salary at time
of hire and at time of the study) perform similarly, although beginning salary shows a
slightly better fit to the additive model (0.815 versus 0.787). You can see the difference
in the two printed Shepard diagrams. The estimates of the parameters show clear
orderings in the categories.
Check for sensitivity of the parameter estimates by examining the root-mean-square
deleted goodness of fit values. The reported values are averages of the fits for both
SALBEG and SALNOW when the respective parameter is set to zero. Here we find that
the greatest change in goodness of fit corresponds to a change in the Grad School
parameter.
Shepard Diagram
1
0
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
Data
-1.0
-0.5
0.0
0.5
1.0
J
o
i
n
t

S
c
o
r
e
2
0
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
Data
-1.0
-0.5
0.0
0.5
1.0
J
o
i
n
t

S
c
o
r
e
1 2
B
la
c
k
_
F
e
m
a
le
W
h
ite
_
F
e
m
a
le
B
la
c
k
_
M
a
le
s
W
h
ite
_
M
a
le
s
GROUP
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
H
ig
h
_
S
c
h
o
o
l
C
o
lle
g
e
G
ra
d
_
S
c
h
o
o
l
EDUC
-1.0
-0.5
0.0
0.5
1.0
M
e
a
s
u
r
e
110
Chapter 5
Transformed Additive Model
The transformed additive model removes the highly significant interaction for
SALNOW and almost removes it for SALBEG in these data. You can see this by
recoding the education and gender/race variables with the parameter estimates from the
conjoint analysis:
Following is the output:
IF GROUP=1 THEN LET G=-.365
IF GROUP=2 THEN LET G=-.2
IF GROUP=3 THEN LET G=-.033
IF GROUP=4 THEN LET G=.147
IF EDUC=1 THEN LET E=-.359
IF EDUC=2 THEN LET E=-.011
IF EDUC=3 THEN LET E=.822
LET LSALB=LOG(SALBEG)
LET LSALN=LOG(SALNOW)
GLM
MODEL LSALB,LSALN = CONSTANT+E+G+E*G
ESTIMATE
HYPOTHESIS
EFFECT=E*G
TEST
Number of cases processed: 474
Dependent variable means

LSALB LSALN
8.753 9.441

-1
Regression coefficients B = (XX) XY

LSALB LSALN

CONSTANT 8.829 9.531

E 0.576 0.653

G 0.723 0.722

E
G 0.558 0.351


Multiple correlations

LSALB LSALN
0.817 0.789

Squared multiple correlations

LSALB LSALN
0.667 0.622
2 2
Adjusted R = 1-(1-R )*(N-1)/df, where N = 474, and df = 470
LSALB LSALN
0.665 0.620
------------------------------------------------------------------------------------
111
Conj oi nt Anal ysi s
Ordered Scatterplots
Finally, lets use SYSTAT to produce scatterplots of beginning and current salary
ordered by the conjoint coefficients. The SYSTAT code to do this can be found in the
file CONJO4.SYC. The spacing of the scatterplots should tell the story.
*** WARNING ***
Case 297 has large leverage (Leverage = 0.128)
Test for effect called: E*G


Univariate F Tests

Effect SS df MS F P

LSALB 0.275 1 0.275 6.596 0.011
Error 19.628 470 0.042

LSALN 0.109 1 0.109 1.818 0.178
Error 28.219 470 0.060


Multivariate Test Statistics

Wilks Lambda = 0.986
F-Statistic = 3.447 df = 2, 469 Prob = 0.033

Pillai Trace = 0.014
F-Statistic = 3.447 df = 2, 469 Prob = 0.033

Hotelling-Lawley Trace = 0.015
F-Statistic = 3.447 df = 2, 469 Prob = 0.033
112
Chapter 5
The story is mainly in this graph: regardless of educational level, minorities and
women received lower salaries. There are a few exceptions to the general pattern, but
overall the bank had reason to settle the lawsuit.
Computation
All computations are in double precision.
Algorithms
CONJOINT uses a direct search optimization method to minimize the loss function.
This enables minimization of Kendalls tau. There is no guarantee that the program will
find the global minimum of tau, so it is wise to try several regression types and the
STRESS loss to be sure that they all reach approximately the same neighborhood.
113
Conj oi nt Anal ysi s
Missing Data
Missing values are processed by omitting them from the loss function.
References
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society, Series B, 26, 211252.
Brogden, H. E. (1977). The Rasch model, the law of comparative judgment and additive
conjoint measurement. Psychometrika, 42, 631634.
Carmone, F. J., Green, P. E., and Jain, A. K. (1978). Robustness of conjoint analysis: Some
Monte Carlo results. Journal of Marketing Research, 15, 300303.
Carroll, J. B., Davies, P., and Richmond, B. (1971). The word frequency book. Boston,
Mass.: Houghton, Mifflin.
Carroll, J. D. and Green, P. E. (1995). Psychometric methods in marketing research: Part I,
conjoint analysis. Journal of Marketing Research, 32, 385391.
Crowe, G. (1980). Conjoint measurements design considerations. PMRS Journal, 1, 813.
DeLeew, J., Young, F. W., and Takane, Y. (1976). Additive structure in qualitative data:
An alternating least squares method with optimal scaling features. Psychometrika, 41,
471503.
Draper, N. R. and Hunter, W. G. (1969). Transformations: Some examples revisited.
Technometrics, 11, 2340.
Emery, D. R. and Barron, F. H. (1979). Axiomatic and numerical conjoint measurement:
An evaluation of diagnostic efficacy. Psychometrika, 44, 195210.
Green, P. E., Carmone, F. J., and Wind, Y. (1972). Subjective evaluation models and
conjoint measurement. Behavioral Science, 17, 288299.
Green, P. E. and DeSarbo, W. S. (1978). Additive decomposition of perceptions data via
conjoint analysis. Journal of Consumer Research, 5, 5865.
Green, P. E. and Rao, V. R. (1971). Conjoint measurement for quantifying judgmental data.
Journal of Marketing Research, 8, 355363.
Green, P. E. and Srinivasan, V. (1978). Conjoint analysis in consumer research: Issues and
outlook. Journal of Consumer Research, 5, 103123.
Green, P. E. and Srinivasan, V. (1990). Conjoint analysis in marketing: New developments
with implications for research and practice. Journal of Marketing, 54, 319.
Heiser, W. J. and Meulman, J. J. (1995). Nonlinear methods for the analysis of
homogeneity and heterogeneity. In W. J. Krzanowski (ed.), Recent advances in
descriptive multivariate analysis, 5189. Oxford: Clarendon Press.
114
Chapter 5
Hensher, D. A. and Johnson, L. W. (1981). Applied discrete choice modeling. London:
Croom Helm.
Krantz, D. H. (1964). Conjoint measurement: The Luce-Tukey axiomatization and some
extensions. Journal of Mathematical Psychology, 1, 284277.
Krantz, D. H. and Tversky, A. (1971). Conjoint measurement analysis of composition rules
in psychology. Psychological Review, 78, 151169.
Kruskal, J. B. (1965). Analysis of factorial experiments by estimating monotone
transformations of the data. Journal of the Royal Statistical Society, Series B, 27,
251263.
Kruskal, J. B. and Carmone, F. J. (1969). MONANOVA: A Fortran-IV program for
monotone analysis of variance (non-metric analysis of factorial experiments).
Behavioral Science, 14, 165166.
Louviere, J. J. (1988). Analyzing decision making: Metric conjoint analysis. Newbury
Park, Calif.: Sage Publications.
Louviere, J. J. (1991). Experimental choice analysis: Introduction and review. Journal of
Business Research. 23, 291297.
Louviere, J. J. (1994). Conjoint analysis. In R. Bagozzi (ed.), Handbook of Marketing
Research, 223259. Oxford: Blackwell Publishers.
Luce, R. D. (1966). Two extensions of conjoint measurement. Journal of Mathematical
Psychology, 3, 348370.
Luce, R. D. and Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of
fundamental measurement. Journal of Mathematical Psychology, 1, 127.
Nygren, T. E. (1986). A two-stage algorithm for assessing violations
of additivity via axiomatic and numerical conjoint analysis. Psychometrika, 51,
483491.
Oppewal, H. (1995). A review of conjoint software. Journal of Retailing and Consumer
Services, 2, 5561.
Srinivasan, V. and Shocker, A. D. (1973). Linear programming techniques for
multidimensional analysis of preference. Psychometrika, 38, 337369.
Steinberg, D. (1992). Applications of logit models in market research. 1992 Sawtooth-
SYSTAT Software Conference Proceedings, 405424. Ketchum, Idaho: Sawtooth
Software, Inc.
Tversky, A. (1967). A general theory of polynomial conjoint measurement. Journal of
Mathematical Psychology, 4, 120.
Umesh, U. N. and Mishra, S. (1990). A Monte Carlo investigation of conjoint analysis
index-of-fit: Goodness of fit, significance and power. Psychometrika, 55, 3344.
Weeks, D. G. and Bentler, P. M. (1979). A comparison of linear and monotone
multidimensional scaling models. Psychological Bulletin, 86, 349354.
115


Chapt er
6
Correlations, Similarities, and
Distance Measures
Leland Wilkinson, Laszlo Engelman, and Rick Marcantonio
Correlations computes correlations and measures of similarity and distance. It prints
the resulting matrix and, if requested, saves it in a SYSTAT file for further analysis,
such as multidimensional scaling, cluster, or factor analysis.
For continuous data, Correlations provides the Pearson correlation, covariances,
and sums of squares of deviations from the mean and sums of cross-products of
deviations (SSCP). In addition to the usual probabilities, the Bonferroni and Dunn-
Sidak adjustments are available with Pearson correlations. If distances are desired,
Euclidean or city-block distances are available. Similarity measures for continuous
data include the Bray-Curtis coefficient and the QSK quantitative symmetric
coefficient (or Kulczynski measure).
For rank-order data, Correlations provides Goodman-Kruskals gamma,
Guttmans mu2, Spearmans rho, and Kendalls tau.
For binary data, Correlations provides S2, the positive matching dichotomy
coefficient; S3, Jaccards dichotomy coefficient; S4, the simple matching dichotomy
coefficient; S5, Anderbergs dichotomy coefficient; and S6, Tanimotos dichotomy
coefficient. When underlying distributions are assumed to be normal, the tetrachoric
correlation is available.
When data are missing, listwise and pairwise deletion methods are available for all
measures. An EM algorithm is an option for maximum likelihood estimates of
correlation, covariance, and cross-products of deviations matrices. For robust ML
estimates where outliers are downweighted, the user can specify the degrees of
freedom for the t distribution or contamination for a normal distribution. Correlations
includes a graphical display of the pattern of missing values. Littles MCAR test is
printed with the display. The EM algorithm also identifies cases with extreme
Mahalanobis distances.
116
Chapter 6
Hadis robust outlier detection and estimation procedure is an option for
correlations, covariances, and SSCP; cases identified as outliers by the procedure are
not used to compute estimates.
Statistical Background
SYSTAT computes many different measures of the strength of association between
variables. The most popular measure is the Pearson correlation, which is appropriate
for describing linear relationships between continuous variables. However, CORR
offers a variety of alternative measures of similarity and distance appropriate if the data
are not continuous.
Lets look at an example. The following data, from the CARS file, are taken from
various issues of Car and Driver and Road & Track magazine. They are the car
enthusiasts equivalent of Consumer Reports performance ratings. The cars rated
include some of the most expensive and exotic cars in the world (for example, Ferrari
Testarossa) as well as some of the least expensive but sporty cars (for example, Honda
Civic CRX). The attributes measured are 060 m.p.h. acceleration, braking distance in
feet from 600 m.p.h., slalom times (speed over a twisty course), miles per gallon, and
top speed in miles per hour.
ACCEL BRAKE SLALOM MPG SPEED NAME$
5.0 245 61.3 17.0 153 Porsche 911T
5.3 242 61.9 12.0 181 Testarossa
5.8 243 62.6 19.0 154 Corvette
7.0 267 57.8 14.5 145 Mercedes 560
7.6 271 59.8 21.0 124 Saab 9000
7.9 259 61.7 19.0 130 Toyota Supra
8.5 263 59.9 17.5 131 BMW 635
8.7 287 64.2 35.0 115 Civic CRX
9.3 258 64.1 24.5 129 Acura Legend
10.8 287 60.8 25.0 100 VW Fox GL
13.0 253 62.3 27.0 95 Chevy Nova
117
Correl ati ons, Si mi l ari ti es, and Di stance Measures
The Scatterplot Matrix (SPLOM)
A convenient summary that shows the relationships between the performance variables
is to arrange them in a matrix. A matrix is a rectangular array. We can put any sort of
numbers in the cells of the matrix, but we will focus on measures of association. Before
doing that, however, lets examine a graphical matrix, the scatterplot matrix
(SPLOM).

This matrix shows the histograms of each variable on the diagonal and the scatterplots
(x-y plots) of each variable against the others. For example, the scatterplot of
acceleration versus braking is at the top of the matrix. Since the matrix is symmetric,
only the bottom half is shown. In other words, the plot of acceleration versus braking
is the same as the transposed scatterplot of braking versus acceleration.
The Pearson Correlation Coefficient
Now, assume that we want a single number that summarizes how well we could predict
acceleration from braking using a straight line. For linear regression, we discuss how
we calculate such a line, but it is enough here to know that we are interested in drawing
a line through the area covered by the points in the scatterplot such that, on average,
the acceleration of a car could be predicted rather well by the value on the line
corresponding to its braking. The closer the points cluster around this line, the better
would be the prediction.
A
C
C
E
L
B
R
A
K
E
S
L
A
L
O
M
M
P
G
ACCEL
S
P
E
E
D
BRAKE SLALOM MPG SPEED
118
Chapter 6
In addition, we want this number to represent simultaneously how well we can
predict braking from acceleration using a similar line. This symmetry we seek is
fundamental to all the measures available in CORR. It means that, whatever the scales
on which we measure our variables, the coefficient of association we compute will be
the same for either prediction. If this symmetry makes no sense for a certain data set,
then you probably should not be using CORR.
The most common measure of association is the Pearson correlation coefficient,
which varies between 1 and +1. A Pearson correlation of 0 indicates that neither of
two variables can be predicted from the other by using a linear equation. A Pearson
correlation of 1 indicates that one variable can be predicted perfectly by a positive
linear function of the other, and vice versa. And a value of 1 indicates the same,
except that the function has a negative sign for the slope of the line.
Following is the Pearson correlation matrix corresponding to this SPLOM:
Try superimposing in your mind the correlation matrix on the SPLOM. The Pearson
correlation for acceleration versus braking is 0.466. This correlation is positive and
moderate in size. On the other hand, the correlation between acceleration and speed is
negative and quite large (0.908). You can see in the lower left corner of the SPLOM
that the points cluster around a downward sloping line. In fact, all of the correlations
of speed with the other variables are negative, which makes sense since greater speed
implies greater performance. The same is true for slalom performance, but this is
clouded by the fact that some small but slower cars like the Honda Civic CRX are
extremely agile.
Keep in mind that the Pearson correlation measures linear predictability. Do not
assume that a Pearson correlation near 0 implies no relationship between variables.
Many nonlinear associations (U- and S-shaped curves, for example) can have Pearson
correlations of 0.
Pearson Correlation Matrix
ACCEL BRAKE SLALOM MPG SPEED
ACCEL 1.000
BRAKE 0.466 1.000
SLALOM 0.176 -0.097 1.000
MPG 0.651 0.622 0.597 1.000
SPEED -0.908 -0.665 -0.115 -0.768 1.000
Number of Observations: 11
119
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Other Measures of Association
CORR offers a variety of other association measures. There is not room here to discuss
all of them, but lets review some briefly.
Measures for Rank-Order Data
Several measures are available for rank-order data: Goodman-Kruskals gamma,
Guttmans mu2, Spearmans rho, and Kendalls tau. Each measures an aspect of rank-
order association. The one closest to Pearson is the Spearman. Spearmans rho is
simply a Pearson correlation computed on the same data after converting them to ranks.
Goodman-Kruskals gamma and Kendalls tau reflect the tendency for two cases to
have similar orderings on two variables. However, the former focuses on cases which
are not tied in rank orderings. If no ties exist, these two measures will be equal.
Following is the same matrix computed for Spearmans rho:
It is often useful to compute both a Spearman and Pearson matrix on the same data.
The absolute difference between the two can reveal unusual features. For example, the
greatest difference for our data is on the slalom-braking correlation. This is because the
Honda Civic CRX is so fast through the slalom, despite its inferior brakes, that it
attenuates the Pearson correlation between slalom and braking. The Spearman
correlation reduces its influence.
Dissimilarity and Distance Measures
These measures include the Bray-Curtis (BC) dissimilarity measure, the quantitative
symmetric dissimilarity coefficient, the Euclidean distance, and the city-block
distance.
Matrix of Spearman Correlation Coefficients
ACCEL BRAKE SLALOM MPG SPEED
ACCEL 1.000
BRAKE 0.501 1.000
SLALOM 0.245 -0.305 1.000
MPG 0.815 0.502 0.487 1.000
SPEED -0.891 -0.651 -0.109 -0.884 1.000
Number of observations: 11
120
Chapter 6
Euclidean and city-block distance measures have been widely available in software
packages for many years; Bray-Curtis and QSK are less common. For each pair of
variables,
where i and j are variables and k is cases. After an extensive computer simulation study,
Faith, Minchin, and Belbin (1987) concluded that BC and QSK were effective as
robust measures in terms of both rank and linear correlation. The use of these
measures is similar to that for Correlations (Pearson, Covariance, and SSCP), except
the EM, Prob, Bonferroni, Dunn-Sidak, and Hadi options are not available.
Measures for Binary Data
Correlations offers the following association measures for binary data: positive
matching dichotomy coefficients (S2), Jaccards dichotomy coefficients (S3), simple
matching dichotomy coefficients (S4), Anderbergs dichotomy coefficients (S5),
Tanimotos dichotomy coefficients (S6), and tetrachoric correlations.
Dichotomy coefficients. These coefficients relate variables whose values may represent
the presence or absence of an attribute or simply two values. They are documented in
Gower (1985). These coefficients were chosen for SYSTAT because they are metric
and produce symmetric positive semidefinite (Gramian) matrices, provided that you do
not use the pairwise deletion option. This makes them suitable for multidimensional
scaling and factoring as well as clustering. The following table shows how the
similarity coefficients are computed:
Bray-Curtis
x
ik
x
jk

x
ik
x
jk
k

+
k

-------------------------------- =
QSK 1
1
2
-- -
min x
ik
x
jk
( , )
1
x
ik
k

-------------
1
x
jk
k

------------- +
,




_

=
121
Correl ati ons, Si mi l ari ti es, and Di stance Measures
When the absence of an attribute in both variables is deemed to convey no information,
d should not be included in the coefficient (see S3 and S5).
Tetrachoric correlation. While the data for this measure are binary, they are assumed to
be a random sample from a bivariate normal distribution. For example, lets draw a
horizontal line and a vertical line on this bivariate normal distribution and count the
number of observations in each quadrant.
1 0
1 a b
0 c d
Proportion of pairs with both values present
Proportion of pairs with both values present
given that at least one occurs
Proportion of pairs where the values of both
variables agree

S3 standardized by all possible patterns of
agreement and disagreement
S4 standardized by all possible patterns of
agreement and disagreement
x
j
x
i
a b +
c d +
a c + b d +
S2
a
a b c d + + +
------------------------------ =
S3
a
a b c + +
--------------------- =
S4
a d +
a b c d + + +
------------------------------ =
S5
a
a 2 b c + ( ) +
----------------------------- =
S6
a d +
a 2 b c + ( ) d + +
--------------------------------------- =
5 19
17 4
-3 3
Y0
-3
3
X0
122
Chapter 6
A large proportion of the observations fall in the upper right and lower left quadrants
because the relationship is positive (the Pearson correlation is approximately 0.70).
Correspondingly, if there were a strong negative relationship, the points would
concentrate in the upper left and lower right quadrants. If the original observations are
no longer available but you do have the frequency counts for the four quadrants, try a
tetrachoric correlation.
The computations for the tetrachoric correlation begin by finding estimates of the
inverse cumulative marginal distributions:
z value for x
0
=
-1
and z value for y
0
=
-1
and using these values as limits when integrating the bivariate normal density
expressed in terms of , the correlation, and then solving for .
If you have the original data, dont bother dichotomizing them because the
tetrachoric correlation has an efficiency of 0.40 compared with the efficient Pearson
correlation estimate.
Transposed Data
You can use CORR to compute measures of association on the rows or columns of your
data. Simply transpose the data and then use CORR. This makes sense when you want
to assess similarity between rows. We might be interested in identifying similar cars
from our performance measures, for example. Recall that you cannot transpose a file
that contains character data.
When you compute association measures across rows, however, be sure that the
variables are on comparable scales. Otherwise, a single variable will influence most of
the association. With the cars data, braking and speed are so large that they would
almost uniquely determine the similarity between cars. Consequently, we standardized
the data before transposing them. That way, the correlations measure the similarities
comparably across attributes.
17 5 +
45
---------------
,
_

17 4 +
45
---------------
,
_

123
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Following is the Pearson correlation matrix for our cars:
Hadi Robust Outlier Detection
Hadi robust outlier detection identifies specific cases as outliers (if there are any) and
then uses the acceptable cases to compute the requested measure in the usual way.
Following are the steps for this procedure:
n Compute a robust covariance matrix by finding the median (instead of the mean)
for each variable and using in the calculation of each covariance.
If the resulting matrix is singular, reconstruct another after inflating the smallest
eigenvalues by a small amount.
n Use this robust estimate of the covariance matrix to compute Mahalanobis
distances and then use the distance to rank the cases.
n Use the half of the sample with the lowest ranks to compute the usual covariance
matrix (that is, deviations from the mean).
n Use this covariance matrix to compute new distances for the complete sample and
rerank the cases.
Pearson Correlation Matrix
PORSCHE FERRARI CORVETTE MERCEDES SAAB
PORSCHE 1.000
FERRARI 0.940 1.000
CORVETTE 0.939 0.868 1.000
MERCEDES 0.093 0.212 -0.240 1.000
SAAB -0.506 -0.523 -0.760 0.664 1.000
TOYOTA 0.238 0.429 0.402 -0.379 -0.680
BMW -0.319 -0.095 -0.557 0.854 0.634
HONDA -0.504 -0.730 -0.393 -0.519 0.265
ACURA -0.046 -0.102 0.298 -0.978 -0.770
VW -0.962 -0.928 -0.980 0.079 0.704
CHEVY -0.731 -0.698 -0.491 -0.532 -0.131
TOYOTA BMW HONDA ACURA VW
TOYOTA 1.000
BMW -0.247 1.000
HONDA -0.298 -0.500 1.000
ACURA 0.533 -0.788 0.349 1.000
VW -0.353 0.391 0.552 -0.156 1.000
CHEVY -0.034 -0.064 0.320 0.536 0.525
CHEVY
CHEVY 1.000
Number of observations: 5
x
i
median ( )
2
124
Chapter 6
n After ranking, select the same number of cases with small ranks as before but add
the case with the next largest rank and repeat the process, each time updating the
covariance matrix, computing and sorting new distances, and increasing the
subsample size by one.
n Continue adding cases until the entering one exceeds an internal limit based on a
chi-square statistic (see Hadi, 1994). The cases remaining (not entered) are
identified as outliers.
n Use the cases that are not identified as outliers to compute the measure requested
in the usual way.
Correlations in SYSTAT
Correlations Main Dialog Box
To open the Correlations dialog box, from the menus choose:
Statistics
Correlations
Simple
Variables. Available only if One is selected for Sets. All selected variables are
correlated with all other variables in the list, producing a triangular correlation matrix.
125
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Rows. Available only if Two is selected for Sets. Selected variables are correlated with
all column variables, producing a rectangular matrix.
Columns. Available only if Two is selected for Sets. Selected variables are correlated
with all row variables, producing a rectangular matrix.
Sets. One set creates a single, triangular correlation matrix of all variables in the
Variable(s) list. Two sets creates a rectangular matrix of variables in the Row(s) list
correlated with variables in the Column(s) list.
Listwise. Listwise deletion of missing data. Any case with missing data for any variable
in the list is excluded.
Pairwise. Pairwise deletion of missing data. Only cases with missing data for one of
the variables in the pair being correlated are excluded.
Save file. Saves the correlation matrix to a file.
Types. Type of data or measure. You can select from a variety of distance measures, as
well as measures for continuous data, rank-order data, and binary data.
Measures for Continuous Data
The following measures are available for continuous data:
n Pearson. Produces a matrix of Pearson product-moment correlation coefficients.
Pearson correlations vary between 1 and +1. A value of 0 indicates that neither of
two variables can be predicted from the other by using a linear equation. A Pearson
correlation of 1 or 1 indicates that one variable can be predicted perfectly by a
linear function of the other.
n Covariance. Produces a covariance matrix.
n SSCP. Produces a sum of cross-products matrix. If the Pairwise option is chosen,
sums are weighted by N/n, where n is the count for a pair.
The Pearson, Covariance, and SSCP measures are related. The entries in an SSCP
matrix are sums of squares of deviations (from the mean) and sums of cross-products
of deviations. If you divide each entry by , variances result from the sums of
squares and covariances from the sums of cross-products. Divide each covariance by
the product of the standard deviations (of the two variables) and the result is a
correlation.
n 1 ( )
126
Chapter 6
Distance and Dissimilarity Measures
Correlations offers two dissimilarity measures and two distance measures:
n Bray-Curtis. Produces a matrix of dissimilarity measures for continuous data.
n QSK. Produces a matrix of symmetric dissimilarity coefficients. Also called the
Kulczynski measure.
n Euclidean. Produces a matrix of Euclidean distances normalized by the sample size.
n City. Produces a matrix of city-block, or first-power, distances (sum of absolute
discrepancies) normalized by the sample size.
Measures for Rank-Order Data
If your data are simply ranks of attributes, or if you want to see how well variables are
associated when you pay attention to rank ordering, you should consider the following
measures available for ranked data:
n Spearman. Produces a matrix of Spearman rank-order correlation coefficients. This
measure is a nonparametric version of the Pearson correlation coefficient, based on
the ranks of the data rather than the actual values.
n Gamma. Produces a matrix of Goodman-Kruskals gamma coefficients.
n MU2. Produces a matrix of Guttmans mu2 monotonicity coefficients.
n Tau. Produces a matrix of Kendalls tau-b rank-order coefficients.
Measures for Binary Data
These coefficients relate variables assuming only two values. The dichotomy
coefficients work only for dichotomous data scored as 0 or 1.
The following measures are available for binary data:
n Positive matching (S2). Produces a matrix of positive matching dichotomy
coefficients.
n Jaccard (S3). Produces a matrix of Jaccards dichotomy coefficients.
n Simple matching (S4). Produces a matrix of simple matching dichotomy
coefficients.
n Anderberg (S5). Produces a matrix of Anderbergs dichotomy coefficients.
127
Correl ati ons, Si mi l ari ti es, and Di stance Measures
n Tanimoto (S6). Produces a matrix of Tanimotos dichotomy coefficients.
n Tetra. Produces a matrix of tetrachoric correlations.
Correlations Options
To specify options for correlations, click Options in the Correlations dialog box.
The following options are available:
Probabilities. Requests probability of each correlation coefficient to test that the
correlation is 0. Appropriate if you select only one correlation coefficient to test.
Bonferroni and Dunn-Sidak use adjusted probabilities. Available only for Pearson
product-moment correlations.
(EM) Estimation. Requests the EM algorithm to estimate Pearson correlation,
covariance, or SSCP matrices from data with missing values. Littles MCAR test is
displayed with a graphical display of the pattern of missing values. For robust estimates
where outliers are downweighted, select Normal or t.
n Normal produces maximum likelihood estimates for a contaminated multivariate
normal sample. For the contaminated normal, SYSTAT assumes that the
distribution is a mixture of two normal distributions (same mean, different
variances) with a specified probability of contamination. The Probability value is
the probability of contamination (for example, 0.10), and Variance is the variance
128
Chapter 6
of contamination. Downweighting for the normal model tends to be concentrated
in a few outlying cases.
n t produces maximum likelihood estimates for a t distribution, where df is the
degrees of freedom. Downweighting for the multivariate t model tends to be more
spread out than for the normal model. The degree of downweighting is inversely
related to the degrees of freedom.
Iterations. Specifies the maximum number of iterations for computing the estimates.
Convergence. Defines the convergence criterion. If the relative change of covariance
entries are less than the specified value, convergence is assumed.
Hadi outlier identification and estimation. Requests the HADI multivariate outlier
detection algorithm to identify outliers and to compute the correlation, covariance,
or SSCP matrix from the remaining cases. Tolerance omits variables with a multiple
R-square value greater than (1 n), where n is the specified tolerance value.
Using Commands
First, specify your data with USE filename. Then, type CORR and choose your measure
and type:
MEASURE is one of:
For PEARSON, COVARIANCE, and SSCP, the following options are available:
In addition, PEARSON offers BONF, DUNN, and PROB as options.
Full matrix MEASURE varlist / options
Portion of matrix MEASURE rowlist * collist / options
BC QSK EUCLIDEAN CITY SPEARMAN
GAMMA MU2 TAU TETRA S2
S3 S4 S5 S6 PEARSON
COVARIANCE SSCP
EM
T=df
NORMAL=n1,n2
ITER=n
CONV=n
HADI
TOL=n
129
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Usage Considerations
Types of data. CORR uses rectangular data only.
Print options. With PRINT=LONG, SYSTAT prints the mean of each variable. In
addition, for EM estimation, SYSTAT prints an iteration history, missing value
patterns, Littles MCAR test, and mean estimates.
Quick Graphs. CORR includes a SPLOM (matrix of scatterplots) where the data in each
plot correspond to a value in the matrix.
Saving files. CORR saves the correlation matrix or other measure computed. SYSTAT
automatically defines the type of file as CORR, DISS, COVA, SSCP, SIMI, or RECT.
BY groups. CORR analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is available in CORR.
Examples
Example 1
Pearson Correlations
This example uses data from the OURWORLD file that contains records (cases) for 57
countries. We are interested in correlations among variables recording the percentage
of the population living in cities, birth rate, gross domestic product per capita, dollars
expended per person for the military, ratio of birth rates to death rates, life expectancy
(in years) for males and females, percentage of the population who can read, and gross
national product per capita in 1986. The input is:
CORR
USE ourworld
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86
130
Chapter 6
The output follows:
The correlations for all pairs of the nine variables are shown here. The bottom of the
output panel shows that the sample size is 49, but the data file has 57 countries. If a
country has one or more missing values, SYSTAT, by default, omits all of the data for
the case. This is called listwise deletion.
The Quick Graph is a matrix of scatterplots with one plot for each entry in the
correlation matrix and histograms of the variables on the diagonal. For example, the
plot of BIRTH_RT against URBAN is at the top left under the histogram for URBAN.
Pearson correlation matrix

URBAN BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM LIFEEXPF
URBAN 1.000
BIRTH_RT -0.800 1.000
GDP_CAP 0.625 -0.762 1.000
MIL 0.597 -0.672 0.899 1.000
B_TO_D -0.307 0.511 -0.659 -0.607 1.000
LIFEEXPM 0.776 -0.922 0.664 0.582 -0.211 1.000
LIFEEXPF 0.801 -0.949 0.704 0.619 -0.265 0.989 1.000
LITERACY 0.800 -0.930 0.637 0.562 -0.274 0.911 0.935
GNP_86 0.592 -0.689 0.964 0.873 -0.560 0.633 0.665
LITERACY GNP_86
LITERACY 1.000
GNP_86 0.611 1.000
Number of observations: 49
U
R
B
A
N
B
I
R
T
H
_
R
T
G
D
P
_
C
A
P
M
I
L
B
_
T
O
_
D
L
I
F
E
E
X
P
M
L
I
F
E
E
X
P
F
L
I
T
E
R
A
C
Y
URBAN
G
N
P
_
8
6
BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM LIFEEXPF LITERACY GNP_86
131
Correl ati ons, Si mi l ari ti es, and Di stance Measures
If linearity does not hold for your variables, your results may be meaningless. A
good way to assess linearity, the presence of outliers, and other anomalies is to
examine the plot for each pair of variables in the scatterplot matrix. The relationships
between GDP_CAP and BIRTH_RT, B_TO_D, LIFEEXPM, and LIFEEXPF do not
appear to be linear. Also, the points in the MIL versus GDP_CAP and GNP_86 versus
MIL displays clump in the lower left corner. It is not wise to use correlations for
describing these relations.
Altering the Format
The correlation matrix for this example wraps (the results for nine variables do not fit
in one panel). You squeeze in more results by specifying a field width and the number
of decimal places. For example, the same correlations printed in a field 6 characters
wide is shown below. We request only 2 digits to the right of the decimal instead of 3.
(Using the command language, press F9 to retrieve the previous PEARSON statement
instead of retyping it.)
The output is:
Notice that while the top row of variable names is truncated to fit within the field
specification, the row names remain complete.
CORR
USE ourworld
FORM 6 2
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86
Pearson correlation matrix

URBAN BIRTH_ GDP_CA MIL B_TO_D LIFEEX LIFEEX LITERA GNP_86
URBAN 1.00
BIRTH_RT -0.80 1.00
GDP_CAP 0.62 -0.76 1.00
MIL 0.60 -0.67 0.90 1.00
B_TO_D -0.31 0.51 -0.66 -0.61 1.00
LIFEEXPM 0.78 -0.92 0.66 0.58 -0.21 1.00
LIFEEXPF 0.80 -0.95 0.70 0.62 -0.26 0.99 1.00
LITERACY 0.80 -0.93 0.64 0.56 -0.27 0.91 0.93 1.00
GNP_86 0.59 -0.69 0.96 0.87 -0.56 0.63 0.67 0.61 1.00

Number of observations: 49
132
Chapter 6
Requesting a Portion of a Matrix
You can request that only a portion of the matrix be computed. The input follows:
The resulting output is:
These correlations correspond to the lower left corner of the first matrix.
Example 2
Transformations
If relationships between variables appear nonlinear, using a measure of linear
association is not advised. Fortunately, transformations of the variables may yield
linear relationships. You can then use the linear relation measures, but all conclusions
regarding the relationships are relative to the transformed variables instead of the
original variables.
In the Pearson correlations example, we observed nonlinear relationships involving
GDP_CAP, MIL, and GNP_86. Here we log transform these variables and compare the
resulting correlations to those for the untransformed variables. The input is:
CORR
USE ourworld
FORMAT
PEARSON lifeexpm lifeexpf literacy gnp_86 *,
urban birth_rt gdp_cap mil b_to_d
Pearson correlation matrix

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
LIFEEXPM 0.776 -0.922 0.664 0.582 -0.211
LIFEEXPF 0.801 -0.949 0.704 0.619 -0.265
LITERACY 0.800 -0.930 0.637 0.562 -0.274
GNP_86 0.592 -0.689 0.964 0.873 -0.560

Number of observations: 49
CORR
USE ourworld
LET (gdp_cap,mil,gnp_86) = L10(@)
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86
133
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Notice that we use SYSTATs shortcut notation to make the transformation.
Alternatively, you could use:
The output follows:
LET gdp_cap = L10(gdp_cap)
LET mil = L10(mil)
LET gnp_86 = L10(gnp_86)
Means
URBAN BIRTH_RT GDP_CAP MIL B_TO_D
52.8776 25.9592 3.3696 1.6954 2.8855
LIFEEXPM LIFEEXPF LITERACY GNP_86
65.4286 70.5714 74.7265 3.2791

Pearson correlation matrix

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 1.0000
BIRTH_RT -0.8002 1.0000
GDP_CAP 0.7636 -0.9189 1.0000
MIL 0.6801 -0.8013 0.8947 1.0000
B_TO_D -0.3074 0.5106 -0.5293 -0.5374 1.0000
LIFEEXPM 0.7756 -0.9218 0.8599 0.7267 -0.2113
LIFEEXPF 0.8011 -0.9488 0.8954 0.7634 -0.2648
LITERACY 0.7997 -0.9302 0.8337 0.7141 -0.2737
GNP_86 0.7747 -0.8786 0.9736 0.8773 -0.4411
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 1.0000
LIFEEXPF 0.9887 1.0000
LITERACY 0.9110 0.9350 1.0000
GNP_86 0.8610 0.8861 0.8404 1.0000

Number of observations: 49
U
R
B
A
N
B
I
R
T
H
_
R
T
G
D
P
_
C
A
P
M
I
L
B
_
T
O
_
D
L
I
F
E
E
X
P
M
L
I
F
E
E
X
P
F
L
I
T
E
R
A
C
Y
URBAN
G
N
P
_
8
6
BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM LIFEEXPF LITERACY GNP_86
134
Chapter 6
In the scatterplot matrix, linearity has improved in the plots involving GDP_CAP, MIL,
and GNP_86. Look at the difference between the correlations before and after
transformation.
After log transforming the variables, linearity has improved in the plots, and many of
the correlations are stronger.
Example 3
Missing Data: Pairwise Deletion
To specify pairwise deletion, the input is:
The output is:
Transformation Transformation Transformation
no yes no yes no yes
gdp_cap vs. mil vs. gnp_86 vs.
urban 0.625 0.764 urban 0.597 0.680 urban 0.592 0.775
birth_rt 0.762 0.919 birth_rt 0.672 0.801 birth_rt 0.689 0.879
lifeexpm 0.664 0.860 lifeexpm 0.582 0.727 lifeexpm 0.633 0.861
lifeexpf 0.704 0.895 lifeexpf 0.619 0.763 lifeexpf 0.665 0.886
literacy 0.637 0.834 literacy 0.562 0.714 literacy 0.611 0.840
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR
Means
URBAN BIRTH_RT GDP_CAP MIL B_TO_D
52.821 26.351 3.372 1.775 2.873
LIFEEXPM LIFEEXPF LITERACY GNP_86
65.088 70.123 73.563 3.293

Pearson correlation matrix

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 1.000
BIRTH_RT -0.781 1.000
GDP_CAP 0.778 -0.895 1.000
MIL 0.683 -0.687 0.857 1.000
B_TO_D -0.248 0.535 -0.472 -0.377 1.000
LIFEEXPM 0.796 -0.892 0.854 0.696 -0.172
LIFEEXPF 0.816 -0.924 0.891 0.721 -0.230
LITERACY 0.807 -0.930 0.832 0.646 -0.291
GNP_86 0.775 -0.881 0.974 0.881 -0.455
135
Correl ati ons, Si mi l ari ti es, and Di stance Measures
The sample size for each variable is reported as the diagonal of the pairwise frequency
table; sample sizes for complete pairs of cases are reported off the diagonal. There are
57 countries in this sample56 reported the percentage living in cities (URBAN), and
50 reported the gross national product per capita in 1986 (GNP_86). There are 49
countries that have values for both URBAN and GNP_86.
The means are printed because we specified PRINT=LONG. Since pairwise deletion
is requested, all available values are used to compute each meanthat is, these means
are the same as those computed by the Statistics procedure.
Example 4
Missing Data: EM Estimation
This example uses the same variables used in the transformations example. To specify
EM estimation, the input is:
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 1.000
LIFEEXPF 0.989 1.000
LITERACY 0.911 0.937 1.000
GNP_86 0.863 0.888 0.842 1.000

Pairwise frequency table

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 56
BIRTH_RT 56 57
GDP_CAP 56 57 57
MIL 55 56 56 56
B_TO_D 56 57 57 56 57
LIFEEXPM 56 57 57 56 57
LIFEEXPF 56 57 57 56 57
LITERACY 56 57 57 56 57
GNP_86 49 50 50 50 50
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 57
LIFEEXPF 57 57
LITERACY 57 57 57
GNP_86 50 50 50 50
CORR
USE ourworld
LET (gdp_cap,mil,gnp_86) = L10(@)
IDVAR = country$
GRAPH = NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm,
lifeexpf literacy gnp_86 / EM
136
Chapter 6
The output follows:
SYSTAT prints missing-value patterns for the data. Forty-nine cases in the sample are
complete (an X is printed for each of the nine variables). Periods are inserted where
data are missing. The value of the first variable, URBAN, is missing for one case, while
the value of the last variable, GNP_86, is missing for six cases. The last row of the
pattern indicates that the values of the fourth variable, MIL, and the last variable,
GNP_86, are both missing for one case.
EM Algorithm Iteration Maximum Error -2*log(likelihood)
--------- ------------- ------------------
1 1.092328 24135.483249
2 1.023878 7625.491302
3 0.643113 6932.605472
4 0.666125 6691.458724
5 0.857590 6573.199525
6 2.718236 6538.852550
7 0.728468 6531.689766
8 0.196577 6530.369252
9 0.077590 6530.167056
10 0.034510 6530.159651
11 0.016278 6530.176410
12 0.007986 6530.190050
13 0.004050 6530.198695
14 0.002120 6530.203895
15 0.001145 6530.207008
16 0.000637 6530.208887

No.of Missing value patterns
Cases (X=nonmissing; .=missing)
49 XXXXXXXXX
1 .XXXXXXXX
6 XXXXXXXX.
1 XXX.XXXX.

Little MCAR test statistic: 35.757 df = 23 prob = 0.044

EM estimate of means
URBAN BIRTH_RT GDP_CAP MIL B_TO_D
53.152 26.351 3.372 1.754 2.873
LIFEEXPM LIFEEXPF LITERACY GNP_86
65.088 70.123 73.563 3.284

EM estimated correlation matrix

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 1.000
BIRTH_RT -0.782 1.000
GDP_CAP 0.779 -0.895 1.000
MIL 0.700 -0.697 0.863 1.000
B_TO_D -0.259 0.535 -0.472 -0.357 1.000
LIFEEXPM 0.796 -0.892 0.854 0.713 -0.172
LIFEEXPF 0.816 -0.924 0.891 0.738 -0.230
LITERACY 0.808 -0.930 0.832 0.668 -0.291
GNP_86 0.796 -0.831 0.968 0.874 -0.342
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 1.000
LIFEEXPF 0.989 1.000
LITERACY 0.911 0.937 1.000
GNP_86 0.863 0.885 0.828 1.000
137
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Littles MCAR (missing completely at random) test has a probability less than
0.0005, indicating that we reject the hypothesis that the nine missing values are
randomly missing. This test has limited power when the sample of incomplete cases is
small and it also offers no direct evidence on the validity of the MAR assumption.
Example 5
Probabilities Associated with Correlations
To request the usual (uncorrected) probabilities for a correlation matrix using pairwise
deletion:
The output is:
The p values that are appropriate for making statements regarding one specific
correlation are shown here. By themselves, these values are not very informative.
These p values are pseudo-probabilities because they do not reflect the number of
correlations being tested. If pairwise deletion is used, the problem is even worse,
although many statistics packages print probabilities as if they meant something in this
case, too.
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR PROB
Bartlett Chi-square statistic: 815.067 df=36 Prob= 0.000
{button Discussion, ji(>discuss,corr_ex5_bartlett)}
Matrix of Probabilities

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 0.0
BIRTH_RT 0.000 0.0
GDP_CAP 0.000 0.000 0.0
MIL 0.000 0.000 0.000 0.0
B_TO_D 0.065 0.000 0.000 0.004 0.0
LIFEEXPM 0.000 0.000 0.000 0.000 0.202
LIFEEXPF 0.000 0.000 0.000 0.000 0.085
LITERACY 0.000 0.000 0.000 0.000 0.028
GNP_86 0.000 0.000 0.000 0.000 0.001
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 0.0
LIFEEXPF 0.000 0.0
LITERACY 0.000 0.000 0.0
GNP_86 0.000 0.000 0.000 0.0
138
Chapter 6
SYSTAT computes the Bartlett chi-square test whenever you request probabilities
for more than one correlation. This tests a global hypothesis concerning the
significance of all of the correlations in the matrix
where N is the total sample size (or the smallest sample size for any pair in the matrix
if pairwise deletion is used), p is the number of variables, and |R| is the determinant of
the correlation matrix. This test is sensitive to non-normality, and the test statistic is
only asymptotically distributed (for large samples) as chi-square. Nevertheless, it can
serve as a guideline.
If the Bartlett test is not significant, dont even look at the significance of individual
correlations. In this example, the test is significant, which indicates that there may be
some real correlations among the variables. The Bartlett test is sensitive to non-
normality and can be used only as a guide. Even if the Bartlett test is significant, you
cannot accept the nominal p values as the true family probabilities associated with each
correlation.
Bonferroni Probabilities with Pairwise Deletion
Lets now examine the probabilities adjusted by the Bonferroni method that provides
protection for multiple tests. Remember that the log-transformed values from the
transformations example are still in effect. The input is:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR BONF
( )
R ln
6
1 p 2
1 N
2

'
+

139
Correl ati ons, Si mi l ari ti es, and Di stance Measures
The output follows:
Compare these results with those for the 36 tests using uncorrected probabilities.
Notice that some correlations, such as those for B_TO_D with MIL, LITERACY, and
GNP_86, are no longer significant.
Bonferroni Probabilities for EM Estimates
You can request the Bonferroni adjusted probabilities for an EM estimated matrix by
specifying:
The probabilities follow:
Bartlett Chi-square statistic: 815.067 df=36 Prob= 0.000

Matrix of Bonferroni Probabilities

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 0.0
BIRTH_RT 0.000 0.0
GDP_CAP 0.000 0.000 0.0
MIL 0.000 0.000 0.000 0.0
B_TO_D 1.000 0.001 0.008 0.150 0.0
LIFEEXPM 0.000 0.000 0.000 0.000 1.000
LIFEEXPF 0.000 0.000 0.000 0.000 1.000
LITERACY 0.000 0.000 0.000 0.000 1.000
GNP_86 0.000 0.000 0.000 0.000 0.032
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 0.0
LIFEEXPF 0.000 0.0
LITERACY 0.000 0.000 0.0
GNP_86 0.000 0.000 0.000 0.0
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / EM BONF
Bartlett Chi-square statistic: 821.288 df=36 Prob= 0.000

Matrix of Bonferroni Probabilities

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 0.0
BIRTH_RT 0.000 0.0
GDP_CAP 0.000 0.000 0.0
MIL 0.000 0.000 0.000 0.0
B_TO_D 1.000 0.001 0.008 0.248 0.0
LIFEEXPM 0.000 0.000 0.000 0.000 1.000
LIFEEXPF 0.000 0.000 0.000 0.000 1.000
LITERACY 0.000 0.000 0.000 0.000 1.000
GNP_86 0.000 0.000 0.000 0.000 0.537
140
Chapter 6
Example 6
Hadi Robust Outlier Detection
If only one or two variables have outliers among many well behaved variables, the
outliers may be masked. Lets look for outliers among four variables. The input is:
The output is:
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 0.000
LIFEEXPF 0.000 0.0
LITERACY 0.000 0.000 0.0
GNP_86 0.000 0.000 0.000 0.000
USE ourworld
CORR
LET (gdp_cap, mil) = L10(@)
GRAPH = NONE
PRINT = LONG
IDVAR = country$
PEARSON gdp_cap mil b_to_d literacy / HADI
PLOT GDP_CAP*B_TO_D*LITERACY / SPIKE XGRID YGRID AXES=BOOK,
SCALE=L SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250
These 15 outliers are identified:
Case Distance
------------ ------------
Venezuela 4.48653
CostaRica 4.55336
Senegal 4.66615
Sudan 4.74882
Ethiopia 4.82013
Pakistan 5.05827
Libya 5.10295
Haiti 5.44901
Bangladesh 5.47974
Yemen 5.84027
Gambia 5.84202
Iraq 5.84507
Guinea 6.12308
Somalia 6.18465
Mali 6.30091

Means of variables of non-outlying cases
GDP_CAP MIL B_TO_D LITERACY
3.634 1.967 2.533 88.183

HADI estimated correlation matrix

GDP_CAP MIL B_TO_D LITERACY
GDP_CAP 1.000
MIL 0.860 1.000
B_TO_D -0.839 -0.753 1.000
LITERACY 0.729 0.642 -0.698 1.000

Number of observations: 56
141
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Fifteen countries are identified as outliers. We suspect that the sample may not be
homogeneous so we request a plot labeled by GROUP$. The panel is set to
PRINT=LONG; the country names appear because we specified COUNTRY$ as an ID
variable. The correlations at the end of the output are computed using the 30 or so cases
that are not identified as outliers.
In the plot, we see that Islamic countries tend to fall between New World and
European countries with respect to birth-to-death ratio and have the lowest literacy.
European countries have the highest literacy and GDP_CAP values.
Stratifying the Analysis
Well use Hadi for each of the three groups separately:
USE ourworld
CORR
LET (gdp_cap, mil) = L10(@)
GRAPH = NONE
PRINT = LONG
IDVAR = country$
BY group$
PEARSON gdp_cap mil b_to_d literacy / HADI
BY
E
EE
E
E E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
N
N
N
N
N
N N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
142
Chapter 6
For clarity, we edited the following output by moving the panels of means to the end:
When computations are done separately for each group, Portugal is the only outlier,
and the within-groups correlations differ markedly from group to group and from those
for the complete sample. By scanning the means, we also see that the centroids for the
three groups are quite different.
The following results are for:
GROUP$ = Europe

These 1 outliers are identified:
Case Distance
------------ ------------
Portugal 5.72050

HADI estimated correlation matrix

GDP_CAP MIL B_TO_D LITERACY
GDP_CAP 1.000
MIL 0.474 1.000
B_TO_D -0.092 -0.173 1.000
LITERACY 0.259 0.263 0.136 1.000

Number of observations: 20


The following results are for:
GROUP$ = Islamic

HADI estimated correlation matrix

GDP_CAP MIL B_TO_D LITERACY
GDP_CAP 1.000
MIL 0.877 1.000
B_TO_D 0.781 0.882 1.000
LITERACY 0.600 0.605 0.649 1.000

Number of observations: 15


The following results are for:
GROUP$ = NewWorld

HADI estimated correlation matrix

GDP_CAP MIL B_TO_D LITERACY
GDP_CAP 1.000
MIL 0.674 1.000
B_TO_D -0.246 -0.287 1.000
LITERACY 0.689 0.561 -0.045 1.000

Number of observations: 21
Means of variables of non-outlying cases (Europe)
GDP_CAP MIL B_TO_D LITERACY
4.059 2.404 1.260 98.316

Means of variables of non-outlying cases (Islamic)
GDP_CAP MIL B_TO_D LITERACY
2.764 1.400 3.547 36.733

Means of variables of non-outlying cases (NewWorld)
GDP_CAP MIL B_TO_D LITERACY
3.214 1.466 3.951 79.957
143
Correl ati ons, Si mi l ari ti es, and Di stance Measures
Example 7
Spearman Correlations
As an example, we request Spearman correlations for the same data used in the Pearson
correlation and Tranformations examples. It is often useful to compute both a
Spearman and a Pearson matrix using the same data. The absolute difference between
the two can reveal unusual features such as outliers and highly skewed distributions.
The input is:
The correlation matrix follows:
Note that many of these correlations are closer to the Pearson correlations for the log-
transformed data than they are to the correlations for the raw data.
Example 8
S2 and S3 Coefficients
The choice among the binary S measures depends on what you want to state about your
variables. In this example, we request S2 and S3 to study responses made by 256
subjects to a depression inventory (Afifi and Clark, 1984). These data are stored in the
SURVEY2 data file that has one record for each respondent with answers to 20
questions about depression. Each subject was asked, for example, Last week, did you
cry less than 1 day (code 0), 1 to 2 days (code 1), 3 to 4 days (code 2), or 5 to 7 days
(code 3)? The distributions of the answers appear to be Poisson, so they are not
USE ourworld
CORR
GRAPH = NONE
SPEARMAN urban birth_rt gdp_cap mil b_to_d,
lifeexpm lifeexpf literacy gnp_86 / PAIR
Spearman correlation matrix

URBAN BIRTH_RT GDP_CAP MIL B_TO_D
URBAN 1.000
BIRTH_RT -0.749 1.000
GDP_CAP 0.777 -0.874 1.000
MIL 0.678 -0.670 0.848 1.000
B_TO_D -0.381 0.689 -0.597 -0.498 1.000
LIFEEXPM 0.731 -0.856 0.834 0.633 -0.410
LIFEEXPF 0.771 -0.902 0.910 0.709 -0.501
LITERACY 0.760 -0.868 0.882 0.696 -0.576
GNP_86 0.767 -0.847 0.973 0.867 -0.543
LIFEEXPM LIFEEXPF LITERACY GNP_86
LIFEEXPM 1.000
LIFEEXPF 0.965 1.000
LITERACY 0.813 0.866 1.000
GNP_86 0.834 0.901 0.909 1.000
144
Chapter 6
satisfactory for Pearson correlations. Here we dichotomize the behaviors or feelings as
Did it occur or did it not? by using transformations of the form:
The result is true (1) when the behavior or feeling is present or false (0) when it is
absent. We use SYSTATs shortcut notation to do this for 7 of the 20 questions. For
each pair of feelings or behaviors, S2 indicates the proportion of subjects with both,
and S3 indicates the proportion of times both occurred given that one occurs. To
perform this example:
The matrices follow:
LET blue = blue <> 0
USE survey2
CORR
LET (blue,depress,cry,sad,no_eat,getgoing,talkless) = @ <> 0
GRAPH = NONE
S2 blue depress cry sad no_eat getgoing talkless
S3 blue depress cry sad no_eat getgoing talkless
S2 (Russell and Rao) binary similarity coefficients

BLUE DEPRESS CRY SAD NO_EAT
BLUE 0.254
DEPRESS 0.207 0.422
CRY 0.090 0.113 0.133
SAD 0.188 0.313 0.117 0.391
NO_EAT 0.117 0.129 0.051 0.137 0.246
GETGOING 0.180 0.309 0.086 0.258 0.152
TALKLESS 0.117 0.156 0.059 0.145 0.098
GETGOING TALKLESS
GETGOING 0.520
TALKLESS 0.172 0.246

Number of observations: 256

> S3 blue depress cry sad no_eat getgoing talkless

S3 (Jaccard) binary similarity coefficients

BLUE DEPRESS CRY SAD NO_EAT
BLUE 1.000
DEPRESS 0.442 1.000
CRY 0.303 0.257 1.000
SAD 0.410 0.625 0.288 1.000
NO_EAT 0.306 0.239 0.155 0.273 1.000
GETGOING 0.303 0.488 0.152 0.395 0.248
TALKLESS 0.306 0.305 0.183 0.294 0.248
GETGOING TALKLESS
GETGOING 1.000
TALKLESS 0.289 1.000

Number of observations: 256
145
Correl ati ons, Si mi l ari ti es, and Di stance Measures
The frequencies for DEPRESS and SAD are:
For S2, the result is 80/256 = 0.313; for S3, 80/128 = 0.625.
Example 9
Tetrachoric Correlation
As an example, we use the bivariate normal data in the SYSTAT data file named
TETRA. The input is:
The output follows:
For our single pair of variables, the tetrachoric correlation is 0.81.
Computation
All computations are implemented in double precision.
Algorithms
The computational algorithms use provisional means, sums of squares, and cross-
products (Spicer, 1972).
Sad
1 0
Depress
1
80 20
0
28 128
USE tetra
FREQ = count
CORR
TETRA x y
Tetrachoric correlations

X Y
X 1.000
Y 0.810 1.000

Number of observations: 45
146
Chapter 6
For the rank-order coefficients (Gamma, Mu2, Spearman, and Tau), keep in mind
that these are time consuming. Spearman requires sorting and ranking the data before
doing the same work done by Pearson. The Gamma and Mu2 items require
computations between all possible pairs of observations. Thus, their computing time is
combinatoric.
Missing Data
If you have missing data, CORR can handle them in three ways: listwise deletion,
pairwise deletion, and EM estimation. Listwise deletion is the default. If there are
missing data and pairwise deletion is used, SYSTAT displays a table of frequencies
between all possible pairs of variables after the correlation matrix.
Listwise and Pairwise Deletion
Listwise deletion removes from computations any observation with a value missing on
any variable included in the correlation matrix requested. That is, SYSTAT will not use
a case if one or more values is missing.
Pairwise deletion is listwise deletion done separately for every pair of selected
variables. In other words, counts, sums of squares, and sums of cross-products are
computed separately for every pair of variables in the file. With pairwise deletion, you
get the same correlation (covariance, etc.) for two variables containing missing data if
you select them alone or with other variables containing missing data. With listwise
deletion, correlations under these two circumstances may differ, depending on the
pattern of missing data among the other variables in the file.
Pairwise deletion takes considerably more computer time because the sums of
cross-products for each pair must be saved in a temporary disk file. If you use the
pairwise deletion to compute an SSCP matrix, the sums of squares and cross-products
are weighted by N/n, where N is the number of cases in the whole file and n is the
number of cases with nonmissing values in a given pair.
Perhaps because it is so convenient, pairwise deletion is a popular method for
computing correlations on matrices with missing data. Many regression programs
include it as a standard method for computing regression estimates from a covariance
or correlation matrix.
Ironically, pairwise deletion is one of the worst ways to handle missing values. If as
few as 20% of the values in a data matrix are missing, it is not difficult to find two
correlations that were computed using substantially different subsets of the cases. In
147
Correl ati ons, Si mi l ari ti es, and Di stance Measures
such cases, it is common to encounter error messages that the matrix is singular in
regression programs and to get eigenvalues less than 0 in factor analysis.
But, more important, classical statistical analyses require complete cases. For
exploration, this restriction can be circumvented by identifying one or more variables
that are not needed, deleting them, and requesting the desired analysisthere should
be more complete cases for this smaller set of variables.
If you have missing values, you may want to compare results from pairwise deletion
with those from the EM method. Or, you may want to take the time to replace the
missing values in the raw data by examining similar cases or variables with nonmissing
values. One way to do this is to compute a regression equation to predict each variable
from nonmissing data on other variables and then use that equation to predict missing
values. For more information about missing values, see Little and Rubin (1987).
EM Estimation
Instead of pairwise deletion, many data analysts prefer to use an EM algorithm when
estimating correlations, covariances, or an SSCP matrix. EM uses the maximum
likelihood method to compute the estimates. This procedure defines a model for the
partially missing data and bases inferences on the likelihood under that model. Each
iteration consists of an E step and an M step. The E step finds the conditional
expectation of the missing data given the observed values and current estimates of
the parameters. These expectations are then substituted for the missing data. For the
M step, maximum likelihood estimation is performed as though the missing data had
been filled in. Missing data is enclosed in quotation marks because the missing
values are not being directly filled but, rather, functions of them are used in the log-
likelihood.
You should take care in assessing the pattern of how the values are missing. Given
variables X and Y (age and income, for example), is the probability of a response:
n Independent of the values of X and Y? That is, is the probability that income is
recorded the same for all people regardless of their ages or incomes? The recorded
or observed values of income form a random subsample of the true incomes for all
of the people in the sample. Little and Rubin call this pattern MCAR (Missing
Completely At Random).
n Dependent on X but not on Y? In this case, the probability that income is recorded
depends on the subjects age, so the probability varies by age but not by income
within that age group. This pattern is called MAR (Missing At Random).
148
Chapter 6
n Dependent on Y and possibly X also? In this case, the probability that income is
present varies by the value of income within each age group. This is not an unusual
pattern for real-world applications.
To use the EM algorithm, your data should be MCAR or at least MAR.
References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Faith, D. P., Minchin, P., and Belbin, L. (1987). Compositional dissimilarity as a robust
measure of ecological distance. Vegetatio, 69, 5768.
Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross-
classification. Journal of the American Statistical Association, 49, 732764.
Gower, J. C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S. and
Johnson, N. L. Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley &
Sons, Inc.
Hadi, A. S. (1994). A modification of a method for the detection of outliers in multivariate
samples. In Journal of the Royal Statistical Society, Series (B), 56, No. 2.
Little, R. J. A. and Rubin, D. B. (1987). Statistical analyses with missing data. New York:
John Wiley & Sons, Inc.
Shye, S., ed. (1978). Theory construction and data analysis in the behavioral sciences. San
Francisco: Jossey-Bass, Inc.
149


Chapt er
7
Correspondence Analysis
Leland Wilkinson
Correspondence analysis allows you to examine the relationship between categorical
variables graphically. It computes simple and multiple correspondence analysis for
two-way and multiway tables of categorical variables, respectively. Tables are
decomposed into row and column coordinates, which are displayed in a graph.
Categories that are similar to each other appear close to each other in the graphs.
Statistical Background
Correspondence analysis is a method for decomposing a table of data into row and
column coordinates that can be displayed graphically. With this technique, a two-way
table can be represented in a two-dimensional graph with points for rows and
columns. These coordinates are computed with a Singular Value Decomposition
(SVD), which factors a matrix into the product of three matrices: a collection of left
singular vectors, a matrix of singular values, and a collection of right singular
vectors. Greenacre (1984) is the most comprehensive reference. Hill (1974) and
Jobson (1992) cover the major topics more briefly.
The Simple Model
The simple correspondence analysis model decomposes a two-way table. This
decomposition begins with a matrix of standardized deviates, computed for each cell
in the table as follows:
150
Chapter 7
where N is the sum of the table counts for all , is the observed count for cell ij,
and is the expected count for cell ij based on an independence model. The second
term in this equation is a cells contribution to the test-for-independence statistic.
Thus, the sum of the squared over all cells in the table is the same as . Finally,
the row mass for row i is and the column mass for column j is .
The next step is to compute the matrix of cross-products from this matrix of
deviates:
This S matrix has nonzero eigenvalues, where r and c are the
row and column dimensions of the original table, respectively. The sum of these
eigenvalues is (which is termed total inertia). It is this matrix that is
decomposed as follows:
where U is a matrix of row vectors, V is a matrix of column vectors, and D is a diagonal
matrix of the eigenvalues. The coordinates actually plotted are standardized from U
(for rows), so that
The coordinates are similarly standardized from V (for columns).
The Multiple Model
The multiple correspondence model decomposes higher-way tables. Suppose we have
a multiway table of dimension by by by .... The multiple model begins with
an n by p matrix Z of dummy-coded profiles, where n = the total number of cases in the
table and . This matrix is used to create a cross-products matrix:
which is rescaled and decomposed with a singular value decomposition, as before. See
Jobson (1992) for further information.
z
i j
1
N
--------
o
i j
e
i j

e
ij
----------------
,
_
=
n
ij
o
ij
e
ij

2
z
ij

2
N
n
i
. N n
.j
N
S ZZ =
t min r 1 ,c 1 ( ) =

2
N
S UDV =

2
N n
i
N ( ) x
ij
2
j 1 =
t

i 1 =
r

=
k
1
k
2
k
3
p k
1
= k
2
k
3
... + + +
S Z = Z
151
Correspondence Anal ysi s
Correspondence Analysis in SYSTAT
Correspondence Analysis Main Dialog Box
To open the Correspondence Analysis dialog box, from the menus choose:
Statistics
Data Reduction
Correspondence Analysis
A correspondence analysis is conducted by specifying a model and estimating it.
Dependent(s). Select the variable(s) you want to examine. The dependent variable(s)
should be categorical. To analyze a two-way table (simple correspondence analysis),
select a variable defining the rows. Selecting multiple dependent variables (and no
independent variables) yields a multiple correspondence model.
Independent(s). To analyze a two-way table, select a categorical variable defining the
columns of the table.
You can specify one of two methods for handling missing data:
n Pairwise deletion. Pairwise deletion examines each pair of variables and uses all
cases with both values present.
n Listwise deletion. Listwise deletion deletes any case with missing data for any
variable in the list.
152
Chapter 7
Using Commands
First, specify your data with USE filename. For a simple correspondence analysis,
continue with:
For a multiple correspondence analysis:
If data are aggregated and there is a variable in the file representing frequency of
profiles, use FREQ to identify that variable.
Usage Considerations
Types of data. CORAN uses rectangular data only.
Print options. There are no print options.
Quick Graphs. Quick Graphs produced by CORAN are correspondence plots for the
simple or multiple models.
Saving files. CORAN does not save files.
BY groups. CORAN analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=variable increases the number of cases by the FREQ variable.
Case weights. WEIGHT is not available in CORAN.
CORAN
MODEL depvar=indvar
ESTIMATE
CORAN
MODEL varlist
ESTIMATE
153
Correspondence Anal ysi s
Examples
The examples begin with a simple correspondence analysis of a two-way table from
Greenacre (1984). This is followed by a multiple correspondence analysis example.
Example 1
Correspondence Analysis (Simple)
Here we illustrate a simple correspondence analysis model. The data comprise a
hypothetical smoking survey in a company (Greenacre, 1984). Notice that we use value
labels to describe the categories in the output and plot. The FREQ command codes the
cell frequencies. The input is:
The resulting output is:
USE SMOKE
LABEL STAFF / 1=Sr.Managers,2=Jr.Managers,3=Sr.Employees,
4=Jr.Employees,5=Secretaries
LABEL SMOKE / 1=None,2=Light,3=Moderate,4=Heavy
FREQ=FREQ
CORAN
MODEL STAFF=SMOKE
ESTIMATE
Variables in the SYSTAT Rectangular file are:
STAFF SMOKE FREQ

Case frequencies determined by value of variable FREQ.

Categorical values encountered during processing are:
STAFF (5 levels)
Sr.Managers, Jr.Managers, Sr.Employees, Jr.Employees, Secretaries
SMOKE (4 levels)
None, Light, Moderate, Heavy

Simple Correspondence Analysis
Chi-Square = 16.442.
Degrees of freedom = 12.
Probability = 0.172.

Factor Eigenvalue Percent Cum Pct
1 0.075 87.76 87.76 -----------------------------------
2 0.010 11.76 99.51 ----
3 0.000 .49 100.00

Sum 0.085 (Total Inertia)

Row Variable Coordinates

Name Mass Quality Inertia Factor 1 Factor 2
Sr.Managers 0.057 0.893 0.003 0.066 0.194
Jr.Managers 0.093 0.991 0.012 -0.259 0.243
Sr.Employees 0.264 1.000 0.038 0.381 0.011
Jr.Employees 0.456 1.000 0.026 -0.233 -0.058
Secretaries 0.130 0.999 0.006 0.201 -0.079
154
Chapter 7
Row variable contributions to factors

Name Factor 1 Factor 2
Sr.Managers 0.003 0.214
Jr.Managers 0.084 0.551
Sr.Employees 0.512 0.003
Jr.Employees 0.331 0.152
Secretaries 0.070 0.081

Row variable squared correlations with factors

Name Factor 1 Factor 2
Sr.Managers 0.092 0.800
Jr.Managers 0.526 0.465
Sr.Employees 0.999 0.001
Jr.Employees 0.942 0.058
Secretaries 0.865 0.133

Column variable coordinates

Name Mass Quality Inertia Factor 1 Factor 2
None 0.316 1.000 0.049 0.393 0.030
Light 0.233 0.984 0.007 -0.099 -0.141
Moderate 0.321 0.983 0.013 -0.196 -0.007
Heavy 0.130 0.995 0.016 -0.294 0.198

Column variable contributions to factors

Name Factor 1 Factor 2
None 0.654 0.029
Light 0.031 0.463
Moderate 0.166 0.002
Heavy 0.150 0.506

Column variable squared correlations with factors

Name Factor 1 Factor 2
None 0.994 0.006
Light 0.327 0.657
Moderate 0.982 0.001
Heavy 0.684 0.310
EXPORT successfully completed.
155
Correspondence Anal ysi s
For the simple correspondence model, CORAN prints the basic statistics and
eigenvalues of the decomposition. Next are the row and column coordinates, with
mass, quality, and inertia values. Mass equals the marginal total divided by the grand
total. Quality is a measure (between 0 and 1) of how well a row or column point is
represented by the first two factors. It is a proportion-of-variance statistic. See
Greenacre (1984) for further information. Inertia is a rows (or columns) contribution
to the total inertia. Contributions to the factors and squared correlations with the factors
are the last reported statistics.
Example 2
Multiple Correspondence Analysis
This example uses automobile accident data in Alberta, Canada, reprinted in Jobson
(1992). The categories are ordered with the ORDER command so that the output will
show them in increasing order of severity. The data are in tabular form, so we use the
FREQ command. The input is:
The resulting output is:
USE ACCIDENT
FREQ=FREQ
ORDER INJURY$ / SORT=None,Minimal,Minor,Major
ORDER DRIVER$ / SORT=Normal,Drunk
ORDER SEATBELT$ / SORT=Yes,No
CORAN
MODEL INJURY$,DRIVER$,SEATBELT$
ESTIMATE
Variables in the SYSTAT Rectangular file are:
SEATBELT$ IMPACT$ INJURY$ DRIVER$ FREQ

Case frequencies determined by value of variable FREQ.

Categorical values encountered during processing are:
INJURY$ (4 levels)
None, Minimal, Minor, Major
DRIVER$ (2 levels)
Normal, Drunk
SEATBELT$ (2 levels)
Yes, No

Multiple Correspondence Analysis

Factor Eigenvalue Percent Cum Pct
1 0.373 22.37 22.37 --------
2 0.334 20.02 42.39 --------
3 0.333 20.00 62.39 --------
4 0.325 19.50 81.89 -------
5 0.302 18.11 100.00 -------

Sum 1.667 (Total Inertia)

156
Chapter 7
Variable Coordinates

Name Mass Quality Inertia Factor 1 Factor 2
None 0.303 0.351 0.031 0.189 0.008
Minimal 0.018 0.251 0.315 -1.523 -1.454
Minor 0.012 0.552 0.322 -2.134 3.294
Major 0.001 0.544 0.332 -3.962 -10.976
Normal 0.313 0.496 0.020 0.179 0.014
Drunk 0.020 0.496 0.313 -2.758 -0.211
Yes 0.053 0.279 0.280 1.143 -0.402
No 0.280 0.279 0.053 -0.217 0.076

Variable contributions to factors

Name Factor 1 Factor 2
None 0.029 0.000
Minimal 0.111 0.113
Minor 0.141 0.375
Major 0.056 0.478
Normal 0.027 0.000
Drunk 0.414 0.003
Yes 0.187 0.026
No 0.036 0.005

Variable squared correlations with factors

Name Factor 1 Factor 2
None 0.350 0.001
Minimal 0.131 0.120
Minor 0.163 0.389
Major 0.063 0.481
Normal 0.493 0.003
Drunk 0.493 0.003
Yes 0.249 0.031
No 0.249 0.031

Case coordinates

Name Factor 1 Factor 2
1 0.825 -0.219
2 -0.779 -0.349
3 -0.110 -1.063
4 -1.713 -1.193
5 -0.443 1.676
6 -2.047 1.547
7 -1.441 -6.558
8 -3.045 -6.687
9 0.825 -0.219
10 -0.779 -0.349
11 -0.110 -1.063
12 -1.713 -1.193
13 -0.443 1.676
14 -2.047 1.547
15 -1.441 -6.558
16 -3.045 -6.687
17 0.825 -0.219
18 -0.779 -0.349
19 -0.110 -1.063
20 -1.713 -1.193
21 -0.443 1.676
22 -2.047 1.547
23 -1.441 -6.558
24 0.825 -0.219
25 -0.779 -0.349
26 -0.110 -1.063
27 -1.713 -1.193
28 -0.443 1.676
29 -1.441 -6.558
30 -3.045 -6.687
31 0.082 0.057
32 -1.521 -0.073
157
Correspondence Anal ysi s
This time, we get case coordinates instead of column coordinates. These are not
included in the following Quick Graph because the focus of the graph is on the tabular
variables and we dont want to clutter the display. If you want to plot case coordinates,
cut and paste them into the editor and plot them directly.
Following is the Quick Graph:
33 -0.853 -0.787
34 -2.456 -0.916
35 -1.186 1.953
36 -2.790 1.823
37 -2.184 -6.281
38 -3.788 -6.411
39 0.082 0.057
40 -1.521 -0.073
41 -0.853 -0.787
42 -2.456 -0.916
43 -1.186 1.953
44 -2.790 1.823
45 -2.184 -6.281
46 -3.788 -6.411
47 0.082 0.057
48 -1.521 -0.073
49 -0.853 -0.787
50 -2.456 -0.916
51 -1.186 1.953
52 -2.790 1.823
53 -2.184 -6.281
54 -3.788 -6.411
55 0.082 0.057
56 -1.521 -0.073
57 -0.853 -0.787
58 -2.456 -0.916
59 -1.186 1.953
60 -2.790 1.823
61 -2.184 -6.281
62 -3.788 -6.411
EXPORT successfully completed.
158
Chapter 7
The graph reveals a principal axis of major versus minor injuries. This axis is related
to drunk driving and seat belt use.
Computation
All computations are in double precision.
Algorithms
CORAN uses a singular value decomposition of the cross-products matrix computed
from the data.
Missing Data
Cases with missing data are deleted from all analyses.
References
Greenacre, M. J. (1984). Theory and applications of correspondence analysis. New York:
Academic Press.
Hill, M. O. (1974). Correspondence analysis: A neglected multivariate method. Applied
Statistics, 23, 340354.
Jobson, J. D. (1992). Applied multivariate data analysis, Vol. II: Categorical and
multivariate methods. New York: Springer-Verlag.
159


Chapt er
8
Crosstabulation
When variables are categorical, frequency tables (crosstabulations) provide useful
summaries. For a report, you may need only the number or percentage of cases falling
in specified categories or cross-classifications. At times, you may require a test of
independence or a measure of association between two categorical variables. Or, you
may want to model relationships among two or more categorical variables by fitting
a loglinear model to the cell frequencies.
Both Crosstabs and Loglinear Model can make, analyze, and save frequency tables
that are formed by categorical variables (or table factors). The values of the factors
can be character or numeric. Both procedures form tables using data read from a
cases-by-variables rectangular file or recorded as frequencies (for example, from a
table in a report) with cell indices. In Crosstabs, you can request percentages of row
totals, column totals, or the total sample size.
Crosstabs (on the Statistics menu) provides three types of frequency tables:
One-way Frequency counts, percentages, and confidence intervals on cell pro-
portions for single table factors or categorical variables
Two-way Frequency counts, percentages, tests, and measures of association for
the crosstabulation of two factors
Multiway Frequency counts and percentages for series of two-way tables strati-
fied by all combinations of values of a third, fourth, etc., table factor
160
Chapter 8
Statistical Background
Tables report results as counts or the number of cases falling in specific categories or
cross-classifications. Categories may be unordered (democrat, republican, and
independent), ordered (low, medium, and high), or formed by defining intervals on a
continuous variable like AGE (child, teen, adult, and elderly).
Making Tables
There are many formats for displaying tabular data. Lets examine basic layouts for
counts and percentages.
One-Way Tables
Here is an example of a table showing the number of people of each gender surveyed
about depression at UCLA in 1980.
The categorical variable producing this table is SEX$. Sometimes, you may define
categories as intervals of a continuous variable. Here is an example showing the 256
people broken down by age.
Two-Way Tables
A crosstabulation is a table that displays one cell for every combination of values on
two or more categorical variables. Here is a two-way table that crosses the gender and
age distributions of the tables above.
Female Male Total
+---------------+
| 152 104 | 256
+---------------+
18 to 30 30 to 45 46 to 60 Over 60 Total
+-------------------------------------+
| 79 80 64 33 | 256
+-------------------------------------+
Female Male Total
+-------------------+
18 to 30 | 49 30 | 79
30 to 45 | 48 32 | 80
46 to 60 | 38 26 | 64
Over 60 | 17 16 | 33
+-------------------+
Total 152 104 256
161
Crosstabul ati on
This crosstabulation shows relationships between age and gender, which were invisible
in the separate tables. Notice, for example, that the sample contains a large number of
females below the age of 46.
Standardizing Tables with Percentages
As with other statistical procedures such as Correlation, it sometimes helps to have
numbers standardized on a recognizable scale. Correlations vary between 1 and 1, for
example. A convenient scale for table counts is percentage, which varies between 0 and
100.
With tables, you must choose a facet on which to standardizerows, columns, or
the total count in the table. For example, if we are interested in looking at the difference
between the genders within age groups, we might want to standardize by rows. Here is
that table:
Here we see that as age increases, the sample becomes more evenly dispersed across
the two genders.
On the other hand, if we are interested in the overall distribution of age for each
gender, we might want to standardize within columns:
For each gender, the oldest age group appears underrepresented.
Female Male Total N
+-------------------+
18 to 30 | 62.025 37.975 | 100.000 79
30 to 45 | 60.000 40.000 | 100.000 80
46 to 60 | 59.375 40.625 | 100.000 64
Over 60 | 51.515 48.485 | 100.000 33
+-------------------+
Total 59.375 40.625 100.000
N 152 104 256
Female Male Total N
+-------------------+
18 to 30 | 32.237 28.846 | 30.859 79
30 to 45 | 31.579 30.769 | 31.250 80
46 to 60 | 25.000 25.000 | 25.000 64
Over 60 | 11.184 15.385 | 12.891 33
+-------------------+
Total 100.000 100.000 100.000
N 152 104 256
162
Chapter 8
Significance Tests and Measures of Association
After producing a table, you may want to consider a population model that accounts
for the structure you see in the observed table. You should have a population in mind
when you make such inferences. Many published statistical analyses of tables do not
explicitly deal with the sampling problem.
One-Way Tables
A model for these data might be that the proportion of the males and females is equal
in the population. The null hypothesis corresponding to the model is:
H: p
males
= p
females
The sampling model for testing this hypothesis requires that a population contains
equal numbers of males and females and that each member of the population has an
equal chance of being chosen. After choosing each person, we identify the person as
male or female. There is no other category possible and one person cannot fit under
both categories (exhaustive and mutually exclusive).
There is an exact way to reject our null hypothesis (called a permutation test). We
can tally every possible sample of size 256 (including one with no females and one
with no males). Then we can sort our samples into two piles: samples in which there
are between 40.625% and 59.375% percent females and samples in which there are
not. If the latter pile is extremely small relative to the former, we can reject the null
hypothesis.
Needless to say, this would be a tedious undertakingparticularly on a
microcomputer. Fortunately, there is an approximation using a continuous probability
distribution that works quite well. First, we need to calculate the expected count of
males and females, respectively, in a sample of size 256 if p is 0.5. This is 128, or half
the sample N. Next, we subtract the observed counts from these expected counts,
square them, and divide by the expected:
If our assumptions about the population and the structure of the table are correct, then
this statistic will be distributed as a mathematical chi-square variable. We can look up

2 152 128 ( )
2
128
------------------------------ -
104 128 ( )
2
128
------------------------------ - 9 = + =
163
Crosstabul ati on
the area under the tail of the chi-square statistic beyond the sample value we calculate
and if this area is small (say, less than 0.05), we can reject the null hypothesis.
To look up the value, we need a degrees of freedom (df) value. This is the number
of independent values being added together to produce the chi-square. In our case, it is
1, since the observed proportion of men is simply 1 minus the observed proportion of
women. If there were three categories (men, women, other?), then the degrees of
freedom would be 2. Anyway, if you look up the value 9 with one degree of freedom
in your chi-square table, you will find that the probability of exceeding this value is
exceedingly small. Thus, we reject our null hypothesis that the proportion of males
equals the proportion of females in the population.
This chi-square approximation is good only for large samples. A popular rule of
thumb is that the expected counts should be greater than 5, although they should be
even greater if you want to be comfortable with your test. With our sample, the
difference between the approximation and the exact result is negligible. For both, the
probability is small.
Our hypothesis test has an associated confidence interval. You can use SYSTAT to
compute this interval on the population data. Here is the result:
The lower limit for each gender is on the bottom; the upper limits are on the top. Notice
that these two intervals do not overlap.
Two-Way Tables
The most familiar test available for two-way tables is the Pearson chi-square test for
independence of table rows and columns. When the table has only two rows or two
columns, the chi-square test is also a test for equality of proportions. The concept of
interaction in a two-way frequency table is similar to the one in analysis of variance. It
is easiest to see in an example. Schachter (1959) randomly assigned 30 subjects to one
of two groups: High Anxiety (17 subjects), who were told that they would be
experiencing painful shocks, and Low Anxiety (13 subjects), who were told that they
would experience painless shocks. After the assignment, each subject was given the
95 percent approximate confidence intervals scaled as cell percents
Values for SEX$

Female Male
+-----------------+
| 66.150 47.687 |
| 52.064 33.613 |
+-----------------+
164
Chapter 8
choice of waiting alone or with the other subjects. The following tables illustrate two
possible outcomes of this study.
Notice in the table on the left that the number choosing to wait together relative to those
choosing to wait alone is similar for both High and Low Anxiety groups. In the table on
the right, however, more of the High Anxiety group chose to wait together.
We are interpreting these numbers relatively, so we should compute row
percentages to understand the differences better. Here are the same tables standardized
by rows:
Now we can see that the percentages are similar in the two rows in the table on the left
(No Interaction) and quite different in the table on the right (Interaction). A simple
graph reveals these differences even more strongly. In the following figure, the No
Interaction row percentages are plotted on the left.
No Interaction Interaction
WAIT WAIT
Alone Together Alone Together
ANXIETY
High 8 9 5 12
Low 6 7 9 4
No Interaction Interaction
WAIT WAIT
Alone Together Alone Together
ANXIETY
High 47.1 52.8 29.4 70.6
Low 46.1 53.8 69.2 30.8
165
Crosstabul ati on
Notice that the lines cross in the Interaction plot, showing that the rows differ. There
is almost complete overlap in the No Interaction plot.
Now, in the one-way table example above, we tested the hypothesis that the cell
proportions were equal in the population. We can test an analogous hypothesis in this
contextthat each of the four cells contains 25 percent of the population. The problem
with this assumption is that we already know that Schachter randomly assigned more
people to the High Anxiety group. In other words, we should take the row marginal
percentages (or totals) as fixed when we determine what proportions to expect in the
cells from a random model.
Our No Interaction model is based on these fixed marginals. In fact, we can fix
either the row or column margins to compute a No Interaction model because the total
number of subjects is fixed at 30. You can verify that the row and column sums in the
above tables are the same.
Now we are ready to compute our chi-square test of interaction (often called a test
of independence) in the two-way table by using the No Interaction counts as expected
counts in our chi-square formula above. This time, our degrees of freedom are still 1
because the marginal counts are fixed. If you know the marginal counts, then one cell
count determines the remaining three. In general, the degrees of freedom for this test
are (rows 1) times (columns 1).
Here is the result of our chi-square test. The chi-square is 4.693, with a p of 0.03.
On this basis, we reject our No Interaction hypothesis.
Actually, we cheated. The program computed the expected counts from the observed
data. These are not exactly the ones we showed you in the No Interaction table. They
differ by rounding error in the first decimal place. You can compute them exactly. The
popular method is to multiply the total row count times the total column count
corresponding to a cell and dividing by the total sample size. For the upper left cell, this
would be 17*14/30 = 7.93.
ANXIETY (rows) by WAIT$ (columns)

Alone Together Total
+-------------------+
High | 5.000 12.000 | 17.000
Low | 9.000 4.000 | 13.000
+-------------------+
Total 14.000 16.000 30.000


Test statistic Value df Prob
Pearson Chi-square 4.693 1.000 0.030
Likelihood ratio Chi-square 4.810 1.000 0.028
McNemar Symmetry Chi-square 0.429 1.000 0.513
Yates corrected Chi-square 3.229 1.000 0.072
Fisher exact test (two-tail) 0.063
166
Chapter 8
There is one other interesting problem with these data. The chi-square is only an
approximation and it does not work well for small samples. Although these data meet
the minimum expected count of 5, they are nevertheless problematic. Look at the
Fishers exact test result in the output above. Like our permutation test above, which
was so cumbersome for large data files, Fishers test counts all possible outcomes
exactly, including the ones that produce interaction greater than what we observed. The
Fisher exact test p value is not significant (0.063). On this basis, we could not reject
the null hypothesis of no interaction, or independence.
Yates chi-square test in the output is an attempt to adjust the Pearson chi-square
statistic for small samples. While it has come into disfavor for being unnecessarily
conservative in many instances, nevertheless, the Yates p value is consistent with
Fishers in this case (0.072). Likelihood-ratio chi-square is an alternative to the
Pearson chi-square and is used as a test statistic for log linear models.
Selecting a Test or Measure
Other tests and measures are appropriate for specific table structures and also depend
on whether or not the categories of the factor are ordered. We use to denote a
table with two rows and two columns, and for a table with r rows and c columns.
The Pearson and likelihood-ratio chi-square statistics apply to tables
categories need not be ordered.
McNemars test of symmetry is used for square tables (the number of rows
equals the number of columns). This structure arises when the same subjects are
measured twice as in a paired comparisons t test (say before and after an event) or when
subjects are paired or matched (cases and controls). So the row and column categories
are the same, but they are measured at different times or circumstances (like the paired
t) or for different groups of subjects (cases and controls). This test ignores the counts
along the diagonal of the table and tests whether the counts in cells above the diagonal
differ from those below the diagonal. A significant result indicates a greater change in
one direction than another. (The counts along the diagonal are for subjects who did not
change.)
The table structure for Cohens kappa looks like that of McNemars in that the row
and column categories are the same. But here the focus shifts to the diagonal: Are the
counts along the diagonal significantly greater than those expected by chance alone?
Because each subject is classified or rated twice, kappa is a measure of interrater
agreement.
Another difference between McNemar and Kappa is that the former is a test with
a chi-square statistic, degrees of freedom, and an associated p value, while the latter is
2 2
r c
r c
r r
167
Crosstabul ati on
a measure. Its size is judged by using an asymptotic standard error to construct a t
statistic (that is, measure divided by standard error) to test whether kappa differs from
0. Values of kappa greater than 0.75 indicate strong agreement beyond chance,
between 0.40 and 0.79 means fair to good, and below 0.40 means poor agreement.
Phi, Cramrs V, and contingency are measures suitable for testing independence of
table factors as you would with Pearsons chi-square. They are designed for comparing
results of tables with different sample sizes. (Note that the expected value of the
Pearson chi-square is proportional to the total table size.) The three measures are scaled
differently, but all test the same null hypothesis. Use the probability printed with the
Pearson chi-square to test that these measures are zero. For tables with two rows and
two columns (a table), phi and Cramrs V are the same.
Five of the measures for two-way tables are appropriate when both categorical
variables have ordered categories (always, sometimes, never or none, minimal
moderate, severe). These are Goodman-Kruskals gamma, Kendalls tau-b, Stuarts
tau-c, Spearmans rho, and Somers d. The first three measures differ only in how ties
are treated; the fourth is like the usual Pearson correlation except that the rank order of
each value is used in the computations instead of the value itself. Somers d is an
asymmetric measure: in SYSTAT, the column variable is considered to be the
dependent variable.
For tables, Fishers exact test (if ) and Yates corrected chi-square are
also printed. When expected cell sizes are small in a table (no expected value
less than 5), use Fishers exact test as described above.
In larger contingency tables, we do not want to see any expected values less than 1.0
or more than 20% of the values less than 5. For large tables with too many small
expected values, there is no remedy except to combine categories or possibly omit a
category that has very few observations.
Yules Q and Yules Y measure dominance in a table. If either off-diagonal
cell is 0, both statistics are equal (otherwise they are less than 1). These statistics are 0
if and only if the chi-square statistic is 0. Therefore, the null hypothesis that the
measure is 0 can be tested by the chi-square test.
r c
2 2
2 2 n 50
2 2
2 2
168
Chapter 8
Crosstabulations in SYSTAT
One-Way Frequency Tables Main Dialog Box
To open the One-Way Frequency Tables dialog box, from the menus choose:
Statistics
Crosstabs
One-way
One-way frequency tables provides frequency counts, percentages, tests, etc., for
single table factors or categorical variables.
n Tables. Tables can include frequency counts, percentages, and confidence
intervals. You can specify any confidence level between 0 and 1.
n Pearson chi-square. Tests the equality of the cell frequencies. This test assumes all
categories are equally likely.
n Options. You can include a category for cases with missing data. SYSTAT treats
this category in the same fashion as the other categories. In addition, you can
display output in a listing format instead of a tabular display. The listing includes
counts, cumulative counts, percentages, and cumulative percentages.
n Save last table as data file. Saves the table for the last variable in the Variable(s) list
as a SYSTAT data file.
169
Crosstabul ati on
Two-Way Frequency Tables Main Dialog Box
To open the Two-Way Frequency Tables dialog box, from the menus choose:
Statistics
Crosstabs
Two-way
Two-way frequency tables crosstabulate one or more categorical row variables with a
categorical column variable.
n Row variable(s). The variables displayed in the rows of the crosstabulation. Each
row variable is crosstabulated with the column variable.
n Column variable. The variable displayed in the columns of the crosstabulation. The
column variable is crosstabulated with each row variable.
n Tables. Tables can include frequency counts, percentages (row, column, or total),
expected counts, deviates (Observed-Expected), and standardized deviates
(Observed-Expected) / SQR (Expected).
n Options. You can include counts and percentages for cases with missing data. In
addition, you can display output in a listing format instead of a tabular display. The
listing includes counts, cumulative counts, percentages, and cumulative
percentages for each combination of row and column variable categories.
170
Chapter 8
n Save last table as data file. Saves the crosstabulation of the column variable with
the last variable in the row variable(s) list as a SYSTAT data file. For each cell of
the table, SYSTAT saves a record with the cell frequency and the row and column
category values.
Two-Way Frequency Tables Statistics
A wide variety of statistics is available for testing the association between variables in
a crosstabulation. Each statistic is appropriate for a particular table structure (rows by
columns), and a few assume that categories are ordered (ordinal data).
Pearson chi-square. For tables with any number of rows and columns, tests for
independence of the row and column variables.
2 x 2 tables. For tables with two rows and two columns, available tests are:
n Yates corrected chi-square. Adjusts the Pearson chi-square statistic for small
samples.
n Fishers exact test. Counts all possible outcomes exactly. When the expected cell
sizes are small (less than 5), use as an alternative to the Pearson chi-square.
n Odds ratio. A measure of association in which a value near 1 indicates no relation
between the variables.
n Yules Q and Y. Measures of association in which values near 1 or +1 indicate a
strong relation. Values near 0 indicate no relation. Yules Y is less sensitive to
differences in the margins of the table than Q.
171
Crosstabul ati on
2 x k tables. For tables with only two rows and any number of ordered column
categories (or vice versa), Cochrans test of linear trend is available to reveal whether
proportions increase (or decrease) linearly across the ordered categories.
r x r tables. For square tables, available tests include:
n McNemars test for symmetry. Used for paired (or matched) variables. Tests whether
the counts above the table diagonal differ from those below the diagonal. Small
probability values indicate a greater change in one direction.
n Cohens kappa. Commonly used to measure agreement between two judges rating
the same objects. Tests whether the diagonal counts are larger than expected.
Values of kappa greater than 0.75 indicate strong agreement beyond chance, values
between 0.40 and 0.79 indicate fair to good, and values below 0.40 indicate poor
agreement.
r x c tables, unordered levels. For tables with any number of rows or columns with no
assumed category order, available tests are:
n Phi. A chi-square based measure of association. Values may exceed 1.
n Cramrs V. A measure of association based on the chi-square. The value ranges
between 0 and 1, with 0 indicating independence between the row and column
variables and values close to 1 indicating dependence between the variables.
n Contingency coefficient. A measure of association based on the chi-square. Similar
to Cramrs V, but values of 1 cannot be attained.
n Uncertainty coefficient and Goodman-Kruskals lambda. Measure of association that
indicate the proportional reduction in error when values of one variable are used to
predict values of the other variable. Values near 0 indicate that the row variable is
no help in predicting the column variable.
n Likelihood-ratio chi-square. An alternative to the Pearson chi-square, primarily
used as a test statistic for loglinear models.
r x c tables, ordered levels. For tables with any number of rows or columns in which
categories for both variables represent ordered levels (for example, low, medium,
high), available tests are:
n Spearmans rho. Similar to the Pearson correlation coefficient, but uses the ranks of
the data rather than the actual values.
n Goodman-Kruskals gamma, Kendalls tau-b, and Stuarts tau-c. Measures of
association between two ordinal variables that range between 1 and +1, differing
only in the method of dealing with ties. Values close to 0 indicate little or no
relationship.
172
Chapter 8
n Somers d. An asymmetric measure of association between two ordinal variables
that ranges from 1 to 1. Values close to 1 or +1 indicate a strong relationship
between the two. The column variable is treated as the dependent variable.
Multiway Frequency Tables Main Dialog Box
Multiway frequency tables provide frequency counts and percentages for series of two-
way tables stratified by all combinations of values of a third, fourth, etc., table factor.
To open the Multiway Frequency Tables dialog box, from the menus choose:
Statistics
Crosstabs
Multiway
n Row variable. The variable displayed in the rows of the crosstabulation.
n Column variable. The variable displayed in the columns of the crosstabulation.
n Strata variable(s). If strata are separate, a separate crosstabulation is produced for
each value of each strata variable. If strata are crossed, a separate crosstabulation
is produced for each unique combination of strata variable values. For example, if
you have two strata variables, each with five categories, Separate will produce 10
tables and Crossed will produce 25 tables.
n Options. You can include counts and percentages for cases with missing data and
save the last table produced as a SYSTAT data file. In addition, you can display
173
Crosstabul ati on
output in a listing format, including percentages and cumulative percentages,
instead of a tabular display.
n Display. You can display frequencies, total percentages, row percentages, and
column percentages. Furthermore, you can use the Mantel-Haenszel test for
subtables to test for an association between two binary variables while controlling
for another variable.
Using Commands
For one-way tables in XTAB, specify:
For two-way tables in XTAB, specify:
For multiway tables in XTAB, specify:
Usage Considerations
Types of data. There are two ways to organize data for tables:
n The usual cases-by-variables rectangular data file
n Cell counts with cell identifiers
XTAB
USE filename
PRINT / FREQ CHISQ LIST PERCENT ROWPCT COLPCT
TABULATE varlist / CONFI=n MISS
XTAB
USE filename
PRINT / FREQ CHISQ LRCHI YATES FISHER ODDS YULE COCHRAN,
MCKEM KAPPA PHI CRAMER CONT UNCE LAMBDA RHO GAMMA,
TAUB TAUC SOMERS EXPECT DEVI STAND LIST PERCENT,
ROWPCT COLPCT
TABULATE rowvar * colvar / MISS
XTAB
USE filename
PRINT / FREQ MANTEL LIST PERCENT ROWPCT COLPCT
TABULATE varlist * rowvar * colvar / MISS
2 2
174
Chapter 8
For example, you may want to analyze the following table reflecting application results
by gender for business schools:
A cases-by-variables data file has the following form:
Instead of entering one case for each of the 685 applicants, you could use the second
method to enter four cases:
For this method, the cell counts in the third column are identified by designating
COUNT as a FREQUENCY variable.
Print options. Three levels of output are available. Statistics produced depend on the
dimensionality of the table. PRINT SHORT yields frequency tables for all tables and
Pearson chi-square for one-way and two-way tables. The MEDIUM length yields all
statistics appropriate for the dimensionality of a two-way or multiway table. LONG
adds expected cell values, deviates, and standardized deviates to the SHORT and
MEDIUM output.
Quick Graphs. Frequency tables produce no Quick Graphs.
Saving files. You can save the frequency counts to a file. For two-way tables, cell
values, deviates, and standardized deviates are also saved.
Admitted Denied
Male
420 90
Female
150 25
PERSON GENDER$ STATUS$
1 female admit
2 male deny
3 male admit
(etc.)
684 female deny
685 male admit
GENDER$ STATUS$ COUNT
male admit 420
male deny 90
female admit 150
female deny 25
175
Crosstabul ati on
BY groups. Use of a BY variable yields separate frequency tables (and corresponding
statistics) for each level of the BY variable.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. XTAB uses the FREQUENCY variable to duplicate cases. This is the
preferred method of input when the data are aggregated.
Case weights. WEIGHT is available for frequency tables.
Examples
Example 1
One-Way Tables
This example uses questionnaire data from a community survey (Afifi and Clark,
1984). The SURVEY2 data file includes a record (case) for each of the 256 subjects in
the sample. We request frequencies for gender, marital status, and religion. The values
of these variables are numbers, so we add character identifiers for the categories. The
input is:
If the words male and female were stored in the variable SEX$, you would omit LABEL
and tabulate SEX$ directly. If you omit LABEL and specify SEX, the numbers would
label the output.
n When using the Label dialog box, you can omit quotation marks around category
names. With commands, you can omit them if the name has no embedded blanks
or symbols (the name, however, is displayed in uppercase letters).
USE survey2
XTAB
LABEL sex / 1=Male, 2=Female
LABEL marital / 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / FREQ
TABULATE sex marital religion
176
Chapter 8
The output follows:
In this sample of 256 subjects, 152 are females, 127 are married, and 133 are
Protestants.
List Layout
List layout produces an alternative layout for the same information. Percentages and
cumulative percentages are part of the display. The input is:
Frequencies
Values for SEX

Male Female Total
+---------------+
| 104 152 | 256
+---------------+



Frequencies
Values for MARITAL

Never Married Divorced Separated Total
+-----------------------------------------+
| 73 127 43 13 | 256
+-----------------------------------------+



Frequencies
Values for RELIGION

Protestant Catholic Jewish None Other Total
+--------------------------------------------------------+
| 133 46 23 52 2 | 256
+--------------------------------------------------------+
USE survey2
XTAB
LABEL sex / 1=Male, 2=Female
LABEL marital / 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / LIST
TABULATE sex marital religion
PRINT
177
Crosstabul ati on
You can also use TABULATE varlist / LIST as an alternative to PRINT NONE / LIST. The
output follows:
Almost 60% (59.4) are female, approximately 50% (49.6) are married, and more than
half (52%) are Protestants.
Example 2
Two-Way Tables
This example uses the SURVEY2 data to crosstabulate marital status against religion.
The input is:
The table follows:
Cum Cum
Count Count Pct Pct SEX
104. 104. 40.6 40.6 Male
152. 256. 59.4 100.0 Female

Cum Cum
Count Count Pct Pct MARITAL
73. 73. 28.5 28.5 Never
127. 200. 49.6 78.1 Married
43. 243. 16.8 94.9 Divorced
13. 256. 5.1 100.0 Separated

Cum Cum
Count Count Pct Pct RELIGION
133. 133. 52.0 52.0 Protestant
46. 179. 18.0 69.9 Catholic
23. 202. 9.0 78.9 Jewish
52. 254. 20.3 99.2 None
2. 256. .8 100.0 Other
USE survey2
XTAB
LABEL marital / 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / FREQ
TABULATE marital * religion
Frequencies
MARITAL (rows) by RELIGION (columns)

Protestant Catholic Jewish None Other Total
+--------------------------------------------------------+
Never | 29 16 8 20 0 | 73
Married | 75 21 11 19 1 | 127
Divorced | 21 6 3 13 0 | 43
Separated | 8 3 1 0 1 | 13
+--------------------------------------------------------+
Total 133 46 23 52 2 256
178
Chapter 8
In the sample of 256 people, 73 never married. Of the people that have never married,
29 are Protestants (the cell in the upper left corner), and none are in the Other category
(their religion is not among the first four categories). The Totals (or marginals) along
the bottom row and down the far right column are the same as the values displayed for
one-way tables.
Omitting Sparse Categories
There are only two counts in the last column, and the counts in the last row are fairly
sparse. It is easy to omit rows and/or columns. You can:
n Omit the category codes from the LABEL request.
n Select cases to use.
Note that LABEL and SELECT remain in effect until you turn them off. If you request
several different tables, use SELECT to ensure that the same cases are used in all tables.
The subset of cases selected via LABEL applies only to those tables that use the
variables specified with LABEL. To turn off the LABEL specification for RELIGION, for
example, specify:
We continue from the last table, eliminating the last category codes for MARITAL and
RELIGION:
The table is:
LABEL religion
SELECT marital <> 4 AND religion <> 6
TABULATE marital * religion
SELECT
Frequencies
MARITAL (rows) by RELIGION (columns)

Protestant Catholic Jewish None Total
+---------------------------------------------+
Never | 29 16 8 20 | 73
Married | 75 21 11 19 | 126
Divorced | 21 6 3 13 | 43
+---------------------------------------------+
Total 125 43 22 52 242
179
Crosstabul ati on
List Layout
Following is the panel for marital status crossed with religious preference:
The listing is:
Example 3
Frequency Input
Crosstabs, like other SYSTAT procedures, reads cases-by-variables data from a
SYSTAT file. However, if you want to analyze a table from a report or a journal article,
you can enter the cell counts directly. This example uses counts from a four-way table
of a breast cancer study of 764 women. The data are from Morrison et al. (1973), cited
in Bishop, Fienberg, and Holland (1975). There is one record for each of the 72 cells
in the table, with the count (NUMBER) of women in the cell and codes or category
names to identify their age group (under 50, 50 to 69, and 70 or over), treatment center
(Tokyo, Boston, or Glamorgan), survival status (dead or alive), and tumor diagnosis
(minimal inflammation and benign, maximum inflammation and benign, minimal
inflammation and malignant, and maximum inflammation and malignant). This
example illustrates how to form a two-way table of AGE by CENTER$.
USE survey2
XTAB
LABEL marital / 1=Never, 2=Married, 3=Divorced
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None
PRINT NONE / LIST
TABULATE marital * religion
PRINT
Cum Cum
Count Count Pct Pct MARITAL RELIGION
29. 29. 12.0 12.0 Never Protestant
16. 45. 6.6 18.6 Never Catholic
8. 53. 3.3 21.9 Never Jewish
20. 73. 8.3 30.2 Never None
75. 148. 31.0 61.2 Married Protestant
21. 169. 8.7 69.8 Married Catholic
11. 180. 4.5 74.4 Married Jewish
19. 199. 7.9 82.2 Married None
21. 220. 8.7 90.9 Divorced Protestant
6. 226. 2.5 93.4 Divorced Catholic
3. 229. 1.2 94.6 Divorced Jewish
13. 242. 5.4 100.0 Divorced None
180
Chapter 8
The input is:
The resulting two-way table is:
Of the 764 women studied, 290 were treated in Tokyo. Of these women, 151 were in
the youngest age group, and 19 were in the 70 or over age group.
Example 4
Missing Category Codes
You can choose whether or not to include a separate category for missing codes. For
example, if some subjects did not check male or female on a form, there would be
three categories for SEX$: male, female, and blank (missing). By default, when values
of a table factor are missing, SYSTAT does not include a category for missing.
In the OURWORLD data file, some countries did not report the GNP to the United
Nations. In this example, we include a category for missing values, and we followed
this request with a table that omits the category for missing. The input follows:
USE cancer
XTAB
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
TABULATE center$ * age
Frequencies
CENTER$ (rows) by AGE (columns)

Under 50 50 to 69 70 & Over Total
+-------------------------------+
Boston | 58 122 73 | 253
Glamorgn | 71 109 41 | 221
Tokyo | 151 120 19 | 290
+-------------------------------+
Total 280 351 133 764


Test statistic Value df Prob
Pearson Chi-square 74.039 4.000 0.000
USE ourworld
XTAB
TABULATE group$ * gnp$ / MISS
LABEL gnp$ / D=Developed, U=Emerging
TABULATE group$ * gnp$
181
Crosstabul ati on
The tables are:
List Layout
To create a listing of the counts in each cell of the table:
The output is:
Note that there is no entry for the empty cell.
Example 5
Percentages
Percentages are helpful for describing categorical variables and interpreting relations
between table factors. Crosstabs prints tables of percentages in the same layout as
described for frequency counts. That is, each frequency count is replaced by the
percentage. The percentages are:
n The total frequency in its row
Frequencies
GROUP$ (rows) by GNP$ (columns)

D U Total
+----------------------------+
Europe | 3 17 0 | 20
Islamic | 2 4 10 | 16
NewWorld | 1 15 5 | 21
+----------------------------+
Total 6 36 15 57

Frequencies
GROUP$ (rows) by GNP$ (columns)

Developed Emerging Total
+---------------------+
Europe | 17 0 | 17
Islamic | 4 10 | 14
NewWorld | 15 5 | 20
+---------------------+
Total 36 15 51
PRINT / LIST
TAB group$ * gnp$
PRINT
Cum Cum
Count Count Pct Pct GROUP$ GNP$
17. 17. 33.3 33.3 Europe Developed
4. 21. 7.8 41.2 Islamic Developed
10. 31. 19.6 60.8 Islamic Emerging
15. 46. 29.4 90.2 NewWorld Developed
5. 51. 9.8 100.0 NewWorld Emerging
182
Chapter 8
n The total frequency in its column
n The total table frequency (or sample size)
In this example, we request all three percentages using the following input:
The output is:
USE ourworld
XTAB
LABEL gnp$ / D=Developed, U=Emerging
PRINT NONE / ROWP COLP PERCENT
TABULATE group$ * gnp$
Percents of total count
GROUP$ (rows) by GNP$ (columns)

Developed Emerging Total N
+---------------------+
Europe | 33.333 0.0 | 33.333 17
Islamic | 7.843 19.608 | 27.451 14
NewWorld | 29.412 9.804 | 39.216 20
+---------------------+
Total 70.588 29.412 100.000
N 36 15 51



Row percents
GROUP$ (rows) by GNP$ (columns)

Developed Emerging Total N
+---------------------+
Europe | 100.000 0.0 | 100.000 17
Islamic | 28.571 71.429 | 100.000 14
NewWorld | 75.000 25.000 | 100.000 20
+---------------------+
Total 70.588 29.412 100.000
N 36 15 51



Column percents
GROUP$ (rows) by GNP$ (columns)

Developed Emerging Total N
+---------------------+
Europe | 47.222 0.0 | 33.333 17
Islamic | 11.111 66.667 | 27.451 14
NewWorld | 41.667 33.333 | 39.216 20
+---------------------+
Total 100.000 100.000 100.000
N 36 15 51
183
Crosstabul ati on
Missing Categories
Notice how the row percentages change when we include a category for the missing
GNP:
The new table is:
Here we see that 62.5% of the Islamic nations are classified as emerging. However,
from the earlier table of row percentages, it might be better to say that among the
Islamic nations reporting the GNP, 71.43% are emerging.
Example 6
Multiway Tables
When you have three or more table factors, Crosstabs forms a series of two-way tables
stratified by all combinations of values of the third, fourth, and so on, table factors. The
order in which you choose the table factors determines the layout. Your input can be
the usual cases-by-variables data file or the cell counts with category values.
The input is:
PRINT NONE / ROWP
LABEL gnp$ / =Missing, D=Developed, U=Emerging
TABULATE group$ * gnp$
PRINT
Row percents
GROUP$ (rows) by GNP$ (columns)

MISSING Developed Emerging Total N
+-------------------------------+
Europe | 15.000 85.000 0.0 | 100.000 20
Islamic | 12.500 25.000 62.500 | 100.000 16
NewWorld | 4.762 71.429 23.810 | 100.000 21
+-------------------------------+
Total 10.526 63.158 26.316 100.000
N 6 36 15 57
USE cancer
XTAB
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69,
70=70 & Over
ORDER center$ / SORT=none
ORDER tumor$ / SORT=MinBengn, MaxBengn, MinMalig,
MaxMalig
TABULATE survive$ * tumor$ * center$ * age
184
Chapter 8
The last two factors selected (CENTER$ and AGE) define two-way tables. The levels
of the first two factors define the strata. After the table is run, we edited the output and
moved the four tables for SURVIVE$ = Dead next to those for Alive.
List Layout
To create a listing of the counts in each cell of the table:
The output follows:
Frequencies
CENTER$ (rows) by AGE (columns)
SURVIVE$ = Alive SURVIVE$ = Dead
TUMOR$ = MinBengn TUMOR$ = MinBengn

Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total
+-------------------------------+ +-------------------------------+
Tokyo | 68 46 6 | 120 Tokyo | 7 9 3 | 19
Boston | 24 58 26 | 108 Boston | 7 20 18 | 45
Glamorgn | 20 39 11 | 70 Glamorgn | 7 12 7 | 26
+-------------------------------+ +-------------------------------+
Total 112 143 43 298 Total 21 41 28 90


SURVIVE$ = Alive SURVIVE$ = Dead
TUMOR$ = MaxBengn TUMOR$ = MaxBengn

Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total
+-------------------------------+ +-------------------------------+
Tokyo | 9 5 1 | 15 Tokyo | 3 2 0 | 5
Boston | 0 3 1 | 4 Boston | 0 2 0 | 2
Glamorgn | 1 4 1 | 6 Glamorgn | 0 0 0 | 0
+-------------------------------+ +-------------------------------+
Total 10 12 3 25 Total 3 4 0 7


SURVIVE$ = Alive SURVIVE$ = Dead
TUMOR$ = MinMalig TUMOR$ = MinMalig

Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total
+-------------------------------+ +-------------------------------+
Tokyo | 26 20 1 | 47 Tokyo | 9 9 2 | 20
Boston | 11 18 15 | 44 Boston | 6 8 9 | 23
Glamorgn | 16 27 12 | 55 Glamorgn | 16 14 3 | 33
+-------------------------------+ +-------------------------------+
Total 53 65 28 146 Total 31 31 14 76


SURVIVE$ = Alive SURVIVE$ = Dead
TUMOR$ = MaxMalig TUMOR$ = MaxMalig

Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total
+-------------------------------+ +-------------------------------+
Tokyo | 25 18 5 | 48 Tokyo | 4 11 1 | 16
Boston | 4 10 1 | 15 Boston | 6 3 3 | 12
Glamorgn | 8 10 4 | 22 Glamorgn | 3 3 3 | 9
+-------------------------------+ +-------------------------------+
Total 37 38 10 85 Total 13 17 7 37
PRINT / LIST
TABULATE survive$ * center$ * age * tumor$
185
Crosstabul ati on
Case frequencies determined by value of variable NUMBER.

Cum Cum
Count Count Pct Pct SURVIVE$ CENTER$ AGE TUMOR$
68. 68. 8.9 8.9 Alive Tokyo Under 50 MinBengn
9. 77. 1.2 10.1 Alive Tokyo Under 50 MaxBengn
26. 103. 3.4 13.5 Alive Tokyo Under 50 MinMalig
25. 128. 3.3 16.8 Alive Tokyo Under 50 MaxMalig
46. 174. 6.0 22.8 Alive Tokyo 50 to 69 MinBengn
5. 179. .7 23.4 Alive Tokyo 50 to 69 MaxBengn
20. 199. 2.6 26.0 Alive Tokyo 50 to 69 MinMalig
18. 217. 2.4 28.4 Alive Tokyo 50 to 69 MaxMalig
6. 223. .8 29.2 Alive Tokyo 70 & Over MinBengn
1. 224. .1 29.3 Alive Tokyo 70 & Over MaxBengn
1. 225. .1 29.5 Alive Tokyo 70 & Over MinMalig
5. 230. .7 30.1 Alive Tokyo 70 & Over MaxMalig
24. 254. 3.1 33.2 Alive Boston Under 50 MinBengn
11. 265. 1.4 34.7 Alive Boston Under 50 MinMalig
4. 269. .5 35.2 Alive Boston Under 50 MaxMalig
58. 327. 7.6 42.8 Alive Boston 50 to 69 MinBengn
3. 330. .4 43.2 Alive Boston 50 to 69 MaxBengn
18. 348. 2.4 45.5 Alive Boston 50 to 69 MinMalig
10. 358. 1.3 46.9 Alive Boston 50 to 69 MaxMalig
26. 384. 3.4 50.3 Alive Boston 70 & Over MinBengn
1. 385. .1 50.4 Alive Boston 70 & Over MaxBengn
15. 400. 2.0 52.4 Alive Boston 70 & Over MinMalig
1. 401. .1 52.5 Alive Boston 70 & Over MaxMalig
20. 421. 2.6 55.1 Alive Glamorgn Under 50 MinBengn
1. 422. .1 55.2 Alive Glamorgn Under 50 MaxBengn
16. 438. 2.1 57.3 Alive Glamorgn Under 50 MinMalig
8. 446. 1.0 58.4 Alive Glamorgn Under 50 MaxMalig
39. 485. 5.1 63.5 Alive Glamorgn 50 to 69 MinBengn
4. 489. .5 64.0 Alive Glamorgn 50 to 69 MaxBengn
27. 516. 3.5 67.5 Alive Glamorgn 50 to 69 MinMalig
10. 526. 1.3 68.8 Alive Glamorgn 50 to 69 MaxMalig
11. 537. 1.4 70.3 Alive Glamorgn 70 & Over MinBengn
1. 538. .1 70.4 Alive Glamorgn 70 & Over MaxBengn
12. 550. 1.6 72.0 Alive Glamorgn 70 & Over MinMalig
4. 554. .5 72.5 Alive Glamorgn 70 & Over MaxMalig
7. 561. .9 73.4 Dead Tokyo Under 50 MinBengn
3. 564. .4 73.8 Dead Tokyo Under 50 MaxBengn
9. 573. 1.2 75.0 Dead Tokyo Under 50 MinMalig
4. 577. .5 75.5 Dead Tokyo Under 50 MaxMalig
9. 586. 1.2 76.7 Dead Tokyo 50 to 69 MinBengn
2. 588. .3 77.0 Dead Tokyo 50 to 69 MaxBengn
9. 597. 1.2 78.1 Dead Tokyo 50 to 69 MinMalig
11. 608. 1.4 79.6 Dead Tokyo 50 to 69 MaxMalig
3. 611. .4 80.0 Dead Tokyo 70 & Over MinBengn
2. 613. .3 80.2 Dead Tokyo 70 & Over MinMalig
1. 614. .1 80.4 Dead Tokyo 70 & Over MaxMalig
7. 621. .9 81.3 Dead Boston Under 50 MinBengn
6. 627. .8 82.1 Dead Boston Under 50 MinMalig
6. 633. .8 82.9 Dead Boston Under 50 MaxMalig
20. 653. 2.6 85.5 Dead Boston 50 to 69 MinBengn
2. 655. .3 85.7 Dead Boston 50 to 69 MaxBengn
8. 663. 1.0 86.8 Dead Boston 50 to 69 MinMalig
3. 666. .4 87.2 Dead Boston 50 to 69 MaxMalig
18. 684. 2.4 89.5 Dead Boston 70 & Over MinBengn
9. 693. 1.2 90.7 Dead Boston 70 & Over MinMalig
3. 696. .4 91.1 Dead Boston 70 & Over MaxMalig
7. 703. .9 92.0 Dead Glamorgn Under 50 MinBengn
16. 719. 2.1 94.1 Dead Glamorgn Under 50 MinMalig
3. 722. .4 94.5 Dead Glamorgn Under 50 MaxMalig
12. 734. 1.6 96.1 Dead Glamorgn 50 to 69 MinBengn
14. 748. 1.8 97.9 Dead Glamorgn 50 to 69 MinMalig
3. 751. .4 98.3 Dead Glamorgn 50 to 69 MaxMalig
7. 758. .9 99.2 Dead Glamorgn 70 & Over MinBengn
3. 761. .4 99.6 Dead Glamorgn 70 & Over MinMalig
3. 764. .4 100.0 Dead Glamorgn 70 & Over MaxMalig
186
Chapter 8
The 35 cells for the women who survived are listed first (the cell for Boston women
under 50 years old with MaxBengn tumors is empty). In the Cum Pct column, we see
that these women make up 72.5% of the sample. Thus, 27.5% did not survive.
Percentages
While list layout provides percentages of the total table count, you might want others.
Here we specify COLPCT in Crosstabs to print the percentage surviving within each
age-by-center stratum. The input is:
The tables follow:
PRINT NONE / COLPCT
TABULATE age * center$ * survive$ * tumor$
PRINT
Column percents
SURVIVE$ (rows) by TUMOR$ (columns)
AGE = Under 50
CENTER$ = Tokyo

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 90.667 75.000 74.286 86.207 | 84.768 128
Dead | 9.333 25.000 25.714 13.793 | 15.232 23
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 75 12 35 29 151


AGE = Under 50
CENTER$ = Boston

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 77.419 0.0 64.706 40.000 | 67.241 39
Dead | 22.581 0.0 35.294 60.000 | 32.759 19
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 31 0 17 10 58


AGE = Under 50
CENTER$ = Glamorgn

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 74.074 100.000 50.000 72.727 | 63.380 45
Dead | 25.926 0.0 50.000 27.273 | 36.620 26
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 27 1 32 11 71


187
Crosstabul ati on
AGE = 50 to 69
CENTER$ = Tokyo

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 83.636 71.429 68.966 62.069 | 74.167 89
Dead | 16.364 28.571 31.034 37.931 | 25.833 31
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 55 7 29 29 120


AGE = 50 to 69
CENTER$ = Boston

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 74.359 60.000 69.231 76.923 | 72.951 89
Dead | 25.641 40.000 30.769 23.077 | 27.049 33
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 78 5 26 13 122


AGE = 50 to 69
CENTER$ = Glamorgn

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 76.471 100.000 65.854 76.923 | 73.394 80
Dead | 23.529 0.0 34.146 23.077 | 26.606 29
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 51 4 41 13 109


AGE = 70 & Over
CENTER$ = Tokyo

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 66.667 100.000 33.333 83.333 | 68.421 13
Dead | 33.333 0.0 66.667 16.667 | 31.579 6
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 9 1 3 6 19


AGE = 70 & Over
CENTER$ = Boston

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 59.091 100.000 62.500 25.000 | 58.904 43
Dead | 40.909 0.0 37.500 75.000 | 41.096 30
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 44 1 24 4 73

AGE = 70 & Over
CENTER$ = Glamorgn

MinBengn MaxBengn MinMalig MaxMalig Total N
+-----------------------------------------+
Alive | 61.111 100.000 80.000 57.143 | 68.293 28
Dead | 38.889 0.0 20.000 42.857 | 31.707 13
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 18 1 15 7 41
188
Chapter 8
The percentage of women surviving for each age-by-center combination is reported in
the first row of each panel. In the marginal Total down the right column, we see that
the younger women treated in Tokyo have the best survival rate (84.77%). This is the
row total (128) divided by the total for the stratum (151).
Example 7
Two-Way Table Statistics
For the SURVEY2 data, you study the relationship between marital status and age. This
is a general tablewhile the categories for AGE are ordered, those for MARITAL are
not. The usual Pearson chi-square statistic is used to test the association between the
two factors. This statistic is the default for Crosstabs.
The data file is the usual cases-by-variables rectangular file with one record for each
person. We split the continuous variable AGE into four categories and add names such
as 30 to 45 for the output. There are too few separated people to tally, so here we
eliminate them and reorder the categories of MARITAL that remain. To supplement the
results, we request row percentages. The input is:
The output follows:
USE survey2
XTAB
LABEL age / .. 29=18 to 29, 30 .. 45=30 to 45,
46 .. 60=46 to 60, 60 .. =Over 60
LABEL marital / 2=Married, 3=Divorced, 1=Never
PRINT / ROWPCT
TABULATE age * marital
Frequencies
AGE (rows) by MARITAL (columns)

Married Divorced Never Total
+----------------------------+
18 to 29 | 17 5 53 | 75
30 to 45 | 48 21 9 | 78
46 to 60 | 39 12 8 | 59
Over 60 | 23 5 3 | 31
+----------------------------+
Total 127 43 73 243


189
Crosstabul ati on
Even though the chi-square statistic is highly significant (87.761; p value < 0.0005), in
the Row percentages table, you see that 70.67% of the youngest age group fall into the
never-married category. Many of these people may be too young to consider marriage.
Eliminating a Stratum
If you eliminate the subjects in the youngest group, is there an association between
marital status and age? To address this question, the input is:
The resulting output is:
Row percents
AGE (rows) by MARITAL (columns)

Married Divorced Never Total N
+----------------------------+
18 to 29 | 22.667 6.667 70.667 | 100.000 75
30 to 45 | 61.538 26.923 11.538 | 100.000 78
46 to 60 | 66.102 20.339 13.559 | 100.000 59
Over 60 | 74.194 16.129 9.677 | 100.000 31
+----------------------------+
Total 52.263 17.695 30.041 100.000
N 127 43 73 243


Test statistic Value df Prob
Pearson Chi-square 87.761 6.000 0.000
SELECT age > 29
PRINT / CHISQ PHI CRAMER CONT ROWPCT
TABULATE age * marital
SELECT
Frequencies
AGE (rows) by MARITAL (columns)

Married Divorced Never Total
+----------------------------+
30 to 45 | 48 21 9 | 78
46 to 60 | 39 12 8 | 59
Over 60 | 23 5 3 | 31
+----------------------------+
Total 110 38 20 168

Row percents
AGE (rows) by MARITAL (columns)

Married Divorced Never Total N
+----------------------------+
30 to 45 | 61.538 26.923 11.538 | 100.000 78
46 to 60 | 66.102 20.339 13.559 | 100.000 59
Over 60 | 74.194 16.129 9.677 | 100.000 31
+----------------------------+
Total 65.476 22.619 11.905 100.000
N 110 38 20 168

190
Chapter 8
The proportion of married people is larger within the Over 60 group than for the 30 to
45 group74.19% of the former are married while 61.54% of the latter are married.
The youngest stratum has the most divorced people. However, you cannot say these
proportions differ significantly (chi-square = 2.173, p value = 0.704).
Example 8
Two-Way Table Statistics (Long Results)
This example illustrates LONG results and table input. It uses the AGE by CENTER$
table from the cancer study described in the frequency input example. The input is:
The output follows:
Test statistic Value df Prob
Pearson Chi-square 2.173 4.000 0.704

Coefficient Value Asymptotic Std Error
Phi 0.114
Cramer V 0.080
Contingency 0.113
USE cancer
XTAB
FREQ = number
PRINT LONG
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
TABULATE center$ * age
Frequencies
CENTER$ (rows) by AGE (columns)

Under 50 50 to 69 70 & Over Total
+-------------------------------+
Boston | 58 122 73 | 253
Glamorgn | 71 109 41 | 221
Tokyo | 151 120 19 | 290
+-------------------------------+
Total 280 351 133 764


Expected values
CENTER$ (rows) by AGE (columns)

Under 50 50 to 69 70 & Over
+-------------------------------+
Boston | 92.723 116.234 44.043 |
Glamorgn | 80.995 101.533 38.473 |
Tokyo | 106.283 133.233 50.484 |
+-------------------------------+


191
Crosstabul ati on
The null hypothesis for the Pearson chi-square test is that the table factors are
independent. You reject the hypothesis (chi-square = 74.039, p value < 0.0005). We are
concerned about the analysis of the full table with four factors in the cancer study
because we see an imbalance between AGE and study CENTER. The researchers in
Tokyo entered a much larger proportion of younger women than did the researchers in
the other cities.
Notice that with LONG, SYSTAT reports all statistics for an table including
those that are appropriate when both factors have ordered categories (gamma, tau-b,
tau-c, rho, and Spearmans rho).
Example 9
Odds Ratios
For a table with cell counts a, b, c, and d:
Standardized deviates: (Observed-Expected)/SQR(Expected)
CENTER$ (rows) by AGE (columns)

Under 50 50 to 69 70 & Over
+-------------------------------+
Boston | -3.606 0.535 4.363 |
Glamorgn | -1.111 0.741 0.407 |
Tokyo | 4.338 -1.146 -4.431 |
+-------------------------------+


Test statistic Value df Prob
Pearson Chi-square 74.039 4.000 0.000
Likelihood ratio Chi-square 76.963 4.000 0.000
McNemar Symmetry Chi-square 79.401 3.000 0.000

Coefficient Value Asymptotic Std Error
Phi 0.311
Cramer V 0.220
Contingency 0.297
Goodman-Kruskal Gamma -0.417 0.043
Kendall Tau-B -0.275 0.030
Stuart Tau-C -0.265 0.029
Cohen Kappa -0.113 0.022
Spearman Rho -0.305 0.033
Somers D (column dependent) -0.267 0.030
Lambda (column dependent) 0.075 0.038
Uncertainty (column dependent) 0.049 0.011
Exposure
yes no
Disease
yes
a b
no
c d
r c
192
Chapter 8
where, if you designate the Disease yes people sick and the Disease no people well, the
odds ratio (or cross-product ratio) equals the odds that a sick person is exposed divided
by the odds that a well person is exposed, or:
If the odds for the sick and disease-free people are the same, the value of the odds ratio
is 1.0.
As an example, use the SURVEY2 file and study the association between gender and
depressive illness. Be careful to order your table factors so that your odds ratio is
constructed correctly (we use LABEL to do this). The input is:
The output is:
The odds that a female is depressed are 36 to 116, the odds for a male are 8 to 96, and
the odds ratio is 3.724. Thus, in this sample, females are almost four times more likely
to be depressed than males. But, does our sample estimate differ significantly from
1.0? Because the distribution of the odds ratio is very skewed, significance is
USE survey2
XTAB
LABEL casecont / 1=Depressed, 0=Normal
PRINT / FREQ ODDS
TABULATE sex$ * casecont
Frequencies
SEX$ (rows) by CASECONT (columns)

Depressed Normal Total
+---------------------+
Female | 36 116 | 152
Male | 8 96 | 104
+---------------------+
Total 44 212 256


Test statistic Value df Prob
Pearson Chi-square 11.095 1.000 0.001

Coefficient Value Asymptotic Std Error
Odds Ratio 3.724
Ln(Odds) 1.315 0.415
a b ( ) c d ( ) ad ( ) bc ( ) =
193
Crosstabul ati on
determined by examining Ln(Odds), the natural logarithm of the ratio, and the standard
error of the transformed ratio. Note the symmetry when ratios are transformed:
The value of Ln(Odds) here is 1.315 with a standard error of 0.415. Constructing an
approximate 95% confidence interval using the statistic plus or minus two times its
standard error:
results in:
Because 0 is not included in the interval, Ln(Odds) differs significantly from 0, and the
odds ratio differs from 1.0.
Using the calculator to take antilogs of the limits. You can use SYSTATs calculator to
take antilogs of the limits EXP(0.485) and EXP(2.145) and obtain a confidence interval
for the odds ratio:
That is, for the lower limit, type CALC EXP(0.485).
Notice that the proportion of females who are depressed is 0.2368 (from a table of
row percentages not displayed here) and the proportion of males is 0.0769, so you also
reject the hypothesis of equality of proportions (chi-square = 11.095, p value = 0.001).
3
Ln 3
2
Ln 2
1
Ln 0
1/2
Ln 2
1/3
Ln 3
1.315 2 * 0.415 t 1.315 0.830 t =
0.485 Ln Odds ( ) 2.145 < <
e
0.485 ( )
odds ratio e
2.145 ( )
< <
1.624 odds ratio 8.542 < <
194
Chapter 8
Example 10
Fishers Exact Test
Lets say that you are interested in how salaries of female executives compare with
those of male executives at a particular firm. The accountant there will not give you
salaries in dollar figures but does tell you whether the executives salaries are low or
high:
The sample size is very small. When a table has only two rows and two columns and
PRINT=MEDIUM is set as the length, SYSTAT reports results of five additional tests and
measures: Fishers exact test, the odds ratio (and Ln(Odds)), Yates corrected chi-
square, and Yules Q and Y.) By setting PRINT=SHORT, you request three of these:
Fishers exact test, the chi-square test, and Yates corrected chi-square. The input is:
The output follows:
Low High
Male
2 7
Female
5 1
USE salary
XTAB
FREQ = count
LABEL sex / 1=male, 2=female
LABEL earnings / 1=low, 2=high
PRINT / FISHER CHISQ YATES
TABULATE sex * earnings
Frequencies
SEX (rows) by EARNINGS (columns)

low high Total
+---------------+
male | 2 7 | 9
female | 5 1 | 6
+---------------+
Total 7 8 15



WARNING: More than one-fifth of fitted cells are sparse (frequency < 5).
Significance tests computed on this table are suspect.
Test statistic Value df Prob
Pearson Chi-square 5.402 1.000 0.020
Yates corrected Chi-square 3.225 1.000 0.073
Fisher exact test (two-tail) 0.041
195
Crosstabul ati on
Notice that SYSTAT warns you that the results are suspect because the counts in the
table are too low (sparse). Technically, the message states that more than one-fifth of
the cells have expected values (fitted values) of less than 5.
The p value for the Pearson chi-square (0.020) leads you to believe that SEX and
EARNINGS are not independent. But there is a warning about suspect results. This
warning applies to the Pearson chi-square test but not to Fishers exact test. Fishers
test counts all possible outcomes exactly, including the ones that produce an
interaction greater than what you observe. The Fisher exact test p value is also
significant. On this basis, you reject the null hypothesis of independence (no
interaction between SEX and EARNINGS).
Sensitivity
Results for small samples, however, can be fairly sensitive. One case can matter. What
if the accountant forgets one well-paid male executive?
The results of the Fisher exact test indicates that you cannot reject the null hypothesis
of independence. It is too bad that you do not have the actual salaries. Much
information is lost when a quantitative variable like salary is dichotomized into LOW
and HIGH.
What Is a Small Expected Value?
In larger contingency tables, you do not want to see any expected values less than 1.0
or more than 20% of the values less than 5. For large tables with too many small
expected values, there is no remedy but to combine categories or possibly omit a
category that has very few observations.
Frequencies
SEX (rows) by EARNINGS (columns)

low high Total
+---------------+
male | 2 6 | 8
female | 5 1 | 6
+---------------+
Total 7 7 14



WARNING: More than one-fifth of fitted cells are sparse (frequency < 5).
Significance tests computed on this table are suspect.
Test statistic Value df Prob
Pearson Chi-square 4.667 1.000 0.031
Yates corrected Chi-square 2.625 1.000 0.105
Fisher exact test (two-tail) 0.103
196
Chapter 8
Example 11
Cochrans Test of Linear Trend
When one table factor is dichotomous and the other has three or more ordered
categories (for example, low, median, and high), Cochrans test of linear trend is used
to test the null hypothesis that the slope of a regression line across the proportions is 0.
For example, in studying the relation of depression to education, you form this table
for the SURVEY2 data and plot the proportion depressed:
If you regress the proportions on scores 1, 2, 3, and 4 assigned by SYSTAT to the
ordered categories, you can test whether the slope is significant.
This is what we do in this example. We also explore the relation of depression to
health. The input is:
USE survey2
XTAB
LABEL casecont / 1=Depressed, 0=Normal
LABEL educatn / 1,2=Dropout, 3=HS grad, 4,5=College,
6,7=Degree +
LABEL healthy / 1=Excellent, 2=Good, 3,4=Fair/Poor
PRINT / FREQ COLPCT COCHRAN
TABULATE casecont * educatn
TABULATE casecont * healthy
197
Crosstabul ati on
The output is:
As the level of education increases, the proportion of depressed subjects decreases
(Cochrans Linear Trend = 7.681, df = 1, and Prob (p value) = 0.006). Of those not
graduating from high school (Dropout), 28% are depressed, and 4.55% of those with
advanced degrees are depressed. Notice that the Pearson chi-square is marginally
significant (p value = 0.049). It simply tests the hypothesis that the four proportions are
equal rather than decreasing linearly.
Frequencies
CASECONT (rows) by EDUCATN (columns)

Dropout HS grad College Degree + Total
+-----------------------------------------+
Depressed | 14 18 11 1 | 44
Normal | 36 80 75 21 | 212
+-----------------------------------------+
Total 50 98 86 22 256


Column percents
CASECONT (rows) by EDUCATN (columns)

Dropout HS grad College Degree + Total N
+-----------------------------------------+
Depressed | 28.000 18.367 12.791 4.545 | 17.187 44
Normal | 72.000 81.633 87.209 95.455 | 82.813 212
+-----------------------------------------+
Total 100.000 100.000 100.000 100.000 100.000
N 50 98 86 22 256


Test statistic Value df Prob
Pearson Chi-square 7.841 3.000 0.049
Cochrans Linear Trend 7.681 1.000 0.006
Frequencies
CASECONT (rows) by HEALTHY (columns)

Excellent Good Fair/Poor Total
+-------------------------------+
Depressed | 16 15 13 | 44
Normal | 105 78 29 | 212
+-------------------------------+
Total 121 93 42 256


Column percents
CASECONT (rows) by HEALTHY (columns)

Excellent Good Fair/Poor Total N
+-------------------------------+
Depressed | 13.223 16.129 30.952 | 17.187 44
Normal | 86.777 83.871 69.048 | 82.813 212
+-------------------------------+
Total 100.000 100.000 100.000 100.000
N 121 93 42 256


Test statistic Value df Prob
Pearson Chi-square 7.000 2.000 0.030
Cochrans Linear Trend 5.671 1.000 0.017
198
Chapter 8
In contrast to education, the proportion of depressed subjects tends to increase
linearly as health deteriorates (p value = 0.017). Only 13% of those in excellent health
are depressed, whereas 31% of cases with fair or poor health report depression.
Example 12
Tables with Ordered Categories
In this example, we focus on statistics for studies in which both table factors have a few
ordered categories. For example, a teacher evaluating the activity level of
schoolchildren may feel that she cant score them from 1 to 20 but that she could
categorize the activity of each child as sedentary, normal, or hyperactive. Here you
study the relation of health status to age. If the category codes are character-valued, you
must indicate the correct ordering (as opposed to the default alphabetical ordering).
For Spearmans rho, instead of using actual data values, the indices of the categories
are used to compute the usual correlation. Gamma measures the probability of getting
like (as opposed to unlike) orders of values. Its numerator is identical to that of
Kendalls tau-b and Stuarts tau-c. The input is:
The output follows:
USE survey2
XTAB
LABEL healthy / 1=Excellent, 2=Good, 3,4=Fair/Poor
LABEL age / .. 29=18 to 29, 30 .. 45=30 to 45,
46 .. 60=46 to 60, 60 .. =Over 60
PRINT / FREQ ROWP GAMMA RHO
TABULATE healthy * age
Frequencies
HEALTHY (rows) by AGE (columns)

18 to 29 30 to 45 46 to 60 Over 60 Total
+-----------------------------------------+
Excellent | 43 48 25 5 | 121
Good | 30 23 24 16 | 93
Fair/Poor | 6 9 15 12 | 42
+-----------------------------------------+
Total 79 80 64 33 256



Row percents
HEALTHY (rows) by AGE (columns)

18 to 29 30 to 45 46 to 60 Over 60 Total N
+-----------------------------------------+
Excellent | 35.537 39.669 20.661 4.132 | 100.000 121
Good | 32.258 24.731 25.806 17.204 | 100.000 93
Fair/Poor | 14.286 21.429 35.714 28.571 | 100.000 42
+-----------------------------------------+
Total 30.859 31.250 25.000 12.891 100.000
N 79 80 64 33 256
199
Crosstabul ati on
Not surprisingly, as age increases, health status tends to deteriorate. In the table of row
percentages, notice that among those with EXCELLENT health, 4.13% are in the oldest
age group; in the GOOD category, 17.2% are in the oldest group; and in the
FAIR/POOR category, 28.57% are in the oldest group.
The value of gamma is 0.346; rho is 0.274. Here are confidence intervals (Value
2 * Asymptotic Std Error) for each statistic:
Because 0 is in neither interval, you conclude that there is an association between
health and age.
Example 13
McNemars Test of Symmetry
In November of 1993, the U.S. Congress approved the North American Free Trade
Agreement (NAFTA). Lets say that two months before the approval and before the
televised debate between Vice President Al Gore and businessman Ross Perot, political
pollsters queried a sample of 350 people, asking Are you for, unsure, or against
NAFTA? Immediately after the debate, the pollsters contacted the same people and
asked the question a second time. Here are the responses:
The pollsters wonder, Is there a shift in opinion about NAFTA? The study design for
the answer is similar to a paired t testeach subject has two responses. The row and
column categories of our table are the same variable measured at different points in time.
Test statistic Value df Prob
Pearson Chi-square 29.380 6.000 0.000

Coefficient Value Asymptotic Std Error
Goodman-Kruskal Gamma 0.346 0.072
Spearman Rho 0.274 0.058
After
For Unsure Against
For
51 22 28
Before Unsure
46 18 27
Against
52 49 57
0.202 <= 0.346 <= 0.490
0.158 <= 0.274 <= 0.390
200
Chapter 8
The file NAFTA contains these data. To test for an opinion shift, the input is:
We use ORDER to ensure that the row and column categories are ordered the same. The
output follows:
The McNemar test of symmetry focuses on the counts in the off-diagonal cells (those
along the diagonal are not used in the computations). We are investigating the direction
of change in opinion. First, how many respondents became more negative about
NAFTA?
n Among those who initially responded For, 22 (6.29%) are now Unsure and 28
(8%) are now Against.
n Among those who were Unsure before the debate, 27 (7.71%) answered Against
afterwards.
USE nafta
XTAB
FREQ = count
ORDER before$ after$ / SORT=for,unsure,against
PRINT / FREQ MCNEMAR CHI PERCENT
TABULATE before$ * after$
Frequencies
BEFORE$ (rows) by AFTER$ (columns)

for unsure against Total
+-------------------------+
for | 51 22 28 | 101
unsure | 46 18 27 | 91
against | 52 49 57 | 158
+-------------------------+
Total 149 89 112 350


Percents of total count
BEFORE$ (rows) by AFTER$ (columns)

for unsure against Total N
+-------------------------+
for | 14.571 6.286 8.000 | 28.857 101
unsure | 13.143 5.143 7.714 | 26.000 91
against | 14.857 14.000 16.286 | 45.143 158
+-------------------------+
Total 42.571 25.429 32.000 100.000
N 149 89 112 350

Test statistic Value df Prob
Pearson Chi-square 11.473 4.000 0.022
McNemar Symmetry Chi-square 22.039 3.000 0.000
201
Crosstabul ati on
The three cells in the upper right contain counts for those who became more
unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in
the lower left contain counts for people who became more positive about NAFTA (46,
52, and 49) or 42% of the sample.
The null hypothesis for the McNemar test is that the changes in opinion are equal.
The chi-square statistic for this test is 22.039 with 3 df and p < 0.0005. You reject the
null hypothesis. The pro-NAFTA shift in opinion is significantly greater than the anti-
NAFTA shift.
You also clearly reject the null hypothesis that the row (BEFORE$) and column
(AFTER$) factors are independent (chi-square = 11.473; p = 0.022). However, a test
of independence does not answer your original question about change of opinion and
its direction.
Example 14
Confidence Intervals for One-Way Table Percentages
If your data are binomially or multinomially distributed, you may want confidence
intervals on the cell proportions. SYSTATs confidence intervals are based on an
approximation by Bailey (1980). Crosstabs uses that references approximation
number 6 with a continuity correction, which closely fits the real intervals for the
binomial on even small samples and performs well when population proportions are
near 0 or 1. The confidence intervals are scaled on a percentage scale for compatibility
with the other Crosstabs output.
Here is an example using data from Davis (1977) on the number of buses failing
after driving a given distance (1 of 10 distances). Print the percentages of the 191 buses
failing in each distance category to see the cover of the intervals. The input follows:
USE buses
XTAB
FREQ = count
PRINT NONE / FREQ PERCENT
TABULATE distance / CONFI=.95
202
Chapter 8
The resulting output is:
There are 6 buses in the first distance category; this is 3.14% of the 191 buses. The
confidence interval for this percentage ranges from 0.55 to 8.23%.
Example 15
Mantel-Haenszel Test
For any table, if the output mode is MEDIUM or if you select the Mantel-
Haenszel test, SYSTAT produces the Mantel-Haenszel statistic without continuity
correction. This tests the association between two binary variables controlling for a
stratification variable. The Mantel-Haenszel test is often used to test the effectiveness
of a treatment on an outcome, to test the degree of association between the presence or
absence of a risk factor and the occurrence of a disease, or to compare two survival
distributions.
Frequencies
Values for DISTANCE

1 2 3 4 5 6 7 8 9 10 Total
+-------------------------------------------------------------+
| 6 11 16 25 34 46 33 16 2 2 | 191
+-------------------------------------------------------------+



Percents of total count
Values for DISTANCE

1 2 3 4 5 6 7
+---------------------------------------------------------+
| 3.141 5.759 8.377 13.089 17.801 24.084 17.277 |
+---------------------------------------------------------+


8 9 10 Total N
+-------------------------+
| 8.377 1.047 1.047 | 100.000 191
+-------------------------+



95 percent approximate confidence intervals scaled as cell percents
Values for DISTANCE

1 2 3 4 5 6 7
+---------------------------------------------------------+
| 8.234 11.875 15.259 20.996 26.447 33.420 25.852 |
| 0.548 1.903 3.552 6.905 10.560 15.737 10.142 |
+---------------------------------------------------------+


8 9 10
+-------------------------+
| 15.259 4.914 4.914 |
| 3.552 0.0 0.0 |
+-------------------------+
k 2 2 ( )
203
Crosstabul ati on
A study by Ansfield, et al. (1977) examined the responses of two different groups of
patients (colon or rectum cancer and breast cancer) to two different treatments:
Here are the data rearranged:
The odds ratio (cross-product ratio) for the first table is:
or
Similarly, for the second table, the odds ratio is:
If the odds for treatments A and B are identical, the ratios would both be 1.0. For these
data, the breast cancer patients on treatment A are 1.6 both times more likely to have a
positive biopsy than patients on treatment B; while, for the colon-rectum, those on
treatment A are 3.2 times more likely to have a positive biopsy than those on treatment
B. But can you say these estimates differ significantly from 1.0? After adjusting for the
CANCER$ TREAT$ RESPONSE$ NUMBER
Colon-Rectum a Positive 16.000
Colon-Rectum b Positive 7.000
Colon-Rectum a Negative 32.000
Colon-Rectum b Negative 45.000
Breast a Positive 14.000
Breast b Positive 9.000
Breast a Negative 28.000
Breast b Negative 29.000
Breast Cancer Colon-Rectum
Positive Negative Positive Negative
Treatment A
14 28 16 32
Treatment B
9 29 7 45
odds (biopsy positive, given treatment A) 14 28 =
odds (biopsy positive, given treatment B) 9 29 =
---------------------------------------------------------------------------------------------------------------------------
14 28
9 29
---------------- 1.6 =
16 32
7 45
---------------- 3.2 =
204
Chapter 8
total frequency in each table, the Mantel-Haenszel statistic combines odd ratios across
tables. The input is:
The stratification variable (CANCER$) must be the first variable listed on TABULATE.
The output is:
SYSTAT prints a chi-square test for testing whether this combined estimate equals 1.0
(that odds for A and B are the same). The probability associated with this chi-square is
0.029, so you reject the hypothesis that the odds ratio is 1.0 and conclude that treatment
A is less effectivemore patients on treatment A have positive biopsies after treatment
than patients on treatment B.
One assumption required for the Mantel-Haenszel chi-square test is that the odds
ratios are homogenous across tables. For your example, the second odds ratio is twice
as large as the first. You can use loglinear models to test if a cancer-by-treatment
interaction is needed to fit the cells of the three-way table defined by cancer, treatment,
and response. The difference between this model and one without the interaction was
not significant (a chi-square of 0.36 with 1 df).
USE ansfield
XTAB
FREQ = number
ORDER response$ / SORT=Positive,Negative
PRINT / MANTEL
TABULATE cancer$ * treat$ * response$
Frequencies
TREAT$ (rows) by RESPONSE$ (columns)
CANCER$ = Breast

Positive Negative Total
+---------------------------+
a | 14 28 | 42
b | 9 29 | 38
+---------------------------+
Total 23 57 80


CANCER$ = Colon-Rectum

Positive Negative Total
+---------------------------+
a | 16 32 | 48
b | 7 45 | 52
+---------------------------+
Total 23 77 100


Test statistic Value df Prob
Mantel-Haenszel statistic = 2.277
Mantel-Haenszel Chi-square = 4.739 Probability = 0.029
205
Crosstabul ati on
Computation
All computations are in double precision.
References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning.
Ansfield, F., et al. (1977). A phase III study comparing the clinical utility of four regimens
of 5-fluorouracil. Cancer, 39, 3440.
Bailey, B. J. R. (1980). Large sample simultaneous confidence intervals for the
multinomial probabilities based on transformations of the cell frequencies.
Technometrics, 22, 583589.
Davis, D. J. (1977). An analysis of some failure data. Journal of the American Statistical
Association, 72, 113150.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. 2nd ed. New York: John
Wiley & Sons, Inc.
Morrison, A. S., Black, M. M., Lowe, C. R., MacMahon, B., and Yuasa, S. Y. (1973). Some
international differences in histology and survival in breast cancer. International
Journal of Cancer, 11, 261267.
207


Chapt er
9
Descriptive Statistics
Leland Wilkinson and Laszlo Engelman
There are many ways to describe data, although not all descriptors are appropriate for
a given sample. Means and standard deviations are useful for data that follow a normal
distribution, but are poor descriptors when the distribution is highly skewed or has
outliers, subgroups, or other anomalies. Some statistics, such as the mean and median,
describe the center of a distribution. These estimates are called measures of location.
Others, such as the standard deviation, describe the spread of the distribution.
Before deciding what you want to describe (location, spread, and so on), you
should consider what type of variables are present. Are the values of a variable
unordered categories, ordered categories, counts, or measurements?
For many statistical purposes, counts are treated as measured variables. Such
variables are called quantitative if it makes sense to do arithmetic on their values.
Means and standard deviations are appropriate for quantitative variables that follow a
normal distribution. Often, however, real data do not meet this assumption of
normality. A descriptive statistic is called robust if the calculations are insensitive to
violations of the assumption of normality. Robust measures include the median,
quartiles, frequency counts, and percentages.
Before requesting descriptive statistics, first scan graphical displays to see if the
shape of the distribution is symmetric, if there are outliers, and if the sample has
subpopulations. If the latter is true, then the sample is not homogeneous, and the
statistics should be calculated for each subgroup separately.
Descriptive Statistics offers the usual mean, standard deviation, and standard error
appropriate for data that follow a normal distribution. It also provides the median,
minimum, maximum, and range. A confidence interval for the mean and standard
errors for skewness and kurtosis can be requested. A stem-and-leaf plot is available
for assessing distributional shape and identifying outliers. Moreover, Descriptive
208
Chapter 9
Statistics provide stratified analysesthat is, you can request results separately for
each level of a grouping variable (such as SEX$) or for each combination of levels of
two or more grouping variables.
Statistical Background
Descriptive statistics are numerical summaries of batches of numbers. Inevitably, these
summaries are misleading, because they mask details of the data. Without them,
however, we would be lost in particulars.
There are many ways to describe a batch of data. Not all are appropriate for every
batch, however. Lets look at the Whos Who data from Chapter 1 to see what this
means. First of all, here is a stem-and-leaf diagram of the ages of 50 randomly sampled
people from Whos Who. A stem-and-leaf diagram is a tally; it shows us the
distribution of the AGE values.
Notice that these data look fairly symmetric and lumpy in the middle. A natural way to
describe this type of distribution would be to report its center and the amount of spread.
Location
How do we describe the center, or central location of the distribution, on a scale? One
way is to pick the value above which half of the data values fall and, by implication,
below which half of the data values fall. This measure is called the median. For our
AGE data, the median age is 56 years. Another measure of location is the center of
gravity of the numbers. Think of turning the stem-and-leaf diagram on its side and
balancing it. The balance point would be the mean. For a batch of numbers, the mean
Stem and leaf plot of variable: AGE , N = 50
Minimum: 34.000
Lower hinge: 49.000
Median: 56.000
Upper hinge: 66.000
Maximum: 81.000
3 4
3 689
4 14
4 H 556778999
5 0011112
5 M 556688889
6 0023
6 H 55677789
7 04
7 5668
8 1
209
Descri pti ve Stati sti cs
is computed by averaging the values. In our sample, the mean age is 56.7 years. It is
quite close to the median.
Spread
One way to measure spread is to take the difference between the largest and smallest
value in the data. This is called the range. For the age data, the range is 47 years.
Another measure, called the interquartile range or midrange, is the difference between
the values at the limits of the middle 50% of the data. For AGE, this is 17 years. (Using
the statistics at the top of the stem-and-leaf display, subtract the lower hinge from the
upper hinge.) Still another way to measure would be to compute the average variability
in the values. The standard deviation is the square root of the average squared
deviation of values from the mean. For the AGE variable, the standard deviation is
11.62. Following is some output from STATS:
The Normal Distribution
All of these measures of location and spread have their advantages and disadvantages,
but the mean and standard deviation are especially useful for describing data that
follow a normal distribution. The normal distribution is a mathematical curve with
only two parameters in its equation: the mean and standard deviation. As you recall
from Chapter 1, a parameter defines a family of mathematical functions, all of which
have the same general shape. Thus, if data come from a normal distribution, we can
describe them completely (except for random variation) with only a mean and standard
deviation.
Lets see how this works for our AGE data. Shown in the next figure is a histogram
of AGE with the normal curve superimposed. The location (center) of this curve is at
the mean age of the sample (56.7), and its spread is determined by the standard
deviation (11.62).
AGE
N of cases 50
Mean 56.700
Standard Dev 11.620
210
Chapter 9
The fit of the curve to the data looks excellent. Lets examine the fit in more detail. For
a normal distribution, we would expect 68% of the observations to fall between one
standard deviation below the mean and one standard deviation above the mean (45.1 to
68.3 years). By counting values in the stem-and-leaf diagram, we find 34 caseson
target. This is not to say that every number follows a normal distribution exactly,
however. If we looked further, we would find that the tails of this distribution are
slightly shorter than those from a normal distribution, but not enough to worry.
Non-Normal Shape
Before you compute means and standard deviations on everything in sight, however,
lets take a look at some more data: the USDATA data. Following are histograms for
the first two variables, ACCIDENT and CARDIO:
30 40 50 60 70 80 90
AGE
0
4
8
12
C
o
u
n
t
0.0
0.1
0.2 P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
30 40 50 60 70 80 90
AGE
0
4
8
12
C
o
u
n
t
20 30 40 50 60 70 80 90
ACCIDENT
0
5
10
15
C
o
u
n
t
0.0
0.1
0.2
0.3
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
20 30 40 50 60 70 80 90
ACCIDENT
0
5
10
15
C
o
u
n
t
100 200 300 400 500 600
CARDIO
0
2
4
6
8
10
C
o
u
n
t
0.0
0.1
0.2
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
100 200 300 400 500 600
CARDIO
0
2
4
6
8
10
C
o
u
n
t
211
Descri pti ve Stati sti cs
Notice that the normal curves fit the distributions poorly. ACCIDENT is positively
skewed. That is, it has a long right tail. CARDIO, on the other hand, is negatively
skewed. It has a long left tail. The means (44.3 and 398.5) clearly do not fall in the
centers of the distributions. Furthermore, if you calculate the medians using the Stem
display, you will see that the mean for ACCIDENT is pulled away from the median
(41.9) toward the upper tail and the mean for CARDIO is pulled to the left of the
median (416.2).
In short, means and standard deviations are not good descriptors for non-normal
data. In these cases, you have two alternatives: either transform your data to look
normal, or find other descriptive statistics that characterize the data. If you log the
values of ACCIDENT, for example, the histogram looks quite normal. If you square the
values of CARDIO, the normal fit similarly improves.
If a transformation doesnt work, then you may be looking at data that come from a
different mathematical distribution or are mixtures of subpopulations (see below). The
probability plots in SYSTAT can help you identify certain mathematical distributions.
There is not room here to discuss parameters for more complex probability
distributions. Otherwise, you should turn to distribution-free summary statistics to
characterize your data: the median, range, minimum, maximum, midrange, quartiles,
and percentiles.
Subpopulations
Sometimes, distributions can look non-normal because they are mixtures of different
normal distributions. Lets look at the Fisher/Anderson IRIS flower measurements.
Following is a histogram of PETALLEN (petal length) smoothed by a normal curve:
0 1 2 3 4 5 6 7
PETALLEN
0
10
20
30
40
50
C
o
u
n
t
0.0
0.1
0.2
0.3
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
0 1 2 3 4 5 6 7
PETALLEN
0
10
20
30
40
50
C
o
u
n
t
0.0
0.1
0.2
0.3
P
r
o
p
o
r
t
i
o
n

p
e
r

B
a
r
0 1 2 3 4 5 6 7
PETALLEN
0
10
20
30
40
50
C
o
u
n
t
212
Chapter 9
We forgot to notice that the petal length measurements involve three different flower
species. You can see one of them at the left. The other two are blended at the right.
Computing a mean and standard deviation on the mixed data is misleading.
The following box plot, split by species, shows how different the subpopulations are:
When there are such differences, you should compute basic statistics by group. If you
want to go on to test whether the differences in subpopulation means are significant,
use analysis of variance.
But first notice that the Setosa flowers (Group 1) have the shortest petals and the
smallest spread; while the Virginica flowers (Group 3) have the longest petals and
widest spread. That is, the size of the cell mean is related to the size of the cell standard
deviation. This violates the assumption of equal variances necessary for a valid
analysis of variance.
Here, we log transform the plot scale:
1 2 3
SPECIES
0
1
2
3
4
5
6
7
P
E
T
A
L
L
E
N
1 2 3
SPECIES
1
2
3
4
5
6
7
8
P
E
T
A
L
L
E
N
213
Descri pti ve Stati sti cs
The spreads of the three distributions are now more similar. For the analysis, we should
log transform the data.
Descriptive Statistics in SYSTAT
Basic Statistics Main Dialog Box
To open the Basic Statistics main dialog box, from the menus choose:
Statistics
Descriptive Statistics
Basic Statistics
The following statistics are available:
n All Options. Calculate all available statistics.
n N. The number of nonmissing values for the variable.
n Minimum. The smallest nonmissing value.
n Maximum. The largest nonmissing value.
n Sum. The total of all nonmissing values of a variable.
n Mean. The arithmetic mean of a variablethe sum of the values divided by the
number of (nonmissing) values.
214
Chapter 9
n SEM. The standard error of the mean is the standard deviation divided by the square
root of the sample size. It is the estimation error, or the average deviation of sample
means from the expected value of a variable.
n CI of Mean. Endpoints for the confidence interval of the mean. You can specify
confidence values between 0 and 1.
n Median. The median estimates the center of a distribution. If the data are sorted in
increasing order, the median is the value above which half of the values fall.
n SD. Standard deviation, a measure of spread, is the square root of the sum of the
squared deviations of the values from the mean divided by (n1).
n CV. The coefficient of variation is the standard deviation divided by the sample
mean.
n Range. The difference between the minimum and the maximum values.
n Variance. The mean of the squared deviations of values from the mean. (Variance
is the standard deviation squared).
n Skewness. A measure of the symmetry of a distribution about its mean. If skewness
is significantly nonzero, the distribution is asymmetric. A significant positive value
indicates a long right tail; a negative value, a long left tail. A skewness coefficient is
considered significant if the absolute value of SKEWNESS / SES is greater than 2.
n SES. The standard error of skewness .
n Kurtosis. A value of kurtosis significantly greater than 0 indicates that the variable
has longer tails than those for a normal distribution; less than 0 indicates that the
distribution is flatter than a normal distribution. A kurtosis coefficient is considered
significant if the absolute value of KURTOSIS / SEK is greater than 2.
n SEK. The standard error of kurtosis .
n Confidence. Confidence level for the confidence interval of the mean. Enter a value
between 0 and 1. (0.95 and 0.99 are typical values).
In addition, you can save the statistics to a data file.
Saving Basic Statistics to a File
If you are saving statistics to a file, you must select the format in which the statistics
are to be saved:
n Variables. Use with a By Groups variable to save selected statistics to a data file.
Each selected statistic is a case in the new data file (both the statistic and the
SQR 6 n ( ) ( )
SQR 24 n ( ) ( )
215
Descri pti ve Stati sti cs
group(s) are identified). The file contains the variable STATISTIC$ identifying the
statistics.
n Aggregate. Saves aggregate statistics to a data file. For each By Groups category, a
record (case) in the new data file contains all requested statistics. Three characters
are appended to the first eight letters of the variable name to identify the statistics.
The first two characters identify the statistic. The third character represents the
order in which the variables are selected. The statistics correspond to the following
two-letter combinations:
Stem Main Dialog Box
To open the Stem main dialog box, from the menus choose:
Statistics
Descriptive Statistics
Stem-and-Leaf
Stem creates a stem-and-leaf plot for one or more variables. The plot shows the
distribution of a variable graphically. In a stem-and-leaf plot, the digits of each number
are separated into a stem and a leaf. The stems are listed as a column on the left, and
the leaves for each stem are in a row on the right. Stem-and-leaf plots also list the
minimum, lower-hinge, median, upper-hinge, and maximum values of the sample.
N of cases NU Std. Error SE
Minimum MI Std. Deviation SD
Maximum MA Variance VA
Range RA C.V. CV
Sum SU Skewness SK
Median MD SE Skewness ES
Mean ME Kurtosis KU
CI Upper CU SE Kurtosis EK
CI Lower CL
216
Chapter 9
Unlike histograms, stem-and-leaf plots show actual numeric values to the precision of
the leaves.
The stem-and-leaf plot is useful for assessing distributional shape and identifying
outliers. Values that are markedly different from the others in the sample are labeled
as outside valuesthat is, the value is more than 1.5 hspreads outside its hinge (the
hspread is the distance between the lower and upper hinges, or quartiles). Under
normality, this translates into roughly 2.7 standard deviations from the mean.
The following must be specified to obtain a stem-and-leaf plot:
n Variable(s). A separate stem-and-leaf plot is created for each selected variable.
In addition, you can indicate how many lines (stems) to include in the plot.
Cronbach Main Dialog Box
To open the Cronbach main dialog box, from the menus choose:
Statistics
Descriptive Statistics
Cronbachs Alpha
Cronbach computes Cronbachs alpha. This statistic is a lower bound for test reliability
and ranges in value from 0 to 1 (negative values can occur when items are negatively
correlated). Alpha can be viewed as the correlation between the items (variables)
selected and all other possible tests or scales (with the same number of items)
constructed to measure the characteristic of interest. The formula used to calculate
alpha is:

k avg cov ( )
avg var ( )
--------------------------------
1
k 1 ( ) avg cov ( )
avg var ( )
---------------------------------------------- +
------------------------------------------------------------- =
217
Descri pti ve Stati sti cs
where k is the number of items, avg(cov) is the average covariance among the items,
and avg(var) is the average variance. Note that alpha depends on both the number of
items and the correlations among them. Even when the average correlation is small, the
reliability coefficient can be large if the number of items is large.
The following must be specified to obtain a Cronbachs alpha:
n Variable(s). To obtain Cronbachs alpha, at least two variables must be selected.
Using Commands
To generate descriptive statistics, choose your data by typing USE filename, and
continue with:
Usage Considerations
Types of data. STATS uses only numeric data.
Print options. The output is standard for all PRINT options.
Quick Graphs. STATS does not create Quick Graphs.
Saving files. STATS saves basic statistics as either records (cases) or as variables.
BY groups. STATS analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. STATS uses the FREQ variable, if present, to duplicate cases.
Case weights. STATS uses the WEIGHT variable, if present, to weight cases. However,
STEM is not affected by the WEIGHT variable.
STATISTICS
STEM varlist / LINES=n
CRONBACH varlist
SAVE / AG
STATISTICS varlist / ALL N MIN MAX SUM MEAN SEM CIM,
CONFI=n MEDIAN SD CV RANGE VARIANCE,
SKEWNESS SES KURTOSIS SEK
218
Chapter 9
Examples
Example 1
Basic Statistics
This example uses the OURWORLD data file, containing one record for each of 57
countries, and requests the default set of statistics for BABYMORT (infant mortality),
GNP_86 (gnp per capita in 1986), LITERACY (percentage of the population who can
read), and POP_1990 (population, in millions, in 1990).
The Statistics procedure knows only that these are numeric variablesit does not
know if the mean and standard deviation are appropriate descriptors for their
distributions. In other examples, we learned that the distribution of infant mortality is
right-skewed and has distinct subpopulations, the gnp is missing for 12.3% of the
countries, the distribution of LITERACY is left-skewed and has distinct subgroups. and
a log transformation markedly improves the symmetry of the population values. This
example ignores those findings.
The input is:
Following is the output:
For each variable, SYSTAT prints the number of cases (N of cases) with data present.
Notice that the sample size for GNP_86 is 50, or 7 less than the total observations. For
each variable, Minimum is the smallest value and Maximum, the largest. Thus, the
lowest infant mortality rate is 5 deaths (per 1,000 live births), and the highest is 154
deaths. In a symmetric distribution, the mean and median are approximately the same.
The median for POP_1990 is 10.354 million people (see the stem-and-leaf plot
example). Here, the mean is 22.8 millionmore than double the median. This estimate
of the mean is quite sensitive to the extreme values in the right tail.
Standard Dev, or standard deviation, measures the spread of the values in each
distribution. When the data follow a normal distribution, we expect roughly 95% of the
values to fall within two standard deviations of the mean.
STATISTICS
USE ourworld
STATISTICS babymort gnp_86 literacy pop_1990
BABYMORT GNP_86 LITERACY POP_1990
N of cases 57 50 57 57
Minimum 5.0000 120.0000 11.6000 0.2627
Maximum 154.0000 17680.0000 100.0000 152.5051
Mean 48.1404 4310.8000 73.5632 22.8003
Standard Dev 47.2355 4905.8773 29.7646 30.3655
219
Descri pti ve Stati sti cs
Example 2
Saving Basic Statistics: One Statistic and One Grouping Variable
For European, Islamic, and New World countries, we save the median infant mortality
rate, gross national product, literacy rate, and 1990 population using the OURWORLD
data file. The input is:
The text results that appear on the screen are shown below (they can also be sent to a
text file).
The MYSTATS data file (created in the SAVE step) is shown below:
Use a statement such as this to eliminate the sample size records:
STATISTICS
USE ourworld
BY group$
SAVE mystats
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN
BY
The following results are for:
GROUP$ = Europe
BABYMORT GNP_86 LITERACY POP_1990
N of cases 20 18 20 20
Median 6.000 9610.000 99.000 10.462

GROUP$ = Islamic
BABYMORT GNP_86 LITERACY POP_1990
N of cases 16 12 16 16
Median 113.000 335.000 28.550 16.686

GROUP$ = NewWorld
BABYMORT GNP_86 LITERACY POP_1990
N of cases 21 20 21 21
Median 32.000 1275.000 85.600 7.241
Case GROUP$ STATISTIC$ BABYMORT GNP_86 LITERACY POP_1990
1 Europe N of cases 20 18 20 20
2 Europe Median 6 9610 99 10.462
3 Islamic N of cases 16 12 16 16
4 Islamic Median 113 335 28.550 16.686
5 NewWorld N of cases 21 20 21 21
6 NewWorld Median 32 1275 85.6 7.241
SELECT statistic$ <> "N of cases"
220
Chapter 9
Example 3
Saving Basic Statistics: Multiple Statistics and Grouping Variables
If you want to save two or more statistics for each unique cross-classification of the
values of the grouping variables, SYSTAT can write the results in two ways:
n A separate record for each statistic. The values of a new variable named
STATISTICS$ identify the statistics.
n One record containing all the requested statistics. SYSTAT generates variable
names to label the results.
The first layout is the default; the second is obtained using:
As examples, we save the median, mean, and standard error of the mean for the cross-
classification of type of country with government for the OURWORLD data. The nine
cells for which we compute statistics are shown below (the number of countries is
displayed in each cell):
Note the empty cell in the first row. We illustrate both file layoutsa separate record
for each statistic and one record for all results.
One record per statistic. The following commands are used to compute and save
statistics for the combinations of GROUP$ and GOV$ shown in the table above:
SAVE filename / AG
Democracy Military One Party
Europe
16 0 4
Islamic
4 7 5
New World
12 6 3
STATISTICS
USE ourworld
BY group$ gov$
SAVE mystats2
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN MEAN SEM
BY
221
Descri pti ve Stati sti cs
The MYSTATS2 file with 32 cases and seven variables is shown below:
The average infant mortality rate for European democratic nations is 6.875 (case 2),
while the median is 6.0 (case 4).
One record for all statistics. Instead of four records (cases) for each combination of
GROUP$ and GOV$, we specify AG (aggregate) to prompt SYSTAT to write one
record for each cell:
Case GROUP$ GOV$ STATISTC$ BABYMORT GNP_86 LITERACY POP_1990
1 Europe Democracy N of Cases 16.000 16.000 16.000 16.000
2 Europe Democracy Mean 6.875 9770.000 97.250 22.427
3 Europe Democracy Std. Error 0.547 1057.226 1.055 5.751
4 Europe Democracy Median 6.000 10005.000 99.000 9.969
5 Europe OneParty N of Cases 4.000 2.000 4.000 4.000
6 Europe OneParty Mean 11.500 2045.000 98.750 20.084
7 Europe OneParty Std. Error 1.708 25.000 0.250 6.036
8 Europe OneParty Median 12.000 2045.000 99.000 15.995
9 Islamic Democracy N of Cases 4.000 4.000 4.000 4.000
10 Islamic Democracy Mean 91.000 700.000 37.300 12.761
11 Islamic Democracy Std. Error 23.083 378.660 9.312 5.315
12 Islamic Democracy Median 97.000 370.000 29.550 12.612
13 Islamic OneParty N of Cases 5.000 3.000 5.000 5.000
14 Islamic OneParty Mean 109.800 1016.667 29.720 15.355
15 Islamic OneParty Std. Error 15.124 787.196 9.786 3.289
16 Islamic OneParty Median 116.000 280.000 18.000 15.862
17 Islamic Military N of Cases 7.000 5.000 7.000 7.000
18 Islamic Military Mean 110.857 458.000 37.886 51.444
19 Islamic Military Std. Error 11.801 180.039 7.779 18.678
20 Islamic Military Median 116.000 350.000 29.000 51.667
21 NewWorld Democracy N of Cases 12.000 12.000 12.000 12.000
22 NewWorld Democracy Mean 44.667 2894.167 85.800 26.490
23 NewWorld Democracy Std. Error 9.764 1085.810 3.143 11.926
24 NewWorld Democracy Median 35.000 1645.000 86.800 15.102
25 NewWorld OneParty N of Cases 3.000 2.000 3.000 3.000
26 NewWorld OneParty Mean 14.667 2995.000 90.500 4.441
27 NewWorld OneParty Std. Error 1.333 2155.000 8.251 3.153
28 NewWorld OneParty Median 16.000 2995.000 98.500 2.441
29 NewWorld Military N of Cases 6.000 6.000 6.000 6.000
30 NewWorld Military Mean 53.167 1045.000 63.000 6.886
31 NewWorld Military Std. Error 13.245 287.573 10.820 1.515
32 NewWorld Military Median 55.000 780.000 60.500 5.726
222
Chapter 9
The MYSTATS3 file, with 8 cases and 18 variables, is shown below. (We separated them
into three panels and shortened the variable names):
Note that there are no European countries with Military governments, so no record is
written.
STATISTICS
USE ourworld
BY group$ gov$
SAVE mystats3 / AG
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN MEAN SEM
BY
Case GROUP$ GOV$ NU1BABYM ME1BABYM SE1BABYM MD1BABYM
1 Europe Democracy 16 6.875 0.547 6.0
2 Europe OneParty 4 11.500 1.708 12.0
3 Islamic Democracy 4 91.000 23.083 97.0
4 Islamic OneParty 5 109.800 15.124 116.0
5 Islamic Military 7 110.857 11.801 116.0
6 NewWorld Democracy 12 44.667 9.764 35.0
7 NewWorld OneParty 3 14.667 1.333 16.0
8 NewWorld Military 6 53.167 13.245 55.0
NU2GNP_8 ME2GNP_8 SE2GNP_8 MD2GNP_8 NU3LITER ME3LITER
16 9770.000 1057.226 10005 16 97.250
2 2045.000 25.000 2045 4 98.750
4 700.000 378.660 370 4 37.300
3 1016.667 787.196 280 5 29.720
5 458.000 180.039 350 7 37.886
12 2894.167 1085.810 1645 12 85.800
2 2995.000 2155.000 2995 3 90.500
6 1045.000 287.573 780 6 63.000
SE3LITER MD3LITER NU4POP_1 ME4POP_1 SE4POP_1 MD4POP_1
1.055 99.0 16 22.427 5.751 9.969
0.250 99.0 4 20.084 6.036 15.995
9.312 29.5 4 12.761 5.315 12.612
9.786 18.0 5 15.355 3.289 15.862
7.779 29.0 7 51.444 18.678 51.667
3.143 86.8 12 26.490 11.926 15.102
8.251 98.5 3 4.441 3.153 2.441
10.820 60.5 6 6.886 1.515 5.726
223
Descri pti ve Stati sti cs
Example 4
Stem-and-Leaf Plot
We request robust statistics for BABYMORT (infant mortality), POP_1990 (1990
population in millions), and LITERACY (percentage of the population who can read)
from the OURWORLD data file. The input is:
The output follows:
STATISTICS
USE ourworld
STEM babymort pop_1990 literacy
Stem and Leaf Plot of variable: BABYMORT, N = 57
Minimum: 5.0000
Lower hinge: 7.0000
Median: 22.0000
Upper hinge: 74.0000
Maximum: 154.0000

0 H 5666666666677777
1 00123456668
2 M 227
3 028
4 9
5
6 11224779
7 H 4
8 77
9
10 77
11 066
12 559
13 6
14 07
15 4

Stem and Leaf Plot of variable: POP_1990, N = 57
Minimum: 0.2627
Lower hinge: 6.1421
Median: 10.3545
Upper hinge: 25.5665
Maximum: 152.5051

0 00122333444
0 H 5556667777788899
1 M 0000034
1 556789
2 14
2 H 56
3 23
3 79
4
4
5 1
* * * Outside Values * * *
5 6677
6 2
11 48
15 2

224
Chapter 9
In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf.
The stems are listed as a column on the left, and the leaves for each stem are in a row
on the right. For infant mortality (BABYMORT), the Maximum number of babies who
die in their first year of life is 154 (out of 1,000 live births). Look for this value at the
bottom of the BABYMORT display. The stem for 154 is 15, and the leaf is 4. The
Minimum value for this variable is 5its leaf is 5 with a stem of 0.
The median value of 22 is printed here as the Median in the top panel and marked
by an M in the plot. The hinges, marked by Hs in the plot, are 7 and 74 deaths, meaning
that 25% of the countries in our sample have a death rate of 7 or less, and another 25%
have a rate of 74 or higher. Furthermore, the gaps between 49 and 61 deaths and
between 87 and 107 indicate that the sample does not appear homogeneous
Focusing on the second plot, the median population size is 10.354, or more than 10
million people. One-quarter of the countries have a population of 6.142 million or less.
The largest country (Brazil) has more than 152 million people. The largest stem for
POP_1990 is 15, like that for BABYMORT. This 15 comes from 152.505, so the 2 is
the leaf and the 0.505 is lost.
The plot for POP_1990 is very right-skewed. Notice that a real number line extends
from the minimum stem of 0 (0.623) to the stem of 5 for 51 million. The values below
Outside Values (stems of 5, 6, 11, and 25 with 8 leaves) do not fall along a number line,
so the right tail of this distribution extends further than one would think at first glance.
The median in the final plot indicates that half of the countries in our sample have
a literacy rate of 88% or better. The upper hinge is 99%, so more than one-quarter of
the countries have a rate of 99% or better. In the country with the lowest rate (Somalia),
only 11.6% of the people can read. The stem for 11.6 is 1 (the 10s digit), and the leaf
is 1 (the units digit). The 0.6 is not part of the display. For stem 10, there are two leaves
that are 0so two countries have 100% literacy rates (Finland and Norway). Notice
the 11 countries (at the top of the plot) with very low rates. Is there a separate subgroup
here?
Stem and Leaf Plot of variable: LITERACY, N = 57
Minimum: 11.6000
Lower hinge: 55.0000
Median: 88.0000
Upper hinge: 99.0000
Maximum: 100.0000

1 1258
2 035689
3 1
4
5 H 002556
6 355
7 0446
8 M 03558
9 H 03344457888889999999999999
10 00
225
Descri pti ve Stati sti cs
Transformations
Because the distribution of POP_1990 is very skewed, it may not be suited for analyses
based on normality. To find out, we transform the population values to log base 10 units
using the L10 function. The input is:
Following is the output:
For the untransformed values of the population, the stem-and-leaf plot identifies eight
outliers. Here, there is only one outlier. More important, however, is the fact that the
shape of the distribution for these transformed values is much more symmetric.
Subpopulations
Here, we stratify the values of LITERACY for countries grouped as European, Islamic,
and New World. The input is:
STATISTICS
USE ourworld
LET logpop90=L10(pop_1990)
STEM logpop90
Stem and Leaf Plot of variable: LOGPOP90, N = 57
Minimum: -0.5806
Lower hinge: 0.7883
Median: 1.0151
Upper hinge: 1.4077
Maximum: 2.1833

-0 5
* * * Outside Values * * *
0 01
0 33
0 445
0 H 6667777
0 888888899999
1 M 00000111
1 2222233
1 H 445555
1 777777
1
2 001
2
STATISTICS
USE ourworld
BY group$
STEM babymort pop_1990 literacy
BY
226
Chapter 9
The output follows:
The following results are for:
GROUP$ = Europe

Stem and Leaf Plot of variable: LITERACY, N = 20
Minimum: 83.0000
Lower hinge: 98.0000
Median: 99.0000
Upper hinge: 99.0000
Maximum: 100.0000

83 0
93 0
95 0
* * * Outside Values * * *
97 0
98 H 000
99 M 00000000000
100 00

The following results are for:
GROUP$ = Islamic

Stem and Leaf Plot of variable: LITERACY, N = 16
Minimum: 11.6000
Lower hinge: 19.0000
Median: 28.5500
Upper hinge: 53.5000
Maximum: 70.0000

1 H 1258
2 M 05689
3 1
4
5 H 0255
6 5
7 0

The following results are for:
GROUP$ = NewWorld

Stem and Leaf Plot of variable: LITERACY, N = 21
Minimum: 23.0000
Lower hinge: 74.0000
Median: 85.6000
Upper hinge: 94.0000
Maximum: 99.0000

2 3
* * * Outside Values * * *
5 0
5 6
6 3
6 5
7 H 44
7 6
8 0
8 M 558
9 H 03444
9 8899
227
Descri pti ve Stati sti cs
The literacy rates for Europe and the Islamic nations do not even overlap. The rates
range from 83% to 100% for the Europeans and 11.6% to 70% for the Islamics. Earlier,
11 countries were identified that have rates of 31% or less. From these stratified results,
we learn that 10 of the countries are Islamic and 1 (Haiti) is from the New World. The
Haitian rate (23%) is identified as an outlier with respect to the values of the other New
World countries.
Computation
All computations are in double precision.
Algorithms
SYSTAT uses a one-pass provisional algorithm (Spicer, 1972). Wilkinson and Dallal
(1977) summarize the performance of this algorithm versus those used in several
statistical packages.
References
Spicer, C. C. (1972). Calculation of power sums of deviations about the mean. Applied
Statistics, 21, 226227.
Wilkinson, L. and Dallal, G. E. (1977). Accuracy of sample moments calculations among
widely used statistical programs. The American Statistician, 31, 128131.
229


Chapt er
10
Design of Experiments
Herb Stenson
Design of Experiments generates design matrices for a variety of ANOVA and
mixture models. You can use Design of Experiments as an online library of
experimental designs, which can be saved to a SYSTAT file. You can run the
associated experiment, add the values of a dependent variable to the same file, and
analyze the experimental data by using General Linear Model (or another SYSTAT
statistical procedure).
Design of Experiments provides complete and incomplete factorial designs.
Complete factorial designs are simplest and have two or three levels of each factor.
Two-level designs can have two to seven factors, and three-level designs can have two
to five factors.
Incomplete designs offered by Design of Experiments include: Latin square
designs with 3 to 12 levels per factor; selected two-level designs described by Box,
Hunter, and Hunter (1978) with 3 to 11 factors and from 4 to 128 runs; 13 of the most
popular Taguchi (1987) designs; all of the Plackett and Burman (1946) two-level
designs with 4 to 100 runs; the 6 three-, five-, and seven-level designs described by
Plackett and Burman; and the set of 10 three-level designs described by Box and
Behnken (1960) in both their blocked and unblocked versions.
Four types of mixture models described by Cornell (1990) are available: Lattice,
Centroid, Axial, and Screen designs. The number of factors (components of a
mixture) can be as large as your computers memory allows.
Any design can be replicated as many times as you want, and the runs can be
randomized.
230
Chapter 10
Statistical Background
Modern quality control (or quality improvement, as it is often called) places an
emphasis on designing quality into products from the start, as well as monitoring
quality during production. In an industrial setting, researchers need information about
factors that influence the quality of a product. To get this information, they design
experiments that vary parameters of the products systematically in order to identify
critical factors and important values of those factors. Such research can be costly and
time-consuming, so experimental designs that provide the maximum amount of
information using the fewest number of runs (trials) are desirable. Such experiments
often have only two or three levels for each factor; there cannot be replications; and the
design may be an incomplete design, meaning that not all possible combinations of
factor levels are used in an experiment. A number of carefully constructed designs are
available for such experiments.
Standard two-level and three-level factorial designs can be generated by the Design
of Experiments procedure. In these designs, each factor has the same number of levels
(two or three), and all possible combinations of the factors are present (that is, the
design is completely crossed). To decrease the number of runs required, omit some
combinations of factors from the experiment. Design of Experiments can generate a
wide variety of such incomplete factorial designs.
The general approach used in constructing incomplete designs is to be sure that at
least the main effects of each factor can be tested. Parsimony is achieved by omitting
cells of the design in such a pattern that the main effects can be tested, but some or all
of the interactions among factors cannot. The number of testable interactions depends
on the particular design, ranging from none (for a Latin square design), to all (for a
complete factorial design). If some, but not all, interactions are included, the highest-
order interactions are usually omitted. Main effects of some factors are completely
confounded (aliased) with these high-order interactions. Thus, if the effect for a
confounded variable is significant, one cannot be sure whether the result is due to the
variable itself or to the interaction with which it is confounded. This is the price of
parsimony. In addition, in an incomplete design, the omission of cells is such that the
remaining cells still form an orthogonal (mutually independent) set, so that the tests of
the factors and interactions (if available) are statistically independent.
Various statisticians such as Plackett and Burman (1946), Box and Behnken (1960),
Box, Hunter, and Hunter (1978), and Taguchi (1987) have contributed to this effort. It
is their experimental designs that are presented here.
In industries such as the petroleum, chemical, or food industries, a special kind of
experiment is required that involves testing various mixtures of components to
231
Desi gn of Experi ments
determine the properties of the mixtures. A class of standard experimental designs
called mixture models has been developed to meet this need. In mixture models, each
row of the design matrix contains a list of the proportions in which each component
(column of the model) is present in the mixture represented by that row. Thus, the sum
of all the elements in a row must add up to 1. This creates a special degrees-of-freedom
problem that is handled properly by the General Linear Model procedures when the
Mixture model option is selected. Four broad classes of mixture models have been
identified by Cornell (1990). Each is available in Design of Experiments.
Design of Experiments in SYSTAT
Design of Experiments Main Dialog Box
To open the Design of Experiments dialog box, from the menus choose:
Statistics
Design of Experiments
Note: It is not necessary to have a data file open to use Design of Experiments.
Design type. Design of Experiments offers seven different design types: Factorial,
BoxHunter, Latin, Taguchi, Plackett, BoxBehnken, and Mixture.
The following options are available with Design of Experiments:
Levels. For factorial, Latin, and mixture designs, this is the number of levels for the
factors.
232
Chapter 10
Factors. For factorial, BoxHunter, BoxBehnken, and lattice mixture designs, this is the
number of factors, or independent variables.
Runs. For Plackett and BoxHunter designs, this is the number of runs.
Replications. For all designs except BoxBehnken and mixture, this is the number of
replications.
Mixture type. For mixture designs, you can specify a mixture type from the drop-down
list.
Taguchi type. For Taguchi designs, you can select a Taguchi type from the drop-down
list.
Save file. This option saves the design to a file.
Print Options. The following two options are available:
n Use letters for labels. Labels the design factors with letters instead of numbers.
n Print Latin square. For Latin square designs, you can print the Latin square.
Design Options. The following two options are available:
n Randomize. Randomizes the order of experimentation.
n Include blocking factor. For BoxBehnken designs, you can include a blocking
factor.
Factorial Designs
Factorial generates a complete factorial design with either two or three levels per
factor. In complete factorial designs, each factor has the same number of levels and all
possible combinations of factors are present. All main effects and interactions are
estimable in such a design. The number of runs required in a complete factorial design
is , where k is the number of factors and n is the number of levels for each factor.
Box and Hunter Designs
As the number of factors (m) increases, the number of runs required for a complete
factorial design increases rapidly. For factors with two levels, the number of runs
necessary to estimate all main effects and interactions is .
Fractional factorial designs eliminate higher-order interactions in an effort to
estimate main effects (and lower-order interactions). The BoxHunter design type
n
k
2
m
233
Desi gn of Experi ments
generates design variables for 27 fractional factorial designs for factors with two
levels. You can specify 3 to 11 factors. However, the number of runs must exceed the
number of factors by at least one and must be an even number between 4 and 128.
Notice that some combinations of runs and factors can result in complete (and perhaps
replicated) factorial designs as opposed to the incomplete designs. Using
PRINT=LONG, you can identify which main effects are confounded with which
interactions. These interactions are assumed to be negligible.
Latin Squares Designs
Latin square designs have three factors, all with the same number of levels. Because it
is assumed that there are no interactions, the structure of a Latin square design is often
laid out as a two-way design with the level of the third design factor displayed in each
cell. Here is a 6 x 6 Latin square given by Cochran and Cox (1957):
With 3 six-level factors, there are 216 possible combinations or experiments. By using
this Latin square design, only 36 are needed. For example, the 2 in the lower right
corner of the diagram indicates an experiment using level 6 of factor A, level 6 of factor
B, and level 2 of factor C. The default output shows runs (cells) by design codes for the
factors. To obtain the Latin structure as displayed here, use the Print Latin square
option. Randomize randomly permutes the rows and columns of the Latin square. In
addition, for four-level designs, a random selection of one of four possible standard
squares is made prior to permutations. (For three-level designs, only one standard
square exists.)
Factor B
1 2 3 4 5 6
1
6 2 3 4 5 1
2
2 6 5 3 1 4
Factor A 3
1 4 6 2 3 5
4
4 1 2 5 6 3
5
3 5 4 1 2 6
6
5 3 1 6 4 2
234
Chapter 10
Taguchi Designs
Taguchi designs allow for a maximum number of main effects to be estimated from a
minimum number of runs in the experiment while allowing for differences in the
number of factor levels. Taguchi creates 13 Taguchi designs; select the appropriate type
from the Taguchi type drop-down list.
Plackett-Burman Designs
Plackett-Burman designs are a special type of two-level designs that maximize the
number of unbiased main effect estimates obtained from as few runs as possible. In
general, for a Plackett-Burman design with n levels for each factor and r runs, the
occurrences of each factor level for any factor are paired with occurrences
of every level of any other factor.
For two-level designs with one fewer factors than runs, specify the number of runs
to be any multiple of 4 between 4 and 100. Designs are also available for factors with
three, five, and seven levels; specify the number of runs as follows for the respective
number of factors and levels:
Type Runs Factors Levels
L4 4 3 2 each
L8 8 7 2 each
L9 9 4 3 each
L12 12 11 2 each
L16 16 15 2 each
LP16 16 5 4 each
L18 18
1
7
2
3
L25 25 6 5 each
L27 27 23 3 each
L32 32 31 2 each
LP32 32
1
9
2
4
L36 36
11
12
2
3
L54 54
1
25
2
3
r n
r n * n ( )
235
Desi gn of Experi ments
Box and Behnken Designs
Box and Behnken designs combine a two-level factorial design with an incomplete
block design. All factors in these designs have three levels. You can also include a
blocking factor.
Mixture Designs
Mixture designs allow you to estimate the optimal mixture of components. In these
designs, it is the proportions of factors, not their actual amounts, that matter. Because
of this, the total amount of the mixture is scaled to 1.0, making results easily
interpretable as proportions. Four types of mixture models are available:
n Lattice. Lattice designs allow you to specify the number of levels or the number of
values that each component (factor) assumes, including 0 and 1. The selection of
levels has no effect for the other three types of designs available because the
number of factors determines the number of levels for each of them. As Cornell
(1990) points out, the vast majority of mixture research employs lattice models;
however, the other three types included here are useful in specific situations.
Number of Runs Factors Levels
9 4 3
27 13 3
81 40 3
25 6 5
125 31 5
49 8 7
Factors With Block
3 No blocking possible
4 3 blocks of 9 cases
5 2 blocks of 23 cases
6 2 blocks of 27 cases
7 2 blocks of 31 cases
9 5 blocks of 26 cases
10 2 blocks of 85 cases
11 No blocking possible
12 2 blocks of 102 cases
16 6 blocks of 66 cases
236
Chapter 10
n Centroid. Centroid designs consist of every (non-empty) subset of the components,
but only with mixtures in which the components appear in equal proportions. Thus,
if we asked for a centroid design with four factors (components), the mixtures in
the model would consist of all permutations of the set (1,0,0,0), all permutations of
the set (1/2,1/2,0,0), all permutations of the set (1/3,1/3,1/3,0), and the set
(1/4,1/4,1/4,1/4). Thus, the number of distinct points is 1 less than 2 raised to the q
power, where q is the number of components. Centroid designs are useful for
investigating mixtures where incomplete mixtures (with at least one component
absent) are of primary importance. See Cornell (1990) for more information on
centroid designs.
n Axial. In an axial design with m components, each run consists of at least ( )
equal proportions of the components. These designs include: mixtures composed
of one component; mixtures composed of ( ) components in equal
proportions; and mixtures with equal proportions of all components. Thus, if we
asked for an axial design with four factors (components), the mixtures in the model
would consist of all permutations of the set (1,0,0,0), all permutations of the set
(5/8,1/8,1/8,1/8), all permutations of the set (0,1/3,1/3,1/3), and the set
(1/4,1/4,1/4,1/4). See Cornell (1990) for more information on axial designs.
n Screen. Screening designs are reduced axial designs, omitting the mixtures that
contain all but one components. Thus, if we asked for a screening design with four
factors (components), the mixtures in the model would consist of all permutations
of the set (1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), and the set
(1/4,1/4,1/4,1/4). Screening designs enable you to single out unimportant
components from an array of many potential components. See Cornell (1990) for
more information on screening designs.
Using Commands
With commands:
DESIGN
SAVE filename
FACTORIAL / FACTORS=n REPS=n LETTERS RAND LEVELS = 2 or 3
BOXHUNTER / FACTORS=n RUNS=n REPS=n LETTERS RAND
LATIN / LEVELS=n SQUARE REPS=n LETTERS RAND
TAGUCHI / TYPE=design REPS=n LETTERS RAND
PLACKETT / RUNS=n REPS=n LETTERS RAND
BOXBEHNKEN / FACTORS=n BLOCK LETTERS RAND
MIXTURE / TYPE=LATTICE FACTORS=n LEVELS=n RAND LETTERS
CENTROID
AXIAL
SCREEN
m 1
m 1
237
Desi gn of Experi ments
Usage Considerations
Types of data. No data file is needed to use Design of Experiments.
Print options. For Box-Hunter designs, using PRINT=LONG yields a listing of the
generators (confounded effects) for the design. For Taguchi designs, a table defining
the interaction is available.
Quick Graphs. No Quick Graphs are produced.
Saving files. The design can be saved to a file.
BY groups. Analysis by groups is not available.
Bootstrapping. Bootstrapping is not available in this procedure.
Case weights. Case weighting is not available in Design of Experiments.
Examples
Example 1
Complete Factorial Designs
The input for a (2 x 2 x 2) design is:
The output is:
DESIGN
SAVE factor
FACTORIAL / FACTORS=3 LEVELS=2
Full 2-Level Factorial Design: 8 Runs, 3 Factors

Factor
Run 1 2 3

1 - - -
2 + - -
3 - + -
4 + + -
5 - - +
6 + - +
7 - + +
8 + + +
The design matrix has been saved.
238
Chapter 10
Replicates
The input for a (2 x 2 x 2) design with two replications is:
The output is:
Notice that the factors are labeled with letters instead of numbers.
Example 2
Box and Hunter Fractional Factorial Design
To generate a (2 x 2 x 2) fractional factorial, the input is:
DESIGN
SAVE factor
FACTORIAL / FACTORS=3 LEVELS=2 REPS=2 LETTERS
Full 2-Level Factorial Design: 16 Runs, 3 Factors

Factor
Run A B C

1 - - -
2 + - -
3 - + -
4 + + -
5 - - +
6 + - +
7 - + +
8 + + +
9 - - -
10 + - -
11 - + -
12 + + -
13 - - +
14 + - +
15 - + +
16 + + +
The design matrix has been saved.
DESIGN
SAVE boxhun1
BOXHUNTER / FACTORS=3
239
Desi gn of Experi ments
The resulting output is:
Aliases
For 7 two-level factors, the number of cells (runs) for a complete factorial is 27=128.
The following example shows the smallest fractional factorial for estimating main
effects. The design codes for the first three factors generate the last four. The input is:
The output is:
The main effect for factor 4 is confounded with the interaction between factors 1 and
2; the main effects for factor 5 is confounded with the interaction between factors 1 and
3; and so on.
Box-Hunter Fractional 2-Level Design: 4 Runs, 3 Factors, Resolution = 3

Factor
Run 1 2 3

1 - - +
2 + - -
3 - + -
4 + + +

Generators for the Requested Design

Factor 3 = 1x2.
The design matrix has been saved.
DESIGN
SAVE boxhun2
PRINT=LONG
BOXHUNTER / FACTORS=7 RUNS=8
Box-Hunter Fractional 2-Level Design: 8 Runs, 7 Factors, Resolution = 3

Factor
Run 1 2 3 4 5 6 7

1 - - - + + + -
2 + - - - - + +
3 - + - - + - +
4 + + - + - - -
5 - - + + - - +
6 + - + - + - -
7 - + + - - + -
8 + + + + + + +

Generators for the Requested Design

Factor 4 = 1x2.
Factor 5 = 1x3.
Factor 6 = 2x3.
Factor 7 = 1x2x3.
The design matrix has been saved.
240
Chapter 10
Example 3
Latin Squares
To generate a Latin square when each factor has four levels, the input is:
The output is:
If you dont elect to print the Latin square (by omitting SQUARE), the results are as
follows:
Permutations
To randomly assign the factors to the cells, the input is:
DESIGN
LATIN / LEVELS=4 SQUARE LETTERS
Latin-Square Design: 4 Factors, Each With 4 Levels

Factor 2
Factor 1 1 2 3 4

1 A B C D
2 B C D A
3 C D A B
4 D A B C
Latin-Square by Run: 16 Runs; 3 Factors, Each With 4 Levels

Factor
Run A B C

1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 1 2
6 2 2 3
7 2 3 4
8 2 4 1
9 3 1 3
10 3 2 4
11 3 3 1
12 3 4 2
13 4 1 4
14 4 2 1
15 4 3 2
16 4 4 3
DESIGN
SAVE latin2
LATIN / LEVELS=4 SQUARE LETTERS RAND
241
Desi gn of Experi ments
The resulting output is:
Example 4
Taguchi Design
To obtain a Taguchi L12 design with 11 factors, the input is:
The output is:
Design L16 with 15 Two-Level Factors plus Aliases
To obtain a Taguchi L16 design with 15 factors, the input is:
Latin-Square Design: 4 Factors, Each With 4 Levels

Factor 2
Factor 1 1 2 3 4

1 D C A B
2 C A B D
3 B D C A
4 A B D C
DESIGN
SAVE taguchi
TAGUCHI / TYPE=L12
Taguchi Design L12: 12 Runs; 11 Factors, Each With 2 Levels

Factor
Run 1 2 3 4 5 6 7 8 9 10 11

1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 2 2 2 2 2 2
3 1 1 2 2 2 1 1 1 2 2 2
4 1 2 1 2 2 1 2 2 1 1 2
5 1 2 2 1 2 2 1 2 1 2 1
6 1 2 2 2 1 2 2 1 2 1 1
7 2 1 2 2 1 1 2 2 1 2 1
8 2 1 2 1 2 2 2 1 1 1 2
9 2 1 1 2 2 2 1 2 2 1 1
10 2 2 2 1 1 1 1 2 2 1 2
11 2 2 1 2 1 2 1 1 1 2 2
12 2 2 1 1 2 1 2 1 2 2 1

Interaction Term(s) for Each Pair of Columns
DESIGN
SAVE taguch2
PRINT=LONG
TAGUCHI / TYPE=L16
242
Chapter 10
The output is:
Example 5
Plackett-Burman Design
To generate a Plackett-Burman design consisting of 11 two-level factors, the input is:
Taguchi Design L16: 16 Runs; 15 Factors, Each With 2 Levels

Factor
Run 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
3 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2
4 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1
5 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
6 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1
7 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1
8 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2
9 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
10 2 1 2 1 2 1 2 2 1 2 1 2 1 2 1
11 2 1 2 2 1 2 1 1 2 1 2 2 1 2 1
12 2 1 2 2 1 2 1 2 1 2 1 1 2 1 2
13 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1
14 2 2 1 1 2 2 1 2 1 1 2 2 1 1 2
15 2 2 1 2 1 1 2 1 2 2 1 2 1 1 2
16 2 2 1 2 1 1 2 2 1 1 2 1 2 2 1

Interaction Term(s) for Each Pair of Columns

Column
Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1
2 3
3 2 1
4 5 6 7
5 4 7 6 1
6 7 4 5 2 3
7 6 5 4 3 2 1
8 9 10 11 12 13 14 15
9 8 11 10 13 12 15 14 1
10 11 8 9 14 15 12 13 2 3
11 10 9 8 15 14 13 12 3 2 1
12 13 14 15 8 9 10 11 4 5 6 7
13 12 15 14 9 8 11 10 5 4 7 6 1
14 15 12 13 10 11 8 9 6 7 4 5 2 3
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
DESIGN
SAVE plackett
PLACKETT / RUNS=12
243
Desi gn of Experi ments
The output follows:
Example 6
Box and Behnken Design
Each factor in this design has three levels. The input is:
The output is:
Plackett-Burman Design: 12 Runs; 11 Factors, Each With 2 Levels

Factor
Run 1 2 3 4 5 6 7 8 9 10 11

1 + + - + + + - - - + -
2 + - + + + - - - + - +
3 - + + + - - - + - + +
4 + + + - - - + - + + -
5 + + - - - + - + + - +
6 + - - - + - + + - + +
7 - - - + - + + - + + +
8 - - + - + + - + + + -
9 - + - + + - + + + - -
10 + - + + - + + + - - -
11 - + + - + + + - - - +
12 - - - - - - - - - - -
The design matrix has been saved.
DESIGN
SAVE boxbehn
BOXBEHNKEN / FACTORS=3
Box-Behnken Design: 15 Runs; 3 Factors, Each With 3 Levels

Factor
Run 1 2 3

1 - - 0
2 + - 0
3 - + 0
4 + + 0
5 - 0 -
6 + 0 -
7 - 0 +
8 + 0 +
9 0 - -
10 0 + -
11 0 - +
12 0 + +
13 0 0 0
14 0 0 0
15 0 0 0
244
Chapter 10
Example 7
Mixture Design
We illustrate a lattice mixture design in which each of the three factors has five levels;
that is, each component of the mixture is 0%, 25%, 50%, 75%, or 100% of the mixture
for a given run, subject to the restriction that the sum of the percentages is 100. The
input is:
The output is:
After collecting your data, you may want to display it in a triangular scatterplot.
References
Box, G. E. P. and Behnken, D. W. (1960). Some new three level designs for the study of
quantitative variables. Technometrics, vol. 2, 4, 455475.
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for experimenters. New
York: John Wiley & Sons, Inc.
Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John
Wiley & Sons, Inc.
Cornell, J. A. (1990). Experiments with mixtures. New York: John Wiley & Sons, Inc.
Plackett, R. L. and Burman, J. P. (1946). The design of optimum multifactor experiments.
Biometrika, vol. 33, 305325.
Taguchi, G. (1987). System of experimental design (2 volumes). New York:
UNIPUB/Kraus International Publications.
DESIGN
MIXTURE / TYPE=LATTICE FACTORS=3 LEVELS=5
Lattice Mixture Design: 15 Runs; 3 Factors, Each With 5 Levels

Factor
Run 1 2 3

1 1.000 .000 .000
2 .000 1.000 .000
3 .000 .000 1.000
4 .750 .250 .000
5 .750 .000 .250
6 .000 .750 .250
7 .500 .500 .000
8 .500 .000 .500
9 .000 .500 .500
10 .250 .750 .000
11 .250 .000 .750
12 .000 .250 .750
13 .500 .250 .250
14 .250 .500 .250
15 .250 .250 .500
245


Chapt er
11
Discriminant Analysis
Laszlo Engelman
Discriminant Analysis performs linear and quadratic discriminant analysis, providing
linear or quadratic functions of the variables that best separate cases into two or
more predefined groups. The variables in the linear function can be selected in a
forward or backward stepwise manner, either interactively by the user or
automatically by SYSTAT. For the latter, at each step, SYSTAT enters the variable that
contributes most to the separation of the groups (or removes the variable that is the
least useful).
The command language allows you to emphasize the difference between specific
groups; contrasts can be used to guide variable selection. Cases can be classified even
if they are not used in the computations.
Discriminant analysis is related to both multivariate analysis of variance and
multiple regression. The cases are grouped in cells like a one-way multivariate
analysis of variance and the predictor variables form an equation like that for multiple
regression. In discriminant analysis, Wilks lambda, the same test statistic used in
multivariate ANOVA, is used to test the equality of group centroids. Discriminant
analysis can be used not only to test multivariate differences among groups, but also
to explore:
n Which variables are most useful for discriminating among groups
n If one subset of variables performs equally well as another
n Which groups are most alike and most different
246
Chapter 11
Statistical Background
When we have categorical variables in a model, it is often because we are trying to
classify cases; that is, what group does someone or something belong to? For example,
we might want to know whether someone with a grade point average (GPA) of 3.5 and
an Advanced Psychology Test score of 600 is more like the group of graduate students
successfully completing a Ph.D. or more like the group that fails. Or, we might want to
know whether an object with a plastic handle and no concave surfaces is more like a
wrench or a screwdriver.
Once we attempt to classify, our attention turns from parameters (coefficients) in a
model to the consequences of classification. We now want to know what proportion of
subjects will be classified correctly and what proportion incorrectly. Discriminant
analysis is one method for answering these questions.
Linear Discriminant Model
If we know that our classifying variables are normally distributed within groups, we
can use a classification procedure called linear discriminant analysis (Fisher, 1936).
Before we present the method, however, we should warn you that the procedure
requires you to know that the groups share a common covariance matrix and you must
know what the covariance matrix values are. We have not found an example of
discriminant analysis in the social sciences where this was true. The most appropriate
applications we have found are in engineering, where a covariance matrix can be
deduced from physical measurements. Discriminant analysis is used, for example, in
automated vision systems for detecting objects on moving conveyer belts.
Why do we need to know the covariance matrix? We are going to use it to calculate
Mahalanobis distances (developed by the Indian statistician Prasanta C.
Mahalanobis). These distances are calculated between cases we want to classify and
the center of each group in a multidimensional space. The closer a case is to the center
of one group (relative to its distance to other groups), the more likely it is to be
classified as belonging to that group. The figure on p. 247 shows what we are doing.
The borders of this graph comprise the two predictors GPA and GRE. The two
hills are centered at the mean values of the two groups (No Ph.D. and Ph.D.). Most
of the data in each group are supposed to be under the highest part of each hill. The
hills, in other words, mathematically represent the concentration of data values in the
scatterplot beneath.
247
Di scri mi nant Anal ysi s
The shape of the hills was computed from a bivariate normal distribution using the
covariance matrix averaged within groups. Weve plotted this figure this way to show
you that this model is like pie-in-the-sky if you use the information in the data below
to compute the shape of these hills. As you can see, there is a lot of smoothing of the
data going on, and if one or two data values in the scatterplot influence unduly the
shape of the hills above, you will have an unrepresentative model when you try to use
it on new samples.
How do we classify a new case into one group or another? Look at the figure again.
The new case could belong to one or the other group. Its more likely to belong to
the closer group, however. The simple way to find how far this case is from the center
of each group would be to take a direct walk from the new case to the center of each
group in the data plot.
248
Chapter 11
Instead of walking in sample data space below, however, we must climb the hills of our
theoretical model above when using the normal classification model. In other words,
we will use our theoretical model to calculate distances. The covariance matrix we
used to draw the hills in the figure makes distances depend on the direction we are
heading. The distance to a group is thus proportional to the altitude (not the horizontal
distance) we must climb to get to the top of the corresponding hill.
Because these hills can be oblong in shape, it is possible to be quite far from the top
of the hill as the crow flies, yet have little altitude to cover in a climb. Conversely, it is
possible to be close to the center of the hill and have a steep climb to get to the top.
Discriminant analysis adjusts for the covariance that causes these eccentricities in hill
shape. That is why we need the covariance matrix in the first place.
So much for the geometric representation. What do the numbers look like? Lets
look at how to set up the problem with SYSTAT. The input is:
The output is:
DISCRIM
USE ADMIT
PRINT LONG
MODEL PHD = GRE,GPA
ESTIMATE
Group frequencies
-----------------
1 2
Frequencies 51 29

Group means
-----------
GPA 4.423 4.639
GRE 590.490 643.448

Pooled within covariance matrix -- DF= 78
------------------------------------------------
GPA GRE
GPA 0.095
GRE 1.543 4512.409

Within correlation matrix
-------------------------
GPA GRE
GPA 1.000
GRE 0.075 1.000

Total covariance matrix -- DF= 79
------------------------------------------------
GPA GRE
GPA 0.104
GRE 4.201 5111.610

Total correlation matrix
------------------------
GPA GRE
GPA 1.000
GRE 0.182 1.000
249
Di scri mi nant Anal ysi s
Theres a lot to follow on this output. The counts and means per group are shown first.
Next comes the Pooled within covariance matrix, computed by averaging the separate-
Between groups F-matrix -- df = 2 77
----------------------------------------------
1 2
1 0.0
2 9.469 0.0

Wilks lambda
Lambda = 0.8026 df = 2 1 78
Approx. F= 9.4690 df = 2 77 prob = 0.0002

Classification functions
----------------------
1 2
Constant -133.910 -150.231
GPA 44.818 46.920
GRE 0.116 0.127


Classification matrix (cases in row categories classified into columns)
---------------------
1 2 %correct
1 38 13 75
2 7 22 76

Total 45 35 75

Jackknifed classification matrix
--------------------------------
1 2 %correct
1 37 14 73
2 7 22 76

Total 44 36 74

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
0.246 0.444 1.000

Wilks lambda= 0.803
Approx.F= 9.469 DF= 2, 77 p-tail= 0.0002

Pillais trace= 0.197
Approx.F= 9.469 DF= 2, 77 p-tail= 0.0002

Lawley-Hotelling trace= 0.246
Approx.F= 9.469 DF= 2, 77 p-tail= 0.0002

Canonical discriminant functions
--------------------------------
1
Constant -15.882
GPA 2.064
GRE 0.011

Canonical discriminant functions -- standardized by within variances
--------------------------------------------------------------------
1
GPA 0.635
GRE 0.727

Canonical scores of group means
-------------------------------
1 -.369
2 .649
250
Chapter 11
group covariance matrices, weighting by group size. The Total covariance matrix
ignores the groups. It includes variation due to the group separation. These are the
same matrices found in the MANOVA output with PRINT=LONG. The Between groups
F-matrix shows the F value for testing the difference between each pair of groups on
all the variables (GPA and GRE). The Wilks lambda is for the multivariate test of
dispersion among all the groups on all the variables, just as in MANOVA. Each case is
classified by our model into the group whose classification function yields the largest
score. Each function is like a regression equation. We compute the predicted value of
each equation for a cases values on GPA and GRE and classify the case into the group
whose function yields the largest value.
Next come the separate F statistics for each variable and the Classification matrix.
The goodness of classification is comparable to that for the PROBIT model. We did a
little worse with the No Ph.D. group and a little better with the Ph.D. The Jackknifed
classification matrix is an attempt to approximate cross-validation. It will tend to be
somewhat optimistic, however, because it uses only information from the current
sample, leaving out single cases to classify the remainder. There is no substitute for
trying the model on new data.
Finally, the program prints the same information produced in a MANOVA by
SYSTATs MGLH (GLM and ANOVA). The multivariate test statistics show the
groups are significantly different on GPA and GRE taken together.
Linear Discriminant Function
We mentioned in the last section that the canonical coefficients are like a regression
equation for computing distances up the hills. Lets look more closely at these
coefficients. The following figure shows the plot underlying the surface in the last
figure. Superimposed at the top of the GRE axis are two normal distributions centered
at the means for the two groups. The standard deviations of these normal distributions
are computed within groups. The within-group standard deviation is the square root of
the diagonal GRE variance element of the residual covariance matrix (4512.409). The
same is done for GPA on the right, using square root of the within-groups variance
(0.095) for the standard deviation and the group means for centering the normals.
251
Di scri mi nant Anal ysi s
Either of these variables separates the groups somewhat. The diagonal line underlying
the two diagonal normal distributions represents a linear combination of these two
variables. It is computed using the canonical discriminant functions in the output.
These are the same as the canonical coefficients produced by MGLH. Before applying
these coefficients, the variables must be standardized by the within-group standard
deviations. Finally, the dashed line perpendicular to this diagonal cuts the observations
into two groups: those to the left and those to the right of the dashed line.
You can see that this new canonical variable and its perpendicular dashed line are
an orthogonal (right-angle-preserving) rotation of the original axes. The separation of
the two groups using normal distributions drawn on the rotated canonical variable is
slightly better than that for either variable alone. To classify on the linear discriminant
axis, make the mean on this new variable 0 (halfway between the two diagonal normal
curves). Then add a scale along the diagonal, running from negative to positive. If we
do this, then any observations with negative scores on this diagonal scale will be
classified into the No Ph.D. group (to the left of the dashed perpendicular bisector) and
those with positive scores into the Ph.D. (to the right). All Ys to the left of the dashed
line and Ns to the right are misclassifications. Try rotating these axes any other way
to get a better count of correctly classified cases (watch out for ties). The linear
discriminant function is the best rotation.
Using this linear discriminant function variable, we get the same classifications we
got with the Mahalanobis distance method. Before computers, this was the preferred
method for classifying because the computations are simpler.
252
Chapter 11
We just use the equation:
F
z
= 0.635*Z
GPA
+ 0.727*Z
GRE
The two Z variables are the raw scores minus the overall mean divided by the within-
groups standard deviations. If F
z
is less than 0, classify No Ph.D.; otherwise, classify
Ph.D.
As we mentioned, the Mahalanobis method and the linear discriminant function
method are equivalent. This is somewhat evident in the figure. The intersection of the
two hills is a straight line running from the northwest to the southeast corner in the
same orientation as the dashed line. Any point to the left of this line will be closer to
the top of the left hill, and any point to the right will be closer to the top of the right hill.
Prior Probabilities
Our sample contained fewer Ph.D.s than No Ph.D.s. If we want to use our discriminant
model to classify new cases and if we believe that this difference in sample sizes
reflects proportions in the population, then we can adjust our formula to favor No
Ph.D.s. In other words, we can make the prior probabilities (assuming we know
nothing about GRE and GPA scores) favor a No Ph.D. classification. We can do this by
adding the option
PRIORS = 0.625, 0.375
to the MODEL command. Do not be tempted to use this method as a way of improving
your classification table. If the probabilities you choose do not reflect real population
differences, then new samples will on average be classified worse. It would make sense
in our case because we happen to know that more people in our department tend to drop
out than stay for the Ph.D.
You might have guessed that the default setting is for prior probabilities to be equal
(both 0.5). In the last figure, this makes the dashed line run halfway between the means
of the two groups on the discriminant axis. By changing the priors, we move this
dashed line (the normal distributions stay in the same place).
Multiple Groups
The discriminant model generalizes to more than two groups. Imagine, for example,
three hills in the first figure. All the distances and classifications are computed in the
253
Di scri mi nant Anal ysi s
same manner. The posterior probabilities for classifying cases are computed by
comparing three distances rather than two.
The multiple group (canonical) discriminant model yields more than one
discriminant axis. For three groups, we get two sets of canonical discriminant
coefficients. For four groups, we get three. If we have fewer variables than groups, then
we get only as many sets as there are variables. The group classification function
coefficients are handy for classifying new cases with the multiple group model. Simply
multiply each coefficient times each variable and add in the constant. Then assign the
case to the group whose set yields the largest value.
Discriminant Analysis in SYSTAT
Discriminant Analysis Main Dialog Box
To open the Discriminant Analysis dialog box, from the menus choose:
Statistics
Classification
Discriminant Analysis...
The following options can be specified:
Quadratic. The Quadratic check box requests quadratic discriminant analysis. If not
selected, linear discriminant analysis is performed.
Save. For each case, Distances saves the Mahalanobis distances to each group centroid
and the posterior probability of the membership in each group. Scores saves the
canonical variable scores. Scores/Data and Distances/Data save scores and distances
along with the data.
254
Chapter 11
Discriminant Analysis Options
SYSTAT includes several controls for stepwise model building and tolerance. To
access these options, click Options in the main dialog box.
The following can be specified:
Tolerance. The tolerance sets the matrix inversion tolerance limit. Tolerance = 0.001 is
the default.
Two estimation options are available:
n Complete. All variables are used in the model.
n Stepwise. Variables can be selected in a forward or backward stepwise manner,
either interactively by the user or automatically by SYSTAT.
If you select stepwise estimation, you can specify the direction in which the estimation
should proceed, whether SYSTAT should control variable entry and elimination, and
any desired criteria for variable entry and elimination.
n Backward. In backward stepping, all variables are entered, irrespective of their F-
to-enter values (if a variable fails the Tolerance limit, however, it is excluded). F-
to-remove and F-to-enter values are reported. When Backward is selected along
with Automatic, at each step, SYSTAT removes the variable with the lowest F-to-
remove value that passes the Remove limit of the F statistic (or reenters the
variable with the largest F-to-enter above the Remove limit of the F statistic).
255
Di scri mi nant Anal ysi s
n Forward. In forward stepping, the variables are entered in the model. F-to-enter
values are reported for all candidate variables, and F-to-remove values are reported
for forced variables. When Forward is selected along with Automatic, at each step,
SYSTAT enters the variable with the highest F-to-enter that passes the Enter limit
of the F statistic (or removes the variable with the lowest F-to-remove below the
Remove limit of the F statistic).
n Automatic. SYSTAT enters or removes variables automatically. F-to-enter and F-
to-remove limits are used.
n Interactive. Variables are interactively removed from and/or added to the model at
each step. In the Command pane, type a STEP command to enter and remove
variables interactively.
Variables are added to or eliminated from the model based on one of two possible
criteria.
n Probability. Variables with probability (F-to-enter) smaller than the Enter
probability are entered into the model if Tolerance permits. The default Enter value
is 0.15. For highly correlated predictors, you may want to set Enter = 0.01.
Variables with probability (F-to-remove) larger than the Remove probability are
removed from the model. The default Remove value is 0.15.
n F-statistic. Variables with F-to-enter values larger than the Enter F value are
entered into the model if Tolerance permits. The default Enter value is 4. Variable
with F-to-remove values smaller than the Remove F value are removed from the
model. The default Remove value is 3.9.
You can also specify variables to include in the model, regardless of whether they meet
the criteria for entry into the model. In the Force text box, enter the number of
STEP One variable is entered into or removed from the model (based
on the Enter and Remove limits of the F statistic).
STEP + Variable with the largest F-to-enter is entered into the model
(irrespective of the Enter limit of the F statistic).
STEP Variable with the smallest F-to-remove is removed from the
model (irrespective of the Remove limit of the F statistic).
STEP c, e Variables named c and e are stepped into/out of the model (irre-
spective of the Enter and Remove limits of the F statistic).
STEP 3, 5 Third and fifth variables are stepped into/out of the model (irre-
spective of the Enter and Remove limits of the F statistic).
STEP/NUMBER = 3 Three variables are entered into or removed from the model.
STOP Stops the stepping and generates final output (classification
matrices, eigenvalues, canonical variables, etc.).
256
Chapter 11
variables, in the order in which they appear in the Variables list, to force into the model
(for example, Force = 2 means include the first two variables on the Variables list in the
main dialog box). Force = 0 is the default.
Discriminant Analysis Statistics
You can select any desired output elements by clicking Statistics in the main dialog box.
All selected statistics will be displayed in the output. Depending on the specified length
of your output, you may also see additional statistics. By default, the print length is set
to Short (you will see all of the statistics on the Short Statistics list). To change the
length of your output, choose Options from the Edit menu. Select Short, Medium, or
Long from the Length drop-down list. Again, all selected statistics will be displayed in
the output, regardless of the print setting.
Short Statistics. Options for Short Statistics are FMatrix (between-groups F matrix),
FStats (F-to-enter/remove statistics), Eigen (eigenvalues and canonical correlation),
CMeans (canonical scores of group means), and Sum (summary panel).
Medium Statistics. Options for Medium Statistics are those for Short Statistics plus
Means (group frequencies and means), Wilks (Wilks lambda and approximate F),
CFunc (discriminant functions), Traces (Lawley-Hotelling and Pillai and Wilks
traces), CDFunc (canonical discriminant functions), SCDFunc (standardized canonical
discriminant functions), Class (classification matrix), and JClass (Jackknifed
classification matrix).
257
Di scri mi nant Anal ysi s
Long Statistics. Options for Long Statistics are those for Medium Statistics plus WCov
(within covariance matrix), WCorr (within correlation matrix), TCov (total covariance
matrix), TCorr (total correlation matrix), GCov (groupwise covariance matrix), and
GCorr (groupwise correlation matrix).
Mahalanobis distances, posterior probabilities (Mahal), and canonical scores (CScore)
for each case must be specified individually.
Using Commands
Select your data by typing USE filename and continue as follows:
In addition to indicating a length for the PRINT output, you can select elements not
included in the output for the specified length. Elements for each length include:
MAHAL and CSCORE must be specified individually. No length specification includes
these statistics.
Basic DISCRIM
MODEL grpvar = varlist / QUADRATIC PRIORS=n1,n2,
CONTRAST [matrix]
PRINT / length element
SAVE / DATA SCORES DISTANCES
ESTIMATE / TOL=n
Stepwise (Instead of ESTIMATE, specify START)
START / FORWARD TOL=n ENTER=p REMOVE=p FENTER=n FREMOVE=n
FORCE=n
BACKWARD
STEP no argument or / NUMBER=n AUTO ENTER=p REMOVE=p
FENTER=n FREMOVE=n
+ or
- or
varlist or
nvari, nvarj,
(sequence of STEPs)
STOP
Length Element
SHORT FMATRIX FSTATS EIGEN CMEANS SUM CLASS JCLASS
MEDIUM MEANS WILKS CFUNC TRACES CDFUNC SCDFUNC
LONG WCOV WCOR TCOV TCOR GCOV GCOR
258
Chapter 11
Usage Considerations
Types of data. DISCRIM uses rectangular data only.
Print options. Print options allow the user to select panels of output to display, including
group means, variances, covariances, and correlations.
Quick Graphs. For two canonical variables, SYSTAT produces a canonical scores plot,
in which the axes are the canonical variables and the points are the canonical variable
scores. This plot includes confidence elipses for each group. For analyses involving
more than two canonical variables, SYSTAT displays a SPLOM of the first three
canonical variables.
Saving files. You can save the Mahalanobis distances to each group centroid (with the
posterior probability of the membership in each group) or the canonical variable
scores.
BY groups. DISCRIM analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. DISCRIM uses a FREQ variable to increase the number of cases.
Case weights. You can weight each case in a discriminant analysis using a weight
variable. Use a binary weight variable coded 0 and 1 for cross-validation. Cases that
have a zero weight do not influence the estimation of the discriminant functions but are
classified into groups.
Examples
Example 1
Complete Estimation
In this example, we examine measurements made on 150 iris flowers: sepal length,
sepal width, petal length, and petal width (in centimeters). The data are from Fisher
(1936) and are grouped by species: Setosa, Versicolor, and Virginica (coded as 1, 2, and
3, respectively).
The goal of the discriminant analysis is to find a linear combination of the four
measures that best classifies or discriminates among the three species (groups of
259
Di scri mi nant Anal ysi s
flowers). Here is a SPLOM of the four measures with within-group bivariate
confidence ellipses and normal curves. The input is:
The plot follows:
Lets see what a default analysis tells us about the separation of the groups and the
usefulness of the variables for the classification. The input is:
Note the shortcut notation (..) in the MODEL statement for listing consecutive variables
in the file (otherwise, simply list each variable name separated by a space).
DISCRIM
USE iris
SPLOM sepallen..petalwid / HALF GROUP=species ELL
DENSITY=NORM OVERLAY
USE iris
LABEL species / 1=Setosa, 2=Versicolor, 3=Virginica
DISCRIM
MODEL species = sepallen .. petalwid
PRINT / MEANS
ESTIMATE
S
E
P
A
L
L
E
N
S
E
P
A
L
W
I
D
P
E
T
A
L
L
E
N
SEPALLEN
P
E
T
A
L
W
I
D
SEPALWID PETALLEN PETALWID
S
E
P
A
L
L
E
N
S
E
P
A
L
W
I
D
P
E
T
A
L
L
E
N
SEPALLEN
P
E
T
A
L
W
I
D
SEPALWID PETALLEN PETALWID
S
E
P
A
L
L
E
N
S
E
P
A
L
W
I
D
P
E
T
A
L
L
E
N
SEPALLEN
P
E
T
A
L
W
I
D
SEPALWID PETALLEN PETALWID
3
2
1
SPECIES
260
Chapter 11
The output follows:
Group frequencies
-----------------
Setosa Versicolor Virginica
Frequencies 50 50 50

Group means
-----------
SEPALLEN 5.0060 5.9360 6.5880
SEPALWID 3.4280 2.7700 2.9740
PETALLEN 1.4620 4.2600 5.5520
PETALWID 0.2460 1.3260 2.0260

Between groups F-matrix -- df = 4 144
----------------------------------------------
Setosa Versicolor Virginica
Setosa 0.0
Versicolor 550.1889 0.0
Virginica 1098.2738 105.3127 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
2 SEPALLEN 4.72 0.347993 |
3 SEPALWID 21.94 0.608859 |
4 PETALLEN 35.59 0.365126 |
5 PETALWID 24.90 0.649314 |

Classification matrix (cases in row categories classified into columns)
---------------------
Setosa Versicolo Virginica %correct
Setosa 50 0 0 100
Versicolor 0 48 2 96
Virginica 0 1 49 98

Total 50 49 51 98

Jackknifed classification matrix
--------------------------------
Setosa Versicolo Virginica %correct
Setosa 50 0 0 100
Versicolor 0 48 2 96
Virginica 0 1 49 98

Total 50 49 51 98

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
32.192 0.985 0.991
0.285 0.471 1.000

Canonical scores of group means
-------------------------------
Setosa 7.608 .215
Versicolor -1.825 -.728
Virginica -5.783 .513
261
Di scri mi nant Anal ysi s
Group Frequencies
The Group frequencies panel shows the count of flowers within each group and the
means for each variable. If the group code or one or more measures are missing, the
case is not used in the analysis.
Between Groups F-Matrix
For each pair of groups, use these F statistics to test the equality of group means. These
values are proportional to distance measures and are computed from Mahalanobis
statistics. Thus, the centroids for Versicolor and Virginica are closest (105.3); those for
Setosa and Virginica (1098.3) are farthest apart. If you explore differences among
several pairs, dont use the probabilities associated with these Fs as a test because of
the simultaneous inference problem. Compare the relative size of these values with the
distances between-group means in the canonical variable plot.
F Statistics and Tolerance
Use F-to-remove statistics to determine the relative importance of variables included
in the model. The numerator degrees of freedom for each F is the number of groups
minus 1, and the denominator df is the (total sample size) (number of groups)
(number of variables in the model) + 1; for example, for these data, 3 1 and 150 3
D
2
262
Chapter 11
4 + 1, or 2 and 144. Because you may be scanning Fs for several variables, do not
use the probabilities from the usual F tables for a test. Here we conclude that
SEPALLEN is least helpful for discriminating among the species (F = 4.72).
Classification Tables
In the Classification matrix, each case is classified into the group where the value of
its classification function is largest. For Versicolor (row name), 48 flowers are
classified correctly and 2 are misclassified (classified as Virginica)96% of the
Versicolor flowers are classified correctly. Overall, 98% of the flowers are classified
correctly (see the last row of the table). The results in the first table can be misleading
because we evaluated the classification rule using the same cases used to compute it.
They may provide an overly optimistic estimate of the rules success. The Jackknifed
classification matrix attempts to remedy the problem by using functions computed
from all of the data except the case being classified. The method of leaving out one case
at a time is called the jackknife and is one form of cross-validation.
For these data, the results are the same. If the percentage for correct classification is
considerably lower in the Jackknifed panel than in the first matrix, you may have too
many predictors in your model.
Eigenvalues, Canonical Correlations, Cumulative Proportion of
Total Dispersion, and Canonical Scores of Group Means
The first canonical variable is the linear combination of the variables that best
discriminates among the groups, the second canonical variable is orthogonal to the first
and is the next best combination of variables, and so on. For our data, the first
eigenvalue (32.2) is very large relative to the second, indicating that the first canonical
variable captures most of the difference among the groupsat the right of this panel,
notice that it accounts for more than 99% of the total dispersion of the groups.
The Canonical correlation between the first canonical variable and a set of two
dummy variables representing the groups is 0.985; the correlation between the second
canonical variable and the dummy variables is 0.471. (The number of dummy variables
is the number of groups minus 1.) Finally, the canonical variables are evaluated at the
group means. That is, in the canonical variable plot, the centroid for the Setosa flowers
is (7.608, 0.215), Versicolor is (1.825, 0.728), and so on, where the first canonical
variable is the x coordinate and the second, the y coordinate.
263
Di scri mi nant Anal ysi s
Canonical Scores Plot
The axes of this Quick Graph are the first two canonical variables, and the points are
the canonical variable scores. The confidence ellipses are centered on the centroid of
each group. The Setosa flowers are well differentiated from the others. There is some
overlap between the other two groups. Look for outliers in these displays because they
can affect your analysis.
Example 2
Automatic Forward Stepping
Our problem for this example is to derive a rule for classifying countries as European,
Islamic, or New World. We know that strong correlations exist among the candidate
predictor variables, so we are curious about just which subset will be useful. Here are
the candidate predictors:
Because the distributions of the economic variables are skewed with long right tails,
we log transform GDP_CAP and take the square root of EDUC, HEALTH, and MIL.
Alternatively, you could also use shortcut notation to request the square root
transformations:
URBAN Percentage of the population living in cities
BIRTH_RT Births per 1000 people in 1990
DEATH_RT Deaths per 1000 people in 1990
B_TO_D Ratio of births to deaths in 1990
BABYMORT Infant deaths during the first year per 1000 live births
GDP_CAP Gross domestic product per capita (in U.S. dollars)
LIFEEXPM Years of life expectancy for males
LIFEEXPF Years of life expectancy for females
EDUC U.S. dollars spent per person on education in 1986
HEALTH U.S. dollars spent per person on health in 1986
MIL U.S. dollars spent per person on the military in 1986
LITERACY Percentage of the population who can read
LET gdp_cap = L10(gdp_cap)
LET educ = SQR(educ)
LET health = SQR(health)
LET mil = SQR(mil)
LET (educ, health, mil) = SQR(@)
264
Chapter 11
We use automatic forward stepping in an effort to identify the best subset of predictors.
After stepping stops, you need to type STOP to ask SYSTAT to produce the summary
table, classification matrices, and information about canonical variables. The input is:
Notice that the initial results appear after START / FORWARD is specified. STEP / AUTO
and STOP are selected later, as indicated in the output that follows:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
PRINT / MEANS
START / FORWARD
STEP / AUTO
STOP
Group frequencies
-----------------
Europe Islamic NewWorld
Frequencies 19 15 21

Group means
-----------
URBAN 68.7895 30.0667 56.3810
BIRTH_RT 12.5789 42.7333 26.9524
DEATH_RT 10.1053 13.4000 7.4762
BABYMORT 7.8947 102.3333 42.8095
GDP_CAP 4.0431 2.7640 3.2139
EDUC 21.5275 6.4156 8.9619
HEALTH 21.9537 3.1937 6.8898
MIL 15.9751 7.5431 6.0903
B_TO_D 1.2658 3.5472 3.9509
LIFEEXPM 72.3684 54.4000 66.6190
LIFEEXPF 79.5263 57.1333 71.5714
LITERACY 97.5263 36.7333 79.9571

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
| 6 URBAN 23.20 1.000000
| 8 BIRTH_RT 103.50 1.000000
| 10 DEATH_RT 14.41 1.000000
| 12 BABYMORT 53.62 1.000000
| 16 GDP_CAP 59.12 1.000000
| 19 EDUC 27.12 1.000000
| 21 HEALTH 49.62 1.000000
| 23 MIL 19.30 1.000000
| 34 B_TO_D 31.54 1.000000
| 30 LIFEEXPM 37.08 1.000000
| 31 LIFEEXPF 50.30 1.000000
| 32 LITERACY 63.64 1.000000
265
Di scri mi nant Anal ysi s
Using commands, type STEP / AUTO.
**************** Step 1 -- Variable BIRTH_RT Entered ****************

Between groups F-matrix -- df = 1 52
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 206.5877 0.0
NewWorld 55.8562 59.0625 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
8 BIRTH_RT 103.50 1.000000 | 6 URBAN 1.26 0.724555
| 10 DEATH_RT 19.41 0.686118
| 12 BABYMORT 2.13 0.443802
| 16 GDP_CAP 4.56 0.581395
| 19 EDUC 5.12 0.831381
| 21 HEALTH 9.52 0.868614
| 23 MIL 8.55 0.907501
| 34 B_TO_D 14.94 0.987994
| 30 LIFEEXPM 4.31 0.437850
| 31 LIFEEXPF 3.58 0.371618
| 32 LITERACY 10.32 0.324635

**************** Step 2 -- Variable DEATH_RT Entered ****************

Between groups F-matrix -- df = 2 51
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 120.1297 0.0
NewWorld 59.7595 29.7661 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
8 BIRTH_RT 118.41 0.686118 | 6 URBAN 0.07 0.694384
10 DEATH_RT 19.41 0.686118 | 12 BABYMORT 1.83 0.279580
| 16 GDP_CAP 7.88 0.520784
| 19 EDUC 5.03 0.812622
| 21 HEALTH 6.47 0.864170
| 23 MIL 13.21 0.789555
| 34 B_TO_D 0.82 0.186108
| 30 LIFEEXPM 3.34 0.158185
| 31 LIFEEXPF 5.20 0.120507
| 32 LITERACY 2.22 0.265285

**************** Step 3 -- Variable MIL Entered ****************

Between groups F-matrix -- df = 3 50
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 80.7600 0.0
NewWorld 55.6502 24.6740 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
8 BIRTH_RT 77.85 0.683054 | 6 URBAN 3.87 0.509585
10 DEATH_RT 25.39 0.596945 | 12 BABYMORT 1.02 0.258829
23 MIL 13.21 0.789555 | 16 GDP_CAP 0.67 0.304330
| 19 EDUC 0.01 0.534243
| 21 HEALTH 1.24 0.652294
| 34 B_TO_D 0.81 0.186064
| 30 LIFEEXPM 0.28 0.135010
| 31 LIFEEXPF 1.34 0.091911
| 32 LITERACY 3.51 0.252509
266
Chapter 11
When using commands, type STOP.
Variable F-to-enter Number of
entered or or variables Wilks Approx.
removed F-to-remove in model lambda F-value df1 df2 p-tail
------------ ----------- --------- ----------- ----------- ---- ----- ---------
BIRTH_RT 103.495 1 0.2008 103.4953 2 52 0.00000
DEATH_RT 19.406 2 0.1140 50.0200 4 102 0.00000
MIL 13.212 3 0.0746 44.3576 6 100 0.00000

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 19 0 0 100
Islamic 0 13 2 87
NewWorld 2 2 17 81

Total 21 15 19 89

Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 19 0 0 100
Islamic 0 13 2 87
NewWorld 2 3 16 76

Total 21 16 18 87

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
5.247 0.916 0.821
1.146 0.731 1.000

Canonical scores of group means
-------------------------------
Europe -2.938 .409
Islamic 2.481 1.243
NewWorld .886 -1.258
Canonical Scores Plot
-4 -2 0 2 4
FACTOR(1)
-4
-2
0
2
4
F
A
C
T
O
R
(
2
)
NewWorld
Islamic
Europe
GROUP
267
Di scri mi nant Anal ysi s
From the panel of Group means, note that, on the average, the percentage of the
population living in cities (URBAN) is 68.8% in Europe, 30.1% in Islamic nations, and
56.4% in the New World. The LITERACY rates for these same groups are 97.5%,
36.7%, and 80.0%, respectively.
After the group means, you will find the F-to-enter statistics for each variable not
in the functions. When no variables are in the model, each F is the same as that for a
one-way analysis of variance. Thus, group differences are the strongest for BIRTH_RT
(F = 103.5) and weakest for DEATH_RT (F = 14.41). At later steps, each F
corresponds to the F for a one-way analysis of covariance where the covariates are the
variables already included.
At step 1, SYSTAT enters BIRTH_RT because its F-to-enter is largest in the last
panel and now displays the same F in the F-to-remove panel. BIRTH_RT is correlated
with several candidate variables, so notice how their F-to-enter values drop when
BIRTH_RT enters (for example, for GDP_CAP, from 59.1 to 4.6). DEATH_RT now
has the highest F-to-enter, so SYSTAT will enter it at step 2. From the between-groups
F-matrix, note that when BIRTH_RT is used alone, Europe and Islamic countries are
the groups that differ most (206.6), and Europe and the New World are the groups that
differ least (55.9).
After DEATH_RT enters, the F-to-enter for MIL (money spent per person on the
military) is largest, so SYSTAT enters it at step 3. The SYSTAT default limit for F-to-
enter values is 4.0. No variable has an F-to-enter above the limit, so the stepping stops.
Also, all F-to-remove values are greater than 3.9, so no variables are removed.
The summary table contains one row for each variable moved into the model. The
F-to-enter (F-to-remove) is printed for each, along with Wilks lambda and its
approximate F statistic, numerator and denominator degrees of freedom, and tail
probability.
After the summary table, SYSTAT prints the classification matrices. From the
biased estimate in the first matrix, our three-variable rule classifies 89% of the
countries correctly. For the jackknifed results, this percentage drops to 87%. All of the
European nations are classified correctly (100%), while almost one-fourth of the New
World countries are misclassified (two as Europe and three as Islamic). These
countries can be identified by using MAHALthe posterior probability for each case
belonging to each group is printed. You will find, for example, that Canada is
misclassified as European and that Haiti and Bolivia are misclassified as Islamic.
If you focus on the canonical results, you motice that the first canonical variable
accounts for 82.1% of the dispersion, and in the Canonical scores of group means panel,
the groups are ordered from left to right: Europe, New World, and then Islamic. The
second canonical variable contrasts Islamic versus New World (1.243 versus 1.258).
268
Chapter 11
In the canonical variable plot, the European nations (on the left) are well separated
from the other groups. The plus sign (+) next to the European confidence ellipse is
Canada. If you are unsure about which ellipse corresponds to what group, look at the
Canonical scores of group means.
Example 3
Automatic Backward Stepping
It is possible that classification rules for other subsets of the variables perform better
than that found using forward steppingespecially when there are correlations among
the variables. We try backward stepping. The input is:
Notice that we request STEP after an initial report and PRINT and STOP later.
The output follows:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
PRINT SHORT / CFUNC
IDVAR = country$
START / BACKWARD
STEP / AUTO
PRINT / TRACES CDFUNC SCDFUNC
STOP
Between groups F-matrix -- df = 12 41
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 25.3059 0.0
NewWorld 18.0596 7.3754 0.0

Classification functions
----------------------
Europe Islamic NewWorld
Constant -4408.4004 -4396.8904 -4408.5297
URBAN -2.4175 -2.3572 -2.2871
BIRTH_RT 41.9790 43.1675 43.1322
DEATH_RT 50.0202 48.1539 48.1950
BABYMORT 9.3190 9.3806 9.3461
GDP_CAP 243.6686 234.5165 237.0805
EDUC 2.0078 4.0450 3.4276
HEALTH -17.9706 -19.8527 -19.3068
MIL -9.8420 -10.1746 -10.6076
B_TO_D -59.6547 -62.2446 -61.8195
LIFEEXPM -9.8216 -9.1537 -9.4952
LIFEEXPF 93.5933 93.0934 93.4108
LITERACY 7.5909 7.5834 7.7178
269
Di scri mi nant Anal ysi s
Using commands, type STEP / AUTO.
(We omit the output for steps 2 through 6.)
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.17 0.436470 |
8 BIRTH_RT 2.01 0.059623 |
10 DEATH_RT 2.26 0.091463 |
12 BABYMORT 0.10 0.083993 |
16 GDP_CAP 0.62 0.143526 |
19 EDUC 6.12 0.065095 |
21 HEALTH 5.36 0.083198 |
23 MIL 7.11 0.323519 |
34 B_TO_D 0.55 0.136148 |
30 LIFEEXPM 0.26 0.036088 |
31 LIFEEXPF 0.07 0.012280 |
32 LITERACY 1.45 0.177756 |
**************** Step 1 -- Variable LIFEEXPF Removed ****************

Between groups F-matrix -- df = 11 42
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 28.2000 0.0
NewWorld 20.1693 8.2086 0.0

Classification functions
----------------------
Europe Islamic NewWorld
Constant -2135.2865 -2147.9924 -2144.2709
URBAN -0.8690 -0.8170 -0.7416
BIRTH_RT 20.1471 21.4523 21.3429
DEATH_RT 29.3876 27.6314 27.6026
BABYMORT 3.7505 3.8419 3.7885
GDP_CAP 292.1240 282.7130 285.4413
EDUC -3.8832 -1.8145 -2.4518
HEALTH -5.8347 -7.7816 -7.1945
MIL -6.9769 -7.3247 -7.7480
B_TO_D -13.7461 -16.5811 -16.0004
LIFEEXPM 32.7200 33.1607 32.9634
LITERACY 5.5340 5.5374 5.6648

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.45 0.466202 | 31 LIFEEXPF 0.07 0.012280
8 BIRTH_RT 3.04 0.077495 |
10 DEATH_RT 2.45 0.100658 |
12 BABYMORT 0.41 0.140589 |
16 GDP_CAP 0.68 0.144854 |
19 EDUC 6.71 0.066537 |
21 HEALTH 6.78 0.092071 |
23 MIL 7.39 0.328943 |
34 B_TO_D 0.70 0.148030 |
30 LIFEEXPM 0.24 0.077817 |
32 LITERACY 1.48 0.185492 |
**************** Step 7 -- Variable URBAN Removed ****************

Between groups F-matrix -- df = 5 48
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 61.5899 0.0
NewWorld 40.9350 15.6004 0.0
270
Chapter 11
Classification functions
----------------------
Europe Islamic NewWorld
Constant -22.4825 -38.4306 -17.6982
BIRTH_RT 0.3003 1.3372 0.9382
DEATH_RT 1.4220 0.6592 0.2591
EDUC -0.1787 1.3011 0.8506
HEALTH 0.7483 -0.8816 -0.3976
MIL 0.7537 0.4181 0.1794

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
8 BIRTH_RT 27.89 0.622699 | 6 URBAN 3.65 0.504724
10 DEATH_RT 15.51 0.583392 | 12 BABYMORT 1.12 0.243722
19 EDUC 5.20 0.083925 | 16 GDP_CAP 1.20 0.171233
21 HEALTH 6.67 0.102470 | 34 B_TO_D 1.24 0.180347
23 MIL 7.42 0.501019 | 30 LIFEEXPM 0.02 0.123573
| 31 LIFEEXPF 0.49 0.076049
| 32 LITERACY 3.42 0.250341
Variable F-to-enter Number of
entered or or variables Wilks Approx.
removed F-to-remove in model lambda F-value df1 df2 p-tail
------------ ----------- --------- ----------- ----------- ---- ----- ---------
LIFEEXPF 0.068 11 0.0405 15.1458 22 84 0.00000
LIFEEXPM 0.237 10 0.0410 16.9374 20 86 0.00000
BABYMORT 0.219 9 0.0414 19.1350 18 88 0.00000
B_TO_D 0.849 8 0.0430 21.4980 16 90 0.00000
GDP_CAP 1.429 7 0.0457 24.1542 14 92 0.00000
LITERACY 2.388 6 0.0505 27.0277 12 94 0.00000
URBAN 3.655 5 0.0583 30.1443 10 96 0.00000

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 19 0 0 100
Islamic 0 13 2 87
NewWorld 1 2 18 86

Total 20 15 20 91

Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 19 0 0 100
Islamic 0 13 2 87
NewWorld 1 2 18 86

Total 20 15 20 91

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
6.984 0.935 0.859
1.147 0.731 1.000
271
Di scri mi nant Anal ysi s
Using commands, type PRINT / TRACES CDFUNC SCDFUNC, then STOP.
Wilks lambda= 0.058
Approx.F= 30.144 df= 10, 96 p-tail= 0.0000

Pillais trace= 1.409
Approx.F= 23.360 df= 10, 98 p-tail= 0.0000

Lawley-Hotelling trace= 8.131
Approx.F= 38.215 df= 10, 94 p-tail= 0.0000

Canonical discriminant functions
--------------------------------
1 2
Constant -1.9836 -5.4022
URBAN . .
BIRTH_RT 0.1603 0.0414
DEATH_RT -0.1588 0.2771
BABYMORT . .
GDP_CAP . .
EDUC 0.2358 0.0063
HEALTH -0.2604 -0.0015
MIL -0.0736 0.1497
B_TO_D . .
LIFEEXPM . .
LIFEEXPF . .
LITERACY . .

Canonical discriminant functions -- standardized by within variances
--------------------------------------------------------------------
1 2
URBAN . .
BIRTH_RT 0.9737 0.2512
DEATH_RT -0.5188 0.9050
BABYMORT . .
GDP_CAP . .
EDUC 1.5574 0.0413
HEALTH -1.5572 -0.0091
MIL -0.3910 0.7952
B_TO_D . .
LIFEEXPM . .
LIFEEXPF . .
LITERACY . .

Canonical scores of group means
-------------------------------
Europe -3.389 .410
Islamic 2.864 1.243
NewWorld 1.020 -1.259
272
Chapter 11
Before stepping starts, SYSTAT uses all candidate variables to compute classification
functions. The output includes the coefficients for these functions used to classify
cases into groups. A variable is omitted only if it fails the Tolerance limit. For each
case, SYSTAT computes three functions. The first is:
4408.4 2.417*urban + 41.979*birth_rt + ... + 7.591*literacy
Each case is assigned to the group with the largest value.
Tolerance measures the correlation of a candidate variable with the variables
included in the model, and its values range from 0 to 1.0. If a variable is highly
correlated with one or more of the others, the value of Tolerance is very small and the
resulting estimates of the discriminant function coefficients may be very unstable. To
avoid a loss of accuracy in the matrix inversion computations, rarely should you set the
value of this limit to a lower value (the default is 0.001). LIFEEXPF, female life
expectancy, has a very low Tolerance value, so it may be redundant or highly
correlated with another variable or a linear combination of other variables. The
Tolerance value of LIFEEXPM, male life expectancy, is also lowthese two measures
of life expectancy may be highly correlated with one another. Notice also that the value
for BIRTH_RT is very low (0.059623) and its F-to-remove value is 2.01; its F-to-enter
at step 0 in the forward stepping example was 103.5.
At step 7, no variable has an F-to-remove value less than 3.9, so the stepping stops.
The final model found by backward stepping includes five variables: BIRTH_RT,
DEATH_RT, EDUC, HEALTH, and MIL. We are not happy, however, with the low
Tolerance values for two of these variables. The model found via automatic forward
273
Di scri mi nant Anal ysi s
stepping did not include EDUC or HEALTH (their F-to-enter statistics at step 3 are
0.01 and 1.24, respectively). URBAN and LITERACY appear more likely candidates,
but their Fs are still less than 4.0.
In both classification matrices, 91% of the countries are classified correctly using
the five-variable discriminant functions. This is a slight improvement over the three-
variable model from the forward stepping example, where the percentages were 89%
for the first matrix and 87% for the jackknifed results. The improvement from 87% to
91% is because two New World countries are now classified correctly. We add two
variables and gain two correct classifications.
Wilks lambda (or U statistic), a multivariate analysis of variance statistic that varies
between 0 and 1, tests the equality of group means for the variables in the discriminant
functions. Wilks lambda is transformed to an approximate F statistic for comparison
with the F distribution. Here, the associated probability is less than 0.00005, indicating
a highly significant difference among the groups. The Lawley-Hotelling trace and its
F approximation are documented in Morrison (1976). When there are only two groups,
it and Wilks lambda are equivalent. Pillais trace and its F approximation are taken
from Pillai (1960).
The canonical discriminant functions list the coefficients of the canonical variables
computed first for the data as input and then for the standardized values. For the
unstandardized data, the first canonical variable is:
1.984 + 0.160*birth_rt 0.159*death_rt + 0.236*educ 0.260*health 0.074*mil
The coefficients are adjusted so that the overall mean of the corresponding scores is 0
and the pooled within-group variances are 1. After standardizing, the first canonical
variable is:
0.974*birth_rt 0.519*death_rt + 1.557*educ 1.557*health 0.391*mil
Usually, one uses the latter set of coefficients to interpret what variables drive each
canonical variable. Here, EDUC and HEALTH, the variables with low tolerance values,
have the largest coefficients, and they appear to cancel one another. Also, in the final
model, the size of their F-to-remove values indicates they are the least useful variables
in the model. This indicates that we do not have an optimum set of variables. These two
variables contribute little alone, while together they enhance the separation of the
groups. This suggests that the difference between EDUC and HEALTH could be a
useful variable (for example, LET diff = educ health). We did this, and the following is
the first canonical variable for standardized values (we omit the constant):
1.024*birth_rt 0.539*death_rt 0.480*mil + 0.553*diff
274
Chapter 11
From the Canonical scores of group means for the first canonical variable, the groups
line up with Europe first, then New World in the middle, and Islamic on the right. In
the second dimension, DEATH_RT and MIL (military expenditures) appear to separate
Islamic and New World countries.
Mahalanobis Distances and Posterior Probabilities
For Mahalanobis distances, even if you have already specified PRINT=LONG, you must
type PRINT / MAHAL to obtain Mahalanobis distances. The output is:
Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors = .333 .333 .333
Europe Islamic NewWorld

Europe
------------
Ireland 3.0 1.00 33.7 .00 13.6 .00
Austria 4.0 1.00 37.7 .00 19.8 .00
Belgium * .3 1.00 42.7 .00 26.0 .00
Denmark 9.1 1.00 37.6 .00 24.9 .00
Finland 2.1 1.00 40.5 .00 22.3 .00
France 2.3 1.00 45.5 .00 29.1 .00
Greece 5.7 1.00 48.6 .00 28.3 .00
Switzerland 11.9 1.00 71.7 .00 48.3 .00
Spain 3.6 1.00 42.8 .00 18.9 .00
UK 2.1 1.00 42.8 .00 29.9 .00
Italy .6 1.00 44.7 .00 23.0 .00
Sweden 4.3 1.00 51.7 .00 35.9 .00
Portugal 3.6 1.00 40.4 .00 18.8 .00
Netherlands 2.1 1.00 43.9 .00 24.2 .00
WGermany 6.0 1.00 65.8 .00 45.5 .00
Norway 5.3 1.00 38.5 .00 28.4 .00
Poland 2.7 .99 29.5 .00 12.5 .01
Hungary 4.4 1.00 39.8 .00 24.3 .00
EGermany 8.0 1.00 42.4 .00 31.9 .00
Czechoslov 1.8 1.00 40.9 .00 25.1 .00

Islamic
------------
Gambia 43.2 .00 2.9 1.00 15.3 .00
Iraq 71.3 .00 23.5 1.00 41.7 .00
Pakistan 38.7 .00 .5 .98 8.6 .02
Bangladesh 37.2 .00 2.0 .91 6.8 .09
Ethiopia 40.5 .00 1.1 .99 10.0 .01
Guinea 41.2 .00 8.0 1.00 24.1 .00
Malaysia --> 36.6 .00 7.7 .17 4.5 .83
Senegal 42.8 .00 .9 .98 9.1 .02
Mali 49.3 .00 5.5 1.00 23.5 .00
Libya 60.3 .00 15.6 1.00 30.1 .00
Somalia 50.0 .00 1.1 1.00 13.1 .00
Afghanistan * . . . . . .
Sudan 43.8 .00 .3 .99 10.1 .01
Turkey --> 25.0 .00 7.2 .05 1.5 .95
Algeria 43.1 .00 4.1 .79 6.7 .21
Yemen 57.4 .00 3.1 1.00 23.2 .00

275
Di scri mi nant Anal ysi s
For each case (up to 250 cases), the Mahalanobis distance squared ( ) is computed
to each group mean. The closer a case is to a particular mean, the more likely it belongs
to that group. The posterior probability for the distance of a case to a mean is the ratio
of EXP( ) for the group divided by the sum of EXP( ) for all groups
(prior probabilities, if specified, affect these computations).
An arrow (-->) marks incorrectly classified cases, and an asterisk (*) flags cases with
missing values. New World countries Bolivia and Haiti are classified as Islamic, and
Canada is classified as Europe. Note that even though an asterisk marks Belgium,
results are printedthe value of the unused candidate variable URBAN is missing. No
results are printed for Afghanistan because MIL, a variable in the final model, is
missing.
You can identify cases with all large distances as outliers. A case can have a 1.0
probability of belonging to a particular group but still have a large distance. Look at
Iraq. It is correctly classified as Islamic, but its distance is 23.5. The distances in this
panel are distributed approximately as a chi-square with degrees of freedom equal to
the number of variables in the function.
NewWorld
------------
Argentina 11.5 .03 19.8 .00 4.4 .97
Barbados 16.4 .00 20.9 .00 4.7 1.00
Bolivia --> 27.7 .00 3.4 .56 3.8 .44
Brazil 27.4 .00 11.5 .00 .6 1.00
Canada --> 6.7 1.00 35.9 .00 19.3 .00
Chile 21.1 .00 15.7 .00 1.5 1.00
Colombia 35.2 .00 13.9 .00 1.9 1.00
CostaRica 34.8 .00 21.1 .00 5.5 1.00
Venezuela 41.2 .00 13.4 .01 4.6 .99
DominicanR. 26.0 .00 13.2 .00 1.3 1.00
Uruguay 13.6 .07 22.9 .00 8.6 .93
Ecuador 32.8 .00 8.6 .02 1.0 .98
ElSalvador 35.3 .00 7.5 .07 2.5 .93
Jamaica 25.6 .00 19.1 .00 1.9 1.00
Guatemala 37.6 .00 4.5 .33 3.1 .67
Haiti --> 37.9 .00 2.0 .99 10.6 .01
Honduras 39.8 .00 6.4 .27 4.5 .73
Trinidad 34.1 .00 11.4 .03 4.1 .97
Peru 20.2 .00 10.5 .02 2.4 .98
Panama 23.8 .00 16.5 .00 2.4 1.00
Cuba 12.0 .03 18.5 .00 5.1 .97
--> case misclassified
* case not used in computation
D
2
0.5 * D
2
0.5 * D
2

276
Chapter 11
Example 4
Interactive Stepping
Automatic forward and backward stepping can produce different sets of predictor
variables, and still other subsets of the variables may perform equally as well or
possibly better. Here we use interactive stepping to explore alternative sets of variables.
Using the OURWORLD data, lets say you decide not to include birth and death
rates in the model because the rates are changing rapidly for several nations (that is, we
omit these variables from the model). We also add the difference between EDUC and
HEALTH as a candidate variable.
SYSTAT provides several ways to specify which variables to move into (or out of)
the model. The input is:
After interpreting these commands and printing the output below, SYSTAT waits for
us to enter STEP instructions.
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
PRINT SHORT / SCDFUNC
GRAPH=NONE
START / BACK
Between groups F-matrix -- df = 12 41
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 25.3059 0.0
NewWorld 18.0596 7.3754 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.17 0.436470 | 40 DIFFRNCE 0000000.00 0.000000
8 BIRTH_RT 2.01 0.059623 |
10 DEATH_RT 2.26 0.091463 |
12 BABYMORT 0.10 0.083993 |
16 GDP_CAP 0.62 0.143526 |
19 EDUC 6.12 0.065095 |
21 HEALTH 5.36 0.083198 |
23 MIL 7.11 0.323519 |
34 B_TO_D 0.55 0.136148 |
30 LIFEEXPM 0.26 0.036088 |
31 LIFEEXPF 0.07 0.012280 |
32 LITERACY 1.45 0.177756 |
277
Di scri mi nant Anal ysi s
A summary of the STEP arguments (variable numbers are visible in the output)
follows:
Notice that the seventh STEP specification (g) removes EDUC and HEALTH and
enters DIFFRNCE. Remember, after the last step, type STOP for the canonical variable
results and other summaries.
Steps 1 and 2
Input:
Output:
a. STEP birth_rt death_rt Remove two variables
b. STEP lifeexpf Remove one variable
c. STEP Remove lifeexpm
d. STEP Remove babymort
e. STEP Remove urban
f. STEP Remove gdp_cap
g. STEP educ health diffrnce Remove educ and health; add diffrnce
h. STEP + Enter gdp_cap
STOP
STEP birth_rt death_rt
**************** Step 1 -- Variable BIRTH_RT Removed ****************

Between groups F-matrix -- df = 11 42
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 26.3672 0.0
NewWorld 18.0391 8.2404 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.64 0.437926 | 8 BIRTH_RT 2.01 0.059623
10 DEATH_RT 2.00 0.092765 | 40 DIFFRNCE 0000.00 0.000000
12 BABYMORT 0.14 0.091364 |
16 GDP_CAP 1.40 0.150944 |
19 EDUC 5.99 0.065824 |
21 HEALTH 4.24 0.090886 |
23 MIL 5.92 0.384992 |
34 B_TO_D 0.35 0.329976 |
30 LIFEEXPM 0.42 0.036548 |
31 LIFEEXPF 0.96 0.015962 |
32 LITERACY 1.79 0.292005 |

278
Chapter 11
Step 3
Input:
Output:
**************** Step 2 -- Variable DEATH_RT Removed ****************

Between groups F-matrix -- df = 10 43
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 27.8162 0.0
NewWorld 18.1733 9.2794 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.20 0.452548 | 8 BIRTH_RT 1.75 0.060472
12 BABYMORT 0.23 0.108992 | 10 DEATH_RT 2.00 0.092765
16 GDP_CAP 1.14 0.153540 | 40 DIFFRNCE 0.00 0.000000
19 EDUC 6.52 0.065850 |
21 HEALTH 6.28 0.093470 |
23 MIL 6.69 0.385443 |
34 B_TO_D 6.48 0.651944 |
30 LIFEEXPM 0.51 0.036592 |
31 LIFEEXPF 0.28 0.019231 |
32 LITERACY 1.89 0.312350 |
STEP lifeexpf
**************** Step 3 -- Variable LIFEEXPF Removed ****************

Between groups F-matrix -- df = 9 44
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 31.1645 0.0
NewWorld 20.4611 10.4752 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.27 0.472161 | 8 BIRTH_RT 1.88 0.086049
12 BABYMORT 0.79 0.147553 | 10 DEATH_RT 1.31 0.111768
16 GDP_CAP 1.80 0.171189 | 31 LIFEEXPF 0.28 0.019231
19 EDUC 7.51 0.066995 | 40 DIFFRNCE 00000.00 0.000000
21 HEALTH 7.37 0.095626 |
23 MIL 6.88 0.389511 |
34 B_TO_D 6.49 0.683545 |
30 LIFEEXPM 0.28 0.151179 |
32 LITERACY 2.44 0.338715 |
279
Di scri mi nant Anal ysi s
Steps 4 through 7
Input:
Output:
(We omit steps 5, 6, and 7. Each step corresponds to a STEP .)
Steps 8, 9, and 10
Input:
Output:
STEP -
**************** Step 4 -- Variable LIFEEXPM Removed ****************

Between groups F-matrix -- df = 8 45
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 35.3422 0.0
NewWorld 23.3116 11.9720 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
6 URBAN 2.48 0.486188 | 8 BIRTH_RT 0.68 0.138508
12 BABYMORT 0.52 0.249802 | 10 DEATH_RT 1.38 0.182210
16 GDP_CAP 1.71 0.173599 | 30 LIFEEXPM 0.28 0.151179
19 EDUC 7.32 0.069441 | 31 LIFEEXPF 0.04 0.079455
21 HEALTH 7.18 0.099905 | 40 DIFFRNCE 000.00 0.000000
23 MIL 7.05 0.391379 |
34 B_TO_D 9.06 0.769167 |
32 LITERACY 2.40 0.346292 |
STEP educ health diffrnce
**************** Step 8 -- Variable EDUC Removed ****************

Between groups F-matrix -- df = 4 49
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 49.9302 0.0
NewWorld 34.1490 20.8722 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
21 HEALTH 2.44 0.652730 | 6 URBAN 2.32 0.520120
23 MIL 6.67 0.601236 | 8 BIRTH_RT 3.24 0.248104
34 B_TO_D 16.14 0.887452 | 10 DEATH_RT 0.40 0.241846
32 LITERACY 33.24 0.761872 | 12 BABYMORT 2.09 0.326834
| 16 GDP_CAP 1.12 0.277122
| 19 EDUC 5.14 0.083616
| 30 LIFEEXPM 0.88 0.313546
| 31 LIFEEXPF 2.03 0.250043
| 40 DIFFRNCE 5.14 0.743192

280
Chapter 11
**************** Step 9 -- Variable HEALTH Removed ****************

Between groups F-matrix -- df = 3 50
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 61.6708 0.0
NewWorld 41.4085 28.1939 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
23 MIL 14.70 0.771975 | 6 URBAN 2.55 0.523182
34 B_TO_D 27.09 0.914822 | 8 BIRTH_RT 3.91 0.248706
32 LITERACY 52.35 0.805675 | 10 DEATH_RT 0.42 0.241913
| 12 BABYMORT 3.11 0.337422
| 16 GDP_CAP 3.02 0.391015
| 19 EDUC 0.33 0.538428
| 21 HEALTH 2.44 0.652730
| 30 LIFEEXPM 1.58 0.327654
| 31 LIFEEXPF 3.33 0.269779
| 40 DIFFRNCE 6.98 0.772114

**************** Step 10 -- Variable DIFFRNCE Entered ****************

Between groups F-matrix -- df = 4 49
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 60.8974 0.0
NewWorld 38.7925 22.4751 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
23 MIL 16.65 0.683968 | 6 URBAN 2.50 0.522963
34 B_TO_D 13.97 0.900149 | 8 BIRTH_RT 3.89 0.246110
32 LITERACY 47.38 0.792219 | 10 DEATH_RT 0.41 0.241913
40 DIFFRNCE 6.98 0.772114 | 12 BABYMORT 3.26 0.333341
| 16 GDP_CAP 4.30 0.372308
| 19 EDUC 0.94 0.514966
| 21 HEALTH 0.94 0.628279
| 30 LIFEEXPM 0.98 0.326826
| 31 LIFEEXPF 2.40 0.269658
281
Di scri mi nant Anal ysi s
Step 11
Input:
Output:
Final Model
Input:
Output:
STEP +
**************** Step 11 -- Variable GDP_CAP Entered ****************

Between groups F-matrix -- df = 5 48
----------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 57.5419 0.0
NewWorld 35.7426 18.6879 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
16 GDP_CAP 4.30 0.372308 | 6 URBAN 2.72 0.513543
23 MIL 5.88 0.478530 | 8 BIRTH_RT 1.04 0.189556
34 B_TO_D 9.46 0.887953 | 10 DEATH_RT 1.00 0.215879
32 LITERACY 12.31 0.609614 | 12 BABYMORT 0.71 0.256567
40 DIFFRNCE 8.37 0.735173 | 19 EDUC 0.36 0.324618
| 21 HEALTH 0.36 0.396047
| 30 LIFEEXPM 0.04 0.259888
| 31 LIFEEXPF 0.24 0.180725
STOP
Variable F-to-enter Number of
entered or or variables Wilks Approx.
removed F-to-remove in model lambda F-value df1 df2 p-tail
------------ ----------- --------- ----------- ----------- ---- ----- ---------
BIRTH_RT 2.011 11 0.0444 14.3085 22 84 0.00000
DEATH_RT 2.002 10 0.0486 15.2053 20 86 0.00000
LIFEEXPF 0.275 9 0.0492 17.1471 18 88 0.00000
LIFEEXPM 0.277 8 0.0498 19.5708 16 90 0.00000
BABYMORT 0.524 7 0.0510 22.5267 14 92 0.00000
URBAN 2.615 6 0.0568 25.0342 12 94 0.00000
GDP_CAP 3.583 5 0.0655 27.9210 10 96 0.00000
EDUC 5.143 4 0.0795 31.1990 8 98 0.00000
HEALTH 2.438 3 0.0874 39.7089 6 100 0.00000
DIFFRNCE 6.983 4 0.0680 34.7213 8 98 0.00000
GDP_CAP 4.299 5 0.0577 30.3710 10 96 0.00000

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 19 0 0 100
Islamic 0 14 1 93
NewWorld 1 1 19 90

Total 20 15 20 95
282
Chapter 11
A summary of results for the models estimated by forward, backward, and interactive
stepping follows:
Notice that the largest difference between the two classification methods (95% versus
87%) occurs for the last model, which includes the most variables. A difference like
Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 19 0 0 100
Islamic 0 12 3 80
NewWorld 1 3 17 81

Total 20 15 20 87

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
6.319 0.929 0.822
1.369 0.760 1.000

Canonical discriminant functions -- standardized by within variances
--------------------------------------------------------------------
1 2
URBAN . .
BIRTH_RT . .
DEATH_RT . .
BABYMORT . .
GDP_CAP 0.6868 0.0377
EDUC . .
HEALTH . .
MIL 0.0676 0.8395
B_TO_D -0.4461 -0.5037
LIFEEXPM . .
LIFEEXPF . .
LITERACY 0.3903 -0.8573
DIFFRNCE -0.6378 -0.0291

Canonical scores of group means
-------------------------------
Europe 3.162 .535
Islamic -2.890 1.281
NewWorld -.796 -1.399
Model
% Correct
(Class)
% Correct
(Jackknife)
Forward (automatic)
1. BIRTH_RT DEATH_RT MIL 89 87
Backward (automatic)
2. BIRTH_RT DEATH_RT MIL EDUC HEALTH 91 91
Interactive (ignoring BIRTH_RT and DEATH_RT)
3. MIL B_TO_D LITERACY 84 84
4. MIL B_TO_D LITERACY EDUC HEALTH 91 89
5. MIL B_TO_D LITERACY DIFFRNCE 91 89
6. MIL B_TO_D LITERACY DIFFRNCE GDP_CAP 95 87
283
Di scri mi nant Anal ysi s
this one (8%) can indicate overfitting of correlated candidate variables. Since the
jackknifed results can still be overly optimistic, cross-validation should be considered.
Example 5
Contrasts
Contrasts are available with commands only. When you have specific hypotheses about
differences among particular groups, you can specify one or more contrasts to direct
the entry (or removal) of variables in the model.
According to the jackknifed classification results in the stepwise examples, the
European countries are always classified correctly (100% correct). All of the
misclassifications are New World countries classified as Islamic or vice versa. In order
to maximize the difference between the second (Islamic) and third groups (New
World), we specify contrast coefficients with commands:
If we want to specify linear and quadratic contrasts across four groups, we could
specify:
or
Here, we use the first contrast and request interactive forward stepping. The input is:
CONTRAST [0 -1 1]
CONTRAST [-3 -1 1 3; -1 1 1 -1]
CONTRAST [-3 -1 1 3
-1 1 1 -1]
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
CONTRAST [0 -1 1]
PRINT / SHORT
START / FORWARD
STEP literacy
STEP mil
STEP urban
STOP
284
Chapter 11
After viewing the results, remember to cancel the contrast if you plan to do other
discriminant analyses:
The output follows:
(We omit results for steps 1, 2, and 3.)
CONTRAST / CLEAR
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
| 6 URBAN 21.87 1.000000
| 8 BIRTH_RT 59.06 1.000000
| 10 DEATH_RT 28.79 1.000000
| 12 BABYMORT 44.12 1.000000
| 16 GDP_CAP 14.32 1.000000
| 19 EDUC 1.30 1.000000
| 21 HEALTH 3.34 1.000000
| 23 MIL 0.65 1.000000
| 34 B_TO_D 1.12 1.000000
| 30 LIFEEXPM 35.00 1.000000
| 31 LIFEEXPF 43.16 1.000000
| 32 LITERACY 64.84 1.000000

**************** Step 1 -- Variable LITERACY Entered ****************
Variable F-to-enter Number of
entered or or variables Wilks Approx.
removed F-to-remove in model lambda F-value df1 df2 p-tail
------------ ----------- --------- ----------- ----------- ---- ----- ---------
LITERACY 64.844 1 0.4450 64.8444 1 52 0.00000
MIL 9.963 2 0.3723 42.9917 2 51 0.00000
URBAN 2.953 3 0.3515 30.7433 3 50 0.00000

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 18 0 1 95
Islamic 0 14 1 93
NewWorld 2 3 16 76

Total 20 17 18 87

Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 18 0 1 95
Islamic 0 14 1 93
NewWorld 2 3 16 76

Total 20 17 18 87

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
1.845 0.805 1.000

Canonical scores of group means
-------------------------------
Europe .882
Islamic -2.397
NewWorld .914
285
Di scri mi nant Anal ysi s
Compare the F-to-enter values with those in the forward stepping example. The
statistics here indicate that for the economic variables (GDP_CAP, EDUC, HEALTH,
and MIL), differences between the second and third groups are much smaller than those
when European countries are included.
The Jackknifed classification matrix indicates that when LITERACY, MIL, and
URBAN are used, 87% of the countries are classified correctly. This is the same
percentage correct as in the forward stepping example for the model with BIRTH_RT,
DEATH_RT, and MIL. Here, however, one fewer Islamic country is misclassified, and
one European country is now classified incorrectly.
When you look at the canonical results, you see that because a single contrast has
one degree of freedom, only one dimension is definedthat is, there is only one
eigenvalue and one canonical variable.
Example 6
Quadratic Model
One of the assumptions necessary for linear discriminant analysis is equality of
covariance matrices. Within-group scatterplot matrices (SPLOMs) provide a picture
of how measures co-vary. Here we add 85% ellipses of concentration to enhance our
view of the bivariate relations. Since our sample sizes do not differ markedly (15 to 21
countries per group), the ellipses for each pair of variables should have approximately
the same shape and tilt across groups if the equality of covariance assumption holds.
The input is:
DISCRIM
USE ourworld
LET(educ, health, mil) = SQR(@)
STAND
SPLOM birth_rt death_rt educ health mil / HALF ROW=1,
GROUP=group$ ELL=.85 DENSITY=NORMAL
286
Chapter 11
Because the length, width, and tilt of the ellipses for most pairs of variables vary
markedly across groups, the assumption of equal covariance matrices has not been met.
Fortunately, the quadratic model does not require equality of covariances. However,
it has a different problem: it requires a larger minimum sample size than that needed
for the linear model. For five variables, for example, the linear and quadratic models,
respectively, for each group are:
So the linear model has six parameters to estimate for each group, and the quadratic
has 21. These parameters arent all independent, so we dont require as many as
( ) cases for a quadratic fit.
In this example, we fit a quadratic model using the subset of variables identified in
the backward stepping example. Following this, we examine results for the subset
identified in the interactive stepping example before EDUC and HEALTH are
removed. The input is:
DISCRIM
USE ourworld
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = birth_rt death_rt educ health mil / QUAD
PRINT SHORT / GCOV WCOV GCOR CFUNC MAHAL
IDVAR = country$
ESTIMATE
MODEL group = educ health mil b_to_d literacy / QUAD
ESTIMATE
Europe
B
IR
T
H
_
R
T
D
E
A
T
H
_
R
T
E
D
U
C
H
E
A
L
T
H
BIRTH_RT
M
IL
DEATH_RT EDUC HEALTH MIL
Islamic
B
IR
T
H
_
R
T
D
E
A
T
H
_
R
T
E
D
U
C
H
E
A
L
T
H
BIRTH_RT
M
IL
DEATH_RT EDUC HEALTH MIL
NewWorld
B
IR
T
H
_
R
T
D
E
A
T
H
_
R
T
E
D
U
C
H
E
A
L
T
H
BIRTH_RT
M
IL
DEATH_RT EDUC HEALTH MIL
f a bx
1
cx
2
dx
3
ex
4
fx
5
+ + + + + =
f a bx
1
cx
2
dx
3
ex
4
fx
5
gx
1
x
2
... px
4
x
5
qx
1
2
... ux
5
2
+ + + + + + + + + + + =
3 21
287
Di scri mi nant Anal ysi s
Output for the first model follows:
Pooled within covariance matrix -- df= 53
------------------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 36.2044
DEATH_RT 10.8948 10.4790
EDUC -16.1749 -7.2497 42.8231
HEALTH -12.9261 -4.9333 36.5504 35.0939
MIL -9.6390 -7.7297 22.0789 16.9130 27.7095

Group Europe covariance matrix
------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 1.7342
DEATH_RT 0.0184 1.8184
EDUC 2.0051 1.3359 47.1696
HEALTH 1.3943 -0.3625 44.2594 47.3538
MIL 0.8255 1.2689 15.2891 14.7387 15.7686

Group Europe correlation matrix
------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 1.0000
DEATH_RT 0.0104 1.0000
EDUC 0.2217 0.1442 1.0000
HEALTH 0.1539 -0.0391 0.9365 1.0000
MIL 0.1579 0.2370 0.5606 0.5394 1.0000

Ln( Det(COV of group Europe) )= 8.67105970

Group Europe discriminant function coefficients
-------------------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT -0.1588
DEATH_RT -0.0487 -0.2038
EDUC 0.0498 0.1140 -0.0627
HEALTH -0.0408 -0.1196 0.1162 -0.0617
MIL 0.0104 0.0367 0.0011 0.0144 -0.0249
Constant 4.1354 4.3332 -1.6504 1.7008 -0.0468
Constant
Constant -51.1780

Group Islamic covariance matrix
------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 48.6381
DEATH_RT 27.5429 25.5429
EDUC -19.8729 -20.3689 33.7508
HEALTH -10.9262 -10.6192 18.8309 10.8603
MIL -15.5902 -28.4991 36.6788 19.3235 66.0183

Group Islamic correlation matrix
------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 1.0000
DEATH_RT 0.7814 1.0000
EDUC -0.4905 -0.6937 1.0000
HEALTH -0.4754 -0.6376 0.9836 1.0000
MIL -0.2751 -0.6940 0.7770 0.7217 1.0000

Ln( Det(COV of group Islamic) )= 10.34980794

288
Chapter 11
Group Islamic discriminant function coefficients
-------------------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT -0.0236
DEATH_RT 0.0703 -0.0751
EDUC 0.0099 -0.0726 -0.3578
HEALTH -0.0424 0.1331 1.0933 -0.8951
MIL 0.0261 -0.0469 0.0485 -0.0360 -0.0190
Constant 0.9492 -0.5959 1.2818 -0.9994 -0.3956
Constant
Constant -20.4487

Group NewWorld covariance matrix
------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 60.2476
DEATH_RT 9.5738 8.1619
EDUC -30.8573 -6.2226 45.0446
HEALTH -27.9303 -5.2955 41.6304 40.4104
MIL -15.4143 -1.7399 18.3092 17.2913 12.2372

Group NewWorld correlation matrix
------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT 1.0000
DEATH_RT 0.4317 1.0000
EDUC -0.5923 -0.3245 1.0000
HEALTH -0.5661 -0.2916 0.9758 1.0000
MIL -0.5677 -0.1741 0.7798 0.7776 1.0000

Ln( Det(COV of group NewWorld) )= 11.46371023

Group NewWorld discriminant function coefficients
-------------------------------------------------
BIRTH_RT DEATH_RT EDUC HEALTH MIL
BIRTH_RT -0.0077
DEATH_RT 0.0121 -0.0401
EDUC -0.0079 -0.0213 -0.1260
HEALTH 0.0040 0.0114 0.2418 -0.1331
MIL -0.0115 0.0196 0.0225 0.0210 -0.0580
Constant 0.4354 0.2643 0.8264 -0.6543 0.5229
Constant
Constant -13.3124

Ln( Det(Pooled covariance matrix) )= 13.05914566

Test for equality of covariance matrices
Chisquare= 139.5799 df= 30 prob= 0.0000

Between groups F-matrix -- df = 5 49
---------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 64.4526 0.0
NewWorld 43.1437 15.9199 0.0

Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors = .333 .333 .333
Europe Islamic NewWorld
289
Di scri mi nant Anal ysi s
(We omit the eigenvalues, etc.)
Look at the quadratic function displayed at the beginning of this example. For our data,
the coefficients for the European group are:
a = 51.178, b = 4.135, c = 4.333, d = 1.650, e = 1.701, f = 0.047, g = 0.049, ,
p = 0.014, q = 0.159, , and u = 0.025
or
f = 51.178 + 4.135*birth_rt + 0.049*birth_rt*death_rt + 0.159*birth_rt
2
+
0.025*mil
2
Similar functions exist for the other two groups.
(We omit the distances and probabilities for the Europe and Islamic groups.)

NewWorld
------------
Argentina 48.1 .00 45.2 .00 3.8 1.00
Barbados 31.8 .00 65.0 .00 6.5 1.00
Bolivia --> 369.3 .00 4.1 .65 4.2 .35
Brazil 133.3 .00 9.4 .03 1.1 .97
Canada --> 14.5 .88 533.6 .00 15.7 .12
Chile 66.6 .00 16.6 .00 1.8 1.00
Colombia 161.1 .00 9.2 .04 1.8 .96
CostaRica 181.6 .00 93.2 .00 7.8 1.00
Venezuela 180.9 .00 16.6 .01 6.0 .99
DominicanR. 175.3 .00 21.5 .00 2.3 1.00
Uruguay 23.1 .00 38.4 .00 5.8 1.00
Ecuador 212.2 .00 5.8 .13 .8 .87
ElSalvador 312.9 .00 10.0 .03 2.0 .97
Jamaica 73.8 .00 20.2 .00 2.5 1.00
Guatemala 404.9 .00 6.0 .17 1.7 .83
Haiti --> 792.1 .00 3.9 .99 11.2 .01
Honduras 395.9 .00 16.1 .00 4.1 1.00
Trinidad 164.1 .00 38.0 .00 5.6 1.00
Peru 167.6 .00 18.9 .00 4.9 1.00
Panama 133.9 .00 97.7 .00 3.4 1.00
Cuba 33.6 .00 39.7 .00 6.8 1.00
--> case misclassified
* case not used in computation

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 20 0 0 100
Islamic 0 14 1 93
NewWorld 1 2 18 86

Total 21 16 19 93

Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 20 0 0 100
Islamic 0 13 2 87
NewWorld 1 2 18 86

Total 21 15 20 91
290
Chapter 11
The output also includes the chi-square test for equality of covariance matrices. The
results are highly significant ( ). Thus, we reject the hypothesis of equal
covariance matrices.
The Mahalanobis distances reveal that only four cases are misclassified: Turkey as
a New World country, Canada as European, and Haiti and Bolivia as Islamic.
The classification matrix indicates that 93% of the countries are correctly classified;
using the jackknifed results, the percentage drops to 91%. The latter percentage agrees
with that for the linear model using the same variables.
The output for the second model follows:
Between groups F-matrix -- df = 5 49
---------------------------------------------
Europe Islamic NewWorld
Europe 0.0
Islamic 51.5154 0.0
NewWorld 33.6025 17.9915 0.0

Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors = .333 .333 .333
Europe Islamic NewWorld

NewWorld
------------
Argentina 30.9 .00 48.3 .00 4.3 1.00
Barbados 35.5 .00 68.7 .00 7.4 1.00
Bolivia 186.2 .00 10.1 .08 2.2 .92
Brazil 230.8 .00 8.1 .13 1.2 .87
Canada --> 19.4 .74 524.3 .00 16.3 .26
Chile 144.3 .00 17.2 .00 1.6 1.00
Colombia 475.1 .00 29.8 .00 1.9 1.00
CostaRica 834.5 .00 190.5 .00 10.3 1.00
Venezuela 932.5 .00 83.6 .00 8.8 1.00
DominicanR. 267.4 .00 18.6 .00 2.0 1.00
Uruguay 15.2 .04 60.5 .00 3.9 .96
Ecuador 276.0 .00 11.5 .02 1.0 .98
ElSalvador 498.0 .00 17.6 .00 1.7 1.00
Jamaica 312.0 .00 15.5 .00 .7 1.00
Guatemala 501.3 .00 7.9 .24 2.5 .76
Haiti --> 648.4 .00 4.6 .99 10.2 .01
Honduras 688.1 .00 31.8 .00 4.0 1.00
Trinidad 315.4 .00 43.1 .00 4.6 1.00
Peru 179.9 .00 16.3 .02 5.1 .98
Panama 411.0 .00 109.7 .00 3.6 1.00
Cuba 54.7 .00 54.5 .00 6.8 1.00
--> case misclassified
* case not used in computation

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 20 0 0 100
Islamic 0 15 0 100
NewWorld 1 1 19 90

Total 21 16 19 96

p < 0.00005
291
Di scri mi nant Anal ysi s
This model does slightly better than the first onethe classification matrices here
show that 96% and 93%, respectively, are classified correctly. This is because Turkey
and Bolivia are classified correctly here and misclassified with the first model.
Example 7
Cross-Validation
At the end of the interactive stepping example, we reported the percentage of correct
classification for six models. The same sample was used to compute the estimates and
evaluate the success of the rules. We also reported results for the jackknifed
classification procedure that removes and replaces one case at a time. This approach,
however, may still give an overly optimistic picture. Ideally, we should try the rules on
a new sample and compare results with those for the original data. Since this usually
isnt practical, researchers often use a cross-validation procedurethat is, they
randomly split the data into two samples, use the first sample to estimate the
classification functions, and then use the resulting functions to classify the second
sample. The first sample is often called the learning sample and the second, the test
sample. The proportion of correct classification for the test sample is an empirical
measure for the success of the discrimination.
Cross-validation is easy to implement in discriminant analysis. Cases assigned a
weight of 0 are not used to estimate the discriminant functions but are classified into
groups. In this example, we generate a uniform random number (values range from 0
to 1.0) for each case, and when it is less than 0.65, the value 1.0 is stored in a new
weight variable named CASE_USE. If the random number is equal to or greater than
0.65, a 0 is placed in the weight variable. So, approximately 65% of the cases have a
weight of 1.0, and 35%, a weight of 0.
Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 19 0 1 95
Islamic 0 14 1 93
NewWorld 1 1 19 90

Total 20 15 21 93

Eigen Canonical Cumulative proportion
values correlations of total dispersion
--------- ------------ ---------------------
5.585 0.921 0.801
1.391 0.763 1.000

Canonical scores of group means
-------------------------------
Europe -2.916 .501
Islamic 2.725 1.322
NewWorld .831 -1.422
292
Chapter 11
We now request a cross-validation for each of the following six models using the
OURWORLD data:
Use interactive forward stepping to toggle variables in and out of the model subsets.
The input is:
Here are the results from the first STEP after MIL enters:
1. BIRTH_RT DEATH_RT MIL
2. BIRTH_RT DEATH_RT MIL EDUC HEALTH
3. MIL B_TO_D LITERACY
4. MIL B_TO_D LITERACY EDUC HEALTH
5. MIL B_TO_D LITERACY DIFFRNCE
6. MIL B_TO_D LITERACY DIFFRNCE GDP_CAP
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
LET case_use = URN < .65
WEIGHT = case_use
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
PRINT NONE / FSTATS CLASS JCLASS
GRAPH NONE
START / FORWARD
STEP birth_rt death_rt mil
STEP educ health
STEP birth_rt death_rt educ health b_to_d literacy
STEP educ health
STEP educ health diffrnce
STEP gdp_cap
STOP
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
8 BIRTH_RT 57.86 0.640126 | 6 URBAN 7.41 0.415097
10 DEATH_RT 24.56 0.513344 | 12 BABYMORT 0.20 0.234804
23 MIL 13.43 0.760697 | 16 GDP_CAP 3.22 0.394128
| 19 EDUC 2.00 0.673136
| 21 HEALTH 4.68 0.828565
| 34 B_TO_D 0.16 0.209796
| 30 LIFEEXPM 0.42 0.136526
| 31 LIFEEXPF 0.83 0.104360
| 32 LITERACY 1.54 0.244547
| 40 DIFFRNCE 5.23 0.784797

293
Di scri mi nant Anal ysi s
Three classification matrices result. The first presents results for the learning sample,
the cases with CASE_USE values of 1.0. Overall, 95% of these countries are classified
correctly. The sample size is 13 + 9 + 16 = 38or 67.9% of the original sample of 56
countries. The second classification table reflects those cases not used to compute
estimates, the test sample. The percentage of correct classification drops to 76% for
these 17 countries. The final classification table presents the jackknifed results for the
learning sample. Notice that the percentages of correct classification are closer to those
for the learning sample than for the test sample.
Now we add the variables EDUC and HEALTH, with the following results:
Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 13 0 0 100
Islamic 0 8 1 89
NewWorld 0 1 15 94

Total 13 9 16 95

Classification of cases with zero weight or frequency
-----------------------------------------------------
Europe Islamic NewWorld %correct
Europe 6 0 0 100
Islamic 0 4 2 67
NewWorld 2 0 3 60

Total 8 4 5 76

Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 13 0 0 100
Islamic 0 8 1 89
NewWorld 1 1 14 88

Total 14 9 15 92
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance
-------------------------------------+-------------------------------------
8 BIRTH_RT 21.13 0.588377 | 6 URBAN 6.50 0.414511
10 DEATH_RT 16.52 0.508827 | 12 BABYMORT 0.07 0.221475
19 EDUC 2.24 0.103930 | 16 GDP_CAP 3.06 0.242491
21 HEALTH 4.88 0.127927 | 34 B_TO_D 0.32 0.198963
23 MIL 5.68 0.567128 | 30 LIFEEXPM 0.05 0.117494
| 31 LIFEEXPF 0.04 0.080161
| 32 LITERACY 1.75 0.238831
| 40 DIFFRNCE 0.00 0.000000

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic NewWorld %correct
Europe 13 0 0 100
Islamic 0 8 1 89
NewWorld 0 1 15 94

Total 13 9 16 95

294
Chapter 11
After we add EDUC and HEALTH, the results here for the learning sample do not
differ from those for the previous model. However, for the test sample, the addition of
EDUC and HEALTH increases the percentage correct from 76% to 88%.
We continue by issuing the STEP specifications listed above, each time noting the
total percentage correct as well as the percentages for the Islamic and New World
groups. After scanning the classification results from both the test sample and the
learning sample jackknifed panel, we conclude that model 2 (BIRTH_RT, DEATH_RT,
MIL, EDUC, and HEALTH) is best and that model 1 performs the worst.
Classification of New Cases
Group membership is known in the current example. What if you have cases where the
group membership is unknown? For example, you might want to apply the rules
developed for one sample to a new sample.
When the value of the grouping variable is missing, SYSTAT still classifies the
case. For example, we set the group code for New World countries to missing
and request automatic forward stepping for the model containing BIRTH_RT,
DEATH_RT, MIL, EDUC, and HEALTH:
Classification of cases with zero weight or frequency
-----------------------------------------------------
Europe Islamic NewWorld %correct
Europe 6 0 0 100
Islamic 0 5 1 83
NewWorld 1 0 4 80

Total 7 5 5 88

Jackknifed classification matrix
--------------------------------
Europe Islamic NewWorld %correct
Europe 13 0 0 100
Islamic 0 8 1 89
NewWorld 0 1 15 94

Total 13 9 16 95
IF group = 3 THEN LET group$ = .
295
Di scri mi nant Anal ysi s
The following are the Mahalanobis distances and posterior probabilities for the
countries with missing group codes and also the classification matrix. The weight
variable is not used here.
Argentina, Barbados, Canada, Uruguay, and Cuba are classified as European; the other
15 countries are classified as Islamic.
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
IF group = 3 THEN LET group = .
LABEL group / 1=Europe, 2=Islamic
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
IDVAR = country$
PRINT / MAHAL
START / FORWARD
STEP / AUTO
STOP
Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors = .500 .500
Europe Islamic

Not Grouped
------------
Argentina * 28.6 1.00 59.6 .00
Barbados * 25.9 1.00 71.9 .00
Bolivia * 120.7 .00 2.7 1.00
Brazil * 115.7 .00 10.0 1.00
Canada * 2.1 1.00 124.1 .00
Chile * 63.2 .00 35.5 1.00
Colombia * 204.0 .00 22.4 1.00
CostaRica * 306.5 .00 60.8 1.00
Venezuela * 297.0 .00 49.2 1.00
DominicanR. * 129.9 .00 10.8 1.00
Uruguay * 12.5 1.00 91.4 .00
Ecuador * 149.3 .00 8.4 1.00
ElSalvador * 183.7 .00 10.1 1.00
Jamaica * 100.2 .00 32.7 1.00
Guatemala * 155.3 .00 5.5 1.00
Haiti * 136.8 .00 1.4 1.00
Honduras * 216.6 .00 13.2 1.00
Trinidad * 132.6 .00 14.0 1.00
Peru * 99.4 .00 7.4 1.00
Panama * 160.5 .00 18.9 1.00
Cuba * 19.4 1.00 70.7 .00
--> case misclassified
* case not used in computation

Classification matrix (cases in row categories classified into columns)
---------------------
Europe Islamic %correct
Europe 19 0 100
Islamic 0 15 100

Total 19 15 100

Not Grouped 5 16
296
Chapter 11
References
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179188.
Hill, M. A. and Engelman, L. (1992). Graphical aids for nonlinear regression and
discriminant analysis. Computational Statistics, Vol. 2, Y. Dodge and J. Whittaker, eds.
Proceedings of the 10th Symposium on Computational Statistics Physica-Verlag,
111126.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The
Statistical Center, University of Phillipines.
297


Chapt er
12
Factor Analysis
Herb Stenson and Leland Wilkinson
Factor analysis provides principal components analysis and common factor analysis
(maximum likelihood and iterated principal axis). SYSTAT has options to rotate, sort,
plot, and save factor loadings. With the principal components method, you can also
save the scores and coefficients. Orthogonal methods of rotation include varimax,
equamax, quartimax, and orthomax. A direct oblimin method is also available for
oblique rotation. Users can explore other rotations by interactively rotating a 3-D
Quick Graph plot of the factor loadings. Various inferential statistics (for example,
confidence intervals, standard errors, and chi-square tests) are provided, depending on
the nature of the analysis that is run.
Statistical Background
Principal components (PCA) and common factor (MLA for maximum likelihood and
IPA for iterated principal axis) analyses are methods of decomposing a correlation or
covariance matrix. Although principal components and common factor analyses are
based on different mathematical models, they can be used on the same data and both
usually produce similar results. Factor analysis is often used in exploratory data
analysis to:
n Study the correlations of a large number of variables by grouping the variables in
factors so that variables within each factor are more highly correlated with
variables in that factor than with variables in other factors.
n Interpret each factor according to the meaning of the variables.
n Summarize many variables by a few factors. The scores from the factors can be
used as input data for t tests, regression, ANOVA, discriminant analysis, and so on.
298
Chapter 12
Often the users of factor analysis are overwhelmed by the gap between theory and
practice. In this chapter, we try to offer practical hints. It is important to realize that you
may need to make several passes through the procedure, changing options each time,
until the results give you the necessary information for your problem.
If you understand the component model, you are on the way toward understanding
the factor model, so lets begin with the former.
A Principal Component
What is a principal component? The simplest way to see is through real data. The
following data consist of Graduate Record Examination verbal and quantitative scores.
These scores are from 25 applicants to a graduate psychology department.
VERBAL QUANTITATIVE
590 530
620 620
640 620
650 550
620 610
610 660
560 570
610 730
600 650
740 790
560 580
680 710
600 540
520 530
660 650
750 710
630 640
570 660
600 650
570 570
600 550
690 540
770 670
610 660
600 640
299
Factor Anal ysi s
Now, we could decide to try linear regression to predict verbal scores from quantitative.
Or, we could decide to predict quantitative from verbal by the same method. The data
dont suggest which is a dependent variable; either will do. What if we arent interested
in predicting either one separately but instead want to know how both variables hang
together jointly? This is what a principal component does. Karl Pearson, who
developed principal component analysis in 1901, described a component as a line of
closest fit to systems of points in space. In short, the regression line indicates best
prediction, and the component line indicates best association.
The following figure shows the regression and component lines for our GRE data.
The regression of y on x is the line with the smallest slope (flatter than diagonal). The
regression of x on y is the line with the largest slope (steeper than diagonal). The
component line is between the other two. Interestingly, when most people are asked to
draw a line relating two variables in a scatterplot, they tend to approximate the
component line. It takes a lot of explaining to get them to realize that this is not the best
line for predicting the vertical axis variable (y) or the horizontal axis variable (x).
Notice that the slope of the component line is approximately 1, which means that the
two variables are weighted almost equally (assuming the axis scales are the same). We
could make a new variable called GRE that is the sum of the two tests:
GRE = VERBAL + QUANTITATIVE
This new variable could summarize, albeit crudely, the information in the other two. If
the points clustered almost perfectly around the component line, then the new
component variable could summarize almost perfectly both variables.
500 600 700 800
Verbal GRE Score
500
600
700
800
Q
u
a
n
t
i
t
a
t
i
v
e

G
R
E

S
c
o
r
e
500 600 700 800
500
600
700
800
500 600 700 800
500
600
700
800
500 600 700 800
500
600
700
800
300
Chapter 12
Multiple Principal Components
The goal of principal components analysis is to summarize a multivariate data set as
accurately as possible using a few components. So far, we have seen only one
component. It is possible, however, to draw a second component perpendicular to the
first. The first component will summarize as much of the joint variation as possible.
The second will summarize whats left. If we do this with the GRE data, of course, we
will have as many components as original variablesnot much of a saving. We usually
seek fewer components than variables, so that the variation left over is negligible.
Component Coefficients
In the above equation for computing the first principal component on our test data, we
made both coefficients equal. In fact, when you run the sample covariance matrix using
factor analysis in SYSTAT, the coefficients are as follows:
GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE
They are indeed nearly equal. Their magnitude is considerably less than 1 because
principal components are usually scaled to conserve variance. That is, once you
compute the components with these coefficients, the total variance on the components
is the same as the total variance on the original variables.
Component Loadings
Most researchers want to know the relation between the original variables and the
components. Some components may be nearly identical to an original variable; in other
words, their coefficients may be nearly 0 for all variables except one. Other
components may be a more even amalgam of several original variables.
Component loadings are the covariances of the original variables with the
components. In our example, these loadings are 51.085 for VERBAL and 62.880 for
QUANTITATIVE. You may have noticed that these are proportional to the coefficients;
they are simply scaled differently. If you square each of these loadings and add them
up separately for each component, you will have the variance accounted for by each
component.
301
Factor Anal ysi s
Correlations or Covariances
Most researchers prefer to analyze the correlation rather than covariance structure
among their variables. Sample correlations are simply covariances of sample
standardized variables. Thus, if your variables are measured on very different scales or
if you feel the standard deviations of your variables are not theoretically significant,
you will want to work with correlations instead of covariances. In our test example,
working with correlations yields loadings of 0.879 for each variable instead of 51.085
and 62.880. When you factor the correlation instead of the covariance matrix, then the
loadings are the correlations of each component with each original variable.
For our test data, loadings of 0.879 mean that if you created a GRE component by
standardizing VERBAL and QUANTITATIVE and adding them together weighted by
the coefficients, you would find the correlation between these component scores and
the original VERBAL scores to be 0.879. The same would be true for QUANTITATIVE.
Signs of Component Loadings
The signs of loadings within components are arbitrary. If a component (or factor) has
more negative than positive loadings, you may change minus signs to plus and plus to
minus. SYSTAT does this automatically for components that have more negative than
positive loadings, and thus will occasionally produce components or factors that have
different signs from those in other computer programs. This occasionally confuses
users. In mathematical terms, and are equivalent.
Factor Analysis
We have seen how principal components analysis is a method for computing new
variables that summarize variation in a space parsimoniously. For our test variables, the
equation for computing the first component was:
GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE
This component equation is linear, of the form:
Component = Linear combination of {Observed variables}
Factor analysts turn this equation around:
Observed variable = Linear combination of {Factors} + Error
Ax x = Ax x =
302
Chapter 12
This model was presented by Spearman near the turn of the century in the context of a
single intelligence factor and extended to multiple mental measurement factors by
Thurstone several decades later. Notice that the factor model makes observed variables
a function of unobserved factors. Even though this looks like a linear regression model,
none of the graphical and analytical techniques used for regression can be applied to
the factor model because there is no unique, observable set of factor scores or residuals
to examine.
Factor analysts are less interested in prediction than in decomposing a covariance
matrix. This is why the fundamental equation of factor analysis is not the above linear
model, but rather its quadratic form:
Observed covariances = Factor covariances + Error covariances
The covariances in this equation are usually expressed in matrix form, so that the
model decomposes an observed covariance matrix into a hypothetical factor
covariance matrix plus a hypothetical error covariance matrix. The diagonals of these
two hypothetical matrices are known, respectively, as communalities and
specificities.
In ordinary language, then, the factor model expresses variation within and relations
among observed variables as partly common variation among factors and partly
specific variation among random errors.
Estimating Factors
Factor analysis involves several steps:
n First, the correlation or covariance matrix is computed from the usual cases-by-
variables data file or it is input as a matrix.
n Second, the factor loadings are estimated. This is called initial factor extraction.
Extraction methods are described in this section.
n Third, the factors are rotated to make the loadings more interpretablethat is,
rotation methods make the loadings for each factor either large or small, not in-
between. These methods are described in the next section.
Factors must be estimated iteratively in a computer. There are several methods
available. The most popular approach, available in SYSTAT, is to modify the diagonal
of the observed covariance matrix and calculate factors the same way components are
computed. This procedure is repeated until the communalities reproduced by the factor
covariances are indistinguishable from the diagonal of the modified matrix.
303
Factor Anal ysi s
Rotation
Usually the initial factor extraction does not give interpretable factors. One of the
purposes of rotation is to obtain factors that can be named and interpreted. That is, if
you can make the large loadings larger than before and the smaller loadings smaller,
then each variable is associated with a minimal number of factors. Hopefully, the
variables that load strongly together on a particular factor will have a clear meaning
with respect to the subject area at hand.
It helps to study plots of loadings for one factor against those for another. Ideally,
you want to see clusters of loadings at extreme values for each factor: like what A and
C are for factor 1, and B and D are for factor 2 in the left plot, and not like E and F in
the middle plot.
In the middle plot, the loadings in groups E and F are sizeable for both factors 1 and 2.
However, if you lift the plot axes away from E and F, rotating them 45 degrees, and then
set them down as on the right, you achieve the desired effect. Sounds easy for two
factors. For three factors, imagine that the loadings are balls floating in a room and that
you rotate the floor and walls so that each loading is as close to the floor or a wall as it
can be. This concept generalizes to more dimensions.
Researchers let the computer do the rotation automatically. There are many criteria
for achieving a simple structure among component loadings, although Thurstones are
most widely cited. For p variables and m components:
n Each component should have at least m near-zero loadings.
n Few components should have nonzero loadings on the same variable.
SYSTAT provides five methods of rotating loadings: varimax, equamax, quartimax,
orthomax, and oblimin.
0
-1
1
-1 0 1
a
b
c
d
0
-1
1
-1 0 1
e
f
-
1
0

1
e
f
0
-
1

1
304
Chapter 12
Principal Components versus Factor Analysis
SYSTAT can perform both principal components and common factor analysis. Some
view principal components analysis as a method of factor analysis, although there is a
theoretical distinction between the two. Principal components are weighted linear
composites of observed variables. Common factors are unobserved variables that are
hypothesized to account for the intercorrelations among observed variables.
One significant practical difference is that common factor scores are indeterminate,
whereas principal component scores are not. There are no sufficient estimators of
scores for subjects on common factors (rotated or unrotated, maximum likelihood, or
otherwise). Some computer models provide regression estimates of factor scores,
but these are not estimates in the usual statistical sense. This problem arises not
because factors can be arbitrarily rotated (so can principal components), but because
the common factor model is based on more unobserved parameters than observed data
points, an unusual circumstance in statistics.
In recent years, maximum likelihood factor analysis algorithms have been
devised to estimate common factors. The implementation of these algorithms in
popular computer packages has led some users to believe that the factor indeterminacy
problem does not exist for maximum likelihood factor estimates. It does.
Mathematicians and psychometricians have known about the factor indeterminacy
problem for decades. For a historical review of the issues, see Steiger (1979); for a
general review, see Rozeboom (1982). For further information on principal
components, consult Harman (1976), Mulaik (1978), Gnanadesikan (1977), or Mardia,
Kent, and Bibby (1979).
Because of the indeterminacy problem, SYSTAT computes subjects scores only
for the principal components model where subjects scores are a simple linear
transformation of scores on the factored variables. SYSTAT does not save scores from
a common factor model.
Applications and Caveats
While there is not room here to discuss more statistical issues, you should realize that
there are several myths about factors versus components:
Myth. The factor model allows hypothesis testing; the component model doesnt.
Fact. Morrison (1967) and others present a full range of formal statistical tests for
components.
305
Factor Anal ysi s
Myth. Factor loadings are real; principal component loadings are approximations.
Fact. This statement is too ambiguous to have any meaning. It is easy to define things
so that factors are approximations of components.
Myth. Factor analysis is more likely to uncover lawful structure in your data; principal
components are more contaminated by error.
Fact. Again, this statement is ambiguous. With further definition, it can be shown to be
true for some data, false for other. It is true that, in general, factor solutions will have
lower dimensionality than corresponding component solutions. This can be an
advantage when searching for simple structure among noisy variables, as long as you
compare the result to a principal components solution to avoid being fooled by the sort
of degeneracies illustrated above.
Factor Analysis in SYSTAT
Factor Analysis Main Dialog Box
For factor analysis, from the menus choose:
Statistics
Data Reduction
Factor Analysis
306
Chapter 12
The following options are available:
Model variables. Variables used to create factors.
Method. SYSTAT offers three estimation methods:
n Principal components analysis (PCA) is the default method of analysis.
n Iterated principal axis (IPA) provides an iterative method to extract common
factors by starting with the principal components solution and iteratively solving
for communalities.
n Maximum likelihood analysis (MLA) iteratively finds communalities and common
factors.
Display. You can sort factor loadings by size or display extended results. Selecting
Extended results displays all possible Factor output.
Sample size for matrix input. If your data are in the form of a correlation or covariance
matrix, you must specify the sample size on which the input matrix is based so that
inferential statistics (available with extended results) can be computed.
Matrix for extraction. You can factor a correlation matrix or a covariance matrix. Most
frequently, the correlation matrix is used. You can also delete missing cases pairwise
instead of listwise. Listwise deletes any case with missing data for any variable in the
list. Pairwise examines each pair of variables and uses all cases with both values
present.
Extraction parameters. You can limit the results by specifying extraction parameters.
n Minimum eigenvalue. Specify the smallest eigenvalue to retain. The default is 1.0
for PCA and IPA (not available with maximum likelihood). Incidentally, if you
specify 0, factor analysis ignores components with negative eigenvalues (which
can occur with pairwise deletion).
n Number of factors. Specify the number of factors to compute. If you specify both
the number of factors and the minimum eigenvalue, factor analysis uses whichever
criterion results in the smaller number of components.
n Iterations. Specify the number of iterations SYSTAT should perform (not available
for principal components). The default is 25.
n Convergence. Specify the convergence criterion (not available for principal
components). The default is 0.001.
307
Factor Anal ysi s
Rotation Parameters
This dialog box specifies the factor rotation method.
The following methods are available:
n No rotation. Factors are not rotated.
n Varimax. An orthogonal rotation method that minimizes the number of variables
that have high loadings on each factor. It simplifies the interpretation of the factors.
n Equamax. A rotation method that is a combination of the varimax method, which
simplifies the factors, and the quartimax method, which simplifies the variables.
The number of variables that load highly on a factor and the number of factors
needed to explain a variable are minimized.
n Quartimax. A rotation method that minimizes the number of factors needed to
explain each variable. It simplifies the interpretation of the observed variables.
n Orthomax. Specifies families of orthogonal rotations. Gamma specifies the member
of the family to use. Varying Gamma changes maximization of the variances of the
loadings from columns (Varimax) to rows (Quartimax).
n Oblimin. Specifies families of oblique (non-orthogonal) rotations. Gamma
specifies the member of the family to use. For Gamma, specify 0 for moderate
correlations, positive values to allow higher correlations, and negative values to
restrict correlations.
308
Chapter 12
Save
You can save factor analysis results for further analyses.
For the maximum likelihood and iterated principal axis methods, you can save only
loadings. For the principal components method, select from these options:
n Do not save results. Results are not saved.
n Factor scores. Standardized factor scores.
n Residuals. Residuals for each case. For a correlation matrix, the residual is the
actual z score minus the predicted z score using the factor scores times the loadings
to get the predicted scores. For a covariance matrix, the residuals are from
unstandardized predictions. With an orthogonal rotation, Q and PROB are also
saved. Q is the sum of the squared residuals, and PROB is its probability.
n Principal components. Unstandardized principal components scores with mean 0
and variance equal to the eigenvalue for the factor (only for PCA without rotation).
n Factor coefficients. Coefficients that produce standardized scores. For a correlation
matrix, multiply the coefficients by the standardized variables; for a covariance
matrix, use the original variables.
n Eigenvectors. Eigenvectors (only for PCA without a rotation). Use to produce
unstandardized scores.
n Factor loadings. Factor loadings.
n Save data with scores. Saves the selected item and all the variables in the working
data file as a new data file. Use with options for scores (not loadings, coefficients,
or other similar options).
309
Factor Anal ysi s
If you save scores, the variables in the file are labeled FACTOR(1), FACTOR(2), and so
on. Any observations with missing values on any of the input variables will have
missing values for all scores. The scores are normalized to have zero mean and, if the
correlation matrix is used, unit variance. If you use the covariance matrix and perform
no rotations, SYSTAT does not standardize the component scores. The sum of their
variances is the same as for the original data.
If you want to use the score coefficients to get component scores for new data,
multiply the coefficients by the standardized data. SYSTAT does this when it saves
scores. Another way to do cross-validation is to assign a zero weight to those cases not
used in the factoring and to assign a unit weight to those cases used. The zero-weight
cases are not used in the factoring, but scores are computed for them.
When Factor scores or Principal components is requested, T2 and PROB are also
saved. The former is the Hotelling statistic that squares the standardized distance
from each case to the centroid of the factor space (that is, the sum of the squared,
standardized factor scores). PROB is the upper-tail probability of T2. Use this statistic
to identify outliers within the factor space. T2 is not computed with an oblique rotation.
Using Commands
After selecting a data file with USE filename, continue with:
Usage Considerations
Types of data. Data for factor analysis can be a cases-by-variables data file, a correlation
matrix, or a covariance matrix.
Print options. Factor analysis offers three categories of output: short (the default),
medium, and long. Each has specific output panels associated with it.
For Short, the default, panels are: Latent roots or eigenvalues (not MLA), initial and
final communality estimates (not PCA), component loadings (PCA) or factor pattern
FACTOR
MODEL varlist
SAVE filename / SCORES DATA LOAD COEF VECTORS PC RESID
ESTIMATE / METHOD = PCA or IPA or MLA ,
LISTWISE or PAIRWISE N=n CORR or COVA ,
NUMBER=n EIGEN=n ITER=n CONV=n SORT ,
ROTATE = VARIMAX or EQUAMAX or QUARTIMAX ,
or ORTHOMAX or OBLIMIN
GAMMA=n
T
2
310
Chapter 12
(MLA, IPA), variance explained by components (PCA) or factors (MLA, IPA),
percentage of total variance explained, change in uniqueness and log likelihood at each
iteration (MLA only), and canonical correlations (MLA only). When a rotation is
requested: rotated loadings (PCA) or pattern (MLA, IPA) matrix, variance explained
by rotated components, percentage of total variance explained, and correlations among
oblique components or factors (oblimin only).
By specifying Medium, you get the panels listed for Short, plus: the matrix to factor,
the chi-square test that all eigenvalues are equal (PCA only), the chi-square test that
last k eigenvalues are equal (PCA only), and differences of original correlations or
covariances minus fitted values. For covariance matrix input (not MLA or IPA):
asymptotic 95% confidence limits for the eigenvalues and estimates of the population
eigenvalues with standard errors.
With Long, you get the panels listed for Short and Medium, plus: latent vectors
(eigenvectors) with standard errors (not MLA) and the chi-square test that the number
of factors is k (MLA only). With an oblimin rotation: direct and indirect contribution
of factors to variances and the rotated structure matrix.
Quick Graphs. Factor analysis produces a scree plot and a factor loadings plot.
Saving files. You can save factor scores, residuals, principal components, factor
coefficients, eigenvectors, or factor loadings as a new data file. For the iterated
principal axis and maximum likelihood methods, you can save only factor loadings.
You can save only eigenvectors and principal components for unrotated solutions using
the principal components method.
BY groups. Factor analysis produces separate analyses for each level of any BY
variables.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. Factor analysis uses FREQUENCY variables to duplicate cases for
rectangular data files.
Case weights. For rectangular data, you can weight cases using a WEIGHT variable.
311
Factor Anal ysi s
Examples
Example 1
Principal Components
Principal components (PCA, the default method) is a good way to begin a factor
analysis (and possibly the only method you may need). If one variable is a linear
combination of the others, the program will not stop (MLA and IPA both require a
nonsingular correlation or covariance matrix). The PCA output can also provide
indications that:
n One or more variables have little relation to the others and, therefore, are not suited
for factor analysisso in your next run, you might consider omitting them.
n The final number of factors may be three or four and not double or triple this
number.
To illustrate this method of factor extraction, we borrow data from Harman (1976),
who borrowed them from a 1937 unpublished thesis by Mullen. This classic data set is
widely used in the literature. For example, Jackson (1991) reports loadings for the
PCA, MLA, and IPA methods. The data are measurements recorded for 305 girls:
height, arm span, length of forearm, length of lower leg, weight, bitrochanteric
diameter (the upper thigh), chest girth, and chest width. Because the units of these
measurements differ, we analyze a correlation matrix:
Height Arm_Span Forearm Lowerleg Weight Bitro Chestgir Chestwid
Height
1.000
Arm_Span
0.846 1.000
Forearm
0.805 0.881 1.000
Lowerleg
0.859 0.826 0.801 1.000
Weight
0.473 0.376 0.380 0.436 1.000
Bitro
0.398 0.326 0.319 0.329 0.762 1.000
Chestgir
0.301 0.277 0.237 0.327 0.730 0.583 1.000
Chestwid
0.382 0.415 0.345 0.365 0.629 0.577 0.539 1.000
312
Chapter 12
The correlation matrix is stored in the GIRLS file. SYSTAT knows that the file contains
a correlation matrix, so no special instructions are needed to read the matrix. The input is:
Notice the shortcut notation (..) for listing consecutive variables in a file.
The output follows:
FACTOR
USE girls
MODEL height .. chestwid
ESTIMATE / METHOD=PCA N=305 SORT ROTATE=VARIMAX
Latent Roots (Eigenvalues)

1 2 3 4 5

4.6729 1.7710 0.4810 0.4214 0.2332

6 7 8

0.1867 0.1373 0.0965

Component loadings

1 2

HEIGHT 0.8594 0.3723
ARM_SPAN 0.8416 0.4410
LOWERLEG 0.8396 0.3953
FOREARM 0.8131 0.4586
WEIGHT 0.7580 -0.5247
BITRO 0.6742 -0.5333
CHESTWID 0.6706 -0.4185
CHESTGIR 0.6172 -0.5801

Variance Explained by Components

1 2

4.6729 1.7710

Percent of Total Variance Explained

1 2

58.4110 22.1373

Rotated Loading Matrix ( VARIMAX, Gamma = 1.0000)

1 2

ARM_SPAN 0.9298 0.1955
FOREARM 0.9191 0.1638
HEIGHT 0.8998 0.2599
LOWERLEG 0.8992 0.2295
WEIGHT 0.2507 0.8871
BITRO 0.1806 0.8404
CHESTGIR 0.1068 0.8403
CHESTWID 0.2509 0.7496
313
Factor Anal ysi s
Notice that we did not specify how many factors we wanted. For PCA, the assumption
is to compute as many factors as there are eigenvalues greater than 1.0so, in this run,
you study results for two factors. After examining the output, you may want to specify
a minimum eigenvalue or, very rarely, a lower limit.
Unrotated loadings (and orthogonally rotated loadings) are correlations of the
variables with the principal components (factors). They are also the eigenvectors of the
correlation matrix multiplied by the square roots of the corresponding eigenvalues.
Usually these loadings are not useful for interpreting the factors. For some industrial
applications, researchers prefer to examine the eigenvectors alone.
The Variance explained for each component is the eigenvalue for the factor. The
first factor accounts for 58.4% of the variance; the second, 22.1%. The Total Variance
is the sum of the diagonal elements of the correlation (or covariance) matrix. By
summing the Percent of Total Variance Explained for the two factors
( ), you can say that more than 80% of the variance of all
eight variables is explained by the first two factors.
In the Rotated Loading Matrix, the rows of the display have been sorted, placing the
loadings > 0.5 for factor 1 first, and so on. These are the coefficients of the factors after
"Variance" Explained by Rotated Components

1 2

3.4973 2.9465

Percent of Total Variance Explained

1 2

43.7165 36.8318
58.411 22.137 + 80.548 =
314
Chapter 12
rotation, so notice that large values for the unrotated loadings are larger here and the
small values are smaller. The sums of squares of these coefficients (for each factor or
column) are printed below under the heading Variance Explained by Rotated
Components. Together, the two rotated factors explain more than 80% of the variance.
Factor analysis offers five types of rotation. Here, by default, the orthogonal varimax
method is used.
To interpret each factor, look for variables with high loadings. The four variables
that load highly on factor 1 can be said to measure lankiness; while the four that load
highly on factor 2, stockiness. Other data sets may include variables that do not load
highly on any specific factor.
In the factor scree plot, the eigenvalues are plotted against their order (or associated
component). Use this display to identify large values that separate well from smaller
eigenvalues. This can help to identify a useful number of factors to retain. Scree is the
rubble at the bottom of a cliff; the large retained roots are the cliff, and the deleted ones
are the rubble.
The points in the factor loadings plot are variables, and the coordinates are the
rotated loadings. Look for clusters of loadings at the extremes of the factors. The four
variables at the right of the plot load highly on factor 1 and all reflect length. The
variables at the top of the plot load highly on factor 2 and reflect width.
Example 2
Maximum Likelihood
This example uses maximum likelihood for initial factor extraction and 2 as the
number of factors. Other options remain as in the principal components example. The
input is:
FACTOR
USE girls
MODEL height .. chestwid
ESTIMATE / METHOD=MLA N=305 NUMBER=2 SORT ROTATE=VARIMAX
315
Factor Anal ysi s
The output follows:
Initial Communality Estimates

1 2 3 4 5

0.8162 0.8493 0.8006 0.7884 0.7488

6 7 8

0.6041 0.5622 0.4778

Iterative Maximum Likelihood Factor Analysis: Convergence = 0.001000.

Iteration Maximum Change in Negative log of
Number SQRT(uniqueness) Likelihood
1 0.722640 0.384050
2 0.243793 0.273332
3 0.051182 0.253671
4 0.010359 0.253162
5 0.000493 0.253162

Final Communality Estimates

1 2 3 4 5

0.8302 0.8929 0.8338 0.8006 0.9109

6 7 8

0.6363 0.5837 0.4633

Canonical Correlations

1 2

0.9823 0.9489

Factor pattern

1 2

HEIGHT 0.8797 0.2375
ARM_SPAN 0.8735 0.3604
LOWERLEG 0.8551 0.2633
FOREARM 0.8458 0.3442
WEIGHT 0.7048 -0.6436
BITRO 0.5887 -0.5383
CHESTWID 0.5743 -0.3653
CHESTGIR 0.5265 -0.5536

Variance Explained by Factors

1 2

4.4337 1.5179

Percent of Total Variance Explained

1 2

55.4218 18.9742

316
Chapter 12
The first panel of output contains the communality estimates. The communality of a
variable is its theoretical squared multiple correlation with the factors extracted. For
MLA (and IPA), the assumption for the initial communalities is the observed squared
multiple correlation with all the other variables.
Rotated Pattern Matrix ( VARIMAX, Gamma = 1.0000)

1 2

ARM_SPAN 0.9262 0.1873
FOREARM 0.8942 0.1853
HEIGHT 0.8628 0.2928
LOWERLEG 0.8569 0.2576
WEIGHT 0.2268 0.9271
BITRO 0.1891 0.7750
CHESTGIR 0.1289 0.7530
CHESTWID 0.2734 0.6233

"Variance" Explained by Rotated Factors

1 2

3.3146 2.6370

Percent of Total Variance Explained

1 2

41.4331 32.9628

Percent of Common Variance Explained

1 2

55.6927 44.3073
317
Factor Anal ysi s
The canonical correlations are the largest multiple correlations for successive
orthogonal linear combination of factors with successive orthogonal linear
combination of variables. These values are comfortably high. If, for other data, some
of the factors have values that are much lower, you might want to request fewer factors.
The loadings and amount of variance explained are similar to those found in the
principal components example. In addition, maximum likelihood reports the
percentage of common variance explained. Common variance is the sum of the
communalities. If A is the unrotated MLA factor pattern matrix, common variance is
the trace of AA.
Number of Factors
In this example, we specified two factors to extract. If you were to omit this
specification and rerun the example, SYSTAT adds this report to the output:
SYSTAT will also report this message if you request more than four factors for these
data. This result is due to a theorem by Lederman and indicates that the degrees of
freedom allow estimates of loadings and communalities for only four factors.
If we set the print length to long, SYSTAT reports:
The results of this chi-square test indicate that you do not reject the hypothesis that
there are four factors (p value > 0.05). Technically, the hypothesis is that no more than
four factors are required. This, of course, does not negate 2 as the right number. For
the GIRLS data, here are rotated loadings for four factors:
The Maximum Number of Factors for Your Data is 4
Chi-square Test that the Number of Factors is 4
CSQ = 4.3187 P = 0.1154 DF = 2.00
Rotated Pattern Matrix ( VARIMAX, Gamma = 1.0000)

1 2 3 4

ARM_SPAN 0.9372 0.1984 -0.2831 0.0465
LOWERLEG 0.8860 0.2142 0.1878 0.1356
HEIGHT 0.8776 0.2819 0.1134 -0.0077
FOREARM 0.8732 0.1957 -0.0851 -0.0065
WEIGHT 0.2414 0.8830 0.1077 0.1080
BITRO 0.1823 0.8233 0.0163 -0.0784
CHESTGIR 0.1133 0.7315 -0.0048 0.5219
CHESTWID 0.2597 0.6459 -0.1400 0.0819
318
Chapter 12
The loadings for the last two factors do not make sense. Possibly, the fourth factor has
one variable, CHESTGIR, but it still has a healthier loading on factor 2. This test is
based on an assumption of multivariate normality (as is MLA itself). If not true, then
the test is invalid.
Example 3
Iterated Principal Axis
This example continues with the GIRLS data described in the principal components
example, this time using the IPA (iterated principal axis) method to extract factors. The
input is:
The output is:
FACTOR
USE girls
MODEL height .. chestwid
ESTIMATE / METHOD=IPA SORT ROTATE=VARIMAX
Initial Communality Estimates

1 2 3 4 5

0.8162 0.8493 0.8006 0.7884 0.7488

6 7 8

0.6041 0.5622 0.4778

Iterative Principal Axis Factor Analysis: Convergence = 0.001000.

Iteration Maximum Change in
Number SQRT(communality)
1 0.308775
2 0.039358
3 0.017077
4 0.008751
5 0.004934
6 0.002923
7 0.001776
8 0.001093
9 0.000677

Final Communality Estimates

1 2 3 4 5

0.8381 0.8887 0.8205 0.8077 0.8880

6 7 8

0.6403 0.5835 0.4921

319
Factor Anal ysi s
Latent Roots (Eigenvalues)

1 2 3 4 5

4.4489 1.5100 0.1016 0.0551 0.0150

6 7 8

-0.0374 -0.0602 -0.0743

Factor pattern

1 2

HEIGHT 0.8561 0.3244
ARM_SPAN 0.8482 0.4114
LOWERLEG 0.8309 0.3424
FOREARM 0.8082 0.4090
WEIGHT 0.7500 -0.5706
BITRO 0.6307 -0.4924
CHESTWID 0.6074 -0.3509
CHESTGIR 0.5688 -0.5098

Variance Explained by Factors

1 2

4.4489 1.5100

Percent of Total Variance Explained

1 2

55.6110 18.8753

Rotated Pattern Matrix ( VARIMAX, Gamma = 1.0000)

1 2

ARM_SPAN 0.9203 0.2045
FOREARM 0.8874 0.1815
HEIGHT 0.8724 0.2775
LOWERLEG 0.8639 0.2478
WEIGHT 0.2334 0.9130
BITRO 0.1884 0.7777
CHESTGIR 0.1291 0.7529
CHESTWID 0.2581 0.6523

"Variance" Explained by Rotated Factors

1 2

3.3150 2.6439

Percent of Total Variance Explained

1 2

41.4377 33.0485

Percent of Common Variance Explained

1 2

55.6314 44.3686
320
Chapter 12
Before the first iteration, the communality of a variable is its multiple correlation
squared with the remaining variables. At each iteration, communalities are estimated
from the loadings matrix, A, by finding the trace of AA, where the number of
columns in A is the number of factors. Iterations continue until the largest change in
any communality is less than that specified with Convergence. Replacing the diagonal
of the correlation (or covariance) matrix with these final communality estimates and
computing the eigenvalues yields the latent roots in the next panel.
Example 4
Rotation
Lets compare the unrotated and orthogonally rotated loadings from the principal
components example with those from an oblique rotation. The input is:
FACTOR
USE girls
PRINT = LONG
MODEL height .. chestwid
ESTIMATE / METHOD=PCA N=305 SORT
MODEL height .. chestwid
ESTIMATE / METHOD=PCA N=305 SORT ROTATE=VARIMAX
MODEL height .. chestwid
ESTIMATE / METHOD=PCA N=305 SORT ROTATE=OBLIMIN
321
Factor Anal ysi s
We focus on the output directly related to the rotations:
Component loadings

1 2

HEIGHT 0.8594 0.3723
ARM_SPAN 0.8416 0.4410
LOWERLEG 0.8396 0.3953
FOREARM 0.8131 0.4586
WEIGHT 0.7580 -0.5247
BITRO 0.6742 -0.5333
CHESTWID 0.6706 -0.4185
CHESTGIR 0.6172 -0.5801

Variance Explained by Components

1 2

4.6729 1.7710

Percent of Total Variance Explained

1 2

58.4110 22.1373
Rotated Loading Matrix ( VARIMAX, Gamma = 1.0000)

1 2

ARM_SPAN 0.9298 0.1955
FOREARM 0.9191 0.1638
HEIGHT 0.8998 0.2599
LOWERLEG 0.8992 0.2295
WEIGHT 0.2507 0.8871
BITRO 0.1806 0.8404
CHESTGIR 0.1068 0.8403
CHESTWID 0.2509 0.7496

"Variance" Explained by Rotated Components

1 2

3.4973 2.9465

Percent of Total Variance Explained

1 2

43.7165 36.8318

Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0)

1 2

ARM_SPAN 0.9572 -0.0166
FOREARM 0.9533 -0.0482
LOWERLEG 0.9157 0.0276
HEIGHT 0.9090 0.0604
WEIGHT 0.0537 0.8975
CHESTGIR -0.0904 0.8821
BITRO -0.0107 0.8642
CHESTWID 0.0876 0.7487

322
Chapter 12
"Variance" Explained by Rotated Components

1 2

3.5273 2.9166

Percent of Total Variance Explained

1 2

44.0913 36.4569

Direct and Indirect Contributions of Factors To Variance

1 2

1 3.5087
2 0.0186 2.8979

Rotated Structure Matrix
1 2

ARM_SPAN 0.9350 0.4523
FOREARM 0.9500 0.3962
LOWERLEG 0.9277 0.4225
HEIGHT 0.9325 0.3629
WEIGHT 0.4407 0.9206
CHESTGIR 0.3620 0.8596
BITRO 0.4104 0.7865
CHESTWID 0.2900 0.8431
323
Factor Anal ysi s
The values in Direct and Indirect Contributions of Factors to Variance are useful for
determining if part of a factors contribution to Variance Explained is due to its
correlation with another factor. Notice that
3.509 + 0.019 = 3.528 (or 3.527)
is the Variance Explained for factor 1, and
2.898 + 0.019 = 2.917
is the Variance Explained for factor 2 (differences in the last digit are due to a
rounding error).
324
Chapter 12
Think of the values in the Rotated Structure Matrix as correlations of the variable
with the factors. Here we see that the first four variables are highly correlated with the
first factor. The remaining variables are highly correlated with the second factor.
The factor loading plots illustrate the effects of the rotation methods. While the
unrotated factor loadings form two distinct clusters, they both have strong positive
loadings for factor 1. The lanky variables have moderate positive loadings on factor
2 while the stocky variables have negative loadings on factor 2. With the varimax
rotation, the lanky variables load highly on factor 1 with small loadings on factor 2;
the stocky variables load highly on factor 2. The oblimin rotation does a much better
job of centering each cluster at 0 on its minor factor.
Example 5
Factor Analysis Using a Covariance Matrix
Jackson (1991) describes a project in which the maximum thrust of ballistic missiles
was measured. For a specific measure called total impulse, it is necessary to calculate
the area under a curve. Originally, a planimeter was used to obtain the area, and later
an electronic device performed the integration directly but unreliably in its early usage.
As data, two strain gauges were attached to each of 40 Nike rockets, and both types of
measurements were recorded in parallel (making four measurements per rocket). The
covariance matrix of the measures is stored in the MISSLES file.
In this example, we illustrate features associated with covariance matrix input
(asymptotic 95% confidence limits for the eigenvalues, estimates of the population
eigenvalues with standard errors, and latent vectors (eigenvectors or characteristic
vectors) with standard errors). The input is:
The output is:
FACTOR
USE missles
MODEL integra1 planmtr1 integra2 planmtr2
PRINT = LONG
ESTIMATE / METHOD=PCA COVA N=40
Latent Roots (Eigenvalues)

1 2 3 4

335.3355 48.0344 29.3305 16.4096

Empirical upper bound for the first Eigenvalue = 398.0000.
Asymptotic 95% Confidence Limits for the Eigenvalues, N = 40.
Upper Limits:

1 2 3 4

596.9599 85.5102 52.2138 29.2122
325
Factor Anal ysi s
Lower Limits:

1 2 3 4

233.1534 33.3975 20.3930 11.4093

Unbiased Estimates of Population Eigenvalues

1 2 3 4

332.6990 46.9298 31.0859 18.3953

Unbiased Estimates of Standard Errors of Eigenvalues

1 2 3 4

74.9460 10.1768 5.7355 3.2528


Chi-Square Test that all Eigenvalues are Equal, N = 40
CSQ = 110.6871 P = 0.0000 df = 9.00

Latent Vectors (Eigenvectors)

1 2 3 4

INTEGRA1 0.4681 0.6215 0.5716 0.2606
PLANMTR1 0.6079 0.1788 -0.7595 0.1473
INTEGRA2 0.4590 -0.1387 0.1677 -0.8614
PLANMTR2 0.4479 -0.7500 0.2615 0.4104

Standard Error for Each Eigenvector Element

1 2 3 4

INTEGRA1 0.0532 0.1879 0.2106 0.1773
PLANMTR1 0.0412 0.2456 0.0758 0.2066
INTEGRA2 0.0342 0.1359 0.2366 0.0519
PLANMTR2 0.0561 0.1058 0.2633 0.1276

Component loadings

1 2 3 4

INTEGRA1 8.5727 4.3072 3.0954 1.0559
PLANMTR1 11.1325 1.2389 -4.1131 0.5965
INTEGRA2 8.4051 -0.9616 0.9084 -3.4893
PLANMTR2 8.2017 -5.1983 1.4165 1.6625

Variance Explained by Components

1 2 3 4

335.3355 48.0344 29.3305 16.4096

Percent of Total Variance Explained

1 2 3 4

78.1467 11.1940 6.8352 3.8241

326
Chapter 12
SYSTAT performs a test to determine if all eigenvalues are equal. The null hypothesis
is that all eigenvalues are equal against an alternative hypothesis that at least one root
is different. The results here indicate that you reject the null hypothesis (p < 0.00005).
At least one of the eigenvalues differs from the others.
The size and sign of the loadings reflect how the factors and variables are related.
The first factor has fairly similar loadings for all four variables. You can interpret this
factor as an overall average of the area under the curve across the four measures. The
second factor represents gauge differences because the signs are different for each. The
third factor is primarily a comparison between the first planimeter and the first
integration device. The last factor has no simple interpretation.
When there are four or more factors, the Quick Graph of the loadings is a SPLOM.
The first component represents 78% of the variability of the product, so plots of
loadings for factors 2 through 4 convey little information (notice that values in the
stripe displays along the diagonal concentrate around 0, while those for factor 1 fall to
the right).
Differences: Original Minus Fitted Correlations or Covariances

INTEGRA1 PLANMTR1 INTEGRA2 PLANMTR2

INTEGRA1 0.0000
PLANMTR1 0.0000 0.0000
INTEGRA2 0.0000 0.0000 0.0000
PLANMTR2 0.0000 0.0000 0.0000 0.0000
Scree Plot
0 1 2 3 4 5
Number of Factors
0
100
200
300
400
E
i
g
e
n
v
a
l
u
e
Factor Loadings Plot
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2)
PLANMTR2 INTEGRA2 INTEGRA1
PLANMTR1
FACTOR(3)
PLANMTR1
INTEGRA2
INTEGRA1
PLANMTR2
FACTOR(4)
F
A
C
T
O
R
(
1
)
INTEGRA2
PLANMTR1
PLANMTR2 INTEGRA1
F
A
C
T
O
R
(
2
)
PLANMTR2
INTEGRA2
PLANMTR1
INTEGRA1
PLANMTR1
INTEGRA2
INTEGRA1
PLANMTR2
F
A
C
T
O
R
(
2
)
INTEGRA2
PLANMTR1
PLANMTR2
INTEGRA1
F
A
C
T
O
R
(
3
)
PLANMTR2 INTEGRA2
PLANMTR1
INTEGRA1
PLANMTR2 INTEGRA2
INTEGRA1
PLANMTR1
F
A
C
T
O
R
(
3
)
INTEGRA2
PLANMTR1
PLANMTR2
INTEGRA1
FACTOR(1)
F
A
C
T
O
R
(
4
)
PLANMTR2
INTEGRA2
PLANMTR1INTEGRA1
FACTOR(2)
PLANMTR2
INTEGRA2
INTEGRA1 PLANMTR1
FACTOR(3)
PLANMTR1
INTEGRA2
INTEGRA1 PLANMTR2
FACTOR(4)
F
A
C
T
O
R
(
4
)
327
Factor Anal ysi s
Example 6
Factor Analysis Using a Rectangular File
Begin this analysis from the OURWORLD cases-by-variables data file. Each case
contains information for one of 57 countries. We will study the interrelations among a
subset of 13 variables including economic measures (gross domestic product per capita
and U.S. dollars spent per person on education, health, and the military), birth and
death rates, population estimates for 1983, 1986, and 1990 plus predictions for 2020,
and the percentages of the population who can read and who live in cities.
We request principal components extraction with an oblique rotation. As a first step,
SYSTAT computes the correlation matrix. Correlations measure linear relations.
However, plots of the economic measures and population values as recorded indicate
a lack of linearity, so you use base 10 logarithms to transform six variables, and you
use square roots to transform two others. The input is:
The output is:
FACTOR
USE ourworld
LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990, pop_2020),
= L10(@)
LET (mil,educ) = SQR(@)
MODEL urban birth_rt death_rt gdp_cap gnp_86 mil,
educ b_to_d literacy pop_1983 pop_1986,
pop_1990 pop_2020
PRINT=MEDIUM
SAVE pcascore / SCORES
ESTIMATE / METHOD=PCA SORT ROTATE=OBLIMIN
Matrix to be factored

URBAN BIRTH_RT DEATH_RT GDP_CAP GNP_86

URBAN 1.0000
BIRTH_RT -0.8002 1.0000
DEATH_RT -0.5126 0.5110 1.0000
GDP_CAP 0.7636 -0.9189 -0.4012 1.0000
GNP_86 0.7747 -0.8786 -0.4518 0.9736 1.0000
MIL 0.6453 -0.7547 -0.1482 0.8657 0.8514
EDUC 0.6238 -0.7528 -0.2151 0.8996 0.9207
B_TO_D -0.3074 0.5106 -0.4340 -0.5293 -0.4411
LITERACY 0.7997 -0.9302 -0.6601 0.8337 0.8404
POP_1983 0.2133 -0.0836 0.0152 0.0583 0.0090
POP_1986 0.1898 -0.0523 0.0291 0.0248 -0.0215
POP_1990 0.1700 -0.0252 0.0284 -0.0015 -0.0447
POP_2020 0.0054 0.1880 0.0743 -0.2116 -0.2484

328
Chapter 12
MIL EDUC B_TO_D LITERACY POP_1983

MIL 1.0000
EDUC 0.8869 1.0000
B_TO_D -0.6184 -0.5252 1.0000
LITERACY 0.6421 0.6869 -0.2737 1.0000
POP_1983 0.2206 -0.0062 -0.1526 -0.0050 1.0000
POP_1986 0.1942 -0.0306 -0.1358 -0.0327 0.9984
POP_1990 0.1727 -0.0513 -0.1070 -0.0534 0.9966
POP_2020 -0.0339 -0.2555 0.0617 -0.2360 0.9531

POP_1986 POP_1990 POP_2020

POP_1986 1.0000
POP_1990 0.9992 1.0000
POP_2020 0.9605 0.9673 1.0000


Latent Roots (Eigenvalues)

1 2 3 4 5

6.3950 4.0165 1.6557 0.4327 0.2390

6 7 8 9 10

0.0966 0.0812 0.0403 0.0251 0.0110

11 12 13

0.0054 0.0012 0.0002

Empirical upper bound for the first Eigenvalue = 7.4817.

Chi-Square Test that all Eigenvalues are Equal, N = 49
CSQ = 1542.2903 P = 0.0000 df = 78.00

Chi-Square Test that the Last 10 Eigenvalues Are Equal
CSQ = 636.4350 P = 0.0000 df = 59.89
Component loadings

1 2 3

GDP_CAP 0.9769 -0.0366 -0.0606
GNP_86 0.9703 -0.0846 0.0040
BIRTH_RT -0.9512 0.0136 -0.0774
LITERACY 0.8972 -0.1008 0.3004
EDUC 0.8927 -0.0857 -0.2296
MIL 0.8770 0.1501 -0.2909
URBAN 0.8393 0.1425 0.2300
B_TO_D -0.5166 -0.1225 0.7762
POP_1990 0.0382 0.9972 0.0394
POP_1986 0.0636 0.9966 0.0253
POP_1983 0.0945 0.9940 0.0248
POP_2020 -0.1796 0.9748 0.1002
DEATH_RT -0.4533 0.0820 -0.8662

Variance Explained by Components

1 2 3

6.3950 4.0165 1.6557

Percent of Total Variance Explained

1 2 3

49.1924 30.8964 12.7361
329
Factor Anal ysi s
Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0)

1 2 3

GDP_CAP 0.9779 -0.0399 0.0523
GNP_86 0.9714 -0.0816 -0.0146
BIRTH_RT -0.9506 0.0040 0.0843
EDUC 0.8961 -0.1049 0.2194
LITERACY 0.8956 -0.0700 -0.3112
MIL 0.8777 0.1242 0.2924
URBAN 0.8349 0.1658 -0.2285
B_TO_D -0.5224 -0.0501 -0.7787
POP_1990 0.0236 0.9977 0.0095
POP_1986 0.0491 0.9958 0.0234
POP_1983 0.0801 0.9932 0.0235
POP_2020 -0.1945 0.9805 -0.0510
DEATH_RT -0.4459 -0.0011 0.8730

"Variance" Explained by Rotated Components

1 2 3

6.3946 4.0057 1.6669

Percent of Total Variance Explained

1 2 3

49.1895 30.8129 12.8225

Correlations among Oblique Factors or Components

1 2 3

1 1.0000
2 0.0127 1.0000
3 -0.0020 0.0452 1.0000
Factor Loadings Plot
GDP_CAP
GNP_86
BIRTH_RT
EDUC
LITERACY
MIL
URBAN
B_TO_D
POP_1990
POP_1986 POP_1983
POP_2020
DEATH_RT
330
Chapter 12
By default, SYSTAT extracts three factors because three eigenvalues are greater than
1.0. On factor 1, seven or eight variables have high loadings. The eighth, B_TO_D (a
ratio of birth-to-death rate) has a higher loading on factor 3. With the exception of
BIRTH_RT, the other variables are economic measures, so lets identify this as the
economic factor. Clearly, the second factor can be named population, and the third,
less clearly, death rates.
The economic and population factors account for 80% (49.19 + 30.81) of the total
variance, so a plot of the scores for these factors should be useful for characterizing
differences among the countries. The third factor accounts for 13% of the total
variance, a much smaller amount than the other two factors. Notice, too, that only 7%
of the total variance is not accounted for by these three factors.
Revisiting the Correlation Matrix
Lets examine the correlation matrix for these variables. In an effort to group the
variables contributing to each factor, we order the variables according to their factor
loadings for the factor on which they load the highest. The input is:
The resulting matrix is:
CORR
USE ourworld
LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990,
pop_2020) = L10(@)
LET (mil,educ) = SQR(@)
PEARSON gdp_cap gnp_86 birth_rt educ literacy mil urban ,
pop_1990 pop_1986 pop_1983 pop_2020 b_to_d death_rt
Pearson correlation matrix

GDP_CAP GNP_86 BIRTH_RT EDUC LITERACY MIL URBAN
GDP_CAP 1.000
GNP_86 0.974 1.000
BIRTH_RT -0.919 -0.879 1.000
EDUC 0.900 0.921 -0.753 1.000
LITERACY 0.834 0.840 -0.930 0.687 1.000
MIL 0.866 0.851 -0.755 0.887 0.642 1.000
URBAN 0.764 0.775 -0.800 0.624 0.800 0.645 1.000
-------------------------------------------------------------------------------
POP_1990 -0.002 -0.045 -0.025 -0.051 -0.053 0.173 0.170
POP_1986 0.025 -0.021 -0.052 -0.031 -0.033 0.194 0.190
POP_1983 0.058 0.009 -0.084 -0.006 -0.005 0.221 0.213
POP_2020 -0.212 -0.248 0.188 -0.255 -0.236 -0.034 0.005
B_TO_D -0.529 -0.441 0.511 -0.525 -0.274 -0.618 -0.307
DEATH_RT -0.401 -0.452 0.511 -0.215 -0.660 -0.148 -0.513
POP_1990 POP_1986 POP_1983 POP_2020 B_TO_D DEATH_RT
POP_1990 1.000
POP_1986 0.999 1.000
POP_1983 0.997 0.998 1.000
POP_2020 0.967 0.960 0.953 1.000
--------------------------------------------------
B_TO_D -0.107 -0.136 -0.153 0.062 1.000
DEATH_RT 0.028 0.029 0.015 0.074 -0.434 1.000
331
Factor Anal ysi s
Use an editor to insert the dotted lines.
The top triangle of the matrix shows the correlations of the variables within the
economic factor. BIRTH_RT has strong negative correlations with the other
variables. Correlations of the population variables with the economic variables are
displayed in the four rows below this top portion, and correlations of the death rates
variables with the economic variables are in the next two rows. Correlations within the
population factor are displayed in the top triangle of the bottom panel. The correlation
between the variables in factor 3 (B_TO_D and DEATH_RT) is 0.434 and is smaller
than any of the other within-factor correlations.
Factor Scores
Look at the scores just stored in PCASCORE. First, merge the name of each country
and the grouping variable GROUP$ with the scores. The values of GROUP$ identify
each country as Europe, Islamic, or New World. Next, plot factor 2 against factor 1
(labeling points with country names) and factor 3 against factor 1 (labeling points with
the first letter of their group membership). Finally, use SPLOMs to display the scores,
adding 75% confidence ellipses for each subgroup in the plots and normal curves for
the univariate distributions. Repeat the latter using kernel density estimators.
The input is:
MERGE "C:\SYSTAT\PCASCORE.SYD"(FACTOR(1) FACTOR(2) FACTOR(3)),
"C:\SYSTAT\DATA\OURWORLD.SYD"(GROUP$ COUNTRY$)
PLOT FACTOR(2)*FACTOR(1) / XLABEL=Economic ,
YLABEL=Population SYMBOL=4,2,3,
SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.250
PLOT FACTOR(3)*FACTOR(1) / XLABEL=Economic ,
YLABEL=Death Rate COLOR=2,1,10,
SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250
SPLOM FACTOR(1) FACTOR(2) FACTOR(3)/ GROUP=GROUP$ OVERLAY,
DENSITY=NORMAL ELL =0.750,
COLOR=2,1,10 SYMBOL=4,2,3,
DASH=1,1,4
SPLOM FACTOR(1) FACTOR(2) FACTOR(3)/ GROUP=GROUP$ OVERLAY,
DENSITY=KERNEL COLOR=2,1,10,
SYMBOL=4,2,3 DASH=1,1,4
332
Chapter 12
The output is:
High loadings on the economic factor show countries that are strong economically
(Germany, Canada, Netherlands, Sweden, Switzerland, Denmark, and Norway)
relative to those with low loadings (Bangladesh, Ethiopia, Mali, and Gambia). Not
surprisingly, the population factor identifies Barbados as the smallest and Bangladesh,
Pakistan, and Brazil as largest. The questionable third factor (death rate) does help to
separate the New World countries from the others.
In each SPLOM, the dashed lines marking curves, ellipses, and kernel contours
identify New World countries. The kernel contours in the plot of factor 3 against factor
1 identify a pocket of Islamic countries within the New World group.
-2 -1 0 1 2
Economic
-3
-2
-1
0
1
2
P
o
p
u
l
a
t
i
o
n
Mali
Gambia
Bangladesh
Ethiopia
Somalia
Haiti Yemen
Sudan
Pakistan
Senegal
Guatemala
Bolivia
Honduras
ElSalvador DominicanR.
Ecuador
Turkey
CostaRica
Algeria
Colombia
Sweden
Norway
Canada
WGermany
Denmark
Netherlands
Switzerland
France UK
Finland
Italy
Austria
Spain
Greece
Ireland
Poland
Hungary
Barbados
Uruguay
Portugal
Argentina
Trinidad
Chile
Venezuela
Panama
Brazil
Malaysia
Peru
Jamaica
-2 -1 0 1 2
Economic
-3
-2
-1
0
1
2
D
e
a
t
h

R
a
t
e E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
I
I II
I
I
I
I
I
I
I
I
N
N
N
N
N
N
N
N
N
N
N
N N
N
N
N
N
N
N
N
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
1
)
F
A
C
T
O
R
(
2
)
F
A
C
T
O
R
(
2
)
FACTOR(1)
F
A
C
T
O
R
(
3
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
3
)
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
1
)
F
A
C
T
O
R
(
2
)
F
A
C
T
O
R
(
2
)
FACTOR(1)
F
A
C
T
O
R
(
3
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
3
)
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
1
)
F
A
C
T
O
R
(
2
)
F
A
C
T
O
R
(
2
)
FACTOR(1)
F
A
C
T
O
R
(
3
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
3
)
Europe
Islamic
NewWorld
GROUP$
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
1
)
F
A
C
T
O
R
(
2
)
F
A
C
T
O
R
(
2
)
FACTOR(1)
F
A
C
T
O
R
(
3
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
3
)
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
1
)
F
A
C
T
O
R
(
2
)
F
A
C
T
O
R
(
2
)
FACTOR(1)
F
A
C
T
O
R
(
3
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
3
)
FACTOR(1)
F
A
C
T
O
R
(
1
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
1
)
F
A
C
T
O
R
(
2
)
F
A
C
T
O
R
(
2
)
FACTOR(1)
F
A
C
T
O
R
(
3
)
FACTOR(2) FACTOR(3)
F
A
C
T
O
R
(
3
)
Europe
Islamic
NewWorld
GROUP$
333
Factor Anal ysi s
Computation
Algorithms
Provisional methods are used for computing covariance or correlation matrices (see
Correlations for references). Components are computed by using a Householder
tridiagonalization and implicit QL iterations. Rotations are computed with a variant of
Kaisers iterative algorithm, described in Mulaik (1972).
Missing Data
Ordinarily, Factor Analysis and other multivariate procedures delete all cases having
missing values on any variable selected for analysis. This is listwise deletion. For data
with many missing values, you may end up with too few complete cases for analysis.
Select Pairwise deletion if you want covariances or correlations computed separately
for each pair of variables selected for analysis. Pairwise deletion takes more time than
the standard listwise deletion because all possible pairs of variances and covariances
are computed. The same option is offered for Correlations, should you decide to create
a symmetric matrix for use in factor analysis that way. Also notice that Correlation
provides an EM algorithm for estimating correlation or covariance matrices when data
are missing.
Be careful. When you use pairwise deletion, you can end up with negative
eigenvalues for principal components or be unable to compute common factors at all.
With either method, it is desirable that the pattern of missing data be random.
Otherwise, the factor structure you compute will be influenced systematically by the
pattern of how values are missing.
References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Clarkson, D. B. and Jennrich, R. I. (1988). Quartic rotation criteria and algorithms,
Psychometrika, 53, 251259.
Dixon, W. J. et al. (1985). BMDP statistical software manual. Berkeley: University of
California Press.
334
Chapter 12
Gnanadesikan, R. (1977). Methods for statistical data analysis of multivariate
observations. New York: John Wiley & Sons, Inc.
Harman, H. H. (1976). Modern factor analysis, 3rd ed. Chicago: University of Chicago
Press.
Jackson, J. E. (1991). A users guide to principal components. New York: John Wiley &
Sons, Inc.
Jennrich, R. I. and Robinson, S. M. (1969). A newton-raphson algorithm for maximum
likelihood factor analysis. Psychometrika, 34, 111123.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. London:
Academic Press.
Morrison, D. F. (1976). Multivariate statistical methods, 2nd ed. New York: McGraw-Hill.
Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw-Hill.
Rozeboom, W. W. (1982). The determinacy of common factors in large item domains.
Psychometrika, 47, 281295.
Steiger, J. H. (1979). Factor indeterminacy in the 1930s and 1970s: some interesting
parallels. Psychometrika, 44, 157167.
335


Chapt er
13
Linear Models
Each chapter in this manual normally has its own statistical background section. In
this part, however, Regression, ANOVA, and General Linear Models are grouped
together. There are two reasons for doing this. First, while some introductory
textbooks treat regression and analysis of variance as distinct, statisticians know that
they are based on the same underlying mathematical model. When you study what
these procedures do, therefore, it is helpful to understand that model and learn the
common terminology underlying each method. Second, although SYSTAT has three
commands (REGRESS, ANOVA, and GLM) and menu settings, it is a not-so-well-
guarded secret that these all lead to the same program, originally called MGLH (for
Multivariate General Linear Hypothesis). Having them organized this way means that
SYSTAT can use tools designed for one approach (for example, dummy variables in
ANOVA) in another (such as computing within-group correlations in multivariate
regression). This synergy is not usually available in packages that treat these models
independently.
Simple Linear Models
Linear models are models based on lines. More generally, they are based on linear
surfaces, such as lines, planes, and hyperplanes. Linear models are widely applied
because lines and planes often appear to describe well the relations among variables
measured in the real world. We will begin by examining the equation for a straight
line, and then move to more complex linear models.
336
Chapter 13
Equation for a Line
A linear model looks like this:
This is the equation for a straight line that you learned in school. The quantities in this
equation are:
Variables are quantities that can vary (have different numerical values) in the same
equation. The remaining quantities are called parameters. A parameter is a quantity
that is constant in a particular equation, but that can be varied to produce other
equations in the same general family. The parameters are:
Lets look at an example. Here are some data showing the yearly earnings a partner
should theoretically get in a certain large law firm, based on annual personal billings
over quota (both in thousands of dollars):
y
a dependent variable
x
an independent variable
a The value of y when x is 0. This is sometimes called a y-intercept (where a line inter-
sects the y axis in a graph when x is 0).
b The slope of the line, or the number of units y changes when x changes by one unit.
EARNINGS BILLINGS
60 20
70 40
80 60
90 80
100 100
120 140
140 180
150 200
175 250
190 280
y a bx + =
337
Li near Model s
We can plot these data with EARNINGS on the vertical axis (dependent variable) and
BILLINGS on the horizontal (independent variable). Notice in the following figure that
all the points lie on a straight line.
What is the equation for this line? Look at the vertical axis value on the sloped line
where the independent variable has a value of 0. Its value is 50. A lawyer is paid
$50,000 even when billing nothing. Thus, a is 50 in our equation. What is b? Notice
that the line rises by $10,000 when billings change by $20,000. The line rises half as
fast as it runs. You can also look at the data and see that the earnings change by $1 as
billing changes by $2. Thus, b is 0.5, or a half, in our equation.
Why bother with all these calculations? We could use the table to determine a
lawyers compensation, but the formula and the line graph allow us to determine wages
not found in the table. For example, we now know that $30,000 in billings would yield
earnings of $65,000:
When we do this, however, we must be sure that we can use the same equation on these
new values. We must be careful when interpolating, or estimating, wages for billings
between the ones we have been given. Does it make sense to compute earnings for
$25,000 in billings, for example? It probably does. Similarly, we must be careful when
extrapolating, or estimating from units outside the domain of values we have been
given. What about negative billings, for example? Would we want to pay an
embezzler? Be careful. Equations and graphs usually are meaningful only within or
close to the range of y values and domain of x values in the data.
EARNINGS 50000 0.5 30000 65000 = + =
338
Chapter 13
Regression
Data are seldom this clean unless we design them to be that way. Law firms typically
fine tune their partners earnings according to many factors. Here are the real billings
and earnings for our law firm (these lawyers predate Reagan, Bush, Clinton, and
Gates):
Our techniques for computing a linear equation wont work with these data. Look at
the following graph. There is no way to draw a straight line through all the data.
Given the irregularities in our data, the line drawn in the figure is a compromise. How
do we find a best fitting line? If we are interested in predicting earnings from the billing
data values rather well, a reasonable method would be to place a line through the points
so that the vertical deviations between the points and the line (errors in predicting
EARNINGS BILLINGS
86 20
67 40
95 60
105 80
86 100
82 140
140 180
145 200
144 250
184 280
339
Li near Model s
earnings) are as small as possible. In other words, these deviations (absolute
discrepancies, or residuals) should be small, on average, for a good-fitting line.
The procedure of fitting a line or curve to data such that residuals on the dependent
variable are minimized in some way is called regression. Because we are minimizing
vertical deviations, the regression line often appears to be more horizontal than we
might place it by eye, especially when the points are fairly scattered. It regresses
toward the mean value of y across all the values of x, namely, a horizontal line through
the middle of all the points. The regression line is not intended to pass through as many
points as possible. It is for predicting the dependent variable as accurately as possible,
given each value of the independent variable.
Least Squares
There are several ways to draw the line so that, on average, the deviations are small.
We could minimize the mean, the median, or some other measure of the typical
behavior of the absolute values of the residuals. Or we can minimize the sum (or mean)
of the squared residuals, which yields almost the same line in most cases. Using
squared instead of absolute residuals gives more influence to points whose y value is
farther from the average of all y values. This is not always desirable, but it makes the
mathematics simpler. This method is called ordinary least squares.
By specifying EARNINGS as the dependent variable and BILLINGS as the
independent variable in a MODEL statement, we can compute the ordinary least-
squares regression y-intercept as $62,800 and the slope as 0.375. These values do not
predict any single lawyers earnings exactly. They describe the whole firm well, in the
sense that, on the average, the line predicts a given earnings value fairly closely from
a given billings value.
Estimation and Inference
We often want to do more with such data than draw a line on a picture. In order to
generalize, formulate a policy, or test a hypothesis, we need to make an inference.
Making an inference implies that we think a model describes a more general
population from which our data have been randomly sampled. In the present example,
this population is all possible lawyers who might work for this firm. To make an
inference about compensation, we need to construct a linear model for our population
that includes a parameter for random error. In addition, we need to change our notation
340
Chapter 13
to avoid confusion later. We are going to use Greek to denote parameters and italic
Roman letters for variables. The error parameter is usually called .
Notice that is a random variable. It varies like any other variable (for example, x),
but it varies randomly, like the tossing of a coin. Since is random, our model forces y
to be random as well because adding fixed values ( and ) to a random variable
produces another random variable. In ordinary language, we are saying with our model
that earnings are only partly predictable from billings. They vary slightly according to
many other factors, which we assume are random.
We do not know all of the factors governing the firms compensation decisions, but
we assume:
n All the salaries are derived from the same linear model.
n The error in predicting a particular salary from billings using the model is
independent of (not in any way predictable from) the error in predicting other
salaries.
n The errors in predicting all the salaries come from the same random distribution.
Our model for predicting in our population contains parameters, but unlike our perfect
straight line example, we cannot compute these parameters directly from the data. The
data we have are only a small sample from a much larger population, so we can only
estimate the parameter values using some statistical method on our sample data. Those
of you who have heard this story before may not be surprised that ordinary least
squares is one reasonable method for estimating parameters when our three
assumptions are appropriate. Without going into all the details, we can be reasonably
assured that if our population assumptions are true and if we randomly sample some
cases (that is, each case has an equal chance of being picked) from the population, the
least-squares estimates of and will, on average, be close to their values in the
population.
So far, we have done what seems like a sleight of hand. We delved into some
abstruse language and came up with the same least-squares values for the slope and
intercept as before. There is something new, however. We have now added conditions
that define our least-squares values as sample estimates of population values. We now
regard our sample data as one instance of many possible samples. Our compensation
model is like Platos cave metaphor; we think it typifies how this law firm makes
compensation decisions about any lawyer, not just the ones we sampled. Before, we
were computing descriptive statistics about a sample. Now, we are computing
inferential statistics about a population.
y x + + =
x

341
Li near Model s
Standard Errors
There are several statistics relevant to the estimation of and . Perhaps most
important is a measure of how variable we could expect our estimates to be if we
continued to sample data from our population and used least squares to get our
estimates. A statistic calculated by SYSTAT shows what we could expect this variation
to be. It is called, appropriately, the standard error of estimate, or Std Error in the
output. The standard error of the y-intercept, or regression constant, is in the first row
of the coefficients: 10.440. The standard error of the billing coefficient or slope is
0.065. Look for these numbers in the following output:
Hypothesis Testing
From these standard errors, we can construct hypothesis tests on these coefficients.
Suppose a skeptic approached us and said, Your estimates look as if something is
going on here, but in this firm, salaries have nothing to do with billings. You just
happened to pick a sample that gives the impression that billings matter. It was the luck
of the draw that provided you with such a misleading picture. In reality, is 0 in the
population because billings play no role in determining earnings.
We can reply, If salaries had nothing to do with billings but are really just a mean
value plus random error for any billing level, then would it be likely for us to find a
coefficient estimate for at least this different from 0 in a sample of 10 lawyers?
To represent these alternatives as a bet between us and the skeptic, we must agree
on some critical level for deciding who will win the bet. If the likelihood of a sample
result at least this extreme occurring by chance is less than or equal to this critical level
(say, five times out of a hundred), we win; otherwise, the skeptic wins.
This logic might seem odd at first because, in almost every case, our skeptics null
hypothesis would appear ridiculous, and our alternative hypothesis (that the skeptic is
wrong) seems plausible. Two scenarios are relevant here, however. The first is the
Dep Var: EARNINGS N: 10 Multiple R: 0.897 Squared multiple R: 0.804

Adjusted squared multiple R: 0.779 Standard error of estimate: 17.626

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 62.838 10.440 0.0 . 6.019 0.000
BILLINGS 0.375 0.065 0.897 1.000 5.728 0.000

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 10191.109 1 10191.109 32.805 0.000
Residual 2485.291 8 310.661

342
Chapter 13
lawyers. We are trying to make a case here. The only way we will prevail is if we
convince our skeptical jury beyond a reasonable doubt. In statistical practice, that
reasonable doubt level is relatively liberal: fewer than five times in a hundred. The
second scenario is the scientists. We are going to stake our reputation on our model.
If someone sampled new data and failed to find nonzero coefficients, much less
coefficients similar to ours, few would pay attention to us in the future.
To compute probabilities, we must count all possibilities or refer to a mathematical
probability distribution that approximates these possibilities well. The most widely
used approximation is the normal curve, which we reviewed briefly in Chapter 1. For
large samples, the regression coefficients will tend to be normally distributed under the
assumptions we made above. To allow for smaller samples, however, we will add the
following condition to our list of assumptions:
n The errors in predicting the salaries come from a normal distribution.
If we estimate the standard errors of the regression coefficients from the data instead
of knowing them in advance, then we should use the t distribution instead of the
normal. The two-tail value for the probability represents the area under the theoretical
t probability curve corresponding to coefficient estimates whose absolute values are
more extreme than the ones we obtained. For both parameters in the model of lawyers
earnings, these values (given as P(2 tail)) are less than 0.001, leading us to reject our
null hypothesis at well below the 0.05 level.
At the bottom of our output, we get an analysis of variance table that tests the
goodness of fit of our entire model. The null hypothesis corresponding to the F ratio
(32.805) and its associated p value is that the billing variable coefficient is equal to 0.
This test overwhelmingly rejects the null hypothesis that both and are 0.
Multiple Correlation
In the same output is a statistic called the squared multiple correlation. This is the
proportion of the total variation in the dependent variable (EARNINGS) accounted for
by the linear prediction using BILLINGS. The value here (0.804) tells us that
approximately 80% of the variation in earnings can be accounted for by a linear
prediction from billings. The rest of the variation, as far as this model is concerned, is
random error. The square root of this statistic is called, not surprisingly, the multiple
correlation. The adjusted squared multiple correlation (0.779) is what we would
expect the squared multiple correlation to be if we used the model we just estimated on
a new sample of 10 lawyers in the firm. It is smaller than the squared multiple

343
Li near Model s
correlation because the coefficients were optimized for this sample rather than for the
new one.
Regression Diagnostics
We do not need to understand the mathematics of how a line is fitted in order to use
regression. You can fit a line to any x-y data by the method of least squares. The
computer doesnt care where the numbers come from. To have a model and estimates
that mean something, however, you should be sure the assumptions are reasonable and
that the sample data appear to be sampled from a population that meets the
assumptions.
The sample analogues of the errors in the population model are the residualsthe
differences between the observed and predicted values of the dependent variable.
There are many diagnostics you can perform on the residuals. Here are the most
important ones:
The errors are normally distributed. Draw a normal probability plot (PPLOT) of the
residuals.
The residuals should fall approximately on a diagonal straight line in this plot. When
the sample size is small, as in our law example, the line may be quite jagged. It is
difficult to tell by any method whether a small sample is from a normal population. You
can also plot a histogram or stem-and-leaf diagram of the residuals to see if they are
lumpy in the middle with thin, symmetric tails.
-40 -30 -20 -10 0 10 20
RESIDUAL
-2
-1
0
1
2
E
x
p
e
c
t
e
d

V
a
l
u
e

f
o
r

N
o
r
m
a
l

D
i
s
t
r
i
b
u
t
i
o
n
344
Chapter 13
The errors have constant variance. Plot the residuals against the estimated values. The
following plot shows studentized residuals (STUDENT) against estimated values
(ESTIMATE). Studentized residuals are the true external kind discussed in Velleman
and Welsch (1981). Use these statistics to identify outliers in the dependent variable
space. Under normal regression assumptions, they have a t distribution with
degrees of freedom, where N is the total sample size and p is the number
of predictors (including the constant). Large values (greater than 2 or 3 in absolute
magnitude) indicate possible problems.
Our residuals should be arranged in a horizontal band within two or three units around
0 in this plot. Again, since there are so few observations, it is difficult to tell whether
they violate this assumption in this case. There is only one particularly large residual,
and it is toward the middle of the values. This lawyer billed $140,000 and is earning
only $80,000. He or she might have a gripe about supporting a higher share of the
firms overhead.
The errors are independent. Several plots can be done. Look at the plot of residuals
against estimated values above. Make sure that the residuals are randomly scattered
above and below the 0 horizontal and that they do not track in a snaky way across the
plot. If they look as if they were shot at the plot by a horizontally moving machine gun,
then they are probably not independent of each other. You may also want to plot
residuals against other variables, such as time, orientation, or other ways that might
influence the variability of your dependent measure. ACF PLOT in SERIES measures
whether the residuals are serially correlated. Here is an autocorrelation plot:
N p 1 ( )
50 100 150 200
ESTIMATE
-3
-2
-1
0
1
2
S
T
U
D
E
N
T
345
Li near Model s
All the bars should be within the confidence bands if each residual is not predictable
from the one preceding it, and the one preceding that, and the one preceding that, and
so on.
All the members of the population are described by the same linear model. Plot Cooks
distance (COOK) against the estimated values.
Cooks distance measures the influence of each sample observation on the coefficient
estimates. Observations that are far from the average of all the independent variable
values or that have large residuals tend to have a large Cooks distance value (say,
greater than 2). Cooks D actually follows closely an F distribution, so aberrant values
depend on the sample size. As a rule of thumb, under the normal regression
assumptions, COOK can be compared to an F distribution with p and N p degrees of
freedom. We dont want to find a large Cooks D value for an observation because it
would mean that the coefficient estimates would change substantially if we deleted that
50 100 150 200
ESTIMATE
0.0
0.1
0.2
0.3
0.4
0.5
C
O
O
K
346
Chapter 13
observation. While none of the COOK values are extremely large in our example, could
it be that the largest one in the upper right corner is the founding partner in the firm?
Despite large billings, this partner is earning more than the model predicts.
Another diagnostic statistic useful for assessing the model fit is leverage, discussed
in Belsley, Kuh, and Welsch (1980) and Velleman and Welsch (1981). Leverage helps
to identify outliers in the independent variable space. Leverage has an average value
of , where p is the number of estimated parameters (including the constant) and
N is the number of cases. What is a high value of leverage? In practice, it is useful to
examine the values in a stem-and-leaf plot and identify those that stand apart from the
rest of the sample. However, various rules of thumb have been suggested. For example,
values of leverage less than 0.2 appear to be safe; between 0.2 and 0.5, risky; and above
0.5, to be avoided. Another says that if p > 6 and (N p) > 12, use as a cutoff.
SYSTAT uses an F approximation to determine this value for warnings (Belsley, Kuh,
and Welsch, 1980).
In conclusion, keep in mind that all our diagnostic tests are themselves a form of
inference. We can assess theoretical errors only through the dark mirror of our
observed residuals. Despite this caveat, testing assumptions graphically is critically
important. You should never publish regression results until you have examined these
plots.
Multiple Regression
A multiple linear model has more than one independent variable; that is:
This is the equation for a plane in three-dimensional space. The parameter a is still an
intercept term. It is the value of y when x and z are 0. The parameters b and c are still
slopes. One gives the slope of the plane along the x dimension; the other, along the
z dimension.
The statistical model has the same form:
p N
3p ( ) N
y a bx cz + + =
y x z + + + =
347
Li near Model s
Before we run out of letters for independent variables, lets switch to a more frequently
used notation:
Notice that we are still using Greek letters for unobservables and Roman letters for
observables.
Now, lets look at our law firm data again. We have learned that there is another
variable that appears to determine earningsthe number of hours billed per year by
each lawyer. Here is an expanded listing of the data:
For our model, is the coefficient for BILLINGS, and is the coefficient for
HOURS. Lets look first at its graphical representation. The following figure shows the
plane fit by least squares to the points representing each lawyer. Notice how the plane
slopes upward on both variables. BILLINGS and HOURS both contribute positively to
EARNINGS in our sample.
EARNINGS BILLINGS HOURS
86 20 1771
67 40 1556
95 60 1749
105 80 1754
86 100 1594
82 140 1400
140 180 1780
145 200 1737
144 250 1645
184 280 1863
y
0

1
x
2
x
2
+ + + =

1

2
348
Chapter 13
Fitting this model involves no more work than fitting the simple regression model. We
specify one dependent and two independent variables and estimate the model as before.
Here is the result:
This time, we have one more row in our regression table for HOURS. Notice that its
coefficient (0.124) is smaller than that for BILLINGS (0.333). This is due partly to the
different scales of the variables. HOURS are measured in larger numbers than
BILLINGS. If we wish to compare the influence of each independent of scales, we
should look at the standardized coefficients. Here, we still see that BILLINGS (0.797)
play a greater role in predicting EARNINGS than do HOURS (0.449). Notice also that
both coefficients are highly significant and that our overall model is highly significant,
as shown in the analysis of variance table.
Dep Var:EARNINGS N: 10 MULTIPLE R: .998 Squared Multiple R: .996
Adjusted squared Multiple R: .995 Standard Error of Estimate: 2.678
Variable Coefficient Std Error Std Coef Tolerance T P(2 tail)
CONSTANT 139.925 11.116 0.000 . -12.588 0.000
BILLINGS 0.333 0.010 0.797 .9510698 32.690 0.000
HOURS 0.124 0.007 0.449 .9510698 18.429 0.000
Analysis of Variance
Source
Sum-of-
Squares
DF Mean Square F-ratio P
Regression 12626.210 2 6313.105 880.493 0.000
Residual 50.190 7 7.170
349
Li near Model s
Variable Selection
In applications, you may not know which subset of predictor variables in a larger set
constitute a good model. Strategies for identifying a good subset are many and
varied: forward selection, backward elimination, stepwise (either a forward or
backward type), and all subsets. Forward selection begins with the best predictor,
adds the next best and continues entering variables to improve the fit. Backward
selection begins with all candidate predictors in an equation and removes the least
useful one at a time as long as the fit is not substantially worsened. Stepwise begins
as either forward or backward, but allows poor predictors to be removed from the
candidate model or good predictors to re-enter the model at any step. Finally, all
subsets methods compute all possible subsets of predictors for each model of a given
size (number of predictors) and choose the best one.
Bias and variance tradeoff. Submodel selection is a tradeoff between bias and variance.
By decreasing the number of parameters in the model, its predictive capability is
enhanced. This is because the variance of the parameter estimates decreases. On the
other side, bias may increase because the true model may have a higher dimension.
So wed like to balance smaller variance against increased bias. There are two aspects
to variable selection: selecting the dimensionality of the submodel (how many
variables to include) and evaluating the model selected. After you determine the
dimension, there may be several alternative subsets that perform equally well. Then,
knowledge of the subject matter, how accurately individual variables are measured, and
what a variable communicates may guide selection of the model to report.
A strategy. If you are in an exploratory phase of research, you might try this version of
backwards stepping. First, fit a model using all candidate predictors. Then identify the
least useful variable, remove it from the model list, and fit a smaller model. Evaluate
your results and select another variable to remove. Continue removing variables. For a
given size model, you may want to remove alternative variables (that is, first remove
variable A, evaluate results, replace A and remove B, etc.).
Entry and removal criteria. Decisions about which variable to enter or remove should be
based on statistics and diagnostics in the output, especially graphical displays of these
values, and your knowledge of the problem at hand.
You can specify your own alpha-to-enter and alpha-to-remove values (do not make
alpha-to-remove less than alpha-to-enter), or you can cycle variables in and out of the
equation (stepping automatically stops if this happens). The default values for these
options are Enter = 0.15 and Remove = 0.15. These values are appropriate for predictor
350
Chapter 13
variables that are relatively independent. If your predictor variables are highly
correlated, you should consider lowering the Enter and Remove values well below
0.05.
When there are high correlations among the independent variables, the estimates of
the regression coefficients can become unstable. Tolerance is a measure of this
condition. It is ; that is, one minus the squared multiple correlation between a
predictor and the other predictors included in the model. (Note that the dependent
variable is not used.) By setting a minimum tolerance value, variables highly correlated
with others already in the model are not allowed to enter.
As a rough guideline, consider models that include only variables that have absolute
t values well above 2.0 and tolerance values greater than 0.1. (We use quotation
marks here because t and other statistics do not have their usual distributions when you
are selecting subset models.)
Evaluation criteria. There is no one test to identify the dimensionality of the best
submodel. Recent research by Leo Breiman emphasizes the usefulness of cross-
validation techniques involving 80% random subsamples. Sample 80% of your file, fit
a model, use the resulting coefficients on the remaining 20% to obtain predicted values,
and then compute for this smaller sample. In over-fitting situations, the discrepancy
between the for the 80% sample and the 20% sample can be dramatic.
A warning. If you do not have extensive knowledge of your variables and expect this
strategy to help you to find a true model, you can get into a lot of trouble. Automatic
stepwise regression programs cannot do your work for you. You must be able to
examine graphics and make intelligent choices based on theory and prior knowledge;
otherwise, you will be arriving at nonsense.
Moreover, if you are thinking of testing hypotheses after automatically fitting a
subset model, dont bother. Stepwise regression programs are the most notorious
source of pseudo p values in the field of automated data analysis. Statisticians seem
to be the only ones who know these are not real p values. The automatic stepwise
option is provided to select a subset model for prediction purposes. It should never be
used without cross-validation.
If you still want some sort of confidence estimate on your subset model, you might
look at tables in Wilkinson (1979), Rencher and Pun (1980), and Wilkinson and Dallal
(1982). These tables provide null hypothesis values for selected subsets given the
number of candidate predictors and final subset size. If you dont know this literature
already, you will be surprised at how large multiple correlations from stepwise
regressions on random data can be. For a general summary of these and other
problems, see Hocking (1983). For more specific discussions of variable selection
1 R
2
( )
R
2
R
2
R
2
351
Li near Model s
problems, see the previous references and Flack and Chang (1987), Freedman (1983),
and Lovell (1983). Stepwise regression is probably the most abused computerized
statistical technique ever devised. If you think you need automated stepwise regression
to solve a particular problem, it is almost certain that you do not. Professional
statisticians rarely use automated stepwise regression because it does not necessarily
find the best fitting model, the real model, or alternative plausible models.
Furthermore, the order in which variables enter or leave a stepwise program is usually
of no theoretical significance. You are always better off thinking about why a model
could generate your data and then testing that model.
Using an SSCP, a Covariance, or a
Correlation Matrix as Input
Normally for a regression analysis, you use a cases-by-variables data file. You can,
however, use a covariance or correlation matrix saved (from Correlations) as input. If
you use a matrix as input, specify the sample size that generated the matrix where the
number you type is an integer greater than 2.
You can enter an SSCP, a covariance, or a correlation matrix by typing it into the
Data Editor Worksheet, by using BASIC, or by saving it in a SYSTAT file. Be sure to
include the dependent as well as independent variables.
SYSTAT needs the sample size to calculate degrees of freedom, so you need to
enter the original sample size. Linear Regression determines the type of matrix (SSCP,
covariance, etc.) and adjusts appropriately. With a correlation matrix, the raw and
standardized coefficients are the same. Therefore, the Include constant option is
disabled when using SSCP, covariance, or correlation matrices. Because these
matrices are centered, the constant term has already been removed.
The following two analyses of the same data file produce identical results (except
that you dont get residuals with the second). In the first, we use the usual cases-by-
variables data file. In the second, we use the CORR command to save a covariance
matrix and then analyze that matrix file with the REGRESS command.
Here are the usual instructions for a regression analysis:
REGRESS
USE filename
MODEL Y = X(1) + X(2) + X(3)
ESTIMATE
352
Chapter 13
Here, we compute a covariance matrix and use it in the regression analysis:
The triangular matrix input facility is useful for meta-analysis of published data and
missing-value computations. There are a few warnings, however. First, if you input
correlation matrices from textbooks or articles, you may not get the same regression
coefficients as those printed in the source. Because of round-off error, printed and raw
data can lead to different results. Second, if you use pairwise deletion with CORR, the
degrees of freedom for hypotheses will not be appropriate. You may not even be able
to estimate the regression coefficients because of singularities.
In general, when an incomplete data procedure is used to estimate the correlation
matrix, the estimate of regression coefficients and hypothesis tests produced from it are
optimistic. You can correct for this by specifying a sample size smaller than the
number of actual observations (preferably, set it equal to the smallest number of cases
used for any pair of variables), but this is a crude guess that you could refine only by
doing Monte Carlo simulations. There is no simple solution. Beware, especially, of
multivariate regressions (or MANOVA, etc.) with missing data on the dependent
variables. You can usually compute coefficients, but results from hypothesis tests are
particularly suspect.
Analysis of Variance
Often, you will want to examine the influence of categorical variables (such as gender,
species, country, and experimental group) on continuous variables. The model
equations for this case, called analysis of variance, are equivalent to those used in
linear regression. However, in the latter, you have to figure out a numerical coding for
categories so that you can use the codes in an equation as the independent variable(s).
CORR
USE filename1
SAVE filename2
COVARIANCE X(1) X(2) X(3) Y
REGRESS
USE filename2
MODEL Y = X(1) + X(2) + X(3) / N=40
ESTIMATE
353
Li near Model s
Effects Coding
The following data file, EARNBILL, shows the breakdown of lawyers sampled by sex.
Because SEX is a categorical variable (numerical values assigned to MALE or
FEMALE are arbitrary), a code variable with the values 1 or 1 is used. It doesnt
matter which group is assigned 1, as long as the other is assigned 1.
There is nothing wrong with plotting earnings against the code variable, as long as you
realize that the slope of the line is arbitrary because it depends on how you assign your
codes. By changing the values of the code variable, you can change the slope. Here is
a plot with the least-squares regression line superimposed.
EARNINGS SEX CODE
86 female 1
67 female 1
95 female 1
105 female 1
86 female 1
82 male 1
140 male 1
145 male 1
144 male 1
184 male 1
354
Chapter 13
Lets do a regression on the data using these codes. Here are the coefficients as
computed by ANOVA:
Notice that Constant (113.4) is the mean of all the data. It is also the regression
intercept because the codes are symmetrical about 0. The coefficient for Code (25.6)
is the slope of the line. It is also one half the difference between the means of the
groups. This is because the codes are exactly two units apart. This slope is often called
an effect in the analysis of variance because it represents the amount that the
categorical variable SEX affects BILLINGS. In other words, the effect of SEX can be
represented by the amount that the mean for males differs from the overall mean.
Means Coding
The effects coding model is useful because the parameters (constant and slope) can be
interpreted as an overall level and as the effect(s) of treatment, respectively. Another
model, however, that yields the means of the groups directly is called the means model.
Here are the codes for this model:.
Notice that CODE1 is nonzero for all females, and CODE2 is nonzero for all males. To
estimate a regression model with these codes, you must leave out the constant. With
Variable Coefficients
Constant 113.400
Code 25.600
EARNINGS SEX CODE1 CODE2
86 female 1 0
67 female 1 0
95 female 1 0
105 female 1 0
86 female 1 0
82 male 0 1
140 male 0 1
145 male 0 1
144 male 0 1
184 male 0 1
355
Li near Model s
only two groups, only two distinct pieces of information are needed to distinguish
them. Here are the coefficients for these codes in a model without a constant:
Notice that the coefficients are now the means of the groups.
Models
Lets look at the algebraic models for each of these codings. Recall that the regression
model looks like this:
For the effects model, it is convenient to modify this notation as follows:
When x (the code variable) is 1,
j
is equivalent to
1
; when x is 1,
j
is equivalent to

2
. This shorthand will help you later when dealing with models with many categories.
For this model, the parameter stands for the grand (overall) mean, and the
parameter stands for the effect. In this model, our best prediction of the score of a group
member is derived from the grand mean plus or minus the deviation of that group from
this grand mean.
The means model looks like this:
In this model, our best prediction of the score of a group member is the mean of that
group.
Variable Coefficient
Code1 87.800
Code2 139.000
y
0

1
x
1
+ + =
y
j
=
j
+ +
y
j

j
= +
356
Chapter 13
Hypotheses
As with regression, we are usually interested in testing hypotheses concerning the
parameters of the model. Here are the hypotheses for the two models:
H
0
:
1
=
2
= 0 (effects model)
H
0
:
1
=
2
(means model)
The tests of this hypothesis compare variation between the means to variation within
each group, which is mathematically equivalent to testing the significance of
coefficients in the regression model. In our example, the F ratio in the analysis of
variance table tells you that the coefficient for SEX is significant at p = 0.019, which is
less than the conventional 0.05 value. Thus, on the basis of this sample and the validity
of our usual regression assumptions, you can conclude that women earn significantly
less than men in this firm.
The nice thing about realizing that ANOVA is specially-coded regression is that the
usual assumptions and diagnostics are appropriate in this context. You can plot
residuals against estimated values, for example, to check for homogeneity of variance.
Multigroup ANOVA
When there are more groups, the coding of categories becomes more complex. For the
effects model, there are one fewer coding variables than number of categories. For two
categories, you need only one coding variable; for three categories, you need two
coding variables:
Dep Var:earnings N: 10 Multiple R: .719 Squared Multiple R: .517
Analysis Of Variance
Source Sum-of-squares Df Mean-square F-ratio P
Sex 6553.600 1 6553.600 8.563 0.019
Error 6122.800 8 765.350
Category Code
1 1 0
2 0 1
3 1 1
357
Li near Model s
For the means model, the extension is straightforward:
For multigroup ANOVA, the models have the same form as for the two-group ANOVA
above. The corresponding hypotheses for testing whether there are differences between
means are:
H
0
:
1
=
2
=
3
=0 (effects model)
H
0
:
1
=
2
=
3
(means model)
You do not need to know how to produce coding variables to do ANOVA. SYSTAT
does this for you automatically. All you need is a single variable that contains different
values for each group. SYSTAT translates these values into different codes. It is
important to remember, however, that regression and analysis of variance are not
fundamentally different models. They are both instances of the general linear model.
Factorial ANOVA
It is possible to have more than one categorical variable in ANOVA. When this
happens, you code each categorical variable exactly the same way as you do with
multi-group ANOVA. The coded design variables are then added as a full set of
predictors in the model.
ANOVA factors can interact. For example, a treatment may enhance bar pressing
by male rats, yet suppress bar pressing by female rats. To test for this possibility, you
can add (to your model) variables that are the product of the main effect variables
already coded. This is similar to what you do when you construct polynomial models.
For example, this is a model without an interaction:
This is a model that contains interaction:
Category Code
1 1 0 0
2 0 1 0
3 0 0 1
y = CONSTANT + treat + sex
y = CONSTANT + treat + sex + treat*sex
358
Chapter 13
If the hypothesis test of the coefficients for the TREAT*SEX term is significant, then
you must qualify your conclusions by referring to the interaction. You might say, It
works one way for males and another for females.
Data Screening and Assumptions
Most analyses have assumptions. If your data do not meet the necessary assumptions,
then the resulting probabilities for the statistics may be suspect. Before an ANOVA,
look for:
n Violations of the equal variance assumption. Your groups should have the same
dispersion or spread (their shapes do not differ markedly).
n Symmetry. The mean of each group should fall roughly in the middle of the spread
(the within-group distributions are not extremely skewed).
n Independence of the group means and standard deviations (the size of the group
means is not related to the size of their standard deviations).
n Gross outliers (no values stand apart from the others in the batch).
Graphical displays are useful for checking assumptions. For analysis of variance, try
dit plots, box-and-whisker displays, or bar charts with standard error bars.
Levene Test
Analysis of variance assumes that the data within cells are independent and normally
distributed with equal variances. This is the ANOVA equivalent of the regression
assumptions for residuals. When the homogeneous variance part of the assumptions is
false, it is sometimes possible to adjust the degrees of freedom to produce
approximately distributed F statistics.
Levene (1960) proposed a test for unequal variances. You can use this test to
determine whether you need an unequal variance F test. Simply fit your model in
ANOVA and save residuals. Then transform the residuals into their absolute values.
Merge these with your original grouping variable(s). Then redo your ANOVA on the
absolute residuals. If it is significant, then you should consider using the separate
variances test.
Before doing all this work, you should do a box plot by groups to see whether the
distributions differ. If you see few differences in the spread of the boxes, Levenes test
is unlikely to be significant.
359
Li near Model s
Pairwise Mean Comparisons
The results in an ANOVA table serve only to indicate whether means differ
significantly or not. They do not indicate which mean differs from another.
To report which pairs of means differ significantly, you might think of computing a
two-sample t test for each pair; however, do not do this. The probability associated
with the two-sample t test assumes that only one test is performed. When several means
are tested pairwise, the probability of finding one significant difference by chance
alone increases rapidly with the number of pairs. If you use a 0.05 significance level to
test that means A and B are equal and to test that means C and D are equal, the overall
acceptance region is now 0.95 x 0.95, or 0.9025. Thus, the acceptance region for two
independent comparisons carried out simultaneously is about 90%, and the critical
region is 10% (instead of the desired 5%). For six pairs of means tested at the 0.05
significance level, the probability of a difference falling in the critical region is not 0.05
but
1 (0.95)
6
= 0.265
For 10 pairs, this probability increases to 0.40. The result of following such a strategy
is to declare differences as significant when they are not.
As an alternative to the situation described above, SYSTAT provides four
techniques to perform pairwise mean comparisons: Bonferroni, Scheffe, Tukey, and
Fishers LSD. The first three methods provide protection for multiple tests. To
determine significant differences, simply look for pairs with probabilities below
your critical value (for example, 0.05 or 0.01).
There is an abundance of literature covering multiple comparisons (see Miller, 1985);
however, a few points are worth noting here:
n If you have a small number of groups, the Bonferroni pairwise procedure will often
be more powerful (sensitive). For more groups, consider the Tukey method. Try all
the methods in ANOVA (except Fishers LSD) and pick the best one.
n All possible pairwise comparisons are a waste of power. Think about a meaningful
subset of comparisons and test this subset with Bonferroni levels. To do this, divide
your critical level, say 0.05, by the number of comparisons you are making. You
will almost always have more power than with any other pairwise multiple
comparison procedures.
360
Chapter 13
n Some popular multiple comparison procedures are not found in SYSTAT.
Duncans test, for example, does not maintain its claimed protection level. Other
stepwise multiple range tests, such as Newman-Keuls, have not been conclusively
demonstrated to maintain overall protection levels for all possible distributions of
means.
Linear and Quadratic Contrasts
Contrasts are used to test relationships among means. A contrast is a linear
combination of means
i
with coefficients
i
:

1
+
2

2
+ +
k

k
= 0
where
1
+
2
+ +
k
= 0. In SYSTAT, hypotheses can be specified about contrasts
and tests performed. Typically, the hypothesis has the form:
H
0
:
1

1
+
2

2
+ +
k

k
= 0
The test statistic for a contrast is similar to that for a two-sample t test; the result of the
contrast (a relation among means, such as mean A minus mean B) is in the numerator
of the test statistic, and an estimate of within-group variability (the pooled variance
estimate or the error term from the ANOVA) is part of the denominator.
You can select contrast coefficients to test:
n Pairwise comparisons (test for a difference between two particular means)
n A linear combination of means that are meaningful to the study at hand (compare
two treatments versus a control mean)
n Linear, quadratic, or the like increases (decreases) across a set of ordered means
(that is, you might test a linear increase in sales by comparing people with no
training, those with moderate training, and those with extensive training)
Many experimental design texts place coefficients for linear and quadratic contrasts for
three groups, four groups, and so on, in a table. SYSTAT allows you to type your
contrasts or select a polynomial option. A polynomial contrast of order 1 is linear; of
order 2, quadratic; of order 3, cubic; and so on.
361
Li near Model s
Unbalanced Designs
An unbalanced factorial design occurs when the numbers of cases in cells are unequal
and not proportional across rows or columns. The following is an example of a
design:
Unbalanced designs require a least-squares procedure like the General Linear Model
because the usual maximum likelihood method of adding up sums of squared
deviations from cell means and the grand mean does not yield maximum likelihood
estimates of effects. The General Linear Model adjusts for unbalanced designs when
you get an ANOVA table to test hypotheses.
However, the estimates of effects in the unbalanced design are no longer orthogonal
(and thus statistically independent) across factors and their interactions. This means
that the sum of squares associated with one factor depends on the sum of squares for
another or its interaction.
Analysts accustomed to using multiple regression have no problem with this
situation because they assume that their independent variables in a model are
correlated. Experimentalists, however, often have difficulty speaking of a main effect
conditioned on another. Consequently, there is extensive literature on hypothesis
testing methodology for unbalanced designs (for example, Speed and Hocking, 1976,
and Speed, Hocking, and Hackney, 1978), and there is no consensus on how to test
hypotheses with non-orthogonal designs.
Some statisticians advise you to do a series of hierarchical tests beginning with
interactions. If the highest-order interactions are insignificant, drop them from the
model and recompute the analysis. Then, examine the lower-order interactions. If they
are insignificant, recompute the model with main effects only. Some computer
programs automate this process and print sums of squares and F tests according to the
hierarchy (ordering of effects) you specify in the model. SAS and SPSS GLM, for
example, calls these Type I sums of squares.
B1 B2
A1
1
2
5
3
4
A2
6
7
9
8
4
2
1
5
3
2 2
362
Chapter 13
This procedure is analogous to stepwise regression in which hierarchical subsets of
models are tested. This example assumes you have specified the following model:
The hierarchical approach tests the following models:
The problem with this approach, however, is that plausible subsets of effects are
ignored if you examine only one hierarchy. The following model, which may be the
best fit to the data, is never considered:
Furthermore, if you decide to examine all the other plausible subsets, you are really
doing all possible subsets regression, and you should use Bonferroni confidence levels
before rejecting a null hypothesis. The example above has 127 possible subset models
(excluding ones without a CONSTANT). Interactive stepwise regression allows you to
explore subset models under your control.
If you have done an experiment and have decided that higher-order effects
(interactions) are of enough theoretical importance to include in your model, you
should condition every test on all other effects in the model you selected. This is the
classical approach of Fisher and Yates. It amounts to using the default F values on the
ANOVA output, which are the same as the SAS and SPSS Type III sums of squares.
Probably the most important reason to stay with one model is that if you eliminate
a series of effects that are not quite significant (for example, p = 0.06), you could end
up with an incorrect subset model because of the dependencies among the sums of
squares. In summary, if you want other sums of squares, compute them. You can
supply the mean square error to customize sums of squares by using a hypothesis test
in GLM, selecting MSE, and specifying the mean square error and degrees of freedom.
Y = CONSTANT + a + b + c + ab + ac + bc + abc
Y = CONSTANT + a + b + c + ab + ac + bc + abc
Y = CONSTANT + a + b + c + ab + ac + bc
Y = CONSTANT + a + b + c + ab + ac
Y = CONSTANT + a + b + c + ab
Y = CONSTANT + a + b + c
Y = CONSTANT + a + b
Y = CONSTANT + a
Y = CONSTANT + a + b + ab
363
Li near Model s
Repeated Measures
In factorial ANOVA designs, each subject is measured once. For example, the
assumption of independence would be violated if a subject is measured first as a
control group member and later as a treatment group member. However, in a repeated
measures design, the same variable is measured several times for each subject (case).
A paired-comparison t test is the most simple form of a repeated measures design (for
example, each subject has a before and after measure).
Usually, it is not necessary for you to understand how SYSTAT carries out
calculations; however, repeated measures is an exception. It is helpful to understand
the quantities SYSTAT derives from your data. First, remember how to calculate a
paired-comparison t test by hand:
n For each subject, compute the difference between the two measures.
n Calculate the average of the differences.
n Calculate the standard deviation of the differences.
n Calculate the test statistic using this mean and standard deviation.
SYSTAT derives similar values from your repeated measures and uses them in
analysis-of-variance computations to test changes across the repeated measures (within
subjects) as well as differences between groups of subjects (between subjects.) Tests
of the within-subjects values are called polynomial tests of order 1, 2,..., up to k, where
k is one less than the number of repeated measures. The first polynomial is used to test
linear changes (for example, do the repeated responses increase (or decrease) around a
line with a significant slope?). The second polynomial tests if the responses fall along
a quadratic curve, and so on.
For each case, SYSTAT uses orthogonal contrast coefficients to derive one
number for each polynomial. For the coefficients of the linear polynomial, SYSTAT
uses (1, 0, 1) when there are three measures; (3, 1, 1, 3) when there are four
measures; and so on. When there are three repeated measures, SYSTAT multiplies the
first by 1, the second by 0, and the third by 1, and sums these products (this sum is
then multiplied by a constant to make the sum of squares of the coefficients equal to
1). Notice that when the responses are the same, the result of the polynomial contrast
is 0; when the responses fall closely along a line with a steep slope, the polynomial
differs markedly from 0.
For the coefficients of the quadratic polynomial, SYSTAT uses (1, 2, 1) when
there are three measures; (1, 1, 1, 1) when there are four measures; and so on. The
cubic and higher-order polynomials are computed in a similar way.
364
Chapter 13
Lets continue the discussion for a design with three repeated measures. Assume
that you record body weight once a month for three months for rats grouped by diet.
(Diet A includes a heavy concentration of alcohol and Diet B consists of normal lab
chow.) For each rat, SYSTAT computes a linear component and a quadratic
component. SYSTAT also sums the weights to derive a total response. These derived
values are used to compute two analysis of variance tables:
n The total response is used to test between-group differences; that is, the total is
used as the dependent variable in the usual factorial ANOVA computations. In the
example, this test compares total weight for Diet A against that for Diet B. This is
analogous to a two-sample t test using total weight as the dependent variable.
n The linear and quadratic components are used to test changes across the repeated
measures (within subjects) and also to test the interaction of the within factor with
the grouping factor. If the test for the linear component is significant, you can
report a significant linear increase in weight over the three months. If the test for
the quadratic component is also significant (but much less so than the linear
component), you might report that growth is predominantly linear, but there is a
significant curve in the upward trend.
n A significant interaction between Diet (the between-group factor) and the linear
component across time might indicate that the slopes for Diet A and Diet B differ.
This test may be the most important one for the experiment.
Assumptions in Repeated Measures
SYSTAT computes both univariate and multivariate statistics. Like all standard
ANOVA procedures, the univariate repeated measures approach requires that the
distributions within cells be normal. The univariate repeated measures approach also
requires that the covariances between all possible pairs of repeated measures be equal.
(Actually, the requirement is slightly less restrictive, but this difference is of little
practical importance.) Of course, the usual ANOVA requirement that all variances
within cells are equal still applies; thus, the covariance matrix of the measures should
have a constant diagonal and equal elements off the diagonal. This assumption is called
compound symmetry.
The multivariate analysis does not require compound symmetry. It requires that the
covariance matrices within groups (there is only one group in this example) be
equivalent and that they be based on multivariate normal distributions. If the classical
assumptions hold, then you should generally ignore the multivariate tests at the bottom
365
Li near Model s
of the output and stay with the classical univariate ANOVA table because the
multivariate tests will be generally less powerful.
There is a middle approach. The Greenhouse-Geiser and Huynh-Feldt statistics are
used to adjust the probability for the classical univariate tests when compound
symmetry fails. (Huynh-Feldt is a more recent adjustment to the conservative
Greenhouse-Geiser statistic.) If the Huynh-Feldt p values are substantially different
from those under the column directly to the right of the F statistic, then you should be
aware that compound symmetry has failed. In this case, compare the adjusted p values
under Huynh-Feldt to those for the multivariate tests.
If all else fails, single degree-of-freedom polynomial tests can always be trusted. If
there are several to examine, however, remember that you may want to use Bonferroni
adjustments to the probabilities; that is, divide the normal value (for example, 0.05) by
the number of polynomial tests you want to examine. You need to make a Bonferroni
adjustment only if you are unable to use the summary univariate or multivariate tests
to protect the overall level; otherwise, you can examine the polynomials without
penalty if the overall test is significant.
Issues in Repeated Measures Analysis
Repeated measures designs can be generated in SYSTAT with a single procedure. You
need not worry about weighting cases in unbalanced designs or selecting error terms.
The program does this automatically; however, you should keep the following in mind:
n The sums of squares for the univariate F tests are pooled across subjects within
groups and their interactions with trials. This means that the traditional analysis
method has highly restrictive assumptions. You must assume that the variances
within cells are homogeneous and that the covariances across all pairs of cells are
equivalent (compound symmetry). There are some mathematical exceptions to this
requirement, but they rarely occur in practice. Furthermore, the compound
symmetry assumption rarely holds for real data.
n Compound symmetry is not required for the validity of the single degree-of-
freedom polynomial contrasts. These polynomials partition sums of squares into
orthogonal components. You should routinely examine the magnitude of these
sums of squares relative to the hypothesis sum of squares for the corresponding
univariate repeated measures F test when your trials are ordered on a scale.
n Think of the repeated measures output as an expanded traditional ANOVA table.
The effects are printed in the same order as they appear in Winer (1971) and other
texts, but they include the single degree-of-freedom and multivariate tests to
366
Chapter 13
protect you from false conclusions. If you are satisfied that both are in agreement,
you can delete the additional lines in the output file.
n You can test any hypothesis after you have estimated a repeated measures design
and examined the output. For example, you can use polynomial contrasts to test
single degree-of-freedom components in an unevenly spaced design. You can also
use difference contrasts to do post hoc tests on adjacent trials.
Types of Sums of Squares
Some other statistics packages print several types of sums of squares for testing
hypotheses. The following names for these sums of squares are not statistical terms,
but they were popularized originally by SAS GLM.
Type I. Type I sums of squares are computed from the difference between the residual
sums of squares of two different models. The particular models needed for the
computation depend on the order of the variables in the MODEL statement. For
example, if the model is
then the sum of squares for AB is produced from the difference between SSE (sum of
squared error) in the two following models:
Similarly, the Type I sums of squares for B in this model are computed from the
difference in SSE between the following models:
Finally, the Type I sums of squares for A is computed from the difference in residual
sums of squares for the following:
In summary, to compute sums of squares, move from right to left and construct models
which differ by the right-most term only.
MODEL y = CONSTANT + a + b + a*b
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a + b + a*b
MODEL y = CONSTANT + a
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT
MODEL y = CONSTANT + a
367
Li near Model s
Type II. Type II sums of squares are computed similarly to Type I except that main
effects and interactions determine the ordering of differences instead of the MODEL
statement order. For the above model, Type II sums of squares for the interaction are
computed from the difference in residual sums of squares for the following models:
For the B effect, difference the following models:
For the A effect, difference the following (this is not the same as for
Type I):
In summary, include interactions of the same order as well as all lower order
interactions and main effects when differencing to get an interaction. When getting
sums of squares for a main effect, difference against all other main effects only.
Type III. Type III sums of squares are the default for ANOVA and are much simpler to
understand. Simply difference from the full model, leaving out only the term in
question. For example, the Type III sum of squares for A is taken from the following
two models:
Type IV. Type IV sums of squares are designed for missing cells designs and are not
easily presented in the above terminology. They are produced by balancing over the
means of nonmissing cells not included in the current hypothesis.
SYSTATs Sums of Squares
Printing more than one sum of squares in a table is potentially confusing to users. There
is a strong temptation to choose the most significant sum of squares without
understanding the hypothesis being tested.
A Type I test is produced by first estimating the full models and noting the error
term. Then, each effect is entered sequentially and tested with the error term from the
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a + b + a*b
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + b
MODEL y = CONSTANT + b + a*b
MODEL y = CONSTANT + a + b + a*b
368
Chapter 13
full model. Later, effects are conditioned on earlier effects, but earlier effects are not
conditioned on later effects. A Type II test is produced most easily with interactive
stepping (STEP). Type III is printed in the regression and ANOVA table. Finally, Type
IV is produced by the careful use of SPECIFY in testing means models. The advantage
of this approach is that the user is always aware that sums of squares depend on explicit
mathematical models rather than additions and subtractions of dimensionless
quantities.
369


Chapt er
14
Linear Models I:
Linear Regression
Leland Wilkinson and Mark Coward
The model for simple linear regression is:
where y is the dependent variable, x is the independent variable, and the s are the
regression parameters (the intercept and the slope of the line of best fit). The model
for multiple linear regression is:
Both Regression and General Linear Model can estimate and test simple and multiple
linear regression models. Regression is easier to use than General Linear Model when
you are doing simple regression, multiple regression, or stepwise regression because
it has fewer options. To include interaction terms in your model or for mixture models,
use General Linear Model. With Regression, all independent variables must be
continuous; in General Linear Model, you can identify categorical independent
variables and SYSTAT will generate a set of design variables for each. Both General
Linear Model and Regression allow you to save residuals. In addition, you can test a
variety of hypotheses concerning the regression coefficients using General Linear
Model.
The ability to do stepwise regression is available in three ways: use the default
values, specify your own selection criteria, or at each step, interactively select a
variable to add or remove from the model.
y x + +
0 1
y x x x
p p
+ + + + +
0 1 1 2 2
. ..
370
Chapter 14
For each model you fit in REGRESS, SYSTAT reports , adjusted , the
standard error of the estimate, and an ANOVA table for assessing the fit of the model.
For each variable in the model, the output includes the estimate of the regression
coefficient, the standard error of the coefficient, the standardized coefficient, tolerance,
and a t statistic for measuring the usefulness of the variable in the model.
Linear Regression in SYSTAT
Regression Main Dialog Box
To obtain a regression analysis, from the menus choose:
Statistics
Regression
Linear
The following options can be specified:
Include constant. Includes the constant in the regression equation. Deselect this option
to remove the constant. You almost never want to remove the constant, and you should
be familiar with no-constant regression terminology before considering it.
Cases. If your data are in the form of a correlation matrix, the number of cases used to
compute the correlation matrix.
Save. You can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, leverage for
each observation, Cooks distance measure, and the standard error of predicted
values.
R
2
R
2
371
Li near Model s I : Li near Regressi on
n Residuals/Data. Saves the residual statistics given by Residuals plus all the
variables in the working data file, including any transformed data values.
n Partial. Saves partial residuals. Suppose your model is:
Y=CONSTANT + X1 + X2 + X3
The saved file contains:
n Partial/Data. Saves partial residuals plus all the variables in the working data file,
including any transformed data values.
n Model. Saves statistics given in Residuals and the variables used in the model.
n Coefficients. Saves the estimates of the regression coefficients.
Regression Options
To open the Options dialog box, click Options in the Regression dialog box.
YPARTIAL(1): Residual of Y = CONSTANT + X2 + X3
XPARTIAL(1): Residual of X1 = CONSTANT + X2 + X3
YPARTIAL(2): Residual of Y = CONSTANT + X1 + X3
XPARTIAL(2): Residual of X2 = CONSTANT + X1 + X3
YPARTIAL(3): Residual of Y = CONSTANT + X1 + X2
XPARTIAL(3): Residual of X3 = CONSTANT + X1 + X2
372
Chapter 14
You can specify a tolerance level, select complete or stepwise entry, and specify entry
and removal criteria.
Tolerance. Prevents the entry of a variable that is highly correlated with the independent
variables already included in the model. Enter a value between 0 and 1. Typical values
are 0.01 or 0.001. The higher the value (closer to 1), the lower the correlation required
to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Mixture model. Constrains the independent variables to sum to a constant.
n Stepwise. Variables are entered or removed from the model one at a time.
The following alternatives are available for stepwise entry and removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
removes the variable with the largest Remove value.
n Forward. Begins with no variables in the model. At each step, SYSTAT adds the
variable with the smallest Enter value.
n Automatic. For Backward, at each step SYSTAT automatically removes a variable
from your model. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. At each step in the model building, you select the variable to enter or
remove from the model.
You can also control the criteria used to enter and remove variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified
value. Enter a value between 0 and 1.
n Remove. Removes a variable from the model if its alpha value is greater than the
specified value. Enter a value between 0 and 1.
n Force. Force the first n variables listed in your model to remain in the equation.
n FEnter. F-to-enter limit. Variables with F greater than the specified value are
entered into the model if Tolerance permits.
n FRemove. F-to-remove limit. Variables with F less than the specified value are
removed from the model.
n Max step. Maximum number of steps.
373
Li near Model s I : Li near Regressi on
Using Commands
First, specify your data with USE filename. Continue with:
For hypothesis testing commands, see Chapter 16.
Usage Considerations
Types of data. Input can be the usual cases-by-variables data file or a covariance,
correlation, or sum of squares and cross-products matrix. Using matrix input requires
specification of the sample size which generated the matrix.
Print options. Using PRINT = MEDIUM, the output includes eigenvalues of XX,
condition indices, and variance proportions. PRINT = LONG adds the correlation matrix
of the regression coefficients to this output.
Quick Graphs. SYSTAT plots the residuals against the predicted values.
Saving files. You can save the results of the analysis (predicted values, residuals, and
diagnostics that identify unusual cases) for further use in examining assumptions.
BY groups. Separate regressions result for each level of any BY variables.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. REGRESS uses the FREQ variable to duplicate cases. This inflates
the degrees of freedom to be the sum of the number of frequencies.
Case weights. REGRESS weights cases using the WEIGHT variable for rectangular
data. You can perform cross-validation if the weight variable is binary and coded 0 or
1. SYSTAT computes predicted values for cases with zero weight even though they are
not used to estimate the regression parameters.
REGRESS
MODEL var=CONSTANT + var1 + var2 + / N=n
SAVE filename / COEF MODEL RESID DATA PARTIAL
ESTIMATE / TOL=n
(use START instead of ESTIMATE for stepwise model building)
START / FORWARD BACKWARD TOL=n ENTER=p REMOVE=p ,
FENTER=n FREMOVE=n FORCE=n
STEP / AUTO ENTER=p REMOVE=p FENTER=n FREMOVE=n
STOP
374
Chapter 14
Examples
Example 1
Simple Linear Regression
In this example, we explore the relation between gross domestic product per capita
(GDP_CAP) and spending on the military (MIL) for 57 countries that report this
information to the United Nationswe want to determine whether a measure of the
financial well being of a country is useful for predicting its military expenditures. Our
model is:
Initially. we plot the dependent variable against the independent variable. Such a plot
may reveal outlying cases or suggest a transformation before applying linear
regression. The input is:
The scatterplot follows:
USE ourworld
PLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500 ,
YLABEL=Military Spending,
SYMBOL=4 SIZE= 1.500 LABEL=NAME$ ,
CSIZE=2.000
+ + cap _ gdp mil
1 0
375
Li near Model s I : Li near Regressi on
To obtain the scatterplot, we created a new variable, NAME$, that had missing values for
all countries except Libya and Iraq. We then used the new variable to label plot points.
Iraq and Libya stand apart from the other countriesthey spend considerably more
for the military than countries with similar GDP_CAP values. The smoother indicates
that the relationship between the two variables is fairly linear. Distressing, however, is
the fact that many points clump in the lower left corner. Many data analysts would
want to study the data after log-transforming both variables. We do this in another
example, but now we estimate the coefficients for the data as recorded.
To fit a simple linear regression model to the data, the input is:
The output is:
REGRESS
USE ourworld
MODEL mil = CONSTANT + gdp_cap
ESTIMATE
1 case(s) deleted due to missing data.

Dep Var: MIL N: 56 Multiple R: 0.646 Squared multiple R: 0.417

Adjusted squared multiple R: 0.407 Standard error of estimate: 136.154

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 41.857 24.838 0.0 . 1.685 0.098
GDP_CAP 0.019 0.003 0.646 1.000 6.220 0.000

Effect Coefficient Lower < 95%> Upper

CONSTANT 41.857 -7.940 91.654
GDP_CAP 0.019 0.013 0.025

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 717100.891 1 717100.891 38.683 0.000
Residual 1001045.288 54 18537.876
-------------------------------------------------------------------------------

*** WARNING ***
Case 22 is an outlier (Studentized Residual = 6.956)
Case 30 is an outlier (Studentized Residual = 4.348)

Durbin-Watson D Statistic 2.046
First Order Autocorrelation -0.032
376
Chapter 14
SYSTAT reports that data are missing for one case. In the next line, it reports that 56
cases are used (N = 56). In the regression calculations, SYSTAT uses only the cases that
have complete data for the variables in the model. However, when only the dependent
variable is missing, SYSTAT computes a predicted value, its standard error, and a
leverage diagnostic for the case. In this sample, Afghanistan did not report military
spending.
When there is only one independent variable, Multiple R (0.646) is the simple
correlation between MIL and GDP_CAP. Squared multiple R (0.417) is the square of
this value, and it is the proportion of the total variation in the military expenditures
accounted for by GDP_CAP (GDP_CAP explains 41.7% of the variability of MIL).
Use Sum-of-Squares in the analysis of variance table to compute it:
717100.891 / (717100.891 + 1001045.288)
Adjusted squared multiple R is of interest for models with more than one independent
variable. Standard error of estimate (136.154) is the square root of the residual mean
square (18537.876) in the ANOVA table.
The estimates of the regression coefficients are 41.857 and 0.019, so the equation is:
mil = 41.857 + 0.019 * gdp_cap
The standard errors (Std Error) of the estimated coefficients are in the next column and
the standardized coefficients (Std Coef) follow. The latter are called beta weights by
some social scientists. Tolerance is not relevant when there is only one predictor.
377
Li near Model s I : Li near Regressi on
Next are t statistics (t)the first (1.685) tests the significance of the difference of
the constant from 0 and the second (6.220) tests the significance of the slope, which is
equivalent to testing the significance of the correlation between military spending and
GDP_CAP.
F-ratio in the analysis of variance table is used to test the hypothesis that the slope
is 0 (or, for multiple regression, that all slopes are 0). The F is large when the
independent variable(s) helps to explain the variation in the dependent variable. Here,
there is a significant linear relation between military spending and GDP_CAP. Thus,
we reject the hypothesis that the slope of the regression line is zero (F-ratio = 38.683,
p value (P) < 0.0005).
It appears from the results above that GDP_CAP is useful for predicting spending
on the militarythat is, countries that are financially sound tend to spend more on the
military than poorer nations. These numbers, however, do not provide the complete
picture. Notice that SYSTAT warns us that two countries (Iraq and Libya) with
unusual values could be distorting the results. We recommend that you consider
transforming the data and that you save the residuals and other diagnostic statistics.
Example 2
Transformations
The data in the scatterplot in the simple linear regression example are not well suited
for linear regression, as the heavy concentration of points in the lower left corner of the
graph shows. Here are the same data plotted in log units:
REGRESS
USE ourworld
PLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500,
XLABEL=GDP per capita,
XLOG=10 YLABEL=Military Spending YLOG=10,
SYMBOL=4,2,3,
SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.450
378
Chapter 14
The scatterplot is:
Except possibly for Iraq and Libya, the configuration of these points is better for linear
modeling than that for the untransformed data.
We now transform both the y and x variables and refit the model. The input is:
The output follows:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = CONSTANT + log_gdp
ESTIMATE
1 case(s) deleted due to missing data.

Dep Var: LOG_MIL N: 56 Multiple R: 0.857 Squared multiple R: 0.734

Adjusted squared multiple R: 0.729 Standard error of estimate: 0.346

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -1.308 0.257 0.0 . -5.091 0.000
LOG_GDP 0.909 0.075 0.857 1.000 12.201 0.000

Effect Coefficient Lower < 95%> Upper

CONSTANT -1.308 -1.822 -0.793
LOG_GDP 0.909 0.760 1.058

379
Li near Model s I : Li near Regressi on
The Squared multiple R for the variables in log units is 0.734 (versus 0.417 for the
untransformed values). That is, we have gone from explaining 41.7% of the variability
of military spending to 73.4% by using the log transformations. The F-ratio is now
148.876it was 38.683. Notice that we now have only one outlier (Iraq).
The Calculator
But what is the estimated model now?
log_mil = 1.308 + 0.909 log_gdp
However, many people dont think in log units. Lets transform this equation
(exponentiate each side of the equation):
Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 17.868 1 17.868 148.876 0.000
Residual 6.481 54 0.120
*** WARNING ***
Case 22 is an outlier (Studentized Residual = 4.004)

Durbin-Watson D Statistic 1.810
First Order Autocorrelation 0.070
380
Chapter 14
We used the calculator to compute 0.049. Type:
CALC 10^-1.308
and SYSTAT returns 0.049.
Example 3
Residuals and Diagnostics for Simple Linear Regression
In this example, we continue with the transformations example and save the residuals
and diagnostics along with the data. Using the saved statistics, we create stem-and-leaf
plots of the residuals and Studentized residuals. In addition, lets plot the Studentized
residuals (to identify outliers in the y space) against leverage (to identify outliers in the
x space) and use Cooks distance measure to scale the size of each plot symbol. In a
second plot, we display the corresponding country names. The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = CONSTANT + log_gdp
SAVE myresult / DATA RESID
ESTIMATE
USE myresult
STATS
STEM residual student
PLOT STUDENT*LEVERAGE / SYMBOL=4,2,3 SIZE=cook
PLOT student*leverage / LABEL=country$ SYMBOL=4,2,3
909 . 0
) gdp *log( 909 . 0 308 . 1
) gdp *log( 909 . 0 308 . 1
) cap _ gdp ( * 049 . 0 mil
10 * 10 mil
10 mil
) gdp log_ * 909 . 0 308 . 1 ( ^ 10 mil log_ ^ 10

+
381
Li near Model s I : Li near Regressi on
The output is:
In the stem-and-leaf plots, Iraqs residual is 1.216 and is identified as an Outside Value.
The value of its Studentized residual is 4.004, which is very extreme for the t
distribution.
The case with the most influence on the estimates of the regression coefficients
stands out at the top left (that is, it has the largest plot symbol). From the second plot,
we identify this country as Iraq. Its value of Cooks distance measure is large because
its Studentized residual is extreme. On the other hand, Ethiopia (furthest to the right),
Stem and Leaf Plot of variable: Stem and Leaf Plot of variable:
RESIDUAL, N = 56 STUDENT, N = 56
Minimum: -0.644 Minimum: -1.923
Lower hinge: -0.246 Lower hinge: -0.719
Median: -0.031 Median: -0.091
Upper hinge: 0.203 Upper hinge: 0.591
Maximum: 1.216 Maximum: 4.004

-6 42 -1 986
-5 6 -1 32000
-4 42 -0 H 88877766555
-3 554000 -0 M 443322111000
-2 H 65531 0 M 000022344
-1 9876433 0 H 555889999
-0 M 98433200 1 0223
0 222379 1 5
1 1558 2 3
2 H 009 * * * Outside Values * * *
3 0113369 4 0
4 27 1 cases with missing values excluded from plot.
5 1
6
7 7
* * * Outside Values * * *
12 1
1 cases with missing values excluded from plot.
382
Chapter 14
the case with the next most influence, has a large value of Cooks distance because its
value of leverage is large. Gambia has the third largest Cook value, and Libya, the
fourth.
Deleting an Outlier
Residual plots identify Iraq as the case with the greatest influence on the estimated
coefficients. Lets remove this case from the analysis and check SYSTATs warnings.
The input is:
The output follows:
Now there are no warnings about outliers.
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
SELECT mil < 700
MODEL log_mil = CONSTANT + log_gdp
ESTIMATE
SELECT
Dep Var: LOG_MIL N: 55 Multiple R: 0.886 Squared multiple R: 0.785

Adjusted squared multiple R: 0.781 Standard error of estimate: 0.306

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -1.353 0.227 0.0 . -5.949 0.000
LOG_GDP 0.916 0.066 0.886 1.000 13.896 0.000

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 18.129 1 18.129 193.107 0.000
Residual 4.976 53 0.094
-------------------------------------------------------------------------------


Durbin-Watson D Statistic 1.763
First Order Autocorrelation 0.086
383
Li near Model s I : Li near Regressi on
Printing Residuals and Diagnostics
Lets look at some of the values in the MYRESULT file. We use the country name as
the ID variable for the listing. The input is:
The output is:
The value of MIL for Afghanistan is missing, so Cooks distance measure and
Studentized residuals are not available (periods are inserted for these values in the
listing).
Example 4
Multiple Linear Regression
In this example, we build a multiple regression model to predict total employment
using values of six independent variables. The data were originally used by Longley
(1967) to test the robustness of least-squares packages to multicollinearity and other
sources of ill-conditioning. SYSTAT can print the estimates of the regression
coefficients with more correct digits than the solution provided by Longley himself
if you adjust the number of decimal places. By default, the first three digits after the
decimal are displayed. After the output is displayed, you can use General Linear Model
to test hypotheses involving linear combinations of regression coefficients.
USE myresult
IDVAR = country$
FORMAT 10 3
LIST cook leverage student mil gdp_cap
* Case ID * COOK LEVERAGE STUDENT MIL GDP_CAP
Ireland 0.013 0.032 -0.891 95.833 8970.885
Austria 0.023 0.043 -1.011 127.237 13500.299
Belgium 0.000 0.044 -0.001 283.939 13724.502
Denmark 0.000 0.045 -0.119 269.608 14363.064
(etc.)
Libya 0.056 0.022 2.348 640.513 4738.055
Somalia 0.009 0.072 0.473 8.846 201.798
Afghanistan . 0.075 . . 189.128
(etc.)
384
Chapter 14
The input is:
The output follows:
REGRESS
USE longley
PRINT = LONG
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE
Eigenvalues of unit scaled XX

1 2 3 4 5
6.861 0.082 0.046 0.011 0.000

6 7
0.000 0.000

Condition indices

1 2 3 4 5
1.000 9.142 12.256 25.337 230.424

6 7
1048.080 43275.046

Variance proportions

1 2 3 4 5
CONSTANT 0.000 0.000 0.000 0.000 0.000
DEFLATOR 0.000 0.000 0.000 0.000 0.457
GNP 0.000 0.000 0.000 0.001 0.016
UNEMPLOY 0.000 0.014 0.001 0.065 0.006
ARMFORCE 0.000 0.092 0.064 0.427 0.115
POPULATN 0.000 0.000 0.000 0.000 0.010
TIME 0.000 0.000 0.000 0.000 0.000


6 7
CONSTANT 0.000 1.000
DEFLATOR 0.505 0.038
GNP 0.328 0.655
UNEMPLOY 0.225 0.689
ARMFORCE 0.000 0.302
POPULATN 0.831 0.160
TIME 0.000 1.000


Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995

Adjusted squared multiple R: 0.992 Standard error of estimate: 304.854

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -3482258.635 890420.384 0.0 . -3.911 0.004
DEFLATOR 15.062 84.915 0.046 0.007 0.177 0.863
GNP -0.036 0.033 -1.014 0.001 -1.070 0.313
UNEMPLOY -2.020 0.488 -0.538 0.030 -4.136 0.003
ARMFORCE -1.033 0.214 -0.205 0.279 -4.822 0.001
POPULATN -0.051 0.226 -0.101 0.003 -0.226 0.826
TIME 1829.151 455.478 2.480 0.001 4.016 0.003
385
Li near Model s I : Li near Regressi on
SYSTAT computes the eigenvalues by scaling the columns of the X matrix so that the
diagonal elements of XX are 1s and then factoring the XX matrix. In this example,
most of the eigenvalues of XX are nearly 0, showing that the predictor variables
comprise a relatively redundant set.
Condition indices are the square roots of the ratios of the largest eigenvalue to each
successive eigenvalue. A condition index greater than 15 indicates a possible problem,
and an index greater than 30 suggests a serious problem with collinearity (Belsley,
Kuh, and Welsh, 1980). The condition indices in the Longley example show a
tremendous collinearity problem.
Variance proportions are the proportions of the variance of the estimates accounted
for by each principal component associated with each of the above eigenvalues. You
should begin to worry about collinearity when a component associated with a high
condition index contributes substantially to the variance of two or more variables. This
is certainly the case with the last component of the Longley data. TIME, GNP, and
UNEMPLOY load highly on this component. See Belsley, Kuh, and Welsch (1980) for
more information about these diagnostics.
Effect Coefficient Lower < 95%> Upper

CONSTANT -3482258.635 -5496529.488 -1467987.781
DEFLATOR 15.062 -177.029 207.153
GNP -0.036 -0.112 0.040
UNEMPLOY -2.020 -3.125 -0.915
ARMFORCE -1.033 -1.518 -0.549
POPULATN -0.051 -0.563 0.460
TIME 1829.151 798.788 2859.515


Correlation matrix of regression coefficients

CONSTANT DEFLATOR GNP UNEMPLOY ARMFORCE
CONSTANT 1.000
DEFLATOR -0.205 1.000
GNP 0.816 -0.649 1.000
UNEMPLOY 0.836 -0.555 0.946 1.000
ARMFORCE 0.550 -0.349 0.469 0.619 1.000
POPULATN -0.411 0.659 -0.833 -0.758 -0.189
TIME -1.000 0.186 -0.802 -0.824 -0.549


POPULATN TIME
POPULATN 1.000
TIME 0.388 1.000

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 1.84172E+08 6 3.06954E+07 330.285 0.000
Residual 836424.056 9 92936.006
-------------------------------------------------------------------------------


Durbin-Watson D Statistic 2.559
First Order Autocorrelation -0.348
386
Chapter 14
Adjusted squared multiple R is 0.992. The formula for this statistic is:
where n is the number of cases and p is the number of predictors, including the
constant.
Notice the extremely small tolerances in the output. Tolerance is 1 minus the
multiple correlation between a predictor and the remaining predictors in the model.
These tolerances signal that the predictor variables are highly intercorrelateda
worrisome situation. This multicollinearity can inflate the standard errors of the
coefficients, thereby attenuating the associated F statistics, and can threaten
computational accuracy.
Finally, SYSTAT produces the Correlation matrix of regression coefficients. In the
Longley data, these estimates are highly correlated, further indicating that there are too
many correlated predictors in the equation to provide stable estimates.
Scatterplot Matrix
Examining a scatterplot matrix of the variables in the model is often a beneficial first
step in any multiple regression analysis. Nonlinear relationships and correlated
predictors, both of which cause problems for multiple linear regression, can be
uncovered before fitting the model. The input is:
USE longley
SPLOM DEFLATOR GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL / HALF
DENSITY=HIST
adj sq multiple R R
p
n p
R . .
( )
( )
*( )


2 2
1
1
387
Li near Model s I : Li near Regressi on
The plot follows:
Notice the severely nonlinear distributions of ARMFORCE with the other variables, as
well as the near perfect correlations among several of the predictors. There is also a
sharp discontinuity between post-war and 1950s behavior on ARMFORCE.
Example 5
Automatic Stepwise Regression
Following is an example of forward automatic stepping using the LONGLEY data. The
input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
START / FORWARD
STEP / AUTO
STOP
D
E
F
L
A
T
O
R
G
N
P
U
N
E
M
P
L
O
Y
A
R
M
F
O
R
C
E
P
O
P
U
L
A
T
N
T
I
M
E
DEFLATOR
T
O
T
A
L
GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL
388
Chapter 14
The output is:
Step # 0 R = 0.000 R-Square = 0.000

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant

Out Part. Corr.
___
2 DEFLATOR 0.971 . . 1.00000 1 230.089 0.000
3 GNP 0.984 . . 1.00000 1 415.103 0.000
4 UNEMPLOY 0.502 . . 1.00000 1 4.729 0.047
5 ARMFORCE 0.457 . . 1.00000 1 3.702 0.075
6 POPULATN 0.960 . . 1.00000 1 166.296 0.000
7 TIME 0.971 . . 1.00000 1 233.704 0.000
-------------------------------------------------------------------------------

Step # 1 R = 0.984 R-Square = 0.967
Term entered: GNP

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP 0.035 0.002 0.984 1.00000 1 415.103 0.000

Out Part. Corr.
___
2 DEFLATOR -0.187 . . 0.01675 1 0.473 0.504
4 UNEMPLOY -0.638 . . 0.63487 1 8.925 0.010
5 ARMFORCE 0.113 . . 0.80069 1 0.167 0.689
6 POPULATN -0.598 . . 0.01774 1 7.254 0.018
7 TIME -0.432 . . 0.00943 1 2.979 0.108
-------------------------------------------------------------------------------

Step # 2 R = 0.990 R-Square = 0.981
Term entered: UNEMPLOY

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP 0.038 0.002 1.071 0.63487 1 489.314 0.000
4 UNEMPLOY -0.544 0.182 -0.145 0.63487 1 8.925 0.010

Out Part. Corr.
___
2 DEFLATOR -0.073 . . 0.01603 1 0.064 0.805
5 ARMFORCE -0.479 . . 0.48571 1 3.580 0.083
6 POPULATN -0.164 . . 0.00563 1 0.334 0.574
7 TIME 0.308 . . 0.00239 1 1.259 0.284
-------------------------------------------------------------------------------
389
Li near Model s I : Li near Regressi on

Step # 3 R = 0.993 R-Square = 0.985
Term entered: ARMFORCE

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP 0.041 0.002 1.154 0.31838 1 341.684 0.000
4 UNEMPLOY -0.797 0.213 -0.212 0.38512 1 13.942 0.003
5 ARMFORCE -0.483 0.255 -0.096 0.48571 1 3.580 0.083

Out Part. Corr.
___
2 DEFLATOR 0.163 . . 0.01318 1 0.299 0.596
6 POPULATN -0.376 . . 0.00509 1 1.813 0.205
7 TIME 0.830 . . 0.00157 1 24.314 0.000
-------------------------------------------------------------------------------

Step # 4 R = 0.998 R-Square = 0.995
Term entered: TIME

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP -0.040 0.016 -1.137 0.00194 1 5.953 0.033
4 UNEMPLOY -2.088 0.290 -0.556 0.07088 1 51.870 0.000
5 ARMFORCE -1.015 0.184 -0.201 0.31831 1 30.496 0.000
7 TIME 1887.410 382.766 2.559 0.00157 1 24.314 0.000

Out Part. Corr.
___
2 DEFLATOR 0.143 . . 0.01305 1 0.208 0.658
6 POPULATN -0.150 . . 0.00443 1 0.230 0.642
-------------------------------------------------------------------------------
Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995

Adjusted squared multiple R: 0.994 Standard error of estimate: 279.396

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -3598729.374 740632.644 0.0 . -4.859 0.001
GNP -0.040 0.016 -1.137 0.002 -2.440 0.033
UNEMPLOY -2.088 0.290 -0.556 0.071 -7.202 0.000
ARMFORCE -1.015 0.184 -0.201 0.318 -5.522 0.000
TIME 1887.410 382.766 2.559 0.002 4.931 0.000

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 1.84150E+08 4 4.60375E+07 589.757 0.000
Residual 858680.406 11 78061.855
-------------------------------------------------------------------------------
390
Chapter 14
The steps proceed as follows:
n At step 0, no variables are in the model. GNP has the largest simple correlation and
F, so SYSTAT enters it at step 1. Note at this step that the partial correlation, Part.
Corr., is the simple correlation of each predictor with TOTAL.
n With GNP in the equation, UNEMPLOY is now the best candidate.
n The F for ARMFORCE is 3.58 when GNP and UNEMPLOY are included in the
model.
n SYSTAT finishes by entering TIME.
In four steps, SYSTAT entered four predictors. None was removed, resulting in a final
equation with a constant and four predictors. For this final model, SYSTAT uses all
cases with complete data for GNP, UNEMPLOY, ARMFORCE, and TIME. Thus, when
some values in the sample are missing, the sample size may be larger here than for the
last step in the stepwise process (there, cases are omitted if any value is missing among
the six candidate variables). If you dont want to stop here, you could move more
variables in (or out) using interactive stepping.
Example 6
Interactive Stepwise Regression
Interactive stepping helps you to explore model building in more detail. With data that
are as highly intercorrelated as the LONGLEY data, interactive stepping reveals the
dangers of thinking that the automated result is the only acceptable subset model. In
this example, we use interactive stepping to explore the LONGLEY data further. That
is, after specifying a model that includes all of the candidate variables available, we
request backward stepping by selecting Stepwise, Backward, and Interactive in the
Regression Options dialog box. After reviewing the results at each step, we use Step
to move a variable in (or out) of the model. When finished, we select Stop for the final
model. To begin interactive stepping, the input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
START / BACK
391
Li near Model s I : Li near Regressi on
The output is:
We begin with all variables in the model. We remove DEFLATOR because it has an
unusually low tolerance and F value.
Type:
The output is:
Step # 0 R = 0.998 R-Square = 0.995

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
2 DEFLATOR 15.062 84.915 0.046 0.00738 1 0.031 0.863
3 GNP -0.036 0.033 -1.014 0.00056 1 1.144 0.313
4 UNEMPLOY -2.020 0.488 -0.538 0.02975 1 17.110 0.003
5 ARMFORCE -1.033 0.214 -0.205 0.27863 1 23.252 0.001
6 POPULATN -0.051 0.226 -0.101 0.00251 1 0.051 0.826
7 TIME 1829.151 455.478 2.480 0.00132 1 16.127 0.003

Out Part. Corr.
___
none
-------------------------------------------------------------------------------
STEP deflator
Dependent Variable TOTAL
Minimum tolerance for entry into model = 0.000000
Backward stepwise with Alpha-to-Enter=0.150 and Alpha-to-Remove=0.150

Step # 1 R = 0.998 R-Square = 0.995
Term removed: DEFLATOR

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP -0.032 0.024 -0.905 0.00097 1 1.744 0.216
4 UNEMPLOY -1.972 0.386 -0.525 0.04299 1 26.090 0.000
5 ARMFORCE -1.020 0.191 -0.202 0.31723 1 28.564 0.000
6 POPULATN -0.078 0.162 -0.154 0.00443 1 0.230 0.642
7 TIME 1814.101 425.283 2.459 0.00136 1 18.196 0.002

Out Part. Corr.
___
2 DEFLATOR 0.059 . . 0.00738 1 0.031 0.863
-------------------------------------------------------------------------------
392
Chapter 14
POPULATN has the lowest F statistic and, again, a low tolerance.
Type:
The output is:
GNP and TIME both have low tolerance values. They could be highly correlated with
one another, so we will take each out and examine the behavior of the other when we do.
Type:
The output is:
STEP populatn
Step # 2 R = 0.998 R-Square = 0.995
Term removed: POPULATN

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP -0.040 0.016 -1.137 0.00194 1 5.953 0.033
4 UNEMPLOY -2.088 0.290 -0.556 0.07088 1 51.870 0.000
5 ARMFORCE -1.015 0.184 -0.201 0.31831 1 30.496 0.000
7 TIME 1887.410 382.766 2.559 0.00157 1 24.314 0.000

Out Part. Corr.
___
2 DEFLATOR 0.143 . . 0.01305 1 0.208 0.658
6 POPULATN -0.150 . . 0.00443 1 0.230 0.642
-------------------------------------------------------------------------------
STEP time
STEP time
STEP gnp
Step # 3 R = 0.993 R-Square = 0.985
Term removed: TIME

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP 0.041 0.002 1.154 0.31838 1 341.684 0.000
4 UNEMPLOY -0.797 0.213 -0.212 0.38512 1 13.942 0.003
5 ARMFORCE -0.483 0.255 -0.096 0.48571 1 3.580 0.083

Out Part. Corr.
___
2 DEFLATOR 0.163 . . 0.01318 1 0.299 0.596
6 POPULATN -0.376 . . 0.00509 1 1.813 0.205
7 TIME 0.830 . . 0.00157 1 24.314 0.000
-------------------------------------------------------------------------------
393
Li near Model s I : Li near Regressi on
We are comfortable with the tolerance values in both models with three variables. With
TIME in the model, the smallest F is 17.671, and with GNP in the model, the smallest
F is 3.580. Furthermore, with TIME, the squared multiple correlation is 0.993, and with
GNP, it is 0.985. Lets stop the stepping and view more information about the last model.
Type:
The output is:
Step # 4 R = 0.998 R-Square = 0.995
Term entered: TIME

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
3 GNP -0.040 0.016 -1.137 0.00194 1 5.953 0.033
4 UNEMPLOY -2.088 0.290 -0.556 0.07088 1 51.870 0.000
5 ARMFORCE -1.015 0.184 -0.201 0.31831 1 30.496 0.000
7 TIME 1887.410 382.766 2.559 0.00157 1 24.314 0.000

Out Part. Corr.
___
2 DEFLATOR 0.143 . . 0.01305 1 0.208 0.658
6 POPULATN -0.150 . . 0.00443 1 0.230 0.642
-------------------------------------------------------------------------------
Step # 5 R = 0.996 R-Square = 0.993
Term removed: GNP

Effect Coefficient Std Error Std Coef Tol. df F P

In
___
1 Constant
4 UNEMPLOY -1.470 0.167 -0.391 0.30139 1 77.320 0.000
5 ARMFORCE -0.772 0.184 -0.153 0.44978 1 17.671 0.001
7 TIME 956.380 35.525 1.297 0.25701 1 724.765 0.000

Out Part. Corr.
___
2 DEFLATOR -0.031 . . 0.01385 1 0.011 0.920
3 GNP -0.593 . . 0.00194 1 5.953 0.033
6 POPULATN -0.505 . . 0.00889 1 3.768 0.078
-------------------------------------------------------------------------------
STOP
Dep Var: TOTAL N: 16 Multiple R: 0.996 Squared multiple R: 0.993

Adjusted squared multiple R: 0.991 Standard error of estimate: 332.084

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -1797221.112 68641.553 0.0 . -26.183 0.000
UNEMPLOY -1.470 0.167 -0.391 0.301 -8.793 0.000
ARMFORCE -0.772 0.184 -0.153 0.450 -4.204 0.001
TIME 956.380 35.525 1.297 0.257 26.921 0.000

394
Chapter 14
Our final model includes only UNEMPLOY, ARMFORCE, and TIME. Notice that its
multiple correlation (0.996) is not significantly smaller than that for the automated
stepping (0.998). Following are the commands we used:
Example 7
Testing whether a Single Coefficient Equals Zero
Most regression programs print tests of significance for each coefficient in an equation.
SYSTAT has a powerful additional featurepost hoc tests of regression coefficients.
To demonstrate these tests, we use the LONGLEY data and examine whether the
DEFLATOR coefficient differs significantly from 0. The input is:
Effect Coefficient Lower < 95%> Upper

CONSTANT -1797221.112 -1946778.208 -1647664.016
UNEMPLOY -1.470 -1.834 -1.106
ARMFORCE -0.772 -1.173 -0.372
TIME 956.380 878.978 1033.782

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 1.83685E+08 3 6.12285E+07 555.209 0.000
Residual 1323360.743 12 110280.062
-------------------------------------------------------------------------------
REGRESS
USE longley
MODEL total=constant + deflator + gnp + unemploy +,
armforce + populatn + time
START / BACK
STEP deflator
STEP populatn
STEP time
STEP time
STEP gnp
STOP
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
EFFECT = deflator
TEST
395
Li near Model s I : Li near Regressi on
The output is:
Notice that the error sum of squares (836424.056) is the same as the output residual
sum of squares at the bottom of the ANOVA table. The probability level (0.863) is the
same also. This probability level (> 0.05) indicates that the regression coefficient for
DEFLATOR does not differ from 0.
You can test all of the coefficients in the equation this way, individually, or choose
All to generate separate hypothesis tests for each predictor or type:
Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995

Adjusted squared multiple R: 0.992 Standard error of estimate: 304.854

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -3482258.635 890420.384 0.0 . -3.911 0.004
DEFLATOR 15.062 84.915 0.046 0.007 0.177 0.863
GNP -0.036 0.033 -1.014 0.001 -1.070 0.313
UNEMPLOY -2.020 0.488 -0.538 0.030 -4.136 0.003
ARMFORCE -1.033 0.214 -0.205 0.279 -4.822 0.001
POPULATN -0.051 0.226 -0.101 0.003 -0.226 0.826
TIME 1829.151 455.478 2.480 0.001 4.016 0.003

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 1.84172E+08 6 3.06954E+07 330.285 0.000
Residual 836424.056 9 92936.006
-------------------------------------------------------------------------------
Test for effect called: DEFLATOR

Test of Hypothesis

Source SS df MS F P

Hypothesis 2923.976 1 2923.976 0.031 0.863
Error 836424.056 9 92936.006

-------------------------------------------------------------------------------
HYPOTHESIS
ALL
TEST
396
Chapter 14
Example 8
Testing whether Multiple Coefficients Equal Zero
You may wonder why you need to bother with testing when the regression output gives
you hypothesis test results. Try the following hypothesis test:
The hypothesis output is:
Here, the error sum of squares is the same as that for the model, but the hypothesis sum
of squares is different. We just tested the hypothesis that the DEFLATOR and GNP
coefficients simultaneously are 0.
The A matrix printed above the test specifies the hypothesis that we tested. It has
two degrees of freedom (see the F statistic) because the A matrix has two rowsone
for each coefficient. If you know some matrix algebra, you can see that the matrix
product AB using this A matrix and B as a column matrix of regression coefficients
picks up only two coefficients: DEFLATOR and GNP. Notice that our hypothesis had
the following matrix equation: AB = 0, where 0 is a null matrix.
If you dont know matrix algebra, dont worry; the ampersand method is equivalent.
You can ignore the A matrix in the output.
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
EFFECT = deflator & gnp
TEST
Test for effect called: DEFLATOR

and
GNP

A Matrix

1 2 3 4 5
1 0.0 1.000 0.0 0.0 0.0
2 0.0 0.0 1.000 0.0 0.0


6 7
1 0.0 0.0
2 0.0 0.0

Test of Hypothesis

Source SS df MS F P

Hypothesis 149295.592 2 74647.796 0.803 0.478
Error 836424.056 9 92936.006

-------------------------------------------------------------------------------
397
Li near Model s I : Li near Regressi on
Two Coefficients with an A Matrix
If you are experienced with matrix algebra, however, you can specify your own matrix
by using AMATRIX. When typing the matrix, be sure to separate cells with spaces and
press Enter between rows. The following simultaneously tests that DEFLATOR = 0 and
GNP = 0:
You get the same output as above.
Why bother with AMATRIX when the you can use EFFECT? Because in the A matrix,
you can use any numbers, not just 0s and 1s. Here is a bizarre matrix:
You may not want to test this kind of hypothesis on the LONGLEY data, but there are
important applications in the analysis of variance where you might.
Example 9
Testing Nonzero Null Hypotheses
You can test nonzero null hypotheses with a D matrix, often in combination using
CONTRAST or AMATRIX. Here, we test whether the DEFLATOR coefficient
significantly differs from 30:
HYPOTHESIS
AMATRIX [0 1 0 0 0 0 0;
0 0 1 0 0 0 0]
TEST
1.0 3.0 0.5 64.3 3.0 2.0 0.0
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
AMATRIX [0 1 0 0 0 0 0]
DMATRIX [30]
TEST
398
Chapter 14
The output is:
The commands that test whether DEFLATOR differs from 30 can be performed more
efficiently using SPECIFY:
Example 10
Regression with Ecological or Grouped Data
If you have aggregated data, weight the regression by a count variable. This variable
should represent the counts of observations (n) contributing to the ith case. If n is not
an integer, SYSTAT truncates it to an integer before using it as a weight. The regression
results are identical to those produced if you had typed in each case.
We use, for this example, an ecological or grouped data file, PLANTS. The input is:
The output is:
Hypothesis.

A Matrix

1 2 3 4 5
0.0 1.000 0.0 0.0 0.0

6 7
0.0 0.0
Null hypothesis value for D
30.000
Test of Hypothesis

Source SS df MS F P

Hypothesis 2876.128 1 2876.128 0.031 0.864
Error 836424.056 9 92936.006
HYPOTHESIS
SPECIFY deflator=30
TEST
REGRESS
USE plants
FREQ=count
MODEL co2 = CONSTANT + species
ESTIMATE
Dep Var: CO2 N: 76 Multiple R: 0.757 Squared multiple R: 0.573

Adjusted squared multiple R: 0.567 Standard error of estimate: 0.729

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 13.738 0.204 0.0 . 67.273 0.000
SPECIES -0.466 0.047 -0.757 1.000 -9.961 0.000
399
Li near Model s I : Li near Regressi on
Example 11
Regression without the Constant
To regress without the constant (intercept) term, or through the origin, remove the
constant from the list of independent variables. REGRESS adjusts accordingly. The
input is:
Some users are puzzled when they see a model without a constant having a higher
multiple correlation than a model that includes a constant. How can a regression with
fewer parameters predict better than another? It doesnt. The total sum of squares
must be redefined for a regression model with zero intercept. It is no longer centered
about the mean of the dependent variable. Other definitions of sums of squares can lead
to strange results, such as negative multiple correlations. If your constant is actually
near 0, then including or excluding the constant makes little difference in the output.
Kvlseth (1985) discusses the issues involved in summary statistics for zero-intercept
regression models. The definition of used in SYSTAT is Kvlseths formula 7. This
was chosen because it retains its PRE (percentage reduction of error) interpretation and
is guaranteed to be in the (0,1) interval.
How, then, do you test the significance of a constant in a regression model? Include
a constant in the model as usual and look at its test of significance.
If you have a zero-intercept model where it is appropriate to compute a coefficient
of determination and other summary statistics about the centered data, use General
Linear Model and select Mixture model. This option provides Kvlseths formula 1 for
and uses centered total sum of squares for other summary statistics.
Effect Coefficient Lower < 95%> Upper

CONSTANT 13.738 13.331 14.144
SPECIES -0.466 -0.559 -0.372

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 52.660 1 52.660 99.223 0.000
Residual 39.274 74 0.531
-------------------------------------------------------------------------------
REGRESS
MODEL dependent = var1 + var2
ESTIMATE
R
2
R
2
400
Chapter 14
References
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying
influential data and sources of collinearity. New York: John Wiley & Sons, Inc.
Flack, V. F. and Chang, P. C. (1987). Frequency of selecting noise variables in subset
regression analysis: A simulation study. The American Statistician, 41, 8486.
Freedman, D. A. (1983). A note on screening regression equations. The American
Statistician, 37, 152155.
Hocking, R. R. (1983). Developments in linear regression methodology: 195982.
Technometrics, 25, 219230.
Lovell, M. C. (1983). Data Mining. The Review of Economics and Statistics, 65, 112.
Rencher, A. C. and Pun, F. C. (1980). Inflation of R-squared in best subset regression.
Technometrics, 22, 4954.
Velleman, P. F. and Welsch, R. E. (1981). Efficient computing of regression diagnostics.
The American Statistician, 35, 234242.
Weisberg, S. (1985). Applied linear regression. New York: John Wiley & Sons, Inc.
Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,
86, 168174.
Wilkinson, L. and Dallal, G. E. (1982). Tests of significance in forward selection
regression with an F-to-enter stopping rule. Technometrics, 24, 2528.
401


Chapt er
15
Linear Models II:
Analysis of Variance
Leland Wilkinson and Mark Coward
SYSTAT handles a wide variety of balanced and unbalanced analysis of variance
designs. The Analysis of Variance (ANOVA) procedure includes all interactions in the
model and tests them automatically; it also provides analysis of covariance, and
repeated measures designs. After you have estimated your ANOVA model, it is easy
to test post hoc pairwise differences in means or to test any contrast across cell means,
including simple effects.
For models with fixed and random effects, you can define error terms for specific
hypotheses. You can also do stepwise ANOVA (that is, Type I sums of squares).
Categorical variables are entered or deleted in blocks, and you can examine
interactively or automatically all combinations of interactions and main effects.
The General Linear Model (GLM) procedure is used for randomized block
designs, incomplete block designs, fractional factorials, Latin square designs, and
analysis of covariance with one or more covariates. GLM also includes repeated
measures, split plot, and crossover designs. It includes both univariate and
multivariate approaches to repeated measures designs.
Moreover, GLM also features the means model for missing cells designs. Widely
favored for this purpose by statisticians (Searle, 1987; Hocking, 1985; Milliken and
Johnson, 1984), the means model allows tests of hypotheses in missing cells designs
(using what are often called Type IV sums of squares). Furthermore, the means model
allows direct tests of simple hypotheses (for example, within levels of other factors).
Finally, the means model allows easier use of population weights to reflect
differences in subclass sizes.
For both ANOVA and GLM, group sizes can be unequal for combinations of
grouping factors; but for repeated measures designs, each subject must have complete
data. You can use numeric or character values to code grouping variables.
You can store results of the analysis (predicted values and residuals) for further
study and graphical display. In ANCOVA, you can save adjusted cell means.
402
Chapter 15
Analysis of Variance in SYSTAT
ANOVA: Estimate Model
To obtain an analysis of variance, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Estimate Model
Dependent. The variable(s) you want to examine. The dependent variable(s) should be
continuous numeric variables (for example, INCOME). For MANOVA (multivariate
analysis of variance), select two or more dependent variables.
Factor. One or more categorical variables (grouping variables) that split your cases into
two or more groups.
Missing values. Includes a separate category for cases with a missing value for the
variable(s) identified with Factor.
Covariates. A covariate is a quantitative independent variable that adds unwanted
variability to the dependent variable. An analysis of covariance (ANCOVA) adjusts or
removes the variability in the dependent variable due to the covariate (for example,
variability in cholesterol level might be removed by using AGE as a covariate).
403
Li near Model s I I : Anal ysi s of Vari ance
Post hoc Tests. Post hoc tests determine which pairs of means differ significantly. The
following alternatives are available:
n Bonferroni. Multiple comparison test based on Students t statistic. Adjusts the
observed significance level for the fact that multiple comparisons are made.
n Tukey. Uses the Studentized range statistic to make all pairwise comparisons
between groups and sets the experimentwise error rate to the error rate for the
collection for all pairwise comparisons. When testing a large number of pairs of
means, Tukey is more powerful than Bonferroni. For a small number of pairs,
Bonferroni is more powerful.
n LSD. Least significant difference pairwise multiple comparison test. Equivalent to
multiple t tests between all pairs of groups. The disadvantage of this test is that no
attempt is made to adjust the observed significance level for multiple comparisons.
n Scheffe. The significance level of Scheffs test is designed to allow all possible
linear combinations of group means to be tested, not just pairwise comparisons
available in this feature. The result is that Scheffs test is more conservative than
other tests, meaning that a larger difference between means is required for
significance.
Save file. You can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, leverages,
Cooks D, and the standard error of predicted values. Only the predicted values and
residuals are appropriate for ANOVA.
n Residuals/Data. Saves the statistics given by Residuals plus all of the variables in
the working data file, including any transformed data values.
n Adjusted. Saves adjusted cell means from analysis of covariance.
n Adjusted/Data. Saves adjusted cell means plus all of the variables in the working
data file, including any transformed data values.
n Model. Saves statistics given in Residuals and the variables used in the model.
n Coefficients. Saves estimates of the regression coefficients.
404
Chapter 15
ANOVA: Hypothesis Test
Contrasts are used to test relationships among cell means. The Post hoc Tests on the
ANOVA dialog box are the most simple form because they compare two means at a
time. Use Specify or Contrast to define contrasts involving two or more meansfor
example, contrast the average responses for two treatment groups against that for a
control group; or test if average income increases linearly across cells ordered by
education (dropouts, high school graduates, college graduates). The coefficients for
the means of the first contrast might be (1,1,2) for a contrast of 1* Treatment A plus
1* Treatment B minus 2 * Control. The coefficients for the second contrast would be
(1,0,1).
To define contrasts among the cell means, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Hypothesis Test
An ANOVA model must be estimated before any hypothesis tests can be performed.
Contrasts can be defined across the categories of a grouping factor or across the levels
of a repeated measure.
n Effects. Specify the factor (that is, grouping variable) to which the contrast applies.
Selecting All yields a separate test of the effect of each factor in the ANOVA model,
as well as tests of all interactions between those factors.
n Within. Use when specifying a contrast across the levels of a repeated measures
factor. Enter the name assigned to the set of repeated measures.
405
Li near Model s I I : Anal ysi s of Vari ance
Specify
To specify hypothesis test coefficients, click Specify in the ANOVA Hypothesis Test
dialog box.
To specify coefficients for a hypothesis test, use cell identifiers. Common hypothesis
tests include contrasts across marginal means or tests of simple effects. For a two-way
factorial ANOVA design with DISEASE (four categories) and DRUG (three
categories), you could contrast the marginal mean for the first level of drug against the
third level by specifying:
Note that square brackets enclose the value of the category (for example, for
GENDER$, specify GENDER$[male]). For the simple contrast of the first and third
levels of DRUG for the second disease only:
The syntax also allows statements like:
You have two error term options for hypothesis tests:
n Pooled. Uses the error term from the current model.
n Separate. Generates a separate variances error term.
DRUG[1] = DRUG[3]
DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]
-3*DRUG[1] - 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4]
406
Chapter 15
Contrast
To specify contrasts, click Contrast in the ANOVA Hypothesis Test dialog box.
Contrast generates a contrast for a grouping factor or a repeated measures factor.
SYSTAT offers six types of contrasts.
n Custom. Enter your own custom coefficients. If your factor has, say, four ordered
categories (or levels), you can specify your own coefficients, such as 3 1 1 3, by
typing these values in the Custom text box.
n Difference. Compare each level with its adjacent level.
n Polynomial. Generate orthogonal polynomial contrasts (to test linear, quadratic,
cubic trends across ordered categories or levels).
n Order. Enter 1 for linear, 2 for quadratic, and so on.
n Metric. Use Metric when the ordered categories are not evenly spaced. For
example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as
the metric.
n Sum. In a repeated measures ANOVA, total the values for each subject.
Repeated Measures
In a repeated measures design, the same variable is measured several times for each
subject (case). A paired-comparison t test is the most simple form of a repeated
measures design (for example, each subject has a before and after measure).
407
Li near Model s I I : Anal ysi s of Vari ance
SYSTAT derives values from your repeated measures and uses them in analysis of
variance computations to test changes across the repeated measures (within subjects)
as well as differences between groups of subjects (between subjects). Tests of the
within-subjects values are called polynomial test of order 1, 2, ..., up to k, where k is
one less than the number of repeated measures. The first polynomial is used to test
linear changes: do the repeated responses increase (or decrease) around a line with a
significant slope? The second polynomial tests whether the responses fall along a
quadratic curve, and so on.
To obtain a repeated measures analysis of variance, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Estimate Model
and click Repeated.
The following options are available:
Perform repeated measures analysis. Treats the dependent variables as a set of repeated
measures.
Optionally, you can assign a name for each set of repeated measures, specify the
number of levels, and specify the metric for unevenly spaced repeated measures.
n Name. Name that identifies each set of repeated measures.
n Levels. Number of repeated measures in the set. For example, if you have three
dependent variables that represent measurements at different times, the number of
levels is 3.
n Metric. Metric that indicates the spacing between unevenly spaced measurements.
For example, if measurements were taken at the third, fifth, and ninth weeks, the
metric would be 3, 5, 9.
408
Chapter 15
Using Commands
To use ANOVA for analysis of covariance, insert COVARIATE before ESTIMATE.
After estimating a model, use HYPOTHESIS to test its parameters. Begin each test with
HYPOTHESIS and end with TEST.
Usage Considerations
Types of data. ANOVA requires a rectangular data file.
Print options. If PRINT=SHORT, output includes an ANOVA table. The MEDIUM length
adds least-squares means to the output. LONG adds estimates of the coefficients.
Quick Graphs. ANOVA plots the group means against the groups.
Saving files. ANOVA can save predicted values, residuals, Studentized residuals,
leverages, Cooks D, standard error of predicted values, adjusted cell means, and
estimates of the coefficients.
BY groups. ANOVA performs separate analyses for each level of any BY variables.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. You can use a FREQUENCY variable to duplicate cases.
Case weights. ANOVA uses a WEIGHT variable, if present, to duplicate cases.
ANOVA
USE filename
CATEGORY / MISS
DEPEND / REPEAT NAMES
BONF or TUKEY or LSD or SCHEFFE
SAVE filename / ADJUST, MODEL, RESID, DATA
ESTIMATE
HYPOTHESIS
EFFECT or WITHIN
ERROR
POST / LSD or TUKEY or BONF or SCHEFFE
POOLED or SEPARATE
or CONTRAST / DIFFERENCE or POLYNOMIAL or SUM or ORDER
or METRIC
or SPECIFY / POOLED or SEPARATE
AMATRIX
CMATRIX
TEST
409
Li near Model s I I : Anal ysi s of Vari ance
Examples
Example 1
One-Way ANOVA
How does equipment influence typing performance? This example uses a one-way
design to compare average typing speed for three groups of typists. Fourteen beginning
typists were randomly assigned to three types of machines and given speed tests.
Following are their typing speeds in words per minute:
The data are stored in the SYSTAT data file named TYPING. The average speeds for
the typists in the three groups are 50.4, 46.5, and 69.8 words per minute, respectively.
To test the hypothesis that the three samples have the same population average speed,
the input is:
The output follows:
Electric Plain old Word processor
52 52 67
47 43 73
51 47 70
49 44 75
53 64
USE typing
ANOVA
CATEGORY equipmnt$
DEPEND speed
ESTIMATE
Dep Var: SPEED N: 14 Multiple R: 0.95 Squared multiple R: 0.91
Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P
EQUIPMNT$ 1469.36 2 734.68 53.52 0.00
Error 151.00 11 13.73
410
Chapter 15
For the dependent variable SPEED, SYSTAT reads 14 cases. The multiple correlation
(Multiple R) for SPEED with the two design variables for EQUIPMNT$ is 0.952. The
square of this correlation (Squared multiple R) is 0.907. The grouping structure
explains 90.7% of the variability of SPEED.
The layout of the ANOVA table is standard in elementary texts; you will find
formulas and definitions there. F-ratio is the Mean-Square for EQUIPMNT$ divided
by the Mean-Square for Error. The distribution of the F ratio is sensitive to the
assumption of equal population group variances. The p value is the probability of
exceeding the F ratio when the group means are equal. The p value printed here is
0.000, so it is less than 0.0005. If the population means are equal, it would be very
unusual to find sample means that differ as much as theseyou could expect such a
large F ratio fewer than five times out of 10,000.
The Quick Graph illustrates this finding. Although the typists using electric and
plain old typewriters have similar average speeds (50.4 and 46.5, respectively), the
word processor group has a much higher average speed.
Pairwise Mean Comparisons
An analysis of variance indicates whether (at least) one of the groups differs from the
others. However, you cannot determine which group(s) differ based on ANOVA
results. To examine specific group differences, use post hoc tests.
411
Li near Model s I I : Anal ysi s of Vari ance
In this example, we use the Bonferroni method for the typing speed data used in the
one-way ANOVA example. As an aid in interpretation, we order the equipment
categories from least to most advanced. The input is:
SYSTAT assigns a number to each of the three groups and uses those numbers in the
output panels that follow:
In the first column, you can read differences in average typing speed for the group
using plain old typewriters. In the second row, you see that they average 3.9 words per
minute fewer than those using electric typewriters; but in the third row, you see that
they average 23.3 minutes fewer than the group using word processors. To see whether
these differences are significant, look at the probabilities in the corresponding
locations at the bottom of the table.
The probability associated with 3.9 is 0.43, so you are unable to detect a difference
in performance between the electric and plain old groups. The probabilities in the third
row are both 0.00, indicating that the word processor group averages significantly
more words per minute than the electric and plain old groups.
USE typing
ORDER equipmnt$ / SORT=plain old electric, word process
ANOVA
CATEGORY equipmnt$
DEPEND speed / BONF
ESTIMATE
COL/
ROW EQUIPMNT$
1 plain old
2 electric
3 word process
Using least squares means.
Post Hoc test of SPEED
------------------------------------------------------------------------

Using model MSE of 13.727 with 11 df.
Matrix of pairwise mean differences:

1 2 3
1 0.0
2 3.90 0.0
3 23.30 19.40 0.0

Bonferroni Adjustment.
Matrix of pairwise comparison probabilities:

1 2 3
1 1.00
2 0.43 1.00
3 0.00 0.00 1.00
412
Chapter 15
Example 2
ANOVA Assumptions and Contrasts
An important assumption in analysis of variance is that the population variances are
equalthat is, that the groups have approximately the same spread. When variances
differ markedly, a transformation may remedy the problem. For example, sometimes it
helps to take the square root of each value of the outcome variable (or log transform
each value) and use the transformed value in the analysis.
In this example, we use a subset of the cases from the SURVEY2 data file to address
the question, For males, does average income vary by education? We focus on those
who:
n Did not graduate from high school (HS dropout)
n Graduated from high school (HS grad)
n Attended some college (Some college)
n Graduated from college (College grad)
n Have an M.A. or Ph.D. (Degree +)
For each male subject (case) in the SURVEY2 data file, use the variables INCOME and
EDUC$. The means, standard deviations, and sample sizes for the five groups are
shown below:
Visually, as you move across the groups, you see that average income increases. But
considering the variability within each group, you might wonder if the differences are
significant. Also, there is a relationship between the means and standard deviationsas
the means increase, so do the standard deviations. They should be independent. If you
take the square root of each income value, there is less variability among the standard
deviations, and the relation between the means and standard deviations is weaker:
HS dropout HS grad Some college College grad Degree +
mean
$13,389 $21,231 $29,294 $30,937 $38,214
sd
10,639 13,176 16,465 16,894 18,230
n
18 39 17 16 14
HS dropout HS grad Some college College grad Degree +
mean
3.371 4.423 5.190 5.305 6.007
sd
1.465 1.310 1.583 1.725 1.516
413
Li near Model s I I : Anal ysi s of Vari ance
A bar chart for the data will show the effect of the transformation. The input is:
The charts follow:
In the chart on the left, you can see a relation between the height of the bars (means)
and the length of the error bars (standard errors). The smaller means have shorter error
bars than the larger means. After transformation, there is less difference in length
among the error bars. The transformation aids in eliminating the dependency between
the group and the standard deviation.
To test for differences among the means:
USE survey2
SELECT sex$ = Male
LABEL educatn / 1,2=HS dropout, 3=HS grad
4=Some college, 5=College grad
6,7=Degree +
CATEGORY educatn
BEGIN
BAR income * educatn / SERROR FILL=.5 LOC=-3IN,0IN
BAR income * educatn / SERROR FILL=.35 YPOW=.5,
LOC=3IN,0IN
END
ANOVA
LET sqrt_inc = SQR(income)
DEPEND sqrt_inc
ESTIMATE
H
S
d
r
o
p
o
u
t
H
S

g
r
a
d
S
o
m
e
c
o
lle
g
e
C
o
lle
g
e

g
r
a
d
D
e
g
r
e
e
+
EDUCATN
0
10
20
30
40
50
60
70
I
N
C
O
M
E
H
S
d
r
o
p
o
u
t
H
S

g
r
a
d
S
o
m
e
c
o
lle
g
e
C
o
lle
g
e

g
r
a
d
D
e
g
r
e
e
+
EDUCATN
10
20
30
40
50
60
70
I
N
C
O
M
E
414
Chapter 15
The output is:
The ANOVA table using the transformed income as the dependent variable suggests a
significant difference among the four means (p < 0.0005).
Tukey Pairwise Mean Comparisons
Which means differ? This example uses the Tukey method to identify significant
differences in pairs of means. Hopefully, you reach the same conclusions using either
the Tukey or Bonferroni methods. However, when the number of comparisons is very
large, the Tukey procedure may be more sensitive in detecting differences; when the
number of comparisons is small, Bonferroni may be more sensitive. The input is:
The output follows:
Dep Var: SQRT_INC N: 104 Multiple R: 0.49 Squared multiple R: 0.24

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

EDUCATN 68.62 4 17.16 7.85 0.00

Error 216.26 99 2.18
ANOVA
LET sqrt_inc = SQR(income)
DEPEND sqrt_inc / TUKEY
ESTIMATE
COL/
ROW EDUCATN
1 HS dropout
2 HS grad
3 Some college
4 College grad
5 Degree +
Using least squares means.
Post Hoc test of SQRT_INC
-------------------------------------------------------------------------------

Using model MSE of 2.184 with 99 df.
Matrix of pairwise mean differences:

1 2 3 4 5
1 0.0
2 1.052 0.0
3 1.819 0.767 0.0
4 1.935 0.883 0.116 0.0
5 2.636 1.584 0.817 0.701 0.0
Tukey HSD Multiple Comparisons.
Matrix of pairwise comparison probabilities:

1 2 3 4 5
1 1.000
2 0.100 1.000
3 0.004 0.387 1.000
4 0.002 0.268 0.999 1.000
5 0.000 0.007 0.545 0.694 1.000
415
Li near Model s I I : Anal ysi s of Vari ance
The layout of the output panels for the Tukey method is the same as that for the
Bonferroni method. Look first at the probabilities at the bottom of the table. Four of
the probabilities indicate significant differences (they are less than 0.05). In the first
column, row 3, the average income for high school dropouts differs from those with
some college (p = 0.004), from college graduates (p = 0.002), and also from those with
advanced degrees (p < 0.0005). The fifth row shows that the differences between those
with advanced degrees and the high school graduates are significant (p = 0.008).
Contrasts
In this example, the five groups are ordered by their level of education, so you use these
coefficients to test linear and quadratic contrasts:
Then you ask, Is there a linear increase in average income across the five ordered
levels of education? A quadratic change? The input follows:
Linear
2 1 0 1 2
Quadratic
2 1 2 1 2
HYPOTHESIS
NOTE Test of linear contrast,
across ordered group means
EFFECT = educatn
CONTRAST [2 1 0 1 2]
TEST
HYPOTHESIS
NOTE 'Test of quadratic contrast',
'across ordered group means'
EFFECT = educatn
CONTRAST [2 1 2 1 2]
TEST
SELECT
416
Chapter 15
The resulting output is:
The F statistic for testing the linear contrast is 29.089 (p value < 0.0005); for testing
the quadratic contrast, it is 1.008 (p value = 0.32). Thus, you can report that there is a
highly significant linear increase in average income across the five levels of education
and that you have not found a quadratic component in this increase.
Example 3
Two-Way ANOVA
Consider the following two-way analysis of variance design from Afifi and Azen
(1972), cited in Kutner (1974), and reprinted in BMDP manuals. The dependent
variable, SYSINCR, is the change in systolic blood pressure after administering one of
four different drugs to patients with one of three different diseases. Patients were
assigned randomly to one of the possible drugs. The data are stored in the SYSTAT file
AFIFI.
Test of linear contrast
across ordered group means
Test for effect called: EDUCATN

A Matrix

1 2 3 4 5
0.0 -4.00 -3.00 -2.00 -1.00
Test of Hypothesis

Source SS df MS F P

Hypothesis 63.54 1 63.54 29.09 0.00
Error 216.26 99 2.18

-------------------------------------------------------------------------------

Test of quadratic contrast
across ordered group means
Test for effect called: EDUCATN

A Matrix

1 2 3 4 5
0.0 0.0 -3.00 -4.00 -3.00
Test of Hypothesis

Source SS df MS F P

Hypothesis 2.20 1 2.20 1.01 0.32
Error 216.26 99 2.18
417
Li near Model s I I : Anal ysi s of Vari ance
To obtain a least-squares two-way analysis of variance:
Because this is a factorial design, ANOVA automatically generates an interaction term
(DRUG * DISEASE). The output follows:
USE afifi
ANOVA
CATEGORY drug disease
DEPEND sysincr
SAVE myresids / RESID DATA
ESTIMATE
Dep Var: SYSINCR N: 58 Multiple R: 0.675 Squared multiple R: 0.456

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

DRUG 2997.472 3 999.157 9.046 0.000
DISEASE 415.873 2 207.937 1.883 0.164
DRUG*DISEASE 707.266 6 117.878 1.067 0.396

Error 5080.817 46 110.453
418
Chapter 15
In two-way ANOVA, begin by examining the interaction. If the interaction is
significant, you must condition your conclusions about a given factors effects on the
level of the other factor. The DRUG * DISEASE interaction is not significant (p =
0.396), so shift your focus to the main effects.
The DRUG effect is significant (p < 0.0005), but the DISEASE effect is not (p = 0.164).
Thus, at least one of the drugs differs from the others with respect to blood pressure
change, but blood pressure change does not vary significantly across diseases.
For each factor, SYSTAT produces a plot of the average value of the dependent
variable for each level of the factor. For the DRUG plot, drugs 1 and 2 yield similar
average blood pressure changes. However, the average blood pressure change for
q
1
1 2 3 4
DRUG$
-3
8
19
30
41
S
Y
S
I
N
C
R
2
1 2 3 4
DRUG$
-3
8
19
30
41
S
Y
S
I
N
C
R
3
1 2 3 4
DRUG$
-3
8
19
30
41
S
Y
S
I
N
C
R
Least Squares Means
419
Li near Model s I I : Anal ysi s of Vari ance
drugs 3 and 4 are much lower. ANOVA tests for significance the differences illustrated
in this plot.
For the DISEASE plot, we see a gradual decrease in blood pressure change across
the three diseases. However, this effect is not significant; there is not enough variation
among these means to overcome the variation due to individual differences.
In addition the plot for each factor, SYSTAT also produces plots of the average
blood pressure change at each level of DRUG for each level of disease. Use these plots
to illustrate interaction effects. Although the interaction effect is not significant in this
example, we can still examine these plots.
In general, we see a decline in blood pressure change across drugs. (Keep in mind
that the drugs are only artificially ordered. We could reorder the drugs, and although
the ANOVA results wouldnt change, the plots would differ.) The similarity of the
plots illustrates the nonsignificant interaction.
A close correspondence exists between the factor plots and the interaction plots. The
means plotted in the factor plot for DISEASE correspond to the weighted average of
the four points in each of the interaction plots. Similarly, each mean plotted in the
DRUG factor plot corresponds to the weighted average of the three corresponding
points across interaction plots. Consequently, the significant DRUG effect can be seen
in the differing means in each interaction plot. Can you see the nonsignificant
DISEASE effect in the interaction plots?
Least-Squares ANOVA
If you have an orthogonal design (equal number of cases in every cell), you will find
that the ANOVA table is the same one you get with any standard program. SYSTAT
can handle non-orthogonal designs, however (as in the present example). To
understand the sources for sums of squares, you must know something about least-
squares ANOVA.
As with one-way ANOVA, your specifying factor levels causes SYSTAT to create
dummy variables out of the classifying input variable. SYSTAT creates one fewer
dummy variables than categories specified.
Coding of the dummy variables is the classic analysis of variance parameterization,
in which the sum of effects estimated for a classifying variable is 0 (Scheff, 1959). In
420
Chapter 15
our example, DRUG has four categories; therefore, SYSTAT creates three dummy
variables with the following scores for subjects at each level:
Because DISEASE has three categories, SYSTAT creates two dummy variables to be
coded as follows:
Now, because there are no continuous predictors in the model (unlike the analysis of
covariance), you have a complete design matrix of dummy variables as follows (DRUG
is labeled with an a, DISEASE with a b, and the grand mean with an m):
This example is used to explain how SYSTAT gets an error term for the ANOVA table.
Because it is a least-squares program, the error term is taken from the residual sum of
squares in the regression onto the above dummy variables. For non-orthogonal designs,
this choice is identical to that produced by BMDP2V and SPSS GLM with Type III sums
of squares. These, in general, will be the hypotheses you want to test on unbalanced
1 0 0 for DRUG = 1 subject
0 1 0 for DRUG = 2 subjects
0 0 1 for DRUG = 3 subjects
1 1 1 for DRUG = 4 subjects
1 0 for DISEASE = 1 subject
0 1 for DISEASE = 2 subjects
1 1 for DISEASE = 3 subjects
Treatment Mean DRUG DISEASE Interaction
A B m a1 a2 a3 b1 b2 a1b1 a1b2 a2b1 a2b2 a3b1 a3b2
1 1 1 1 0 0 1 0 1 0 0 0 0 0
1 2 1 1 0 0 0 1 0 1 0 0 0 0
1 3 1 1 0 0 1 1 1 1 0 0 0 0
2 1 1 0 1 0 1 0 0 0 1 0 0 0
2 2 1 0 1 0 0 1 0 0 0 1 0 0
2 3 1 0 1 0 1 1 0 0 1 1 0 0
3 1 1 0 0 1 1 0 0 0 0 0 1 0
3 2 1 0 0 1 0 1 0 0 0 0 0 1
3 3 1 0 0 1 1 1 0 0 0 0 1 1
4 1 1 1 1 1 1 0 1 0 1 0 1 0
4 2 1 1 1 1 0 1 0 1 0 1 0 1
4 3 1 1 1 1 1 1 1 1 1 1 1 1
421
Li near Model s I I : Anal ysi s of Vari ance
experimental data. You can construct other types of sums of squares by using an A matrix
or by running your ANOVA model using the Stepwise options in GLM. Consult the
references if you do not already know what these sums of squares mean.
Post Hoc Tests
It is evident that only the main effect for DRUG is significant; therefore, you might
want to test some contrasts on the DRUG effects. A simple way would be to use the
Bonferroni method to test all pairwise comparisons of marginal drug means. However,
to compare three or more means, you must specify the particular contrast of interest.
Here, we compare the first and third drugs, the first and fourth drugs, and the first two
drugs with the last two drugs. The input is:
You need four numbers in each contrast because DRUG has four levels. You cannot use
CONTRAST to specify coefficients for interaction terms. It creates an A matrix only for
main effects. Following are the results of the above hypothesis tests:
HYPOTHESIS
EFFECT = drug
CONTRAST [1 0 1 0]
TEST
HYPOTHESIS
EFFECT = drug
CONTRAST [1 0 0 1]
TEST
HYPOTHESIS
EFFECT = drug
CONTRAST [1 1 -1 1]
TEST
Test for effect called: DRUG

A Matrix

1 2 3 4 5
0.0 1.000 0.0 -1.000 0.0

6 7 8 9 10
0.0 0.0 0.0 0.0 0.0

11 12
0.0 0.0
Test of Hypothesis

Source SS df MS F P

Hypothesis 1697.545 1 1697.545 15.369 0.000
Error 5080.817 46 110.453
-------------------------------------------------------------------------------
422
Chapter 15
Notice the A matrices in the output. SYSTAT automatically takes into account the
degree of freedom lost in the design coding. Also, notice that you do not need to
normalize contrasts or rows of the A matrix to unit vector length, as in some ANOVA
programs. If you use (2 0 2 0) or (0.707 0 0.707 0) instead of (1 0 1 0), you get the
same sum of squares.
For the comparison of the first and third drugs, the F statistic is 15.369 (p value
< 0.0005), indicating that these two drugs differ. Looking at the Quick Graphs
produced earlier, we see that the change in blood pressure was much smaller for the
third drug.
Notice that in the A matrix created by the contrast of the first and fourth drugs, you
get (2 1 1) in place of the three design variables corresponding to the appropriate
columns of the A matrix. Because you selected the reduced form for coding of design
variables in which sums of effects are 0, you have the following restriction for the
DRUG effects:
Test for effect called: DRUG

A Matrix

1 2 3 4 5
0.0 2.000 1.000 1.000 0.0

6 7 8 9 10
0.0 0.0 0.0 0.0 0.0

11 12
0.0 0.0
Test of Hypothesis

Source SS df MS F P

Hypothesis 1178.892 1 1178.892 10.673 0.002
Error 5080.817 46 110.453
-------------------------------------------------------------------------------

Test for effect called: DRUG

A Matrix

1 2 3 4 5
0.0 2.000 2.000 0.0 0.0

6 7 8 9 10
0.0 0.0 0.0 0.0 0.0

11 12
0.0 0.0
Test of Hypothesis

Source SS df MS F P

Hypothesis 2982.934 1 2982.934 27.006 0.000
Error 5080.817 46 110.453

1 2 3 4
0 + + +
423
Li near Model s I I : Anal ysi s of Vari ance
where each value is the effect for that level of DRUG. This means that
and the contrast DRUG(1) DRUG(4) is equivalent to
which is
For the final contrast, SYSTAT transforms the (1 1 1 1) specification into contrast
coefficients of (2 2 0) for the dummy coded variables. The p value (< 0.0005) indicates
that the first two drugs differ from the last two drugs.
Simple Effects
You can do simple contrasts between drugs within levels of disease (although the lack
of a significant DRUG * DISEASE interaction does not justify it). To show how it is
done, consider a contrast between the first and third levels of DRUG for the first
DISEASE only. You must specify the contrast in terms of the cell means. Use the
terminology:
You want to contrast cell means M{1,1} and M{3,1}. These are composed of:
Therefore the difference between the two means is:
Now, if you consider the coding of the variables, you can construct an A matrix that
picks up the appropriate columns of the design matrix. Here are the column labels of
MEAN (DRUG index, DISEASE index) = M{i,j}

4 1 2 3
+ + ( )

1 1 2 3
0 + + [ ( )]
2
1 2 3
+ +
M
M
{ , }
{ , }
11
31
1 1 11
3 1 31
+ + +
+ + +


M M { , } { , } 11 31
1 3 11 31
+
424
Chapter 15
the design matrix (a means DRUG and b means DISEASE) to serve as a column ruler
over the A matrix specified in the hypothesis.
The corresponding input is:
The output follows:
After you understand how SYSTAT codes design variables and how the model
sentence orders them, you can take any standard ANOVA text like Winer (1971) or
Scheff (1959) and construct an A matrix for any linear contrast.
Contrasting Marginal and Cell Means
Now look at how to contrast cell means directly without being concerned about how
they are coded. Test the first level of DRUG against the third (contrasting the marginal
means) with the following input:
m a1 a2 a3 b1 b2 a1b1 a1b2 a2b1 a2b2 a3b1 a3b2
0 1 0 1 0 0 1 0 0 0 1 0
HYPOTHESIS
AMATRIX [0 1 0 1 0 0 1 0 0 0 1 0]
TEST
Hypothesis.

A Matrix

1 2 3 4 5
0.0 1.000 0.0 -1.000 0.0

6 7 8 9 10
0.0 1.000 0.0 0.0 0.0

11 12
-1.000 0.0
Test of Hypothesis

Source SS df MS F P

Hypothesis 338.000 1 338.000 3.060 0.087
Error 5080.817 46 110.453
HYPOTHESIS
SPECIFY drug[1] = drug[3]
TEST
425
Li near Model s I I : Anal ysi s of Vari ance
To contrast the first against the fourth:
Finally, here is the simple contrast of the first and third levels of DRUG for the first
DISEASE only:
Screening Results
Lets examine the AFIFI data in more detail. To analyze the residuals to examine the
ANOVA assumptions, first plot the residuals against estimated values (cell means) to
check for homogeneity of variance. Use the Studentized residuals to reference them
against a t distribution. In addition, stem-and-leaf plots of the residuals and boxplots of
the dependent variable aid in identifying outliers. The input is:
HYPOTHESIS
SPECIFY drug[1] = drug[4]
TEST
HYPOTHESIS
SPECIFY drug[1] disease[1] = drug[3] disease[1]
TEST
USE afifi
ANOVA
CATEGORY drug disease
DEPEND sysincr
SAVE myresids / RESID DATA
ESTIMATE
DENSITY sysincr * drug / BOX
USE myresids
PLOT student*estimate / SYM=1 FILL=1
STATISTICS
STEM student
426
Chapter 15
The plots suggest the presence of an outlier. The smallest value in the stem-and-leaf
plot seems to be out of line. A t statistic value of 2.647 corresponds to p < 0.01, and
you would not expect a value this small to show up in a sample of only 58 independent
values. In the scatterplot, the point corresponding to this value appears at the bottom
and badly skews the data in its cell (which happens to be DRUG1, DISEASE3). The
outlier in the first group clearly stands out in the boxplot, too.
To see the effect of this outlier, delete the observation with the outlying Studentized
residual. Then, run the analysis again. Following is the ANOVA output for the revised
data:
The differences are not substantial. Nevertheless, notice that the DISEASE effect is
substantially attenuated when only one case out of 58 is deleted. Daniel (1960) gives
an example in which one outlying case alters the fundamental conclusions of a
designed experiment. The F test is robust to certain violations of assumptions, but
factorial ANOVA is not robust against outliers. You should routinely do these plots for
ANOVA.
Stem and Leaf Plot of variable: STUDENT, N = 58
Minimum: -2.647
Lower hinge: -0.761
Median: 0.101
Upper hinge: 0.698
Maximum: 1.552

-2 6
-2
-1 987666
-1 410
-0 H 9877765
-0 4322220000
0 M 001222333444
0 H 55666888
1 011133444
1 55
Dep Var: SYSINCR N: 57 Multiple R: .710 Squared Multiple R: .503
Analysis of Variance
Source Sum-of-Squares DF Mean-Square F-Ratio P
DRUG 3344.064 3 1114.688 11.410 0.000
DISEASE 232.826 2 116.413 1.192 0.313
DRUG*DISEASE 676.865 6 112.811 1.155 0.347
Error 4396.367 45 97.697
427
Li near Model s I I : Anal ysi s of Vari ance
Example 4
Single-Degree-of-Freedom Designs
The data in the REACT file involve yields of a chemical reaction under various
combinations of four binary factors (A, B, C, and D). Two reactions were observed
under each combination of experimental factors, so the number of cases per cell is two.
To analyze these data in a four-way ANOVA:
You can see the advantage of ANOVA over GLM when you have several factors; you
have to select only the main effects. With GLM, you have to specify the interactions
and identify which variables are categorical (that is, A, B, C, and D). The following
example is the full model using GLM:
The ANOVA output follows:
The output shows a significant main effect for the first factor (A) plus one significant
interaction (A*C*D).
USE react
ANOVA
USE react
CATEGORY a, b, c, d
DEPEND yield
ESTIMATE
MODEL yield = CONSTANT + a + b + c + d +,
a*b + a*c + a*d + b*c + b*d + c*d +,
a*b*c + a*b*d + a*c*d + b*c*d +,
a*b*c*d
Dep Var: YIELD N: 32 Multiple R: 0.755 Squared multiple R: 0.570


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

A 369800.000 1 369800.000 4.651 0.047
B 1458.000 1 1458.000 0.018 0.894
C 5565.125 1 5565.125 0.070 0.795
D 172578.125 1 172578.125 2.170 0.160
A*B 87153.125 1 87153.125 1.096 0.311
A*C 137288.000 1 137288.000 1.727 0.207
A*D 328860.500 1 328860.500 4.136 0.059
B*C 61952.000 1 61952.000 0.779 0.390
B*D 3200.000 1 3200.000 0.040 0.844
C*D 3160.125 1 3160.125 0.040 0.844
A*B*C 81810.125 1 81810.125 1.029 0.326
A*B*D 4753.125 1 4753.125 0.060 0.810
A*C*D 415872.000 1 415872.000 5.230 0.036
B*C*D 4.500 1 4.500 0.000 0.994
A*B*C*D 15051.125 1 15051.125 0.189 0.669

Error 1272247.000 16 79515.437
428
Chapter 15
Assessing Normality
Lets look at the study more closely. Because this is a single-degree-of-freedom study
(a 2n factorial), each effect estimate is normally distributed if the usual assumptions for
the experiment are valid. All of the effects estimates, except the constant, have zero
mean and common variance (because dummy variables were used in their
computation). Thus, you can compare them to a normal distribution. SYSTAT
remembers your last selections, so the input is:
This reestimates the model and saves the regression coefficients (effects). The file has
one case with 16 variables (CONSTANT plus 15 effects). The effects are labeled X(1),
X(2), and so on because they are related to the dummy variables, not the original
variables A, B, C, and D. Lets transpose this file into a new file containing only the 15
effects and create a probability plot of the effects. The input is:
The resulting plot is:
These effects are indistinguishable from a random normal variable. They plot almost
on a straight line. What does it mean for the study and for the significant F tests?
SAVE effects / COEF
ESTIMATE
USE effects
DROP constant
TRANSPOSE
SELECT case > 1
PPLOT col(1) / FILL=1 SYMBOL=1 XLABEL=Estimates of
Effects
429
Li near Model s I I : Anal ysi s of Vari ance
Its time to reveal that the data were produced by a random number generator.
n If you are doing a factorial analysis of variance, the p values you see on the output
are not adjusted for the number of factors. If you do a three-way design, look at
seven tests (excluding the constant). For a four-way design, examine 15 tests. Out
of 15 F tests on random data, expect to find at least one test approaching
significance. You have two significant and one almost significant, which is not far
out of line. The probabilities for each separate F test need to be corrected for the
experimentwise error rate. Some authors devote entire chapters to fine distinctions
between multiple comparison procedures and then illustrate them within a
multifactorial design not corrected for the experimentwise error rate just
demonstrated. Remember that a factorial design is a multiple comparison. If you
have a single-degree-of-freedom study, use the procedure you used to draw a
probability plot of the effects. Any effect that is really significant will become
obvious.
n If you have a factorial study with more degrees of freedom on some factors, use the
Bonferroni critical value for deciding which effects are significant. It guarantees
that the Type I error rate for the study will be no greater than the level you choose.
In the above example, this value is 0.05 / 15 (that is, 0.003).
n Multiple F tests based on a common denominator (mean-square error in this
example) are correlated. This complicates the problem further. In general, the
greater the discrepancy between numerator and denominator degrees of freedom
and the smaller the denominator degrees of freedom, the greater the dependence of
the tests. The Bonferroni tests are best in this situation, although Feingold and
Korsog (1986) offer some useful alternatives.
Example 5
Mixed Models
Mixed models involve combinations of fixed and random factors in an ANOVA. Fixed
factors are assumed to be composed of an exhaustive set of categories (for example,
males and females), while random factors have category levels that are assumed to
have been randomly sampled from a larger population of categories (for example,
classrooms or word stems). Because of the mixing of fixed and random components,
expected mean squares for certain effects are different from those for fully fixed or
fully random designs. SYSTAT can handle mixed models because you can specify
error terms for specific hypotheses.
430
Chapter 15
For example, lets analyze the AFIFI data with a mixed model instead of a fully
fixed factorial. Here, you are interested in the four drugs as wide-spectrum disease
killers. Because each drug is now thought to be effective against diseases in general,
you have sampled three random diseases to assess the drugs. This implies that
DISEASE is a random factor and DRUG remains a fixed factor. In this case, the error
term for DRUG is the DRUG * DISEASE interaction. To begin, run the same analysis
we performed in the two-way example to get the ANOVA table. To test for the DRUG
effect, specify drug * disease as the error term in a hypothesis test. The input is:
The output is:
Notice that the SS, df, and MS for the error term in the hypothesis test correspond to the
values for the interaction in the ANOVA table.
USE afifi
ANOVA
CATEGORY drug, disease
DEPEND sysincr
ESTIMATE
HYPOTHESIS
EFFECT = drug
ERROR = drug*disease
TEST
Dep Var: SYSINCR N: 58 Multiple R: 0.675 Squared multiple R: 0.456


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

DRUG 2997.472 3 999.157 9.046 0.000
DISEASE 415.873 2 207.937 1.883 0.164
DRUG*DISEASE 707.266 6 117.878 1.067 0.396

Error 5080.817 46 110.453
Test for effect called: DRUG

Test of Hypothesis

Source SS df MS F P

Hypothesis 2997.472 3 999.157 8.476 0.014
Error 707.266 6 117.878
431
Li near Model s I I : Anal ysi s of Vari ance
Example 6
Separate Variance Hypothesis Tests
The data in the MJ20 data file are from Milliken and Johnson (1984). They are the
results of a paired-associate learning task. GROUP describes the type of drug
administered; LEARNING is the amount of material learned during testing. First we
perform Levenes test (Levene, 1960) to determine if the variances are equal across
cells. The input is:
Following is the ANOVA table of the absolute residuals:
Notice that the F is significant, indicating that the separate variances test is advisable.
Lets do several single-degree-of-freedom tests, following Milliken and Johnson. The
first is for comparing all drugs against the control; the second tests the hypothesis that
groups 2 and 3 together are not significantly different from group 4. The input is:
USE mj20
ANOVA
SAVE mjresids / RESID DATA
DEPEND learning
CATEGORY group
ESTIMATE
USE mjresids
LET residual = ABS(residual)
CATEGORY group
DEPEND residual
ESTIMATE
Dep Var: RESIDUAL N: 29 Multiple R: 0.675 Squared multiple R: 0.455


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

GROUP 30.603 3 10.201 6.966 0.001

Error 36.608 25 1.464
USE mj20
ANOVA
CATEGORY group
DEPEND learning
ESTIMATE
HYPOTHESIS
SPECIFY 3*group[1] = group[2] +group[3] + group[4] / SEPARATE
TEST
HYPOTHESIS
SPECIFY 2*group[4] = group[2] +group[3] / SEPARATE
TEST
432
Chapter 15
Following is the output. The ANOVA table has been omitted because it is not valid
when variances are unequal.
Example 7
Analysis of Covariance
Winer (1971) uses the COVAR data file for an analysis of covariance in which X is the
covariate and TREAT is the treatment. Cases do not need to be ordered by the grouping
variable TREAT.
Before analyzing the data with an analysis of covariance model, be sure there is no
significant interaction between the covariate and the treatment. The assumption of no
interaction is often called the homogeneity of slopes assumption because it is
tantamount to saying that the slope of the regression line of the dependent variable onto
the covariate should be the same in all cells of the design.
Using separate variances estimate for error term.
Hypothesis.

A Matrix

1 2 3 4
0.0 -4.000 0.0 0.0
Null hypothesis value for D
0.0
Test of Hypothesis

Source SS df MS F P

Hypoth 242.720 1 242.720 18.115 0.004
Error 95.085 7.096 13.399

-------------------------------------------------------------------------------

Using separate variances estimate for error term.
> TEST
Hypothesis.

A Matrix

1 2 3 4
0.0 2.000 3.000 3.000
Null hypothesis value for D
0.0
Test of Hypothesis

Source SS df MS F P

Hypoth 65.634 1 65.634 17.819 0.001
Error 61.852 16.792 3.683
433
Li near Model s I I : Anal ysi s of Vari ance
Parallelism is easy to test with a preliminary model. Use GLM to estimate this
model with the interaction between treatment (TREAT) and covariate (X) in the model.
The input is:
The output follows:
The probability value for the treatment by covariate interaction is 0.605, so the
assumption of homogeneity of slopes is plausible.
Now, fit the usual analysis of covariance model by specifying:
For incomplete factorials and similar designs, you still must specify a model (using
GLM) to do analysis of covariance.
The output follows:
USE covar
GLM
CATEGORY treat
MODEL y = CONSTANT + treat + x + treat*x
ESTIMATE
Dep Var: Y N: 21 Multiple R: 0.921 Squared multiple R: 0.849


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

TREAT 6.693 2 3.346 5.210 0.019
X 15.672 1 15.672 24.399 0.000
TREAT*X 0.667 2 0.334 0.519 0.605

Error 9.635 15 0.642
USE covar
ANOVA
PRINT=MEDIUM
CATEGORY treat
DEPEND y
COVARIATE x
ESTIMATE
Dep Var: Y N: 21 Multiple R: 0.916 Squared multiple R: 0.839


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

TREAT 16.932 2 8.466 13.970 0.000
X 16.555 1 16.555 27.319 0.000

Error 10.302 17 0.606
-------------------------------------------------------------------------------
434
Chapter 15
The treatment adjusted for the covariate is significant. There is a significant difference
among the three treatment groups. Also, notice that the coefficient for the covariate is
significant (F = 27.319, p < 0.0005). If it were not, the analysis of covariance could be
taking away a degree of freedom without reducing mean-square error enough to help you.
SYSTAT computes the adjusted cell means the same way it computes estimates
when saving residuals. Model terms (main effects and interactions) that do not contain
categorical variables (covariates) are incorporated into the equation by adding the
product of the coefficient and the mean of the term for computing estimates. The grand
mean (CONSTANT) is included in computing the estimates.
Example 8
One-Way Repeated Measures
In this example, six rats were weighed at the end of each of five weeks. A plot of each
rats weight over the duration of the experiment follows:
ANOVA is the simplest way to analyze this one-way model. Because we have no
categorical variable(s), SYSTAT generates only the constant (grand mean) in the
Adjusted least squares means.
Adj. LS Mean SE N
TREAT =1 4.888 0.307 7
TREAT =2 7.076 0.309 7
TREAT =3 6.750 0.294 7
435
Li near Model s I I : Anal ysi s of Vari ance
model. To obtain individual single-degree-of-freedom orthogonal polynomials, the
input is:
The output follows:
USE rats
ANOVA
DEPEND weight(1 .. 5) / REPEAT NAME=Time
PRINT MEDIUM
ESTIMATE
Number of cases processed: 6
Dependent variable means

WEIGHT(1) WEIGHT(2) WEIGHT(3) WEIGHT(4) WEIGHT(5)
2.500 5.833 7.167 8.000 8.333

-------------------------------------------------------------------------------

Univariate and Multivariate Repeated Measures Analysis


Within Subjects
---------------

Source SS df MS F P G-G H-F

Time 134.467 4 33.617 16.033 0.000 0.004 0.002
Error 41.933 20 2.097


Greenhouse-Geisser Epsilon: 0.3420
Huynh-Feldt Epsilon : 0.4273
-------------------------------------------------------------------------------


Single Degree of Freedom Polynomial Contrasts
---------------------------------------------

Polynomial Test of Order 1 (Linear)

Source SS df MS F P

Time 114.817 1 114.817 38.572 0.002
Error 14.883 5 2.977


Polynomial Test of Order 2 (Quadratic)

Source SS df MS F P

Time 18.107 1 18.107 7.061 0.045
Error 12.821 5 2.564

Polynomial Test of Order 3 (Cubic)

Source SS df MS F P

Time 1.350 1 1.350 0.678 0.448
Error 9.950 5 1.990


436
Chapter 15
The Huynh-Feldt p value (0.002) does not differ from the p value for the F statistic to
any significant degree. Compound symmetry appears to be satisfied and weight
changes significantly over the five trials.
The polynomial tests indicate that most of the trials effect can be accounted for by
a linear trend across time. In fact, the sum of squares for TIME is 134.467, and the sum
of squares for the linear trend is almost as large (114.817). Thus, the linear polynomial
accounts for roughly 85% of the change across the repeated measures.
Unevenly Spaced Polynomials
Sometimes the underlying metric of the profiles is not evenly spaced. Lets assume that
the fifth weight was measured after the tenth week instead of the fifth. In that case, the
default polynomials have to be adjusted for the uneven spacing. These adjustments do
not affect the overall repeated measures tests of each effect (univariate or multivariate),
but they partition the sums of squares differently for the single-degree-of-freedom
tests. The input is:
Alternatively, you could request a hypothesis test, specifying the metric for the
polynomials:
Polynomial Test of Order 4

Source SS df MS F P

Time 0.193 1 0.193 0.225 0.655
Error 4.279 5 0.856

-------------------------------------------------------------------------------


Multivariate Repeated Measures Analysis

Test of: Time Hypoth. df Error df F P
Wilks Lambda= 0.011 4 2 43.007 0.023
Pillai Trace = 0.989 4 2 43.007 0.023
H-L Trace = 86.014 4 2 43.007 0.023
USE rats
ANOVA
DEPEND weight(1 .. 5) / REPEAT=5(1 2 3 4 10) NAME=Time
PRINT MEDIUM
ESTIMATE
HYPOTHESIS
WITHIN='Time'
CONTRAST / POLYNOMIAL METRIC=1,2,3,4,10
TEST
437
Li near Model s I I : Anal ysi s of Vari ance
The last point has been spread out further to the right. The output follows:
The significance tests for the linear and quadratic trends differ from those for the
evenly spaced polynomials. Before, the linear trend was strongest; now, the quadratic
polynomial has the most significant results (F = 107.9, p < 0.0005).
You may have noticed that although the univariate F tests for the polynomials are
different, the multivariate test is unchanged. The latter measures variation across all
components. The ANOVA table for the combined components is not affected by the
metric of the polynomials.
Univariate and Multivariate Repeated Measures Analysis


Within Subjects
---------------

Source SS df MS F P G-G H-F

Time 134.467 4 33.617 16.033 0.000 0.004 0.002
Error 41.933 20 2.097


Greenhouse-Geisser Epsilon: 0.3420
Huynh-Feldt Epsilon : 0.4273
-------------------------------------------------------------------------------


Single Degree of Freedom Polynomial Contrasts
---------------------------------------------

Polynomial Test of Order 1 (Linear)

Source SS df MS F P

Time 67.213 1 67.213 23.959 0.004
Error 14.027 5 2.805


Polynomial Test of Order 2 (Quadratic)

Source SS df MS F P

Time 62.283 1 62.283 107.867 0.000
Error 2.887 5 0.577

(We omit the cubic and quartic polynomial output.)

-------------------------------------------------------------------------------


Multivariate Repeated Measures Analysis

Test of: Time Hypoth. df Error df F P
Wilks Lambda= 0.011 4 2 43.007 0.023
Pillai Trace = 0.989 4 2 43.007 0.023
H-L Trace = 86.014 4 2 43.007 0.023
438
Chapter 15
Difference Contrasts
If you do not want to use polynomials, you can specify a C matrix that contrasts
adjacent weeks. After estimating the model, input the following:
The output is:
Notice the C matrix that this command generates. In this case, each of the univariate F
tests covers the significance of the difference between the adjacent weeks indexed by
the C matrix. For example, F = 17.241 shows that the first and second weeks differ
significantly. The third and fourth weeks do not differ (F = 0.566). Unlike polynomials,
these contrasts are not orthogonal.
HYPOTHESIS
WITHIN=Time
CONTRAST / DIFFERENCE
TEST
Hypothesis.

C Matrix

1 2 3 4 5
1 1.000 -1.000 0.0 0.0 0.0
2 0.0 1.000 -1.000 0.0 0.0
3 0.0 0.0 1.000 -1.000 0.0
4 0.0 0.0 0.0 1.000 -1.000


Univariate F Tests

Effect SS df MS F P

1 66.667 1 66.667 17.241 0.009
Error 19.333 5 3.867

2 10.667 1 10.667 40.000 0.001
Error 1.333 5 0.267

3 4.167 1 4.167 0.566 0.486
Error 36.833 5 7.367

4 0.667 1 0.667 2.500 0.175
Error 1.333 5 0.267


Multivariate Test Statistics

Wilks Lambda = 0.011
F-Statistic = 43.007 df = 4, 2 Prob = 0.023

Pillai Trace = 0.989
F-Statistic = 43.007 df = 4, 2 Prob = 0.023

Hotelling-Lawley Trace = 86.014
F-Statistic = 43.007 df = 4, 2 Prob = 0.023
439
Li near Model s I I : Anal ysi s of Vari ance
Summing Effects
To sum across weeks:
The output is:
In this example, you are testing whether the overall weight (across weeks) significantly
differs from 0. Naturally, the F value is significant. Notice the C matrix that is
generated. It is simply a set of 1s that, in the equation BC' = 0, sum all the coefficients
in B. In a group-by-trials design, this C matrix is useful for pooling trials and analyzing
group effects.
Custom Contrasts
To test any arbitrary contrast effects between dependent variables, you can use C
matrix, which has the same form (without a column for the CONSTANT) as A matrix.
The following commands test a linear trend across the five trials:
The output is:
HYPOTHESIS
WITHIN=Time
CONTRAST / SUM
TEST
Hypothesis.

C Matrix

1 2 3 4 5
1.000 1.000 1.000 1.000 1.000
Test of Hypothesis

Source SS df MS F P

Hypothesis 6080.167 1 6080.167 295.632 0.000
Error 102.833 5 20.567
HYPOTHESIS
CMATRIX [2 1 0 1 2]
TEST
Hypothesis.

C Matrix

1 2 3 4 5
-2.000 -1.000 0.0 1.000 2.000
Test of Hypothesis

Source SS df MS F P

Hypothesis 1148.167 1 1148.167 38.572 0.002
Error 148.833 5 29.767
440
Chapter 15
Example 9
Repeated Measures ANOVA for One Grouping Factor and
One Within Factor with Ordered Levels
The following example uses estimates of population for 1983, 1986, and 1990 and
projections for 2020 for 57 countries from the OURWORLD data file. The data are log
transformed before analysis. Here you compare trends in population growth for
European and Islamic countries. The variable GROUP$ contains codes for these
groups plus a third code for New World countries (we exclude these countries from this
analysis). To create a bar chart of the data after using YLOG to log transform them:
To perform a repeated measures analysis:
USE ourworld
SELECT group$ <> NewWorld
BAR pop_1983 .. pop_2020 / REPEAT OVERLAY YLOG,
GROUP=group$ SERROR FILL=.35, .8
USE ourworld
ANOVA
SELECT group$ <> NewWorld
CATEGORY group$
LET(pop_1983, pop_1986, pop_1990, pop_2020) = L10(@)
DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4 NAME=Time
ESTIMATE
P
O
P
_
1
9
8
3
P
O
P
_
1
9
8
6
P
O
P
_
1
9
9
0
P
O
P
_
2
0
2
0
Trial
1.0
10.0
100.0
M
e
a
s
u
r
e
Europe
Islamic
GROUP$
441
Li near Model s I I : Anal ysi s of Vari ance
The output follows:
The within-subjects results indicate highly significant linear, quadratic, and cubic
changes across time. The pattern of change across time for the two groups also differs
significantly (that is, the TIME * GROUP$ interactions are highly significant for all
three tests).
Notice that there is a larger gap in time between 1990 and 2020 than between the
other values. Lets incorporate real time in the analysis with the following
specification:
Single Degree of Freedom Polynomial Contrasts
---------------------------------------------

Polynomial Test of Order 1 (Linear)

Source SS df MS F P

Time 0.675 1 0.675 370.761 0.000
Time*GROUP$ 0.583 1 0.583 320.488 0.000
Error 0.062 34 0.002


Polynomial Test of Order 2 (Quadratic)

Source SS df MS F P

Time 0.132 1 0.132 92.246 0.000
Time*GROUP$ 0.128 1 0.128 89.095 0.000
Error 0.049 34 0.001


Polynomial Test of Order 3 (Cubic)

Source SS df MS F P

Time 0.028 1 0.028 96.008 0.000
Time*GROUP$ 0.027 1 0.027 94.828 0.000
Error 0.010 34 0.000

-------------------------------------------------------------------------------


Multivariate Repeated Measures Analysis

Test of: Time Hypoth. df Error df F P
Wilks Lambda= 0.063 3 32 157.665 0.000
Pillai Trace = 0.937 3 32 157.665 0.000
H-L Trace = 14.781 3 32 157.665 0.000

Test of: Time*GROUP$ Hypoth. df Error df F P
Wilks Lambda= 0.076 3 32 130.336 0.000
Pillai Trace = 0.924 3 32 130.336 0.000
H-L Trace = 12.219 3 32 130.336 0.000
DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4(83,86,90,120),
NAME=TIME
ESTIMATE
442
Chapter 15
The results for the orthogonal polynomials are shown below:
When the values for POP_2020 are positioned on a real time line, the tests for
quadratic and cubic polynomials are no longer significant. The test for the linear
TIME * GROUP$ interaction, however, remains highly significant, indicating that the
slope across time for the Islamic group is significantly steeper than that for the
European countries.
Example 10
Repeated Measures ANOVA for Two Grouping Factors and
One Within Factor
Repeated measures enables you to handle grouping factors automatically. The
following example is from Winer (1971). There are two grouping factors (ANXIETY
and TENSION) and one trials factor in the file REPEAT1. Following is a dot display of
the average responses across trials for each of the four combinations of ANXIETY and
TENSION.
Single Degree of Freedom Polynomial Contrasts
---------------------------------------------

Polynomial Test of Order 1 (Linear)

Source SS df MS F P

TIME 0.831 1 0.831 317.273 0.000
TIME*GROUP$ 0.737 1 0.737 281.304 0.000
Error 0.089 34 0.003


Polynomial Test of Order 2 (Quadratic)

Source SS df MS F P

TIME 0.003 1 0.003 4.402 0.043
TIME*GROUP$ 0.001 1 0.001 1.562 0.220
Error 0.025 34 0.001


Polynomial Test of Order 3 (Cubic)

Source SS df MS F P

TIME 0.000 1 0.000 1.653 0.207
TIME*GROUP$ 0.000 1 0.000 1.733 0.197
Error 0.006 34 0.000
443
Li near Model s I I : Anal ysi s of Vari ance
The input is:
The model also includes an interaction between the grouping factors (ANXIETY *
TENSION). The output follows:
USE repeat1
ANOVA
DOT trial(1..4) / Group=anxiety,tension, line,repeat,serror
CATEGORY anxiety tension
DEPEND trial(1 .. 4) / REPEAT NAME=Trial
PRINT MEDIUM
ESTIMATE
Univariate and Multivariate Repeated Measures Analysis

Between Subjects
----------------

Source SS df MS F P

ANXIETY 10.083 1 10.083 0.978 0.352
TENSION 8.333 1 8.333 0.808 0.395
ANXIETY
*TENSION 80.083 1 80.083 7.766 0.024
1,1
T
R
IA
L
(1
)
T
R
IA
L
(2
)
T
R
IA
L
(3
)
T
R
IA
L
(4
)
Trial
0
5
10
15
20
M
e
a
s
u
r
e
1,2
T
R
IA
L
(1
)
T
R
IA
L
(2
)
T
R
IA
L
(3
)
T
R
IA
L
(4
)
Trial
0
5
10
15
20
M
e
a
s
u
r
e
2,1
T
R
IA
L
(1
)
T
R
IA
L
(2
)
T
R
IA
L
(3
)
T
R
IA
L
(4
)
Trial
0
5
10
15
20
M
e
a
s
u
r
e
2,2
T
R
IA
L
(1
)
T
R
IA
L
(2
)
T
R
IA
L
(3
)
T
R
IA
L
(4
)
Trial
0
5
10
15
20
M
e
a
s
u
r
e
1,1 1,2
444
Chapter 15
Error 82.500 8 10.313
Within Subjects
---------------

Source SS df MS F P G-G H-F

Trial 991.500 3 330.500 152.051 0.000 0.000 0.000
Trial
*ANXIETY 8.417 3 2.806 1.291 0.300 0.300 0.301
Trial
*TENSION 12.167 3 4.056 1.866 0.162 0.197 0.169
Trial
*ANXIETY
*TENSION 12.750 3 4.250 1.955 0.148 0.185 0.155
Error 52.167 24 2.174


Greenhouse-Geisser Epsilon: 0.5361
Huynh-Feldt Epsilon : 0.9023
-------------------------------------------------------------------------------


Single Degree of Freedom Polynomial Contrasts
---------------------------------------------

Polynomial Test of Order 1 (Linear)

Source SS df MS F P

Trial 984.150 1 984.150 247.845 0.000
Trial
*ANXIETY 1.667 1 1.667 0.420 0.535
Trial
*TENSION 10.417 1 10.417 2.623 0.144
Trial
*ANXIETY
*TENSION 9.600 1 9.600 2.418 0.159
Error 31.767 8 3.971


Polynomial Test of Order 2 (Quadratic)

Source SS df MS F P

Trial 6.750 1 6.750 3.411 0.102
Trial
*ANXIETY 3.000 1 3.000 1.516 0.253
Trial
*TENSION 0.083 1 0.083 0.042 0.843
Trial
*ANXIETY
*TENSION 0.333 1 0.333 0.168 0.692
Error 15.833 8 1.979


Polynomial Test of Order 3 (Cubic)

Source SS df MS F P

Trial 0.600 1 0.600 1.051 0.335
Trial
*ANXIETY 3.750 1 3.750 6.569 0.033
Trial
*TENSION 1.667 1 1.667 2.920 0.126
Trial
*ANXIETY
*TENSION 2.817 1 2.817 4.934 0.057
Error 4.567 8 0.571

445
Li near Model s I I : Anal ysi s of Vari ance
In the within-subjects table, you see that the trial effect is highly significant (F = 152.1,
p < 0.0005). Below that table, we see that the linear trend across trials (Polynomial
Order 1) is highly significant (F = 247.8, p < 0.0005). The hypothesis sums of squares for
the linear, quadratic, and cubic polynomials sum to the total hypothesis sum of squares for
trials (that is, 984.15 + 6.75 + 0.60 = 991.5). Notice that the total sum of squares is 991.5,
while that for the linear trend is 984.15. This means that the linear trend accounts for more
than 99% of the variability across the four trials. The assumption of compound symmetry
is not required for the test of linear trendso you can report that there is a highly
significant linear decrease across the four trials (F = 247.8, p < 0.0005).
Example 11
Repeated Measures ANOVA for Two Trial Factors
Repeated measures enables you to handle several trials factors, so we include an
example with two trial factors. It is an experiment from Winer (1971), which has one
grouping factor (NOISE) and two trials factors (PERIODS and DIALS). The trials
factors must be sorted into a set of dependent variables (one for each pairing of the two
factors groups). It is useful to label the levels with a convenient mnemonic. The file is
set up with variables P1D1 through P3D3. Variable P1D2 indicates a score in the
PERIODS = 1, DIALS = 2 cell. The data are in the file REPEAT2.
-------------------------------------------------------------------------------
Multivariate Repeated Measures Analysis

Test of: Trial Hypoth. df Error df F P
Wilks Lambda= 0.015 3 6 127.686 0.000
Pillai Trace = 0.985 3 6 127.686 0.000
H-L Trace = 63.843 3 6 127.686 0.000

Test of: Trial Hypoth. df Error df F P
*ANXIETY
Wilks Lambda= 0.244 3 6 6.183 0.029
Pillai Trace = 0.756 3 6 6.183 0.029
H-L Trace = 3.091 3 6 6.183 0.029

Test of: Trial Hypoth. df Error df F P
*TENSION
Wilks Lambda= 0.361 3 6 3.546 0.088
Pillai Trace = 0.639 3 6 3.546 0.088
H-L Trace = 1.773 3 6 3.546 0.088

Test of: Trial Hypoth. df Error df F P
*ANXIETY
*TENSION
Wilks Lambda= 0.328 3 6 4.099 0.067
Pillai Trace = 0.672 3 6 4.099 0.067
H-L Trace = 2.050 3 6 4.099 0.067
446
Chapter 15
The input is:
Notice that REPEAT specifies that the two trials factors have three levels each. ANOVA
assumes the subscript of the first factor will vary slowest in the ordering of the
dependent variables. If you have two repeated factors (DAY with four levels and AMPM
with two levels), you should select eight dependent variables and type Repeat=4,2. The
repeated measures are selected in the following order:
From this indexing, it generates the proper main effects and interactions. When more
than one trial factor is present, ANOVA lists each dependent variable and the
associated level on each factor. The output follows:
USE repeat2
ANOVA
CATEGORY noise
DEPEND p1d1 .. p3d3 / REPEAT=3,3 NAMES=period,dial
PRINT MEDIUM
ESTIMATE
DAY1_AM DAY1_PM DAY2_AM DAY2_PM DAY3_AM DAY3_PM DAY4_AM DAY4_PM
Dependent variable means

P1D1 P1D2 P1D3 P2D1 P2D2
48.000 52.000 63.000 37.167 42.167

P2D3 P3D1 P3D2 P3D3
54.167 27.000 32.500 42.500


-------------------------------------------------------------------------------

Univariate and Multivariate Repeated Measures Analysis

Between Subjects
----------------

Source SS df MS F P

NOISE 468.167 1 468.167 0.752 0.435
Error 2491.111 4 622.778


Within Subjects
---------------

Source SS df MS F P G-G H-F

period 3722.333 2 1861.167 63.389 0.000 0.000 0.000
period*NOISE 333.000 2 166.500 5.671 0.029 0.057 0.029
Error 234.889 8 29.361


Greenhouse-Geisser Epsilon: 0.6476
Huynh-Feldt Epsilon : 1.0000
dial 2370.333 2 1185.167 89.823 0.000 0.000 0.000
dial*NOISE 50.333 2 25.167 1.907 0.210 0.215 0.210
Error 105.556 8 13.194


447
Li near Model s I I : Anal ysi s of Vari ance
Greenhouse-Geisser Epsilon: 0.9171
Huynh-Feldt Epsilon : 1.0000
period*dial 10.667 4 2.667 0.336 0.850 0.729 0.850
period*dial
*NOISE 11.333 4 2.833 0.357 0.836 0.716 0.836
Error 127.111 16 7.944


Greenhouse-Geisser Epsilon: 0.5134
Huynh-Feldt Epsilon : 1.0000
-------------------------------------------------------------------------------


Single Degree of Freedom Polynomial Contrasts
---------------------------------------------

Polynomial Test of Order 1 (Linear)

Source SS df MS F P

period 3721.000 1 3721.000 73.441 0.001
period*NOISE 225.000 1 225.000 4.441 0.103
Error 202.667 4 50.667

dial 2256.250 1 2256.250 241.741 0.000
dial*NOISE 6.250 1 6.250 0.670 0.459
Error 37.333 4 9.333

period*dial 0.375 1 0.375 0.045 0.842
period*dial
*NOISE 1.042 1 1.042 0.125 0.742
Error 33.333 4 8.333


Polynomial Test of Order 2 (Quadratic)

Source SS df MS F P

period 1.333 1 1.333 0.166 0.705
period*NOISE 108.000 1 108.000 13.407 0.022
Error 32.222 4 8.056

dial 114.083 1 114.083 6.689 0.061
dial*NOISE 44.083 1 44.083 2.585 0.183
Error 68.222 4 17.056

period*dial 3.125 1 3.125 0.815 0.418
period*dial
*NOISE 0.125 1 0.125 0.033 0.865
Error 15.333 4 3.833


Polynomial Test of Order 3 (Cubic)

Source SS df MS F P

period*dial 6.125 1 6.125 0.750 0.435
period*dial
*NOISE 3.125 1 3.125 0.383 0.570
Error 32.667 4 8.167


Polynomial Test of Order 4

Source SS df MS F P

period*dial 1.042 1 1.042 0.091 0.778
period*dial
*NOISE 7.042 1 7.042 0.615 0.477
Error 45.778 4 11.444
-------------------------------------------------------------------------------
448
Chapter 15
Using GLM, the input is:
Example 12
Repeated Measures Analysis of Covariance
To do repeated measures analysis of covariance, where the covariate varies within
subjects, you would have to set up your model like a split plot with a different record
for each measurement.
This example is from Winer (1971). This design has two trials (DAY1 and DAY2),
one covariate (AGE), and one grouping factor (SEX). The data are in the file WINER.
Multivariate Repeated Measures Analysis

Test of: period Hypoth. df Error df F P
Wilks Lambda= 0.051 2 3 28.145 0.011
Pillai Trace = 0.949 2 3 28.145 0.011
H-L Trace = 18.764 2 3 28.145 0.011

Test of: period*NOISE Hypoth. df Error df F P
Wilks Lambda= 0.156 2 3 8.111 0.062
Pillai Trace = 0.844 2 3 8.111 0.062
H-L Trace = 5.407 2 3 8.111 0.062

Test of: dial Hypoth. df Error df F P
Wilks Lambda= 0.016 2 3 91.456 0.002
Pillai Trace = 0.984 2 3 91.456 0.002
H-L Trace = 60.971 2 3 91.456 0.002

Test of: dial*NOISE Hypoth. df Error df F P
Wilks Lambda= 0.565 2 3 1.155 0.425
Pillai Trace = 0.435 2 3 1.155 0.425
H-L Trace = 0.770 2 3 1.155 0.425

Test of: period*dial Hypoth. df Error df F P
Wilks Lambda= 0.001 4 1 331.445 0.041
Pillai Trace = 0.999 4 1 331.445 0.041
H-L Trace = 1325.780 4 1 331.445 0.041

Test of: period*dial Hypoth. df Error df F P
*NOISE
Wilks Lambda= 0.000 4 1 581.875 0.031
Pillai Trace = 1.000 4 1 581.875 0.031
H-L Trace = 2327.500 4 1 581.875 0.031
GLM
USE repeat2
CATEGORY noise
MODEL p1d1 .. p3d3 = CONSTANT + noise / REPEAT=3,3
NAMES=period,dial
PRINT MEDIUM
ESTIMATE
449
Li near Model s I I : Anal ysi s of Vari ance
The input follows:
The output is:
The F statistics for the covariate and its interactions, namely AGE (13.587) and
DAY * AGE (0.102), are not ordinarily published; however, they help you
understand the adjustment made by the covariate.
This analysis did not test the homogeneity of slopes assumption. If you want to test
the homogeneity of slopes assumption, run the following model in GLM first:
Then check to see if the SEX * AGE interaction is significant.
USE winer
ANOVA
CATEGORY sex
DEPEND day(1 .. 2) / REPEAT NAME=day
COVARIATE age
ESTIMATE
Dependent variable means

DAY(1) DAY(2)
16.500 11.875

-------------------------------------------------------------------------------

Univariate Repeated Measures Analysis
Between Subjects
----------------

Source SS df MS F P

SEX 44.492 1 44.492 3.629 0.115
AGE 166.577 1 166.577 13.587 0.014
Error 61.298 5 12.260


Within Subjects
---------------

Source SS df MS F P G-G H-F

day 22.366 1 22.366 17.899 0.008 . .
day*SEX 0.494 1 0.494 0.395 0.557 . .
day*AGE 0.127 1 0.127 0.102 0.763 . .
Error 6.248 5 1.250


Greenhouse-Geisser Epsilon: .
Huynh-Feldt Epsilon : .
MODEL day(1 .. 2) = CONSTANT + sex + age + sex*age / REPEAT
450
Chapter 15
To use GLM:
Example 13
Multivariate Analysis of Variance
The data in the file MANOVA comprise a hypothetical experiment on rats assigned
randomly to one of three drugs. Weight loss in grams was observed for the first and
second weeks of the experiment. The data were analyzed in Morrison (1976) with a
two-way multivariate analysis of variance (a two-way MANOVA.)
You can use ANOVA to set up the MANOVA model for complete factorials:
Notice that the only difference between an ANOVA and MANOVA model is that the
latter has more than one dependent variable. The output includes:
GLM
USE winer
CATEGORY sex
MODEL day(1 .. 2) = CONSTANT + sex + age / REPEAT NAME=day
ESTIMATE
USE manova
ANOVA
CATEGORY sex, drug
DEPEND week(1 .. 2)
ESTIMATE
Dependent variable means

WEEK(1) WEEK(2)
9.750 8.667

-1
Estimates of effects B = (XX) XY

WEEK(1) WEEK(2)

CONSTANT 9.750 8.667

SEX 1 0.167 0.167

DRUG 1 -2.750 -1.417

DRUG 2 -2.250 -0.167

SEX 1
DRUG 1 -0.667 -1.167

SEX 1
DRUG 2 -0.417 -0.417
451
Li near Model s I I : Anal ysi s of Vari ance
Notice that each column of the B matrix is now assigned to a separate dependent
variable. It is as if we had done two runs of an ANOVA. The numbers in the matrix are
the analysis of variance effects estimates.
You can also use GLM to set up the MANOVA model. With this approach, the
design does not have to be a complete factorial. With commands:
Testing Hypotheses
With more than one dependent variable, you do not get a single ANOVA table; instead,
each hypothesis is tested separately. Here are three hypotheses. Extended output for the
second hypothesis is used to illustrate the detailed output.
Following are the collected results:
GLM
USE manova
CATEGORY sex, drug
MODEL week(1 .. 2) = CONSTANT + sex + drug + sex*drug
ESTIMATE
HYPOTHESIS
EFFECT = sex
TEST
PRINT = LONG
HYPOTHESIS
EFFECT = drug
TEST
PRINT = SHORT
HYPOTHESIS
EFFECT = sex*drug
TEST
Test for effect called: SEX

Univariate F Tests

Effect SS df MS F P

WEEK(1) 0.667 1 0.667 0.127 0.726
Error 94.500 18 5.250

WEEK(2) 0.667 1 0.667 0.105 0.749
Error 114.000 18 6.333


Multivariate Test Statistics

Wilks Lambda = 0.993
F-Statistic = 0.064 df = 2, 17 Prob = 0.938

Pillai Trace = 0.007
F-Statistic = 0.064 df = 2, 17 Prob = 0.938

Hotelling-Lawley Trace = 0.008
F-Statistic = 0.064 df = 2, 17 Prob = 0.938
-------------------------------------------------------------------------------
452
Chapter 15
Test for effect called: DRUG


Null hypothesis contrast AB

WEEK(1) WEEK(2)
1 -2.750 -1.417
2 -2.250 -0.167


-1
Inverse contrast A(XX) A

1 2
1 0.083
2 -0.042 0.083


-1 -1
Hypothesis sum of product matrix H = BA(A(XX) A) AB

WEEK(1) WEEK(2)
WEEK(1) 301.000
WEEK(2) 97.500 36.333


Error sum of product matrix G = EE

WEEK(1) WEEK(2)
WEEK(1) 94.500
WEEK(2) 76.500 114.000


Univariate F Tests

Effect SS df MS F P

WEEK(1) 301.000 2 150.500 28.667 0.000
Error 94.500 18 5.250

WEEK(2) 36.333 2 18.167 2.868 0.083
Error 114.000 18 6.333


Multivariate Test Statistics

Wilks Lambda = 0.169
F-Statistic = 12.199 df = 4, 34 Prob = 0.000

Pillai Trace = 0.880
F-Statistic = 7.077 df = 4, 36 Prob = 0.000

Hotelling-Lawley Trace = 4.640
F-Statistic = 18.558 df = 4, 32 Prob = 0.000

THETA = 0.821 S = 2, M =-0.5, N = 7.5 Prob = 0.000

Test of Residual Roots

Roots 1 through 2
Chi-Square Statistic = 36.491 df = 4

Roots 2 through 2
Chi-Square Statistic = 1.262 df = 1
453
Li near Model s I I : Anal ysi s of Vari ance
Matrix formulas (that are something long) make explicit the hypothesis being tested.
For MANOVA, hypotheses are tested with sums-of-squares and cross-products
matrices. Before printing the multivariate tests, however, SYSTAT prints the univariate
tests. Each of these F statistics is constructed in the same way as the ANOVA model.
The sums of squares for hypothesis and error are taken from the diagonals of the
respective sum of product matrices. The univariate F test for the WEEK(1) DRUG
effect, for example, is computed from over , or hypothesis mean
square divided by error mean square.
The next statistics printed are for the multivariate hypothesis. Wilks lambda
(likelihood-ratio criterion) varies between 0 and 1. Schatzoff (1966) has tables for its
percentage points. The following F statistic is Raos approximate (sometimes exact) F
statistic corresponding to the likelihood-ratio criterion (see Rao, 1973). Pillais trace
and its F approximation are taken from Pillai (1960). The Hotelling-Lawley trace and
Canonical Correlations

1 2
0.906 0.244
Dependent variable canonical coefficients standardized
by conditional (within groups) standard deviations

1 2
WEEK(1) 1.437 -0.352
WEEK(2) -0.821 1.231

Canonical loadings (correlations between conditional
dependent variables and dependent canonical factors)

1 2
WEEK(1) 0.832 0.555
WEEK(2) 0.238 0.971

-------------------------------------------------------------------------------
Test for effect called: SEX*DRUG


Univariate F Tests

Effect SS df MS F P

WEEK(1) 14.333 2 7.167 1.365 0.281
Error 94.500 18 5.250

WEEK(2) 32.333 2 16.167 2.553 0.106
Error 114.000 18 6.333


Multivariate Test Statistics

Wilks Lambda = 0.774
F-Statistic = 1.159 df = 4, 34 Prob = 0.346

Pillai Trace = 0.227
F-Statistic = 1.152 df = 4, 36 Prob = 0.348

Hotelling-Lawley Trace = 0.290
F-Statistic = 1.159 df = 4, 32 Prob = 0.347

THETA = 0.221 S = 2, M =-0.5, N = 7.5 Prob = 0.295
301.0 2 94.5 18
454
Chapter 15
its F approximation are documented in Morrison (1976). The last statistic is the largest
root criterion for Roys union-intersection test (see Morrison, 1976). Charts of the
percentage points of this statistic, found in Morrison and other multivariate texts, are
taken from Heck (1960).
The probability value printed for THETA is not an approximation. It is what you find
in the charts. In the first hypothesis, all the multivariate statistics have the same value
for the F approximation because the approximation is exact when there are only two
groups (see Hotellings in Morrison, 1976). In these cases, THETA is not printed
because it has the same probability value as the F statistic.
Because we requested extended output for the second hypothesis, we get additional
material.
Bartletts Residual Root (Eigenvalue) Test
The chi-square statistics follow Bartlett (1947). The probability value for the first chi-
square statistic should correspond to that for the approximate multivariate F statistic in
large samples. In small samples, they might be discrepant, in which case you should
generally trust the F statistic more. The subsequent chi-square statistics are
recomputed, leaving out the first and later roots until the last root is tested. These are
sequential tests and should be treated with caution, but they can be used to decide how
many dimensions (roots and canonical correlations) are significant. The number of
significant roots corresponds to the number of significant p values in this ordered list.
Canonical Coefficients
Dimensions with insignificant chi-square statistics in the prior tests should be ignored
in general. Corresponding to each canonical correlation is a canonical variate, whose
coefficients have been standardized by the within-groups standard deviations (the
default). Standardization by the sample standard deviation is generally used for
canonical correlation analysis or multivariate regression when groups are not present
to introduce covariation among variates. You can standardize these variates by the total
(sample) standard deviations with:
inserted prior to TEST. Continue with the other test specifications described earlier.
Finally, the canonical loadings are printed. These are correlations and, thus, provide
information different from the canonical coefficients. In particular, you can identify
STANDARDIZE = TOTAL
T
2
455
Li near Model s I I : Anal ysi s of Vari ance
suppressor variables in the multivariate system by looking for differences in sign
between the coefficients and the loadings (which is the case with these data). See Bock
(1975) and Wilkinson (1975, 1977) for an interpretation of these variates.
Computation
Algorithms
Centered sums of squares and cross-products are accumulated using provisional
algorithms. Linear systems, including those involved in hypothesis testing, are solved
by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved
with Householder tridiagonalization and implicit QL iterations. For further
information, see Wilkinson and Reinsch (1971) or Chambers (1977).
References
Afifi, A. A. and Azen, S. P. (1972). Statistical analysis: A computer-oriented approach.
New York: Academic Press.
Affifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Bartlett, M. S. (1947). Multivariate analysis. Journal of the Royal Statistical Society, Series
B, 9, 176197.
Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York:
McGraw-Hill.
Cochran and Cox. (1957). Experimental designs. 2nd ed. New York: John Wiley & Sons,
Inc.
Daniel, C. (1960). Locating outliers in factorial experiments. Technometrics, 2, 149156.
Feingold, M. and Korsog, P. E. (1986). The correlation and dependence between two f
statistics with the same denominator. The American Statistician, 40, 218220.
Heck, D. L. (1960). Charts of some upper percentage points of the distribution of the largest
characteristic root. Annals of Mathematical Statistics, 31, 625642.
Hocking, R. R. (1985). The analysis of linear models. Monterey, Calif.: Brooks/Cole.
John, P. W. M. (1971). Statistical design and analysis of experiments. New York:
MacMillan, Inc.
Kutner, M. H. (1974). Hypothesis testing in linear models (Eisenhart Model I). The
American Statistician, 28, 98100.
456
Chapter 15
Levene, H. (1960). Robust tests for equality of variance. I. Olkin, ed., Contributions to
Probability and Statistics. Palo Alto, Calif.: Stanford University Press, 278292.
Miller, R. (1985). Multiple comparisons. Kotz, S. and Johnson, N. L., eds., Encyclopedia
of Statistical Sciences, vol. 5. New York: John Wiley & Sons, Inc., 679689.
Milliken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed
Experiments. New York: Van Nostrand Reinhold Company.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed.
Homewood, Ill.: Richard D. Irwin, Inc.
Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The
Statistical Center, University of Phillipines.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Schatzoff, M. (1966). Exact distributions of Wilks likelihood ratio criterion. Biometrika,
53, 347358.
Scheff, H. (1959). The analysis of variance. New York: John Wiley & Sons, Inc.
Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.
Searle, S. R. (1987). Linear models for unbalanced data. New York: John Wiley & Sons,
Inc.
Speed, F. M. and Hocking, R. R. (1976). The use of the r( )- notation with unbalanced data.
The American Statistician, 30, 3033.
Speed, F. M., Hocking, R. R., and Hackney, O. P. (1978). Methods of analysis of linear
models with unbalanced data. Journal of the American Statistical Association, 73,
105112.
Wilkinson, L. (1975). Response variable hypotheses in the multivariate analysis of
variance. Psychological Bulletin, 82, 408412.
Wilkinson, L. (1977). Confirmatory rotation of MANOVA canonical variates. Multivariate
Behavioral Research, 12, 487494.
Winer, B. J. (1971). Statistical principles in experimental design, 2nd ed. New York:
McGraw-Hill.
457


Chapt er
16
Linear Models III:
General Linear Models
Leland Wilkinson and Mark Coward
General Linear Model (GLM) can estimate and test any univariate or multivariate
general linear model, including those for multiple regression, analysis of variance or
covariance, and other procedures such as discriminant analysis and principal
components. With the general linear model, you can explore randomized block
designs, incomplete block designs, fractional factorial designs, Latin square designs,
split plot designs, crossover designs, nesting, and more. The model is:
Y = XB + e
where Y is a vector or matrix of dependent variables, X is a vector or matrix of
independent variables, B is a vector or matrix of regression coefficients, and e is a
vector or matrix of random errors. See Searle (1971), Winer (1971), Neter,
Wasserman, and Kutner (1985), or Cohen and Cohen (1983) for details.
In multivariate models, Y is a matrix of continuous measures. The X matrix can be
either continuous or categorical dummy variables, according to the type of model. For
discriminant analysis, X is a matrix of dummy variables, as in analysis of variance.
For principal components analysis, X is a constant (a single column of 1s). For
canonical correlation, X is usually a matrix of continuous right-hand variables (and Y
is the matrix of left-hand variables).
For some multivariate models, it may be easier to use ANOVA, which can handle
models with multiple dependent variables and zero, one, or more categorical
independent variables (that is, only the constant is present in the former). ANOVA
automatically generates interaction terms for the design factor.
After the parameters of a model have been estimated, they can be tested by any
general linear hypothesis of the following form:
458
Chapter 16
ABC = D
where A is a matrix of linear weights on coefficients across the independent variables
(the rows of B), C is a matrix of linear weights on the coefficients across dependent
variables (the columns of B), B is the matrix of regression coefficients or effects, and
D is a null hypothesis matrix (usually a null matrix).
For the multivariate models described in this chapter, the C matrix is an identity
matrix, and the D matrix is null. The A matrix can have several different forms, but
these are all submatrices of an identity matrix and are easily formed.
General Linear Models in SYSTAT
Model Estimation (in GLM)
To specify a general linear model using GLM, from the menus choose:
Statistics
General Linear Model (GLM)
Estimate Model
You can specify any multivariate linear model with General Linear Model. You must
select the variables to include in the desired model.
459
Li near Model s II I: General Li near Model s
Dependent(s). The variable(s) you want to examine. The dependent variable(s) should
be continuous numeric variables (for example, income).
Independent(s). Select one or more continuous or categorical variables (grouping
variables). Independent variables that are not denoted as categorical are considered
covariates. Unlike ANOVA, GLM does not automatically include and test all
interactions. With GLM, you have to build your model. If you want interactions or
nested variables in your model, you need to build these components.
Model. The following model options allow you to include a constant in your model, do
a means model, specify the sample size, and weight cell means:
n Include constant. The constant is an optional parameter. Deselect Include constant
to obtain a model through the origin. When in doubt, include the constant.
n Means. Specifies a fully factorial design using means coding.
n Cases. When your data file is a symmetric matrix, specify the sample size that
generated the matrix.
n Weight. Weights cell means by the cell counts before averaging.
In addition, you can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, and the
standard error of predicted values.
n Residuals/Data. Saves the statistics given by Residuals, plus all the variables in the
working data file, including any transformed data values.
n Adjusted. Saves adjusted cell means from analysis of covariance.
n Adjusted/Data. Saves adjusted cell means plus all the variables in the working data
file, including any transformed data values.
n Partial. Saves partial residuals.
n Partial/Data. Saves partial residuals plus all the variables in the working data file,
including any transformed data values.
n Model. Saves statistics given in Residuals and the variables used in the model.
n Coefficients. Saves the estimates of the regression coefficients.
460
Chapter 16
Categorical Variables
You can specify numeric or character-valued categorical (grouping) variables that
define cells. You want to categorize an independent variable when it has several
categories such as education levels, which could be divided into the following
categories: less than high school, some high school, finished high school, some
college, finished bachelors degree, finished masters degree, and finished doctorate.
On the other hand, a variable such as age in years would not be categorical unless age
where broken up into categories such as under 21, 2165, and over 65.
To specify categorical variables, click the Categories button in the General Linear
Model dialog box.
Types of Categories. You can elect to use one of two different coding methods:
n Effect. Produces parameter estimates that are differences from group means.
n Dummy. Produces dummy codes for the design variables instead of effect codes.
Coding of dummy variables is the classic analysis of variance parameterization, in
which the sum of effects estimated for a classifying variable is 0. If your categorical
variable has k categories, dummy variables are created. k 1
461
Li near Model s II I: General Li near Model s
Repeated Measures
In a repeated measures design, the same variable is measured several times for each
subject (case). A paired-comparison t test is the most simple form of a repeated
measures design (for example, each subject has a before and after measure).
SYSTAT derives values from your repeated measures and uses them in general
linear model computations to test changes across the repeated measures (within
subjects) as well as differences between groups of subjects (between subjects). Tests
of the within-subjects values are called Polynomial Test Of Order 1, 2,..., up to k,
where k is one less than the number of repeated measures. The first polynomial is used
to test linear changes: Do the repeated responses increase (or decrease) around a line
with a significant slope? The second polynomial tests if the responses fall along a
quadratic curve, etc.
To open the Repeated Measures dialog box, click Repeated in the General Linear
Model dialog box.
If you select Perform repeated measures analysis, SYSTAT treats the dependent
variables as a set of repeated measures. Optionally, you can assign a name for each set
of repeated measures, specify the number of levels, and specify the metric for unevenly
spaced repeated measures.
Name. Name that identifies each set of repeated measures.
Levels. Number of repeated measures in the set. For example, if you have three
dependent variables that represent measurements at different times, the number of
levels is 3.
Metric. Metric that indicates the spacing between unevenly spaced measurements. For
example, if measurements were taken at the third, fifth, and ninth weeks, the metric
would be 3, 5, 9.
462
Chapter 16
General Linear Model Options
General Linear Model Options allows you to specify a tolerance level, select complete
or stepwise entry, and specify entry and removal criteria.
To open the Options dialog box, click Options in the General Linear Model dialog box.
The following options can be specified:
Tolerance. Prevents the entry of a variable that is highly correlated with the independent
variables already included in the model. Enter a value between 0 and 1. Typical values
are 0.01 or 0.001. The higher the value (closer to 1), the lower the correlation required
to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Mixture model. Constrains the independent variables to sum to a constant.
n Stepwise. Variables are entered into or removed from the model, one at a time.
Stepwise Options. The following alternatives are available for stepwise entry and
removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
removes the variable with the largest Remove value.
463
Li near Model s II I: General Li near Model s
n Forward. Begins with no variables in the model. At each step, SYSTAT adds the
variable with the smallest Enter value.
n Automatic. For Backward, at each step, SYSTAT automatically removes a variable
from your model. For Forward, at each step, SYSTAT automatically adds a
variable to the model.
n Interactive. At each step in the model building, you select the variable to enter into
or remove from the model.
You can also control the criteria used to enter and remove variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified
value. Enter a value between 0 and 1(for example, 0.025).
n Remove. Removes a variable from the model if its alpha value is greater than the
specified value. Enter a value between 0 and 1(for example, 0.025).
n Force. Forces the first n variables listed in your model to remain in the equation.
n FEnter. F-to-enter limit. Variables with F greater than the specified value are
entered into the model if Tolerance permits.
n FRemove. F-to-remove limit. Variables with F less than the specified value are
removed from the model.
n Max step. Maximum number of steps.
Pairwise Comparisons
Once you determine that your groups are different, you may want to compare pairs of
groups to determine which pairs differ.
To open the Pairwise Comparisons dialog box, from the menus choose:
Statistics
General Linear Model (GLM)
Pairwise Comparisons
464
Chapter 16
Groups. You must specify the variable that defines the groups.
Test. General Linear Model provides several post hoc tests to compare levels of this
variable.
n Bonferroni. Multiple comparison test based on Students t statistic. Adjusts the
observed significance level for the fact that multiple comparisons are made.
n Tukey. Uses the Studentized range statistic to make all pairwise comparisons
between groups and sets the experimentwise error rate to the error rate for the
collection for all pairwise comparisons. When testing a large number of pairs of
means, Tukey is more powerful than Bonferroni. For a small number of pairs,
Bonferroni is more powerful.
n Dunnett. The Dunnett test is available only with one-way designs. Dunnett
compares a set of treatments against a single control mean that you specify. You
can choose a two-sided or one-sided test. To test that the mean at any level (except
the control category) of the experimental groups is not equal to that of the control
category, select 2-sided. To test if the mean at any level of the experimental groups
is smaller (or larger) than that of the control category, select 1-sided.
n Fishers LSD. Least significant difference pairwise multiple comparison test.
Equivalent to multiple t tests between all pairs of groups. The disadvantage of this
test is that no attempt is made to adjust the observed significance level for multiple
comparisons.
n Scheff. The significance level of Scheffs test is designed to allow all possible
linear combinations of group means to be tested, not just pairwise comparisons
available in this feature. The result is that Scheffs test is more conservative than
other tests, meaning that a larger difference between means is required for
significance.
465
Li near Model s II I: General Li near Model s
Error Term. You can either use the mean square error specified by the model or you can
enter the mean square error.
n Model MSE. Uses the mean square error from the general linear model that you ran.
n MSE and df. You can specify your own mean square error term and degrees of
freedom for mixed models with random factors, split-plot designs, and crossover
designs with carry-over effects.
Hypothesis Tests
Contrasts are used to test relationships among cell means. The post hoc tests in GLM
Pairwise Comparison are the most simple form because they compare two means at a
time. However, general contrasts can involve any number of means in the analysis.
To test hypotheses, from the menus choose:
Statistics
General Linear Model (GLM)
Hypothesis Test
Contrasts can be defined across the categories of a grouping factor or across the levels
of a repeated measure.
466
Chapter 16
Effects. Specify the factor (grouping variable) to which the contrast applies. For
principal components, specify the grouping variable for within-groups components (if
any). For canonical correlation, select All to test all of the effects in the model.
Within. Use when specifying a contrast across the levels of a repeated measures factor.
Enter the name assigned to the set of repeated measures in the Repeated Measures
subdialog box.
Error Term. You can specify which error term to use for the hypothesis tests.
n Model MSE. Uses the mean square error from the general linear model that you ran.
n MSE and df. You can specify your own mean square error and degrees of freedom
if you know them from a previous model.
n Between Subject(s) Effect(s). Select this option to use main effect error terms or
interaction error terms in all tests. Specify interactions using an ampersand
between variables.
Priors. Prior probabilities for discriminant analysis. Type a value for each group,
separated by spaces. These probabilities should add to 1. For example, if you have three
groups, priors might be 0.5, 0.3, and 0.2.
Standardize. You can standardize canonical coefficients using the total sample or a
within-groups covariance matrix.
n Within groups is usually used in discriminant analysis to make comparisons easier
when measures are on different scales.
n Sample is used in canonical correlation.
Rotate. Specify the number of components to rotate.
Factor. In a factor analysis with grouping variables, factor the Hypothesis (between-
groups) matrix or the Error (within-groups) matrix. This allows you to compute
principal components on the hypothesis or error matrix separately, offering a direct
way to compute principal components on residuals of any linear model you wish to fit.
You can specify the matrix type as Correlations, SSCP, or Covariance.
Save scores and results. You can save the results to a SYSTAT data file. Exactly what
is saved depends on the analysis, When you save scores and results, extended output is
automatically produced. This enables you to see more detailed output when computing
these statistics.
467
Li near Model s II I: General Li near Model s
Specify (in GLM)
To specify contrasts for between-subjects effects, click Specify in the Hypothesis Test
dialog box.
You can use GLMs cell means language to define contrasts across the levels of a
grouping variable in a multivariate model. For example, for a two-way factorial
ANOVA design with DISEASE (four categories) and DRUG (three categories), you
could contrast the marginal mean for the first level of drug against the third level by
specifying:
Note that square brackets enclose the value of the category (for example, for
GENDER$, specify GENDER$[male]). For the simple contrast of the first and third
levels of DRUG for the second disease only, specify:
The syntax also allows statements like:
In addition, you can specify the error term to use for the contrasts.
Pooled. Uses the error term from the current model.
Separate. Generates a separate variances error term.
DRUG[1] = DRUG[3]
DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]
3*DRUG[1] 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4]
468
Chapter 16
Contrast
Contrast generates a contrast for a grouping factor or a repeated measures factor. To
open the Contrast dialog box, click Contrast in the Hypothesis Test dialog box.
SYSTAT offers several types of contrasts:
Custom. Enter your own custom coefficients. For example, if your factor has four
ordered categories (or levels), you can specify your own coefficients, such as 3 1 1
3, by typing these values in the Custom text box.
Difference. Compares each level with its adjacent level.
Polynomial. Generates orthogonal polynomial contrasts (to test linear, quadratic, or
cubic trends across ordered categories or levels).
n Order. Enter 1 for linear, 2 for quadratic, etc.
n Metric. Use Metric when the ordered categories are not evenly spaced. For
example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as
the metric.
Sum. In a repeated measures ANOVA, totals the values for each subject.
469
Li near Model s II I: General Li near Model s
A Matrix, C Matrix, and D Matrix
A matrix, C matrix, and D matrix are available for hypothesis testing in multivariate
models. You can test parameters of the multivariate model estimated or factor the
quadratic form of your model into orthogonal components. Linear hypotheses have the
form:
These matrices (A, C, and D) may be specified in several alternative ways; if they are
not specified, they have default values. To specify an A matrix, click A matrix in the
Hypothesis Test dialog box.
A is a matrix of linear weights contrasting the coefficient estimates (the rows of B). The
A matrix has as many columns as there are regression coefficients (including the
constant) in your model. The number of rows in A determine how many degrees of
freedom your hypothesis involves. The A matrix can have several different forms, but
these are all submatrices of an identity matrix and are easily formed using Hypothesis
Test.
To specify a C matrix, click C matrix in the Hypothesis Test dialog box.
ABC = D
470
Chapter 16
The C matrix is used to test hypotheses for repeated measures analysis of variance
designs and models with multiple dependent variables. C has as many columns as there
are dependent variables. For most multivariate models, C is an identity matrix.
To specify a D matrix, click D matrix in the Hypothesis Test dialog box.
D is a null hypothesis matrix (usually a null matrix). The D matrix, if you use it, must
have the same number of rows as A. For univariate multiple regression, D has only one
column. For multivariate models (multiple dependent variables), the D matrix has one
column for each dependent variable.
A matrix and D matrix are often used to test hypotheses in regression. Linear
hypotheses in regression have the form A = D, where A is the matrix of linear weights
on coefficients across the independent variables (the rows of ), is the matrix of
regression coefficients, and D is a null hypothesis matrix (usually a null matrix). The
A and D matrices can be specified in several alternative ways, and if they are not
specified, they have default values.
471
Li near Model s II I: General Li near Model s
Using Commands
Select the data with USE filename and continue with:
For stepwise model building, use START in place of ESTIMATE:
To perform hypothesis tests:
Usage Considerations
Types of data. Normally, you analyze raw cases-by-variables data with General Linear
Model. You can, however, use a symmetric matrix data file (for example, a covariance
matrix saved in a file from Correlations) as input. If you use a matrix as input, you must
specify a value for Cases when estimating the model (under Group in the General
GLM
MODEL varlist1 = CONSTANT + varlist2 + var1*var2 + ,
var3(var4) / REPEAT=m,n, REPEAT=m(x1,x2,),
n(y1,y2,) NAMES=name1,name2, , MEANS,
WEIGHT N=n
CATEGORY grpvarlist / MISS EFFECT or DUMMY
SAVE filename / COEF MODEL RESID DATA PARTIAL ADJUSTED
ESTIMATE / MIX TOL=n
START / FORWARD or BACKWARD TOL=n ENTER=p REMOVE=p ,
FENTER=n FREMOVE=n FORCE=n MAXSTEP=n
STEP no argument or var or index / AUTO ENTER=p,
REMOVE=p FENTER=n FREMOVE=n
STOP
HYPOTHESIS
EFFECT varlist, var1&var2,
WITHIN name
CONTRAST [matrix] / DIFFERENCE or POLYNOMIAL or SUM
ORDER=n METRIC=m,n,
SPECIFY hypothesis lang / POOLED or SEPARATE
AMATRIX [matrix]
CMATRIX [matrix]
DMATRIX [matrix]
ALL
POST varlist / LSD or TUKEY or BONF=n or SCHEFFE or,
DUNNETT ONE or TWO CONTROL=levelname,
POOLED or SEPARATE
ROTATE=n
TYPE=CORR or COVAR or SSCP
STAND = TOTAL or WITHIN
FACTOR = HYPOTHESIS or ERROR
ERROR varlist or var1&var2 or value(df) or matrix
PRIORS m n p
TEST
472
Chapter 16
Linear Model dialog box) to specify the sample size of the data file that generated the
matrix. The number you specify must be an integer greater than 2.
Be sure to include the dependent as well as independent variables in your matrix.
SYSTAT picks out the dependent variable you name in your model.
SYSTAT uses the sample size to calculate degrees of freedom in hypothesis tests.
SYSTAT also determines the type of matrix (SSCP, covariance, and so on) and adjusts
appropriately. With a correlation matrix, the raw and standardized coefficients are the
same; therefore, you cannot include a constant when using SSCP, covariance, or
correlation matrices. Because these matrices are centered, the constant term has
already been removed.
The triangular matrix input facility is useful for meta-analysis of published data
and missing value computations; however, you should heed the following warnings:
First, if you input correlation matrices from textbooks or articles, you may not get the
same regression coefficients as those printed in the source. Because of round-off error,
printed and raw data can lead to different results. Second, if you use pairwise deletion
with Correlations, the degrees of freedom for hypotheses will not be appropriate. You
may not even be able to estimate the regression coefficients because of singularities.
In general, correlation matrices containing missing data produce coefficient
estimates and hypothesis tests that are optimistic. You can correct for this by
specifying a sample size smaller than the number of actual observations (preferably set
it equal to the smallest number of cases used for any pair of variables), but this is a
guess that you can refine only by doing Monte Carlo simulations. There is no simple
solution. Beware, especially, of multivariate regressions (MANOVA and others) with
missing data on the dependent variables. You can usually compute coefficients, but
hypothesis testing produces results that are suspect.
Print options. General Linear Model produces extended output if you set the output length
to LONG or if you select Save scores and results in the Hypothesis Test dialog box.
For model estimation, extended output adds the following: total sum of product
matrix, residual (or pooled within groups) sum of product matrix, residual (or pooled
within groups) covariance matrix, and the residual (or pooled within groups)
correlation matrix.
For hypothesis testing, extended output adds A, C, and D matrices, the matrix of
contrasts, and the inverse of the cross products of contrasts, hypothesis and error sum
of product matrices, tests of residual roots, canonical correlations, coefficients, and
loadings.
473
Li near Model s II I: General Li near Model s
Quick Graphs. If no variables are categorical, GLM produces Quick Graphs of residuals
versus predicted values. For categorical predictors, GLM produces graphs of the least
squares means for the levels of the categorical variable(s).
Saving files. Several sets of output can be saved to a file. The actual contents of the
saved file depend on the analysis. Files may include estimated regression coefficients,
model variables, residuals, predicted values, diagnostic statistics, canonical variable
scores, and posterior probabilities (among other statistics).
BY groups. Each level of any BY variables yields a separate analysis.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. GLM uses the FREQUENCY variable, if present, to duplicate cases.
Case weights. GLM uses the values of any WEIGHT variables to weight each case.
Examples
Example 1
One-Way ANOVA
The following data, KENTON, are from Neter, Wasserman, and Kutner (1985). The
data comprise unit sales of a cereal product under different types of package designs.
Ten stores were selected as experimental units. Each store was randomly assigned to
sell one of the package designs (each design was sold at two or three stores).
PACKAGE SALES
1 12
1 18
2 14
2 12
2 13
3 19
3 17
3 21
4 24
4 30
474
Chapter 16
Numbers are used to code the four types of package designs; alternatively, you could
have used words. Neter, Wasserman, and Kutner report that cartoons are part of designs
1 and 3 but not designs 2 and 4; designs 1 and 2 have three colors; and designs 3 and 4
have five colors. Thus, string codes for PACKAGE$ might have been Cart 3, NoCart
3, Cart 5, and NoCart 5. Notice that the data does not need to be ordered by
PACKAGE as shown here. The input for a one-way analysis of variance is:
The output follows:
This is the standard analysis of variance table. The F ratio (11.217) appears significant,
so you could conclude that the package designs differ significantly in their effects on
sales, provided the assumptions are valid.
Pairwise Multiple Comparisons
SYSTAT offers five methods for comparing pairs of means: Bonferroni, Tukey-Kramer
HSD, Scheff, Fischers LSD, and Dunnetts test.
The Dunnett test is available only with one-way designs. Dunnett requires the value
of a control group against which comparisons are made. By default, two-sided tests are
computed. One-sided Dunnett tests are also available. Incidentally, for Dunnetts tests
on experimental data, you should use the one-sided option unless you cannot predict
from theory whether your experimental groups will have higher or lower means than
the control.
USE kenton
GLM
CATEGORY package
MODEL sales=CONSTANT + package
GRAPH NONE
ESTIMATE
Categorical values encountered during processing are:
PACKAGE (4 levels)
1, 2, 3, 4

Dep Var: SALES N: 10 Multiple R: 0.921 Squared multiple R: 0.849


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

PACKAGE 258.000 3 86.000 11.217 0.007

Error 46.000 6 7.667
475
Li near Model s II I: General Li near Model s
Comparisons for the pairwise methods are made across all pairs of least-squares
group means for the design term that is specified. For a multiway design, marginal cell
means are computed for the effects specified before the comparisons are made.
To determine significant differences, simply look for pairs with probabilities below
your critical value (for example, 0.05 or 0.01). All multiple comparison methods
handle unbalanced designs correctly.
After you estimate your ANOVA model, it is easy to do post hoc tests. To do a
Tukey HSD test, first estimate the model, then specify these commands:
The output follows:
Results show that sales for the fourth package design (five colors and no cartoons) are
significantly larger than those for packages 1 and 2. None of the other pairs differ
significantly.
HYPOTHESIS
POST package / TUKEY
TEST
COL/
ROW PACKAGE
1 1
2 2
3 3
4 4
Using least squares means.
Post Hoc test of SALES
Using model MSE of 7.667 with 6 df.
Matrix of pairwise mean differences:

1 2 3 4
1 0.0
2 -2.000 0.0
3 4.000 6.000 0.0
4 12.000 14.000 8.000 0.0

Tukey HSD Multiple Comparisons.
Matrix of pairwise comparison probabilities:

1 2 3 4
1 1.000
2 0.856 1.000
3 0.452 0.130 1.000
4 0.019 0.006 0.071 1.000
476
Chapter 16
Contrasts
This example uses two contrasts:
n We compare the first and third packages using coefficients of (1, 0, 1, 0).
n We compare the average performance of the first three packages with the last,
using coefficients of (1, 1, 1, 3).
The input is:
For each hypothesis, we specify one contrast, so the test has one degree of freedom;
therefore, the contrast matrix has one row of numbers. These numbers are the same
ones you see in ANOVA textbooks, although ANOVA offers one advantageyou do
not have to standardize them so that their sum of squares is 1. The output follows:
For the first contrast, the F statistic (2.504) is not significant, so you cannot conclude
that the impact of the first and third package designs on sales is significantly different.
HYPOTHESIS
EFFECT = package
CONTRAST [1 0 1 0]
TEST
HYPOTHESIS
EFFECT = package
CONTRAST [1 1 1 3]
TEST
Test for effect called: PACKAGE

A Matrix

1 2 3 4
0.0 1.000 0.0 -1.000
Test of Hypothesis

Source SS df MS F P

Hypothesis 19.200 1 19.200 2.504 0.165
Error 46.000 6 7.667
-------------------------------------------------------------------------------

Test for effect called: PACKAGE

A Matrix

1 2 3 4
0.0 4.000 4.000 4.000
Test of Hypothesis

Source SS df MS F P

Hypothesis 204.000 1 204.000 26.609 0.002
Error 46.000 6 7.667
477
Li near Model s II I: General Li near Model s
Incidentally, the A matrix contains the contrast. The first column (0) corresponds to the
constant in the model, and the remaining three columns (1 0 1) correspond to the
dummy variables for PACKAGE.
The last package design is significantly different from the other three taken as a
group. Notice that the A matrix looks much different this time. Because the effects sum
to 0, the last effect is minus the sum of the other three; that is, letting
i
denote the
effect for level i of package,

1
+
2
+
3
+
4
= 0
so

4
= (
1
+
2
+
3
)
and the contrast is

1
+
2
+
3
3
4
which is

1
+
2
+
3
3(
1

2

3
)
which simplifies to
4*
1
+ 4*
2
+ 4*
3
Remember, SYSTAT does all this work automatically.
Orthogonal Polynomials
Constructing orthogonal polynomials for between-group factors is useful when the
levels of a factor are ordered. To construct orthogonal polynomials for your between-
groups factors:
HYPOTHESIS
EFFECT = package
CONTRAST / POLYNOMIAL ORDER=2
TEST
478
Chapter 16
The output is:
Make sure that the levels of the factorafter they are sorted by the procedure
numerically or alphabeticallyare ordered meaningfully on a latent dimension. If you
need a specific order, use LABEL or ORDER; otherwise, the results will not make sense.
In the example, the significant quadratic effect is the result of the fourth package
having a much larger sales volume than the other three.
Effect and Dummy Coding
The effects in a least-squares analysis of variance are associated with a set of dummy
variables that SYSTAT generates automatically. Ordinarily, you do not have to concern
yourself with these dummy variables; however, if you want to see them, you can save
them in to a SYSTAT file. The input is:
Test for effect called: PACKAGE

A Matrix

1 2 3 4
0.0 0.0 -1.000 -1.000
Test of Hypothesis

Source SS df MS F P

Hypothesis 60.000 1 60.000 7.826 0.031
Error 46.000 6 7.667
USE kenton
GLM
CATEGORY package
MODEL sales=CONSTANT + package
GRAPH NONE
SAVE mycodes / MODEL
ESTIMATE
USE mycodes
FORMAT 12,0
LIST SALES x(1..3)
479
Li near Model s II I: General Li near Model s
The listing of the dummy variables follows:
The variables X(1), X(2), and X(3) are the effects coding dummy variables generated
by the procedure. All cases in the first cell are associated with dummy values 1 0 0;
those in the second cell with 0 1 0; the third, 0 0 1; and the fourth, 1 1 1. Other least-
squares programs use different methods to code dummy variables. The coding used by
SYSTAT is most widely used and guarantees that the effects sum to 0.
If you had used dummy coding, these dummy variables would be saved:
This coding yields parameter estimates that are the differences between the mean for
each group and the mean of the last group.
Case Number SALES X(1) X(2) X(3)
1 12 1 0 0
2 18 1 0 0
3 14 0 1 0
4 12 0 1 0
5 13 0 1 0
6 19 0 0 1
7 17 0 0 1
8 21 0 0 1
9 24 1 1 1
10 30 1 1 1
SALES X(1) X(2) X(3)
12 1 0 0
18 1 0 0
14 0 1 0
12 0 1 0
13 0 1 0
19 0 0 0
19 0 0 1
17 0 0 1
21 0 0 1
24 0 0 0
30 0 0 0
480
Chapter 16
Example 2
Randomized Block Designs
A randomized block design is like a factorial design without an interaction term. The
following example is from Neter, Wasserman, and Kutner (1985). Five blocks of
judges were given the task of analyzing three treatments. Judges are stratified within
blocks, so the interaction of blocks and treatments cannot be analyzed. These data are
in the file BLOCK. The input is:
You must use GLM instead of ANOVA because you do not want the BLOCK*TREAT
interaction in the model. The output is:
Example 3
Incomplete Block Designs
Randomized blocks can be used in factorial designs. Here is an example from John
(1971). The data (in the file JOHN) involve an experiment with three treatment factors
(A, B, and C) plus a blocking variable with eight levels. Notice that data were collected
on 32 of the possible 64 experimental situations.
USE block
GLM
CATEGORY block, treat
MODEL judgment = CONSTANT + block + treat
ESTIMATE
Dep Var: JUDGMENT N: 15 Multiple R: 0.970 Squared multiple R: 0.940

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

BLOCK 171.333 4 42.833 14.358 0.001
TREAT 202.800 2 101.400 33.989 0.000

Error 23.867 8 2.983
481
Li near Model s II I: General Li near Model s
The input is:
The output follows:
BLOCK A B C Y BLOCK A B C Y
1 1 1 1 101 5 1 1 1 87
1 2 1 2 373 5 2 1 2 324
1 1 2 2 398 5 1 2 1 279
1 2 2 1 291 5 2 2 2 471
2 1 1 2 312 6 1 1 2 323
2 2 1 1 106 6 2 1 1 128
2 1 2 1 265 6 1 2 2 423
2 2 2 2 450 6 2 2 1 334
3 1 1 1 106 7 1 1 1 131
3 2 2 1 306 7 2 1 1 103
3 1 1 2 324 7 1 2 2 445
3 2 2 2 449 7 2 2 2 437
4 1 2 1 272 8 1 1 2 324
4 2 1 1 89 8 2 1 2 361
4 1 2 2 407 8 1 2 1 302
4 2 1 2 338 8 2 2 1 272
USE john
GLM
CATEGORY block, a, b, c
MODEL y = CONSTANT + block + a + b + c +,
a*b + a*c + b*c + a*b*c
ESTIMATE
Dep Var: Y N: 32 Multiple R: 0.994 Squared multiple R: 0.988


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

BLOCK 2638.469 7 376.924 1.182 0.364
A 3465.281 1 3465.281 10.862 0.004
B 161170.031 1 161170.031 505.209 0.000
C 278817.781 1 278817.781 873.992 0.000
A*B 28.167 1 28.167 0.088 0.770
A*C 1802.667 1 1802.667 5.651 0.029
B*C 11528.167 1 11528.167 36.137 0.000
A*B*C 45.375 1 45.375 0.142 0.711

Error 5423.281 17 319.017
482
Chapter 16
Example 4
Fractional Factorial Designs
Sometimes a factorial design involves so many combinations of treatments that certain
cells must be left empty to save experimental resources. At other times, a complete
randomized factorial study is designed, but loss of subjects leaves one or more cells
completely missing. These models are similar to incomplete block designs because not
all effects in the full model can be estimated. Usually, certain interactions must be left
out of the model.
The following example uses some experimental data that contain values in only 8
out of 16 possible cells. Each cell contains two cases. The pattern of nonmissing cells
makes it possible to estimate only the main effects plus three two-way interactions. The
data are in the file FRACTION.
The input follows:
A B C D Y
1 1 1 1 7
1 1 1 1 3
2 2 1 1 1
2 2 1 1 2
2 1 2 1 12
2 1 2 1 13
1 2 2 1 14
1 2 2 1 15
2 1 1 2 8
2 1 1 2 6
1 2 1 2 12
1 2 1 2 10
1 1 2 2 6
1 1 2 2 4
2 2 2 2 6
2 2 2 2 7
USE fraction
GLM
CATEGORY a, b, c, d
MODEL y = CONSTANT + a + b + c + d + a*b + a*c + b*c
ESTIMATE
483
Li near Model s II I: General Li near Model s
We must use GLM instead of ANOVA to omit the higher-way interactions that ANOVA
automatically generates. The output is:
When missing cells turn up by chance rather than by design, you may not know which
interactions to eliminate. When you attempt to fit the full model, SYSTAT informs you
that the design is singular. In that case, you may need to try several models before
finding an estimable one. It is usually best to begin by leaving out the highest-order
interaction (A*B*C*D in this example). Continue with subset models until you get an
ANOVA table.
Looking for an estimable model is not the same as analyzing the data with stepwise
regression because you are not looking at p values. After you find an estimable model,
stop and settle with the statistics printed in the ANOVA table.
Example 5
Nested Designs
Nested designs resemble factorial designs with certain cells missing (incomplete
factorials). This is because one factor is nested under another, so that not all
combinations of the two factors are observed. For example, in an educational study,
classrooms are usually nested under schools because it is impossible to have the same
classroom existing at two different schools (except as antimatter). The following
Dep Var: Y N: 16 Multiple R: 0.972 Squared multiple R: 0.944


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

A 16.000 1 16.000 8.000 0.022
B 4.000 1 4.000 2.000 0.195
C 49.000 1 49.000 24.500 0.001
D 4.000 1 4.000 2.000 0.195
A*B 182.250 1 182.250 91.125 0.000
A*C 12.250 1 12.250 6.125 0.038
B*C 2.250 1 2.250 1.125 0.320

Error 16.000 8 2.000
484
Chapter 16
example (in which teachers are nested within schools) is from Neter, Wasserman, and
Kutner (1985). The data (learning scores) look like this:
In the study, there are actually six teachers, not just two; thus, the design really looks
like this:
The data are set up in the file SCHOOLS.
TEACHER1 TEACHER2
SCHOOL1 25
29
14
11
SCHOOL2 11
6
22
18
SCHOOL3 17
20
5
2
TEACHER1 TEACHER2 TEACHER3 TEACHER4 TEACHER5 TEACHER6
SCHOOL1 25
29
14
11
SCHOOL2 11
6
22
18
SCHOOL3 17
20
5
2
TEACHER SCHOOL LEARNING
1 1 25
1 1 29
2 1 14
2 1 11
3 2 11
3 2 6
4 2 22
4 2 18
5 3 17
5 3 20
6 3 5
6 3 2
485
Li near Model s II I: General Li near Model s
The input is:
The output follows:
Your data can use any codes for TEACHER, including a separate code for every teacher
in the study, as long as each different teacher within a given school has a different code.
GLM will use the nesting specified in the MODEL statement to determine the pattern of
nesting. You can, for example, allow teachers in different schools to share codes.
This example is a balanced nested design. Unbalanced designs (unequal number of
cases per cell) are handled automatically in SYSTAT because the estimation method
is least squares.
Example 6
Split Plot Designs
The split plot design is closely related to the nested design. In the split plot, however,
plots are often considered a random factor; therefore, you have to construct different
error terms to test different effects. The following example involves two treatments: A
(between plots) and B (within plots). The numbers in the cells are the YIELD of the
crop within plots.
USE schools
GLM
CATEGORY teacher, school
MODEL learning = CONSTANT + school + teacher(school)
ESTIMATE
Dep Var: LEARNING N: 12 Multiple R: 0.972 Squared multiple R: 0.945


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

SCHOOL 156.500 2 78.250 11.179 0.009
TEACHER(SCHOOL) 567.500 3 189.167 27.024 0.001

Error 42.000 6 7.000
A1 A2
PLOT1 PLOT2 PLOT3 PLOT4
B1
0 3 4 5
B2
0 1 2 4
B3
5 5 7 6
B4
3 4 8 6
486
Chapter 16
Here are the data from the PLOTS data file in the form needed by SYSTAT:
To analyze this design, you need two different error terms. For the between-plots
effects (A), you need plots within A. For the within-plots effects (B and A*B), you
need B by plots within A.
First, fit the saturated model with all the effects and then specify different error
terms as needed. The input is:
The output follows:
PLOT A B YIELD
1 1 1 0
1 1 2 0
1 1 3 5
1 1 4 3
2 1 1 3
2 1 2 1
2 1 3 5
2 1 4 4
3 2 1 4
3 2 2 2
3 2 3 7
3 2 4 8
4 2 1 5
4 2 2 4
4 2 3 6
4 2 4 6
USE plots
GLM
CATEGORY plot, a, b
MODEL yield = CONSTANT + a + b + a*b + plot(a) + b*plot(a)
ESTIMATE
Dep Var: YIELD N: 16 Multiple R: 1.000 Squared multiple R: 1.000

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

A 27.563 1 27.563 . .
B 42.688 3 14.229 . .
A*B 2.188 3 0.729 . .
PLOT(A) 3.125 2 1.562 . .
B*PLOT(A) 7.375 6 1.229 . .

Error 0.0 0 .
487
Li near Model s II I: General Li near Model s
You do not get a full ANOVA table because the model is perfectly fit. The coefficient
of determination (Squared multiple R) is 1. Now you have to use some of the effects as
error terms.
Between-Plots Effects
Lets test for between-plots effects, namely A. The input is:
The output is:
The between-plots effect is not significant (p = 0.052).
Within-Plots Effects
To do the within-plots effects (B and A*B), the input is:
The output follows:
HYPOTHESIS
EFFECT = a
ERROR = plot(a)
TEST
Test for effect called: A

Test of Hypothesis

Source SS df MS F P

Hypothesis 27.563 1 27.563 17.640 0.052
Error 3.125 2 1.562
HYPOTHESIS
EFFECT = b
ERROR = b*plot(a)
TEST
HYPOTHESIS
EFFECT = a*b
ERROR = b*plot(a)
TEST
Test for effect called: B

Test of Hypothesis

Source SS df MS F P

Hypothesis 42.687 3 14.229 11.576 0.007
Error 7.375 6 1.229

-------------------------------------------------------------------------------
488
Chapter 16
Here, we find a significant effect due to factor B (p = 0.007), but the interaction is not
significant (p = 0.642).
This analysis is the same as that for a repeated measures design with subjects as
PLOT, groups as A, and trials as B. Because this method becomes unwieldy for a large
number of plots (subjects), SYSTAT offers a more compact method for repeated
measures analysis as an alternative.
Example 7
Latin Square Designs
A Latin square design imposes a pattern on treatments in a factorial design to save
experimental effort or reduce within cell error. As in the nested design, not all
combinations of the square and other treatments are measured, so the model lacks
certain interaction terms between squares and treatments. GLM can analyze these
designs easily if an extra variable denoting the square is included in the file. The
following fixed effects example is from Neter, Wasserman, and Kutner (1985). The
SQUARE variable is represented in the cells of the design. For simplicity, the
dependent variable, RESPONSE, has been left out.
Test for effect called: A*B

Test of Hypothesis

Source SS df MS F P

Hypothesis 2.188 3 0.729 0.593 0.642
Error 7.375 6 1.229
day1 day2 day3 day4 day5
week1
D C A B E
week2
C B E A D
week3
A D B E C
week4
E A C D B
week5
B E D C A
489
Li near Model s II I: General Li near Model s
You would set up the data as shown below (the LATIN file).
To do the analysis, the input is:
DAY WEEK SQUARE RESPONSE
1 1 D 18
1 2 C 17
1 3 A 14
1 4 E 21
1 5 B 17
2 1 C 13
2 2 B 34
2 3 D 21
2 4 A 16
2 5 E 15
3 1 A 7
3 2 E 29
3 3 B 32
3 4 C 27
3 5 D 13
4 1 B 17
4 2 A 13
4 3 E 24
4 4 D 31
4 5 C 25
5 1 E 21
5 2 D 26
5 3 C 26
5 4 B 31
5 5 A 7
USE latin
GLM
CATEGORY day, week, square
MODEL response = CONSTANT + day + week + square
ESTIMATE
490
Chapter 16
The output follows:
Example 8
Crossover and Changeover Designs
In crossover designs, an experiment is divided into periods, and the treatment of a
subject changes from one period to the next. Changeover studies often use designs
similar to a Latin square. A problem with these designs is that there may be a residual
or carry-over effect of a treatment into the following period. This can be minimized by
extending the interval between experimental periods; however, this is not always
feasible. Fortunately, there are methods to assess the magnitude of any carry-over
effects that may be present.
Two-period crossover designs can be analyzed as repeated-measures designs. More
complicated crossover designs can also be analyzed by SYSTAT, and carry-over
effects can be assessed. Cochran and Cox (1957) present a study of milk production by
cows under three different feed schedules: A (roughage), B (limited grain), and C (full
grain). The design of the study has the form of two ( ) Latin squares:
Dep Var: RESPONSE N: 25 Multiple R: 0.931 Squared multiple R: 0.867


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

DAY 82.000 4 20.500 1.306 0.323
WEEK 477.200 4 119.300 7.599 0.003
SQUARE 664.400 4 166.100 10.580 0.001

Error 188.400 12 15.700
COW
Latin square 1 Latin square 2
Period I II III IV V VI
1
A B C A B C
2
B C A C A B
3
C A B B C A
3 3
491
Li near Model s II I: General Li near Model s
The data are set up in the WILLIAMS data file as follows:
PERIOD is nested within each Latin square (the periods for cows in one square are
unrelated to the periods in the other). The variable RESIDUAL indicates the treatment
of the preceding period. For the first period for each cow, there is no preceding period.
The input is:
COW SQUARE PERIOD FEED CARRY RESIDUAL MILK
1 1 1 1 1 0 38
1 1 2 2 1 1 25
1 1 3 3 2 2 15
2 1 1 2 1 0 109
2 1 2 3 2 2 86
2 1 3 1 2 3 39
3 1 1 3 1 0 124
3 1 2 1 2 3 72
3 1 3 2 1 1 27
4 2 1 1 1 0 86
4 2 2 3 1 1 76
4 2 3 2 2 3 46
5 2 1 2 1 0 75
5 2 2 1 2 2 35
5 2 3 3 1 1 34
6 2 1 3 1 0 101
6 2 2 2 2 3 63
6 2 3 1 2 2 1
USE williams
GLM
CATEGORY cow, period, square, residual, carry, feed
MODEL milk = CONSTANT + cow + feed +,
period(square) + residual(carry)
ESTIMATE
492
Chapter 16
The output follows:
There is a significant effect of feed on milk production and an insignificant residual or
carry-over effect in this instance.
Type I Sums-of-Squares Analysis
To replicate the Cochran and Cox Type I sums-of-squares analysis, you must fit a new
model to get their sums of squares. The following commands test the COW effect.
Notice that the Error specification uses the mean square error (MSE) from the previous
analysis. It also contains the error degrees of freedom (4) from the previous model.
The output follows:
Dep Var: MILK N: 18 Multiple R: 0.995 Squared multiple R: 0.990

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

COW 3835.950 5 767.190 15.402 0.010
FEED 2854.550 2 1427.275 28.653 0.004
PERIOD(SQUARE) 3873.950 4 968.488 19.443 0.007
RESIDUAL(CARRY) 616.194 2 308.097 6.185 0.060

Error 199.250 4 49.813
USE williams
GLM
CATEGORY cow
MODEL milk = CONSTANT + cow
ESTIMATE
HYPOTHESIS
EFFECT = cow
ERROR = 49.813(4)
TEST
Dep Var: MILK N: 18 Multiple R: 0.533 Squared multiple R: 0.284


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

COW 5781.111 5 1156.222 0.952 0.484

Error 14581.333 12 1215.111

-------------------------------------------------------------------------------

493
Li near Model s II I: General Li near Model s
The remaining term, PERIOD, requires a different model. PERIOD is nested within
SQUARE.
The resulting output is:
Test for effect called: COW

Test of Hypothesis

Source SS df MS F P

Hypothesis 5781.111 5 1156.222 23.211 0.005
Error 199.252 4 49.813
USE williams
GLM
CATEGORY period square
MODEL milk = CONSTANT + period(square)
ESTIMATE
HYPOTHESIS
EFFECT = period(square)
ERROR = 49.813(4)
TEST
Dep Var: MILK N: 18 Multiple R: 0.751 Squared multiple R: 0.564


Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

PERIOD(SQUARE) 11489.111 4 2872.278 4.208 0.021

Error 8873.333 13 682.564

-------------------------------------------------------------------------------

> HYPOTHESIS
> EFFECT = period(square)
> ERROR = 49.813(4)
> TEST
Test for effect called: PERIOD(SQUARE)

Test of Hypothesis

Source SS df MS F P

Hypothesis 11489.111 4 2872.278 57.661 0.001
Error 199.252 4 49.813
494
Chapter 16
Example 9
Missing Cells Designs (the Means Model)
When cells are completely missing in a factorial design, parameterizing a model can
be difficult. The full model cannot be estimated. GLM offers a means model
parameterization so that missing cell parameters can be dropped automatically from
the model, and hypotheses for main effects and interactions can be tested by specifying
cells directly. Examine Searle (1987), Hocking (1985), or Milliken and Johnson (1984)
for more information in this area.
Widely favored for this purpose by statisticians (Searle, 1987; Hocking, 1985;
Milliken and Johnson, 1984), the means model allows:
n Tests of hypotheses in missing cells designs (using Type IV sums of squares)
n Tests of simple hypotheses (for example, within levels of other factors)
n The use of population weights to reflect differences in subclass sizes
Effects coding is the default for GLM. Alternatively, means models code predictors as
cell means rather than effects, which differ from a grand mean. The constant is omitted,
and the predictors are 1 for a case belonging to a given cell and 0 for all others. When
cells are missing, GLM automatically excludes null columns and estimates the
submodel.
The categorical variables are specified in the MODEL statement differently for a
means model than for an effects model. Here are some examples:
The first two models generate fully factorial designs (A by B and group by AGE by
SCHOOL$). Notice that they omit the constant and main effects parameters because the
means model does not include effects or a grand mean. Nevertheless, the number of
parameters is the same in the two models. The following are the effects model and the
means model, respectively, for a design (two levels of A and three levels of B):
MODEL y = a*b / MEANS
MODEL y = group*age*school$ / MEANS
MODEL y = CONSTANT + A + B + A*B
2 3
495
Li near Model s II I: General Li near Model s
Means and effects models can be blended for incomplete factorials and others designs.
All crossed terms (for example, A*B) will be coded with means design variables
(provided the MEANS option is present), and the remaining terms will be coded as
effects. The constant must be omitted, even in these cases, because it is collinear with
the means design variables. All covariates and effects that are coded factors must
precede the crossed factors in the MODEL statement.
Here is an example, assuming A has four levels, B has two, and C has three. In this
design, there are 24 possible cells, but only 12 are nonmissing. The treatment
combinations are partially balanced across the levels of B and C.
A B m a1 b1 b2 a1b1 a1b2
1 1 1 1 1 0 1 0
1 2 1 1 0 1 0 1
1 3 1 1 1 1 1 1
2 1 1 1 1 0 1 0
2 2 1 1 0 1 0 1
2 3 1 1 1 1 1 1
MODEL y = A*B / MEANS
A B a1b1 a1b2 a1b3 a2b1 a2b2 a2b3
1 1 1 0 0 0 0 0
1 2 0 1 0 0 0 0
1 3 0 0 1 0 0 0
2 1 0 0 0 1 0 0
2 2 0 0 0 0 1 0
2 3 0 0 0 0 0 1
MODEL y = A + B*C / MEANS
496
Chapter 16
Nutritional Knowledge Survey
The following example, which uses the data file MJ202, is from Milliken and Johnson
(1984). The data are from a home economics survey experiment. DIFF is the change
in test scores between pre-test and post-test on a nutritional knowledge questionnaire.
GROUP classifies whether or not a subject received food stamps. AGE designates four
age groups, and RACE$ was their term for designating Whites, Blacks, and Hispanics.
Empty cells denote age/race combinations for which no data were collected. Numbers
within cells refer to cell designations in the Fisher LSD pairwise mean comparisons at
the end of this example.
First, fit the model. The input is:
A B C a1 a2 a3 b1c1 b1c2 b1c3 b2c1 b2c2 b2c3
1 1 1 1 0 0 1 0 0 0 0 0
3 1 1 0 0 1 1 0 0 0 0 0
2 1 2 0 1 0 0 1 0 0 0 0
4 1 2 1 1 1 0 1 0 0 0 0
1 1 3 1 0 0 0 0 1 0 0 0
4 1 3 1 1 1 0 0 1 0 0 0
2 2 1 0 1 0 0 0 0 1 0 0
3 2 1 0 0 1 0 0 0 0 1 0
2 2 2 0 1 0 0 0 0 0 1 0
4 2 2 1 1 1 0 0 0 0 1 0
1 2 3 1 0 0 0 0 0 0 0 1
3 2 3 0 0 1 0 0 0 0 0 1
Group 0 Group 1
1 2 3 4 1 2 3 4
W
1 3 6 9 10 13 15
H
5 12
B
2 4 7 8 11 14
USE mj202
GLM
CATEGORY group age race$
MODEL diff = group*age*race$ / MEANS
ESTIMATE
497
Li near Model s II I: General Li near Model s
The output follows:
We need to test the GROUP main effect. The following notation is equivalent to
Milliken and Johnsons. Because of the missing cells, the GROUP effect must be
computed over means that are balanced across the other factors.
In the drawing at the beginning of this example, notice that this specification
contrasts all the numbered cells in group 0 (except 2) with all the numbered cells in
group 1 (except 8 and 15). The input is:
The output is:
Means Model


Dep Var: DIFF N: 107 Multiple R: 0.538 Squared multiple R: 0.289

***WARNING***
Missing cells encountered. Tests of factors will not appear.
Ho: All means equal.

Unweighted Means Model

Analysis of Variance


Source Sum-of-Squares df Mean-Square F-ratio P
Model 1068.546 14 76.325 2.672 0.003
Error 2627.472 92 28.559
HYPOTHESIS
NOTE GROUP MAIN EFFECT
SPECIFY ,
group[0] age[1] race$[W] + group[0] age[2] race$[W] +,
group[0] age[3] race$[B] + group[0] age[3] race$[H] +,
group[0] age[3] race$[W] + group[0] age[4] race$[B] =,
group[1] age[1] race$[W] + group[1] age[2] race$[W] +,
group[1] age[3] race$[B] + group[1] age[3] race$[H] +,
group[1] age[3] race$[W] + group[1] age[4] race$[B]
TEST
Hypothesis.

A Matrix

1 2 3 4 5
-1.000 0.0 -1.000 -1.000 -1.000

6 7 8 9 10
-1.000 -1.000 0.0 1.000 1.000

11 12 13 14 15
1.000 1.000 1.000 1.000 0.0
Null hypothesis value for D
0.0
Test of Hypothesis

Source SS df MS F P

Hypothesis 75.738 1 75.738 2.652 0.107
Error 2627.472 92 28.559
498
Chapter 16
The computations for the AGE main effect are similar to those for the GROUP main
effect:
The output follows:
The GROUP by AGE interaction requires more complex balancing than the main
effects. It is derived from a subset of the means in the following specified combination.
Again, check Milliken and Johnson to see the correspondence.
HYPOTHESIS
NOTE AGE MAIN EFFECT
SPECIFY ,
GROUP[1] AGE[1] RACE$[B] + GROUP[1] AGE[1] RACE$[W] =,
GROUP[1] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,
GROUP[0] AGE[2] RACE$[B] + GROUP[1] AGE[2] RACE$[W] =,
GROUP[0] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,
GROUP[0] AGE[3] RACE$[B] + GROUP[1] AGE[3] RACE$[B] +,
GROUP[1] AGE[3] RACE$[W] =,
GROUP[0] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[B] +,
GROUP[1] AGE[4] RACE$[W]
TEST
Hypothesis.

A Matrix

1 2 3 4 5
1 0.0 0.0 0.0 0.0 0.0
2 0.0 -1.000 0.0 0.0 0.0
3 0.0 0.0 0.0 -1.000 0.0


6 7 8 9 10
1 0.0 0.0 -1.000 -1.000 0.0
2 0.0 1.000 0.0 0.0 -1.000
3 0.0 1.000 0.0 0.0 0.0


11 12 13 14 15
1 0.0 0.0 0.0 1.000 1.000
2 0.0 0.0 0.0 0.0 1.000
3 -1.000 0.0 -1.000 1.000 1.000

D Matrix

1 0.0
2 0.0
3 0.0

Test of Hypothesis

Source SS df MS F P

Hypothesis 41.526 3 13.842 0.485 0.694
Error 2627.472 92 28.559
499
Li near Model s II I: General Li near Model s
The input is:
The output is:
HYPOTHESIS
NOTE GROUP BY AGE INTERACTION
SPECIFY ,
group[0] age[1] race$[W] group[0] age[3] race$[W] ,
group[1] age[1] race$[W] + group[1] age[3] race$[W] +,
group[0] age[3] race$[B] group[0] age[4] race$[B] ,
group[1] age[3] race$[B] + group[1] age[4] race$[B]=0.0;,
group[0] age[2] race$[W] group[0] age[3] race$[W] ,
group[1] age[2] race$[W] + group[1] age[3] race$[W] +,
group[0] age[3] race$[B] group[0] age[4] race$[B] ,
group[1] age[3] RACE$[B] + group[1] age[4] race$[B]=0.0;,
group[0] age[3] race$[B] group[0] age[4] race$[B] ,
group[1] age[3] race$[B] + group[1] age[4] race$[B]=0.0
TEST
Hypothesis.

A Matrix

1 2 3 4 5
1 -1.000 0.0 0.0 -1.000 0.0
2 0.0 0.0 -1.000 -1.000 0.0
3 0.0 0.0 0.0 -1.000 0.0


6 7 8 9 10
1 1.000 1.000 0.0 1.000 0.0
2 1.000 1.000 0.0 0.0 1.000
3 0.0 1.000 0.0 0.0 0.0


11 12 13 14 15
1 1.000 0.0 -1.000 -1.000 0.0
2 1.000 0.0 -1.000 -1.000 0.0
3 1.000 0.0 0.0 -1.000 0.0

D Matrix

1 0.0
2 0.0
3 0.0

Test of Hypothesis

Source SS df MS F P

Hypothesis 91.576 3 30.525 1.069 0.366
Error 2627.472 92 28.559
500
Chapter 16
The following commands are needed to produce the rest of Milliken and Johnsons
results. The remaining output is not listed.
Finally, Milliken and Johnson do pairwise comparisons:
HYPOTHESIS
NOTE RACE$ MAIN EFFECT
SPECIFY ,
group[0] age[2] race$[B] + group[0] age[3] race$[B] +,
group[1] age[1] race$[B] + group[1] age[3] race$[B] +,
group[1] age[4] race$[B] =,
group[0] age[2] race$[W] + group[0] age[3] race$[W] +,
group[1] age[1] race$[W] + group[1] age[3] race$[W] +,
group[1] age[4] race$[W];,
group[0] age[3] race$[H] + group[1] age[3] race$[H] =,
group[0] age[3] race$[W] + group[1] age[3] race$[W]
TEST
HYPOTHESIS
NOTE GROUP*RACE$
SPECIFY ,
group[0] age[3] race$[B] group[0] age[3] race$[W] ,
group[1] age[3] race$[B] + group[1] age[3] race$[W]=0.0;,
group[0] age[3] race$[H] group[0] age[3] race$[W] ,
group[1] age[3] race$[H] + group[1] age[3] race$[W]=0.0
TEST
HYPOTHESIS
NOTE 'AGE*RACE$'
SPECIFY ,
group[1] age[1] race$[B] group[1] age[1] race$[W] ,
group[1] age[4] race$[B] + group[1] age[4] race$[W]=0.0;,
group[0] age[2] race$[B] group[0] age[2] race$[W] ,
group[0] age[3] race$[B] + group[0] age[3] race$[W]=0.0;,
group[1] age[3] race$[B] group[1] age[3] race$[W] ,
group[1] age[4] race$[B] + group[1] age[4] race$[W]=0.0
TEST
HYPOTHESIS
POST group*age*race$ / LSD
TEST
501
Li near Model s II I: General Li near Model s
The following is the matrix of comparisons printed by GLM. The matrix of mean
differences has been omitted.
Within group 0 (cells 17), there are no significant pairwise differences in average test
score changes. The same is true within group 1 (cells 815).
COL/
ROW GROUP AGE RACE$
1 0 1 W
2 0 2 B
3 0 2 W
4 0 3 B
5 0 3 H
6 0 3 W
7 0 4 B
8 1 1 B
9 1 1 W
10 1 2 W
11 1 3 B
12 1 3 H
13 1 3 W
14 1 4 B
15 1 4 W
Using unweighted means.
Post Hoc test of DIFF

Using model MSE of 28.559 with 92 df.
Fishers Least-Significant-Difference Test.
Matrix of pairwise comparison probabilities:

1 2 3 4 5
1 1.000
2 0.662 1.000
3 0.638 0.974 1.000
4 0.725 0.323 0.295 1.000
5 0.324 0.455 0.461 0.161 1.000
6 0.521 0.827 0.850 0.167 0.497
7 0.706 0.901 0.912 0.527 0.703
8 0.197 0.274 0.277 0.082 0.780
9 0.563 0.778 0.791 0.342 0.709
10 0.049 0.046 0.042 0.004 0.575
11 0.018 0.016 0.015 0.002 0.283
12 0.706 0.901 0.912 0.527 0.703
13 0.018 0.007 0.005 0.000 0.456
14 0.914 0.690 0.676 0.908 0.403
15 0.090 0.096 0.090 0.008 0.783
6 7 8 9 10
6 1.000
7 0.971 1.000
8 0.292 0.543 1.000
9 0.860 0.939 0.514 1.000
10 0.026 0.392 0.836 0.303 1.000
11 0.010 0.213 0.451 0.134 0.425
12 0.971 1.000 0.543 0.939 0.392
13 0.000 0.321 0.717 0.210 0.798
14 0.610 0.692 0.288 0.594 0.168
15 0.059 0.516 0.930 0.447 0.619
11 12 13 14 15
11 1.000
12 0.213 1.000
13 0.466 0.321 1.000
14 0.082 0.692 0.124 1.000
15 0.219 0.516 0.344 0.238 1.000
502
Chapter 16
Example 10
Covariance Alternatives to Repeated Measures
Analysis of covariance offers an alternative to repeated measures in a pre-post design.
You can use the pre-test as a covariate in predicting the post-test. This example shows
how to do a two-group, pre-post design:
When using this design, be sure to check the homogeneity of slopes assumption. Use
the following commands to check that the interaction term, GROUP*PRE, is not
significant:
Example 11
Weighting Means
Sometimes you want to weight the cell means when you test hypotheses in ANOVA.
Suppose you have an experiment in which a few rats died before its completion. You
do not want the hypotheses tested to depend upon the differences in cell sizes (which
are presumably random). Here is an example from Morrison (1976). The data
(MOTHERS) are hypothetical profiles on three scales of mothers in each of four
socioeconomic classes.
Morrison analyzes these data with the multivariate profile model for repeated
measures. Because the hypothesis of parallel profiles across classes is not rejected, you
can test whether the profiles are level. That is, do the scales differ when we pool the
classes together?
Pooling unequal classes can be done by weighting each according to sample size or
averaging the means of the subclasses. First, lets look at the model and test the
hypothesis of equality of scale parameters without weighting the cell means.
GLM
USE filename
CATEGORY group
MODEL post = CONSTANT + group + pre
ESTIMATE
GLM
USE filename
CATEGORY group
MODEL post = CONSTANT + group + pre + group*pre
ESTIMATE
503
Li near Model s II I: General Li near Model s
The input is:
The output is:
Notice that the dependent variable means differ from the CONSTANT. The CONSTANT
in this case is a mean of the cell means rather than the mean of all the cases.
USE mothers
GLM
CATEGORY class
MODEL scale(1 .. 3) = CONSTANT + class
ESTIMATE
HYPOTHESIS
EFFECT = CONSTANT
CMATRIX [1 1 0; 0 1 1]
TEST
Dependent variable means

SCALE(1) SCALE(2) SCALE(3)
14.524 15.619 15.857

-1
Estimates of effects B = (XX) XY

SCALE(1) SCALE(2) SCALE(3)

CONSTANT 13.700 14.550 14.988

CLASS 1 4.300 5.450 4.763

CLASS 2 0.100 0.650 -0.787

CLASS 3 -0.700 -0.550 0.012

Test for effect called: CONSTANT

C Matrix

1 2 3
1 1.000 -1.000 0.0
2 0.0 1.000 -1.000


Univariate F Tests

Effect SS df MS F P

1 14.012 1 14.012 4.652 0.046
Error 51.200 17 3.012

2 3.712 1 3.712 1.026 0.325
Error 61.500 17 3.618


Multivariate Test Statistics

Wilks Lambda = 0.564
F-Statistic = 6.191 df = 2, 16 Prob = 0.010

Pillai Trace = 0.436
F-Statistic = 6.191 df = 2, 16 Prob = 0.010

Hotelling-Lawley Trace = 0.774
F-Statistic = 6.191 df = 2, 16 Prob = 0.010
504
Chapter 16
Weighting by the Sample Size
If you believe (as Morrison does) that the differences in cell sizes reflect population
subclass proportions, then you need to weight the cell means to get a grand mean; for
example:
8*(
1
) + 5*(
2
) + 4*(
3
) + 4*(
4
)
Expressed in terms of our analysis of variance parameterization, this is:
8*( +
1
) + 5*( +
2
) + 4*( +
3
) + 4*( +
4
)
Because the sum of effects is 0 for a classification and because you do not have an
independent estimate of CLASS4, this expression is equivalent to
8*( +
1
) + 5*( +
2
) + 4*( +
3
) + 4*( -
1
-
2
-
3
)
which works out to
21* + 4*(
1
) + 1*(
2
) + 0*(
3
)
Use AMATRIX to test this hypothesis.
The output is:
HYPOTHESIS
AMATRIX [21 4 1 0]
CMATRIX [1 -1 0; 0 1 -1]
TEST
Hypothesis.

A Matrix

1 2 3 4
21.000 4.000 1.000 0.0
C Matrix

1 2 3
1 1.000 -1.000 0.0
2 0.0 1.000 -1.000


Univariate F Tests

Effect SS df MS F P

1 25.190 1 25.190 8.364 0.010
Error 51.200 17 3.012

2 1.190 1 1.190 0.329 0.574
Error 61.500 17 3.618


505
Li near Model s II I: General Li near Model s
This is the multivariate F statistic that Morrison gets. For these data, we prefer the
weighted means analysis because these differences in cell frequencies probably reflect
population base rates. They are not random.
Example 12
Hotellings T-Square
You can use General Linear Model to calculate Hotellings T-square statistic.
One-Sample Test
For example, to get a one-sample test for the variables X and Y, select both X and Y as
dependent variables.
The F test for CONSTANT is the statistic you want. It is the same as the Hotellings T^2
for the hypothesis that the population means for X and Y are 0.
You can also test against the hypothesis that the means of X and Y have particular
nonzero values (for example, 10 and 15) by using:
Multivariate Test Statistics

Wilks Lambda = 0.501
F-Statistic = 7.959 df = 2, 16 Prob = 0.004

Pillai Trace = 0.499
F-Statistic = 7.959 df = 2, 16 Prob = 0.004

Hotelling-Lawley Trace = 0.995
F-Statistic = 7.959 df = 2, 16 Prob = 0.004
GLM
USE filename
MODEL x, y = CONSTANT
ESTIMATE
HYPOTHESIS
DMATRIX [10 15]
TEST
506
Chapter 16
Two-Sample Test
For a two-sample test, you must provide a categorical independent variable that
represents the two groups. The input is:
Example 13
Discriminant Analysis
This example uses the IRIS data file. Fisher used these data to illustrate his discriminant
function. To define the model:
SYSTAT saves the canonical scores associated with the hypothesis. The scores are
stored in subscripted variables named FACTOR. Because the effects involve a
categorical variable, the Mahalanobis distances (named DISTANCE) and posterior
probabilities (named PROB) are saved in the same file. These distances are computed
in the discriminant space itself. The closer a case is to a particular groups location in
that space, the more likely it is that it belongs to that group. The probability of group
membership is computed from these distances. A variable named PREDICT that
contains the predicted group membership is also added to the file.
The output follows:
GLM
CATEGORY group
MODEL x,y = CONSTANT + group
ESTIMATE
USE iris
GLM
CATEGORY species
MODEL sepallen sepalwid petallen petalwid = CONSTANT +,
species
ESTIMATE
HYPOTHESIS
EFFECT = species
SAVE canon
TEST
Dependent variable means

SEPALLEN SEPALWID PETALLEN PETALWID
5.843 3.057 3.758 1.199

-1
507
Li near Model s II I: General Li near Model s
Estimates of effects B = (XX) XY

SEPALLEN SEPALWID PETALLEN PETALWID

CONSTANT 5.843 3.057 3.758 1.199

SPECIES 1 -0.837 0.371 -2.296 -0.953

SPECIES 2 0.093 -0.287 0.502 0.127

-------------------------------------------------------------------------------

Test for effect called: SPECIES


Null hypothesis contrast AB

SEPALLEN SEPALWID PETALLEN PETALWID
1 -0.837 0.371 -2.296 -0.953
2 0.093 -0.287 0.502 0.127


-1
Inverse contrast A(XX) A

1 2
1 0.013
2 -0.007 0.013


-1 -1
Hypothesis sum of product matrix H = BA(A(XX) A) AB

SEPALLEN SEPALWID PETALLEN PETALWID
SEPALLEN 63.212
SEPALWID -19.953 11.345
PETALLEN 165.248 -57.240 437.103
PETALWID 71.279 -22.933 186.774 80.413


Error sum of product matrix G = EE

SEPALLEN SEPALWID PETALLEN PETALWID
SEPALLEN 38.956
SEPALWID 13.630 16.962
PETALLEN 24.625 8.121 27.223
PETALWID 5.645 4.808 6.272 6.157


Univariate F Tests

Effect SS df MS F P

SEPALLEN 63.212 2 31.606 119.265 0.000
Error 38.956 147 0.265

SEPALWID 11.345 2 5.672 49.160 0.000
Error 16.962 147 0.115

PETALLEN 437.103 2 218.551 1180.161 0.000
Error 27.223 147 0.185

PETALWID 80.413 2 40.207 960.007 0.000
Error 6.157 147 0.042


508
Chapter 16
The multivariate tests are all significant. The dependent variable canonical coefficients
are used to produce discriminant scores. These coefficients are standardized by the
within-groups standard deviations so you can compare their magnitude across
variables with different scales. Because they are not raw coefficients, there is no need
for a constant. The scores produced by these coefficients have an overall zero mean and
a unit standard deviation within groups.
Multivariate Test Statistics

Wilks Lambda = 0.023
F-Statistic = 199.145 df = 8, 288 Prob = 0.000

Pillai Trace = 1.192
F-Statistic = 53.466 df = 8, 290 Prob = 0.000

Hotelling-Lawley Trace = 32.477
F-Statistic = 580.532 df = 8, 286 Prob = 0.000

THETA = 0.970 S = 2, M = 0.5, N = 71.0 Prob = 0.0

Test of Residual Roots

Roots 1 through 2
Chi-Square Statistic = 546.115 df = 8

Roots 2 through 2
Chi-Square Statistic = 36.530 df = 3

Canonical Correlations

1 2
0.985 0.471
Dependent variable canonical coefficients standardized
by conditional (within groups) standard deviations

1 2
SEPALLEN 0.427 0.012
SEPALWID 0.521 0.735
PETALLEN -0.947 -0.401
PETALWID -0.575 0.581

Canonical loadings (correlations between conditional
dependent variables and dependent canonical factors)

1 2
SEPALLEN -0.223 0.311
SEPALWID 0.119 0.864
PETALLEN -0.706 0.168
PETALWID -0.633 0.737

Group classification function coefficients
1 2 3
SEPALLEN 23.544 15.698 12.446
SEPALWID 23.588 7.073 3.685
PETALLEN -16.431 5.211 12.767
PETALWID -17.398 6.434 21.079

Group classification constants

1 2 3
-86.308 -72.853 -104.368
Canonical scores have been saved.
509
Li near Model s II I: General Li near Model s
The group classification coefficients and constants comprise the Fisher discriminant
functions for classifying the raw data. You can apply these coefficients to new data and
assign each case to the group with the largest function value for that case.
Studying Saved Results
The CANON file that was just saved contains the canonical variable scores
(FACTOR(1) and FACTOR(2)), the Mahalanobis distances to each group centroid
(DISTANCE(1), DISTANCE(2), and DISTANCE(3)), the posterior probability for each
case being assigned to each group (PROB(1), PROB(2), and PROB(3)), the predicted
group membership (PREDICT), and the original group assignment (GROUP).
To produce a classification table of the group assignment against the predicted
group membership and a plot of the second canonical variable against the first, the
input is:
The output follows:
USE canon
XTAB
PRINT NONE/ FREQ CHISQ
TABULATE GROUP * PREDICT
PLOT FACTOR(2)*FACTOR(1) / OVERLAY GROUP=GROUP COLOR=2,1,3 ,
FILL=1,1,1 SYMBOL=4,8,5
Frequencies
GROUP (rows) by PREDICT (columns)

1 2 3 Total
+-------------------+
1 | 50 0 0 | 50
2 | 0 48 2 | 50
3 | 0 1 49 | 50
+-------------------+
Total 50 49 51 150


Test statistic Value df Prob
Pearson Chi-square 282.593 4.000 0.000
510
Chapter 16
However, it is much easier to use the Discriminant Analysis procedure.
Prior Probabilities
In this example, there were equal numbers of flowers in each group. Sometimes the
probability of finding a case in each group is not the same across groups. To adjust the
prior probabilities for this example, specify 0.5, 0.3, and 0.2 as the priors:
General Linear Model uses the probabilities you specify to compute the posterior
probabilities that are saved in the file under the variable PROB. Be sure to specify a
probability for each level of the grouping variable. The probabilities should add up to 1.
Example 14
Principal Components Analysis (Within Groups)
General Linear Model allows you to partial out effects based on grouping variables and
to factor residual correlations. If between-group variation is significant, the within-
group structure can differ substantially from the total structure (ignoring the grouping
variable). However, if you are just computing principal components on a single sample
(no grouping variable), you can obtain more detailed output using the Factor Analysis
procedure.
The following data (USSTATES) comprise death rates by cause from nine census
divisions of the country for that year. The divisions are in the column labeled DIV, and
PRIORS 0.5 0.3 0.2
-10 -5 0 5 10
FACTOR(1)
-3
-2
-1
0
1
2
3
F
A
C
T
O
R
(
2
)
1
2
3
GROUP
511
Li near Model s II I: General Li near Model s
the U.S. Post Office two-letter state abbreviations follow DIV. Other variables include
ACCIDENT, CARDIO, CANCER, PULMONAR, PNEU_FLU, DIABETES, LIVER,
STATE$, FSTROKE, MSTROKE.
The variation in death rates between divisions in these data is substantial. Here is a
grouped box plot of the second variable, CARDIO, by division. The other variables
show similar regional differences.
If you analyze these data ignoring DIVISION$, the correlations among death rates
would be due substantially to between-division differences. You might want to
examine the pooled within-region correlations to see if the structure is different when
divisional differences are statistically controlled. Accordingly, you will factor the
residual correlation matrix after regressing medical variables onto an index variable
denoting the census regions. The input is:
USE usstates
GLM
CATEGORY division
MODEL accident cardio cancer pulmonar pneu_flu,
diabetes liver fstroke mstroke = CONSTANT + division
ESTIMATE
HYPOTHESIS
EFFECT = division
FACTOR = ERROR
TYPE = CORR
ROTATE = 2
TEST
E

N
C
e
n
t
r
a
l
E
S
C
e
n
t
r
a
l
M
id
A
tla
n
tic
M
o
u
n
ta
in
N
e
w

E
n
g
la
n
d
P
a
c
ific
S

A
tla
n
tic
W

N

C
e
n
t
r
a
l
W
S

C
e
n
tr
a
l
DIVISION$
100
200
300
400
500
C
A
R
D
I
O
512
Chapter 16
The hypothesis commands compute the principal components on the error (residual)
correlation matrix and rotate the first two components to a varimax criterion. For other
rotations, use the Factor Analysis procedure.
The FACTOR options can be used with any hypothesis. Ordinarily, when you test a
hypothesis, the matrix product INV(G)*H is factored and the latent roots of this matrix
are used to construct the multivariate test statistic. However, you can indicate which
matrixthe hypothesis (H) matrix or the error (G) matrixis to be factored. By
computing principal components on the hypothesis or error matrix separately, FACTOR
offers a direct way to compute principal components on residuals of any linear model
you wish to fit. You can use any A, C, and/or D matrices in the hypothesis you are
factoring, or you can use any of the other commands that create these matrices.
The hypothesis output follows:
Factoring Error Matrix
1 2 3 4 5
1 1.000
2 0.280 1.000
3 0.188 0.844 1.000
4 0.307 0.676 0.711 1.000
5 0.113 0.448 0.297 0.396 1.000
6 0.297 0.419 0.526 0.296 -0.123
7 -0.005 0.251 0.389 0.252 -0.138
8 0.402 -0.202 -0.379 -0.190 -0.110
9 0.495 -0.119 -0.246 -0.127 -0.071


6 7 8 9
6 1.000
7 -0.025 1.000
8 -0.151 -0.225 1.000
9 -0.076 -0.203 0.947 1.000

Latent roots
1 2 3 4 5
3.341 2.245 1.204 0.999 0.475

6 7 8 9
0.364 0.222 0.119 0.033
Loadings
1 2 3 4 5
1 0.191 0.798 0.128 -0.018 -0.536
2 0.870 0.259 -0.097 0.019 0.219
3 0.934 0.097 0.112 0.028 0.183
4 0.802 0.247 -0.135 0.120 -0.071
5 0.417 0.146 -0.842 -0.010 -0.042
6 0.512 0.218 0.528 -0.580 0.068
7 0.391 -0.175 0.400 0.777 -0.044
8 -0.518 0.795 0.003 0.155 0.226
9 -0.418 0.860 0.025 0.138 0.204


513
Li near Model s II I: General Li near Model s
Notice the sorted, rotated loadings. When interpreting these values, do not relate the
row numbers (1 through 9) to the variables. Instead, find the corresponding loading in
the Rotated Loadings table. The ordering of the rotated loadings corresponds to the
order of the model variables.
The first component rotates to a dimension defined by CANCER, CARDIO,
PULMONAR, and DIABETES; the second, by a dimension defined by MSTROKE and
FSTROKE (male and female stroke rates). ACCIDENT also loads on the second factor
but is not independent of the first. LIVER does not load highly on either factor.
6 7 8 9
1 0.106 -0.100 -0.019 -0.015
2 0.145 -0.254 0.177 0.028
3 0.039 -0.066 -0.251 -0.058
4 -0.499 0.085 0.044 0.015
5 0.216 0.220 -0.005 -0.002
6 0.093 0.241 0.063 0.010
7 0.154 0.159 0.046 0.009
8 -0.041 0.056 0.081 -0.119
9 0.005 0.035 -0.101 0.117

Rotated loadings on first 2 principal components
1 2
1 0.457 0.682
2 0.906 -0.060
3 0.909 -0.234
4 0.838 -0.047
5 0.441 -0.008
6 0.556 0.027
7 0.305 -0.300
8 -0.209 0.925
9 -0.093 0.951

Sorted rotated loadings on first 2 principal components
(loadings less than .25 made 0.)
1 2
1 0.909 0.0
2 0.906 0.0
3 0.838 0.0
4 0.556 0.0
5 0.0 0.951
6 0.0 0.925
7 0.457 0.682
8 0.305 -0.300
9 0.441 0.0
514
Chapter 16
Example 15
Canonical Correlation Analysis
Suppose you have 10 dependent variables, MMPI(1) to MMPI(10), and 3 independent
variables, RATER(1) to RATER(3). Enter the following commands to obtain the
canonical correlations and dependent canonical coefficients:
The canonical correlations are displayed; if you want, you can rotate the dependent
canonical coefficients by using the Rotate option.
To obtain the coefficients for the independent variables, run GLM again with the model
reversed:
Example 16
Mixture Models
Mixture models decompose the effects of mixtures of variables on a dependent
variable. They differ from ordinary regression models because the independent
variables sum to a constant value. The regression model, therefore, does not include a
constant, and the regression and error sums of squares have one less degree of freedom.
Marquardt and Snee (1974) and Diamond (1981) discuss these models and their
estimation.
USE datafile
GLM
MODEL mmpi(1 .. 10) = CONSTANT + rater(1) + rater(2) + rater(3)
ESTIMATE
PRINT=LONG
HYPOTHESIS
STANDARDIZE
EFFECT=rater(1) & rater(2) & rater(3)
TEST
MODEL rater(1 .. 3) = CONSTANT + mmpi(1) + mmpi(2),
+ mmpi(3) + mmpi(4) + mmpi(5),
+ mmpi(6) + mmpi(7) + mmpi(8),
+ mmpi(9) + mmpi(10)
ESTIMATE
HYPOTHESIS
STANDARDIZE = TOTAL
EFFECT = mmpi(1) & mmpi(2) & mmpi(3) & mmpi(4) &,
mmpi(5) & mmpi(6) & mmpi(7) & mmpi(8) &,
mmpi(9) & mmpi(10)
TEST
515
Li near Model s II I: General Li near Model s
Here is an example using the PUNCH data file from Cornell (1985). The study
involved effects of various mixtures of watermelon, pineapple, and orange juice on
taste ratings by judges of a fruit punch. The input is:
The output follows:
Not using a mixture model produces a much larger (0.999) and an F value of
2083.371, both of which are inappropriate for these data. Notice that the Regression
Sum-of-Squares has five degrees of freedom instead of six as in the usual zero-intercept
regression model. We have lost one degree of freedom because the predictors sum to 1.
Example 17
Partial Correlations
Partial correlations are easy to compute with General Linear Model. The partial
correlation of two variables (a and b) controlling for the effects of a third (c) is the
correlation between the residuals of each (a and b) after each has been regressed on the
third (c). You can therefore use General Linear Model to compute an entire matrix of
partial correlations.
USE punch
GLM
MODEL taste = watrmeln + pineappl + orange + ,
watrmeln*pineappl + watrmeln*orange + ,
pineappl*orange
ESTIMATE / MIX
Dep Var: TASTE N: 18 Multiple R: 0.969 Squared multiple R: 0.939

Adjusted squared multiple R: 0.913 Standard error of estimate: 0.232

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

WATRMELN 4.600 0.134 3.001 0.667 34.322 0.000
PINEAPPL 6.333 0.134 4.131 0.667 47.255 0.000
ORANGE 7.100 0.134 4.631 0.667 52.975 0.000
WATRMELN
*PINEAPPL 2.400 0.657 0.320 0.667 3.655 0.003
WATRMELN
*ORANGE 1.267 0.657 0.169 0.667 1.929 0.078
PINEAPPL
*ORANGE -2.200 0.657 -0.293 0.667 -3.351 0.006

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 9.929 5 1.986 36.852 0.000
Residual 0.647 12 0.054
R
2
516
Chapter 16
For example, to compute the matrix of partial correlations for Y1, Y2, Y3, Y4, and
Y5, controlling for the effects of X, select Y1 through Y5 as dependent variables and X
as the independent variable. The input follows:
Look for the Residual Correlation Matrix in the output; it is the matrix of partial
correlations among the ys given x. If you want to compute partial correlations for
several xs, just select them (also) as independent variables.
Computation
Algorithms
Centered sums of squares and cross products are accumulated using provisional
algorithms. Linear systems, including those involved in hypothesis testing, are solved
by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved
with Householder tridiagonalization and implicit QL iterations. For further
information, see Wilkinson and Reinsch (1971) or Chambers (1977).
References
Cohen, J. and Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences. 2nd ed. Hillsdale, N.J.: Lawrence Erlbaum.
Linn, R. L., Centra, J. A., and Tucker, L. (1975). Between, within, and total group factor
analyses of student ratings of instruction. Multivariate Behavioral Research, 10,
277288.
Neter, J., Wasserman,W., and Kutner, M. (1985). Applied linear statistical models. 2nd
ed. Homewood, Illinois: Richard E. Irwin, Inc.
Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.
Winer, B. J. (1971). Statistical principles in experimental design. 2nd ed. New York:
McGraw-Hill.
GLM
MODEL y(1 .. 5) = CONSTANT + x
PRINT=LONG
ESTIMATE
517


Chapt er
17
Logistic Regression
Dan Steinberg and Phillip Colla
LOGIT performs multiple logistic regression, conditional logistic regression, the
econometric discrete choice model, general linear (Wald) hypothesis testing, score
tests, odds ratios and confidence intervals, forward, backward and interactive
stepwise regression, Pregibon regression diagnostics, prediction success and
classification tables, independent variable derivatives and elasticities, model-based
simulation of response curves, deciles of risk tables, options to specify start values and
to separate data into learning and test samples, quasi-maximum likelihood standard
errors, control of significance levels for confidence interval calculations, zero/one
dependent variable coding, choice of reference group in automatic dummy variable
generation, and integrated plotting tools.
Many of the results generated by modeling, testing, or diagnostic procedures can
be saved to SYSTAT data files for subsequent graphing and display with the graphics
routines. LOGIT and PROBIT are aliases to the categorical multivariate general
modeling module called CMGLH, just as ANOVA, GLM, and REGRESSION are aliases
to the multivariate general linear module called MGLH.
Statistical Background
The LOGIT module is SYSTATs comprehensive program for logistic regression
analysis and provides tools for model building, model evaluation, prediction,
simulation, hypothesis testing, and regression diagnostics. The program is designed
to be easy for the novice and can produce the results most analysts need with just three
simple commands. In addition, many advanced features are also included for
sophisticated research projects. Beginners can skip over any unfamiliar concepts and
gradually increase their mastery of logistic regression by working through the tools
incorporated here.
518
Chapter 17
LOGIT will estimate binary (Cox, 1970), multinomial (Anderson, 1972), conditional
logistic regression models (Breslow and Day, 1980), and the discrete choice model
(Luce, 1959; McFadden, 1973). The LOGIT framework is designed for analyzing the
determinants of a categorical dependent variable. Typically, the dependent variable is
binary and coded as 0 or 1; however, it may be multinomial and coded as an integer
ranging from 1 to or 0 to .
Studies you can conduct with LOGIT include bioassay, epidemiology of disease
(cohort or case-control), clinical trials, market research, transportation research (mode
of travel), psychometric studies, and voter-choice analysis. The LOGIT module can also
be used to analyze ranked choice information once the data have been suitably
transformed (Beggs, Cardell, and Hausman, 1981).
This chapter contains a brief introduction to logistic regression and a description of
the commands and features of the module. If you are unfamiliar with logistic
regression, the textbook by Hosmer and Lemeshow (1989) is an excellent place to
begin; Breslow and Day (1980) provide an introduction in the context of case-control
studies; Train (1986) and Ben-Akiva and Lerman (1985) introduce the discrete-choice
model for econometrics; Wrigley (1985) discusses the model for geographers; and
Hoffman and Duncan (1988) review discrete choice in a demographic-sociological
context. Valuable surveys appear in Amemiya (1981), McFadden (1984, 1982, 1976),
and Maddala (1983).
Binary Logit
Although logistic regression may be applied to any categorical dependent variable, it
is most frequently seen in the analysis of binary data, in which the dependent variable
takes on only two values. Examples include survival beyond five years in a clinical
trial, presence or absence of disease, responding to a specified dose of a toxin, voting
for a political candidate, and participating in the labor force. The figure below
compares the ordinary least-squares linear model to the basic binary logit model on the
same data. Notice some features of the linear model in the upper panel of the figure:
n The linear model predicts values of y from minus to plus infinity. If the prediction
is intended to be for probabilities, this model is clearly inappropriate.
n The linear model does not pass through the means of x for either value of the
response. More generally, it does not appear to approach the data values very well.
We shouldnt blame the linear model for this; it is doing its job as a regression
estimator by shrinking back toward the mean of y for all x values (0.5). The linear
model is simply not designed to come near the data.
k k 1
519
Logi sti c Regressi on
The lower panel illustrates a logistic model. By contrast, it is designed to fit binary
dataeither when y is assumed to represent a probability distribution or when it is
taken simply as a binary measure we are attempting to predict.
Despite the difference in their graphical appearance, the linear and logit models are
only slight variants of one another. Assuming the possibility of more than one predictor
(x) variable, the linear model is:
where y is a vector of observations, X is a matrix of predictor scores, and e is a vector
of errors.
The logit model is:
where the exponential function is applied to the vector argument. Rearranging terms,
we have:
and logging both sides of the equation, we have:
This last expression is one source of the term logit. The model is linear in the logs.
y Xb e + =
y exp Xb e + ( ) 1 exp Xb e + ( ) + [ ] =
y 1 y ( ) exp Xb e + ( ) =
log y 1 y ( ) [ ] Xb e + b
0
b
j
X
ij
e
i
for all i + + 1, ..., n = = =
520
Chapter 17
Multinomial Logit
Multinomial logit is a logistic regression model having a dependent variable with
more than two levels (Agresti, 1990; Santer and Duffy, 1989; Nerlove and Press, 1973).
Examples of such dependent variables include political preference (Democrat,
Republican, Independent), health status (healthy, moderately impaired, seriously
impaired), smoking status (current smoker, former smoker, never smoked), and job
classification (executive, manager, technical staff, clerical, other). Outside of the
difference in the number of levels of the dependent variable, the multinomial logit is
very similar to the binary logit, and most of the standard tools of interpretation,
analysis, and model selection can be applied. In fact, the polytomous unordered logit
we discuss here is essentially a combination of several binary logits estimated
simultaneously (Begg and Gray, 1984). We use the term polytomous to differentiate
this model from the conditional logistic regression and discrete choice models
discussed below.
There are important differences between binary and multinomial models. Chiefly,
the multinomial output is more complicated than that of the binary model, and care
must be taken in the interpretation of the results. Fortunately, LOGIT provides some
new tools that make the task of interpretation much easier. There is also a difference in
dependent variable coding. The binary logit dependent variable is normally coded 0 or
1, whereas the multinomial dependent can be coded 1, 2, ..., , (that is, it starts at 1
rather than 0) or 0, 1, 2, ..., .
Conditional Logit
The conditional logistic regression model has become a major analytical tool in
epidemiology since the work of Prentice and Breslow (1978), Breslow et al. (1978),
Prentice and Pyke (1979), and the extended treatment of case-control studies in
Breslow and Day (1980). A mathematically similar model with the same name was
introduced independently and from a rather different perspective by McFadden (1973)
in econometrics. The models have since seen widespread use in the considerably
different contexts of biomedical research and social science, with parallel literatures on
sampling, estimation techniques, and statistical results. In epidemiology, conditional
logit is used to estimate relative risks in matched sample case-control studies (Breslow,
1982), whereas in econometrics a similar likelihood function is used to model
consumer choices as a function of the attributes of alternatives. We begin this section
with a treatment of the biomedical use of the conditional logistic model. A separate
k
k 1
521
Logi sti c Regressi on
section on the discrete choice model covers the econometric version and contains
certain fine points that may be of interest to all readers. A discussion of parallels in the
two literatures appears in Steinberg (1991).
In the traditional conditional logistic regression model, you are trying to measure
the risk of disease corresponding to different levels of exposure to risk factors. The data
have been collected in the form of matched sets of cases and controls, where the cases
have the disease, the controls do not, and the sets are matched on background variables
such as age, sex, marital status, education, residential location, and possibly other
health indicators. The matching variables combine to form strata over which relative
risks are to be estimated; thus, for example, a small group of persons of a given age,
marital status, and health history will form a single stratum. The matching variables
can also be thought of as proxies for a larger set of unobserved background variables
that are assumed to be constant within strata. The logit for the jth individual in the ith
stratum can be written as:
where is the vector of exposure variables and is a parameter dedicated to the
stratum. Since case-control studies will frequently have a large number of small
matched sets, the are nuisance parameters that can cause problems in estimation
(Cox and Hinkley, 1974). In the example discussed below, there are 63 matched sets,
each consisting of four cases and one control, with information on seven exposure
variables for every subject.
The problem with estimating an unconditional model for these data is that we would
need to include dummy variables for the strata. This would leave us with
possibly 70 parameters being estimated for a data set with only 315 observations.
Furthermore, increasing the sample size will not help because an additional stratum
parameter would have to be estimated for each additional matched set in the study
sample. By working with the appropriate conditional likelihood, however, the nuisance
parameters can be eliminated, simplifying estimation and protecting against potential
biases that may arise in the unconditional model (Cox, 1975; Chamberlain, 1980). The
conditional model requires estimation only of the relative risk parameters of interest.
LOGIT allows the estimation of models for matched sample case-control studies
with one case and any number of controls per set. Thus, matched pair studies, as well
as studies with varying numbers of controls per case, are easily handled. However, not
all commands discussed so far are available for conditional logistic regression.
logit p
ij
( ) a
i
b
j
X
ij
+ =
X
ij
a
i
a
i
63 1 62 =
522
Chapter 17
Discrete Choice Logit
Econometricians and psychometricians have developed a version of logit frequently
called the discrete choice model, or McFaddens conditional logit model
(McFadden, 1973, 1976, 1982, 1984; Hensher and Johnson, 1981; Ben-Akiva and
Lerman, 1985; Train, 1986; Luce, 1959). This multinomial model differs from the
standard polytomous logit in the interpretation of the coefficients, the number of
parameters estimated, the syntax of the model sentence, and options for data layout.
The discrete choice framework is designed specifically to model an individuals
choices in response to the characteristics of the choices. Characteristics of choices are
attributes such as price, travel time, horsepower, or calories; they are features of the
alternatives that an individual might choose from. By contrast, characteristics of the
chooser, such as age, education, income, and marital status, are attributes of a person.
The classic application of the discrete choice model has been to the choice of travel
mode to work (Domencich and McFadden, 1975). Suppose a person has three
alternatives: private auto, car pool, and commuter train. The individual is assumed to
have a utility function representing the desirability of each option, with the utility of
an alternative depending solely on its own characteristics. With travel time and travel
cost as key characteristics determining mode choice, the utility of each option could be
written as:
where represents private auto, car pool, and train, respectively. In this
random utility model, the utility of the ith alternative is determined by the travel
time , the cost of that alternative, and a random error term, . Utility of an
alternative is assumed not to be influenced by the travel times or costs of other
alternatives available, although choice will be determined by the attributes of all
available alternatives. In addition to the alternative characteristics, utility is sometimes
also determined by an alternative specific constant.
The choice model specifies that an individual will choose the alternative with the
highest utility as determined by the equation above. Because of the random
component, we are reduced to making statements concerning the probability that a
given choice is made. If the error terms are distributed as i.i.d. extreme value, it can be
shown that the probability of the i th alternative being chosen is given by the familiar
logit formula
U
i
B
1
T
i
B
2
C
i
e
i
+ + =
i 1, 2, 3 =
U
i
T
i
C
i
e
i
523
Logi sti c Regressi on
Suppose that for the first few cases our data are as follows:
The third record has a person who chooses to go to work by private auto (choice = 1);
when he drives, it takes 15 minutes to get to work and costs one dollar. Had he
carpooled instead, it would have taken 30 minutes to get to work and cost 50 cents. The
train would have taken an hour and cost one dollar. For this case, the utility of each
option is given by
U
(private auto)
= b
1
*15 + b
2
*1.00 + error
13
U
(car pool)
= b
1
*30 + b
2
* 0.50 + error
23
U
(train)
= b
1
*60 + b
2
*1.00 + error
33
The error term has two subscripts, one pertaining to the alternative and the other
pertaining to the individual. The error is individual-specific and is assumed to be
independent of any other error or variable in the data set. The parameters and are
common utility weights applicable to all individuals in the sample. In this example,
these are the only parameters, and their number does not depend on the number of
alternatives individuals can choose from. If a person also had the option of walking to
work, we would expand the model to include this alternative with
U
(walking)
= b
1
*70 + b
2
*0.00 + error
43
and we would still be dealing with only the two regression coefficients and .
This highlights a major difference between the discrete choice and standard
polytomous logit models. In polytomous logit, the number of parameters grows with
the number alternatives; if the value of NCAT is increased from 3 to 4, a whole new
vector of parameters is estimated. By contrast, in the discrete choice model without a
constant, increasing the number of alternatives does not increase the number of
discrete choice parameters estimated.
Subject Choice Auto(1) Auto(2) Pool(1) Pool(2) Train(1) Train(2) Sex Age
1 1 20 3.50 35 2.00 65 1.10 Male 27
2 3 45 6.00 65 3.00 65 1.00 Female 35
3 1 15 1.00 30 0.50 60 1.00 Male 22
4 2 60 5.50 70 2.00 90 2.00 Male 45
5 3 30 4.25 40 1.75 55 1.50 Male 52
Prob U
i
U
j
for all j i > ( )
X
i
b ( ) exp
X
i
b ( ) exp

------------------------------- =
b
1
b
2
b
1
b
2
524
Chapter 17
Finally, we need to look at the optional constant. Optional is emphasized because it
is perfectly legitimate to estimate without a constant, and, in certain circumstances, it
is even necessary to do so. If we were to add a constant to the travel mode model, we
would obtain the following utility equations:

where represents private auto, car pool, and train, respectively. The
constant here, , is alternative-specific, with a separate one estimated for each
alternative: corresponds to private auto; , to car pooling; and , to train. Like
polytomous logit, the constant pertaining to the reference group is normalized to 0 and
is not estimated.
An alternative specific CONSTANT is entered into a discrete choice model to capture
unmeasured desirability of an alternative. Thus, the first constant could reflect the
convenience and comfort of having your own car (or in some cities the inconvenience
of having to find a parking space), and the second might reflect the inflexibility of
schedule associated with shared vehicles. With NCAT=3, the third constant will be
normalized to 0.
Stepwise Logit
Automatic model selection can be extremely useful for analyzing data with a large
number of covariates for which there is little or no guidance from previous research.
For these situations, LOGIT supports stepwise regression, allowing forward, backward,
mixed, and interactive covariate selection, with full control over forcing, selection
criteria, and candidate variables (including interactions). The procedure is based on
Peduzzi, Holford, and Hardy (1980).
Stepwise regression results in a model that cannot be readily evaluated using
conventional significance criteria in hypothesis tests, but the model may prove useful
for prediction. We strongly suggest that you separate the sample into learning and test
sets for assessment of predictive accuracy before fitting a model to the full data set. See
the cautionary discussion and references in Chapter 14.
U
i
b
oi
b
1
T
i
b
2
C
i
e
i
+ + + =
i 1, 2, 3 =
b
oi
b
o1
b
o2
b
o3
525
Logi sti c Regressi on
Logistic Regression in SYSTAT
Estimate Model Main Dialog Box
Logistic regression analysis provides tools for model building, model evaluation,
prediction, simulation, hypothesis testing, and regression diagnostics.
Many of the results generated by modeling, testing, or diagnostic procedures can be
saved to SYSTAT data files for subsequent graphing and display. New data handling
features for the discrete choice model allow tremendous savings in disk space when
choice attributes are constant, and in some models, performance is greatly improved.
The Logit Estimate Model dialog box is shown below.
n Dependent. Select the variable you want to examine. The dependent variable should
be a categorical numeric variable.
n Independent(s). Select one or more continuous or categorical variables. To add an
interaction to your model, use the Cross button. For example, to add the term
SEX*EDUCATION, add SEX to the Independent list and then add EDUCATION by
clicking Cross.
n Conditional(s). Select conditional variables. To add interactive conditional
variables to your model, use the Cross button. For example, to add the term
SEX*EDUCATION, add SEX to the Conditional list and then add EDUCATION by
clicking Cross.
526
Chapter 17
n Include constant. The constant is an optional parameter. Deselect Include constant
to obtain a model through the origin. When in doubt, include the constant.
n Prediction table. Produces a prediction-of-success table, which summarizes the
classificatory power of the model.
n Quasi maximum likelihood. Specifies that the covariance matrix will be quasi-
maximum likelihood adjusted after the first iteration. If this matrix is calculated, it
will be used during subsequent hypothesis testing and will affect t ratios for
estimated parameters.
n Save file. Saves specified statistics in filename.SYD.
Click the Options button to go to the Categories, Discrete Choice, and Estimation
Options dialog boxes.
Categories
You must specify numeric or string grouping variables that define cells. Specify for all
categorical variables for which logistic regression analysis should generate design
variables.
Categorical Variables(s). Categorize an independent variable when it has several
categories; for example, education levels, which could be divided into the following
categories: less than high school, some high school, finished high school, some
college, finished bachelors degree, finished masters degree, and finished doctorate.
On the other hand, a variable such as age in years would not be categorical unless age
were broken up into categories such as under 21, 2165, and over 65.
Effect. Produces parameter estimates that are differences from group means.
527
Logi sti c Regressi on
Dummy. Produces dummy codes for the design variables instead of effect codes. Coding
of dummy variables is the classic analysis of variance parameterization, in which the
sum of effects estimated for a classifying variable is 0. If your categorical variable has
categories, dummy variables are created.
Discrete Choice
The discrete choice framework is designed specifically to model an individuals
choices in response to the characteristics of the choices. Characteristics of choices are
attributes such as price, travel time, horsepower, or calories; they are features of the
alternatives that an individual might choose from. You can define set names for groups
of variables, and create, edit, or delete variables.
Set Name. Specifies conditional variables. Enter a set name and then you can add and
cross variables. To create a new set, click New. Repeat this process until you have
defined all of your sets. You can edit existing sets by highlighting the name of the set
in the Set Name drop-down list. To delete a set, select the set in the drop-down list and
click Delete. When you click Continue, SYSTAT will check that each set name has a
definition. If a set name exists but no variables were assigned to it, the set is discarded
and the set name will not be in the drop-down list when you return to this dialog box.
Alternatives for discrete choice. Specify an alternative for discrete choice.
Characteristics of choice are features of the alternatives that an individual might choose
between. It is needed only when the number of alternatives in a choice model varies per
subject.
Number of categories. Specify the number of categories or alternatives the variable has.
This is needed only for the by-choice data layout where the values of the dependent
variable are not explicitly coded. This is only enabled when the Alternatives for discrete
choice field is not empty.
k k 1
528
Chapter 17
Options
The Logit Options dialog box allows you to specify convergence and a tolerance level,
select complete or stepwise entry, and specify entry and removal criteria.
Converge. Specifies the largest relative change in any coordinate before iterations
terminate.
Tolerance. Prevents the entry of a variable that is highly correlated with the independent
variables already included in the model. Enter a value between 0 and 1. Typical values
are 0.01 or 0.001. The higher the value (closer to 1), the lower the correlation required
to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Stepwise. Allows forward, backward, mixed, and interactive covariates selection,
with full control over forcing, selection criteria, and candidates, including
interactions. It results in a model that can be useful for prediction.
Stepwise Options. The following alternatives are available for stepwise entry and
removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
removes the variable with the largest Remove value.
529
Logi sti c Regressi on
n Forward. Begins with no variables in the model. At each step, SYSTAT adds the
variable with the smallest Enter value.
n Automatic. For Backward, SYSTAT automatically removes a variable from your
model at each step. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. Allows you to use your own judgment in selecting variables for
addition or deletion.
Probability. You can also control the criteria used to enter variables into and remove
variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified
value. Enter a value between 0 and 1(for example, 0.025).
n Remove. Removes a variable from the model if its alpha value is greater than the
specified value. Enter a value between 0 and 1(for example, 0.025).
Force. Forces the first n variables listed in your model to remain in the equation.
Max step. Specifies the maximum number of steps.
Deciles of Risk
After you successfully estimate your model using logistic regression, you can calculate
deciles of risk. This will help you make sure that your model fits the data and that the
results are not unduly influenced by a handful of unusual observations. In using the
deciles of risk table, please note that the goodness-of-fit statistics will depend on the
grouping rule specified.
Two grouping rules are available:
n Based on probability values. Probability is reallocated across the possible values of
the dependent variable as the independent variable changes. It provides a global
view of covariate effects that is not easily seen when considering each binary
submodel separately. In fact, the overall effect of a covariate on the probability of
530
Chapter 17
an outcome can be of the opposite sign of its coefficient estimate in the
corresponding submodel. This is because the submodel concerns only two of the
outcomes, whereas the derivative table considers all outcomes at once.
n Based on equal counts per bin. Allocates approximately equal numbers of
observations to each cell. Enter the number of cells or bins in the Number of bins
text box.
Quantiles
After estimating your model, you can calculate quantiles for any single-predictor in the
model. Quantiles of unadjusted data can be useful in assessing the suitability of a
functional form when you are interested in the unconditional distribution of the failure
times.
Covariate(s). The Covariate(s) list contains all of the variables specified in the
Independent list in the main Logit dialog box. You can set any of the covariates to a
fixed value by selecting the variable in the Covariates list and entering a value in the
Value text box. This constraint appears as variable name = value in the Fixed Value
Settings list after you click Add. The quantiles for the desired variable correspond to a
model in which the covariates are fixed at these values. Any covariates not fixed to a
value are assigned the value of 0.
Quantile Value Variable. By default, the first variable in the Independent variable list in
the main dialog box is shown in this field. You can change this to any variable from the
list. This variable name is then issued as the argument for the QNTL command.
531
Logi sti c Regressi on
Simulation
SYSTAT allows you to generate and save predicted probabilities and odds ratios, using
the last model estimated to evaluate a set of logits. The logits are calculated from a
combination of fixed covariate values that you specify in this dialog box.
Covariate(s). The Covariate(s) list contains all of the variables specified in the
Independent list on the main Logit dialog box. Select a covariate, enter a fixed value
for the covariate in the Value text box, and click Add.
Value. Enter the value over which the parameters of the simulation are to vary.
Fixed value settings. This box lists the fixed values on the covariates from which the
logits are calculated.
When you click OK, SYSTAT prompts you to specify a file to which the simulation
results will be saved.
Hypothesis
After you successfully estimate your model using logistic regression, you can perform
post hoc analyses.
532
Chapter 17
Enter the hypotheses that you would like to test. All the hypotheses that you list will
be tested jointly in a single test. To test each restriction individually, you will have to
revisit this dialog box each time. To reference dummies generated from categorical
covariates, use square brackets, as in:
You can reproduce the Wald version of the t ratio by testing whether a coefficient is 0:
If you dont specify a sub-vector, the first is assumed; thus, the constraint above is
equivalent to:
Using Commands
After selecting a file with USE filename, continue with:
LOGIT
CATEGORY grpvarlist / MISS EFFECT DUMMY
NCAT=n
ALT var
SET parameter=condvarlist
MODEL depvar = indvarexp / CONSTANT
depvar = condvarlist;polyvarlist
ESTIMATE / PREDICT TOLERANCE=d CONVERGE=d QML MEANS CLASS
DERIVATIVE=INDIVIDUAL or AVERAGE
or
START / BACKWARD FORWARD ENTER=d REMOVE=d FORCE=n
MAXSTEP=n
STEP var or + or - / AUTO
(sequence of STEPs)
STOP
SAVE
DC / SMART=n P=p1,p2,
QNTL var / covar=d covar=d
SIMULATE var1=d1, var2=d2, / DO var1=d1,d2,d3, var2=d1,d2,d3
HYPOTHESIS
CONSTRAIN argument
TEST
RACE 1 [ ] 0 =
AGE 0 =
AGE 1 { } 0 =
533
Logi sti c Regressi on
Usage Considerations
Types of data. LOGIT uses rectangular data only. The dependent variable is
automatically taken to be categorical. To change the order of the categories, use the
ORDER statement. For example,
LOGIT can also handle categorical predictor variables. Use the CATEGORY statement
to create them, and use the EFFECTS or DUMMY options of CATEGORY to determine
the coding method. Use the ORDER command to change the order of the categories.
Print options. For PRINT=SHORT, the output gives N, the type of association, parameter
estimates, and associated tests. PRINT=LONG gives, in addition to the above results, a
correlation matrix of the parameter estimates.
Quick Graphs. LOGIT produces no Quick Graphs. Use the saved files from ESTIMATE
or DC to produce diagnostic plots and fitted curves. See the examples.
Saving files. LOGIT saves simulation results, quantiles, or residuals and estimated
values.
BY groups. LOGIT analyzes data by groups.
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. LOGIT uses the FREQ variable, if present, to weight cases. This
inflates the total degrees of freedom to be the sum of the number of frequencies. Using
a FREQ variable does not require more memory, however. Cases whose value on the
FREQ variable are less than or equal to 0 are deleted from the analysis. The FREQ
variable may take non-integer values. When the FREQ command is in effect, separate
unweighted and weighted case counts are printed.
Weighting can be used to compensate for sampling schemes that stratify on the
covariates, giving results that more accurately reflect the population. Weighting is also
useful for market share predictions from samples stratified on the outcome variable in
discrete choice models. Such samples are known as choice-based in the econometric
literature (Manski and Lerman, 1977; Manski and McFadden, 1980; Coslett, 1980) and
are common in matched-sample case-control studies where the cases are usually over-
sampled, and in market research studies where persons who choose rare alternatives
are sampled separately.
Case weights. LOGIT does not allow case weighting.
ORDER CLASS / SORT=DESCENDING
534
Chapter 17
Examples
The following examples begin with the simple binary logit model and proceed to more
complex multinomial and discrete choice logit models. Along the way, we will
examine diagnostics and other options used for applications in various fields.
Example 1
Binary Logit
To illustrate the use of binary logistic regression, we take this example from Hosmer
and Lemeshows book Applied Logistic Regression, referred to below as H&L. Hosmer
and Lemeshow consider data on low infant birth weight (LOW) as a function of several
risk factors. These include the mothers age (AGE), mothers weight during last
menstrual period (LWT), race (RACE = 1: white, RACE = 2: black, RACE = 3: other),
smoking status during pregnancy (SMOKE), history of premature labor (PTL),
hypertension (HT), uterine irritability (UI), and number of physician visits during first
trimester (FTV). The dependent variable is coded 1 for birth weights less than 2500
grams and coded 0 otherwise. These variables have previously been identified as
associated with low birth weight in the obstetrical literature.
The first model considered is the simple regression of LOW on a constant and LWD,
a dummy variable coded 1 if LWT is less than 110 pounds and coded 0 otherwise. (See
H&L, Table 3.17.) LWD and LWT are similar variable names. Be sure to note which is
being used in the models that follow.
The input is:
The output begins with a listing of the dependent variable and the sample split between
0 (reference) and 1 (response) for the dependent variable. A brief iteration history
follows, showing the progress of the procedure to convergence. Finally, the parameter
estimates, standard errors, standardized coefficients (popularly called t ratios), p
values, and the log-likelihood are presented.
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD
ESTIMATE
535
Logi sti c Regressi on
Coefficients
We can evaluate these results much like a linear regression. The coefficient on LWD is
large relative to its standard error (t ratio = 2.91) and so appears to be an important
predictor of low birth weight. The interpretation of the coefficient is quite different
from ordinary regression, however. The logit coefficient tells how much the logit
increases for a unit increase in the independent variable, but the probability of a 0 or 1
outcome is a nonlinear function of the logit.
Odds Ratio
The odds-ratio table provides a more intuitively meaningful quantity for each
coefficient. The odds of the response are given by , where is the
probability of response, and the odds ratio is the multiplicative factor by which the odds
change when the independent variable increases by one unit. In the first model, being
a low-weight mother increases the odds of a low birth weight baby by a multiplicative
Variables in the SYSTAT Rectangular file are:
ID LOW AGE LWT RACE SMOKE
PTL HT UI FTV BWT RACE1
CASEID PTD LWD

Categorical values encountered during processing are:
LOW (2 levels)
0, 1

Binary LOGIT Analysis.

Dependent variable: LOW
Input records: 189
Records for analysis: 189
Sample split

Category choices
REF 59
RESP 130
Total : 189

L-L at iteration 1 is -131.005
L-L at iteration 2 is -113.231
L-L at iteration 3 is -113.121
L-L at iteration 4 is -113.121
Log Likelihood: -113.121
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT -1.054 0.188 -5.594 0.000
2 LWD 1.054 0.362 2.914 0.004
95.0 % bounds
Parameter Odds Ratio Upper Lower
2 LWD 2.868 5.826 1.412
Log Likelihood of constants only model = LL(0) = -117.336
2*[LL(N)-LL(0)] = 8.431 with 1 df Chi-sq p-value = 0.004
McFaddens Rho-Squared = 0.036
p 1 p ( ) p
536
Chapter 17
factor of 2.87, with lower and upper confidence bounds of 1.41 and 5.83, respectively.
Since the lower bound is greater than 1, the variable appears to represent a genuine risk
factor. See Kleinbaum, Kupper, and Chambliss (1982) for a discussion.
Example 2
Binary Logit with Multiple Predictors
The binary logit example contains only a constant and a single dummy variable. We
consider the addition of the continuous variable AGE to the model.
The input is:
The output follows:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD+AGE
ESTIMATE / MEANS
Variables in the SYSTAT Rectangular file are:
ID LOW AGE LWT RACE SMOKE
PTL HT UI FTV BWT RACE1
CASEID PTD LWD

Categorical values encountered during processing are:
LOW (2 levels)
0, 1

Binary LOGIT Analysis.

Dependent variable: LOW
Input records: 189
Records for analysis: 189
Sample split

Category choices
REF 59
RESP 130
Total : 189

Independent variable MEANS
PARAMETER 0 -1 OVERALL
1 CONSTANT 1.000 1.000 1.000
2 LWD 0.356 0.162 0.222
3 AGE 22.305 23.662 23.238
L-L at iteration 1 is -131.005
L-L at iteration 2 is -112.322
L-L at iteration 3 is -112.144
L-L at iteration 4 is -112.143
L-L at iteration 5 is -112.143
Log Likelihood: -112.143
537
Logi sti c Regressi on
We see the means of the independent variables overall and by value of the dependent
variable. In this sample, there is a substantial difference between the mean LWD across
birth weight groups but an apparently small AGE difference.
AGE is clearly not significant by conventional standards if we look at the
coefficient/standard-error ratio. The confidence interval for the odds ratio (0.898,
1.019) includes 1.00, indicating no effect in relative risk, when adjusting for LWD.
Before concluding that AGE does not belong in the model, H&L consider the
interaction of AGE and LWD.
Example 3
Binary Logit with Interactions
In this example, we fit a model consisting of a constant, a dummy variable, a
continuous variable, and an interaction. Note that it is not necessary to create a new
interaction variable; this is done for us automatically by writing the interaction on the
MODEL statement. Lets also add a prediction table for this model.
Following is the input:
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT -0.027 0.762 -0.035 0.972
2 LWD 1.010 0.364 2.773 0.006
3 AGE -0.044 0.032 -1.373 0.170
95.0 % bounds
Parameter Odds Ratio Upper Lower
2 LWD 2.746 5.607 1.345
3 AGE 0.957 1.019 0.898
Log Likelihood of constants only model = LL(0) = -117.336
2*[LL(N)-LL(0)] = 10.385 with 2 df Chi-sq p-value = 0.006
McFaddens Rho-Squared = 0.044
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD+AGE+LWD*AGE
ESTIMATE / PREDICTION
SAVE SIM319/SINGLE,SAVE ODDS RATIOS FOR H&L TABLE 3.19
SIMULATE CONSTANT=0,AGE=0,LWD=1 / DO LWD*AGE =15,45,5
USE SIM319
LIST
538
Chapter 17
The output follows:
Variables in the SYSTAT Rectangular file are:
ID LOW AGE LWT RACE SMOKE
PTL HT UI FTV BWT RACE1
CASEID PTD LWD

Categorical values encountered during processing are:
LOW (2 levels)
0, 1
Total : 12

Binary LOGIT Analysis.

Dependent variable: LOW
Input records: 189
Records for analysis: 189
Sample split

Category choices
REF 59
RESP 130
Total : 189

L-L at iteration 1 is -131.005
L-L at iteration 2 is -110.937
L-L at iteration 3 is -110.573
L-L at iteration 4 is -110.570
L-L at iteration 5 is -110.570
Log Likelihood: -110.570
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 0.774 0.910 0.851 0.395
2 LWD -1.944 1.725 -1.127 0.260
3 AGE -0.080 0.040 -2.008 0.045
4 AGE*LWD 0.132 0.076 1.746 0.081
95.0 % bounds
Parameter Odds Ratio Upper Lower
2 LWD 0.143 4.206 0.005
3 AGE 0.924 0.998 0.854
4 AGE*LWD 1.141 1.324 0.984
Log Likelihood of constants only model = LL(0) = -117.336
2*[LL(N)-LL(0)] = 13.532 with 3 df Chi-sq p-value = 0.004
McFaddens Rho-Squared = 0.058
Model Prediction Success Table
Actual Predicted Choice Actual
Choice Response Reference Total

Response 21.280 37.720 59.000
Reference 37.720 92.280 130.000

Pred. Tot. 59.000 130.000 189.000
Correct 0.361 0.710
Success Ind. 0.049 0.022
Tot. Correct 0.601

Sensitivity: 0.361 Specificity: 0.710
False Reference: 0.639 False Response: 0.290
Simulation Vector
Fixed Parameter Value
1 CONSTANT 0.0
2 LWD 1.000
3 AGE 0.0

539
Logi sti c Regressi on
Likelihood-Ratio Statistic
At this point, it would be useful to assess the model as a whole. One method of model
evaluation is to consider the likelihood-ratio statistic. This statistic tests the hypothesis
that all coefficients except the constant are 0, much like the F test reported below linear
regressions. The likelihood-ratio statistic (LR for short) of 13.532 is chi-squared with
three degrees of freedom and a p value of 0.004. The degrees of freedom are equal to
the number of covariates in the model, not including the constant. McFaddens rho-
squared is a transformation of the LR statistic intended to mimic an R-squared. It is
always between 0 and 1, and a higher rho-squared corresponds to more significant
results. Rho-squared tends to be much lower than R-squared though, and a low number
does not necessarily imply a poor fit. Values between 0.20 and 0.40 are considered very
satisfactory (Hensher and Johnson, 1981).
Models can also be assessed relative to one another. A likelihood-ratio test is
formally conducted by computing twice the difference in log-likelihoods for any pair
of nested models. Commonly called the G statistic, it has degrees of freedom equal to
the difference in the number of parameters estimated in the two models. Comparing the
current model with the model without the interaction, we have
with one degree of freedom, which has a p value of 0.076. This result corresponds to
the bottom row of H&Ls Table 3.17. The conclusion of the test is that the interaction
approaches significance.
Loop Parameter Minimum Maximum Increment
4 AGE*LWD 15.000 45.000 5.000
SYSTAT save file created.
7 records written to %1 save file.
Case number LOGIT SELOGIT PROB PLOWER PUPPER
ODDS ODDSL ODDSU LOOP(1)
1 0.04 0.66 0.51 0.22 0.79
1.04 0.28 3.79 15.00
2 0.70 0.40 0.67 0.48 0.82
2.01 0.91 4.44 20.00
3 1.36 0.42 0.80 0.63 0.90
3.90 1.71 8.88 25.00
4 2.02 0.69 0.88 0.66 0.97
7.55 1.95 29.19 30.00
5 2.68 1.03 0.94 0.66 0.99
14.63 1.94 110.26 35.00
6 3.34 1.39 0.97 0.65 1.00
28.33 1.85 432.77 40.00
7 4.00 1.76 0.98 0.64 1.00
54.86 1.75 1724.15 45.00
G 2 * 112.14338 110.56997 ( ) 3.14684 = =
540
Chapter 17
Prediction Success Table
The output also includes a prediction success table, which summarizes the
classificatory power of the model. The rows of the table show how observations from
each level of the dependent variable are allocated to predicted outcomes. Reading
across the first (Response) row we see that of the 59 cases of low birth weight, 21.28
are correctly predicted and 37.72 are incorrectly predicted. The second row shows that
of the 130 not-LOW cases, 37.72 are incorrectly predicted and 92.28 are correctly
predicted.
By default, the prediction success table sums predicted probabilities into each cell;
thus, each observation contributes a fractional amount to both the Response and
Reference cells in the appropriate row. Column sums give predicted totals for each
outcome, and row sums give observed totals. These sums will always be equal for
models with a constant.
The table also includes additional analytic results. The Correct row is the proportion
successfully predicted, defined as the diagonal table entry divided by the column total,
and Tot.Correct is the ratio of the sum of the diagonal elements in the table to the total
number of observations. In the Response column, 21.28 are correctly predicted out of
a column total of 59, giving a correct rate of 0.3607. Overall, out of a
total of 189 are correct, giving a total correct rate of 0.6009.
Success Ind. is the gain that this model shows over a purely random model that
assigned the same probability of LOW to every observation in the data. The model
produces a gain of 0.0485 over the random model for responses and 0.0220 for
reference cases. Based on these results, we would not think too highly of this model.
In the biostatistical literature, another terminology is used for these quantities. The
Correct quantity is also known as sensitivity for the Response group and specificity
for the Reference group. The False Reference rate is the fraction of those predicted to
respond that actually did not respond, while the False Response rate is the fraction of
those predicted to not respond that actually responded.
We prefer the prediction success terminology because it is applicable to the
multinomial case as well.
Simulation
To understand the implications of the interaction, we need to explore how the relative
risk of low birth weight varies over the typical child-bearing years. This changing
relative risk is evaluated by computing the logit difference for base and comparison
21.28 92.28 +
541
Logi sti c Regressi on
groups. The logit for the base group, mothers with LWD = 0, is written as L(0); the logit
for the comparison group, mothers with LWD = 1, is L(l). Thus,
since, for L(l), LWD = 1. The logit difference is
which is the coefficient on LWD plus the interaction multiplied by its coefficient. The
difference L(l) (0) evaluated for a mother of a given age is a measure of the log relative
risk due to LWD being 1. This can be calculated simply for several ages, and converted
to odds ratios with upper and lower confidence bounds, using the SIMULATE
command.
SIMULATE calculates the predicted logit, predicted probability, odds ratio, upper
and lower bounds, and the standard error of the logit for any specified values of the
covariates. In the above command, the constant and age are set to 0, because these
coefficients do not appear in the logit difference. LWD is set to 1, and the interaction
is allowed to vary from 15 to 45 in increments of five years. The only printed output
produced by this command is a summary report.
SIMULATE does not print results when a DO LOOP is specified because of the
potentially large volume of output it can generate. To view the results, use the
commands:
The results give the effect of low maternal weight (LWD) on low birth weight as a
function of age, where LOOP(1) is the value of AGE * LWD (which is just AGE) and
ODDSU and ODDSL are upper and lower bounds of the odds ratio. We see that the effect
of LWD goes up dramatically with age, although the confidence interval becomes quite
large beyond age 30. The results presented here are calculated internally within LOGIT
and thus differ slightly from those reported in H&L, who use printed output with fewer
decimal places of precision to obtain their results.
L(O) = CONSTANT + B2*AGE
L(l) = CONSTANT + B1*LWD + B2*AGE + B3*LWD*AGE
= CONSTANT + B1 + B2*AGE + B3*AGE
L(l)-L(0) = B1 + B3*LWD*AGE
USE SIM319
LIST
542
Chapter 17
Example 4
Deciles of Risk and Model Diagnostics
Before turning to more detailed model diagnostics, we fit H&Ls final model. As a
result of experimenting with more variables and a large number of interactions, H&L
arrive at the model used here. The input is:
The categorical variable RACE is specified to have three levels. By default LOGIT uses
the lowest category as the reference group, although this can be changed (see the
discussion of categorical variable coding below). The model includes all of the main
variables except FTV, with LWT and PTL transformed into dummy variable variants
LWD and PTD, and two interactions. To reproduce the results of Table 5.1 of H&L, we
specify a particular set of cut points for the deciles of risk table. Some of the results are:
USE HOSLEM
LOGIT
CATEGORY RACE / DUMMY
MODEL LOW=CONSTANT+AGE+RACE+SMOKE+HT+UI+LWD+PTD+ ,
AGE*LWD+SMOKE*LWD
ESTIMATE
SAVE RESID
DC / P=0.06850,0.09360,0.15320,0.20630,0.27810,0.33140,
0.42300,0.49124,0.61146
USE RESID
PPLOT PEARSON / SIZE=VARIANCE
PLOT DELPSTAT*PROB/SIZE=DELBETA(1)
Variables in the SYSTAT Rectangular file are:
ID LOW AGE LWT RACE SMOKE
PTL HT UI FTV BWT RACE1
CASEID PTD LWD

Categorical values encountered during processing are:
RACE (3 levels)
1, 2, 3
LOW (2 levels)
0, 1

Binary LOGIT Analysis.

Dependent variable: LOW
Input records: 189
Records for analysis: 189
Sample split

Category choices
REF 59
RESP 130
Total : 189

L-L at iteration 1 is -131.005
L-L at iteration 2 is -98.066
L-L at iteration 3 is -96.096
L-L at iteration 4 is -96.006
L-L at iteration 5 is -96.006
L-L at iteration 6 is -96.006
Log Likelihood: -96.006
543
Logi sti c Regressi on
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 0.248 1.068 0.232 0.816
2 AGE -0.084 0.046 -1.843 0.065
3 RACE_1 -0.760 0.464 -1.637 0.102
4 RACE_2 0.323 0.532 0.608 0.543
5 SMOKE 1.153 0.458 2.515 0.012
6 HT 1.359 0.661 2.055 0.040
7 UI 0.728 0.479 1.519 0.129
8 LWD -1.730 1.868 -0.926 0.354
9 PTD 1.232 0.471 2.613 0.009
10 AGE*LWD 0.147 0.083 1.779 0.075
11 SMOKE*LWD -1.407 0.819 -1.719 0.086
95.0 % bounds
Parameter Odds Ratio Upper Lower
2 AGE 0.919 1.005 0.841
3 RACE_1 0.468 1.162 0.188
4 RACE_2 1.382 3.920 0.487
5 SMOKE 3.168 7.781 1.290
6 HT 3.893 14.235 1.065
7 UI 2.071 5.301 0.809
8 LWD 0.177 6.902 0.005
9 PTD 3.427 8.632 1.360
10 AGE*LWD 1.159 1.363 0.985
11 SMOKE*LWD 0.245 1.218 0.049
Log Likelihood of constants only model = LL(0) = -117.336
2*[LL(N)-LL(0)] = 42.660 with 10 df Chi-sq p-value = 0.000
McFaddens Rho-Squared = 0.182
Deciles of Risk

Records processed: 189
Sum of weights = 189.000
Statistic p-value df
Hosmer-Lemeshow* 5.231 0.733 8.000
Pearson 183.443 0.374 178.000
Deviance 192.012 0.224 178.000
* Large influence of one or more deciles may affect statistic.

Category 0.069 0.094 0.153 0.206 0.278 0.331
0.423 0.491 0.611
Resp Obs 0.0 1.000 4.000 2.000 6.000 6.000
6.000 10.000 9.000
Exp 0.854 1.641 2.252 3.646 5.017 5.566
6.816 8.570 10.517
Ref Obs 18.000 19.000 14.000 18.000 14.000 12.000
12.000 9.000 10.000
Exp 17.146 18.359 15.748 16.354 14.983 12.434
11.184 10.430 8.483

Avg Prob 0.047 0.082 0.125 0.182 0.251 0.309
0.379 0.451 0.554

Category 1.000

Resp Obs 15.000
Exp 14.122
Ref Obs 4.000
Exp 4.878


Avg Prob 0.743
SYSTAT save file created.
189 records written to %1 save file.
544
Chapter 17
Deciles of Risk
How well does a model fit the data? Are the results unduly influenced by a handful of
unusual observations? These are some of the questions we try to answer with our
model assessment tools. Besides the prediction success table and likelihood-ratio tests
(see the Binary Logit with Interactions example), the model assessment methods in
LOGIT include the Pearson chi-square, deviance and Hosmer-Lemeshow statistics, the
deciles of risk table, and a collection of residual, leverage, and influence quantities.
Most of these are produced by the DC command, which is invoked after estimating a
model.
-4 -3 -2 -1 0 1 2 3 4
PEARSON
-3
-2
-1
0
1
2
3
E
x
p
e
c
t
e
d

V
a
l
u
e

f
o
r

N
o
r
m
a
l

D
i
s
t
r
i
b
u
t
i
o
n
0.0
0.1
0.2
0.3
VARIANCE
545
Logi sti c Regressi on
The table in this example is generated by partitioning the sample into 10 groups
based on the predicted probability of the observations. The row labeled Category gives
the end points of the cells defining a group. Thus, the first group consists of all
observations with predicted probability between 0 and 0.069, the second group covers
the interval 0.069 to 0.094, and the last group contains observations with predicted
probability greater than 0.611.
The cell end points can be specified explicitly as we did or generated automatically
by LOGIT. Cells will be equally spaced if the DC command is given without any
arguments, and LOGIT will allocate approximately equal numbers of observations to
each cell when the SMART option is given, as:
which requests 10 cells. Within each cell, we are given a breakdown of the observed
and expected 0s (Ref) and 1s (Resp) calculated as in the prediction success table.
Expected ls are just the sum of the predicted probabilities of 1 in the cell. In the table,
it is apparent that observed totals are close to expected totals everywhere, indicating a
fairly good fit. This conclusion is borne out by the Hosmer-Lemeshow statistic of 5.23,
which is approximately chi-squared with eight degrees of freedom. H&L discuss the
degrees of freedom calculation.
In using the deciles of risk table, it should be noted that the goodness-of-fit statistics
will depend on the grouping rule specified and that not all statistics programs will
apply the same rules. For example, some programs assign all tied probabilities to the
same cell, which can result in very unequal cell counts. LOGIT gives the user a high
degree of control over the grouping, allowing you to choose among several methods.
The table also provides the Pearson chi-square and the sum of squared deviance
residuals, assuming that each observation has a unique covariate pattern.
Regression Diagnostics
If the DC command is preceded by a SAVE command, a SYSTAT data file containing
regression diagnostics will be created (Pregibon, 1981; Cook and Weisberg, 1984). The
SAVE file contains these variables:
DC / SMART = 10
546
Chapter 17
LEVERAGE(1) is a measure of the influence of an observation on the model fit and is
H&Ls h. DELBETA(1) is a measure of the change in the coefficient vector due to the
observation and is their (delta beta), DELPSTAT is based on the squared residual
and is their (delta chi-square), and DELDSTAT is the change in deviance and is
their (delta D). As in linear regression, the diagnostics are intended to identify
outliers and influential observations. Plots of PEARSON, DEVIANCE, LEVERAGE(l),
DELDSTAT, DELPSTAT against the CASE will highlight unusual data points. H&L
suggest plotting , , and against PROB and against h.
There is an important difference between our calculation of these measures and
those produced by H&L. In LOGIT, the above quantities are computed separately for
each observation, with no account taken of covariate grouping; whereas, in H&L,
grouping is taken into account. To obtain the grouped variants of these statistics,
several SYSTAT programming steps are involved. For further discussion and
interpretation of diagnostic graphs, see H&Ls Chapter 5. We include the probability
plot of the residuals from our model, with the variance of the residuals used to size the
plotting characters.
We also display an example of the graph on the cover of H&L. The original cover
was plotted using SYSTAT Version 5 for the Macintosh. There are slight differences
between the two plots because of the scales and number of iterations in the model
fitting, but the examples are basically the same. H&L is an extremely valuable resource
for learning about graphical aids to diagnosing logistic models.
ACTUAL Value of Dependent Variable
PREDICT Class Assignment (1 or 0)
PROB Predicted probability
LEVERAGE(1) Diagonal element of Pregibon hat matrix
LEVERAGE(2) Component of LEVERAGE(1)
PEARSON Pearson Residual for observation
VARIANCE Variance of Pearson Residual
STANDARD Standardized Pearson Residual
DEVIANCE Deviance Residual
DELDSTART Change in Deviance chi-square
DELPSTART Change in Pearson chi-square
DELBETA(1) Standardized Change in Beta
DELBETA(2) Standardized Change in Beta
DELBETA(3) Standardized Change in Beta

2

D

547
Logi sti c Regressi on
Example 5
Quantiles
In bioassay, it is common to estimate the dosage required to kill 50% of a target
population. For example, a toxicity experiment might establish the concentration of
nicotine sulphate required to kill 50% of a group of common fruit flies (Hubert, 1984).
More generally, the goal is to identify the level of a stimulus required to induce a 50%
response rate, where the response is any binary outcome variable and the stimulus is a
continuous covariate. In bioassay, stimuli include drugs, toxins, hormones, and
insecticides; the responses include death, weight gain, bacterial growth, and color
change, but the concepts are equally applicable to other sciences.
To obtain the LD50 in LOGIT, simply issue the QNTL command. However, dont
make the mistake of spelling quantile as QU, which means QUIT in SYSTAT. QNTL
will produce not only the LD50 but also a number of other quantiles as well, with upper
and lower bounds when they exist. Consider the following data from Williams (1986):
Here, RESPONSE is the dependent variable, LDOSE is the logarithm of the dose
(stimulus), and COUNT is the number of subjects with that response. The model
estimated is:
RESPONSE LDOSE COUNT
CASE 1 1 2 1
CASE 2 0 2 4
CASE 3 1 1 3
CASE 4 0 1 2
CASE 5 1 0 2
CASE 6 0 0 3
CASE 7 1 1 4
CASE 8 0 1 1
CASE 9 1 2 5
USE WILL
FREQ=COUNT
LOGIT
MODEL RESPONSE=CONSTANT+LDOSE
ESTIMATE
QNTL
548
Chapter 17
Following is the output:
Variables in the SYSTAT Rectangular file are:
RESPONSE LDOSE COUNT

Case frequencies determined by value of variable COUNT.

Categorical values encountered during processing are:
RESPONSE (2 levels)
0, 1

Binary LOGIT Analysis.

Dependent variable: RESPONSE
Analysis is weighted by COUNT
Sum of weights = 25.000
Input records: 9
Records for analysis: 9
Sample split

Weighted
Category Count Count
REF 5 15.000
RESP 4 10.000
Total : 9 25.000

L-L at iteration 1 is -17.329
L-L at iteration 2 is -13.277
L-L at iteration 3 is -13.114
L-L at iteration 4 is -13.112
L-L at iteration 5 is -13.112
Log Likelihood: -13.112
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 0.564 0.496 1.138 0.255
2 LDOSE 0.919 0.394 2.334 0.020
95.0 % bounds
Parameter Odds Ratio Upper Lower
2 LDOSE 2.507 5.425 1.159
Log Likelihood of constants only model = LL(0) = -16.825
2*[LL(N)-LL(0)] = 7.427 with 1 df Chi-sq p-value = 0.006
McFaddens Rho-Squared = 0.221

Evaluation Vector
1 CONSTANT 1.000
2 LDOSE VALUE


Quantile Table

Probability LOGIT LDOSE Upper Lower

0.999 6.907 6.900 44.788 3.518
0.995 5.293 5.145 33.873 2.536
0.990 4.595 4.385 29.157 2.105
0.975 3.664 3.372 22.875 1.519
0.950 2.944 2.590 18.042 1.050
0.900 2.197 1.777 13.053 0.530
0.750 1.099 0.582 5.928 -0.445
0.667 0.695 0.142 3.551 -1.047
0.500 0.0 -0.613 0.746 -3.364
0.333 -0.695 -1.369 -0.347 -7.392
0.250 -1.099 -1.809 -0.731 -9.987
0.100 -2.197 -3.004 -1.552 -17.266
0.050 -2.944 -3.817 -2.046 -22.281
0.025 -3.664 -4.599 -2.503 -27.126
0.010 -4.595 -5.612 -3.081 -33.416
0.005 -5.293 -6.372 -3.508 -38.136
0.001 -6.907 -8.127 -4.486 -49.055
549
Logi sti c Regressi on
This table includes LD (probability) values between 0.001 and 0.999. The median
lethal LDOSE (log-dose) is 0.613 with upper and lower bounds of 0.746 and 3.364
for the default 95% confidence interval, corresponding to a dose of 0.542 with limits
2.11 and 0.0346.
Indeterminate Confidence Intervals
Quantile confidence intervals are calculated using Fieller bounds (Finney, 1978),
which can easily include positive or negative infinity for steep dose-response
relationships. In the output, these are represented by the SYSTAT missing value. If this
happens, an alternative suggested by Williams (1986) is to calculate confidence bounds
using likelihood-ratio (LR) tests. See Cox and Oakes (1984) for a likelihood profile
example. Williams observes that the LR bounds seem to be invariably smaller than the
Fieller bounds even for well-behaved large-sample problems.
With SYSTAT BASIC, the search for the LR bounds can be conducted easily.
However, if you are not familiar with LR testing of this type, please refer to Cox and
Oakes (1984) and Williams (1986) for further explanation, because our account here
is necessarily brief.
We first estimate the model of RESPONSE on LDOSE reported above, which will be
the unrestricted model in the series of tests. The key statistic is the final log-likelihood
of 13.112. We then need to search for restricted models that force the LD50 to other
values and that yield log-likelihoods no worse than . A
difference in log-likelihoods of 1.92 marks a 95% confidence interval because 2 * 1.92
= 3.84 is the 0.95 cutoff of the chi-squared distribution with one degree of freedom.
A restricted model is estimated by using a new independent variable and fitting a
model without a constant. The new independent variable is equal to the original minus
the value of the hypothesized LD50 bound. Values of the bounds will be selected by
trial and error. Thus, to test an LD50 value of 0.4895, we could type:
SYSTAT BASIC is used to create the new variable LDOSEB on the fly, and the new
model is then estimated without a constant. The only important part of the results from
a restricted model is the final log-likelihood. It should be close to 15.032 if we have
LOGIT
LET LDOSEB=LDOSE-.4895
MODEL RESPONSE=LDOSEB
ESTIMATE
LET LDOSEB=LDOSE+2.634
MODEL RESPONSE=LDOSEB
ESTIMATE
13.112 1.92 15.032 =
550
Chapter 17
found the boundary of the confidence interval. We wont show the results of these
estimations except to say that the lower bound was found to be 2.634 and is tested
using the second LET statement. Note that the value of the bound is subtracted from the
original independent variable, resulting in the subtraction of a negative number. While
the process of looking for a bound that will yield a log-likelihood of 15.032 for these
data is one of trial and error, it should not take long with the interactive program.
Several other examples are provided in Williams (1986). We were able to reproduce
most of his confidence interval results, but for several models his reported LD50 values
seem to be incorrect.
Quantiles and Logistic Regression
The calculation of LD values has traditionally been conducted in the context of simple
regressions containing a single predictor variable. LOGIT extends the notion to multiple
regression by allowing you to select one variable for LD calculations while holding the
values of the other variables constant at prespecified values. Thus,
will produce the quantiles for AGE with the other variables set as specified. The Fieller
bounds are calculated, adjusting for all other parameters estimated.
Example 6
Multinomial Logit
We will illustrate multinomial modeling with an example, emphasizing what is new in
this context. If you have not already read the example on binary logit, this is a good
time to do so. The data used here have been extracted from the National Longitudinal
Survey of Young Men, 1979. Information on 200 individuals is supplied on school
enrollment status (NOTENR = 1 if not enrolled, 0 otherwise), log10 of wage (LW), age,
highest completed grade (EDUC), mothers education (MED), fathers education
(FED), an index of reading material available in the home (CULTURE = 1 for least, 3
for most), mean income of persons in fathers occupation in 1960 (FOMY), an IQ
USE HOSLEM
CATEGORY RACE
MODEL LOW = CONSTANT + AGE + RACE + SMOKE + HT +,
UI + LWD + PTD
ESTIMATE
QNTL AGE / CONSTANT=1, RACE[1]=1, SMOKE=1, PTD=1,
LWD=1, HT=1, UI=1
551
Logi sti c Regressi on
measure, a race dummy (BLACK = 0 for white), a region dummy (SOUTH = 0 for non-
South), and the number of siblings (NSIBS).
We estimate a model to analyze the CULTURE variable, predicting its value with
several demographic characteristics. In this example, we ignore the fact that the
dependent variable is ordinal and treat it as a nominal variable. (See Agresti, 1990, for
a discussion of the distinction.)
These commands look just like our binary logit analyses with the exception of the
DERIVATIVE and CLASS options, which we will discuss below. The resulting output is:
USE NLS
FORMAT=4
PRINT=LONG
LOGIT
MODEL CULTURE=CONSTANT+MED+FOMY
ESTIMATE / MEANS,PREDICT,CLASS,DERIVATIVE=INDIVIDUAL
PRINT
Categorical values encountered during processing are:
CULTURE (3 levels)
1, 2, 3
Multinomial LOGIT Analysis.

Dependent variable: CULTURE
Input records: 200
Records for analysis: 200
Sample split

Category choices
1 12
2 49
3 139
Total : 200

Independent variable MEANS


PARAMETER 1 2 3 OVERALL
1 CONSTANT 1.0000 1.0000 1.0000 1.0000
2 MED 8.7500 10.1837 11.4460 10.9750
3 FOMY 4551.5000 5368.8571 6116.1367 5839.1750
L-L at iteration 1 is -219.7225
L-L at iteration 2 is -145.2936
L-L at iteration 3 is -138.9952
L-L at iteration 4 is -137.8612
L-L at iteration 5 is -137.7851
L-L at iteration 6 is -137.7846
L-L at iteration 7 is -137.7846
Log Likelihood: -137.7846
Parameter Estimate S.E. t-ratio p-value
Choice Group: 1
1 CONSTANT 5.0638 1.6964 2.9850 0.0028
2 MED -0.4228 0.1423 -2.9711 0.0030
3 FOMY -0.0006 0.0002 -2.6034 0.0092
552
Chapter 17
Choice Group: 2
1 CONSTANT 2.5435 0.9834 2.5864 0.0097
2 MED -0.1917 0.0768 -2.4956 0.0126
3 FOMY -0.0003 0.0001 -2.1884 0.0286
95.0 % bounds
Parameter Odds Ratio Upper Lower
Choice Group: 1
2 MED 0.6552 0.8660 0.4958
3 FOMY 0.9994 0.9998 0.9989
Choice Group: 2
2 MED 0.8255 0.9597 0.7101
3 FOMY 0.9997 1.0000 0.9995
Log Likelihood of constants only model = LL(0) = -153.2535
2*[LL(N)-LL(0)] = 30.9379 with 4 df Chi-sq p-value = 0.0000
McFaddens Rho-Squared = 0.1009

Wald tests on effects across all choices

Wald Chi-Sq
Effect Statistic Signif df
1 CONSTANT 12.0028 0.0025 2.0000
2 MED 12.1407 0.0023 2.0000
3 FOMY 9.4575 0.0088 2.0000
Covariance Matrix

1 2 3 4 5
1 2.8777
2 -0.1746 0.0202
3 -0.0002 -0.0000 0.0000
4 0.5097 -0.0282 -0.0000 0.9670
5 -0.0274 0.0027 -0.0000 -0.0541 0.0059
6 -0.0000 -0.0000 0.0000 -0.0001 -0.0000
6
6 0.0000

Correlation Matrix

1 2 3 4 5
1 1.0000 -0.7234 -0.6151 0.3055 -0.2100
2 -0.7234 1.0000 -0.0633 -0.2017 0.2462
3 -0.6151 -0.0633 1.0000 -0.1515 -0.0148
4 0.3055 -0.2017 -0.1515 1.0000 -0.7164
5 -0.2100 0.2462 -0.0148 -0.7164 1.0000
6 -0.1659 -0.0149 0.2284 -0.5544 -0.1570
6
1 -0.1659
2 -0.0149
3 0.2284
4 -0.5544
5 -0.1570
6 1.0000

Individual variable derivatives averaged over all observations


PARAMETER 1 2 3
1 CONSTANT 0.2033 0.3441 -0.5474
2 MED -0.0174 -0.0251 0.0425
3 FOMY -0.0000 -0.0000 0.0001

553
Logi sti c Regressi on
The output begins with a report on the number of records read and retained for analysis.
This is followed by a frequency table of the dependent variable; both weighted and
unweighted counts would be provided if the FREQ option had been used. The means
table provides means of the independent variables by value of the dependent variable.
We observe that the highest educational and income values are associated with the most
reading material in the home. Next, an abbreviated history of the optimization process
lists the log-likelihood at each iteration, and finally, the estimation results are printed.
Note that the regression results consist of two sets of estimates, labeled Choice
Group 1 and Choice Group 2. It is this multiplicity of parameter estimates that
differentiates multinomial from binary logit. If there had been five categories in the
dependent variable, there would have been four sets of estimates, and so on. This
volume of output provides the challenge to understanding the results.
The results are a little more intelligible when you realize that we have really
estimated a series of binary logits simultaneously. The first submodel consists of the
two dependent variable categories 1 and 3, and the second consists of categories 2 and
3. These submodels always include the highest level of the dependent variable as the
reference class and one other level as the response class. If NCAT had been set to 25,
the 24 submodels would be categories 1 and 25, categories 2 and 25, through categories
24 and 25. We then obtain the odds ratios for the two submodels separately, comparing
dependent variable levels 1 against 3 and 2 against 3. This table shows that levels 1 and
2 are less likely as MED and FOMY increase, as the odds ratio is less than 1.
Model Prediction Success Table


Actual Predicted Choice Actual
Choice 1 2 3 Total

1 1.8761 4.0901 6.0338 12.0000
2 3.6373 13.8826 31.4801 49.0000
3 6.4865 31.0273 101.4862 139.0000

Pred. Tot. 12.0000 49.0000 139.0000 200.0000
Correct 0.1563 0.2833 0.7301
Success Ind. 0.0963 0.0383 0.0351
Tot. Correct 0.5862
Model Classification Table


Actual Predicted Choice Actual
Choice 1 2 3 Total

1 1.0000 3.0000 8.0000 12.0000
2 0.0 4.0000 45.0000 49.0000
3 1.0000 5.0000 133.0000 139.0000

Pred. Tot. 2.0000 12.0000 186.0000 200.0000
Correct 0.0833 0.0816 0.9568
Success Ind. 0.0233 -0.1634 0.2618
Tot. Correct 0.6900
554
Chapter 17
Wald Test Table
The coefficient/standard-error ratios (t ratios) reported next to each coefficient are a
guide to the significance of an individual parameter. But when the number of categories
is greater than two, each variable corresponds to more than one parameter. The Wald
test table automatically conducts the hypothesis test of dropping all parameters
associated with a variable, and the degrees of freedom indicates how many parameters
were involved. Because each variable in this example generates two coefficients, the
Wald tests have two degrees of freedom each. Given the high individual t ratios, it is
not surprising that every variable is also significant overall. The PRINT = LONG option
also produces the parameter covariance and correlation matrices.
Derivative Tables
In a multinomial context, we will want to know how the probabilities of each of the
outcomes will change in response to a change in the covariate values. This information
is provided in the derivative table, which tells us, for example, that when MED
increases by one unit, the probability of category 3 goes up by 0.042, and categories 1
and 2 go down by 0.017 and 0.025, respectively. To assess properly the effect of
fathers income, the variable should be rescaled to hundreds or thousands of dollars (or
the FORMAT increased) because the effect of an increase of one dollar is very small.
The sum of the entries in each row is always 0 because an increase in probability in one
category must come about by a compensating decrease in other categories. There is no
useful interpretation of the CONSTANT row.
In general, the table shows how probability is reallocated across the possible values
of the dependent variable as the independent variable changes. It thus provides a global
view of covariate effects that is not easily seen when considering each binary submodel
separately. In fact, the overall effect of a covariate on the probability of an outcome can
be of the opposite sign of its coefficient estimate in the corresponding submodel. This
is because the submodel concerns only two of the outcomes, whereas the derivative
table considers all outcomes at once.
This table was generated by evaluating the derivatives separately for each individual
observation in the data set and then computing the mean; this is the theoretically
correct way to obtain the results. A quick alternative is to evaluate the derivatives once
at the sample average of the covariates. This method saves time (but at the possible cost
of accuracy) and is requested with the option DERIVATIVE=AVERAGE.
555
Logi sti c Regressi on
Prediction Success
The PREDICT option instructs LOGIT to produce the prediction success table, which we
have already seen in the binary logit. (See Hensher and Johnson, 1981; McFadden,
1979.) The table will break down the distribution of predicted outcomes by actual
choice, with diagonals representing correct predictions and off-diagonals representing
incorrect predictions. For the multinomial model, the table will have dimensions NCAT
by NCAT with additional marginal results. For our example model, the core table is 3
by 3.
Each row of the table takes all cases having a specific value of the dependent
variable and shows how the model allocates those cases across the possible outcomes.
Thus in row 1, the 12 cases that actually had CULTURE = 1 were distributed by the
predictive model as 1.88 to CULTURE = 1, 4.09 to CULTURE = 2, and 6.03 to
CULTURE = 3. These numbers are obtained by summing the predicted probability of
being in each category across all of the cases with CULTURE actually equal to 1. A
similar allocation is provided for every value of the dependent variable.
The prediction success table is also bordered by additional informationrow totals
are observed sums, and column totals are predicted sums and will be equal for any
model containing a constant. The Correct row gives the ratio of the number correctly
predicted in a column to the column total. Thus, among cases for which CULTURE =
1, the fraction correct is ; for CULTURE = 3, the ratio is
. The total correct gives the fraction correctly predicted
overall and is computed as the sum Correct in each column divided by the table total.
This is .
The success index measures the gain that the model exhibits in number correctly
predicted in each column over a purely random model (a model with just a constant).
A purely random model would assign the same probabilities of the three outcomes to
each case, as illustrated below:
Thus, the smaller the success index in each column, the poorer the performance of the
model; in fact, the index can even be negative.
Normally, one prediction success table is produced for each model estimated.
However, if the data have been separated into learning and test subsamples with BY, a
Random Probabitity Model
Predicted Sample Fraction
Success Index =
CORRECT - Random Predicted
PROB (CULTURE=l)= 12/200 = 0.0600 0.1563 0.0600 = 0.0963
PROB (CULTURE=2)= 49/200 = 0.2450 0.2833 0.2450 = 0.0383
PROB (CULTURE=3)=139/200 = 0.6950 0.7301 0.6950 = 0.0351
1.8761 12 0.1563 =
101.4862 139 0.7301 =
1.8761 13.8826 101.4862 + + ( ) 200 0.5862 =
556
Chapter 17
separate prediction success table will be produced for each portion of the data. This can
provide a clear picture of the strengths and weaknesses of the model when applied to
fresh data.
Classification Tables
Classification tables are similar to prediction success tables except that predicted
choices instead of predicted probabilities are added into the table. Predicted choice is
the choice with the highest probability. Mathematically, the classification table is a
prediction success table with the predicted probabilities changed, setting the highest
probability of each case to 1 and the other probabilities to 0.
In the absence of fractional case weighting, each cell of the main table will contain
an integer instead of a real number. All other quantities are computed as they would be
for the prediction success table. In our judgment, the classification table is not as good
a diagnostic tool as the prediction success table. The option is included primarily for
the binary logit to provide comparability with results reported in the literature.
Example 7
Conditional Logistic Regression
Data must be organized in a specific way for the conditional logistic model;
fortunately, this organization is natural for matched sample case-control studies. First,
matched samples must be grouped together; all subjects from a given stratum must be
contiguous. It is thus advisable to provide each set with a unique stratum number to
facilitate the sorting and tracking of records. Second, the dependent variable gives the
relative position of the case within a matched set. Thus, the dependent variable will be
an integer between 1 and NCAT, and if the case is first in each stratum, then the
dependent variable will be equal to 1 for every record in the data set.
To illustrate how to set up conditional logit models, we use data discussed at length
by Breslow and Day (1980) on cases of endometrial cancer in a retirement community
near Los Angeles. The data are reproduced in their Appendix III and are identified in
SYSTAT as MACK.SYS.
The data set includes the dependent variable CANCER, the exposure variables AGE,
GALL (gall bladder disease), HYP (hypertension), OBESE, ESTROGEN, DOSE, DUR
(duration of conjugated estrogen exposure), NON (other drugs), some transformations
of these variables, and a set identification number. The data are organized by sets, with
557
Logi sti c Regressi on
the case coming first, followed by four controls, and so on, for a total of 315
observations .
To estimate a model of the relative risks of gall bladder disease, estrogen use, and
their interaction, you may proceed as follows:
There are three key points to notice about this sequence of commands. First, the NCAT
command is required to let LOGIT know how many subjects there are in a matched set.
Unlike the unconditional binary LOGIT, a unit of information in matched samples will
typically span more than one line of data, and NCAT will establish the minimum size
of each matched set. If each set contains the same number of subjects, the NCAT
command completely describes the data organization. If there were a varying number
of controls per set, the size of each set would be signaled with the ALT command, as in
Here, SETSIZE is a variable containing the total number of subjects (number of
controls plus 1) per set. Each set could have its own value.
The second point is that the matched set conditional logit never contains a constant;
the constant is eliminated along with all other variables that do not vary among
members of a matched set. The third point is the appearance of the semicolon at the
end of the model. This is required to distinguish the conditional from the unconditional
model.
After you specify the commands, the output produced includes:
USE MACK
PRINT LONG
LOGIT
MODEL DEPVAR=GALL+EST+GALL*EST ;
ALT=SETSIZ
NCAT=5
ESTIMATE
ALT = SETSIZE
Variables in the SYSTAT Rectangular file are:
CANCER GALL HYP OBESE EST DOS
DURATION NON REC DEPVAR GROUP OB
DOSGRP DUR DURGRP CEST SETSIZ
Conditional LOGIT, data organized by matched set.

Categorical values encountered during processing are:
DEPVAR (1 levels)
1

Conditional LOGIT Analysis.

Dependent variable: DEPVAR
Number of alternatives: SETSIZ
Input records: 315
Matched sets for analysis: 63
63 * 4 1 + ( ) ( )
558
Chapter 17
The output begins with a report on the number of SYSTAT records read and the number
of matched sets kept for analysis. The remaining output parallels the results produced
by the unconditional logit model. The parameters estimated are coefficients of a linear
logit, the relative risks are derived by exponentiation, and the interpretation of the
model is unchanged. Model selection will proceed as it would in linear regression; you
might experiment with logarithmic transformations of the data, explore quadratic and
higher-order polynomials in the risk factors, and look for interactions. Examples of
such explorations appear in Breslow and Day (1980).
Example 8
Discrete Choice Models
The CHOICE data set contains hypothetical data motivated by McFadden (1979). The
CHOICE variable represents which of the three transportation alternatives (AUTO,
POOL, TRAIN) each subject prefers. The first subscripted variable in each choice
category represents TIME and the second, COST. Finally, SEX$ represents the gender
of the chooser, and AGE, the age.
L-L at iteration 1 is -101.3946
L-L at iteration 2 is -79.0552
L-L at iteration 3 is -76.8868
L-L at iteration 4 is -76.7326
L-L at iteration 5 is -76.7306
L-L at iteration 6 is -76.7306
Log Likelihood: -76.7306
Parameter Estimate S.E. t-ratio p-value
1 GALL 2.8943 0.8831 3.2777 0.0010
2 EST 2.7001 0.6118 4.4137 0.0000
3 GALL*EST -2.0527 0.9950 -2.0631 0.0391
95.0 % bounds
Parameter Odds Ratio Upper Lower
1 GALL 18.0717 102.0127 3.2014
2 EST 14.8818 49.3621 4.4866
3 GALL*EST 0.1284 0.9025 0.0183
Log Likelihood of constants only model = LL(0) = 0.0000
McFaddens Rho-Squared = 4.56944E+15


Covariance Matrix

1 2 3
1 0.7798
2 0.3398 0.3743
3 -0.7836 -0.3667 0.9900

Correlation Matrix

1 2 3
1 1.0000 0.6290 -0.8918
2 0.6290 1.0000 -0.6024
3 -0.8918 -0.6024 1.0000
559
Logi sti c Regressi on
A basic discrete choice model is estimated with:
There are two new features of this program. First, the word TIME is not a SYSTAT
variable name; rather, it is a label we chose to remind us of time spent commuting. The
group of names in the SET statement are valid SYSTAT variables corresponding, in
order, to the three modes of transportation. Although there are three variable names in
the SET variable, only one attribute is being measured.
Following is the output:
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+COST
ESTIMATE
Categorical values encountered during processing are:
CHOICE (3 levels)
1, 2, 3
Categorical variables are effects coded with the highest value as reference.

Conditional LOGIT Analysis.

Dependent variable: CHOICE
Input records: 29
Records for analysis: 29
Sample split

Category choices
1 15
2 6
3 8
Total : 29

L-L at iteration 1 is -31.860
L-L at iteration 2 is -31.142
L-L at iteration 3 is -31.141
L-L at iteration 4 is -31.141
Log Likelihood: -31.141
Parameter Estimate S.E. t-ratio p-value
1 TIME -0.020 0.017 -1.169 0.243
2 COST -0.088 0.145 -0.611 0.541
95.0 % bounds
Parameter Odds Ratio Upper Lower
1 TIME 0.980 1.014 0.947
2 COST 0.915 1.216 0.689
Log Likelihood of constants only model = LL(0) = -29.645
McFaddens Rho-Squared = -0.050

Covariance Matrix

1 2
1 0.000
2 0.001 0.021

Correlation Matrix

1 2
1 1.000 0.384
2 0.384 1.000
560
Chapter 17
The output begins with a frequency distribution of the dependent variable and a brief
iteration history and prints standard regression results for the parameters estimated.
A key difference between a conditional variable clause and a standard SYSTAT
polytomous variable is that each clause corresponds to only one estimated parameter
regardless of the value of NCAT, while each free-standing polytomous variable
generates NCAT 1 parameters. The difference is best seen in a model that mixes both
types of variables (see Hoffman and Duncan, 1988, or Steinberg, 1987) for further
discussion).
Mixed Parameters
The following is an example of mixing polytomous and conditional variables:
The hybrid model generates a single coefficient each for TIME and COST and two sets
of parameters for the polytomous variables.
The resulting output is:
USE CHOICE
LOGIT
CATEGORY SEX$
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+COST+SEX$+AGE
ESTIMATE
Categorical values encountered during processing are:
SEX$ (2 levels)
Female, Male
CHOICE (3 levels)
1, 2, 3

Conditional LOGIT Analysis.

Dependent variable: CHOICE
Input records: 29
Records for analysis: 29
Sample split

Category choices
1 15
2 6
3 8
Total : 29

L-L at iteration 1 is -31.860
L-L at iteration 2 is -28.495
L-L at iteration 3 is -28.477
L-L at iteration 4 is -28.477
L-L at iteration 5 is -28.477
Log Likelihood: -28.477
Parameter Estimate S.E. t-ratio p-value
1 TIME -0.018 0.020 -0.887 0.375
2 COST -0.351 0.217 -1.615 0.106
561
Logi sti c Regressi on
Varying Alternatives
For some discrete choice problems, the number of alternatives available varies across
choosers. For example, health researchers studying hospital choice pooled data from
several cities in which each city had a different number of hospitals in the choice set
(Luft et al., 1988). Transportation research may pool data from locations having train
service with locations without trains. Carson, Hanemann, and Steinberg (1990) pool
responses from two contingent valuation survey questions having differing numbers of
alternatives. To let LOGIT know about this, there are two ways of proceeding. The most
flexible is to organize the data by choice. With the standard data layout, use the ALT
command, as in
Choice Group: 1
3 SEX$_Female 0.328 0.509 0.645 0.519
4 AGE 0.026 0.014 1.850 0.064
Choice Group: 2
3 SEX$_Female 0.024 0.598 0.040 0.968
4 AGE -0.008 0.016 -0.500 0.617
95.0 % bounds
Parameter Odds Ratio Upper Lower
1 TIME 0.982 1.022 0.945
2 COST 0.704 1.078 0.460
Choice Group: 1
4 AGE 1.026 1.054 0.998
Choice Group: 2
4 AGE 0.992 1.024 0.961
Log Likelihood of constants only model = LL(0) = -29.645
2*[LL(N)-LL(0)] = 2.335 with 4 df Chi-sq p-value = 0.674
McFaddens Rho-Squared = 0.039
Wald tests on effects across all choices

Wald Chi-Sq
Effect Statistic Signif df
3 SEX$_Female 0.551 0.759 2.000
4 AGE 4.475 0.107 2.000

Covariance Matrix

1 2 3 4 5 6
1 0.000
2 0.001 0.047
3 0.002 0.009 0.259
4 -0.000 -0.001 0.002 0.000
5 0.002 -0.018 0.165 0.002 0.358
6 -0.000 0.001 0.002 0.000 0.003 0.000

Correlation Matrix

1 2 3 4 5 6
1 1.000 0.180 0.150 -0.076 0.146 -0.266
2 0.180 1.000 0.084 -0.499 -0.140 0.310
3 0.150 0.084 1.000 0.230 0.543 0.193
4 -0.076 -0.499 0.230 1.000 0.281 0.265
5 0.146 -0.140 0.543 0.281 1.000 0.323
6 -0.266 0.310 0.193 0.265 0.323 1.000
ALT=NCHOICES
562
Chapter 17
where NCHOICES is a SYSTAT variable containing the number of alternatives
available to the chooser. If the value of the ALT variable is less than NCAT for an
observation, LOGIT will use only the first NCHOICES variables in each conditional
variable clause in the analysis.
With the standard data layout, the ALT command is useful only if the choices not
available to some cases all appear at the end of the choice list. Organizing data by
choice is much more manageable. One final note on varying numbers of alternatives:
if the ALT command is used in the standard data layout, the model may not contain a
constant or any polytomous variables; the model must be composed only of conditional
variable clauses. We will not show an example here because by now you must have
figured that we believe the by-choice layout is more suitable if you have data with
varying choice alternatives.
Interactions
A common practice in discrete choice models is to enter characteristics of choosers as
interactions with attributes of the alternatives in conditional variable clauses. When
dealing with large sets of alternatives, such as automobile purchase choices or hospital
choices, where the model may contain up to 60 different alternatives, adding
polytomous variables can quickly produce unmanageable estimation problems, even
for mainframes. In the transportation literature, it has become commonplace to
introduce demographic variables as interactions with, or other functions, of the discrete
choice variables. Thus, instead of, or in addition to, the COST group of variables,
AUTO(2), POOL(2), TRAIN(2), you might see the ratio of cost to income. These ratios
would be created with LET transformations and then added in another SET list for use
as a conditional variable in the MODEL statement. Interactions can also be introduced
this way. By confining demographic variables to appear only as interactions with
choice variables, the number of parameters estimated can be kept quite small.
Thus, an investigator might prefer
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET TIMEAGE=AUTO(1)*AGE,POOL(1)*AGE,TRAIN(1)*AGE
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+TIMEAGE+COST
ESTIMATE
563
Logi sti c Regressi on
as a way of entering demographics. The advantage to using only conditional clauses is
clear when dealing with a large value of NCAT as the number of additional parameters
estimated is minimized. The model above yields:
Categorical values encountered during processing are:
CHOICE (3 levels)
1, 2, 3

Conditional LOGIT Analysis.

Dependent variable: CHOICE
Input records: 29
Records for analysis: 29
Sample split

Category choices
1 15
2 6
3 8
Total : 29

L-L at iteration 1 is -31.860
L-L at iteration 2 is -28.021
L-L at iteration 3 is -27.866
L-L at iteration 4 is -27.864
L-L at iteration 5 is -27.864
Log Likelihood: -27.864
Parameter Estimate S.E. t-ratio p-value
1 TIME -0.148 0.062 -2.382 0.017
2 TIMEAGE 0.003 0.001 2.193 0.028
3 COST 0.007 0.155 0.043 0.966
95.0 % bounds
Parameter Odds Ratio Upper Lower
1 TIME 0.863 0.974 0.764
2 TIMEAGE 1.003 1.006 1.000
3 COST 1.007 1.365 0.742
Log Likelihood of constants only model = LL(0) = -29.645
2*[LL(N)-LL(0)] = 3.561 with 1 df Chi-sq p-value = 0.059
McFaddens Rho-Squared = 0.060


Covariance Matrix

1 2 3
1 0.004
2 -0.000 0.000
3 -0.001 0.000 0.024

Correlation Matrix

1 2 3
1 1.000 -0.936 -0.110
2 -0.936 1.000 0.273
3 -0.110 0.273 1.000
564
Chapter 17
Constants
The models estimated here deliberately did not include a constant because the constant
is treated as a polytomous variable in LOGIT. To obtain an alternative specific constant,
enter the following model statement:
Two CONSTANT parameters would be estimated. For the discrete choice model with
the type of data layout of this example, there is no need to specify the NCAT value
because LOGIT determines this automatically by the number of variables between the
brackets. If the model statement is inconsistent in the number of variables within
brackets across conditional variable clauses, an error message will be generated.
Following is the output:
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=CONSTANT+TIME+COST
ESTIMATE
Categorical values encountered during processing are:
CHOICE (3 levels)
1, 2, 3

Conditional LOGIT Analysis.

Dependent variable: CHOICE
Input records: 29
Records for analysis: 29
Sample split

Category choices
1 15
2 6
3 8
Total : 29

L-L at iteration 1 is -31.860
L-L at iteration 2 is -25.808
L-L at iteration 3 is -25.779
L-L at iteration 4 is -25.779
L-L at iteration 5 is -25.779
Log Likelihood: -25.779
Parameter Estimate S.E. t-ratio p-value
1 TIME -0.012 0.020 -0.575 0.565
2 COST -0.567 0.222 -2.550 0.011
3 CONSTANT 1.510 0.608 2.482 0.013
3 CONSTANT -0.865 0.675 -1.282 0.200
95.0 % bounds
Parameter Odds Ratio Upper Lower
1 TIME 0.988 1.029 0.950
2 COST 0.567 0.877 0.367
Log Likelihood of constants only model = LL(0) = -29.645
2*[LL(N)-LL(0)] = 7.732 with 2 df Chi-sq p-value = 0.021
McFaddens Rho-Squared = 0.130

565
Logi sti c Regressi on
Example 9
By-Choice Data Format
In the standard data layout, there is one data record per case that contains information
on every alternative open to a chooser. With a large number of alternatives, this can
quickly lead to an excessive number of variables. A convenient alternative is to
organize data by choice; with this data layout, there is one record per alternative and as
many as NCAT records per case. The data set CHOICE2 organizes the CHOICE data
of the last example in this way. If you analyze the differences between the two data sets,
you will see that they are similar to those between the split-plot and multivariate layout
for the repeated measures design (see Analysis of Variance). To set up the same
problem in a by-choice layout, input the following:
The by-choice format requires that the dependent variable appear with the same value
on each record pertaining to the case. An ALT variable (here NCHOICES) indicating
the number of records for this case must also appear on each record. The by-choice
organization results in fewer variables on the data set, with the savings increasing with
the number of alternatives. However, there is some redundancy in that certain data
values are repeated on each record. The best reason for using a by-choice format is to
Wald tests on effects across all choices

Wald Chi-Sq
Effect Statistic Signif df
3 CONSTANT 8.630 0.013 2.000

Covariance Matrix

1 2 3 4
1 0.000
2 0.001 0.049
3 -0.001 -0.082 0.370
4 -0.005 0.056 0.046 0.455

Correlation Matrix

1 2 3 4
1 1.000 0.130 -0.053 -0.350
2 0.130 1.000 -0.606 0.372
3 -0.053 -0.606 1.000 0.113
4 -0.350 0.372 0.113 1.000
USE CHOICE2
LOGIT
NCAT=3
ALT=NCHOICES
MODEL CHOICE=TIME+COST ;
ESTIMATE
566
Chapter 17
handle varying numbers of alternatives per case. In this situation, there is no need to
shuffle data values or to be concerned with choice order.
With the by-choice data format, the NCAT statement is required; it is the only way
for LOGIT to know the number of alternatives to expect per case. For varying numbers
of alternatives per case, the ALT statement is also required, although we use it here with
the same number of alternatives.
Because the number of alternatives (ALT) is the same for each case in this example, the
output is the same as the Mixed Parameters example.
Weighting Choice-Based Samples
For estimation of the slope coefficients of the discrete choice model, weighting is not
required even in choice-based samples. For predictive purposes, however, weighting is
necessary to forecast aggregate shares, and it is also necessary for consistent estimation
of the alternative specific dummies (Manski and Lerman, 1977).
The appropriate weighting procedure for choice-based sample logit estimation
requires that the sum of the weights equal the actual number of observations retained
in the estimation sample. For choice-based samples, the weight for any observation
choosing the ith option is , where is the population share choosing the
jth option and is the choice-based sample share choosing the jth option.
As an example, suppose theatergoers make up 10% of the population and we have
a choice-based sample consisting of 100 theatergoers ( ) and 100 non-
theatergoers ( ). Although theatergoers make up only 10% of the population,
they are heavily oversampled and make up 50% of the study sample. Using the above
formulas, the correct weights would be
USE CHOICE2
LOGIT
CATEGORY SEX$
NCAT=3
ALT=NCHOICES
MODEL CHOICE=TIME+COST ; AGE+SEX$
ESTIMATE
W
j
S
j
j
s
= S
j
s
j
Y 1 =
Y 0 =
W
0
0.9 0.5 1.8 = =
W
1
0.1 0.5 0.2 = =
567
Logi sti c Regressi on
and the sum of the weights would be , as required. To
handle such samples, LOGIT permits non-integer weights and does not truncate them
to integers.
Example 10
Stepwise Regression
LOGIT offers forward and backward stepwise logistic regression with single stepping
as an option. The simplest way to initiate stepwise regression is to substitute START
for ESTIMATE following a MODEL statement and then proceed with stepping with the
STEP command, just as in GLM or Regression.
An upward step consists of three components. First, the current model is estimated
to convergence. The procedure is exactly the same as regular estimation. Second, score
statistics for each additional effect are conducted, adjusted for variables already in the
model. The joint significance of all additional effects together is also computed.
Finally, the effect with the smallest significance level for its score statistic is identified.
If this significance level is below the ENTER option (0.05 by default), the effect is
added to the model.
A downward step also consists of three computational segments. First, the model is
estimated to convergence. Then Wald statistics are computed for each effect in the
model. Finally, the effect with the largest p value for its Wald test statistic is identified.
If this significance level is above the REMOVE criterion (by default 0.10), the effect is
removed from the model.
If you require certain effects to remain in the model regardless of the outcome of the
Wald test, force them into the model by listing them first on the model and using the
FORCE option of START. It is important to set the ENTER and REMOVE criteria
carefully because it is possible to have a variable cycle in and out of a model
repeatedly. The defaults are:
although Hosmer and Lemeshow use
in the example we reproduce below.
START / ENTER = .05, REMOVE = .10
START / ENTER =.15, REMOVE =.20
100 * 1.8 100 * 0.2 + 200 =
568
Chapter 17
Hosmer and Lemeshow use stepwise regression in their search for a model of low
birth weight discussed in the Binary Logit section. We conduct a similar analysis
with:
Following is the output:
USE HOSLEM
LOGIT
CATEGORY RACE
MODEL LOW=CONSTANT+PTL+LWT+HT+RACE+SMOKE+UI+AGE+FTV
START / ENTER=.15,REMOVE=.20
STEP / AUTO
Variables in the SYSTAT Rectangular file are:
ID LOW AGE LWT RACE SMOKE
PTL HT UI FTV BWT RACE1
CASEID PTD LWD
Stepping parameters:
Significance to include = 0.150
Significance to remove = 0.200
Number of effects to force = 1
Maximum number of steps = 10
Direction : Up and Down


Categorical values encountered during processing are:
RACE (3 levels)
1, 2, 3
LOW (2 levels)
0, 1
Categorical variables are effects coded with the highest value as reference.
Binary Stepwise LOGIT Analysis.

Dependent variable: LOW
Input records: 189
Records for analysis: 189
Sample split

Category choices
REF 59
RESP 130
Total : 189

Step 0
Log Likelihood: -117.336
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT -0.790 0.157 -5.033 0.000
Score tests on effects not in model

Score Chi-Sq
Effect Statistic Signif df
2 PTL 7.267 0.007 1.000
3 LWT 5.438 0.020 1.000
4 HT 4.388 0.036 1.000
5 RACE 5.005 0.082 2.000
6 SMOKE 4.924 0.026 1.000
7 UI 5.401 0.020 1.000
8 AGE 2.674 0.102 1.000
9 FTV 0.749 0.387 1.000
Joint Score 30.959 0.000 9.000
Step 1
Log Likelihood: -113.946
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT -0.964 0.175 -5.511 0.000
2 PTL 0.802 0.317 2.528 0.011
Score tests on effects not in model

569
Logi sti c Regressi on
Score Chi-Sq
Effect Statistic Signif df
3 LWT 4.113 0.043 1.000
4 HT 4.722 0.030 1.000
5 RACE 5.359 0.069 2.000
6 SMOKE 3.164 0.075 1.000
7 UI 3.161 0.075 1.000
8 AGE 3.478 0.062 1.000
9 FTV 0.577 0.448 1.000
Joint Score 24.772 0.002 8.000
Step 2
Log Likelihood: -111.792
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT -1.062 0.184 -5.764 0.000
2 PTL 0.823 0.318 2.585 0.010
3 HT 1.272 0.616 2.066 0.039
Score tests on effects not in model

Score Chi-Sq
Effect Statistic Signif df
4 LWT 6.900 0.009 1.000
5 RACE 4.882 0.087 2.000
6 SMOKE 3.117 0.078 1.000
7 UI 4.225 0.040 1.000
8 AGE 3.448 0.063 1.000
9 FTV 0.370 0.543 1.000
Joint Score 20.658 0.004 7.000
Step 3
Log Likelihood: -107.982
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 1.093 0.841 1.299 0.194
2 PTL 0.726 0.328 2.213 0.027
3 HT 1.856 0.705 2.633 0.008
4 LWT -0.017 0.007 -2.560 0.010
Score tests on effects not in model

Score Chi-Sq
Effect Statistic Signif df
5 RACE 5.266 0.072 2.000
6 SMOKE 2.857 0.091 1.000
7 UI 3.081 0.079 1.000
8 AGE 1.895 0.169 1.000
9 FTV 0.118 0.732 1.000
Joint Score 14.395 0.026 6.000
Step 4
Log Likelihood: -105.425
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 1.405 0.900 1.560 0.119
2 PTL 0.746 0.328 2.278 0.023
3 HT 1.805 0.714 2.530 0.011
4 LWT -0.018 0.007 -2.607 0.009
5 RACE_1 -0.518 0.237 -2.190 0.029
6 RACE_2 0.569 0.318 1.787 0.074
Score tests on effects not in model

Score Chi-Sq
Effect Statistic Signif df
6 SMOKE 5.936 0.015 1.000
7 UI 3.265 0.071 1.000
8 AGE 1.019 0.313 1.000
9 FTV 0.025 0.873 1.000
Joint Score 9.505 0.050 4.000
570
Chapter 17
Not all logistic regression programs compute the variable addition statistics in the same
way, so minor differences in output are possible. Our results listed in the Chi-Square
Significance column of the first step, for example, correspond to H&Ls first row in
their Table 4.15; the two sets of results are very similar but not identical. While our
method yields the same final model as H&L, the order in which variables are entered
Step 5
Log Likelihood: -102.449
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 0.851 0.913 0.933 0.351
2 PTL 0.602 0.335 1.797 0.072
3 HT 1.745 0.695 2.511 0.012
4 LWT -0.017 0.007 -2.418 0.016
5 RACE_1 -0.734 0.263 -2.790 0.005
6 RACE_2 0.557 0.324 1.720 0.085
7 SMOKE 0.946 0.395 2.396 0.017
Score tests on effects not in model

Score Chi-Sq
Effect Statistic Signif df
7 UI 3.034 0.082 1.000
8 AGE 0.781 0.377 1.000
9 FTV 0.014 0.904 1.000
Joint Score 3.711 0.294 3.000
Step 6
Log Likelihood: -100.993
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 0.654 0.921 0.710 0.477
2 PTL 0.503 0.341 1.475 0.140
3 HT 1.855 0.695 2.669 0.008
4 LWT -0.016 0.007 -2.320 0.020
5 RACE_1 -0.741 0.265 -2.797 0.005
6 RACE_2 0.585 0.323 1.811 0.070
7 SMOKE 0.939 0.399 2.354 0.019
8 UI 0.786 0.456 1.721 0.085
Score tests on effects not in model
Score Chi-Sq
Effect Statistic Signif df
8 AGE 0.553 0.457 1.000
9 FTV 0.056 0.813 1.000
Joint Score 0.696 0.706 2.000
Log Likelihood: -100.993
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 0.654 0.921 0.710 0.477
2 PTL 0.503 0.341 1.475 0.140
3 HT 1.855 0.695 2.669 0.008
4 LWT -0.016 0.007 -2.320 0.020
5 RACE_1 -0.741 0.265 -2.797 0.005
6 RACE_2 0.585 0.323 1.811 0.070
7 SMOKE 0.939 0.399 2.354 0.019
8 UI 0.786 0.456 1.721 0.085
95.0 % bounds
Parameter Odds Ratio Upper Lower
2 PTL 1.654 3.229 0.847
3 HT 6.392 24.964 1.637
4 LWT 0.984 0.998 0.971
5 RACE_1 0.477 0.801 0.284
6 RACE_2 1.795 3.379 0.953
7 SMOKE 2.557 5.586 1.170
8 UI 2.194 5.367 0.897
Log Likelihood of constants only model = LL(0) = -117.336
2*[LL(N)-LL(0)] = 32.686 with 7 df Chi-sq p-value = 0.000
McFaddens Rho-Squared = 0.139
571
Logi sti c Regressi on
is not the same because intermediate p values differ slightly. Once a final model is
arrived at, it is re-estimated to give true maximum likelihood estimates.
Example 11
Hypothesis Testing
Two types of hypothesis tests are easily conducted in LOGIT: the likelihood ratio (LR)
test and the Wald test. The tests are discussed in numerous statistics books, sometimes
under varying names. Accounts can be found in Maddalas text (1988), Cox and
Hinkley (1974), Rao (1973), Engel (1984), and Breslow and Day (1980). Here we
provide some elementary examples.
Likelihood-Ratio Test
The likelihood-ratio test is conducted by fitting two nested models (the restricted and
the unrestricted) and comparing the log-likelihoods at convergence. Typically, the
unrestricted model contains a proposed set of variables, and the restricted model omits
a selected subset, although other restrictions are possible. The test statistic is twice the
difference of the log-likelihoods and is chi-squared with degrees of freedom equal to
the number of restrictions imposed. When the restrictions consist of excluding
variables, the degrees of freedom are equal to the number of parameters set to 0.
If a model contains a constant, LOGIT automatically calculates a likelihood-ratio test
of the null hypothesis that all coefficients except the constant are 0. It appears on a line
that looks like:
This example line states that twice the difference between the likelihood of the
estimated model and the constants only model is 26.586, which is a chi-squared
deviate on five degrees of freedom. The p value indicates that the null hypothesis would
be rejected.
To illustrate use of the LR test, consider a model estimated on the low birth weight
data (see the Binary Logit example). Assuming CATEGORY=RACE, compare the
following model
with
2*[LL(N)-LL(0)] = 26.586 with 5 df, Chi-sq p-value = 0.00007
MODEL LOW CONSTANT + LWD + AGE + RACE + PTD
MODEL LOW CONSTANT + LWD + AGE
572
Chapter 17
The null hypothesis is that the categorical variable RACE, which contributes two
parameters to the model, and PTD are jointly 0. The model likelihoods are 104.043
and 112.143, and twice the difference (16.20) is chi-squared with three degrees of
freedom under the null hypothesis. This value can also be more conveniently
calculated by taking the difference of the LR test statistics reported below the
parameter estimates and the difference in the degrees of freedom. The unrestricted
model above has with five degrees of freedom, and the restricted model
has with two degrees of freedom. The difference between the G values
is 16.20, and the difference between degrees of freedom is 3.
Although LOGIT will not automatically calculate LR statistics across separate
models, the p value of the result can be obtained with the command:
Wald Test
The Wald test is the best known inferential procedure in applied statistics. To conduct
a Wald test, we first estimate a model and then pose a linear constraint on the
parameters estimated. The statistic is based on the constraint and the appropriate
elements of the covariance matrix of the parameter vector. A test of whether a single
parameter is 0 is conducted as a Wald test by dividing the squared coefficient by its
variance and referring the result to a chi-squared distribution on one degree of freedom.
Thus, each t ratio is itself the square root of a simple Wald test. Following is an
example:
CALC 1-XCF(16.2,3)
USE HOSLEM
LOGIT
CATEGORY RACE
MODEL LOW=CONSTANT+LWD+AGE+RACE+PTD
ESTIMATE
HYPOTHESIS
CONSTRAIN PTD=0
CONSTRAIN RACE[1]=0
CONSTRAIN RACE[2]=0
TEST
G 26.587 =
G 10.385 =
573
Logi sti c Regressi on
Following is the output (minus the estimation stage):
Note that this statistic of 15.104 is close to the LR statistic of 16.2 obtained for the same
hypothesis in the previous section. Although there are three separate CONSTRAIN lines
in the HYPOTHESIS paragraph above, they are tested jointly in a single test. To test
each restriction individually, place a TEST after each CONSTRAIN. The restrictions
being tested are each entered with separate CONSTRAIN commands. These can include
any linear algebraic expression without parentheses involving the parameters. If
interactions were present on the MODEL statement, they can also appear on the
CONSTRAIN statement. To reference dummies generated from categorical covariates,
use square brackets, as in the example for RACE. This constraint refers to the
coefficient labeled RACE1 in the output.
More elaborate tests can be posed in this framework. For example,
or
For multinomial models, the architecture is a little different. To reference a variable
that appears in more than one parameter vector, it is followed with curly braces around
the number corresponding to the Choice Group. For example,
Entering hypothesis procedure.
Linear Restriction System

Parameter
EQN 1 2 3 4 5
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 1.000 0.0
3 0.0 0.0 0.0 0.0 1.000
EQN 6 RHS Q
1 1.000 0.0 1.515
2 0.0 0.0 -0.442
3 0.0 0.0 0.464
General linear Wald test results


ChiSq Statistic = 15.104
ChiSq p-value = 0.002
Degrees of freedom = 3
CONSTRAIN 7*LWD - 4.3*AGE + 1.5*RACE[2] = -5
CONSTRAIN AGE + LWD = 1
CONSTRAIN CONSTANT{1} - CONSTANT{2} = 0
CONSTRAIN AGE{1} - AGE{2} = 0
574
Chapter 17
Comparisons between Tests
The Wald and likelihood-ratio tests are classical testing methods in statistics. The
properties of the tests are based on asymptotic theory, and in the limit, as sample sizes
tend to infinity, the tests give identical results. In small samples, there will be
differences between results and conclusions, as has been emphasized by Hauck and
Donner (1977). Given a choice, which test should be used?
Most statisticians favor the LR test over the Wald for three reasons. First, the
likelihood is the fundamental measure on which model fitting is based. Cox and Oakes
(1984) illustrate this preference when they use the likelihood profile to determine
confidence intervals for a parameter in a survival model. Second, Monte Carlo studies
suggest that the LR test is more reliable in small samples. Finally, a nonlinear
constraint can be imposed on the parameter estimates and simply tested by estimating
restricted and unrestricted models. See the Quantiles example for an illustration
involving LD50 values. Also, you can use the FUNPAR option in NONLIN to do the
same thing.
Why bother with the Wald test, then? One reason is simplicity and computational
cost. The LR test requires estimation of two models to final convergence for a single
test, and each additional test requires another full estimation. By contrast, any number
of Wald tests can be run on the basis of one estimated model, and they do not require
an additional pass through the data.
Example 12
Quasi-Maximum Likelihood
When a model to be estimated by maximum likelihood is misspecified, White (1982)
has shown that the standard methods for obtaining the variance-covariance matrix are
incorrect. In particular, standard errors derived from the inverse matrix of second
derivatives and all hypothesis tests based on this matrix are unreliable. Since
misspecification may be the rule rather than the exception, is there any safe way to
proceed with inference? White offers an alternative variance-covariance matrix that
simplifies (asymptotically) to the inverse Hessian when the model is not misspecified
and is correct when the model is misspecified. Calling the procedure of estimating a
misspecified model quasi-maximum likelihood estimation (QMLE), the proper QML
matrix is defined as
Q = H
1
GH
1
575
Logi sti c Regressi on
where H
1
is the covariance matrix at convergence and G is the cumulated outer
product of the gradient vectors.
White shows that for a misspecified model, the LR test is not asymptotically chi-
squared, and the Wald and likelihood-ratio tests are not asymptotically equivalent,
even when the QML matrix is used for Wald tests.
The best course of action appears to be to use only the QML version of the Wald test
when misspecification is a serious possibility. If the QML covariance matrix is
requested with the ESTIMATE command, a second set of parameter statistics will be
printed, reflecting the new standard errors, t ratios and p values; the coefficients are
unchanged. The QML covariance matrix will replace the standard covariance matrix
during subsequent hypothesis testing with the HYPOTHESIS command. Following is
an example:
Following is the output:
USE NLS
LOGIT
MODEL CULTURE=CONSTANT+IQ
ESTIMATE / QML
Categorical values encountered during processing are:
CULTURE (3 levels)
1, 2, 3

Multinomial LOGIT Analysis.

Dependent variable: CULTURE
Input records: 200
Records for analysis: 200
Sample split

Category choices
1 12
2 49
3 139
Total : 200

L-L at iteration 1 is -219.722
L-L at iteration 2 is -148.554
L-L at iteration 3 is -144.158
L-L at iteration 4 is -143.799
L-L at iteration 5 is -143.793
L-L at iteration 6 is -143.793
Log Likelihood: -143.793
Parameter Estimate S.E. t-ratio p-value
Choice Group: 1
1 CONSTANT 4.252 2.107 2.018 0.044
2 IQ -0.065 0.021 -3.052 0.002
Choice Group: 2
1 CONSTANT 3.287 1.275 2.579 0.010
2 IQ -0.041 0.012 -3.372 0.001
95.0 % bounds
Parameter Odds Ratio Upper Lower
Choice Group: 1
2 IQ 0.937 0.977 0.898
Choice Group: 2
2 IQ 0.960 0.983 0.937
Log Likelihood of constants only model = LL(0) = -153.254
2*[LL(N)-LL(0)] = 18.921 with 2 df Chi-sq p-value = 0.000
McFaddens Rho-Squared = 0.062

576
Chapter 17
Note the changes in the standard errors, t ratios, p values, odds ratio bounds, Wald test
p values, and covariance matrix.
Computation
All calculations are in double precision.
Algorithms
LOGIT uses Gauss Newton methods for maximizing the likelihood. By default, two
tolerance criteria must be satisfied: the maximum value for relative coefficient changes
must fall below 0.001, and the Euclidean norm of the relative parameter change vector
must also fall below 0.001. By default, LOGIT uses the second derivative matrix to
update the parameter vector. In discrete choice models, it may be preferable to use a
first derivative approximation to the Hessian instead. This option, popularized by
Berndt, Hall, Hall, and Hausman (1974), will be noted if it is used by the program.
BHHH uses the summed outer products of the gradient vector in place of the Hessian
matrix and generally will converge much more slowly than the default method.
Missing Data
Cases with missing data on any variables included in a model are deleted.
Covariance matrix QML adjusted.
Log Likelihood: -143.793
Parameter Estimate S.E. t-ratio p-value
Choice Group: 1
1 CONSTANT 4.252 2.252 1.888 0.059
2 IQ -0.065 0.023 -2.860 0.004
Choice Group: 2
1 CONSTANT 3.287 1.188 2.767 0.006
2 IQ -0.041 0.011 -3.682 0.000
95.0 % bounds
Parameter Odds Ratio Upper Lower
Choice Group: 1
2 IQ 0.937 0.980 0.896
Choice Group: 2
2 IQ 0.960 0.981 0.939
Log Likelihood of constants only model = LL(0) = -153.254
2*[LL(N)-LL(0)] = 18.921 with 2 df Chi-sq p-value = 0.000
McFaddens Rho-Squared = 0.062
577
Logi sti c Regressi on
Basic Formulas
For the binary logistic regression model, the dependent variable for the ith case is ,
taking on values of 0 (nonresponse) and 1 (response), and the probability of response
is a function of the covariate vector and the unknown coefficient vector . We write
this probability as:
and abbreviate it as . The log-likelihood for the sample is given by
For the polytomous multinomial logit, the integer-valued dependent variable ranges
from 1 to , and the probability that the ith case has , where is:
In this model, is fixed for all cases, there is a single covariate vector , and
parameter vectors are estimated. This last equation is identified by normalizing to 0.
McFaddens discrete choice model represents a distinct variant of the logit model
based on Luces (1959) probabilistic choice model. Each subject is observed to make
a choice from a set consisting of elements. Each element is characterized by a
separate covariate vector of attributes . The dependent variable Y
i
ranges from 1 to
, with possibly varying across subjects, and the probability that , where
is a function of the attribute vectors , , ... and the parameter vector
. The probability that the ith subject chooses element m from his choice set is:
Y
i
x
i

Prob Y
i
1 x
i
= ( )
e
x
i

1 e
x
i

+
---------------- =
P
i
LL ( ) Y
i
P
i
log
i 1 =
n

1 Y
i
( ) 1 P
i
( ) log + =
k Y m = 1 m k
Prob Y
i
m x
i
= ( )
e
x
i

m
e
x
i

j
j 1 =
k

---------------- =
k x
i
k
j

k
C
i
J
i
Z
k
J
i
J
i
Y
i
k =
1 k J
i
Z
1
Z
2
Z
j

578
Chapter 17
Heuristically, this equation differs from the previous one in the components that vary
with alternative outcomes of the dependent variable. In the polytomous logit, the
coefficients are alternative-specific and the covariate vector is constant; in the discrete
choice model, while the attribute vector is alternative-specific, the coefficients are
constant. The models also differ in that the range of the dependent variable can be case-
specific in the discrete choice model, while it is constant for all cases in the polytomous
model.
The polytomous logit can be recast as a discrete choice model in which each
covariate x is entered as an interaction with an alternative-specific dummy, and the
number of alternatives is constant for all cases. This reparameterization is used for the
mixed polytomous discrete choice model.
Regression Diagnostics Formulas
The SAVE command issued before the deciles of risk command (DC) produces a
SYSTAT save file with a number of diagnostic quantities computed for each case in the
input data set. Computations are always conducted on the assumption that each
covariate pattern is unique. The following formulas are based on the binary dependent
variable , which is either 0 or 1, and fitted probabilities , obtained from the basic
logistic equation.
LEVERAGE(1) is the diagonal element of Pregibons (1981) hat matrix, with
formulas given by Hosmer and Lemeshow (1989) as their equations (5.7) and (5.8). It
is defined as , where
and is the covariate vector for the xth case, X is the data matrix for the sample
including a constant, and V is a diagonal matrix with general A A element ,
the fitted probability for the ith case. is our LEVERAGE(2).
Prob Y
i
m Z = ( )
e
Z
m

e
Z
j

j C
i

---------------- =
y
i
P
i
b
j
v
j
b
j
x
j
XVX ( )
1
x
j
=
x
j
P

i
1 P

i
( )
b
j
v
j
P

i
1 P

i
( ) =
579
Logi sti c Regressi on
Thus LEVERAGE(L) is given by
The PEARSON residual is
The VARIANCE of the residual is
and the standardized residual STANDARD is
The DEVIANCE residual is defined as
for and
otherwise.
DELDSTAT is the change in deviance and is
DELPSTAT is the change in Pearson chi-square:
h
j
v
j
b
j
=
r
j
y
i
p

i
1 p

i
( )
--------------------------- =
v
j
1 h
j
( )
r
sj
r
j
1 h
j

----------------- =
d
j
2 ln p
j
( ) =
y
j
1 =
d
j
ln 1 p
j
( )
2
=
D
j
d
j
2
1 h
j
( ) =

2
r
sj
2
=
580
Chapter 17
The final three saved quantities are measures of the overall change in the estimated
parameter vector .
is a measure proposed by Pregibon, and
References
Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons, Inc.
Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates
in logistic regression models. Biometrika, 71, 110.
Amemiya, T. (1981). Qualitative response models: A survey. Journal of Economic
Literature, 14831536.
Begg, Colin B. and Gray, R. (1984). Calculation of polychotomous logistic regression
parameters using individualized regressions. Biometrika, 71, 1118.
Beggs, S., Cardell, N. S., and Hausman, J. A. (1981). Assessing the potential demand for
electric cars. Journal of Econometrics, 16, 119.
Ben-Akival, M. and Lerman, S. (1985). Discrete choice analysis. Cambridge, Mass.: MIT
Press.
Berndt, E. K., Hall, B. K., Hall, R. E., and Hausman, J. A. (1974). Estimation and inference
in non-linear structural models. Annals of Economic and Social Measurement, 3,
653665.
Breslow, N. (1982). Covariance adjustment of relative-risk estimates in matched studies.
Biometrics, 38, 661672.
Breslow, N. and Day, N. E. (1980). Statistical methods in cancer research, vol. II: The
design and analysis of cohort studies. Lyon: IARC.
Breslow, N., Day, N. E., Halvorsen, K.T, Prentice, R.L., and Sabai, C. (1978). Estimation
of multiple relative risk functions in matched case-control studies. American Journal of
Epidemiology, 108, 299307.
Carson, R., Hanemann, M., and Steinberg, S. (1990). A discrete choice contingent
valuation estimate of the value of kenai king salmon. Journal of Behavioral Economics,
19, 5368.

DELBETA 1 ( ) r
sj
2
h
j
1 h
j
( ) =
DELBETA 2 ( ) r
sj
2
h
j
1 h
j
( ) =
DELBETA 3 ( ) r
sj
2
h
j
1 h
j
( )
2
=
581
Logi sti c Regressi on
Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic
Studies, 47, 225238.
Cook, D. R. and Weisberg, S. (1984). Residuals and influence in regression. New York:
Chapman and Hall.
Coslett, S. R. (1980). Efficient estimation of discrete choice models. In C. Manski and D.
McFadden, Eds., Structural Analysis of Discrete Data with Econometric Applications.
Cambridge, Mass.: MIT Press.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269276.
Cox, D. R. and Hinkley, D.V. (1974). Theoretical statistics. London: Chapman and Hall.
Cox, D. R. and Oakes, D. (1984). Analysis of survival data. New York: Chapman and Hall.
Domencich, T. and McFadden, D. (1975). Urban travel demand: A behavioral analysis.
Amsterdam: North-Holland.
Engel, R. F. (1984). Wald, likelihood ratio and Lagrange multiplier tests in econometrics.
In Z. Griliches and M. Intrilligator, Eds., Handbook of Econometrics. New York: North-
Holland.
Finney, D. J. (1978). Statistical method in biological assay. London: Charles Griffin.
Hauck, W. W. (1980). A note on confidence bands for the logistic response Curve.
American Statistician, 37, 158160.
Hauck, W. W. and Donner, A. (1977). Walds test as applied to hypotheses in logit
analysis. Journal of the American Statistical Association, 72, 851853.
Hensher, D. and Johnson, L. W. (1981). Applied discrete choice modelling. London:
Croom Helm.
Hoffman, S. and Duncan, G. (1988). Multinomial and conditional logit discrete choice
models in demography. Demography, 25, 415428.
Hosmer, D. W. and Lemeshow, S. (1989). Applied logistic regression. New York: John
Wiley & Sons, Inc.
Hubert, J. J. (1984). Bioassay, 2nd ed. Dubuque, Iowa: Kendall-Hunt.
Kalbfleisch, J. and Prentice, R. (1980). The statistical analysis of failure time data. New
York: John Wiley & Sons, Inc.
Kleinbaum, D., Kupper, L., and Chambliss, L. (1982). Logistic regression analysis of
epidemiologic data: Theory and practice. Communications in Statistics: Theory and
Methods, 11, 485547.
Luce, D. R. (1959). Individual choice behavior: A theoretical analysis. New York: John
Wiley & Sons, Inc.
Luft, H., Garnick, D., Peltzman, D., Phibbs, C., Lichtenberg, E., and McPhee, S. (1988).
The sensitivity of conditional choice models for hospital care to estimation technique.
Draft, Institute for Health Policy Studies. San Francisco: University of California.
Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics.
Cambridge University Press.
582
Chapter 17
Maddala, G. S. (1988). Introduction to econometrics. New York: MacMillan.
McFadden, D. (1982). Qualitative response models. In W. Hildebrand (ed.), Advances in
Econometrics. Cambridge University Press.
McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior. In P.
Zarembka (ed.), Frontiers in Econometrics. New York: Academic Press.
McFadden, D. (1976). Quantal choice analysis: A survey. Annals of Economic and Social
Measurement, 5, 363390.
McFadden, D. (1979). Quantitative methods for analyzing travel behavior of individuals:
Some recent developments. In D. A. Hensher and P. R. Stopher (eds.), Behavioral
Travel Modelling. London: Croom Helm.
McFadden, D. (1984). Econometric analysis of qualitative response models. In Z. Griliches
and M. D. Intrilligator (eds.), Handbook of Econometiics, Volume III. Elsevier Science
Publishers BV.
Manski, C. and Lerman, S. (1977). The estimation of choice probabilities from choice
based samples. Econometrica, 8, 19771988.
Manski, C. and McFadden, D. (1980). Alternative estimators and sample designs for
discrete choice analysis. In C. Manski and D. McFadden (eds.), Structural Analysis of
Discrete Data with Econometric Applications. Cambridge, Mass.: MIT Press.
Manski, C. and McFadden, D., eds. (1981). Structural analysis of discrete data with
econometric applications. Cambridge, Mass.: MIT Press.
Nerlove, M. and Press, S. J. (1973). Univariate and multivariate loglinear and logistic
models. Rand Report No R-1306EDA/NIH.
Peduzzi, P. N., Holford, T. R., and Hardy, R. J. (1980). A stepwise variable selection
procedure for nonlinear regression models. Biometrics, 36, 511516.
Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9, 705724.
Prentice, R. and Breslow, N. (1978). Retrospective studies and failure time models.
Biometrika, 65, 153158.
Prentice, R. and Pyke, R. (1979). Logistic disease incidence models and case-control
studies. Biometrika, 66, 403412.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Santer, T. J. and Duffy, D. E. (1989). The statistical analysis of discrete data. New York:
Springer-Verlag.
Steinberg, D. (1991). The common structure of discrete choice and conditional logistic
regression models. Unpublished paper. Department of Economics, San Diego State
University.
Steinberg, D. (1987). Interpretation and diagnostics of the multinomial and binary logistic
regression using PROC MLOGIT. SAS Users Group International, Proceedings of the
Twelfth Annual Conference, 10711073, Cary, N.C.: SAS Institute Inc.
583
Logi sti c Regressi on
Steinberg, D. and Cardell, N. S. (1987). Logistic regression on pooled choice based
samples and samples missing the dependent variable. Proceedings of the Social
Statistics Section. Alexandria, Va.: American Statistical Association, 158160.
Train, K. (1986). Qualitative choice analysis. Cambridge, Mass.: MIT Press.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
50, 125.
Williams, D. A. (1986). Interval estimation of the median lethal dose. Biometrics, 42,
641645.
Wrigley, N. (1985). Categorical data analysis for geographers and environmental
scientists. New York: Longman.
585


Chapt er
18
Loglinear Models
Laszlo Engelman
Loglinear models are useful for analyzing relationships among the factors of a
multiway frequency table. The loglinear procedure computes maximum likelihood
estimates of the parameters of a loglinear model by using the Newton-Raphson
method. For each user-specified model, a test of fit of the model is provided, along
with observed and expected cell frequencies, estimates of the loglinear parameters
(lambdas), standard errors of the estimates, the ratio of each lambda to its standard
error, and multiplicative effects (EXP(lambda)).
For each cell, you can request its contribution to the Pearson chi-square or the
likelihood-ratio chi-square. Deviates, standardized deviates, Freeman-Tukey
deviates, and likelihood-ratio deviates are available to characterize departures of the
observed values from expected values.
When searching for the best model, you can request tests after removing each first-
order effect or interaction term one at a time individually or hierarchically (when a
lower-order effect is removed, so are its respective interaction terms). The models do
not need to be hierarchical.
A model can explain the frequencies well in most cells, but poorly in a few.
LOGLIN uses Freeman-Tukey deviates to identify the most divergent cell, fit a model
without it, and continue in a stepwise manner identifying other outlier cells that depart
from your model.
You can specify cells that contain structural zeros (cells that are empty naturally or
by design, not by sampling), and fit a model to the subset of cells that remain. A test
of fit for such a model is often called a test of quasi-independence.
586
Chapter 18
Statistical Background
Researchers fit loglinear models to the cell frequencies of a multiway table in order to
describe relationships among the categorical variables that form the table. A loglinear
model expresses the logarithm of the expected cell frequency as a linear function
of certain parameters in a manner similar to that of analysis of variance.
To introduce loglinear models, recall how to calculate expected values for the
Pearson chi-square statistic. The expected value for a cell in a row i and column j is:
Lets ignore the denominator, because its the same for every cell. Write:
(Part of each expected value comes from the row its in and part from the column its
in.) Now take the log:
and let:
and write:
This expected value is computed under the null hypothesis of independence (that is,
there is no interaction between the table factors). If this hypothesis is rejected, you
would need more information than A
i
and B
j
. In fact, the usual chi-square test can be
expressed as a test that the interaction term is needed in a model that estimates the log
of the cell frequencies. We write this model as:
or more commonly as:
(row i total) * (column j total)
total table count
--------------------------------------------------------------------------
R
i
* C
j
ln (R
i
* C
j
) ln R
i
ln C
j
+ =
ln R
i
A
i
=
ln C
j
B
j
=
A
i
B
j
+
ln F
ij
A
i
B
j
AB
ij
+ + + =
587
Logl i near Model s
where is an overall mean effect and the parameters sum to zero over the levels of
the row factors and the column factors. For a particular cell in a three-way table (a cell
in the i row, j column, and k level of the third factor) we write:
The order of the effect is the number of indices in the subscript.
Notation in publications for loglinear model parameters varies. Grant Blank
summarizes:
So, a loglinear model expresses the logarithm of the expected cell frequency as a linear
function of certain parameters in a manner similar to that of analysis of variance. An
important distinction between ANOVA and loglinear modeling is that in the latter, the
focus is on the need for interaction terms; while in ANOVA, testing for main effects is
the primary interest. Look back at the loglinear model for the two-way tablethe usual
chi-square tests the need for the AB
ij
interaction, not for A alone or B alone.
The loglinear model for a three-way table is saturated because it contains all
possible terms or effects. Various smaller models can be formed by including only
selected combinations of effects (or equivalently testing that certain effects are 0). An
important goal in loglinear modeling is parsimonythat is, to see how few effects are
needed to estimate the cell frequencies. You usually dont want to test that the main
effect of a factor is 0 because this is the same as testing that the total frequencies are
equal for all levels of the factor. For example, a test that the main effect for SURVIVE$
SYSTAT FATHER + SON + FATHER SON
Agresti (1984) log m
ij
= +
i
F
+
j
S
+
ij
FS
Fienberg (1980) log m
ij
= +
1(i)
+
2(j)
+
12(ij)
Goodman (1978)
ij
= +
i
A
+
j
B
+
ij
AB
Haberman (1978) log m
ij
= +
i
A
+
j
B
+
ij
AB
Knoke and Burke (1980) G
ij
= +
i
F
+
j
S
+
ij
FS
or, in multiplicative form,
Goodman (1971)
F
ij
= r
i
A
r
j
B
r
ij
AB
where

ij
= log(F
ij
), = log ,
i
A
= log(r
i
A
), etc.
ln F
ij

i
A

j
B

ij
AB
+ + + =

ln F
ijk

i
A

j
B

k
C

ij
AB

ik
AC

jk
BC

ijk
ABC
+ + + + + + + =
588
Chapter 18
(alive, dead) is 0 simply tests whether the total number of survivors equals the number
of nonsurvivors. If no interaction terms are included and the test is not significant (that
is, the model fits), you can report that the table factors are independent. When there are
more than two second-order effects, the test of an interaction is conditional on the other
interactions and may not have a simple interpretation.
Fitting a Loglinear Model
To fit a loglinear model:
n First, screen for an appropriate model to test.
n Test the model, and if significant, compare its results with those for models with
one or more terms. If not significant, compare results with models with fewer
terms.
n For the model you select as best, examine fitted values and residuals, looking for
cells (or layers within the table) with large differences between observed and
expected (fitted) cell counts.
How do you determine which effects or terms to include in your loglinear model?
Ideally, by using your knowledge of the subject matter of your study, you have a
specific model in mindthat is, you want to make statements regarding the
independence of certain table factors. Otherwise, you may want to screen for effects.
The likelihood-ratio chi-square is additive under partitioning for nested models.
Two models are nested if all the effects of the first are a subset of the second. The
likelihood ratio chi-square is additive because the statistic for the second model can be
subtracted from that for the first. The difference provides a test of the additional
effectsthat is, the difference in the two statistics has an asymptotic chi-square
distribution with degrees of freedom equal to the difference between those for the two
model chi-squares (or the difference between the number of effects in the two models).
This property does not hold for the Pearson chi-square. The additive property for the
likelihood ratio chi-square is useful for screening effects to include in a model.
If you are doing exploratory research and lack firm knowledge about which effects
to include, some statisticians suggest a strategy of starting with a large model and, step
by step, identifying effects to delete. (You compare each smaller model nested within
the larger one as described above.) But we caution you about multiple testing. If you
test many models in a search for your ideal model, remember that the p value
associated with a specific test is valid when you execute one and only one test. That is,
use p values as relative measures when you test several models.
589
Logl i near Model s
Loglinear Models in SYSTAT
Loglinear Model Main Dialog Box
To open the Loglinear Model dialog box, from the menus choose:
Statistics
Loglinear Model
Estimate Model...
The following must be specified:
Model Terms. Build the model components (main effects and interactions) by adding
terms to the Model Terms text box. All variables should be categorical (either
numerical or character). Click Cross to add interactions.
Define Table. The variables that define the frequency table. Variables that are used in
the model terms must be included in the frequency table.
The following optional computational controls can also be specified:
n Convergence. The parameter convergence criteria.
n L Convergence. The log-likelihood convergence criteria.
n Tolerance. The tolerance limit.
590
Chapter 18
n Iterations. The maximum number of iterations.
n Halvings. The maximum number of step halvings.
n Delta. The constant value added to the observed frequency in each cell.
You can save two sets of statistics to a file:
n Estimates. Saves, for each cell in the table, the observed and expected frequencies
and their differences, standardized and Freeman-Tukey deviates, the contribution
to the Pearson and likelihood-ratio chi-square statistics, the contribution to the log-
likelihood, and the cell indices.
n Lambdas. Saves, for each level of each term in the model, the estimate of lambda,
the standard error of lambda, the ratio of lambda to its standard error, the
multiplicative effect (EXP(lambda)), and the indices of the table of factors.
Loglinear Model Statistics
Loglinear models offer statistics for hypothesis testing, parameter estimation, and
individual cell examination.
The following statistics are available:
n Chi-square. Displays Pearson and likelihood-ratio chi-square statistics for lack of
fit.
n Ratio. Displays lambda divided by standard error of lambda. For large samples, this
ratio can be interpreted as a standard normal deviate (z score).
591
Logl i near Model s
n Maximized likelihood value. The log of the models maximum likelihood value.
n Multiplicative effects. Multiplicative parameters, EXP(lambda). Large values
indicate an increased probability for that combination of indices.
n Term. One at a time, LOGLIN removes each first-order effect and each interaction
term from the model. For each smaller model, LOGLIN provides a likelihood-ratio
chi-square for testing the fit of the model and the difference in the chi-square
statistics between the smaller model and the full model.
n Hterm. Tests each term by removing it and its higher order interactions from the
model. These tests are similar to those in Term except that only hierarchical models
are testedif a lower-order effect is removed, so are the higher-order effects that
include it.
To examine the parameters, you can request the coefficients of the design variables, the
covariance matrix of the parameters, the correlation matrix of the parameters, and the
additive effect of each level for each term (lambda).
In addition, for each cell you can choose to display the observed frequency, the
expected frequency, the standardized deviate, the standard error of lambda, the
observed minus the expected frequency, the likelihood ratio of the deviate, the
Freeman-Tukey deviate, the contribution to Pearson chi-square, and the contribution
to the models log-likelihood.
Finally, you can select the number of cells to identify as outlandish. The first cell
has the largest Freeman-Tukey deviate (these deviates are similar to z scores when the
data are from a Poisson distribution). It is treated as a structural zero, the model is fit
to the remaining cells, and the cell with the largest Freeman-Tukey deviate is
identified. This process continues step by step, each time including one more cell as a
structural zero and refitting the model.
Structural Zeros
A cell is declared to be a structural zero when the probability is zero that there are
counts in the cell. Notice that such zero frequencies do not arise because of small
samples but because the cells are empty naturally (a male hysterectomy patient) or by
design (the diagonal of a two-way table comparing fathers (rows) and sons (columns)
occupations is not of interest when studying changes or mobility). A model can then
be fit to the subset of cells that remain. A test of fit for such a model is often called a
test of quasi-independence.
592
Chapter 18
To specify structural zeros, click Zero in the Loglinear Model dialog box.
The following can be specified:
No structural zeros. No cells are treated as structural zeros.
Make all empty cells structural zeros. Treats all empty cells with zero frequency as
structural zeros.
Define custom structural zeros. Specifies one or more cells for treatment as structural
zeros. List the index (n1, n2, ...) of each factor in the order in which the factor appears
in the table. If you want to select a layer or level of a factor, use 0s for the other factors
when specifying the indices. For example, in a table with four factors (TUMOR$ being
the fourth factor), to declare the third level of TUMOR$ as structural zeros, use 0 0 0 3.
Alternatively, you can replace the 0s with blanks or periods (. . . 3).
When fitting a model, LOGLIN excludes cells identified as structural zeros, and then, as
in a regression analysis with zero weight cases, it can compute expected values,
deviates, and so on, for all cells including the structural zero cells.
You might consider identifying cells as structural zeros when:
n It is meaningful to the study at hand to exclude some cellsfor example, the
diagonal of a two-way table crossing the occupations of fathers and sons.
n You want to determine whether an interaction term is necessary only because there
are one or two aberrant cells. That is, after you select the best model, fit a second
model with fewer effects and identify the outlier cells (the most outlandish cells)
for the smaller model. Then refit the best model declaring the outlier cells to be
structural zeros. If the additional interactions are no longer necessary, you might
report the smaller model, adding a sentence describing how the unusual cell(s)
depart from the model.
593
Logl i near Model s
Frequency Tables (Tabulate)
If you want only a frequency table and no analysis, from the menus choose:
Statistics
Loglinear Model
Tabulate...
Simply specify the table factors in the same order in which you want to view them from
left to right. In other words, the last variable selected defines the columns of the table
and cross-classifications of all preceding variables define the rows.
Although you can also form multiway tables using Crosstabs, tables for loglinear
models are more compact and easy to read. Crosstabs forms a series of two-way tables
stratified by all combinations of the other table factors. Loglinear models create one
table, with the rows defined by factor combinations. However, loglinear model tables
do not display marginal totals, whereas Crosstabs tables do.
Using Commands
First, specify your data with USE filename. Continue with:
LOGLIN
FREQ var
TABULATE var1*var2*
MODEL variables defining table = terms of model
ZERO CELL n1, n2,
SAVE filename / ESTIMATES or LAMBDAS
PRINT SHORT or MEDIUM or LONG or NONE ,
/ OBSFREQ CHISQ RATIO MLE EXPECT STAND ELAMBDA ,
TERM HTERM COVA CORR LAMBDA SELAMBDA DEVIATES ,
LRDEV FTDEV PEARSON LOGLIKE CELLS=n
ESTIMATE / DELTA=n LCONV=n CONV=n TOL=n ITER=n HALF=n
594
Chapter 18
Usage Considerations
Types of data. LOGLIN uses a cases-by-variables rectangular file or data recorded as
frequencies with cell indices.
Print options. You can control what report panels appear in the output by globally
setting output length to SHORT, MEDIUM, or LONG. You can also use the PRINT
command in LOGLIN to request reports individually. You can specify individual panels
by specifying the particular option.
Short output panels include the observed frequency for each cell, the Pearson and
likelihood-ratio chi-square statistics, lambdas divided by their standard errors, the log
of the models maximized likelihood value, and a report of the three most outlandish
cells.
Medium results include all of the above, plus the following: the expected frequency
for each cell (current model), standardized deviations, multiplicative effects, a test of
each term by removing it from the model, a test of each term by removing it and its
higher-order interactions from the model, and the five most outlandish cells.
Long results add the following: coefficients of design variables, the covariance
matrix of the parameters, the correlation matrix of the parameters, the additive effect
of each level for each term, the standard errors of the lambdas, the observed minus the
expected frequency for each cell, the contribution to the Pearson chi-square from each
cell, the likelihood-ratio deviate for each cell, the Freeman-Tukey deviate for each cell,
the contribution to the models log-likelihood from each cell, and the 10 most
outlandish cells.
As a PRINT option, you can also specify CELLS=n, where n is the number of
outlandish cells to identify.
Quick Graphs. LOGLIN produces no Quick Graphs.
Saving files. For each level of a term included in your model, you can save the estimate
of lambda, the standard error of lambda, the ratio of lambda to its standard error, the
multiplicative effect, and the marginal indices of the effect. Alternatively, for each cell,
you can save the observed and expected frequencies, its deviates (listed above), the
Pearson and likelihood-ratio chi-square, the contributions to the log-likelihood, and the
cell indices.
BY groups. LOGLIN analyzes each level of any BY variables separately.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. LOGLIN uses the FREQ variable, if present, to duplicate cases.
Case weights. WEIGHT variables have no effect in LOGLIN.
595
Logl i near Model s
Examples
Example 1
Loglinear Modeling of a Four-Way Table
In this example, you use the Morrison breast cancer data stored in the CANCER data
file (Bishop, et al. (1975)) and treat the data as a four-way frequency table:
The CANCER data include one record for each of the 72 cells formed by the four table
factors. Each record includes a variable, NUMBER, that has the number of women in
the cell plus numeric or character value codes to identify the levels of the four factors
that define the cell.
For the first model of the CANCER data, you include three two-way interactions.
The input is:
The MODEL statement has two parts: table factors and terms (effects to fit). Table
factors appear to the left of the equals sign and terms are on the right. The layout of the
table is determined by the order in which the variables are specifiedfor example,
specify TUMOR$ last so its levels determine the columns.
CENTER$ Center or city where the data were collected
SURVIVE$ Survivaldead or alive
AGE Age groups of under 50, 50 to 69, and 70 or over
TUMOR$ Tumor diagnosis (called INFLAPP by some researchers) with levels:
Minimal inflammation and benign
Greater inflammation and benign
Minimal inflammation and malignant
Greater inflammation and malignant
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = center$ + age,
+ survive$ + tumor$,
+ age*center$,
+ survive$*center$,
+ tumor$*center$
PRINT SHORT / EXPECT LAMBDAS
ESTIMATE / DELTA=0.5
596
Chapter 18
The LABEL statement assigns category names to the numeric codes for AGE. If the
statement is omitted, the data values label the categories. By default, SYSTAT orders
string variables alphabetically, so we specify SORT = NONE to list the categories for
the other factors as they first appear in the data file.
We specify DELTA = 0.5 to add 0.5 to each cell frequency. This option is common
in multiway table procedures as an aid when some cell sizes are sparse. It is of little
use in practice and is used here only to make the results compare with those reported
elsewhere.
The output is:
Case frequencies determined by value of variable NUMBER.

Number of cells (product of levels): 72
Total count: 764

Observed Frequencies
====================
CENTER$ AGE SURVIVE$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+---------+---------+-------------------------------------------------
Tokyo Under 50 Dead | 9.000 7.000 4.000 3.000
Alive | 26.000 68.000 25.000 9.000
+
50 to 69 Dead | 9.000 9.000 11.000 2.000
Alive | 20.000 46.000 18.000 5.000
+
70 & Over Dead | 2.000 3.000 1.000 0.0
Alive | 1.000 6.000 5.000 1.000
---------+---------+---------+-------------------------------------------------
Boston Under 50 Dead | 6.000 7.000 6.000 0.0
Alive | 11.000 24.000 4.000 0.0
+
50 to 69 Dead | 8.000 20.000 3.000 2.000
Alive | 18.000 58.000 10.000 3.000
+
70 & Over Dead | 9.000 18.000 3.000 0.0
Alive | 15.000 26.000 1.000 1.000
---------+---------+---------+-------------------------------------------------
Glamorgn Under 50 Dead | 16.000 7.000 3.000 0.0
Alive | 16.000 20.000 8.000 1.000
+
50 to 69 Dead | 14.000 12.000 3.000 0.0
Alive | 27.000 39.000 10.000 4.000
+
70 & Over Dead | 3.000 7.000 3.000 0.0
Alive | 12.000 11.000 4.000 1.000
-----------------------------+-------------------------------------------------
Pearson ChiSquare 57.5272 df 51 Probability 0.24635
LR ChiSquare 55.8327 df 51 Probability 0.29814
Rafterys BIC -282.7342
Dissimilarity 9.9530

597
Logl i near Model s
Expected Values
===============
CENTER$ AGE SURVIVE$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+---------+---------+-------------------------------------------------
Tokyo Under 50 Dead | 7.852 15.928 7.515 2.580
Alive | 28.076 56.953 26.872 9.225
+
50 to 69 Dead | 6.281 12.742 6.012 2.064
Alive | 22.460 45.563 21.498 7.380
+
70 & Over Dead | 1.165 2.363 1.115 0.383
Alive | 4.166 8.451 3.988 1.369
---------+---------+---------+-------------------------------------------------
Boston Under 50 Dead | 5.439 12.120 2.331 0.699
Alive | 10.939 24.378 4.688 1.406
+
50 to 69 Dead | 11.052 24.631 4.737 1.421
Alive | 22.231 49.542 9.527 2.858
+
70 & Over Dead | 6.754 15.052 2.895 0.868
Alive | 13.585 30.276 5.822 1.747
---------+---------+---------+-------------------------------------------------
Glamorgn Under 50 Dead | 9.303 10.121 3.476 0.920
Alive | 19.989 21.746 7.468 1.977
+
50 to 69 Dead | 14.017 15.249 5.237 1.386
Alive | 30.117 32.764 11.252 2.979
+
70 & Over Dead | 5.582 6.073 2.086 0.552
Alive | 11.993 13.048 4.481 1.186
-----------------------------+-------------------------------------------------

Log-Linear Effects (Lambda)
===========================

THETA
-------------
1.826
-------------

CENTER$
Tokyo Boston Glamorgn
-------------------------------------
0.049 0.001 -0.050
-------------------------------------

AGE
Under 50 50 to 69 70 & Over
-------------------------------------
0.145 0.444 -0.589
-------------------------------------

SURVIVE$
Dead Alive
-------------------------
-0.456 0.456
-------------------------

TUMOR$
MinMalig MinBengn MaxMalig MaxBengn
-------------------------------------------------
0.480 1.011 -0.145 -1.346
-------------------------------------------------

598
Chapter 18
CENTER$ | AGE
| Under 50 50 to 69 70 & Over
---------+-------------------------------------
Tokyo | 0.565 0.043 -0.609
Boston | -0.454 -0.043 0.497
Glamorgn | -0.111 -0.000 0.112
---------+-------------------------------------

CENTER$ | SURVIVE$
| Dead Alive
---------+-------------------------
Tokyo | -0.181 0.181
Boston | 0.107 -0.107
Glamorgn | 0.074 -0.074
---------+-------------------------

CENTER$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+-------------------------------------------------
Tokyo | -0.368 -0.191 0.214 0.345
Boston | 0.044 0.315 -0.178 -0.181
Glamorgn | 0.323 -0.123 -0.036 -0.164
---------+-------------------------------------------------

Lambda / SE(Lambda)
===================

THETA
-------------
1.826
-------------

CENTER$
Tokyo Boston Glamorgn
-------------------------------------
0.596 0.014 -0.586
-------------------------------------

AGE
Under 50 50 to 69 70 & Over
-------------------------------------
2.627 8.633 -8.649
-------------------------------------

SURVIVE$
Dead Alive
-------------------------
-11.548 11.548
-------------------------

TUMOR$
MinMalig MinBengn MaxMalig MaxBengn
-------------------------------------------------
6.775 15.730 -1.718 -10.150
-------------------------------------------------

CENTER$ | AGE
| Under 50 50 to 69 70 & Over
---------+-------------------------------------
Tokyo | 7.348 0.576 -5.648
Boston | -5.755 -0.618 5.757
Glamorgn | -1.418 -0.003 1.194
---------+-------------------------------------

599
Logl i near Model s
Initially, SYSTAT produces a frequency table for the data. We entered cases for 72
cells. The total frequency count across these cells is 764that is, there are 764 women
in the sample. Notice that the order of the factors is the same order we specified in the
MODEL statement. The last variable (TUMOR$) defines the columns; the remaining
variables define the rows.
The test of fit is not significant for either the Pearson chi-square or the likelihood-
ratio test, indicating that your model with its three two-way interactions does not
disagree with the observed frequencies. The model statement describes an association
between study center and age, survival, and tumor status. However, at each center, the
other three factors are independent. Because the overall goal is parsimony, we could
explore whether any of the interactions can be dropped.
Rafterys BIC (Bayesian Information Criterion) adjusts the chi-square for both the
complexity of the model (measured by degrees of freedom) and the size of the sample.
It is the likelihood-ratio chi-square minus the degrees of freedom for the current model
times the natural log of the sample size. If BIC is negative, you can conclude that the
model is preferable to the saturated model. When comparing alternative models, select
the model with the lowest BIC value.
The index of dissimilarity can be interpreted as the percentage of cases that need to
be relocated in order to make the observed and expected counts equal. For these data,
you would have to move about 9.95% of the cases to make the expected frequencies fit.
CENTER$ | SURVIVE$
| Dead Alive
---------+-------------------------
Tokyo | -3.207 3.207
Boston | 1.959 -1.959
Glamorgn | 1.304 -1.304
---------+-------------------------

CENTER$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+-------------------------------------------------
Tokyo | -3.862 -2.292 2.012 2.121
Boston | 0.425 3.385 -1.400 -0.910
Glamorgn | 3.199 -1.287 -0.289 -0.827
---------+-------------------------------------------------
Model ln(MLE): -160.563

The 3 most outlandish cells (based on FTD, stepwise):
======================================================

CENTER$
| AGE
| | SURVIVE$
ln(MLE) LR_ChiSq p-value Frequency | | | TUMOR$
--------- -------- -------- --------- - - - -
-154.685 11.755 0.001 7 1 1 1 2
-150.685 8.001 0.005 1 2 3 2 3
-145.024 11.321 0.001 16 3 1 1 1
600
Chapter 18
The expected frequencies are obtained by fitting the loglinear model to the observed
frequencies. Compare these values with the observed frequencies. Values for
corresponding cells will be similar if the model fits well.
After the expected values, SYSTAT lists the parameter estimates for the model you
requested. Usually, it is of more interest to examine these estimates divided by their
standard errors. Here, however, we display them in order to relate them to the expected
values. For example, the observed frequency for the cell in the upper left corner
(Tokyo, Under 50, Dead, MinMalig) is 9. To find the expected frequency under your
model, you add the estimates (from each panel, select the term that corresponds to your
cell):
and then use SYSTATs calculator to sum the estimates:
and SYSTAT responds 2.06. Take the antilog of this value:
and SYSTAT responds 7.846. In the panel of expected values, this number is printed as
7.852 (in its calculations, SYSTAT uses more digits following the decimal point).
Thus, for this cell, the sample includes 9 women (observed frequency) and the model
predicts 7.85 women (expected frequency).
The ratio of the parameter estimates to their asymptotic standard errors is part of the
default output. Examine these values to better understand the relationships among the
table factors. Because, for large samples, this ratio can be interpreted as a standard
normal deviate (z score), you can use it to indicate significant parametersfor
example, for an interaction term, significant positive (or negative) associations. In the
CENTER$ by AGE panel, the ratio for young women from Tokyo is very large (7.348),
implying a significant positive association, and that for older Tokyo women is
extremely negative (5.648). The reverse is true for the women from Boston. If you use
the Column Percent option in XTAB to print column percentages for CENTER$ by
AGE, you will see that among the women under 50, more than 50% are from Tokyo
(52.1), while only 23% are from Boston. In the 70 and over age group, 14% are from
Tokyo and 55% are from Boston.
theta
1.826 C*A 0.565
CENTER$
0.049 C*S -0.181
AGE
0.145 C*T -0.368
SURVIVE$ -0.456
TUMOR$ 0.480
CALC 1.826 + 0.049 + 0.145 0.456 + 0.480 + 0.565 0.181 0.368
CALC EXP(2.06)
601
Logl i near Model s
The Alive estimate for Tokyo shows a strong positive association (3.207) with
survival in Tokyo. The relationship in Boston is negative (1.959). In this study, the
overall survival rate is 72.5%. In Tokyo, 79.3% of the women survived, while in
Boston, 67.6% survived. There is a negative association for having a malignant tumor
with minimal inflammation in Tokyo (3.862). The same relationship is strongly
positive in Glamorgan (3.199).
Cells that depart from the current model are identified as outlandish in a stepwise
manner. The first cell has the largest Freeman-Tukey deviate (these deviates are
similar to z scores when the data are from a Poisson distribution). It is treated as a
structural zero, the model is fit to the remaining cells, and the cell with the largest
Freeman-Tukey deviate is identified. This process continues step by step, each time
including one more cell as a structural zero and refitting the model.
For the current model, the observations in the cell corresponding to the youngest
nonsurvivors from Tokyo with benign tumors and minimal inflammation (Tokyo,
Under 50, Dead, MinBengn) differs the most from its expected value. There are 7
women in the cell and the expected value is 15.9 women. The next most unusual cell
is 2,3,2,3 (Boston, 70 & Over, Alive, MaxMalig), and so on.
Medium Output
We continue the previous analysis, repeating the same model, but changing the PRINT
(output length) setting to request medium-length results:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = age # center$,
+ survive$ # center$,
+ tumor$ # center$
PRINT MEDIUM
ESTIMATE / DELTA=0.5
602
Chapter 18
Notice that we use shortcut notation to specify the model. The output includes:
Standardized Deviates = (Obs-Exp)/sqrt(Exp)
===========================================
CENTER$ AGE SURVIVE$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+---------+---------+-------------------------------------------------
Tokyo Under 50 Dead | 0.410 -2.237 -1.282 0.262
Alive | -0.392 1.464 -0.361 -0.074
+
50 to 69 Dead | 1.085 -1.048 2.034 -0.044
Alive | -0.519 0.065 -0.754 -0.876
+
70 & Over Dead | 0.774 0.414 -0.109 -0.619
Alive | -1.551 -0.843 0.507 -0.315
---------+---------+---------+-------------------------------------------------
Boston Under 50 Dead | 0.241 -1.471 2.403 -0.836
Alive | 0.018 -0.077 -0.318 -1.186
+
50 to 69 Dead | -0.918 -0.933 -0.798 0.486
Alive | -0.897 1.202 0.153 0.084
+
70 & Over Dead | 0.864 0.760 0.062 -0.932
Alive | 0.384 -0.777 -1.999 -0.565
---------+---------+---------+-------------------------------------------------
Glamorgn Under 50 Dead | 2.196 -0.981 -0.255 -0.959
Alive | -0.892 -0.374 0.195 -0.695
+
50 to 69 Dead | -0.004 -0.832 -0.977 -1.177
Alive | -0.568 1.089 -0.373 0.592
+
70 & Over Dead | -1.093 0.376 0.633 -0.743
Alive | 0.002 -0.567 -0.227 -0.171
-----------------------------+-------------------------------------------------
Multiplicative Effects = exp(Lambda)
====================================

THETA
-------------
6.209
-------------

AGE
Under 50 50 to 69 70 & Over
-------------------------------------
1.156 1.559 0.555
-------------------------------------

CENTER$
Tokyo Boston Glamorgn
-------------------------------------
1.050 1.001 0.951
-------------------------------------

SURVIVE$
Dead Alive
-------------------------
0.634 1.578
-------------------------

TUMOR$
MinMalig MinBengn MaxMalig MaxBengn
-------------------------------------------------
1.616 2.748 0.865 0.260
-------------------------------------------------
603
Logl i near Model s
The goodness-of-fit tests provide an overall indication of how close the expected
values are to the cell counts. Just as you study residuals for each case in multiple
regression, you can use deviates to compare the observed and expected values for each
cell. A standardized deviate is the square root of each cells contribution to the Pearson
chi-square statisticthat is, (the observed frequency minus the expected frequency)
CENTER$ | AGE
| Under 50 50 to 69 70 & Over
---------+-------------------------------------
Tokyo | 1.760 1.044 0.544
Boston | 0.635 0.958 1.644
Glamorgn | 0.895 1.000 1.118
---------+-------------------------------------

CENTER$ | SURVIVE$
| Dead Alive
---------+-------------------------
Tokyo | 0.835 1.198
Boston | 1.113 0.899
Glamorgn | 1.077 0.929
---------+-------------------------

CENTER$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+-------------------------------------------------
Tokyo | 0.692 0.826 1.238 1.412
Boston | 1.045 1.370 0.837 0.834
Glamorgn | 1.382 0.884 0.965 0.849
---------+-------------------------------------------------
Model ln(MLE): -160.563

Term tested The model without the term Removal of term from model
ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
AGE. . . . . . -216.120 166.95 53 0.0000 111.11 2 0.0000
CENTER$. . . . -160.799 56.31 53 0.3523 0.47 2 0.7894
SURVIVE$ . . . -234.265 203.24 52 0.0000 147.41 1 0.0000
TUMOR$ . . . . -344.471 423.65 54 0.0000 367.82 3 0.0000
CENTER$
* AGE. . . . . -196.672 128.05 55 0.0000 72.22 4 0.0000
CENTER$
* SURVIVE$ . . -166.007 66.72 53 0.0975 10.89 2 0.0043
CENTER$
* TUMOR$ . . . -178.267 91.24 57 0.0027 35.41 6 0.0000

Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
AGE. . . . . . -246.779 228.26 57 0.0000 172.43 6 0.0000
CENTER$. . . . -224.289 183.29 65 0.0000 127.45 14 0.0000
SURVIVE$ . . . -242.434 219.57 54 0.0000 163.74 3 0.0000
TUMOR$ . . . . -363.341 461.39 60 0.0000 405.56 9 0.0000

The 5 most outlandish cells (based on FTD, stepwise):
======================================================

CENTER$
| AGE
| | SURVIVE$
ln(MLE) LR_ChiSq p-value Frequency | | | TUMOR$
--------- -------- -------- --------- - - - -
-154.685 11.755 0.001 7 1 1 1 2
-150.685 8.001 0.005 1 2 3 2 3
-145.024 11.321 0.001 16 3 1 1 1
-140.740 8.569 0.003 6 2 1 1 3
-136.662 8.157 0.004 11 1 2 1 3
604
Chapter 18
divided by the square root of the expected frequency. These values are similar to z
scores. For the second cell in the first row, the expected value under your model is
considerably larger than the observed count (its deviate is 2.237, the observed count
is 7, and the expected count is 15.9). Previously, this cell was identified as the most
outlandish cell using Freeman-Tukey deviates.
Note that LOGLIN produces five types of deviates or residuals: standardized, the
observed minus the expected frequency, the likelihood-ratio deviate, the Freeman-
Tukey deviate, and the Pearson deviate.
Estimates of the multiplicative parameters equal . Look for values that
depart markedly from 1.0. Very large values indicate an increased probability for that
combination of indices and, conversely, a value considerably less than 1.0 indicates an
unlikely combination. A test of the hypothesis that a multiplicative parameter equals
1.0 is the same as that for lambda equal to 0; so use the values of (lambda)/S.E. to test
the values in this panel. For the CENTER$ by AGE interaction, the most likely
combination is women under 50 from Tokyo (1.76); the least likely combination is
women 70 and over from Tokyo (0.544).
After listing the multiplicative effects, SYSTAT tests reduced models by removing
each first-order effect and each interaction from the model one at a time. For each
smaller model, LOGLIN provides:
n A likelihood-ratio chi-square for testing the fit of the model
n The difference in the chi-square statistics between the smaller model and the full
model
The likelihood-ratio chi-square for the full model is 55.833. For a model that omits
AGE, the likelihood-ratio chi-square is 166.95. This smaller model does not fit the
observed frequencies (p value < 0.00005). To determine whether the removal of this
term results in a significant decrease in the fit, look at the difference in the statistics:
166.95 55.833 = 111.117, p value < 0.00005. The fit worsens significantly when AGE
is removed from the model.
From the second line in this panel, it appears that a model without the first-order
term for CENTER$ fits (p value = 0.3523). However, removing any of the two-way
interactions involving CENTER$ significantly decreases the model fit.
The hierarchical tests are similar to the preceding tests except that only hierarchical
models are testedif a lower-order effect is removed, so are the higher-order effects
that include it. For example, in the first line, when CENTER$ is removed, the three
interactions with CENTER$ are also removed. The reduction in the fit is significant
(p < 0.00005). Although removing the first-order effect of CENTER$ does not
significantly alter the fit, removing the higher-order effects involving CENTER$
decreases the fit substantially.
e
lambda ( )
605
Logl i near Model s
Example 2
Screening Effects
In this example, you pretend that no models have been fit to the CANCER data (that is,
you have not seen the other examples). As a place to start, first fit a model with all
second-order interactions finding that it fits. Then fit models nested within the first by
using results from the HTERM (terms tested hierarchically) panel to guide your
selection of terms to be removed.
Heres a summary of your instructions: you study the output generated from the first
MODEL and ESTIMATE statements and decide to remove AGE by TUMOR$. After
seeing the results for this smaller model, you decide to remove AGE by SURVIVE$, too.
The output follows:
USE cancer
LOGLIN
FREQ = number
PRINT NONE / CHI HTERM
MODEL center$*age*survive$*tumor$ = tumor$..center$^2
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$ = tumor$..center$^2,
- age*tumor$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$ = tumor$..center$^2,
- age*tumor$,
- age*survive$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$ = tumor$..center$^2,
- age*tumor$,
- age*survive$,
- tumor$*survive$
ESTIMATE / DELTA=0.5
All two-way interactions
Pearson ChiSquare 40.1650 df 40 Probability 0.46294
LR ChiSquare 39.9208 df 40 Probability 0.47378
Rafterys BIC -225.6219
Dissimilarity 7.6426

Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
TUMOR$ . . . . -361.233 457.17 58 0.0000 417.25 18 0.0000
SURVIVE$ . . . -241.675 218.06 48 0.0000 178.14 8 0.0000
AGE. . . . . . -241.668 218.04 54 0.0000 178.12 14 0.0000
CENTER$. . . . -213.996 162.70 54 0.0000 122.78 14 0.0000
SURVIVE$
* TUMOR$ . . . -157.695 50.10 43 0.2125 10.18 3 0.0171
AGE
* TUMOR$ . . . -153.343 41.39 46 0.6654 1.47 6 0.9613
AGE
* SURVIVE$ . . -154.693 44.09 42 0.3831 4.17 2 0.1241
CENTER$
* TUMOR$ . . . -169.724 74.15 46 0.0053 34.23 6 0.0000
606
Chapter 18
CENTER$
* SURVIVE$ . . -156.501 47.71 42 0.2518 7.79 2 0.0204
CENTER$
* AGE. . . . . -186.011 106.73 44 0.0000 66.81 4 0.0000
Remove AGE * TUMOR
Pearson ChiSquare 41.8276 df 46 Probability 0.64757
LR ChiSquare 41.3934 df 46 Probability 0.66536
Rafterys BIC -263.9807
Dissimilarity 7.8682

Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
TUMOR$ . . . . -361.233 457.17 58 0.0000 415.78 12 0.0000
SURVIVE$ . . . -242.434 219.57 54 0.0000 178.18 8 0.0000
AGE. . . . . . -241.668 218.04 54 0.0000 176.65 8 0.0000
CENTER$. . . . -215.687 166.08 60 0.0000 124.69 14 0.0000
SURVIVE$
* TUMOR$ . . . -158.454 51.61 49 0.3719 10.22 3 0.0168
AGE
* SURVIVE$ . . -155.452 45.61 48 0.5713 4.22 2 0.1214
CENTER$
* TUMOR$ . . . -171.415 77.54 52 0.0124 36.14 6 0.0000
CENTER$
* SURVIVE$ . . -157.291 49.29 48 0.4214 7.90 2 0.0193
CENTER$
* AGE. . . . . -187.702 110.11 50 0.0000 68.72 4 0.0000
Remove AGE * TUMOR$ and AGE * SURVIVE$
Pearson ChiSquare 45.3579 df 48 Probability 0.58174
LR ChiSquare 45.6113 df 48 Probability 0.57126
Rafterys BIC -273.0400
Dissimilarity 8.4720

Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
TUMOR$ . . . . -363.341 461.39 60 0.0000 415.78 12 0.0000
SURVIVE$ . . . -242.434 219.57 54 0.0000 173.96 6 0.0000
AGE. . . . . . -241.668 218.04 54 0.0000 172.43 6 0.0000
CENTER$. . . . -219.546 173.80 62 0.0000 128.19 14 0.0000
SURVIVE$
* TUMOR$ . . . -160.563 55.83 51 0.2981 10.22 3 0.0168
CENTER$
* TUMOR$ . . . -173.524 81.75 54 0.0087 36.14 6 0.0000
CENTER$
* SURVIVE$ . . -161.264 57.23 50 0.2245 11.62 2 0.0030
CENTER$
* AGE. . . . . -191.561 117.83 52 0.0000 72.22 4 0.0000
Remove AGE * TUMOR$, AGE * SURVIVE$, and TUMOR$ * SURVIVE$
Pearson ChiSquare 57.5272 df 51 Probability 0.24635
LR ChiSquare 55.8327 df 51 Probability 0.29814
Rafterys BIC -282.7342
Dissimilarity 9.9530

607
Logl i near Model s
The likelihood-ratio chi-square for the model that includes all two-way interactions is
39.9 (p value = 0.4738). If the AGE by TUMOR$ interaction is removed, the chi-square
for the smaller model is 41.39 (p value = 0.6654). Does the removal of this interaction
cause a significant change? No, chi-square = 1.47 (p value = 0.9613). This chi-square
is computed as 41.39 minus 39.92 with 46 minus 40 degrees of freedom. The removal
of this interaction results in the least change, so you remove it first. Notice also that the
estimate of the maximized likelihood function is largest when this second-order effect
is removed (153.343).
The model chi-square for the second model is the same as that given for the first
model with AGE * TUMOR$ removed (41.3934). Here, if AGE by SURVIVE$ is
removed, the new model fits (p value = 0.5713) and the change between the model
minus one interaction and that minus two interactions is insignificant (p value =
0.1214).
If SURVIVE$ by TUMOR$ is removed from the current model with four
interactions, the new model fits (p value = 0.2981). The change in fit is not significant
(p = 0.0168). Should we remove any other terms? Looking at the HTERM panel for the
model with three interactions, you see that a model without CENTER$ by SURVIVE$
has a marginal fit (p value = 0.0975) and the chi-square for the difference is significant
(p value = 0.0043). Although the goal is parsimony and technically a model with only
two interactions does fit, you opt for the model that also includes CENTER$ by
SURVIVE$ because it is a significant improvement over the very smallest model.
Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
TUMOR$ . . . . -363.341 461.39 60 0.0000 405.56 9 0.0000
SURVIVE$ . . . -242.434 219.57 54 0.0000 163.74 3 0.0000
AGE. . . . . . -246.779 228.26 57 0.0000 172.43 6 0.0000
CENTER$. . . . -224.289 183.29 65 0.0000 127.45 14 0.0000
CENTER$
* TUMOR$ . . . -178.267 91.24 57 0.0027 35.41 6 0.0000
CENTER$
* SURVIVE$ . . -166.007 66.72 53 0.0975 10.89 2 0.0043
CENTER$
* AGE. . . . . -196.672 128.05 55 0.0000 72.22 4 0.0000
608
Chapter 18
Example 3
Structural Zeros
This example identifies outliers and then declares them to be structural zeros. You
wonder if any of the interactions in the model fit in the example on loglinear modeling
for a four-way table are necessary only because of a few unusual cells. To identify the
unusual cells, first pull back from your ideal model and fit a model with main effects
only, asking for the four most unusual cells. (Why four cells? Because 5% of 72 cells
is 3.6 or roughly 4.)
Of course this model doesnt fit, but following are selections from the output:
Next, fit your ideal model, identifying these four cells as structural zeros and also
requesting PRINT / HTERM to test the need for each interaction term.
USE cancer
LOGLIN
FREQ = number
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = tumor$ .. center$
PRINT / CELLS=4
ESTIMATE / DELTA=0.5
Pearson ChiSquare 181.3892 df 63 Probability 0.00000
LR ChiSquare 174.3458 df 63 Probability 0.00000
Rafterys BIC -243.8839
Dissimilarity 19.3853
The 4 most outlandish cells (based on FTD, stepwise):
======================================================

CENTER$
| AGE
| | SURVIVE$
ln(MLE) LR_ChiSq p-value Frequency | | | TUMOR$
--------- -------- -------- --------- - - - -
-203.261 33.118 0.000 68 1 1 2 2
-195.262 15.997 0.000 1 1 3 2 1
-183.471 23.582 0.000 25 1 1 2 3
-176.345 14.253 0.000 6 1 3 2 2
609
Logl i near Model s
Defining Four Cells As Structural Zeros
Continuing from the analysis of main effects only, now specify your original model
with its three second-order effects:
Following are selections from the output. Notice that asterisks mark the structural zero
cells.
MODEL center$*age*survive$*tumor$ = ,
(age + survive$ + tumor$) # center$
ZERO CELL=1 1 2 2 CELL=1 3 2 1 CELL=1 1 2 3 CELL=1 3 2 2
PRINT / HTERMS
ESTIMATE / DELTA=0.5
Number of cells (product of levels): 72
Number of structural zero cells: 4
Total count: 664

Observed Frequencies
====================
CENTER$ AGE SURVIVE$ | TUMOR$
| MinMalig MinBengn MaxMalig MaxBengn
---------+---------+---------+-------------------------------------------------
Tokyo Under 50 Dead | 9.000 7.000 4.000 3.000
Alive | 26.000 *68.000 *25.000 9.000
+
50 to 69 Dead | 9.000 9.000 11.000 2.000
Alive | 20.000 46.000 18.000 5.000
+
70 & Over Dead | 2.000 3.000 1.000 0.0
Alive | *1.000 *6.000 5.000 1.000
---------+---------+---------+-------------------------------------------------
Boston Under 50 Dead | 6.000 7.000 6.000 0.0
Alive | 11.000 24.000 4.000 0.0
+
50 to 69 Dead | 8.000 20.000 3.000 2.000
Alive | 18.000 58.000 10.000 3.000
+
70 & Over Dead | 9.000 18.000 3.000 0.0
Alive | 15.000 26.000 1.000 1.000
---------+---------+---------+-------------------------------------------------
Glamorgn Under 50 Dead | 16.000 7.000 3.000 0.0
Alive | 16.000 20.000 8.000 1.000
+
50 to 69 Dead | 14.000 12.000 3.000 0.0
Alive | 27.000 39.000 10.000 4.000
+
70 & Over Dead | 3.000 7.000 3.000 0.0
Alive | 12.000 11.000 4.000 1.000
-----------------------------+-------------------------------------------------
* indicates structural zero cells
Pearson ChiSquare 46.8417 df 47 Probability 0.47906
LR ChiSquare 44.8815 df 47 Probability 0.56072
Rafterys BIC -260.5378
Dissimilarity 10.1680
610
Chapter 18
The model has a nonsignificant test of fit and so does a model without the CENTER$
by SURVIVAL$ interaction (p value = 0.4226).
Eliminating Only the Young Women
Two of the extreme cells are from the youngest age group. What happens to the
CENTER$ by SURVIVE$ effect if only these cells are defined as structural zeros?
HTERM remains in effect.
The output follows:
When the two cells for the young women from Tokyo are excluded from the model
estimation, the CENTER$ by SURVIVE$ effect is not needed (p value = 0.3737).
Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
AGE. . . . . . -190.460 132.87 53 0.0000 87.98 6 0.0000
SURVIVE$ . . . -206.152 164.25 50 0.0000 119.37 3 0.0000
TUMOR$ . . . . -326.389 404.72 56 0.0000 359.84 9 0.0000
CENTER$. . . . -177.829 107.60 61 0.0002 62.72 14 0.0000
CENTER$
* AGE. . . . . -158.900 69.75 51 0.0416 24.86 4 0.0001
CENTER$
* SURVIVE$ . . -149.166 50.28 49 0.4226 5.40 2 0.0674
CENTER$
* TUMOR$ . . . -162.289 76.52 53 0.0189 31.64 6 0.0000
MODEL center$*age*survive$*tumor$ =,
(age + survive$ + tumor$) # center$
ZERO CELL=1 1 2 2 CELL=1 1 2 3
ESTIMATE / DELTA=0.5
Number of cells (product of levels): 72
Number of structural zero cells: 2
Total count: 671
Pearson ChiSquare 50.2610 df 49 Probability 0.42326
LR ChiSquare 49.1153 df 49 Probability 0.46850
Rafterys BIC -269.8144
Dissimilarity 10.6372
Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
AGE. . . . . . -221.256 188.37 55 0.0000 139.25 6 0.0000
SURVIVE$ . . . -210.369 166.60 52 0.0000 117.48 3 0.0000
TUMOR$ . . . . -331.132 408.12 58 0.0000 359.01 9 0.0000
CENTER$. . . . -192.179 130.22 63 0.0000 81.10 14 0.0000
CENTER$
* AGE. . . . . -172.356 90.57 53 0.0010 41.45 4 0.0000
CENTER$
* SURVIVE$ . . -153.888 53.63 51 0.3737 4.52 2 0.1045
CENTER$
* TUMOR$ . . . -169.047 83.95 55 0.0072 34.84 6 0.0000
611
Logl i near Model s
Eliminating the Older Women
Here you define the two cells for the Tokyo women from the oldest age group as
structural zeros.
The output is:
When the two cells for the women from the older age group are treated as structural
zeros, the case for removing the CENTER$ by SURVIVE$ effect is much weaker than
when the cells for the younger women are structural zeros. Here, the inclusion of the
effect results in a significant improvement in the fit of the model (p value = 0.0019).
Conclusion
The structural zero feature allowed you to quickly focus on 2 of the 72 cells in your
multiway table: the survivors under 50 from Tokyo, especially those with benign
tumors with minimal inflammation. The overall survival rate for the 764 women is
72.5%, that for Tokyo is 79.3%, and that for the most unusual cell is 90.67%. Half of
the Tokyo women under age 50 have MinBengn tumors (75 out of 151) and almost 10%
of the 764 women (spread across 72 cells) are concentrated here. Possibly the protocol
for study entry (including definition of a tumor) was executed differently at this
center than at the others.
MODEL center$*age*survive$*tumor$ =,
(age + survive$ + tumor$) # center$
ZERO CELL=1 3 2 1 CELL=1 3 2 2
ESTIMATE / DELTA=0.5
Number of cells (product of levels): 72
Number of structural zero cells: 2
Total count: 757
Pearson ChiSquare 53.4348 df 49 Probability 0.30782
LR ChiSquare 50.9824 df 49 Probability 0.39558
Rafterys BIC -273.8564
Dissimilarity 9.4583
Term tested The model without the term Removal of term from model
hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value
--------------- --------- -------- ---- -------- -------- ---- --------
AGE. . . . . . -203.305 147.41 55 0.0000 96.42 6 0.0000
SURVIVE$ . . . -238.968 218.73 52 0.0000 167.75 3 0.0000
TUMOR$ . . . . -358.521 457.84 58 0.0000 406.86 9 0.0000
CENTER$. . . . -209.549 159.89 63 0.0000 108.91 14 0.0000
CENTER$
* AGE. . . . . -177.799 96.39 53 0.0003 45.41 4 0.0000
CENTER$
* SURVIVE$ . . -161.382 63.56 51 0.1114 12.58 2 0.0019
CENTER$
* TUMOR$ . . . -171.123 83.04 55 0.0086 32.06 6 0.0000
612
Chapter 18
Example 4
Tables without Analyses
If you want only a frequency table and no analysis, use TABULATE. Simply specify the
table factors in the same order in which you want to view them from left to right. In
other words, the last variable defines the columns of the table and cross-classifications
of the preceding variables the rows.
For this example, we use data in the CANCER file. Here we use LOGLIN to display
counts for a 3 by 3 by 2 by 4 table (72 cells) in two dozen lines. The input is:
The resulting table is:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=59 to 69, 70=70 & Over
ORDER center$ / SORT=NONE
ORDER tumor$ / SORT =MinBengn, MaxBengn, MinMalig,
MaxMalig
TABULATE age * center$ * survive$ * tumor$
Number of cells (product of levels): 72
Total count: 764

Observed Frequencies
====================
AGE CENTER$ SURVIVE$ | TUMOR$
| MinBengn MaxBengn MinMalig MaxMalig
---------+---------+---------+-------------------------------------------------
Under 50 Tokyo Alive | 68.000 9.000 26.000 25.000
Dead | 7.000 3.000 9.000 4.000
+
Boston Alive | 24.000 0.0 11.000 4.000
Dead | 7.000 0.0 6.000 6.000
+
Glamorgn Alive | 20.000 1.000 16.000 8.000
Dead | 7.000 0.0 16.000 3.000
---------+---------+---------+-------------------------------------------------
59 to 69 Tokyo Alive | 46.000 5.000 20.000 18.000
Dead | 9.000 2.000 9.000 11.000
+
Boston Alive | 58.000 3.000 18.000 10.000
Dead | 20.000 2.000 8.000 3.000
+
Glamorgn Alive | 39.000 4.000 27.000 10.000
Dead | 12.000 0.0 14.000 3.000
---------+---------+---------+-------------------------------------------------
70 & Over Tokyo Alive | 6.000 1.000 1.000 5.000
Dead | 3.000 0.0 2.000 1.000
+
Boston Alive | 26.000 1.000 15.000 1.000
Dead | 18.000 0.0 9.000 3.000
+
Glamorgn Alive | 11.000 1.000 12.000 4.000
Dead | 7.000 0.0 3.000 3.000
-----------------------------+-------------------------------------------------
613
Logl i near Model s
Computation
Algorithms
Loglinear modeling implements the algorithms of Haberman (1973).
References
Agresti, A. (1984). Analysis of ordinal categorical data. New York: Wiley-Interscience.
Agresti, A. (1990). Categorical data analysis. New York: Wiley-Interscience.
Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate
analysis: Theory and practice. Cambridge, Mass.: McGraw-Hill.
Fienberg, S. E. (1980). The analysis of cross-classified categorical data, 2nd ed.
Cambridge, Mass.: MIT Press.
Goodman, L. A. (1978). Analyzing qualitative/categorical data: Loglinear models and
latent structure analysis. Cambridge, Mass.: Abt Books.
Haberman, S. J. (1973). Loglinear fit for contingency tables, algorithm AS 51. Applied
Statistics, 21, 218224.
Haberman, S. J. (1978). Analysis of qualitative data, Vol. 1: Introductory topics. New
York: Academic Press.
Knoke, D. and Burke, P. S. (1980). Loglinear models. Newbury Park: Sage.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
615


Chapt er
19
Multidimensional Scaling
Leland Wilkinson
Multidimensional scaling offers nonmetric multidimensional scaling of a similarity or
dissimilarity matrix in one to five dimensions. Multidimensional scaling is a powerful
data reduction procedure that can be used on a direct similarity or dissimilarity matrix
or on one derived from rectangular data with Correlations. SYSTAT provides three
MDS loss functions (Kruskal, Guttman, and Young) that produce results comparable
to those from three of the major MDS packages (KYST, SSA, and ALSCAL). All
three methods perform a similar function: to compute coordinates for a set of points
in a space such that the distances between pairs of these points fit as closely as
possible to measured dissimilarities between a corresponding set of objects.
The family of procedures called principal components or factor analysis is related
to multidimensional scaling in function, but multidimensional scaling differs from
this family in important respects. Usually, but not necessarily, multidimensional
scaling can fit an appropriate model in fewer dimensions than can these other
procedures. Furthermore, if it is implausible to assume a linear relationship between
distances and dissimilarities, multidimensional scaling nevertheless provides a simple
dimensional model.
MDS also computes the INDSCAL (individual differences multidimensional
scaling) model (Carroll and Chang, 1970). The INDSCAL model fits
dissimilarity/similarity matrices for multiple subjects into one common space, with
jointly estimated weight parameters for each subject (that is, a dissimilarity matrix is
input for each subject and separate (monotonic) regression functions are computed).
MDS can fit the INDSCAL model using any of the three loss functions, although we
recommend using Kruskals STRESS for this purpose.
Finally, MDS can fit the nonmetric unfolding model. This allows one to analyze
rank-order preference data.
616
Chapter 19
Statistical Background
Multidimensional scaling (MDS) is a procedure for fitting a set of points in a space
such that the distances between points correspond as closely as possible to a given set
of dissimilarities between a set of objects. Dissimilarities may be measured directly, as
in psychological judgments, or derived indirectly, as in correlation matrices computed
on rectangular data.
Assumptions
Because MDS, like cluster analysis, operates directly on dissimilarities, no statistical
distribution assumptions are necessary. There are, however, other important
assumptions. First, multidimensional scaling is a spatial model. To fit points in the
kinds of spaces that MDS covers, you assume that your data satisfy metric conditions:
n The distance from an object to itself is 0.
n The distance from object A to object B is the same as that from B to A.
n The distance from object A to C is less than or equal to the distance from A to B
plus B to C. This is sometimes called the triangle inequality.
You may think these conditions are obvious, but there are numerous counter-examples
in psychological perception and elsewhere. For example, commuters often view the
distance from home to the city as closer than the distance from the city to home because
of traffic patterns, terrain, and psychological expectations related to time of day.
Framing or context effects can also disrupt the metric axioms, as Amos Tversky has
shown. For example, Miami is similar to Havana. Havana is similar to Moscow. Is
Miami similar to Moscow? If your data (objects) are not consistent with these three
axioms, do not use MDS.
Second, there are ways of deriving distances from rectangular data that do not
satisfy the metric axioms. The ones available in Correlations do, but if you are thinking
of using some other derived measure of similarity, check it carefully.
Finally, it is assumed that all of your objects will fit in the same metric space. It is
best if they diffuse somewhat evenly through this space as well. Dont expect to get
interpretable results for 25 nearly indistinguishable objects and one that is radically
different.
617
Mul ti di mensi onal Scal i ng
Collecting Dissimilarity Data
You can collect dissimilarities directly or compute them indirectly.
Direct Methods
Examples of direct dissimilarities are:
Distances. Take distances between objects (for example, cities) directly off a map. If
the scale is local, MDS will reproduce the map nicely. If the scale is global, you will
need three dimensions for an MDS fit. Two- or three-dimensional spatial distances can
be measured directly. Direct measures of social distance might include spatial
propinquity or the number of times or amount of time one individual interacts with
another.
Judgments. Ask subjects to give a numerical rating of the dissimilarity (for example, 0
to 10) between all pairs of objects.
Clusters. Ask people to sort objects into piles; or examine naturally occurring
aggregates, such as paragraphs, communities, and associations. Record 0 if two objects
occur in the same group and 1 if they do not. Sum these counts over replications or
judges.
Triads. Ask subjects to compare three objects at a time and report which two are most
similar (or which is the odd one out). Do this over all possible triads of objects. To
compute dissimilarities, sum over all triads, as for the clustering method. There are
usually many more triads than pairs of objects, so this method is more tedious;
however, it allows you to assess independently possible violations of the triangle
inequality.
Indirect Methods
Indirect dissimilarities are computed over a rectangular matrix whose columns are
objects and rows are attributes. You can transpose this matrix if you want to scale rows
instead. Possible indirect dissimilarities include:
Computed Euclidean distances. These are the square root of the sum-of-squared
discrepancies between columns of the rectangular matrix.
618
Chapter 19
Negatives of correlations. For standardized data (mean of 0 and standard deviation of
1), Pearson correlations are proportional to Euclidean distances. For unstandardized
data, Pearson correlations are comparable to computing Euclidean distances after
standardizing. MDS automatically negates correlations if you do not. Other types of
correlationsfor example, Spearman and gammaare analogous to standardized
distances, but only approximately. Also, be aware that large negative correlations will
be treated as large distances and large positive correlations, as small distances. Make
sure that all variables are scored in the same direction before computing correlations.
If you find that a whole row of a correlation matrix is negative, reverse the variable by
multiplying by 1, and recompute the correlations.
Counts of discrepancies. Counting discrepancies between columns or using some of the
binary association measures in Correlations is closely related to computing the
Euclidean distance. These methods are also related to the clustering distance
calculations mentioned above for direct distances.
Scaling Dissimilarities
Once you have dissimilarities (or similarities, correlations, etc., which MDS
automatically transforms to dissimilarities), you may scale them. You do not need to
know how the computer does the calculations in order to use the program intelligently
as long as you pay attention to the following:
Stress and Iterations
Stress is the goodness-of-fit statistic that MDS tries to minimize. It consists of the
square root of the normalized squared discrepancies between interpoint distances in the
MDS plot and the smoothed distances predicted from the dissimilarities. Stress varies
between 0 and 1, with values near 0 indicating better fit. It is printed for each iteration,
which is one movement of all of the points in the plot toward a better solution. Make
sure that iterations proceed smoothly to a minimum. This is true for the examples in
this chapter. If you find that the stress values increase or decrease in uneven steps, you
should be suspicious.
619
Mul ti di mensi onal Scal i ng
The Shepard Diagram
The Shepard diagram is a scatterplot of the distances between points in the MDS plot
against the observed dissimilarities (or similarities). The points in the plot should
adhere cleanly to a curve or straight line (which would be the smoothed distances). In
other words, you should look at a good Shepard plot and think it resembles the outcome
of a well-designed experiment. Check the examples in this chapter.
If the Shepard diagram resembles a stepwise or L-shaped function, beware. You
may have achieved a degenerate solution. Publish it and you will be excoriated by the
clergy.
The MDS Plot
The plot of points is what you seek. The points should be scattered fairly evenly through
the space. The orientation of axes is arbitraryremember we are scaling distances, not
axes. Feel free to reverse axes or rotate the solution. MDS rotates it to the largest
dimensions of variation, but these dont necessarily mean anything for your data.
You may interpret the axes as in principal components or factor analysis. More
often, however, you should look for clusters of objects or regular patterns among the
objects, such as circles, curved manifolds, and other structures. See the Guttman loss
function example for a good view of a circle.
Multidimensional Scaling in SYSTAT
Multidimensional Scaling Main Dialog Box
To open the Multidimensional Scaling dialog box, from the menus choose:
Statistics
Data Reduction
Multidimensional Scaling (MDS)...
620
Chapter 19
The following options are available:
Variable(s). Select the variables that contain the matrix of data to be analyzed.
Dimension. Number of dimensions in which to scale. The number of dimensions must
be a positive integer less than or equal to the number of variables that you scale.
R-metric. Constant for the Minkowski power metric for computing distances. For
ordinary Euclidean distance, enter 2. For city-block distance, enter 1. For values other
than 1 or 2, computation is slower because logarithms and exponentials are used.
The general formula for calculating distances is:
where r is the specified power and p is the number of dimensions.
Iterations. Limit for the number of iterations.
Converge. Iterations terminate when the maximum absolute difference between any
coordinate in the solution at iteration i versus iteration is less than the specified
convergence criterion. Because the configuration is standardized to unit variance on
every iteration, iteration stops when no coordinate moves more than the specified
convergence criterion (0.005 by default) from its value on the previous iteration.
d x x
jk ij ik
i
p
r

"
$
#

1
1
i 1
621
Mul ti di mensi onal Scal i ng
Most MDS programs terminate when stress reaches a predetermined value or
changes by less than a small amount. These programs can terminate prematurely,
however, because comparable stress values can result from different configurations.
The SYSTAT convergence criterion allows you to stop iterating when the
configuration ceases to change.
Weight. Adds weights for each dimension and each matrix (subject) into the calculation
of separate distances that are used in the minimization. For an individual differences
model, select Weight.
Split Loss. For an individual differences of unfolding model, split the calculation of the
loss function by rows of the matrix or by matrices. Splitting by rows is possible only
for a rectangular matrix.
Loss Function. MDS scales similarity and dissimilarity matrices using three loss
functions:
n Kruskal uses Kruskals STRESS formula 1 scaling method.
n Guttman uses Guttmans coefficient of alienation scaling method.
n Young uses Youngs S-STRESS scaling method, which allows you to scale using
the loss function featured in ALSCAL.
Iterations with Kruskals method are faster but usually take longer to converge to a
minimum value than those with the Guttman method. The procedure used in the
latter has been found in simulations to be less susceptible to local minima than that
used in the Kruskal method (Lingoes and Roskam, 1973). We do not recommend
Youngs S-STRESS loss function. Because it weights squares of distances, large
distances have more influence than smaller ones. Weinberg and Menil (1993)
summarized why this is a problem: error variances of dissimilarities tend to be
positively correlated with their means. If this is the case, large distances should be,
if anything, down-weighted relative to small distances.
When using the Kruskal loss function, choose the form of the function relating
distances to similarities (or dissimilarities):
n Mono specifies nonmetric scaling.
n Linear specifies metric scaling.
n Log specifies a log function, allowing a smooth curvilinear relation between
dissimilarities and distances.
n Power specifies a power function.
622
Chapter 19
Shape. Specify the type of matrix input. For a similarities model, select Square. For an
unfolding model, select Rectangular and enter the number of rows in your matrix.
Save file. You can save three sets of output to a data file:
n Configuration saves the final configuration.
n Distances saves the matrix of distances between points in the final scaled
configuration.
n Residuals saves the data, distances, estimated distances, residuals, and the row and
column number of the original distance in the rectangular SYSTAT file.
With the residuals, MDS displays the root-mean-squared residuals for each point in its
output. Because STRESS is a function of the sum-of-squared residuals, the root-mean-
squared residuals are a measure of the influence of each point on the STRESS statistic.
This can help you identify ill-fitting points.
Multidimensional Scaling Configuration
SYSTAT offers several alternative initial configurations.
Compute configuration from data. By default, the configuration is computed from the
data. The method used depends on the loss function.
Use previous configuration. Uses the configuration from the previous scaling.
Define custom configuration. You can specify a custom starting configuration for the
scaling. There must be as many rows as items and columns as dimensions. When you
type a matrix, SYSTAT reads as many numbers in each row as you specify. It reads as
many rows as there are points to scale.
You can specify a configuration for confirmatory analysis. Enter a hypothesized
configuration and let the program iterate only once. Then look at the stress.
623
Mul ti di mensi onal Scal i ng
Using Commands
First, specify your data with USE filename. Continue with:
Usage Considerations
Types of data. MDS uses a data file that contains an SSCP, covariance, correlation, or
dissimilarity matrix. When you open the data file, MDS automatically recognizes its
type.
Print options. The output is standard for all PRINT lengths.
Quick Graphs. MDS produces a Shepard diagram for each matrix analyzed and a plot of
the final configuration. For solutions containing four or more dimensions, the final
configuration appears as a scatterplot matrix of all dimension pairs.
Saving files. You can save the final configuration, matrix of distances between points
in the final scaled configuration, distances, estimated distances, residuals, and the row
and column number of the original distance in SYSTAT data files.
BY groups. MDS produces separate analyses for each level of a BY variable.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ variables are not available in MDS.
Case weights. WEIGHT is not available in MDS.
MDS
MODEL varlist / ROWS=n SHAPE=SQUARE or RECT
CONFIG = LAST
or
CONFIG [matrix]
ESTIMATE / DIM=n R=n ITER=n WEIGHT CONVERGE=n ,
LOSS=GUTTMAN or KRUSKAL or YOUNG ,
REGRESS=MONO or LINEAR or LOG or POWER ,
SPLIT=ROW or MATRIX
SAVE filename / CONFIG or DIST or RESID
624
Chapter 19
Examples
Example 1
Kruskal Method
The data in the ROTHKOPF file are adapted from an experiment by Rothkopf (1957).
They were originally obtained from 598 subjects who judged whether or not pairs of
Morse code signals presented in succession were the same. Morse code signals for
letters and digits were used in the experiment, and all pairs were tested in each of two
possible sequences. For multidimensional scaling, the data for letter signals have been
averaged across sequence, and the diagonal (pairs of the same signal) has been omitted.
The data in this form were first scaled by Shepard.
The input is:
Use the shortcut notation (..) in MODEL for listing consecutive variables in the file
(otherwise, simply list each variable name separated by a space).
The program begins by generating an initial configuration of points whose
interpoint distances are a linear function of the input data. For this estimation, MDS
uses a metric multidimensional scaling. To do this, missing values in the input matrix
are replaced by mean values for the whole matrix. Then the values are converted to
distances by adding a constant.
The output is:
MDS
USE ROTHKPF1
MODEL a .. z
IDVAR = code$
ESTIMATE / LOSS=KRUSKAL
Monotonic Multidimensional Scaling
The data are analyzed as similarities
Minimizing Kruskal STRESS (form 1) in 2 dimensions

Iteration STRESS
--------- ------
0 0.263538
1 0.237909
2 0.218821
3 0.202184
4 0.190513
5 0.184340
6 0.181176
7 0.179394
8 0.178271
Stress of final configuration is: 0.17827
Proportion of variance (RSQ) is: 0.84502
625
Mul ti di mensi onal Scal i ng
The solution required eight iterations. Notice that STRESS reduces at each iteration.
Final STRESS values near zero may indicate the presence of a degenerate solution.
The Shepard diagram is a scatterplot of distances between points in the MDS plot
against the observed dissimilarities or similarities. In monotonic scaling, the regression
function has steps at various points. For most solutions, the function in this plot should
Coordinates in 2 dimensions
Variable Dimension
-------- ---------
1 2
.- -1.21 -.31
-... .59 -.45
-.-. .67 .05
-.. .06 -.44
. -1.54 .89
..-. .48 -.57
--. .22 .65
.... .03 -1.05
.. -1.45 -.38
.--- .78 .77
-.- .22 .02
.-.. .60 -.27
-- -.62 .76
-. -1.15 -.04
--- .47 1.02
.--. .63 .31
--.- .90 .56
.-. -.28 -.34
... -.66 -1.04
- -1.47 .95
..- -.31 -.75
...- .37 -.87
.-- .04 .13
-..- .83 -.15
-.-- .87 .38
--.. .94 .18
626
Chapter 19
be relatively smooth (without large steps). If the function looks like one or two large
steps, you should consider setting REGRESSION to LOG or LINEAR under ESTIMATE.
Notice that large values of the data tend to have small distances in the configuration.
The diagram displays an overall decreasing trend because we are using similarities
(large data values indicate similar objects). For dissimilarities, the Shepard diagram
displays an increasing trend.
In the configuration plot, the points should be scattered fairly evenly through the
space. If you are scaling in more than two dimensions, you should examine plots of
pairs of axes or rotate the solution in three dimensions. The solution has been rotated
to principal axes (that is, the major variation is on the first dimension). This rotation is
not performed unless the scaling is in Euclidean space, as in the present example.
The two-dimensional solution clearly distinguishes short signals from long and dots
from dashes. Dashes tend to appear in the upper right and dots in the lower left. Long
codes tend to appear in the lower right and short in the upper left.
Regression Function
If you use the Kruskal or Young loss function, you can fit a MONOTONIC, LINEAR, or
LOG function of distances onto input dissimilarities. The standard option is
MONOTONIC multidimensional scaling. To avoid degenerate solutions, however, log or
linear scaling is sometimes handy. Log scaling is recommended for this purpose
because it allows a smooth curvilinear relation between dissimilarities and distances.
Example 2
Guttman Loss Function
To illustrate the Guttman loss function, this example uses judged similarities among 14
spectral colors (from Ekman, 1954). Nanometer wavelengths (W434, , W674) are
used to name the variables for each color. Blue-violets are in the 400s; reds are in the
600s. The judgments are averaged across 31 subjects; the larger the number for a pair
of colors, the more similar the two colors are. The file (EKMAN) has no diagonal
elements, and its type is SIMILARITY.
627
Mul ti di mensi onal Scal i ng
The Guttman method is used to scale these judgments in two dimensions to
determine whether the data fit a perceptual color wheel. The Kruskal loss function will
give you a similar result. The input is:
The output is:
MDS
USE ekman
MODEL w434 .. w674
ESTIMATE / LOSS=GUTTMAN
Monotonic Multidimensional Scaling
The data are analyzed as similarities
Minimizing Guttman/Lingoes Coefficient of Alienation in 2 dimensions

Iteration Alienation
--------- ----------
0 0.070826
1 0.042069
2 0.037770
3 0.036155
4 0.035069
Alienation of final configuration is: 0.03507
Proportion of variance (RSQ) is: 0.99623

Coordinates in 2 dimensions
Variable Dimension
-------- ---------
1 2
W434 .31 -.91
W445 .40 -.84
W465 .89 -.57
W472 .95 -.48
W490 .98 .11
W504 .81 .64
W537 .55 .89
W555 .33 .97
W584 -.54 .73
W600 -.83 .38
W610 -1.01 .06
W628 -1.01 -.18
W651 -.94 -.33
W674 -.90 -.47
628
Chapter 19
The fit of configuration distances to original data is extremely close, as evidenced by
the low coefficient of alienation and clean Shepard diagram.
The resulting configuration is almost circular, denoting a circumplex by Guttman
(1954). There is a large gap at the bottom of the figure, however, because the
perceptual color between deep red and dark purple is not a spectral color.
Example 3
Individual Differences Multidimensional Scaling
The data in the COLAS file are taken from Schiffman, Reynolds, and Young (1981).
The data in this file have an unusual structure. The file consists of 10 dissimilarity
matrices stacked on top of each other. They are judgments by 10 subjects of the
dissimilarity (0100) between pairs of colas. The example will fit the INDSCAL
(individual differences scaling) model to these data, seeking a common group space for
the 10 different colas and a parallel weight space for the 10 different judges.
The input follows:
MDS
USE colas
MODEL dietpeps .. dietrite
ESTIMATE / LOSS=KRUSKAL WEIGHT SPLIT=MATRIX DIM=3
629
Mul ti di mensi onal Scal i ng
The WEIGHT option tells SYSTAT to weight each matrix separately. Without this
option, all matrices would be weighted equally, and you would have a single pooled
solution. You want to use weighting so that you can see which subjects favor one
dimension over the others in their judgments. The MATRIX option of SPLIT tells
SYSTAT to compute separate (monotonic) regression functions for each subject
(matrix). Finally, scale the result in three dimensions, as did Schiffman et al. (1981).
The output is:
Monotonic Multidimensional Scaling
The data are analyzed as dissimilarities
There are 10 replicated data matrices
Dimensions are weighted separately for each matrix
Fitting is split between data matrices
Minimizing Kruskal STRESS (form 1) in 3 dimensions

Iteration STRESS
--------- ------
0 0.220899
1 0.184422
0 0.221307
1 0.184508
Stress of final configuration is: 0.18451
Proportion of variance (RSQ) is: 0.53501

Coordinates in 3 dimensions
Variable Dimension
-------- ---------
1 2 3
DIETPEPS -.61 .20 .78
RC .52 .05 .76
YUKON .42 -.09 -.87
PEPPER .27 -1.27 .06
SHASTA .80 .02 -.14
COKE .39 .84 -.35
DIETPEPR -.75 -.84 -.17
TAB -.79 .44 -.61
PEPSI .57 .22 .38
DIETRITE -.82 .43 .17

Matrix Weights
Matrix Stress RSQ Dimension
------ ------ --- ---------
1 2 3
1 .188 .548 .70 .43 .53
2 .200 .416 .45 .47 .72
3 .196 .468 .35 .52 .74
4 .171 .564 .59 .49 .61
5 .178 .594 .70 .37 .56
6 .172 .621 .70 .37 .57
7 .181 .552 .42 .58 .66
8 .180 .560 .48 .60 .61
9 .163 .625 .56 .50 .63
10 .212 .402 .44 .61 .62
630
Chapter 19
Shepard Diagram
1
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
2
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
3
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
4
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
5
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
6
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
7
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
8
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
is
t
a
n
c
e
s
9
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
i
s
t
a
n
c
e
s
10
0 10 20 30 40 50 60 70 80 90 100
Data
0.0
0.5
1.0
1.5
2.0
D
i
s
t
a
n
c
e
s
1 2 3 4
5 6 7 8
9 10
Shepard Diagram
631
Mul ti di mensi onal Scal i ng
The solution required four iterations. Notice that the second two iterations appear to be
a restart. That is exactly what they are. Because the fourth matrix has a missing value,
SYSTAT uses the EM algorithm to reestimate this value, compute a new metric
solution, and iterate two more times until convergence. This extra set of iterations did
not do much for you in this example because the stress is insignificantly higher than it
would have been had you stopped at only two iterations. With many missing values,
however, the EM algorithm will improve MDS solutions substantially.
For the INDSCAL model, you have a set of coordinates for the colas and one for the
subjects. In the three-dimensional graph of the coordinates, the colas are represented
by symbols and the subjects by vectors. The first dimension separates the diet colas
from the others. The second dimension differentiates between Dr. Pepper/diet Dr.
Pepper and the remaining colas.
For each subject, you have a contribution to overall stress and a separate squared
correlation (RSQ) between the predicted and obtained distances in the configuration.
Notice that subject 10 is fit worst (STRESS = 0.212) and subject 9 best (STRESS =
0.163). Furthermore, subjects 1, 5, and 6 have a high loading on the first dimension,
indicating that they place a higher emphasis on diet/nondiet differences than on cherry
cola/cola differences. Subjects 7, 8, and 10, on the other hand, emphasize the second
dimension more.
Configuration
DIETPEPS
RC
YUKON
PEPPER
SHASTA COKE
DIETPEPR
TAB
PEPSI
DIETRITE
1
2 3
4
56
7
8 9 10
632
Chapter 19
Example 4
Nonmetric Unfolding
The COLRPREF data set contains color preferences among 15 SYSTAT employees for
five primary colors. This example uses the MDS unfolding model to scale the people
and the colors in two dimensions, such that each persons coordinate is near his or her
favorite colors coordinate and far from his or her least favorite colors coordinate. For
this example, use ROWS to specify the number of rows for a rectangular matrix and
SHAPE to specify the type of matrix input to use. When you enter these data for the
first time, you must remember to specify their type as DISSIMILARITY so that small
numbers are understood as meaning most similar (preferred).
To scale these with the unfolding model, specify:
Notice that you are using the Kruskal loss function as the default. The output is shown
below:
MDS
USE colrpref
MODEL red .. blue / SHAPE=RECT
IDVAR=name$
ESTIMATE / SPLIT=ROWS
Monotonic Multidimensional Scaling
The data are analyzed as dissimilarities
The data are rectangular (lower corner matrix)
Fitting is split between rows of data matrix
Minimizing Kruskal STRESS (form 1) in 2 dimensions

Iteration STRESS
--------- ------
0 0.148374
1 0.135423
2 0.125152
3 0.117255
4 0.111131
5 0.106394
6 0.102623
7 0.099539
8 0.096883
9 0.094497
0 0.107456
1 0.100496
2 0.096038
3 0.092748
4 0.090087
Stress of final configuration is: 0.09009
Proportion of variance (RSQ) is: 0.94001

633
Mul ti di mensi onal Scal i ng
Coordinates in 2 dimensions
Variable Dimension
-------- ---------
1 2
RED .25 -.49
ORANGE .53 -1.70
YELLOW -1.31 -.56
GREEN 1.39 .26
BLUE -.55 .79
Patrick .56 .78
Laszlo -.73 -.13
Mary -1.01 .11
Jenna .19 -.25
Julie -.70 -.22
Steve 1.18 -.76
Phil .61 .61
Mike -.80 -.02
Keith .27 .76
Kathy .05 .76
Leah -.72 .00
Stephanie .50 .58
Lisa .78 .21
Mark -.57 .50
John .06 -1.24
Row Fit Measures
Row Stress RSQ
--- ------ ---
Patrick .000 1.000
Laszlo .068 .970
Mary .004 1.000
Jenna .048 .983
Julie .272 .508
Steve .033 .993
Phil .061 .972
Mike .083 .958
Keith .172 .774
Kathy .000 1.000
Leah .067 .971
Stephanie .029 .994
Lisa .055 .981
Mark .000 1.000
John .025 .996
634
Chapter 19
Nonmetric Unfolding and the EM Algorithm
The nonmetric unfolding model has often presented problems to MDS programs
because so much data are missing. If you think of the unfolding matrix as the lower
corner matrix in a larger triangular matrix of subjects + objects, you can visualize how
much data (namely, all of the subject-object comparisons) are missing. Since SYSTAT
uses the EM algorithm for missing values, unfolding models do not degenerate as
frequently. SYSTAT does a complete MDS using all available data and then estimates
missing dissimilarities/similarities using the distances in the solution. These estimated
values are then used to get a starting configuration for another complete iteration cycle.
This process continues until there are no changes between EM cycles.
The following example, from Borg and Lingoes (1987) adapted from Green and
Carmone (1970), shows how this works. This unfolding data set contains
dissimilarities only between the points delineating A and M, and these dissimilarities
are treated only as rank orders. Borg and Lingoes discuss the problems in fitting an
unfolding model to these data.
The input follows:
Notice that the example uses the Guttman loss function, but the others provide similar
results. The output is shown below:
MDS
USE am
IDVAR = row$
MODEL / SHAPE=RECT
ESTIMATE / LOSS=GUTTMAN SPLIT=ROWS
Monotonic Multidimensional Scaling
The data are analyzed as dissimilarities
The data are rectangular (lower corner matrix)
Fitting is split between rows of data matrix
Minimizing Guttman/Lingoes Coefficient of Alienation in 2 dimensions

Iteration Alienation
--------- ----------
0 0.076135
1 0.037826
2 0.023540
3 0.017736
4 0.013277
5 0.009962
Alienation of final configuration is: 0.00996
Proportion of variance (RSQ) is: 0.99925

635
Mul ti di mensi onal Scal i ng
Coordinates in 2 dimensions
Variable Dimension
-------- ---------
1 2
A1 -.94 -1.02
A2 -.89 -.98
A3 -1.09 -.41
A4 -1.07 -.40
A5 -1.19 .15
A6 -1.23 .34
A7 -1.54 .67
A8 -1.00 .55
A9 -.69 .47
A10 -.31 .36
A11 .01 .10
A12 .10 .10
A13 .13 .09
A14 -.85 .09
A15 -.74 .14
A16 -.57 .13
M1 .74 -1.08
M2 .43 -.52
M3 .20 -.56
M4 .01 -.43
M5 -.15 -.33
M6 -.21 -.18
M7 -.17 .12
M8 -.06 .22
M9 .18 .27
M10 .56 .24
M11 .59 .22
M12 .59 .22
M13 .83 .87
M14 .89 .66
M15 1.04 .21
M16 1.24 .16
M17 1.50 .23
M18 1.70 -.21
M19 1.94 -.49
Row Fit Measures
Row Stress RSQ
--- ------ ---
M1 .000 1.000
M2 .000 1.000
M3 .000 1.000
M4 .000 1.000
M5 .027 .993
M6 .022 .996
M7 .024 .997
M8 .016 .999
M9 .000 1.000
M10 .000 1.000
M11 .000 1.000
M12 .000 1.000
M13 .002 1.000
M14 .000 1.000
M15 .000 1.000
M16 .000 1.000
M17 .000 1.000
M18 .000 1.000
M19 .000 1.000
636
Chapter 19
Example 5
Power Scaling Ratio Data
Because similarities or dissimilarities are often collected as rank-order data, the
nonmetric MDS model has to work backward in order to solve for a configuration
fitting the data. As J. D. Carroll has pointed out, the MDS model should really express
observed data as a function of distances between points in a configuration rather than
the other way around. If your data are direct or derived distances, however, you should
try setting REGRESSION = POWER with LOSS = FUNCTION. This way, you can fit a
Stevens power function to the data using distances between points in the configuration.
The results may not always differ much from nonmetric or linear or log MDS, but
SYSTAT will also tell you the exponent of the power function in the Shepard diagram.
Notice with this model that the data and distances are transposed in the Shepard
diagram because loss is being computed from errors in the data rather than the
distances. SYSTAT calls the loss for the power model PSTRESS to distinguish it from
Kruskals STRESS. In PSTRESS, you use DATA and its DHAT instead of DIST and its
DHAT to compute the loss.
637
Mul ti di mensi onal Scal i ng
The HELM data set contains highly accurate estimates of distance between color
pairs by one experimental subject (CB). These are from Helm (1959) and reprinted by
Borg and Lingoes (1987). To scale these with power model, specify:
The output is shown below:
MDS
USE helm
MODEL a .. s
ESTIMATE / REGRESS=POWER
Power regression function, where Dissimilarities=a*Distances^p
The data are analyzed as dissimilarities
Minimizing PSTRESS (STRESS with DIST and DATA exchanged) in 2 dimensions

Iteration PSTRESS
--------- -------
0 0.142060
1 0.131422
2 0.127135
3 0.125206
Stress of final configuration is: 0.12521
Estimated exponent for power regression is: 0.85154
Proportion of variance (RSQ) is: 0.91039

Coordinates in 2 dimensions
Variable Dimension
-------- ---------
1 2
A -.83 -.79
C .40 -1.09
E 1.13 -.50
G .98 .10
I .79 .48
K .33 .68
M -.21 .80
O -.73 .58
Q -1.00 .05
S -.87 -.32
638
Chapter 19
SYSTAT estimated the power exponent for the function, fitting distances to
dissimilarities as 0.85. Color and many other visual judgments show similar power
exponents less than 1.0.
Computation
This section summarizes algorithms separately for the Kruskal and Guttman methods.
The algorithms in these options substantially follow those of Kruskal (1964ab) and
Guttman (1968). MDS output should agree with other nonmetric multidimensional
scaling except for rotation, dilation, and translation of the configuration. Secondary
documentation can be found in Schiffman, Reynolds, and Young (1981) and the other
multidimensional scaling references. The summary assumes that dissimilarities are
input. If similarities are input, MDS inverts them.
Algorithms
Kruskal Method
The program begins by generating a configuration of points whose interpoint distances
are a linear function of the input data. For this estimation, MDS uses a metric
multidimensional scaling. Missing values in the input dissimilarities matrix are
639
Mul ti di mensi onal Scal i ng
replaced by mean values for the whole matrix. Then the values are converted to
distances by adding a constant. A scalar products matrix B is then calculated following
the procedures described in Torgerson (1958). The initial configuration matrix X in p
dimensions is computed from the first p eigenvectors of B using the Young-
Householder procedure (Torgerson, 1958)
After an initial configuration is computed by the metric method, nonmetric
optimization begins (there are no metric pre-iterations). At the beginning of each
iteration, the configuration is normalized to have zero centroid and unit dispersion.
Next, Kruskals DHAT (fitted) distance values are computed by a monotonic
regression of distances onto data. Tied data values are ordered according to their
corresponding distances in the configuration.
Stress (formula 1) is calculated from fitted distances, observed distances, and input
data values. If the stress is less than 0.001 or has decreased in the last five iterations
less than 0.001 per iteration, or the number of iterations equals the number specified
by the user (default is 50), iterations terminate (that is, go to the next paragraph).
Otherwise, the negative gradient is computed for each point in the configuration by
taking the partial derivatives of stress with respect to each dimension. Points in the
configuration are moved along their gradients with a step size chosen as a function of
the rate of descent; the steeper the descent, the smaller the step size. This completes an
iteration.
After the last iteration, the configuration is shifted so that the origin lies in the
centroid. Thus, the point coordinates sum to 0 on each dimension. Moreover, the
configuration is normalized to unit size so that the sum of squares of its coordinates is
1. If the Minkowski constant is 2 (Euclidean scaling, which is the standard option), the
final configuration is rotated to its principal axis.
Guttman Method
The initial configuration for the Guttman option is computed according to Lingoes and
Roskam (1973). Principal components are computed on a matrix C,
where rij are the ranks of the input dissimilarities (smallest rank corresponding to
smallest dissimilarity), and n is the number of points. The diagonal elements of C are
c
ij
1
r
i j
n n 1 ( )
2
-------------------
------------------- =
640
Chapter 19
where the sum is taken over the entire row of the dissimilarity matrix.
For the iteration stage, the initial configuration is normalized as in the Kruskal
method. Then rank images corresponding to each distance in the configuration are
computed by permuting the configuration distances so that they mirror the rank order
of the original input dissimilarities. Ties in the data are handled as in the Kruskal
method. These rank images are used to compute the Guttman/Lingoes coefficient of
alienation. Iterations are terminated if this coefficient becomes arbitrarily small, if the
number of iterations exceeds the maximum, or if the change in its value becomes small.
Otherwise, the points in the configuration are moved five times using the same rank
images but different interpoint distances each time to compute a new negative gradient.
These five cycles within each iteration are what lengthens the calculations in the
Guttman method. This completes an iteration.
The final configuration is rotated and scaled as with the Kruskal method.
Guttman/Lingoes programs normalize the extreme values of the configuration to unity
and thus do not plot the configuration with a zero centroid, so MDS output corresponds
to their output within rigid motion and configuration size.
Missing Data
Missing values in a similarity/dissimilarity matrix are ignored in the computation of
the loss function that determines how points in the configuration are moved. For
information on how this function is computed, see the discussion of algorithms.
If you compute a similarity matrix with Correlations for input to MDS, the matrix
will have no missing values unless all of your cases in the raw data have a constant or
missing value on one or more variables.
References
Borg, I. and Lingoes, J. (1981). Multidimensional data representations: When and why?
Ann Arbor: Mathesis Press.
Borg, I. and Lingoes, J. (1987). Multidimensional similarity structure analysis. New York:
Springer Verlag.
Carroll, J. D. and Arabie, P. (1980). Multidimensional scaling. M. R. Rosenzweig and L. W.
Porter, eds. Annual Review of Psychology, 31, 607649.
c
ij
1 r
i j
=
641
Mul ti di mensi onal Scal i ng
Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in
multidimensional scaling via an N-way generalization of Eckart-Young decomposition.
Psychometrika, 35, 283319.
Carroll, J. D. and Wish, M. (1974). Models and methods for three-way multidimensional
scaling. D. H. Krantz, R. C. Atkinson, R. D. Luce, and P. Suppes, eds. Contemporary
Developments in Mathematical Psychology, Vol. II: Measurement, Psychophysics, and
Neural Information Processing. San Francisco: W. H. Freeman and Company.
Coombs, C. H. (1964). A theory of data. New York: John Wiley & Sons, Inc.
Davison, M. L. (1983). Multidimensional scaling. New York: John Wiley & Sons, Inc.
Ekman, G. (1954). Dimensions of color vision. Journal of Psychology, 38, 467474.
Green, P. E. and Carmone, F. J. (1970). Multidimensional scaling and related techniques.
Boston: Allyn and Bacon.
Green, P. E. and Rao, V. R. (1972). Applied multidimensional scaling. New York: Holt,
Rinehart, and Winston.
Guttman, L. (1954). A new approach to factor analysis: The radex. P. F. Lazarsfeld, ed.
Mathematical Thinking in the Social Sciences. New York: Free Press.
Guttman, L. (1968). A general nonmetric technique for finding the smallest coordinate
space for a configuration of points. Psychometrika, 33, 469506.
Helm, C. E. (1959). A multidimensional ratio scaling analysis of color relations. Technical
Report, Princeton University and Educational Testing Service, June 1959.
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis. Psychometrika, 29, 127.
Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method.
Psychometrika, 29, 115129.
Kruskal, J. B. and Wish, M. (1978). Multidimensional scaling. Beverly Hills, Calif.: Sage
Publications.
Lingoes, J. C. and Roskam, E. E. (1973). A mathematical and empirical study of two
multidimensional scaling algorithms. Psychometrika Monograph Supplement, 19.
Rothkopf, E. Z. (1957). A measure of stimulus similarity and errors in some paired-
associate learning tasks. Journal of Experimental Psychology, 53, 94101.
Schiffman, S. S., Reynolds, M. L., and Young, F. W. (1981). Introduction to
multidimensional scaling: Theory, methods, and applications. New York: Academic
Press.
Shepard, R. N. (1963). Analysis of proximities as a study of information processing in man.
Human Factors, 5, 3348.
Shepard, R. N., Romney, A. K., and Nerlove, S., eds. (1972). Multidimensional scaling:
Theory and application in the behavioral sciences. New York: Academic Press.
Takane, Y., Young, F. W., and de Leeuw, J. (1977). Nonmetric individual differences
scaling: An alternating least squares method with optimal scaling features.
Psychometrika, 42, 327.
642
Chapter 19
Torgerson, W. S. (1958). Theory and methods of scaling. New York: John Wiley & Sons,
Inc.
Weinberg, S. L. and Menil, V. C. (1993). The recovery of structure in linear and ordinal
data: INDSCAL and ALSCAL. Multivariate Behavioral Research, 28:2, 215233.
643


Chapt er
20
Nonlinear Models
Laszlo Engelman
Nonlinear modeling estimates parameters for a variety of nonlinear models using a
Gauss-Newton (SYSTAT computes exact derivatives), Quasi-Newton, or Simplex
algorithm. In addition, you can specify a loss function other than least squares, so
maximum likelihood estimates can be computed. You can set lower and upper limits
on individual parameters. When the parameters are highly intercorrelated, and there
is concern about overfitting, you can fix the value of one or more parameters, and
Nonlinear Model will test the result against the full model. If the estimates have
trouble converging, or if they converge to a local minimum, Marquardting is available.
For assessing the certainty of the parameter estimates, Nonlinear Model offers
Wald confidence regions and Cook-Weisberg graphical confidence curves. The latter
are useful when it is unreasonable to assume that the estimates follow a normal
distribution. You can also save values of the loss function for plotting contours in a
bivariate display of the parameter space. This allows you to study the combinations
of parameter estimates with approximately the same loss function values.
When your response contains outliers, you may want to downweight their residuals
using one of Nonlinear Models robust functions: median, Huber, Hampel,
bisquare, t, trim, or the pth power of the absolute value of the residuals.
You can specify functions of parameters (like LD50 for a logistic model).
SYSTAT evaluates the function at each iteration, and prints the standard error and the
Wald interval for the estimate after the last iteration.
644
Chapter 20
Statistical Background
The following data are from a toxicity study for a drug designed to combat tumors. The
table shows the proportion of laboratory rats dying (Response) at each dose level
(Dose) of the drug. Clinical studies usually scale dose in natural logarithm units, which
are listed in the center column (Log Dose). We arbitrarily set the Log Dose to 4 for
zero Dose for the purpose of plotting and fitting with a linear model.
Modeling the Dose-Response Function
The plot of Response against Log Dose is clearly curvilinear.
Dose Log Dose Response
0.00 4.000 0.026
0.10 2.303 0.120
0.25 1.386 0.088
0.50 0.693 0.169
1.00 0.000 0.281
2.50 0.916 0.443
5.00 1.609 0.632
10.00 2.303 0.718
25.00 3.219 0.820
50.00 3.912 0.852
100.00 4.605 0.879
-5 -4 -3 -2 -1 0 1 2 3 4 5
LOGDOS
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
R
E
S
P
O
N
S
E
645
Nonl i near Model s
The S-shaped function suggests that we could use a linear model with linear, quadratic,
and cubic terms (that is, a polynomial function) to fit a curved line to the data. Here are
the results:
Notice that all the coefficients are highly significant and the overall fit is excellent
( ). Even the tolerances are relatively large, so we need not worry about
collinearity. The residual plots for this function are reasonably well behaved. There is
no significant autocorrelation in the residuals.
The following figure shows the observed data and the fitted curve.
How do the researchers interpret this plot? First of all, the curve is consistent with the
printed output; it fits extremely well in the range of the data. Putting the fitted curve
into ordinary language, we can say that fewer animals die at lower dosages and more
at higher. At the extremes, however, more animals die with extremely low dosages and
fewer animals die at extremely high dosages.
Dep Var: RESPONSE N: 11 Multiple R: 0.993 Squared multiple R: 0.986

Adjusted squared multiple R: 0.980 Standard error of estimate: 0.047

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 0.314 0.021 0.0 . 15.241 0.000
LOGDOS 0.166 0.013 1.344 0.168 12.418 0.000
LOGDOS
*LOGDOS 0.009 0.002 0.202 0.771 3.995 0.005
LOGDOS
*LOGDOS
*LOGDOS -0.004 0.001 -0.492 0.152 -4.322 0.003
R
2
0.986 =
-6 -2 2 6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-6 -2 2 6
LOGDOS
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
R
E
S
P
O
N
S
E
646
Chapter 20
This is nonsense. While it is possible to imagine some drugs (arsenic, for example)
for which dose-response functions are nonmonotonic, the model we fit makes no sense
for a clinical drug of this sort. Second, the cubic function we fit extrapolates beyond
the 01 response interval. It implies that there is something beyond dying and
something less than living. Third, the parameters of the model we fit have no
theoretical interpretation.
Clinical researchers usually prefer to fit quantal response data like these with a
bounded monotonic response function of the following form:
where is the background response, or rate of dying, is a location parameter for the
curve, and is a slope parameter for the curve.
Estimating a quantity called LD50 is the usual purpose of this type of study. LD50
is the dose at which 50 percent of the animals are expected to die. LD50 is:
Notice how the parameters of this model make theoretical sense. We have a problem,
however. We cannot fit an intrinsically nonlinear model like this with a linear
regression program. We cannot even transform this equation, using logs or other
mathematical operators, to a linear form. The cubic linear model we fit before was
nonlinear in the data but linear in the parameters. Linear models involve additive
combinations of parameters. The model we want to fit now is nonlinear in the data and
nonlinear in the parameters.
We need a program that fits this type of model iteratively. NONLIN begins with
initial estimates of parameter values and modifies them in small steps until the fit of
the curve to the data is as close as possible.
proportion dying
1
1 e
[ - log(dose)]
+
-------------------------------------- + =
e

1 2 ( )
1
647
Nonl i near Model s
Here is the result:
Notice how the curve tapers at the ends so that it is bounded by 0 and 1 on the Response
scale. This behavior fits our theoretical ideas about the effect of this drug. The value
for LD50 is 3.295, which is in raw dose units.
Interestingly, this model does not fit significantly better than the cubic polynomial.
Both have comparable sums of squared residuals. True, the cubic model has four
parameters and we have used only three. Nevertheless, this example should convince
you that blind searching for models that produce good fits is not good science. It is even
possible that a model with a poorer fit can be the true model generating data and one
with a better fit can be bogus.
Loss Functions
Nonlinear estimation includes a broad variety of statistical procedures. We have
performed nonlinear least squares, which is analogous to ordinary least squares. Both
methods minimize squared deviations of the dependent variable data values from
values estimated by the function at the same independent variable data points. In these
cases, loss is the sum of least squares.
Other types of loss functions can be defined which produce different estimates of
parameters in the same functions. The most widely used loss is negative log likelihood.
This loss is used for maximum likelihood estimation. Other loss functions are used for
robust estimators and nonparametric procedures.
648
Chapter 20
Maximum Likelihood
A maximum likelihood estimate of a parameter is a value of that parameter in a given
distribution that has the highest probability of generating the observed sample data.
Sometimes maximum likelihood and least squares estimators coincide (as in fixed
effects, fully crossed, balanced factorial ANOVA), and at other times they diverge. In
our quantal response data example, the maximum likelihood estimates are different.
They can be computed in NONLIN by using the loss function.
In general, maximum likelihood estimates are found by maximizing the likelihood
function L with respect to the parameter vector :
where is the density of the response at each value of x. Equivalently, the
negative of the log of the likelihood function can be minimized:
Here we outline four methods for computing maximum likelihood estimates in
NONLIN. To define them, we use a specific model and a specific density. The model is
the sum of two exponentials:
and the distribution of y at each x is Poisson:
In our definitions, we also use the log of the density:
where LGM is the log gamma function for computing y!.

L d x
i
) , (
i 1 =
n

=
d x
i
) , (
L log ln d x
i
) , ( ( )
i 1 =
n

=
y

p
1
e
p
2
x
p
3
e
p
4
x
+ =
d x
i
) , (
e

y
y!
------------ =
lnd yln LGM y 1 + ( ) + =
649
Nonl i near Model s
Method 1. Set the LOSS function to ln(density). In NONLIN, you can specify your own
loss function. Here we specify the negative of the log of the density function:
For the estimate of lambda, we use , or estimate, as it is known to Nonlinear Model.
Using commands, we type:
Note that for this method, you need to specify only the loss function. This method can
be used for any distribution; however, the estimated standard errors may not be correct.
Method 2. Iteratively reweighted least squares. This method is appropriate for
distributions belonging to the exponential family (for example, normal, binomial,
multinomial, Poisson, and gamma). It provides meaningful standard errors for the
parameter estimates and useful residuals. For this method, you define a case weight
that is recomputed at each iteration:
For our Poisson distribution, the mean and variance are equal, so lambda is the
variance, and our estimate of the variance is estimate. Thus, the weight is:
Heres how to specify this method using NONLIN commands:
The standard deviation of the resulting estimates are the usual information theory
standard errors.
MODEL Y = p1*EXP(p2*x) + p3*EXP(p4*x)
LOSS = estimate y*LOG(estimate) + LGM(y+1)
ESTIMATE
LET wt=1
WEIGHT = wt
MODEL y = p1*EXP(p2*x) + p2*EXP(p4*x)
RESET wt = 1 / estimate
ESTIMATE / SCALE
LOSS yln LGM y 1 + ( ) + =
y

weight
1
variance y
i
( )
----------------------------- =
weight
1
estimate
------------------- =
650
Chapter 20
Method 3. Estimate ln(density) and reset the predicted value to y + 1. For this method, the
data may follow any distribution and the standard errors are correct, but the method
does not yield correct residuals. You define a dummy outcome variable and estimate
the log of the density, and then reset the outcome variable to at each iteration.
For our example, with commands:
Method 4. Set the predicted value to zero and define the function as the square root of the
negative log density. This method is a variation of method 1, so it is appropriate for data
from any distribution and provides estimates of the parameters only. Here we trick
NONLIN by setting y=0 for all cases:
, so becomes
For our example, with commands:
Least Absolute Deviations
As an example of other types of loss functions, consider minimizing least absolute
values of deviations of the dependent variable data values from values estimated by the
function at the same independent variable data points. This procedure produces
estimates which, on average, are influenced less by outliers than the least squares
estimates. This is because squaring a large value increases its impact. While there are
more sophisticated robust procedures, least absolute values estimates are easy to
compute in NONLIN and fun to compare with least squares estimates.
LET dummy = 0
MODEL dummy = -p1*EXP(p2*x) - p3*EXP(p4*x),
+ y*LOG(p1*EXP(p2*x) + p3*EXP(p4*x),
-LGM(y + 1)
RESET dummy = estimate + 1
ESTIMATE / SCALE
LET dummy = 0
MODEL dummy = SQR(p1*EXP(p2*x) + p3*EXP(p4*x),
y*LOG(p1*EXP(p2*x) + p3*EXP(p4*x),
+ LGM(y + 1)
ESTIMATE
y

1 +
f ln d x , ( ) = y f ( )
2
0 ln d x q , ( ) ( )
2
ln d x , ( ) =
651
Nonl i near Model s
Model Estimation
SYSTAT provides three algorithms for estimating your model: Gauss-Newton, Quasi-
Newton, and Simplex. The Gauss-Newton method with its exact derivatives produces
more accurate estimates of the asymptotic standard errors and covariances and can
converge in fewer iterations and more quickly than the other two algorithms.
Both GN and the Quasi-Newton method do not work if the derivatives are undefined
in the region in which you are seeking minimum values. Specifically, the first and
second derivatives must exist at all points for which the algorithm computes values.
However, the algorithms cannot identify situations where the derivatives do not exist.
Also, Quasi-Newton cannot detect when derivatives fluctuate rapidlythus, Gauss-
Newton can be more accurate.
The Simplex algorithm does not have this requirement. It calculates a value for your
loss function at some point, looks to see if this value is less than values elsewhere, and
steps to a new point to try again. When the steps become small, iterations stop.
GN is the fastest method. Simplex is generally slower than the others, particularly
for least squares, because Simplex cannot make use of the information in the
derivatives to find how far to move its estimates at each step.
How Nonlinear Modeling Works
The estimation works as follows: the starting values of the parameters are selected by
the program or by you. Then the model (if stated) is evaluated for the first case in
double precision. The result of this function is called the estimate. Then the loss
function is evaluated for the first case, using the estimate from the model. If you did
not include a loss function, then loss is computed by squaring the residual for the first
case.
This procedure is repeated for all cases in the file and the loss is summed over cases.
The summed loss is then minimized using the Gauss-Newton, Quasi-Newton, or
Simplex algorithms. Iterations continue until both convergence criteria are met or the
maximum number of iterations is reached.
Problems
You may encounter numerous pitfalls (for example, dependencies, discontinuities,
local minima, and so on). Nonlinear Model offers several possibilities to overcome
these pitfalls, but in some instances, even your best efforts may be futile.
652
Chapter 20
n Find reasonable starting values by considering approximately what the values
should be. Try plotting the data. For example in the contouring example, you could
let and estimate to be approximately 20.
n Try Marquardting.
n Use several different starting values for each method before you feel comfortable
with the final estimates. This can help you expose local minima. The Simplex
method is most robust against local minima. There is a trade-off, however, because
it is considerably slower.
n Try switching back and forth between Gauss-Newton, Quasi-Newton, and Simplex
without changing the starting values. That way, one may help you out of a
convergence or local minimum problem.
n If you get illegal function values for starting values, try some other estimates. For
some functions with many parameters, you may need high quality starting values
to get an estimable function at all.
n Never trust the output of an iterative nonlinear estimation procedure until you have
plotted estimates against predictors and you have tried several different starting
values. SYSTAT is designed so that you can quickly save estimates, residuals, and
model variables and plot them. All of the examples in this chapter were tested this
way. Although most began with default starting values for the parameters, they
were checked with other starting values.
Nonlinear Models in SYSTAT
Nonlinear Model Specification
To open the Nonlinear Model dialog box, from the menus choose:
Statistics
Regression
Nonlinear
Model/Loss...
DAYS
1
653
Nonl i near Model s
Model specification. Specify a general algebraic equation model to be estimated. Terms
that are not variables are assumed to be parameters. If you want to use a function in the
model, choose a Function Type from the drop-down list, select the function in the
Functions list, and click Add.
Nonlinear modeling uses models resembling those for General Linear Model
(GLM). There is one critical difference, however. The Nonlinear Model statement is a
literal algebraic expression of variables and parameters. Choose any name you want
for these parameters. Any names you specify that are not variable names in your file
are assumed to be parameter names. Suppose you specify the following model for the
USSTATES data:
Since b0 and b1 are not variables (they are parameters), the following model is the same:
Parameter names can be any names that meet the requirements for SYSTAT numeric
variable names (eight characters beginning with a letter). However, unlike variable
names, parameter names may not have subscripts.
Any legal SYSTAT expression can be used in a model statement, including
trigonometric and other functions, plus the special variables CASE and COMPLETE.
The only restriction is that the dependent variable must be a variable in your file. Here
is a more complicated example:
liver = b0 + b1 * wine
liver = constant + beta * wine
654
Chapter 20
This model has two parameters (mu1 and mu2). Their values are conditional on the
value of division. Notice that the remaining parts of this expression involve relational
operations ( ). SYSTAT evaluates these to 1 (true) or 0 (false).
You can perform piecewise regression by fitting different curves to different subsets
of your data:
In this model, y is 10 if x is less than or equal to 0, y is BETA*x if x is greater than 0
and less than 1, and y is 20 if x is greater than or equal to 1. These types of constraints
are useful for specifying bounded probability functions such as the cumulative uniform
distribution.
Estimation. You can specify a loss function other than least squares. From the drop-
down list, select Loss Function to perform loss analysis. When your response contains
outliers, you may want to downweight their residuals using a robust function by
selecting Robust.
Method. Three model estimation methods are available.
n Gauss-Newton. Computes exact derivatives.
n Quasi-Newton. Uses numeric estimates of the first and second derivatives.
n Simplex. Uses a direct search procedure.
Save file. You can save seven sets of statistics to a file.
n Data. The data, estimated values, and residuals.
n Residuals. The estimated values, residuals, and variables in the model.
n Residuals/Data. All of the above.
n Response Surface. Five levels of contours of the loss function surrounding the
converged minimum (like a response surface for the loss function in a 2-D
parameter space).
n Confidence Interval. Cook-Weisberg graphical confidence curves. These are useful
when it is unreasonable to assume that the estimates follow a normal distribution.
n Confidence Region. A closed curve that defines the n% confidence region for a pair
of parameters surrounding the converged minimum. Type a number, n, between 0
and 0.99 in the Confidence Region field to specify the size of the confidence region.
n Parameters. Parameter estimates.
cardio division 5 < ( ) * mu1 division 5 ( ) * mu2 + =
division 5
y x 0 ( ) * 10 x 0 AND x 1 < > ( )* beta * x x 1 ( ) * 20 + + =
655
Nonl i near Model s
Parameters. For Response Surface and Confidence Region, you must specify names
of two parameters. For Confidence Interval, you must specify the names of the
parameters. Use a comma between each parameter name.
Estimate
Click Estimate in the Nonlinear Model dialog box to open the Estimate dialog box.
SYSTAT offers several options for controlling model computation.
Marquardt. Marquardt method of inflating the diagonal of the (JacobianJacobian)
matrix by n. This speeds convergence when initial values are far from the estimates and
when the estimates of the parameters are highly intercorrelated. This method is similar
to ridging, except that the inflation factor n is omitted from final iterations.
Start. Starting values for model parameters. Specify values for each parameter in the
order the parameters appear in your model (or loss statement if no model is specified).
Separate the values with commas or blanks. You can specify starting values for some
of the parameters and leave blanks for others.
SYSTAT chooses starting values if you do not. Specify starting values that give the
general shape of the function you expect as a result. For example, if you expect that the
function is a negative exponential function, then specify initial values that yield a
negative exponential function. Also, make sure that the starting values are in a
reasonable range. For example, if the function contains EXP(P*TIME) and TIME ranges
from 10,000 to 20,000, then the initial value of P should be around 1/10,000. If you
656
Chapter 20
specified an initial value such as 0.1, the function would have extremely large values,
such as e
1000
.
Minimum. Lower limits for the parameters, one number per parameter.
Maximum. Upper limits for the parameters, one number per parameter.
Iterations. Maximum number of iteration for fitting your model.
Half. Maximum number of step halvings. If the loss increases between two iterations,
Nonlinear Model halves the increment size, computes the loss at the midpoint, and
compares it to the residual sum of squares at the previous iteration. This process
continues until the residual sum of squares is less than that at the previous iteration or
until the maximum number of halvings is reached.
Tolerance. A check for near singularity. In order for SYSTAT to invert the matrix of
sums of cross-products of the derivatives with respect to the parameters, the matrix
cannot be singular. Use Tolerance to guard against this singularity problem. A
parameter estimate is not changed at an iteration if more than 1 TOL proportion of the
sum of squares of partial derivatives with respect to that parameter can be expressed
with partial derivatives of other parameters.
Loss convergence. When the relative improvement in the loss function for an iteration
is less than the specified value, SYSTAT declares that a solution has been found. Note,
for convergence, both loss convergence and parameter convergence must be satisfied.
Parameter convergence. When the largest relative improvement of parameters for an
iteration is less than the specified value, SYSTAT considers that the estimates of the
parameters have converged. Each parameter estimate must satisfy this criterion.
Mean square error scale. Rescales the mean square error to 1 at the end of the iterations.
Fix. Specify names of parameters to be held fixed at a constant value. SYSTAT
estimates the remaining parameters and tests whether the result differs from that for the
full model. An example is .
Recompute
The dependent variable or the weight variable can be recomputed after each iteration,
using the current values of the parameters.
You can open the Recompute dialog box by clicking Recompute in the Nonlinear
Model dialog box.
p3 1.0 =
657
Nonl i near Model s
Type the name of the dependent variable or the weight variable in the Variable field or
select the appropriate variable in the Variables list and click Add. If you want to use a
function in your expression, choose a Function Type from the drop-down list, select the
function in the Functions list, and click Add.
Functions of Parameters
Click Func Param in the Nonlinear Model dialog box to open the Functions of
Parameters dialog box.
SYSTAT allows you to estimate functions of parameters. Assign a name to each
function in the Parameter field. You can state up to four functions in this dialog box.
SYSTAT estimates each function and reports related statistics.
658
Chapter 20
If you want to use a built-in function in the expression, choose a Function Type from
the drop-down list, select the function in the Functions list, and click Add.
Robust Analysis
When your dependent variable contains outliers, a robust regression procedure can
downweight their influence on the parameter estimates. Thus, the resulting estimates
reflect the great bulk of the data and are not sensitive to the value of a few unusual
cases.
To specify a robust analysis, select Robust under Estimation in the Nonlinear Model
dialog box and click Robust.
You must select Perform robust analysis to specify a robust estimation procedure.
Available methods include:
n Absolute. The sum of absolute values of residuals.
n Power. The sum of the nth power of absolute values of residuals.
n Trim. Trims the n proportion of the residuals (those with the largest absolute values)
and minimizes the sum of squares of the remaining residuals.
n Huber. The sum of MAD standardized residuals weighted by Huber.
n Hampel. The sum of MAD standardized residuals weighted by Hampel.
n T. A t distribution with df degrees of freedom.
n Bisquare. The sum of MAD standardized residuals weighted by Bisquare.
The parameters for Huber, Hampel, t, and Bisquare are defined in MAD units (median
absolute deviations from the median of the residuals).
659
Nonl i near Model s
Each procedure has a function that is used to construct a weight for each residual
(that is recomputed at each iteration). Here is the weighting scheme for the Hampel
procedure (the heavy line is the Hampel function):
Nonlinear Models default values for a, b, and c are 1.7, 3.4, and 8.5, respectively. So,
if the size of the residual is less than 1.7, the weight is one; if it is over 8.5, the weight
is zero. As the residual increases in absolute value, the weight decreases.
Loss Functions for Nonlinear Model Estimation
As an alternative to least squares and robust regression, you can specify a custom loss
function to apply in model estimation. The default (least squares) loss function is
. The word estimate in the function is the fitted value from
your model. It is a special Nonlinear Model word, so you should not name a variable
ESTIMATE. The model defines the parameters (that is, new parameters cannot be
introduced in the loss function).
To specify a loss function for a model, select Loss Function under Estimation in the
Nonlinear Model dialog box, and click Loss.
for | residual | < a the weight ((residual)/residual) is 1.0
a < | residual | < b the weight is m/n
b < | residual | < c the weight is p/q
c < | residual | the weight is 0.0
O
a b c
m
n
p r
q

Hampel
depvar estimate ( )
2
660
Chapter 20
Expression. Enter the desired loss function. If you want to use a function in the
expression, choose a Function Type from the drop-down list, select the function in the
Functions list, and click Add.
Loss Functions for Analytic Function Minimization
You can also use nonlinear estimation to minimize an algebraic function. Such a
function requires no model specification. As a result, the loss function defines the
parameters and SYSTAT computes no estimates for a dependent variable.
To open the Loss dialog box, from the menus choose:
Statistics
Regression
Nonlinear
Loss...
661
Nonl i near Model s
Expression. Enter the desired loss function. If you want to use a function in the
expression, choose a Function Type from the drop-down list, select the function in the
Functions list, and click Add.
If estimation problems arise, use an alternative estimation method. The Simplex
method generally does better with algebraic expressions that incur roundoff error.
Using Commands
First, specify your data with USE filename. Continue with:
Usage Considerations
Types of data. NONLIN uses rectangular data only.
Print options. If you specify LONG output, casewise predictions and the asymptotic
correlation matrix of parameters are printed in addition to the default output.
Quick Graphs. NONLIN produces a scatterplot of the dependent variable against the
variables in the model expression. The fitted function appears as either a line or a
surface. If the model expression contains three or more variables, only the first two
appear in the plot.
Saving files. In nonlinear modeling, you can save residuals, estimated values, and
variables from your model statement, loss function values surrounding the converged
minimum, or data for plotting the Cook-Weisberg confidence intervals or two-
parameter confidence region.
BY groups. NONLIN produces separate results for each level of any BY variables.
NONLIN
MODEL var = function
LOSS function
RESET depvar = expression or weightvar = expression
ROBUST argument / ABSOLUTE or POWER=n or TRIM=n or HUBER=n ,
or HAMPEL=n1,n2,n3 or T=df or BISQUARE=n
FUNPAR name1=function1, name2=function2,
SAVE filename / DATA RESID RS=p1,p2 CI=p1,p2 CR=p1,p2 CONFI=n
ESTIMATE / GN or QUASI or SIMPLEX
MARQUADT=n START=n1,n2, MIN=n1,n2, MAX=n1,n2,,
ITER=n HALF=n TOL=n LCONV=n CONV=n SCALE
FIX p1=n1, p2=n2,
ESTIMATE
662
Chapter 20
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. NONLIN uses a FREQUENCY variable, if present, to duplicate cases.
Case weights. You can weight cases in NONLIN by specifying a WEIGHT variable.
Examples
Example 1
Nonlinear Model with Three Parameters
For this first example, we do not specify any options specific to NONLIN; we simply
specify the model using the operators and functions available for SYSTATs
transformations. Here, we use the default Gauss-Newton algorithm that computes
exact derivatives.
The Pattison data are from a 1987 JASA article by C. P. Y. Clarke (Clarke took the
data from an unpublished thesis by N. B. Pattinson). For 13 grass samples collected in
a pasture, Pattinson recorded the number of weeks since grazing began in the pasture
(TIME) and the weight of grass (GRASS) cut from 10 randomly sited quadrants. He
then fit the Mitcherlitz equation. Here is the model with the Quick Graph from its fit:
Grass e Time + +



1 2
3
663
Nonl i near Model s
The input is:
The output follows:
USE pattison
NONLIN
PRINT=LONG
MODEL grass = p1 + p2*EXP(-p3*time)
ESTIMATE
Iteration
No. Loss P1 P2 P3
0 .220818D+02 .101000D+01 .102000D+01 .103000D+01
1 .120609D+02 .117014D+01 .182736D+00-.152631D+00
2 .112473D+02 .172163D+01-.530281D-01-.212060D+00
3 .530076D+01 .272740D+01-.314883D+00 .112491D+00
4 .281714D+01 .971285D+00 .251024D+01 .186373D+00
5 .127700D+00 .120930D+01 .223520D+01 .109079D+00
6 .540618D-01 .966518D+00 .251532D+01 .102374D+00
7 .534536D-01 .963226D+00 .251890D+01 .103061D+00
8 .534536D-01 .963120D+00 .251900D+01 .103055D+00
9 .534536D-01 .963121D+00 .251900D+01 .103055D+00

Dependent variable is GRASS

Source Sum-of-Squares df Mean-Square
Regression 70.871 3 23.624
Residual 0.053 10 0.005

Total 70.925 13
Mean corrected 3.309 12

Raw R-square (1-Residual/Total) = 0.999
Mean corrected R-square (1-Residual/Corrected) = 0.984
R(observed vs predicted) square = 0.984

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
P1 0.963 0.322 2.995 0.247 1.680
P2 2.519 0.266 9.478 1.927 3.111
P3 0.103 0.026 4.041 0.046 0.160

GRASS GRASS
Case Observed Predicted Residual
1 3.183 3.235 -0.052
2 3.059 3.013 0.046
3 2.871 2.812 0.059
4 2.622 2.631 -0.009
5 2.541 2.468 0.073
6 2.184 2.320 -0.136
7 2.110 2.188 -0.078
8 2.075 2.068 0.007
9 2.018 1.959 0.059
10 1.903 1.862 0.041
11 1.770 1.774 -0.004
12 1.762 1.695 0.067
13 1.550 1.623 -0.073

Asymptotic Correlation Matrix of Parameters
P1 P2 P3
P1 1.000
P2 -0.972 1.000
P3 0.984 -0.923 1.000
664
Chapter 20
The estimates of parameters converged in nine iterations. At each iteration, Nonlinear
Model prints the number of the iteration, the loss, or the residual sum of squares (RSS),
and the estimates of the parameters. At step 0, the estimates of the parameters are the
starting values chosen by SYSTAT or specified by the user with the START option of
ESTIMATE. The residual sum of squares is
where y is the observed value, f is the estimated value, and w is the value of the case
weight (its default is 1.0).
Sums of squares (SS) appearing in the output include:
The Raw R
2
(Regression SS / Total SS) is the proportion of the variation in y that is
explained by the sum of squares due to regression. Some researchers object to this
measure because the means are not removed. The Mean corrected R
2
tries to adjust for
this. Many researchers prefer the last measure of R
2
(R(observed vs. predicted)
squared). It is the correlation squared between the observed values and the predicted
values.
A period (there is none here) for the asymptotic standard error indicates a problem
with the estimate (the correlations among the estimated parameters may be very high,
or the value of the function may not be affected if the estimate is changed). Read
Param/ASE, the estimate of each parameter divided by its asymptotic standard error,
roughly as a t statistic.
The Wald Confidence Intervals for the estimates are defined as EST t*A.S.E for
the t distribution with residual degrees of freedom (df = 10 in this example). SYSTAT
prints the 95% confidence intervals. Use CONFI=n to specify a different confidence
level.
SYSTAT computes asymptotic standard errors and correlations by estimating the
INV(J'J) matrix after iterations have terminated. The matrix is computed from the
asymptotic covariance matrix that inverts INV(J'J) * RMS, where J is the Jacobian and
RMS is the residual mean squared. You should examine your model for redundant
RSS w f

{ *(y ) }
2
Re : (y )
Re : (y )
:
: (y )
gression wy f
sidual w f
Total wy
Mean corrected w y
2 2
2
2
2

665
Nonl i near Model s
parameters. If the JJ matrix is singular (parameters are very highly intercorrelated),
SYSTAT prints a period to mark parameters with problems. In this example, the
parameters are highly intercorrelated; the model may be overparameterized.
Example 2
Confidence Curves and Regions
Confidence curves and regions provide information about the certainty of your
parameter estimates. The usual Wald confidence intervals can be misleading when
intercorrelations among the parameters are high.
Confidence curves. Cook and Weisberg construct confidence curves by plotting an
assortment of potential estimates of a specific parameter on the y axis against the
absolute value of a t statistic derived from the residual sum of squares (RSS) associated
with each parameter estimate. To obtain the values for the x axis, SYSTAT:
n Computes the model as usual and saves RSS.
n Fixes the value of the parameter of interest of (for example, the estimate plus half
the standard error of the estimate), recomputes the model, and saves RSS*.
n Computes the t statistic:
n Repeats the above steps for other estimates of the parameter.
Now SYSTAT plots each parameter estimate against the absolute value of its associated
t* statistic. Vertical lines at the 90, 95, and 99 percentage points of the t distribution
with (n p) degrees of freedom provide a useful frequentist calibration of the plot.
To illustrate the usefulness of confidence curves, we again use the Pattison data
used in the three-parameter nonlinear model example. Recall that the parameter
estimates were:
t
RSS * -RSS
1
-----------------------------
RSS
n p
----------- -
----------------------------- =
p1 0.93 =
p2 2.519 =
p3 0.103 =
666
Chapter 20
To produce the Cook-Weisberg confidence curves for the model:
Here are the results:
The nonvertical straight lines (blue on a computer monitor) are the Wald 95%
confidence intervals and the solid curves are the Cook-Weisberg confidence curves.
The vertical lines show the 90th, 95th, and 99th percentiles of the t distribution with
degrees of freedom.
USE pattison
NONLIN
MODEL grass = p1 + p2*EXP(-p3*time)
SAVE pattci / CI=p1 p2 p3
ESTIMATE
SUBMIT pattci
n p 10 =
667
Nonl i near Model s
For P1 and P2, the coverage of the Wald intervals differs makedly from that of the
Cook-Weisberg (C-W) curves. The 95% interval for P1 on the C-W curve is
approximately from 0.58 to 1.45; the Wald interval extends from 0.247 to 1.68. The
steeply descending lower C-W curve indicates greater uncertainty for smaller estimates
of P1. For P2, the C-W interval ranges from 2.12 to 3.92; the Wald interval ranges from
1.9 to 3.1. The agreement between the two methods is better for P3. The C-W curves
show that the distributions of estimates for P1 and P2 are quite asymmetric.
Confidence region. SYSTAT also provides the CR option for confidence regions. When
there are more than two parameters in the model, this feature causes Nonlinear Model
to search for the best values of the additional parameters for each combination of
estimates for the first two parameters. Type:
The plot follows:
You can also specify the level of confidence. For example,
USE pattison
NONLIN
MODEL grass = p1 + p2*EXP(-p3*time)
SAVE pattcr / CR=p1 p2
ESTIMATE
SUBMIT pattcr
SAVE pattcr / CR=p1 p2 CONFI=.90
-0.8 0.0 0.8 1.6
P1
2.0
2.5
3.0
3.5
4.0
P
2
668
Chapter 20
Example 3
Fixing Parameters and Evaluating Fit
In the three-parameter nonlinear model example, the R
2
between the observed and
predicted values is 0.984, indicating good agreement between the data and fitted
values. However, there may be consecutive points across time where the fitted values
are consistently overestimated or underestimated. We can look for trends in the
residuals by plotting them versus TIME and connecting the points with a line. A stem-
and-leaf plot will tell us if extreme values are identified as outliers (outside values or
far outside values). The input is:
The output is:
USE pattison
NONLIN
MODEL grass = p1 + p2*EXP(-p3*time)
SAVE myresids / DATA
ESTIMATE
USE myresids
PLOT RESIDUAL*TIME / LINE YLIMIT=0
STATISTICS
STEM RESIDUAL
669
Nonl i near Model s
The results of a runs test would not be significant here. The large negative residual in
the center of the plot, 0.137, is not identified as an outlier in the stem-and-leaf plot.
We should probably be more concerned about the fact that the parameters are highly
intercorrelated: The correlation between P1 and P2 is 0.972, and the correlation
between P1 and P3 is 0.984. This might indicate that our model has too many
parameters. You can fix one or more parameters and let SYSTAT estimate the
remaining parameters. Suppose, for example, that similar studies report a value of P1
close to 1.0. You can fix P1 at 1.0 and then test whether the results differ from the
results for the full model.
To do this, first specify the full model. Use FIX to specify the parameter as P1 with
a value of 1. Then initiate the estimation process with ESTIMATE:
Here are selections from the output:
Stem and Leaf Plot of variable: RESIDUAL, N = 13
Minimum: -0.136
Lower hinge: -0.052
Median: 0.007
Upper hinge: 0.059
Maximum: 0.073

-1 3
-0 H 775
-0 H 00
0 M 044
0 H 5567
USE pattison
NONLIN
MODEL grass = p1 + p2*EXP(-p3*time)
ESTIMATE
FIX p1=1
SAVE pattci / CI=p2 p3
ESTIMATE
SUBMIT pattci
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
P1 1.000 0.0 . . .
P2 2.490 0.060 41.662 2.358 2.621
P3 0.106 0.004 23.728 0.096 0.116
Analysis of the effect of fixing parameter(s)
Source Sum-of-squares df Mean-square F-value p(F-value)
Parameter fix 0.000 1 0.000 0.014 0.908
Residual 0.053 10 0.005
670
Chapter 20
The analysis of the effect of fixing parameter(s) F test tests the hypothesis that P1=1.0.
In our output, F = 0.014 (p = 0.908), indicating that there is no significant difference
between the two models. This is not surprising, considering the similarity of the results:
There are some differences between the two models. The correlation between P2 and
P3 is 0.923 for the full model and 0.810 when P1 is fixed. The most striking difference
is in the Wald intervals for P2 and P3. When P1 is fixed, the Wald interval for P2 is
less than one-fourth of the interval for the full model. The interval for P3 is less than
one-fifth the interval for the full model. Lets see what information the C-W curves
provide about the uncertainty of the estimates. Here are the curves for the model with
P1 fixed:
Compare these curves with the curves for the full model. The C-W curve for P2 has
straightened out and is very close to the Wald interval. If we were to plot the P2 C-W
curve for both models on the same axes, the wedge for the fixed P1 model would be
only a small slice of the wedge for the full model.
Three parameters P1 fixed at 1.0
P1 0.963 1.000
P2 2.519 2.490
P3 0.103 0.106
RSS 0.053 0.054
R
2
0.984 0.984
671
Nonl i near Model s
Example 4
Functions of Parameters
Frequently, researchers are not interested in the estimates of the parameters
themselves, but instead want to make statements about functions of parameters. For
example, in a logistic model, they may want to estimate LD50 and LD90 and determine
the variability of these estimates. You can specify functions of parameters in Nonlinear
Model. SYSTAT evaluates the function at each iteration and prints the standard error
and the Wald interval for the estimate after the last iteration.
We look at a quadratic function described by Cook and Weisberg. Here is the Quick
Graph that results from fitting the model:
This function reaches its maximum at b/2c. However, for the data given by Cook and
Weisberg, this maximum is close to the smallest x. That is, to the left of the maximum,
there is little of the response curve.
In SYSTAT, you can estimate the maximum (and get Wald intervals) directly from
the original quadratic by using FUNPAR. The input is:
USE quad
NONLIN
MODEL y = a + b*x + c*x^2
FUNPAR max = b/(2*c)
ESTIMATE
672
Chapter 20
The parameter estimates are:
Using the Wald interval, we estimate that the maximum response occurs for an x value
between 0.09 and 0.45.
C-W Curves
To obtain the C-W confidence curves for MAX, we have to re-express the model so that
MAX is a parameter of the model:
so
The original model is easy to compute because it is linear. The reparameterized model
is not as well-behaved, so we use estimates from the first run as starting values and
request C-W confidence curves:
The C-W confidence curves describe our uncertainty about the x value at which the
expected response is maximized much better than the Wald interval does.
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
A 0.034 0.117 0.292 -0.213 0.282
B 0.524 0.555 0.944 -0.647 1.694
C -1.452 0.534 -2.718 -2.579 -0.325
MAX 0.180 0.128 1.409 -0.090 0.450
MODEL y=a - (2*c*max)*x + c*x^2
SAVE quadcw / CI=max
ESTIMATE / START=0.034,-1.452, 0.180
SUBMIT quadcw
b 2cMax =
y a 2cMax ( )x cx
2
+ =
673
Nonl i near Model s
The picture provides clear information about the MAX response in the positive
direction. We can be confident that the value is less than 0.4 because the C-W curve is
lower than the Wald interval on the 95th percentile line. The lower bound is much less
clear; it could certainly be lower than the Wald interval indicates.
Example 5
Contouring the Loss Function
You can save loss function values along contour curves and then plot the loss function.
For this example, we use the BOD data (Bates and Watts, 1988). These data were taken
from stream samples in 1967 by Marske. Each sample bottle was inoculated with a
mixed culture of microorganisms, sealed, incubated, and opened periodically for
analysis of dissolved oxygen concentration.
The data are:
where DAYS is time in days and BOD is the biochemical oxygen demand. The six BOD
values are averages of two analyses on each bottle. An exponential decay model with
a fixed rate constant was estimated to predict biochemical oxygen demand.
Lets look at the contours of the parameter space defined by THETA_2 with
THETA_1. We use loss function data values stored in the BODRS data file. Heres how
we created the file:
USE bod
NONLIN
MODEL bod = theta_1*(1EXP(-theta_2*days))
PRINT=LONG
SAVE bodrs / RS
ESTIMATE
SUBMIT bodrs
BOD e
DAYS


1
1
2
( )
674
Chapter 20
The output follows:
The kidney-shaped area near the center of the plot is the region where the loss function
is minimized. Any parameter value combination (that is, any point inside the kidney)
produces approximately the same loss function.
Dependent variable is BOD

Source Sum-of-Squares df Mean-Square
Regression 1401.390 2 700.695
Residual 25.990 4 6.498

Total 1427.380 6
Mean corrected 107.213 5

Raw R-square (1-Residual/Total) = 0.982
Mean corrected R-square (1-Residual/Corrected) = 0.758
R(observed vs predicted) square = 0.758

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
THETA_1 19.143 2.496 7.670 12.213 26.072
THETA_2 0.531 0.203 2.615 -0.033 1.095

BOD BOD
Case Observed Predicted Residual
1 8.300 7.887 0.413
2 10.300 12.525 -2.225
3 19.000 15.252 3.748
4 16.000 16.855 -0.855
5 15.600 17.797 -2.197
6 19.800 18.678 1.122

Asymptotic Correlation Matrix of Parameters
THETA_1 THETA_2
THETA_1 1.000
THETA_2 -0.853 1.000
5 12 19 26 33 40
THETA_1
0
1
2
3
4
5
6
7
T
H
E
T
A
_
2
675
Nonl i near Model s
Example 6
Maximum Likelihood Estimation
Because NONLIN includes a loss function, you can maximize the likelihood of a
function in the model equation. The way to do this is to minimize the negative of the
log-likelihood.
Here is an example using the IRIS data. Lets compute the maximum likelihood
estimates of the mean and variance of SEPALWID assuming a normal distribution for
the first species in the IRIS data. For a sample of n independent normal random
variables, the log-likelihood function is:
However, we can use the ZDF function as a shortcut. In this example, we minimize the
negative of the log-likelihood with LOSS and thus maximize the likelihood. SYSTATs
small default starting values for MEAN and SIGMA (0.101 and 0.100) will produce
very large z scores ((x mean) / sigma) and values of the density close to 0, so we
arbitrarily select larger starting values. We use the IRIS data. Under SELECT, we
specify SPECIES = 1. Then, we type in our LOSS statement. Finally, we use
ESTIMATEs START option to specify start values (2,2):
The estimates are:
Note that the least squares estimate of sigma (0.379) computed using STATISTICS is
larger than the biased maximum likelihood estimate here (0.375).
USE iris
NONLIN
SELECT species=1
LOSS = -log(zdf(sepalwid,mean,sigma))
ESTIMATE / START=2,2
Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
MEAN 3.428 0.053 65.255 3.322 3.534
SIGMA 0.375 0.037 10.102 0.301 0.450
L(
n n
X

, ) ln(2 ) ln( ) ( )
2 2
2
2
2 2
1
2

676
Chapter 20
Example 7
Iteratively Reweighted Least Squares for Logistic Models
Cox (1970) reports the following data on tests among objects for failures after certain
times. These data are in the COX data fileFAILURE is the number of failures and
COUNT is the total number of tests.
Cox uses a logistic model to fit the failures:
The log-likelihood function for the logit model is:
where the sum is over all observations. Because the counts differ at each time, the
variances of the failures also differ. If FAILURE is randomly sampled from a binomial,
then.
Therefore, the weight is :
We use these variances to weight each case in the estimation. On each iteration, the
variances are recalculated from the new estimates and used anew in computing the
weighted loss function.
estimate count
e
e
time
time

+
+
+
( )


0 1
0 1
1
L( p e p estimate
0 1
1 1 , ) ln(estimat ) ( ) ln( ) +

VAR failure ( ) estimate * count estimate ( ) count =


1 variance
w
i
count estimate * count estimate ( ) ( ) =
677
Chapter 20
In the following commands, we use RESET to recompute the weight after each
iteration. The SCALE option of ESTIMATE rescales the mean square error to 1 at the
end of the iterations. The commands are:
The output follows:
Jennrich and Moore (1975) show that this method can be used for maximum likelihood
estimation of parameters from a distribution in the exponential family.
USE cox
NONLIN
PRINT = LONG
LET w = 1
WEIGHT = w
MODEL failure = count*EXP(-b0 - b1*time)/,
(1 + EXP(-b0 - b1*time))
RESET w = count / (estimate*(count-estimate))
ESTIMATE / SCALE
Iteration
No. Loss B0 B1
0 .162222D+03 .101000D+00 .102000D+00
1 .161785D+02 .272314D+01-.109931D-01
2 .325354D+01 .419599D+01-.509510D-01
3 .754172D+00 .510574D+01-.736890D-01
4 .665897D+00 .539079D+01-.801623D-01
5 .674806D+00 .541501D+01-.806924D-01
6 .674876D+00 .541518D+01-.806960D-01

Dependent variable is FAILURE

Source Sum-of-Squares df Mean-Square
Regression 13.038 2 6.519
Residual 0.675 2 0.337

Total 13.712 4
Mean corrected 10.539 3

Raw R-square (1-Residual/Total) = 0.951
Mean corrected R-square (1-Residual/Corrected) = 0.936
R(observed vs predicted) square = 0.988
Standard Errors of Parameters are rescaled

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
B0 5.415 0.728 7.443 3.989 6.841
B1 -0.081 0.022 -3.610 -0.125 -0.037

FAILURE FAILURE
Case Observed Predicted Residual Case Weight
1 0.0 0.427 -0.427 2.360
2 2.000 2.132 -0.132 0.475
3 7.000 6.013 0.987 0.173
4 3.000 3.427 -0.427 0.371
678
Chapter 20
Example 8
Robust Estimation (Measures of Location)
Robust estimators provide methods other than the mean, median, or mode to estimate
the center of a distribution. The sample mean is the least squares estimate of location;
that is, it is the point at which the squared deviations of the sample values are at a
minimum. (The sample medians minimize absolute deviations instead of squared
deviations.) In terms of weights, the usual mean assigns a weight of 1.0 to each
observation, while the robust methods assign smaller weights to residuals far from the
center.
In this example, we use sepal width of the Setosa iris flowers and SELECT
SPECIES = 1. We request the usual sample mean and then ask for a 10% trimmed
mean, a Hampel estimator, and the median. But first, lets view the distribution
graphically. Here are a box-and-whisker display and a dit plot of the data.
Except for the outlier at the left, the distribution of SEPALWID is slightly right-skewed.
Mean
In the maximum likelihood example, we requested maximum likelihood estimates of
the mean and standard deviation. Here is the least squares estimate:
USE iris
NONLIN
SELECT species = 1
MODEL sepalwid = mean
ESTIMATE
2 3 4 5
SEPALWID
2 3 4 5
SEPALWID
679
Nonl i near Model s
The output is:
Trimmed Mean
We enter the following commands after viewing the results for the mean. Note that
SYSTAT resets the starting values to their defaults when a new model is specified. If
MODEL is not given, SYSTAT uses the final values from the last calculation as starting
values for the current task.
For this trimmed mean estimate, SYSTAT deletes the five cases (0.1 * 50 = 5) with
the most extreme residuals. The input is:
The output follows:
Iteration
No. Loss MEAN
0 .299377D+03 .101000D+01
1 .704080D+01 .342800D+01
2 .704080D+01 .342800D+01
3 .704080D+01 .342800D+01

Dependent variable is SEPALWID

Source Sum-of-Squares df Mean-Square
Regression 587.559 1 587.559
Residual 7.041 49 0.144

Total 594.600 50
Mean corrected 7.041 49

Raw R-square (1-Residual/Total) = 0.988
Mean corrected R-square (1-Residual/Corrected) = 0.0
R(observed vs predicted) square = 0.0

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
MEAN 3.428 0.054 63.946 3.320 3.536
MODEL sepalwid = trimmean
ROBUST TRIM = 0.1
ESTIMATE
Iteration
No. Loss TRIMMEAN
0 .560487D+03 .101000D+00
1 .704080D+01 .342800D+01
2 .344888D+01 .342800D+01
3 .337200D+01 .338667D+01
4 .337200D+01 .338667D+01
5 .337200D+01 .338667D+01

TRIM robust regression: 45 cases have positive psi-weights
The average psi-weight is 1.00000

Dependent variable is SEPALWID

680
Chapter 20
The trimmed estimate deletes the outlier, plus the four flowers on the right side of the
distribution with width equal to or greater than 4.0 (if you select the LONG mode of
output, you would see that these flowers have the largest residuals).
Hampel
We now request a Hampel estimator using the default values for its parameters.
The output is:
Zero weights, missing data or estimates reduced degrees of freedom
Source Sum-of-Squares df Mean-Square
Regression 587.474 1 587.474
Residual 7.126 44 0.162

Total 594.600 45
Mean corrected 7.041 44

Raw R-square (1-Residual/Total) = 0.988
Mean corrected R-square (1-Residual/Corrected) = 0.0
R(observed vs predicted) square = 0.0

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
TRIMMEAN 3.387 0.060 56.451 3.266 3.508
MODEL sepalwid = hamp_est
ROBUST HAMPEL
ESTIMATE
Iteration
No. Loss HAMP_EST
0 .560487D+03 .101000D+00
1 .704080D+01 .342800D+01
2 .509172D+01 .342800D+01
3 .507163D+01 .341620D+01
4 .506858D+01 .341450D+01
5 .506825D+01 .341431D+01
6 .506822D+01 .341429D+01
7 .506821D+01 .341429D+01
8 .506821D+01 .341429D+01

HAMPEL robust regression: 50 cases have positive psi-weights
The average psi-weight is 0.94551
Dependent variable is SEPALWID

Source Sum-of-Squares df Mean-Square
Regression 587.550 1 587.550
Residual 7.050 49 0.144

Total 594.600 50
Mean corrected 7.041 49

Raw R-square (1-Residual/Total) = 0.988
Mean corrected R-square (1-Residual/Corrected) = 0.0
R(observed vs predicted) square = 0.0

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
HAMP_EST 3.414 0.054 63.648 3.306 3.522
681
Nonl i near Model s
Median
We let NONLIN minimize the absolute value of the residuals for an estimate of the
median.
The output is:
If you request the median for these data in the Basic Statistics procedure, the value is 3.4.
Example 9
Regression
Usually, you would not use NONLIN for linear regression because other procedures are
available. If, however, you are concerned about the influence of outliers on the estimates
of the coefficients, you should try one of Nonlinear Models robust procedures
MODEL sepalwid = median
ROBUST ABSOLUTE
ESTIMATE
Iteration
No. Loss MEDIAN
0 .299377D+03 .101000D+01
1 .143680D+02 .342800D+01
2 .142988D+02 .341647D+01
3 .142499D+02 .340831D+01
4 .142214D+02 .340357D+01
5 .142081D+02 .340135D+01
6 .142028D+02 .340047D+01
7 .142010D+02 .340016D+01
8 .142003D+02 .340005D+01
9 .142001D+02 .340002D+01
10 .142000D+02 .340001D+01
11 .142000D+02 .340000D+01
12 .142000D+02 .340000D+01
13 .142000D+02 .340000D+01

ABSOLUTE robust regression: 50 cases have positive psi-weights
The average psi-weight is 2418627.93032

Dependent variable is SEPALWID

Source Sum-of-Squares df Mean-Square
Regression 587.520 1 587.520
Residual 7.080 49 0.144

Total 594.600 50
Mean corrected 7.041 49

Raw R-square (1-Residual/Total) = 0.988
Mean corrected R-square (1-Residual/Corrected) = 0.0
R(observed vs predicted) square = 0.0

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
MEDIAN 3.400 . . . .
682
Chapter 20
The example uses the OURWORLD data file and we model the relation of military
expenditures to gross domestic product using information reported by 57 countries to
the United Nations. Each country is a case in our file and MIL and GDP_CAP are our
two variables. In the transformation example for linear regression, we discovered that
both variables require a log transformation, and that Iraq and Libya are outliers.
Here is a scatterplot of the data. The solid line is the least squares line of best fit for
the complete sample (with its corresponding confidence band); the dotted line (and its
confidence band) is the regression line after deleting Iraq and Libya from the sample.
How do robust lines fit within original confidence bands?
Visually, we see the dotted line-of-best fit falls slightly below the solid line for the
complete sample. More striking, however, is the upper curve for the confidence band
the dotted line is considerably lower than the solid one.
We can use NONLIN to fit a least squares regression line with the following input:
USE ourworld
NONLIN
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = intercept + slope*log_gdp
ESTIMATE
683
Nonl i near Model s
The output is:
The estimate of the intercept (1.308) and the slope (0.909) are the same as those
produced by GLM. The residual for Iraq (1.216) is identified as an outlierits
Studentized value is 4.004. Libyas residual is 0.77.
1st Power
We now estimate the model using a least absolute values loss function (first power
regression). We do not respecify the model, so by default, SYSTAT uses our last
estimates as starting values. To avoid this, we specify START without an argument.
The output follows:
Dependent variable is LOG_MIL

Zero weights, missing data or estimates reduced degrees of freedom
Source Sum-of-Squares df Mean-Square
Regression 194.332 2 97.166
Residual 6.481 54 0.120

Total 200.813 56
Mean corrected 24.349 55

Raw R-square (1-Residual/Total) = 0.968
Mean corrected R-square (1-Residual/Corrected) = 0.734
R(observed vs predicted) square = 0.734

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
INTERCEPT -1.308 0.257 -5.091 -1.822 -0.793
SLOPE 0.909 0.075 12.201 0.760 1.058
ROBUST ABSOLUTE
ESTIMATE / START
Iteration
No. Loss INTERCEPT SLOPE
0 .119361D+03 .101000D+00 .102000D+00
1 .147084D+02-.130751D+01 .909014D+00
2 .146579D+02-.135163D+01 .919628D+00
3 .146302D+02-.138083D+01 .926673D+00
4 .146142D+02-.140215D+01 .931814D+00
5 .146139D+02-.140402D+01 .932266D+00
6 .146135D+02-.140636D+01 .932831D+00
7 .146130D+02-.140918D+01 .933513D+00
8 .146125D+02-.141248D+01 .934310D+00
9 .146118D+02-.141622D+01 .935214D+00
10 .146111D+02-.142033D+01 .936207D+00
11 .146104D+02-.142471D+01 .937267D+00
12 .146096D+02-.142924D+01 .938362D+00
13 .146089D+02-.143375D+01 .939451D+00
14 .146082D+02-.143801D+01 .940481D+00
15 .146075D+02-.144174D+01 .941383D+00
16 .146070D+02-.144461D+01 .942075D+00
17 .146068D+02-.144633D+01 .942491D+00
18 .146066D+02-.144701D+01 .942656D+00
19 .146066D+02-.144717D+01 .942695D+00
20 .146066D+02-.144720D+01 .942701D+00
21 .146066D+02-.144720D+01 .942702D+00

684
Chapter 20
Huber
For the Hampel estimator, the weights begin to be less than 1.0 after the value of the
first parameter (1.7). For this Huber estimate, we let the weight taper off sooner by
setting the parameter at 1.5.
The output is:
ABSOLUTE robust regression: 56 cases have positive psi-weights
The average psi-weight is 40210712082202.71000

Dependent variable is LOG_MIL

Zero weights, missing data or estimates reduced degrees of freedom
Source Sum-of-Squares df Mean-Square
Regression 194.271 2 97.136
Residual 6.542 54 0.121

Total 200.813 56
Mean corrected 24.349 55

Raw R-square (1-Residual/Total) = 0.967
Mean corrected R-square (1-Residual/Corrected) = 0.731
R(observed vs predicted) square = 0.734

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
INTERCEPT -1.447 . . . .
SLOPE 0.943 . . . .
ROBUST HUBER = 1.5
ESTIMATE / START
Iteration
No. Loss INTERCEPT SLOPE
0 .119361D+03 .101000D+00 .102000D+00
1 .648115D+01-.130751D+01 .909014D+00
2 .428867D+01-.130751D+01 .909014D+00
3 .426728D+01-.133847D+01 .913898D+00
4 .417969D+01-.135733D+01 .918472D+00
5 .418018D+01-.136897D+01 .921389D+00
6 .418202D+01-.137260D+01 .922285D+00
7 .418261D+01-.137367D+01 .922546D+00
8 .418278D+01-.137398D+01 .922623D+00
9 .418283D+01-.137407D+01 .922646D+00
10 .418285D+01-.137410D+01 .922653D+00
11 .418285D+01-.137411D+01 .922655D+00
12 .418285D+01-.137411D+01 .922655D+00
13 .418285D+01-.137411D+01 .922656D+00

HUBER robust regression: 56 cases have positive psi-weights
The average psi-weight is 0.92050

Dependent variable is LOG_MIL

685
Nonl i near Model s
5% Trim
In the linear regression version of this example, we removed Iraq from the sample by
specifying:
SELECT mil < 700 or SELECT country$ <> Iraq
Here, we ask for 5% trimming (0.0556=2.8 or 2 cases):
The output is:
Zero weights, missing data or estimates reduced degrees of freedom
Source Sum-of-Squares df Mean-Square
Regression 194.305 2 97.153
Residual 6.508 54 0.121

Total 200.813 56
Mean corrected 24.349 55

Raw R-square (1-Residual/Total) = 0.968
Mean corrected R-square (1-Residual/Corrected) = 0.733
R(observed vs predicted) square = 0.734

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
INTERCEPT -1.374 0.255 -5.398 -1.885 -0.864
SLOPE 0.923 0.073 12.567 0.775 1.070
ROBUST TRIM = .05
ESTIMATE / START
Iteration
No. Loss INTERCEPT SLOPE
0 .119361D+03 .101000D+00 .102000D+00
1 .648115D+01-.130751D+01 .909014D+00
2 .440626D+01-.130751D+01 .909014D+00
3 .433275D+01-.133192D+01 .905350D+00
4 .433275D+01-.133192D+01 .905350D+00
5 .433275D+01-.133192D+01 .905350D+00

TRIM robust regression: 54 cases have positive psi-weights
The average psi-weight is 1.00000

Dependent variable is LOG_MIL

Zero weights, missing data or estimates reduced degrees of freedom
Source Sum-of-Squares df Mean-Square
Regression 194.256 2 97.128
Residual 6.557 52 0.126

Total 200.813 54
Mean corrected 24.349 53

Raw R-square (1-Residual/Total) = 0.967
Mean corrected R-square (1-Residual/Corrected) = 0.731
R(observed vs predicted) square = 0.734

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
INTERCEPT -1.332 0.264 -5.049 -1.861 -0.803
SLOPE 0.905 0.077 11.829 0.752 1.059
686
Chapter 20
Example 10
Piecewise Regression
Sometimes we need to fit two different regression functions to the same data. For
example, sales of a certain product might be strongly related to quality when
advertising budgets are below a certain levelthat is, when sales are generated by
word of mouth. Above this advertising budget level, sales may be less strongly
related to quality of goods and more by marketing and advertising factors. In these
cases, we can fit different sections of the data with different models. It is easier to
combine these into a single model, however.
Here is an example of a quadratic function with a ceiling using data from Gilfoil
(1982). This particular study is one of several that show that dialog menu interfaces are
preferred by inexperienced computer users and that command based interfaces are
preferred by experienced users. The data for one subject are in the file LEARN. The
variable SESSION is the session number and TASKS is the number of user-controlled
tasks (as opposed to dialog) chosen by the subject during a session.
We fit these data with a quadratic model for earlier sessions and a ceiling for later
sessions. We use NONLIN to estimate the point where the learning hits this ceiling (at
six tasks). The input is:
Note that the expressions (SESSION<KNOWN and SESSION>=KNOWN) control which
function is to be usedthe quadratic or the horizontal line. The output follows:
USE learn
NONLIN
PRINT = LONG
MODEL tasks = b*session^2*(session<known) +,
b*known^2*(session>=known)
ESTIMATE
Iteration
No. Loss B KNOWN
0 .313871D+03 .101000D+01 .102000D+01
1 .207272D+03 .505000D+00 .204177D+01
2 .175758D+03 .252500D+00 .311938D+01
3 .152604D+03 .126250D+00 .461304D+01
4 .122355D+03 .451977D-01 .802625D+01
5 .270318D+02 .552896D-01 .112719D+02
6 .161372D+02 .544354D-01 .105367D+02
7 .145557D+02 .620140D-01 .967811D+01
8 .144181D+02 .633275D-01 .965934D+01
9 .144181D+02 .633275D-01 .965971D+01
10 .144181D+02 .633275D-01 .965971D+01

687
Nonl i near Model s
Dependent variable is TASKS

Source Sum-of-Squares df Mean-Square
Regression 445.582 2 222.791
Residual 14.418 18 0.801

Total 460.000 20
Mean corrected 140.000 19

Raw R-square (1-Residual/Total) = 0.969
Mean corrected R-square (1-Residual/Corrected) = 0.897
R(observed vs predicted) square = 0.912

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
B 0.063 0.007 8.762 0.048 0.079
KNOWN 9.660 0.594 16.269 8.412 10.907

TASKS TASKS
Case Observed Predicted Residual
1 0.0 0.063 -0.063
2 0.0 0.253 -0.253
3 0.0 0.570 -0.570
4 1.000 1.013 -0.013
5 0.0 1.583 -1.583
6 1.000 2.280 -1.280
7 1.000 3.103 -2.103
8 6.000 4.053 1.947
9 6.000 5.130 0.870
10 6.000 5.909 0.091
11 5.000 5.909 -0.909
12 6.000 5.909 0.091
13 6.000 5.909 0.091
14 6.000 5.909 0.091
15 6.000 5.909 0.091
16 6.000 5.909 0.091
17 6.000 5.909 0.091
18 6.000 5.909 0.091
19 6.000 5.909 0.091
20 6.000 5.909 0.091

Asymptotic Correlation Matrix of Parameters
B KNOWN
B 1.000
KNOWN -0.928 1.000
688
Chapter 20
From the Quick Graph, we see that the fit at the lower end is not impressive. We might
want to fit a truncated logistic model instead of a quadratic because learning is more
often represented with this type of function. This model would have a logistic curve at
the lower values of SESSION and a flat ceiling line at the upper end. We should use a
LOSS also to make the fit maximum likelihood.
Piecewise linear regression models with unknown breakpoints can be fitted
similarly. These models look like this:
y = b0 + b1*x + b2*(xbreak)*(x>break)
If the break point is known, then you can use GLM to do ordinary regression to fit the
separate pieces. See Neter, Wasserman, and Kutner (1985) for an example.
Example 11
Kinetic Models
You can also use NONLIN to test kinetic models. The following analysis models
competitive inhibition for an enzyme inhibitor. The data are adapted from a conference
session on statistical computing with microcomputers (Greco, et al., 1982). We will fit
three variables: initial enzyme velocity (V), concentration of the substrate (S), and
concentration of the inhibitor (I). The parameters of the model are the maximum
velocity (VMAX), the Michaelis constant (KM) and the dissociation constant of the
enzyme-inhibitor complex (KIS). The input is:
The output follows:
USE ENZYME
NONLIN
PRINT = LONG
MODEL V = VMAX*S / (KM*(1 + I/KIS) + S)
ESTIMATE / MIN = 0,0,0
Iteration
No. Loss VMAX KM KIS
0 .356767D+01 .101000D+01 .102000D+01 .103000D+01
1 .228856D+01 .100833D+01 .932638D+00 .122786D-06
2 .228607D+01 .100847D+01 .932573D+00 .124013D-04
3 .204340D+01 .102226D+01 .925997D+00 .125254D-02
4 .269664D-01 .125640D+01 .818472D+00 .227823D-01
5 .137491D-01 .125852D+01 .844568D+00 .268902D-01
6 .136979D-01 .125946D+01 .846743D+00 .271757D-01
7 .136979D-01 .125952D+01 .846854D+00 .271759D-01
8 .136979D-01 .125952D+01 .846857D+00 .271760D-01

689
Nonl i near Model s
You could try alternative models for these data such as one for uncompetitive
inhibition,
or one for noncompetitive inhibition,
where KII is the dissociation constant of the enzyme-inhibitor-substrate complex.
Example 12
Minimizing an Analytic Function
You can also use NONLIN to find the minimum of an algebraic function. Since this
requires no data, you need a trick. Use any data file. We do not use any of the variables
in this file, but SYSTAT requires a data file to be open to do a nonlinear estimation.
The minimization input is:
Dependent variable is V

Source Sum-of-Squares df Mean-Square
Regression 15.404 3 5.135
Residual 0.014 43 0.000

Total 15.418 46
Mean corrected 5.763 45

Raw R-square (1-Residual/Total) = 0.999
Mean corrected R-square (1-Residual/Corrected) = 0.998
R(observed vs predicted) square = 0.998

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
VMAX 1.260 0.012 104.191 1.235 1.284
KM 0.847 0.027 31.876 0.793 0.900
KIS 0.027 0.001 31.033 0.025 0.029
MODEL V = VMAX*S / (KM + S + S*I/KII)
MODEL V = VMAX*S / (KM + KM/KIS + S + S*I/KII)
USE dose
NONLIN
LOSS=100*(U-V^2)^2+(1-V)^2
ESTIMATE / SIMPLEX
690
Chapter 20
This particular function is from Rosenbrock (1960). We are using SIMPLEX to save
space and because it generally does better with algebraic expressions which incur
roundoff error. Here is the result:
Computation
Algorithms
The Quasi-Newton method is described in Fletcher (1972) and is sometimes called
modified Fletcher/Powell. Modifications include the LDL Cholesky factorization of
the updated Hessian matrix. It is the same algorithm employed in SERIES for ARIMA
estimation. The Simplex method is adapted from ONeill (1971), with several revisions
noted in Griffiths and Hill (1985).
The loss function is computed in two steps. First, the model statement is evaluated
for a case using current values of the parameters and data. Second, the LOSS statement
is evaluated using ESTIMATE (computed as the result of the model statement
evaluation) and other parameter and data values. These two steps are repeated for all
cases, over which the result of the loss function is summed. The summed LOSS is then
minimized by the Quasi-Newton or Simplex procedure. Step halvings are used in the
minimizations when model or loss statement evaluations overflow or result in illegal
values. If repeated step halvings down to machine epsilon (error limit) fail to remedy
this situation, iterations cease with an Illegal values message.
Asymptotic standard errors are computed by the central differencing finite
approximation of the Hessian matrix. Some nonlinear regression programs compute
standard errors by squaring the Jacobian matrix of first derivatives. Others use
Iteration
No. Loss U V
0 .102098D+01 .1010D+01 .1020D+01
1 .931215D+00 .1262D+01 .1126D+01
2 .216987D-02 .1005D+01 .1003D+01
3 .593092D-05 .9992D+00 .9996D+00
4 .689847D-08 .1000D+01 .1000D+01
5 .162557D-10 .1000D+01 .1000D+01
6 .793924D-13 .1000D+01 .1000D+01
7 .264812D-15 .1000D+01 .1000D+01
8 .140037D-17 .1000D+01 .1000D+01
9 .110445D-20 .1000D+01 .1000D+01
10 .165203D-23 .1000D+01 .1000D+01
Final value of loss function is 0.000

Wald Confidence Interval
Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper
U 1.000 . . . .
V 1.000 . . . .
691
Nonl i near Model s
different methods altogether. For linear models, all valid methods produce identical
results. For some nonlinear models, however, the results may differ. The Hessian
approach, which works well for nonlinear regression, is also ideally suited for
NONLINs maximum likelihood estimation.
Missing Data
Missing values are handled according to the conventions of SYSTAT BASIC. That is,
missing values propagate in algebraic expressions. For example, X + . is a missing
value. The expression X = . is not missing, however. It is 1 if X is missing and 0 if not.
Thus, you can use logical expressions to put conditions on model or loss functions;
consider the following loss function:
(X<>.)(Y - ESTIMATE)^2 + (X=.)(Z - ESTIMATE)^2
Illegal expressions (such as division by 0 and negative square roots) are set to missing
values. If this happens when computing the loss statement for a particular case, the loss
function is set to an extremely large value (10
35
). This way, parameter estimates are
forced to move away from regions of the parameter space that yield illegal function
evaluations.
Overflows (such as a positive number with an extremely large exponent) are set to
machine overflow (10
35
). Negative overflows are set to the negative of this value.
Overflows usually cause the loss function to be large, so the program is forced to move
away from estimates that produce overflows.
These features mean that NONLIN tends to crash less frequently than most other
nonlinear estimation programs. It will continue for several iterations to try parameter
values that lower the loss value, even when some of these lead to a seemingly hopeless
result. It is your responsibility to check whether final estimates are reasonable,
however, by using both estimation methods, different starting values, and other
options.
References
Bates, D. M. and Watts, D. G. (1988). Nonlinear regression and its applications. New
York: John Wiley & Sons, Inc.
Clark, G. P. Y. (1987). Approximate confidence limits for a parameter function in
nonlinear regression, Journal of the American Statistical Association, 82, 221230.
692
Chapter 20
Cook, R. D. and Weisberg, S. (1990). Confidence curves in nonlinear regression, Journal
of the American Statistical Association, 85, 544551.
Cox, D. R. (1970). The analysis of binary data. New York: Halsted Press.
Fletcher, R. (1972). FORTRAN subroutines for minimization by Quasi-Newton methods.
AERE R. 7125.
Griffiths, P. and Hill, I. D. (1985). Applied statistics algorithms. Chichester: Ellis Horwood
Limited.
Gilfoil, D. M. (1982). Warming up to computers: A study of cognitive and affective
interaction over time. In Proceedings: Human factors in computer systems.
Washington, D.C.: Association for Computing Machinery.
Greco, W. R., et al. (1982). ROSFIT: An enzyme kinetics nonlinear regression curve fitting
package for a microcomputer. Computers and Biomedical Research, 15, 3945.
Hill, M. A. and Engelman, L. (1992). Graphical aids for nonlinear regression and
discriminant analysis. Computational Statistics, vol. 2, Y. Dodge and J. Whittaker, eds.
Proceedings of the 10th Symposium on Computational Statistics Physica-Verlag,
111126.
Jennrich, R. I. and Moore, R. H. (1975). Maximum likelihood estimation by means of
nonlinear least squares. Proceedings of the Statistical Computing Section, American
Statistical Association, 5765.
Neter, J., and Wasserman, W., and Kutner, M. (1985). Applied linear statistical models,
2nd ed. Homewood, Ill.: Richard D. Irwin, Inc.
ONeill, R. (1971). Functions minimization usign a simplex procedure. Algorithms AS 47.
Applied Statistics, 338.
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust regression and outlier detection. New
York: John Wiley & Sons, Inc.
693


Chapt er
21
Nonparametric Statistics
Leland Wilkinson
Nonparametric tests compute nonparametric statistics for groups of cases and pairs of
variables. Tests are available for two or more independent groups of cases, two or
more dependent variables, and for the distribution of a single variable.
Nonparametric tests do not assume that the data conform to a particular probability
distribution. Nonparametric models are often appropriate when the usual parameters,
such as mean and standard deviation based on normal theory, do not apply. Usually,
however, some other assumptions about shape and continuity are made. Note that if
you can find normalizing transformations for your data that allow you to use
parametric tests, you will usually be better off doing so.
Several nonparametric tests are available. The Kruskal-Wallis test and the two-
sample Kolmogorov-Smirnov test measure differences of a single variable across two
or more independent groups of cases. The sign test, the Wilcoxon signed-rank test,
and the Friedman test measure differences among related samples. The one-sample
Kolmogorov-Smirnov test and the Wald-Wolfowitz runs test examine the distribution
of a single variable.
Many nonparametric statistics are computed elsewhere in SYSTAT. Correlations
calculates matrices of coefficients, such as Spearmans rho, Kendalls tau-b,
Guttmans mu2, and Goodman-Kruskal gamma. Descriptive Statistics offers stem-
and-leaf plots, and Box Plot offers box plots with medians and quartiles. Time Series
can perform nonmetric smoothing. Crosstabs can be used for chi-square tests of
independence. Multidimensional Scaling (MDS) and Cluster Analysis work with
nonmetric data matrices. Finally, you can use Rank to compute a variety of rank-order
statistics.
694
Chapter 21
Note: Beware of using nonparametric procedures to rescue bad data. In most cases,
these procedures were designed to apply to categorical or ranked data, such as rank
judgments and binary data. If you have data that violate distributional assumptions for
linear models, you should consider transformations or robust models before retreating
to nonparametrics.
Statistical Background
Nonparametric statistics is a misnomer. The term is ordinarily used to describe a
heterogeneous group of procedures that require relatively minimal assumptions about
the shape of distributions underlying an analysis. Frequently, however, nonparametric
models include parameters. These parameters are not necessarily ones like and ,
which we see in typical parametric tests based on normal theory, but they are
parameters in a class of mathematical functions nonetheless.
In this context, a better term for nonparametric is distribution-free. That is, the data
for this class of statistical tests are not assumed to follow a specific probability
distribution. This does not mean, however, that we make no assumptions about
distributions in nonparametric methods. For example, in the Mann-Whitney and
Kruskal-Wallis tests, we assume that the underlying populations are continuous and
have the same shape.
Rank (Ordinal) Data
An aspect of many nonparametric tests is that they are invariant under rank-order
transformations of the data values. In other words, we may change actual data values
as long as we preserve relative ranks, and the results of our hypothesis tests will not
change. Data that can be replaced by rank-order values without losing information are
often called rank or ordinal data. For example, if we believe that the list (25, 54,
107.6, 3400) contains only ordinal information, then we can replace it with the list (1,
2, 3, 4) without loss of information.
Categorical (Nominal) Data
Some nonparametric methods are invariant under permutation transformations. That
is, we can interchange data values and get the same results, provided we keep all cases
with one value before transformation single valued after transformation. Data that can

695
Nonparametri c Stati sti cs
be treated like this are often called categorical or nominal. For example, if we believe
the list (1, 1, 5, 5, 10, 10, 10) contains only nominal information, then we can replace
it with the list (red, red, green, green, blue, blue, blue) without loss of information.
Robustness
Sometimes, we may think our data contain more than nominal or ordinal information,
but we want to be extremely conservative. For example, our data may contain extreme
outliers. We could eliminate these outliers, downweight them, or apply some nonlinear
transformation to reduce their influence. An alternative, however, would be to use a
nonparametric test based on ranks. If we can afford to lose some power by using a
nonparametric test, we can gain robustness. If we find significant results with a
nonparametric test, no skeptic can challenge us on the basis of scale artifacts or
outliers. This is not to say that you should retreat to nonparametric methods every time
you find a histogram that does not look normal. If you can find a simple normalizing
transformation that works, such as logging the data, you will almost always be better
off using normal parametric methods.
Nonparametric Statistics for
Independent Samples in SYSTAT
Kruskal-Wallis Main Dialog Box
For the Kruskal-Wallis test, the values of a variable are transformed to ranks (ignoring
group membership) to test that there is no shift in the center of the groups (that is, the
centers do not differ). This is the nonparametric analog of a one-way analysis of
variance. When there are only two groups, this procedure reduces to the Mann-
Whitney test, the nonparametric analog of the two-sample t test.
To open the Kruskal-Wallis dialog box, from the menus choose:
Statistics
Nonparametric Tests
Kruskal-Wallis
696
Chapter 21
Variables(s). SYSTAT computes a separate test for each variable in the Variable(s) list.
Grouping Variable. The grouping variable can be string or numeric.
Two-Sample Kolmogorov-Smirnov Main Dialog Box
The two-sample Kolmogorov-Smirnov test tests whether two independent samples
come from the same distribution by comparing the two-sample cumulative distribution
functions. The test assumes that both samples come from exactly the same distribution.
The distributions can be organized as two variables (two columns) or as a single
variable (column) with a second variable that identifies group membership. The latter
layout is necessary when sample sizes differ.
To open the Two-Sample Kolmogorov-Smirnov dialog box, from the menus choose:
Statistics
Nonparametric Tests
Two sample KS
Variable(s). If each sample is a separate variable, both variables must be selected.
Selecting three or more variables yields a separate test for each pair of variables. If you
select only one variable, you must identify the grouping variable.
697
Nonparametri c Stati sti cs
Grouping Variable. If the grouping variable has three or more levels, separate tests of
each pair of levels result. Selecting multiple variables and a grouping variable yields a
test comparing the groups for the first variable only.
Using Commands
First, specify your data with USE filename. Continue with:
Nonparametric Statistics for
Related Variables in SYSTAT
A need for comparing variables frequently arises in before and after studies, where
each subject is measured before and after a treatment. Here your goal is to determine
if any difference in response can be attributed to chance alone. As a test, researchers
often use the sign test or the Wilcoxon signed-rank test. For these tests, the
measurements need not be collected at different points in time; they simply can be two
measures on the same scale for which you want to test differences. If you have more
than two measures for each subject, the Friedman test can be used.
Sign Tests Main Dialog Box
The sign test compares two related samples and is analogous to the paired t test for
parametric data. For each case, the sign test computes the sign of the difference
between two variables. This test is attractive because of its simplicity and the fact that
the variance of the first measure in each pair may differ from that of the second.
However, you may be losing information since the magnitude of each difference is
ignored.
To open the Sign Tests dialog box, from the menus choose:
Statistics
Nonparametric Tests
Sign
NPAR
KRUSKAL varlist*grpvar
KS varlist*grpvar
698
Chapter 21
Selecting three or more variables yields separate tests for each pair of variables.
Wilcoxon Signed-Rank Test Main Dialog Box
To open the Wilcoxon Tests dialog box, from the menus choose:
Statistics
Nonparametric Tests
Wilcoxon
The Wilcoxon test compares the rank values of the variables you select, pair by pair,
and displays the count of positive and negative differences. For ties, the average rank
is assigned. It then computes the sum of ranks associated with positive differences and
the sum of ranks associated with negative differences. The test statistic is the lesser of
the two sums of ranks.
Friedman Tests Main Dialog Box
To open the Friedman Tests dialog box, from the menus choose:
Statistics
Nonparametric Tests
Friedman
699
Nonparametri c Stati sti cs
The Friedman test computes a Friedman two-way analysis of variance on selected
variables. This test is a nonparametric extension of the paired t test, where, instead of
two measures, each subject has n measures (n > 2). In other terms, it is a nonparametric
analog of a repeated measures analyses of variance with one group. The Friedman test
is often used for analyzing ranks of three or more objects by multiple judges. That is,
there is one case for each judge and the variables are the judges ratings of several types
of wine, consumer products, or even how a set of mothers relate to their children. The
Friedman statistic is used to test the hypothesis that there is no systematic response or
pattern across the variables (ratings).
Using Commands
First, specify your data with USE filename. Continue with:
Nonparametric Statistics for
Single Samples in SYSTAT
One-Sample Kolmogorov-Smirnov Main Dialog Box
The one-sample Kolmogorov-Smirnov test is used to compare the shape and location
of a sample distribution to a specified distribution. The Kolmogorov-Smirnov test and
its generalizations are among the handiest of distribution-free tests. The test statistic is
based on the maximum difference between two cumulative distribution functions
(CDF). In the one-sample test, one of the CDFs is continuous and the other is discrete.
Thus, it is a companion test to a probability plot.
NPAR
SIGN varlist
WILCOXON varlist
FRIEDMAN varlist
700
Chapter 21
To open the One-Sample Kolmogorov-Smirnov dialog box, from the menus choose:
Statistics
Nonparametric Tests
One sample KS
Options. Allows you to choose the test distribution. Many options allow you to specify
parameters of the hypothesized distribution. For example, if you choose a Uniform
distribution, you can specify values for Min and Max. Distributions include:
n Uniform. Compares the data to the uniform(a,b) distribution.
n Normal. Compares the data to the normal distribution with the specified mean and
standard deviation.
n t. Compares the data to the t distribution with the specified degrees of freedom.
n F. Compares the data to the F distribution with the specified degrees of freedom.
n ChiSQ. Compares the data to the chi-square distribution with the specified degrees
of freedom.
n Gamma. Compares the data to the gamma(a) distribution.
n Beta. Compares the data to the beta(a,b) distribution.
n Exp. Compares the data to the exponential distribution with mean equaling the
location parameter and sd equaling the scale parameter.
n Logistic. Compares the data to logistic distribution with the specified location
parameter (mean) and scale parameter (sd).
n Range. Compares the data to the Studentized range(n,p) distribution.
n Weibull. Compares the data to the Weibull(n,p) distribution.
n Binomial. Compares the data to the binomial(n,p) distribution.
n Poisson. Compares the data to the Poisson distribution with mean=lambda.
701
Nonparametri c Stati sti cs
n Lilliefors. The Lilliefors test uses the standard normal distribution. The variables
you select are automatically standardized, and the test determines whether the
standardized versions are normally distributed.
Wald-Wolfowitz Runs Main Dialog Box
The Wald-Wolfowitz runs test detects serial patterns in a run of numbers (for example,
runs of heads or tails in a series of coin tosses). The runs test measures such behavior
for dichotomous (or binary) variables.
To open the Wald-Wolfowitz Runs dialog box, from the menus choose:
Statistics
Nonparametric Tests
Wald-Wolfowitz Runs...
For continuous variables, use Cut to define a cutpoint to determine whether values
fluctuate in patterns above and below this cutpoint. This feature is useful for studying
trends in residuals from a regression analysis.
Using Commands
First, specify your data with USE filename. Continue with:
NPAR
RUNS varlist / CUT=n
KS varlist / distribution=parameters
702
Chapter 21
Possible distributions for the Kolmogorov-Smirnov test include:
Usage Considerations
Types of data. NPAR uses rectangular data only.
Print options. The output is standard for all PRINT lengths.
Quick Graphs. NPAR produces no Quick Graphs.
Saving files. NPAR saves no statistics.
BY groups. You can perform tests using a BY variable. The output includes separate
tests for each level of the BY variable.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. NPAR uses a FREQUENCY variable (if present) to increase the
number of cases in the analysis.
Case weights. WEIGHT variables have no effect in NPAR.
Examples
Example 1
Kruskal-Wallis Test
For two or more independent groups, the Kruskal-Wallis test statistic tests whether the
k samples come from identically distributed populations. If the grouping variable has
only two levels, the Mann-Whitney (Wilcoxon) statistic is reported. For two groups,
Distribution Parameters Distribution Parameters
UNIFORM min, max NORMAL mean, SD
T df F df1, df2
CHISQ df GAMMA a
BETA a, b EXP mean, SD
LOGISTIC mean, SD RANGE n, p
WEIBULL n, p BINOMIAL n, p
POISSON lambda LILLIEFORS
703
Nonparametri c Stati sti cs
the Kruskal-Wallis test and the Mann-Whitney U statistic are analogous to the
independent groups t test.
In this example, we compare the percentage of people who live in cities (URBAN)
for three groups of countries: European, Islamic, and New World. We use the
OURWORLD data file that has one record for each of 57 countries with the variables
URBAN and GROUP$. We include a box plot of URBAN grouped by GROUP$ to
illustrate the test. The input is:
The output follows:
NPAR
USE ourworld
DENSITY urban * group$ / BOX TRANS
KRUSKAL urban * group$
Categorical values encountered during processing are:
GROUP$ (3 levels)
Europe, Islamic, NewWorld

Kruskal-Wallis One-Way Analysis of Variance for 56 cases
Dependent variable is URBAN
Grouping variable is GROUP$

Group Count Rank Sum

Europe 19 765.000
Islamic 16 198.000
NewWorld 21 633.000

Kruskal-Wallis Test Statistic = 25.759
Probability is 0.000 assuming Chi-square distribution with 2 df
704
Chapter 21
In the box plot, the median of each distribution is marked by the vertical bar inside the
box: the median for European countries is 69%; for Islamic countries, 24.5%; and for
New World countries, 50%. We ask, Is there a difference in typical values of URBAN
among these groups of countries?
Looking at the Kruskal-Wallis results, we find a p value < 0.0005. We conclude that
urbanization differs markedly across the three groups of countries.
Example 2
Mann-Whitney Test
When there are only two groups, Kruskal-Wallis provides the Mann-Whitney test.
Note that your grouping variable must contain exactly two values. Here we modify the
Kruskal-Wallis example by deleting the Islamic group. We ask, Do European nations
tend to be more urban than New World countries? The input is:
The output follows:
The percentage of population living in urban areas is significantly greater for European
countries than for New World countries (p value = 0.02).
NPAR
USE ourworld
SELECT group$ <> Islamic
KRUSKAL urban * group$
Categorical values encountered during processing are:
GROUP$ (2 levels)
Europe, NewWorld

Kruskal-Wallis One-Way Analysis of Variance for 40 cases
Dependent variable is URBAN
Grouping variable is GROUP$

Group Count Rank Sum

Europe 19 475.000
NewWorld 21 345.000
Mann-Whitney U test statistic = 285.000
Probability is 0.020
Chi-square approximation = 5.370 with 1 df
705
Nonparametri c Stati sti cs
Example 3
Two-Sample Kolmogorov-Smirnov Test
The two-sample Kolmogorov-Smirnov test measures the discrepancy between two-
sample cumulative distribution functions.
In this example, we test if the distributions of URBAN, the proportion of people
living in cities, for European and New World countries have the same mean, standard
deviation, and shape. The input is:
The output follows:
The two distributions differ significantly (p value = 0.009).
Example 4
Sign Test
Here, for a sample of countries (not subjects), we ask, Does life expectancy differ for
males and females? Using the OURWORLD data, we compare LIFEEXPF and
LIFEEXPM, using stem-and-leaf plots to illustrate the distributions. The sign test
counts the number of times male life expectancy is greater than that for females and
vice versa.
The input is:
NPAR
USE ourworld
SELECT group$ <> Islamic
KS urban * group$
Kolmogorov-Smirnov Two Sample Test results
Maximum differences for pairs of groups
Europe NewWorld
Europe 0.0
NewWorld 0.520 0.0
Two-sided probabilities
Europe NewWorld
Europe .
NewWorld 0.009 .
STATISTICS
USE ourworld
STEM lifeexpf lifeexpm / LINES=10
NPAR
SIGN lifeexpf lifeexpm
706
Chapter 21
The output follows:
For each case, SYSTAT first reports the number of differences that were positive and
the number that were negative. In two countries (Afghanistan and Bangladesh), the
males live longer than the females; the reverse is true for the other 55 countries. Note
that the layout of this output allows reports for many pairs of variables.
In the two-sided probabilities panel, the smaller count of differences (positive or
negative) is compared to the total number of nonzero differences. SYSTAT computes
a sign test on all possible pairs of specified variables. For each pair, the difference
between values on each case is calculated, and the number of positive and negative
differences is printed. The lesser of the two types of difference (positive or negative)
is then compared to the total number of nonzero differences. From this comparison, the
probability is computed according to the binomial (for a total less than or equal to 25)
or a normal approximation to the binomial (for a total greater than 25). A correction
for continuity (0.5) is added to the normal approximations numerator, and the
denominator is computed from the null value of 0.5. The large sample test is thus
equivalent to a chi-square test for an underlying proportion of 0.5. The probability for
our test is 0.000 (or < 0.0005). We conclude that there is a significant difference in life
expectancy; females tend to live longer.
Stem and Leaf Plot of variable:
LIFEEXPF, N = 57
Stem and Leaf Plot of variable:
LIFEEXPM, N = 57
Minimum: 44.000 Minimum: 40.000
Lower hinge: 65.000 Lower hinge: 61.000
Median: 75.000 Median: 68.000
Upper hinge: 79.000 Upper hinge: 73.000
Maximum: 83.000 Maximum: 75.000

4 4 4 0
4 679 * * * Outside Values * * *
5 0234 4 56789
5 55667 5 122334
6 4 5 6
6 H 567788889 6 H 01222444
7 01344 6 M 5556778899
7 M 5666777778889999 7 H 001111223333333334444
8 0000111111223 7 55555
Sign test results

Counts of differences (row variable greater than column)
LIFEEXPM LIFEEXPF
LIFEEXPM 0 2
LIFEEXPF 55 0

Two-sided probabilities for each pair of variables
LIFEEXPM LIFEEXPF
LIFEEXPM 1.000
LIFEEXPF 0.000 1.000
707
Nonparametri c Stati sti cs
Example 5
Wilcoxon Test
Here, as in the sign test example, we ask, Does life expectancy differ for males and
females? The input is:
The output is:
Two-sided probabilities are computed from an approximate normal variate (Z in the
output) constructed from the lesser of the sum of the positive ranks and the sum of the
negative ranks (for example, Marascuilo and McSweeney, 1977, p. 338). The Z for our
test is 6.535 with a probability less than 0.0005. As with the sign test, we conclude that
females tend to live longer.
Example 6
Sign and Wilcoxon Tests for Multiple Variables
SYSTAT can compute a sign or Wilcoxon test on all pairs of specified variables (or all
numeric variables in your file). To illustrate the layout of the output, we add two more
variables to our request for a sign test: the birth-to-death ratios in 1982 and 1990. The
input follows:
USE ourworld
NPAR
WILCOXON lifeexpf lifeexpm
Wilcoxon Signed Ranks Test Results

Counts of differences (row variable greater than column)
LIFEEXPM LIFEEXPF
LIFEEXPM 0 2
LIFEEXPF 55 0

Z = (Sum of signed ranks)/square root(sum of squared ranks)
LIFEEXPM LIFEEXPF
LIFEEXPM 0.0
LIFEEXPF 6.535 0.0


Two-sided probabilities using normal approximation
LIFEEXPM LIFEEXPF
LIFEEXPM 1.000
LIFEEXPF 0.000 1.000
NPAR
USE ourworld
SIGN b_to_d82 b_to_d lifeexpm lifeexpf
708
Chapter 21
The resulting output is:
The results contain some meaningless data. SYSTAT has ordered the variables as they
appear in the data file. When you specify more than two variables, there may be just a
few numbers of interest. In the first column, the birth-to-death ratio in 1982 is
compared with the birth-to-death ratio in 1990and with male and female life
expectancy! Only the last entry is relevant36 countries have larger ratios in 1990
than they did in 1982. In the last column, you see that 17 countries have smaller ratios
in 1990. The life expectancy comparisons you saw in the last example are in the middle
of this table. In the two-sided probabilities panel, the probability for the birth-to-death
ratio comparison (0.013) is at the bottom of the first column. We conclude that the ratio
is significantly larger in 1990 than it was in 1982. Does this mean that the number of
births is increasing or that the number of deaths is decreasing?
Example 7
Friedman Test
In this example, we study dollars that each country spends per person for education,
health, and the military. We ask, Do the typical values for the three expenditures differ
significantly? We stratify our analysis and look within each type of country separately.
Here are the median expenditures:
Sign test results

Counts of differences (row variable greater than column)
B_TO_D82 LIFEEXPM LIFEEXPF B_TO_D
B_TO_D82 0 0 0 17
LIFEEXPM 57 0 2 57
LIFEEXPF 57 55 0 57
B_TO_D 36 0 0 0


Two-sided probabilities for each pair of variables
B_TO_D82 LIFEEXPM LIFEEXPF B_TO_D
B_TO_D82 1.000
LIFEEXPM 0.000 1.000
LIFEEXPF 0.000 0.000 1.000
B_TO_D 0.013 0.000 0.000 1.000
EDUCATION HEALTH MILITARY
Europe 496.28 502.01 271.15
Islamic 13.67 4.28 22.80
New World 57.39 22.73 29.02
709
Nonparametri c Stati sti cs
The input is:
The output is:
The Friedman test transforms the data for each country to ranks (1 for the smallest
value, 2 for the next, and 3 for the largest) and then sums the ranks for each variable.
NPAR
USE ourworld
BY group$
FRIEDMAN educ health mil
The following results are for:
GROUP$ = Europe


Friedman Two-Way Analysis of Variance Results for 20 cases.


Variable Rank Sum

EDUC 43.000
HEALTH 52.000
MIL 25.000


Friedman Test Statistic = 18.900
Kendall Coefficient of Concordance = 0.472
Probability is 0.000 assuming Chi-square distribution with 2 df

The following results are for:
GROUP$ = Islamic


Friedman Two-Way Analysis of Variance Results for 15 cases.


Variable Rank Sum

EDUC 37.500
HEALTH 17.000
MIL 35.500


Friedman Test Statistic = 17.033
Kendall Coefficient of Concordance = 0.568
Probability is 0.000 assuming Chi-square distribution with 2 df

The following results are for:
GROUP$ = NewWorld


Friedman Two-Way Analysis of Variance Results for 21 cases.


Variable Rank Sum

EDUC 56.000
HEALTH 31.500
MIL 38.500


Friedman Test Statistic = 15.167
Kendall Coefficient of Concordance = 0.361
Probability is 0.001 assuming Chi-square distribution with 2 df
710
Chapter 21
Thus, if each country spent the least on the military, the rank rum for MIL would be 20.
The largest the rank sum could be is 60 (20 * 3). For these three countries, no
expenditure is always the smallest or largest. In addition to the rank sums, SYSTAT
reports the Kendall coefficient of concordance, an estimate of the average correlation
among the expenditures.
For all three countries, we reject the hypothesis that the expenditures are equal.
Example 8
One-Sample Kolmogorov-Smirnov Test
In this example, we use SYSTATs random number generator to make a normally
distributed random number and then test it for normality. We use the variable Z as our
normal random number and the variable ZS as a standardized copy of Z. This may seem
strange because normal random numbers are expected to have a mean of 0 and a
standard deviation of 1. This is not exactly true in a sample, however, so we standardize
the observed values to make a variable that has exactly a mean of 0 and a standard
deviation of 1. The input follows:
We use STATISTICS to examine the mean and standard deviation of our two variables.
Remember, if you correlated these two variables, the Pearson correlation would be 1.
Only their mean and standard deviations differ. Finally, we test Z for normality.
BASIC
NEW
REPEAT 50
LET z=zrn
LET zs=z
RUN
SAVE NORM
STAND ZS/SD
USE norm
STATISTICS
STATS
NPAR
KS z zs / NORMAL
711
Nonparametri c Stati sti cs
The output is:
Why are the probabilities different? The one-sample Kolmogorov-Smirnov test pays
attention to the shape, location, and scale of the sample distribution. Z and ZS have the
same shape in the population (they are both normal). Because ZS has been
standardized, however, it has a different location.
Thus, you should never use the Kolmogorov-Smirnov test with the normal
distribution on a variable you have standardized. The probability printed for ZS (0.983)
is misleading. If you select ChiSq, Normal, or Uniform, you are assuming that the
variable you are testing has been randomly sampled from a standard normal, uniform
(0 to 1), or chi-square (with stated degrees of freedom) population.
Lilliefors Test
Here we perform a Lilliefors test using the data generated for the one-sample
Kolmogorov-Smirnov example. Note that Lilliefors automatically standardizes the
variables you list and tests whether the standardized versions are normally distributed.
The input is:
The output is:
Notice that the probabilities are smaller this time even though MaxDif is the same as
before. The probability values for Z and ZS (0.895) are the same because this test pays
Z ZS
N of cases 50 50
Minimum -2.194 -2.271
Maximum 1.832 1.932
Mean -0.018 0.000
Standard Dev 0.958 1.000

Kolmogorov-Smirnov One Sample Test using Normal(0.00,1.00) distribution

Variable N-of-Cases MaxDif Probability (2-tail)

Z 50.000 0.069 0.969
ZS 50.000 0.065 0.983
USE norm
KS z zs / LILLIEFORS
Kolmogorov-Smirnov One Sample Test using Normal(0.00,1.00) distribution

Variable N-of-Cases MaxDif Lilliefors Probability (2-tail)

Z 50.000 0.065 0.895
ZS 50.000 0.065 0.895
712
Chapter 21
attention only to the shape of the distribution and not to the location or scale. Neither
significantly differs from normal.
This example was constructed to contrast Normal and Lilliefors. Many statistical
package users do a Kolmogorov-Smirnov test for normality on their standardized data
without realizing that they should instead do a Lilliefors test.
One last point: The Lilliefors test can be used for residual analysis in regression. Just
standardize your residuals and use Nonparametric Tests to test them for normality. If
you do this, you should always look at the corresponding normal probability plot.
Example 9
Wald-Wolfowitz Runs Test
We use the OURWORLD file and cut MIL (dollars per person each country spends on
the military) at its median and see whether countries with higher military expenditures
are grouped together in the file. (Be careful when you use a cutpoint on a continuous
variable, however. Your conclusions can change depending on the cutpoint you use.)
We include a scatterplot of the military expenditures against the case number (order of
each country in the file), adding a dotted line at the cutpoint of 53.889. The input is:
Following is the output:
NPAR
USE ourworld
RUNS mil / CUT=53.889
IF (country$=Iraq or country$=Libya or country$=Canada),
THEN LET country2$=country$
PLOT mil / LINE DASH=11 YLIM=53.9 LABEL=country2$ SYMBOL=2
Wald-Wolfowitz runs test using cutpoint = 53.889
Probability
Variable Cases LE Cut Cases GT Cut Runs Z (2-tail)

MIL 28 28 17 -3.237 0.001
713
Nonparametri c Stati sti cs
The test is significant (p value = 0.001). The military expenditures are not ordered
randomly in the file.
The European countries are first in the file, followed by Islamic and New World.
Looking at the plot, notice that the first 20 cases exceed the median. The remaining
cases are for the most part below the median. Iraq, Libya, and Canada stand apart from
the other countries in their group. When the line joining the MIL values crosses the
median line, a new run begins. Thus, the plot illustrates the 17 runs.
Computation
Algorithms
Probabilities for the Kolmogorov-Smirnov statistic for n < 25 are computed with an
asymptotic negative exponential approximation.
Lilliefors probabilities are computed by a nonlinear approximation to Lilliefors
table. Dallal and Wilkinson (1986) recomputed Lilliefors study using up to a million
replications for estimating critical values. They found a number of Lilliefors value to
be incorrect. Consequently, the SYSTAT approximation uses the corrected values. The
approximation discussed in Dallal and Wilkinson and used in SYSTAT differs from
the tabled values by less than 0.01 and by less than 0.001 for p < 0.05.
714
Chapter 21
References
Conover, W. J. (1980). Practical nonparametric statistics. 2nd ed. New York: John Wiley
& Sons, Inc.
Hollander, M. and Wolfe, D. A. (1973). Nonparametric statistical methods. New York:
John Wiley & Sons, Inc.
Lehmann, E. L. (1975). Nonparametrics. San Francisco: Holden-Day, Inc.
Marascuilo, L. A. and McSweeney, M. (1977). Nonparametric and distribution-free
methods for the social sciences. Belmont, Calif.: Wadsworth Publishing.
Mosteller, F. and Rourke, R. E. K. (1973). Sturdy statistics. Reading, Mass.: Addison-
Wesley.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York:
McGraw-Hill.
715


Chapt er
22
Partially Ordered Scalogram
Analysis with Coordinates
Leland Wilkinson, Samuel Shye, Reuben Amar, and Louis Guttman
The POSAC module calculates a partial order scalogram analysis on a set of
multicategory items. It consolidates duplicate data profiles, computes profile
similarity coefficients, and iteratively computes a configuration of points in a two-
dimensional space according to the partial order model. POSAC produces Quick
Graphs of the configuration, labeled by either profile values or an ID variable. Shye
(1985) is the authoritative reference on POSAC. See also Borgs review (1987) for
more information. The best approach to setting up a study for POSAC analysis is to
use facet theory (see Canter, 1985).
Statistical Background
The figure below shows a pattern of bits in two dimensions, an instance of a partially
ordered set (POSET). There are several interesting things about this pattern.
n The vertical dimension of the pattern runs from four 1s on the top to no 1s on the
bottom.
n The horizontal dimension runs from 1s on the left to 1s in the center to 1s on
the right.
n Except for the bottom row, each bit pattern is the result of an OR operation of the
two bit patterns below itself, as denoted by the arrows in the figure. For example,
.
n There are possible patterns for four bits. Only 11 patterns meet the
above requirements in two dimensions. The remaining patterns are: (1011),
(1101), (1010), (0101), and (1001).
1111 ( ) 1110 ( ) or 0111 ( ) =
2
4
16 =
716
Chapter 22
n This structure is a lattice. We can move things around and still represent the POSET
geometrically as long as none of the arrows cross or head down instead of up.
Suppose we had real binary data involving presence or absence of attributes and
wanted to determine whether our data fit a POSET structure. We would have to do the
following:
n Order the attributes from left to right so that the horizontal dimension would show
1s moving from left to right in the plotted profile, as in the figure above.
n Sort the profiles of attributes from top to bottom.
n Sort the profiles from left to right.
n Locate any profiles not fitting the pattern and make sure the overall solution was
not influenced by them.
The fourth requirement is somewhat elusive and depends on the first. That is, if we had
patterns (1010) and (0101), exchanging the second and third bits would yield (1100)
and (0011), which would give us two extreme profiles in the third row rather than two
ill-fitting profiles. If we exchange bits for one profile, we must exchange them for all,
however. Thus, the global solution depends on the order of the bits as well as their
positioning.
POSAC stands for partially ordered scalogram analysis with coordinates. The
algorithm underlying POSAC computes the ordering and the lattice for cases-by-
attributes data. Developed originally by Louis Guttman and Shmuel Shye, POSAC fits
not only binary, but also multivalued data, into a two-dimensional space according to
the constraints we have discussed.
The following figure (a multivalue POSET) shows a partial ordering on some
multivalue profiles. Again, we see that the marginal values increase on the vertical
dimension (from 0 to 1 to 2 to 4 to 8) and the horizontal dimension distinguishes left
and right skew.
1111
1110 0111
1100 0110 0011
1000 0100 0010 0001
0000
717
Parti al l y Ordered Scal ogram Anal ysi s wi th Coordi nates
The following figure shows this distributional positioning more generally. For ordered
profiles with many values on each attribute, we expect the central profiles in the
POSAC to be symmetrically distributed, profiles to the left to be right-skewed, and
profiles to the right to be left-skewed.
Coordinates
There are two standard coordinate systems for displaying profiles. The first uses joint
and lateral dimensions to display the profiles as in the figures above. Profiles that have
similar sum scores fall at approximately the same latitude in this coordinate system.
Comparable profiles differing in their sum scores (for example, 112211 and 223322)
fall above and below each other at the same longitude.
The second coordinate display, the one printed in the SYSTAT plots, is a 45-degree
rotation of this set. These base coordinates have the joint dimension running from
southwest to northeast and the lateral dimension running from northwest to southeast.
The diamond pattern is transformed into a square.
2222
2110 0112
1100 0110 0011
1000 0100 0010 0001
0000
718
Chapter 22
POSAC in SYSTAT
POSAC Main Dialog Box
To open the POSAC dialog box, from the menus choose:
Statistics
Data Reduction
POSAC...
Model Variable(s). Specify the items to be scaled.
Iterations. Enter the maximum number of iterations that you wish to allow the program
to perform in order to estimate the parameters.
Convergence. Enter the convergence criterion. This is the largest relative change in any
coordinate before iterations terminate.
Using Commands
After selecting a file with USE filename, continue with:
The FREQ command is useful when data are aggregated and there is a variable in the
file representing frequency of profiles.
POSAC
MODEL varlist
ESTIMATE / ITER=n,CONVERGE=d
719
Parti al l y Ordered Scal ogram Anal ysi s wi th Coordi nates
Usage Considerations
Types of data. POSAC uses rectangular data only. It is most suited for data with up to
nine categories per item. If your data have more than nine categories, the profile labels
will not be informative, since each item is displayed with a single digit in the profile
labels. If your data have many more categories in an item, the program may refuse the
computation. Similarly, POSAC can handle many items, but its interpretability and
usefulness as an analytical tool declines after 10 or 20 items. These practical
limitations are comparable to those for loglinear modeling and analysis of contingency
tables, which become complex and problematic for multiway tables.
Print options. The output is the same for all PRINT options.
Quick Graphs. POSAC produces a Quick Graph of the coordinates labeled either with
value profiles or an ID variable.
Saving files. POSAC saves the configuration into a SYSTAT file.
BY groups. POSAC analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in POSAC.
Examples
The following examples illustrate the features of the POSAC module. The first example
involves binary profiles that fit the POSAC model perfectly. The second example shows
an analysis for real binary data. The third example shows how POSAC works for
multicategory data.
720
Chapter 22
Example 1
Scalogram AnalysisA Perfect Fit
The file BIT5 contains five-item binary profiles fitting a two-dimensional structure
perfectly. The input is:
The resulting output is:
USE BIT5
POSAC
MODEL X(1)..X(5)
ESTIMATE
Variables in the SYSTAT Rectangular file are:
X(1..5)

Reordered item weak monotonicity coefficients

X(5) X(4) X(3) X(2) X(1)

X(5) 1.000
X(4) 0.750 1.000
X(3) 0.111 0.667 1.000
X(2) -0.286 0.0 0.667 1.000
X(1) -0.391 -0.286 0.111 0.750 1.000

Iteration Loss
---------------
1 0.019009
2 0.005559
3 0.000291
4 0.000000

Final loss value: 0.000
Proportion of profile pairs correctly represented: 1.000
Score-distance weighted coefficient: 1.000

Label DIM 1 DIM 2 Joint Lateral Fit
11111 1.000 1.000 1.000 0.500 0.0
01111 0.933 0.667 0.800 0.633 0.0
11110 0.667 0.933 0.800 0.367 0.0
01110 0.600 0.600 0.600 0.500 0.0
00111 0.867 0.400 0.633 0.733 0.0
11100 0.400 0.867 0.633 0.267 0.0
01100 0.333 0.533 0.433 0.400 0.0
00011 0.800 0.200 0.500 0.800 0.0
11000 0.200 0.800 0.500 0.200 0.0
00110 0.533 0.333 0.433 0.600 0.0
00100 0.267 0.267 0.267 0.500 0.0
10000 0.067 0.733 0.400 0.167 0.0
00010 0.467 0.133 0.300 0.667 0.0
00001 0.733 0.067 0.400 0.833 0.0
01000 0.133 0.467 0.300 0.333 0.0
00000 0.000 0.000 0.000 0.500 0.0
EXPORT successfully completed.
721
Parti al l y Ordered Scal ogram Anal ysi s wi th Coordi nates
POSAC first computes Guttman monotonicity coefficients and orders the matrix of
them using an SSA (multidimensional scaling) algorithm. These monotonicity
coefficients, which Shye (1985) discusses in detail, are similar to the MU2 coefficients
in the SYSTAT CORR module.
The next section of the output shows the iteration history and computed coordinates.
SYSTATs POSAC module calculates the square roots of the coordinates before
display and plotting. This is done in order to make the lateral direction linear rather
than curvilinear. Notice that for the perfect data in this example, the profiles are
confined to the upper right triangle of the plot, as in the theoretical examples in Shye
(1985). If you are comparing output with the earlier Jerusulem program, remember to
include this transformation. Notice that the profiles are ordered in both the joint and
lateral directions.
Example 2
Binary Profiles
The following data are reports of fear symptoms by selected United States soldiers
after being withdrawn from World War II combat. The data were originally reported by
Suchman in Stouffer et al. (1950). Notice that we use FREQ to represent duplicate
profiles.
POSAC Profile Plot
0.0 0.2 0.4 0.6 0.8 1.0
DIM(1)
0.0
0.2
0.4
0.6
0.8
1.0
D
I
M
(
2
)
00000
10000
01000
11000
00100
01100
11100
00010
11111
01111
00111
00011
00001
11110
01110
00110
722
Chapter 22
The input is:
The resulting output is:
USE COMBAT
FREQ=COUNT
POSAC
MODEL POUNDING..URINE
ESTIMATE
Variables in the SYSTAT Rectangular file are:
POUNDING SINKING SHAKING NAUSEOUS STIFF FAINT
VOMIT BOWELS URINE COUNT

Case frequencies determined by value of variable COUNT.

Reordered item weak monotonicity coefficients

STIFF VOMIT NAUSEOUS FAINT SINKING

STIFF 1.000
VOMIT 0.682 1.000
NAUSEOUS 0.728 0.815 1.000
FAINT 0.716 0.665 0.844 1.000
SINKING 0.583 0.381 0.706 0.644 1.000
SHAKING 0.829 0.495 0.661 0.729 0.705
BOWELS 0.751 0.780 0.780 0.761 0.513
URINE 0.782 0.589 1.000 0.846 1.000
POUNDING 0.290 0.443 0.615 0.569 0.449

SHAKING BOWELS URINE POUNDING

SHAKING 1.000
BOWELS 0.617 1.000
URINE 0.763 0.960 1.000
POUNDING 0.709 1.000 1.000 1.000

Iteration Loss
---------------
1 4.611967
2 2.260031
3 1.193905
4 0.877537
5 0.898418

Final loss value: 0.878
Proportion of profile pairs correctly represented: 0.810
Score-distance weighted coefficient: 0.977

Label DIM 1 DIM 2 Joint Lateral Fit
111111111 1.000 1.000 1.000 0.500 0.0
111111101 0.918 0.980 0.949 0.469 2.577
101111111 0.939 0.878 0.908 0.531 10.242
111111001 0.857 0.898 0.878 0.480 11.973
111110101 0.694 0.755 0.724 0.469 13.251
101111101 0.816 0.837 0.827 0.490 7.571
101111011 0.878 0.816 0.847 0.531 9.357
111101001 0.306 0.939 0.622 0.184 8.880
011111001 0.959 0.510 0.735 0.724 6.641
101111001 0.714 0.653 0.684 0.531 10.411
111011001 0.653 0.796 0.724 0.429 11.101
110111001 0.551 0.714 0.633 0.418 8.689
011110001 0.796 0.388 0.592 0.704 7.238
001111001 0.776 0.490 0.633 0.643 4.255
100111001 0.490 0.673 0.582 0.408 6.911
111001001 0.265 0.918 0.592 0.173 12.063
011011001 0.837 0.347 0.592 0.745 9.030
723
Parti al l y Ordered Scal ogram Anal ysi s wi th Coordi nates
The output shows an initial ordering of the symptoms that, according to the SSA, runs
from stiffness to loss of urine and bowel control and a pounding heart. The lateral
dimension follows this general ordering. Notice that the joint dimension runs from
absence of symptoms to presence of all symptoms.
111100001 0.245 0.857 0.551 0.194 10.225
111010001 0.469 0.694 0.582 0.388 13.307
011010101 0.898 0.245 0.571 0.827 5.937
001010111 0.980 0.184 0.582 0.898 1.793
101011001 0.531 0.612 0.571 0.459 8.332
101011000 0.061 0.735 0.398 0.163 8.936
111010000 0.041 0.959 0.500 0.041 9.716
001011001 0.633 0.306 0.469 0.663 4.639
100001101 0.286 0.776 0.531 0.255 18.117
101010001 0.408 0.551 0.480 0.429 6.088
011010001 0.755 0.224 0.490 0.765 10.334
001110001 0.673 0.327 0.500 0.673 7.902
110010001 0.388 0.633 0.510 0.378 9.413
000111001 0.612 0.286 0.449 0.663 6.454
101001001 0.224 0.592 0.408 0.316 6.892
100011001 0.429 0.531 0.480 0.449 7.752
001010001 0.449 0.082 0.265 0.684 7.128
000011001 0.510 0.265 0.388 0.622 8.843
000110001 0.592 0.163 0.378 0.714 7.337
000010001 0.327 0.122 0.224 0.602 1.155
100000001 0.082 0.449 0.265 0.316 5.579
001000001 0.367 0.143 0.255 0.612 7.827
000011000 0.204 0.367 0.286 0.418 9.295
001100000 0.184 0.469 0.327 0.357 10.533
100010000 0.020 0.571 0.296 0.224 10.084
000001001 0.347 0.204 0.276 0.571 8.718
000000101 0.735 0.041 0.388 0.847 15.543
010000001 0.571 0.102 0.337 0.735 18.115
000000001 0.163 0.020 0.092 0.571 6.259
000100000 0.122 0.408 0.265 0.357 10.401
010000000 0.143 0.429 0.286 0.357 13.698
000010000 0.102 0.061 0.082 0.520 11.087
000000000
EXPORT successfully completed.
724
Chapter 22
Example 3
Multiple Categories
This example uses the crime data to construct a 2D solution of crime patterns. We first
recode the data into four categories for each item by using the CUT function. The cuts
are made at each standard deviation and the mean. Then, POSAC computes the
coordinates for these four category profiles. The input is:
The resulting output is:
USE CRIME
STANDARDIZE MURDER..AUTOTHFT
LET (MURDER..AUTOTHFT)=CUT(@,-1,0,1,4)
POSAC
MODEL MURDER..AUTOTHFT
ESTIMATE
Reordered item weak monotonicity coefficients

LARCENY AUTOTHFT BURGLARY ROBBERY RAPE

LARCENY 1.0000
AUTOTHFT 0.8215 1.0000
BURGLARY 0.9302 0.9497 1.0000
ROBBERY 0.8058 0.9003 0.8677 1.0000
RAPE 0.7858 0.7314 0.8504 0.9220 1.0000
ASSAULT 0.5161 0.6669 0.7424 0.8788 0.9207
MURDER 0.2802 0.4826 0.5793 0.6500 0.8230

ASSAULT MURDER

ASSAULT 1.0000
MURDER 0.9650 1.0000

Iteration Loss
---------------
1 0.451041
2 0.332829
3 0.130639
4 0.101641
5 0.085226
6 0.091481

Final loss value: 0.0852
Proportion of profile pairs correctly represented: 0.8163
Score-distance weighted coefficient: 0.9939

Label DIM 1 DIM 2 Joint Lateral Fit
4444444
4444443 0.9242 0.9895 0.9569 0.4673 2.0147
4343344 0.9574 0.8416 0.8995 0.5579 4.7697
4344433 0.8292 0.9465 0.8878 0.4413 2.5764
4343443 0.8165 0.9354 0.8760 0.4405 1.9954
4443432 0.7071 0.9789 0.8430 0.3641 1.0454
4443333 0.8539 0.9682 0.9111 0.4428 2.5587
3444243 0.7638 0.9014 0.8326 0.4312 3.1705
3334443 0.8660 0.8780 0.8720 0.4940 1.5690
3334433 0.8416 0.8539 0.8478 0.4939 1.1482
3333334 0.9354 0.8165 0.8760 0.5595 2.0266
2323444 0.9895 0.6455 0.8175 0.6720 0.4374
3333333 0.7773 0.8292 0.8032 0.4741 0.5635
725
Parti al l y Ordered Scal ogram Anal ysi s wi th Coordi nates
3324333 0.8036 0.8036 0.8036 0.5000 3.8324
3322434 0.8898 0.7071 0.7984 0.5913 4.1468
3332333 0.7360 0.7773 0.7566 0.4793 2.5768
4442212 0.3819 0.9574 0.6697 0.2122 2.1545
4233322 0.5951 0.9242 0.7597 0.3355 3.0452
2232334 0.9465 0.6292 0.7878 0.6587 0.6921
4242322 0.5774 0.9129 0.7451 0.3322 2.6240
2222244 0.9682 0.5590 0.7636 0.7046 2.3395
1222344 0.9789 0.3536 0.6662 0.8127 2.1700
3323322 0.6455 0.7906 0.7180 0.4275 1.7497
3432122 0.4330 0.8898 0.6614 0.2716 4.2661
2323322 0.6922 0.6614 0.6768 0.5154 2.6771
2333222 0.6614 0.7217 0.6916 0.4699 2.3523
2222234 0.9129 0.5774 0.7451 0.6678 1.9414
3222233 0.6770 0.7360 0.7065 0.4705 2.0523
2432222 0.7500 0.7638 0.7569 0.4931 6.8248
2332222 0.6292 0.6770 0.6531 0.4761 2.8814
4222222 0.5590 0.8660 0.7125 0.3465 0.9197
1122333 0.7906 0.4787 0.6346 0.6559 4.2385
3222222 0.5401 0.7500 0.6450 0.3950 1.7113
1222233 0.7217 0.4082 0.5650 0.6567 2.2306
1222224 0.9014 0.2887 0.5950 0.8064 1.8188
1223222 0.6124 0.6124 0.6124 0.5000 6.1082
1112234 0.8780 0.2041 0.5410 0.8369 1.2590
2222222 0.5204 0.5401 0.5302 0.4902 1.1929
3122222 0.3536 0.6922 0.5229 0.3307 5.8713
2222211 0.4564 0.5000 0.4782 0.4782 2.5150
2212221 0.5000 0.5204 0.5102 0.4898 2.9364
2112212 0.4787 0.4564 0.4676 0.5111 3.5318
2212111 0.2500 0.5951 0.4226 0.3274 2.8411
1112122 0.4082 0.2500 0.3291 0.5791 2.1351
1212111 0.3227 0.3819 0.3523 0.4704 2.9375
1121211 0.2887 0.3227 0.3057 0.4830 3.6207
2111111 0.1443 0.4330 0.2887 0.3557 3.4967
1112111 0.2041 0.1443 0.1742 0.5299 0.3086
1111111
726
Chapter 22
The configuration plot is labeled with the profile values. We can see that the larger
values generally fall in the upper extreme of the joint (diagonal) dimension. The lateral
dimension runs basically according to the ordering of the initial SSA, from property
crimes at the left end of each profile to person crimes at the right end. POSAC thus has
organized the states in two dimensions by frequency (low versus high) and by type of
crime (person versus property).
If we add
before the ESTIMATE command, we can label the points with the state names. The
result is shown in the following POSAC profile plot:
POSAC and MDS
To see how the POSAC compares to a multidimensional scaling, we ran an MDS on the
transposed crime data. The following input program illustrates several important points
about SYSTAT and data analyses in this context. Our goal is to run an MDS on the
distances (differences) between states on crime incidence for the seven crimes. First,
we standardize the variables so that all of the crimes have a comparable influence on
the differences between states. This prevents a high-frequency crime, like auto theft,
from unduly influencing the crime differences. Next, we add a LABEL$ variable to the
file because TRANSPOSE renames the variables with its values if a variable with this
name is found in the source file. We save the transposed file into TCRIME and then use
IDVAR=STATE$
727
Parti al l y Ordered Scal ogram Anal ysi s wi th Coordi nates
CORR to compute Euclidean distances between the states. MDS then is used to analyze
the matrix of pairwise distances of the states ranging from Maine to Hawaii (the two-
letter state names are from the U.S. Post Office designations).
We save the MDS configuration instead of looking at the plot immediately because
we want to do one more thing. We are going to make the symbol sizes proportional to
the standardized level of the crimes (by summing them into a TOTAL crime variable).
States with the highest value on this variable rank highest, in general, on all crimes. By
merging SCRIME (produced by the original standardization) and CONF (produced by
MDS), we retain the labels and the crime values and the configuration coordinates.
The resulting graph follows:
USE CRIME
STANDARDIZE MURDER..AUTOTHFT
SAVE SCRIME
RUN
CORR
USE SCRIME
LET LABEL$=STATE$
TRANSPOSE MURDER..AUTOTHFT
SAVE TCRIME
EUCLID ME..HI
MDS
USE TCRIME
MODEL ME..HI
SAVE CONF / CONFIG
ESTIMATE
MERGE CONF SCRIME
LET TOTAL=SUM(MURDER..AUTOTHFT)
PLOT DIM(2)*DIM(1)/SIZE=TOTAL,LAB=STATE$,LEGEND=NONE
728
Chapter 22
Notice that the first dimension comprises a frequency of crime factor since the size of
the symbols is generally larger on the left. This dimension is not much different from
the joint dimension in the POSAC configuration. The second dimension, however, is
less interpretable than the POSAC lateral dimension. It is not clearly person versus
property.
Computation
Calculations are in single precision for profile categories, with double precision
variables used where needed in the minimization to ensure accuracy.
Algorithms
POSAC uses algorithms developed by Louis Guttman and Samuel Shye. The SYSTAT
program is a recoding of the Hebrew University version using different minimization
algorithms, an SSA procedure to reorder the profiles according to a suggestion of
Guttman, and a memory model which allows large problems.
Missing Data
Profiles with missing data are excluded from the calculations.
References
Borg, I. (1987). Review of S. Shye, Multiple scaling. Psychometrika, 52, 304307.
Borg, I. and Shye, S. (1995). Facet theory: Form and content. Thousand Oaks, Calif.: Sage
Publications.
Canter, D., ed. (1985). Facet theory approaches to social research. New York: Springer
Verlag.
Shye, S., ed. (1978). Theory construction and data analysis in the behavioral sciences. San
Francisco, Calif.: Jossey-Bass.
Shye, S. (1985). Multiple scaling: The theory and application of Partial Order Scalogram
Analysis. Amsterdam: North-Holland.
Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Staf, S. A., and Clausen, J.
A. (1950). Measurement and Prediction. Princeton, N.J.: Princeton University Press.
729


Chapt er
23
Path Analysis (RAMONA)
Michael W. Browne and Gerhard Mels
RAMONA implements the McArdle and McDonald Reticular Action Model (RAM)
for path analysis with manifest and latent variables. Input to the program is coded
directly from a path diagram without reference to any matrices.
RAMONA stands for RAM Or Near Approximation. The deviation from RAM is
minorno distinction is made between residual variables and other latent variables.
As in RAM, only two parameter matrices are involved in the model. One represents
single-headed arrows in the path diagram (path coefficients) and the other, double-
headed arrows (covariance relationships).
RAMONA can correctly fit path analysis models to correlation matrices, and it
avoids the errors associated with treating a correlation matrix as if it were a covariance
matrix (Cudeck, 1989). Furthermore, you can request that both exogenous and
endogenous latent variable variances have unit variances. Consequently, estimates of
standardized path coefficients, with the associated standard errors, can be obtained,
and difficulties associated with the interpretation of unstandardized path coefficients
(Bollen, 1989) can be avoided.
Statistical Background
The Path Diagram
The input file for RAMONA is coded directly from a path diagram. We first briefly
review the main characteristics of path diagrams. More information can be found in
texts dealing with structural equation modeling (Bollen, 1989; Everitt, 1984; and
McDonald, 1985).
730
Chapter 23
Look at the path diagram on the following page. This is a model, adapted from
Jreskog (1977), for a study of the stability of attitudes over time conducted by
Wheaton, Muthn, Alwin, and Summers (1977). Attitude scales measuring anomia
(ANOMIA) and powerlessness (POWRLS) were regarded as indicators of the latent
variable alienation (ALNTN) and administered to 932 persons in 1967 and 1971. A
socioeconomic index (SEI) and years of school completed (EDUCTN) were regarded
as indicators of the latent variable socioeconomic status (SES).
POWERLS71
1.0 1.0 1.0 1.0
1.0 1.0
1.0
Z2 Z1
1.0
SES
1.0
1.0 1.0
ALNTN71 ALNTN67
ANOMIA67 POWERLS67 ANOMIA71
E1 E2 E4 E3
EDUCTN SEI
D1 D2
731
Path Anal ysi s (RAMONA)
In the path diagram, a manifest (observed) variable is represented by a square or
rectangular box:
while a circle or ellipse signifies a latent (unobservable) variable:
A dependence path is represented by a single-headed arrow emitted by the
explanatory variable and received by the dependent variable:
while a covariance path is represented by a double-headed arrow:
In many diagrams, variance paths are omitted. Because variances form an essential
part of a model and must be specified for RAMONA, we represent them here explicitly
by curved double-headed arrows (McArdle, 1988) with both heads touching the same
circle or square:
732
Chapter 23
If a path coefficient, variance, or covariance is fixed (at a nonzero value), we attach the
value to the single- or double-headed arrow:
A variable that acts as an explanatory variable in all of its dependence relationships
(emits single-headed arrows but does not receive any) is exogenous (outside the
system):
A variable that acts as a dependent variable in at least one dependence relationship
(receives at least one single-headed arrow) is endogenous (inside the system), whether
or not it ever acts as an explanatory variable (emits any arrows):
A parameter in RAMONA is associated with each dependence path and covariance
path between two exogenous variables. Covariance paths are permitted only between
exogenous variables. For example, the following covariance paths are permissible:
733
Path Anal ysi s (RAMONA)
Variances and covariances of endogenous variables are implied by the corresponding
explanatory variables and have no associated parameters in the model. Thus, an
endogenous variable may not have a covariance path with any other variable. The
covariance is a function of path coefficients and variances or covariances of exogenous
variables and is not represented by a parameter in the model. The following covariance
paths, for example, are not permissible:
Also, an endogenous variable does not have a free parameter representing its variance.
Its variance is a function of the path coefficients and variances of its explanatory
variables. Therefore, it may not have an associated double-headed arrow with no fixed
value:
Permissible
Not permissible
Not permissible
734
Chapter 23
Exogenous variables alone may have free parameters representing their variances:
We do, however, allow fixed variances for both endogenous and exogenous variables.
These two types of fixed variances are interpreted differently in the program:
n A fixed variance for an endogenous variable is treated as a nonlinear equality
constraint on the parameters in the model:
The fixed implied variance is represented by a dotted two-headed arrow instead of a
solid two-headed arrow because it is a nonlinear constraint on several other parameters
in the model and does not have a single fixed parameter associated with it.
n A fixed variance for an exogenous variable is treated as a model parameter with a
fixed value:
Permissible
Constraint
Parameter
735
Path Anal ysi s (RAMONA)
Every latent variable must emit at least one arrow. No latent variable can receive arrows
without emitting any:
The scale of every latent variable (exogenous or endogenous) should be fixed to avoid
indeterminate parameter values. Some ways for accomplishing this are:
n To fix one of the path coefficients, associated with an emitted arrow, to a nonzero
value (usually 1.0):
n To fix both the variance and path coefficient of an associated error term, if the
latent variable is endogenous (for example, Jreskog and Goldberger, 1975):
Not permissible
736
Chapter 23
n To fix the variance of the latent variable:
If a latent variable is endogenous and the third method is used, RAMONA fixes the
implied variance by means of equality constraints. Programs that do not have this
facility require the user to employ the first or second method to determine the scales of
endogenous latent variables.
Consider ALNTN67 in the path diagram. This latent variable is endogenous (it
receives arrows from SES and Z1). It also emits arrows to ANOMIA67 and
POWRLS67. Consequently, it is necessary to fix either the variance of ALNTN67, the
path coefficient from ALNTN67 to ANOMIA67, the path coefficient from ALNTN67 to
POWRLS67, or the variance of Z1. It is conventional to use 1.0 as the fixed value. Our
preference is to use the third method and fix the variance of ALNTN67 rather than use
the first or second method because we find standardized path coefficients easier to
interpret (Bollen, 1989). The first two methods result in latent variables with non-unit
variances. RAMONA does, however, allow the use of these methods.
The model shown in the path diagram is equivalent to Jreskogs (1977) model but
makes use of different identification conditions. We apply nonlinear equality
constraints to fix the variances of the endogenous variables ALNTN67 and ALNTN71,
but treat the path coefficients from ALNTN67 to ANOMIA67 and from ALNTN71 to
ANOMIA71 as free parameters. Jreskog fixed the path coefficients from ALNTN67 to
ANONMIA67 and from ALNTN71 to ANOMIA71 and did not apply any nonlinear
equality constraints.
An error term is an exogenous latent variable that emits only one single-headed
arrow and shares double-headed arrows only with other error terms. In the path
diagram, the variables E1, E2, E3, E4, D1, D2, Z1, and Z2 are error terms. RAMONA
treats error terms in exactly the same manner as other latent variables.
737
Path Anal ysi s (RAMONA)
RAMONAs Model
Let be a vector of manifest variables, be an vector of latent
variables, and let
Equation 23-1
be the vector (t = p + m) representing all variables in the system, manifest and
latent. Suppose that B is a matrix of path coefficients. The path coefficient
corresponding to the directed arrow from the jth element, , of v to the ith element,
, will appear in the ith row and jth column of B. Let be a vector formed from
v by replacing all elements corresponding to non-null rows of B by zeros. Thus,
consists of exogenous variables with endogenous variables replaced by zeros. The
system of directed paths represented in the path diagram is then given by:
Equation 23-2
The formulation of the model given in Equation 23-1 differs only slightly from that of
RAM (McArdle and McDonald, 1984). All non-null elements of are also elements
of v. Also, the non-null elements of can, in some situations, be common factors
rather than residuals. Let
be the covariance matrix of . Thus, the nonzero elements of are parameters
associated with two-headed arrows in the path diagram. Null rows and columns of
will be associated with endogenous variables in v.
Let . It follows from Equation 23-2 that (McArdle and McDonald)
Equation 23-3
The manifest variable covariance matrix is the first submatrix
of (see Equation 23-1). Specified values may be assigned to exogenous variable
covariances by applying constraints to appropriate diagonal elements of .
The structural model employed by RAMONA is given in Equation 23-3. Both
and are large matrices with most of their elements equal to 0. Their nonzero
v
1
p 1 v
2
m 1
v
v
1
v
2
=
t 1
t t
v
j
v
i
v
x
t 1
v
x
v Bv v
x
+ =
v
x
v
x
Cov v
x
v
x
( , ) =
t t v
x

Cov v v ( , ) =
I B ( )
1
I B ( )
1
=
Cov v
1
v
1
( , ) = p p

738
Chapter 23
elements alone are stored in RAMONA. Sparse matrix methods are used in the
computation of and . Details can be found in Mels (1988).
The covariance structure in Equation 23-3 differs from a formulation of Bentler and
Weeks (1980) in that there is a single matrix, , for path coefficients instead of two.
Structural equation models are often fitted to sample correlation matrices. There are
many published studies where this has been done incorrectly (Cudeck,1989).
RAMONA fits a correlation structure by introducing a duplicate standardized variable,
, with unit variance to correspond to each manifest variable , , and then
taking
where stands for the standard deviation of . The duplicate variables are treated in
the same way as latent variableswith variances constrained to unity if they are
endogenous and fixed at unity if they are exogenous. Also, the standard deviation, ,
is treated in the same way as a path coefficient. This procedure is equivalent to
expressing the manifest variable covariance matrix in the form
where is a diagonal matrix with the , , as diagonal elements, and is the
manifest variable correlation matrix, which is treated as the covariance matrix of the
standardized duplicate variables , . Fitting the model to a sample correlation
matrix instead of a sample covariance matrix results in the estimates being replaced
by , where is a sample standard deviation. These quantities are referred to as
Scaled Standard Deviations (nuisance parameters) in the output. Other parameter
estimates are not affected.
This approach involves the introduction of p additional parameters, , and
additional constraints on the variances of the . The number of degrees of freedom is
not affected (unless some parameters or constraints are redundant), but computation
time is increased because of the additional parameters and additional constraints.
I B ( )
1

B
v
i
*
v
i
i p
v
i

i
v
i
*
for i p =

i
v
i

i
D

PD

=
D


i
i p P
v
i
*
i p

s
i
s
i

i
p
v
i
*
739
Path Anal ysi s (RAMONA)
Path Analysis in SYSTAT
RAMONA Model Main Dialog Box
To open the RAMONA Model dialog box, from the menus choose:
Statistics
Path Analysis (RAMONA)...
Specify the paths or relationships in the path diagram. Include a statement for each
arrow. SYSTAT checks each variable to determine whether it is in the input file. (If it
is in the file, SYSTAT considers it manifest; if not, it is considered latent).
The relationships represented in the path diagram are of two types: dependence and
covariance. These relationships may be specified in any order. Parameter numbers and
values, if not the default values, are specified in parentheses after the variable name.
Dependence Relationships
A dependence relationship is indicated by the symbol <-, which relates directly to a
single-headed arrow in the path diagram. To code a dependence path, enter the
descriptive name of the dependent variable followed by the symbol <-. Then name the
explanatory variable, including the parameter number and the starting value for the
parameter involved within parentheses. For example,
dependent <- explanatory(1, 0.6)
The parameter number is an integer used to indicate fixed parameters or parameters
whose values are constrained to be equal. A fixed parameter must have a parameter
number of 0. Any free parameters whose values are required to be equal are assigned
the same parameter number. A free parameter that is not constrained to equality with
740
Chapter 23
any other parameter may be assigned the symbol * instead of a parameter number. Its
parameter number is assigned within the program.
The starting value is a real number and is used to initialize the iterative process.
Some rules for choosing starting values are given by Bollen (1989). If you have
difficulty in deciding on a starting value, you can replace it with a *. The program then
chooses a very rough starting value. If a parameter is fixed with a 0 as the parameter
number, then the fixed value must replace the starting value. It is not permissible to use
a * instead of the fixed value.
Inspection of the path diagram in the Statistical Background section shows that
the endogenous manifest variable POWRLS67 receives single-headed arrows from the
latent variable ALNTN67 and the measurement error Z1. These dependence
relationships can be coded as:
powrls67 <- alntn67(*,*),
powrls67 <- e2(0,1.0)
In the first path, the parameter is free and not constrained to equality with any other
parameter. The parameter number is replaced by a *. No starting value is specified and
it is replaced by a *. The parameter in the second path is fixed at 1.0 so that the
parameter number is 0 and the parameter value is 1.0.
The default is (*,*), so it is not necessary to type it:
powrls67 <- alntn67,
powrls67 <- e2(0,1.0)
It is not necessary to have a different statement for each path. Several paths with the
same dependent (receiving) variable can be combined into one statement. Since the
same endogenous variable, POWRLS67, is involved in two dependence relationships,
the two paths can be coded in a single statement as:
powrls67 <- alntn67 e2(0,1.0)
If the statement continues to a second line, place a comma at the end of the first line.
Constraining parameters. If you want to constrain two or more free parameters to be
equal, the parameters are assigned the same nonzero positive integer for the parameter
number. Suppose you want to constrain the path coefficient from SES to ALNTN67 and
from SES to ALNTN71 to be equal. You can specify:
alntn67 <- ses(7,*) z1(0,1.0),
alntn71 <- alntn67 ses(7,*) z2(0,1.0)
741
Path Anal ysi s (RAMONA)
Providing starting values. You can provide starting values for free parameters. Suppose
that it is known from a previous run that the path coefficient of ALNTN67 to ALNTN71
is approximately 0.6. In this case, you can specify the following:
alntn71 <- alntn67(*,0.6) ses(7,*) z2(0,1.0)
When specifying dependence relationships, bear in mind that:
n Dependence relationships can be specified in any order.
n A statement can specify several dependence paths involving the same dependent
variable.
n Specified path numbers need not be sequential; for example, 5, 3, 9 can be used.
Sequential path numbers will be reassigned by the program.
Covariance Relationships
A variance or a covariance relationship is indicated by the symbol <->, which relates
directly to the double-headed arrow in the path diagram. To specify a covariance path,
enter the name of one of the variables in the path, followed by the symbol <->. Then
enter the name of the other variable, and include the path number and the starting value
within parentheses. Unlike the dependence relationship, it does not matter which
variable is given first. For example,
e2 <-> e2(10,*)
You can replace the number and/or the starting value of a free parameter with the
symbol *. In this case, they are provided by the program. In the case of a fixed
parameter, however, you must specify 0 as the number of the parameter and provide
the fixed value of the parameter.
Inspection of the path diagram shows that double-headed arrows are used from the
measurement error E1 to itself to specify a variance and to E3 to specify a covariance.
These relationships are specified in the statement:
e1 <-> e1(*,*) e3(*,*)
or
e1 <-> e1 e3
742
Chapter 23
Covariance paths can be constrained to be equal in the same manner as dependence
paths. Suppose you want to specify that the variances of the measurement errors E1,
E2, and E3 must be equal:
e1 <-> e1(10,*) e3,
e2 <-> e2(10,*),
e3 <-> e3(10,*)
You can again provide starting values for free parameters:
e3 <-> e3(*,0.32)
Variances of both exogenous and endogenous variables can be required to have fixed
values. Thus, both
ses <-> ses(0,1.0)
and
altntn67 <-> altntn67(0,1.0)
are acceptable. They are, however, treated differently within the program. The
exogenous latent variable, SES, has a parameter associated with its variance and it is
set equal to 1.0. There is no parameter representing the variance of the endogenous
latent variable, ALNTN67. This variance is a function of the path coefficient,
ALNTN67 <- SES, the variance of SES, and the variance of Z1. It is constrained to have
a value of 1.0 by RAMONA.
When specifying covariance relationships, bear in mind that:
n Covariance paths can be specified in any order.
n Several covariance paths per statement can be specified. For example, the variance
of an exogenous variable as well as its covariances with other exogenous variables
can be specified in the same statement.
n Dependence paths and covariance paths must be specified in separate
substatements. The dependence path subparagraph must precede the covariance
path subparagraph.
n If every manifest endogenous variable has a corresponding measurement error with
an unconstrained variance, the coding of these variances can be omitted. When all
error path coefficients are fixed and no error variance paths are input for the
measurement errors, the program will automatically provide the error variance
paths.
743
Path Anal ysi s (RAMONA)
n If there are exogenous manifest variables and if all of their variances and
covariances are present in the system and are unrestricted, the coding of these
variance and covariance paths can be omitted. When no variance and covariance
paths for exogenous manifest variables are input, the program will automatically
provide them.
RAMONA Options
To specify RAMONA options, click Options in the Model dialog box.
The following options can be specified:
Manifest Variable(s). Specify the manifest (observable) variables in the model.
Latent Variable Names. Specify the latent (unobservable) variables in the model. Decide
upon descriptive names including errors. A systematic way of organizing names is to
let the endogenous latent variables be followed by the exogenous latent variables and
to include the error terms last. It is not, however, essential to do this.
You can specify the type of matrix to be analyzed.
n Covariance. If the input matrix is a correlation matrix (has unit diagonal elements),
the analysis is performed, but RAMONA prints a warning in the output. This is the
default.
n Correlation. If a covariance matrix is input, RAMONA rescales it to be a correlation
matrix.
744
Chapter 23
RAMONA offers five methods of estimation:
n Maximum Wishart likelihood. This is the default.
n Generalized least squares. Assumes a Wishart distribution for S.
n Ordinary least squares. No measures of fit and no standard errors of estimators are
printed.
n ADF Gramian. Asymptotically distribution-free estimate that uses a biased but
Gramian (non-negative definite) estimate of the asymptotic covariance matrix.
n ADF Unbiased. Asymptotically distribution-free estimate that uses an unbiased
estimate of the asymptotic covariance matrix.
You can designate how starting values are to be scaled.
n Rescale. Rescales the starting values to satisfy the specified variance constraints.
RAMONA applies ordinary least squares initially. After partial convergence,
RAMONA switches to the method you specify. If starting values are poor, you are
advised to rescale them. If you require MWL estimates and supply poor starting
values or if you use the * alternative for the starting value, use this option. You
should also rescale for ADFG and ADFU if starting values are poor because the time
taken per iteration is less for GLS than for these two methods.
n Do not rescale. RAMONA uses the estimation procedure specified under Method
from the beginning of the iterative procedure. This option should always be used
with OLS.
Cases. Number of cases used to compute the matrix. The number of cases should
exceed the number of p manifest variables if you use the maximum Wishart likelihood
method or the generalized least squares method. If you use the ADF Gramian method
or the ADF unbiased method the number of cases must exceed 0.5 p(p + 1).
Iterations. Maximum number of iterations allowed for the iterative procedure.
Convergence criterion. Limit for the residual cosine employed as a convergence
criterion.
Confidence. Confidence interval range.
745
Path Anal ysi s (RAMONA)
Using Commands
First, specify your data with USE filename. Continue with:
Usage Considerations
Types of data. RAMONA uses a correlation or covariance matrix either read from a file
or computed from a rectangular file. When specifying ADFG or ADFU, a cases-by-
variables input file must be used.
Print options. Three lengths of output are available. Results included for each include:
n SHORT. The sample covariance (correlation) matrix, path coefficient estimates,
90% confidence intervals, standard errors and t statistics, and variance/covariance
or correlation estimates.
n MEDIUM. The panels listed for SHORT, plus details of the iterative procedure, the
reproduced covariance or correlation matrix, the matrix of residuals, and
information about equality constraints on variances (if applicable).
n LONG. The panels listed for MEDIUM, plus the asymptotic correlation matrix of the
estimators.
Quick Graphs. RAMONA produces no Quick Graphs.
Saving files. You cannot save specific RAMONA results to a file.
BY groups. For a rectangular file, RAMONA produces separate results for each BY
variable.
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. RAMONA uses a FREQUENCY variable, if present, to duplicate cases.
Case weights. RAMONA ignores WEIGHT variables.
RAMONA
MANIFEST var1,var2,
LATENT var1,var2,
MODEL depvar1<-expvar1(i,n1), expvar2(j,n2),
e1<->e2,e3(k,n3)
ESTIMATE / DISP= COVA or CORR
METHOD = MWL or GLS or OLS or ADFG or ADFU
START = ROUGH or CLOSE
NCASES=n ITER=n CONVG=n RESTART CONFI=n
746
Chapter 23
Examples
Example 1
Path Analysis Basics
The covariance matrix of six manifest variables is shown below. These covariances and
variances were computed from a sample of 932 respondents and are stored in the EX1
data file.
In this example, we specify the model illustrated in Statistical Background on p. 729.
The role of the manifest and latent variables is clear from the MODEL statement
below. Manifest variables are in the SYSTAT file (latent variables are not). The input is:
ANOMIA67 POWRLS67 ANOMIA71 POWRLS71 EDUCTN SEI
ANOMIA67
11.834
POWRLS67
6.947 9.364
ANOMIA71
6.819 5.091 12.532
POWRLS71
4.783 5.028 7.495 9.986
EDUCTN
3.839 3.889 3.841 3.625 9.610
SEI
21.899 18.831 21.748 18.755 35.522 450.288
USE ex1
RAMONA
MODEL anomia67 <- alntn67(*,*) e1(0,1.0),
powrls67 <- alntn67(*,*) e2(0,1.0),
anomia71 <- alntn71(*,*) e3(0,1.0),
powrls71 <- alntn71(*,*) e4(0,1.0),
eductn <- ses(*,*) d1(0,1.0),
sei <- ses(*,*) d2(0,1.0),
alntn67 <- ses(*,*) z1(0,1.0),
alntn71 <- alntn67(*,*) ses(*,*) z2(0,1.0),
ses <-> ses(0,1.0),
e1 <-> e1(*,*) e3(*,*),
e2 <-> e2(*,*) e4(*,*),
e3 <-> e3(*,*),
e4 <-> e4(*,*),
d1 <-> d1(*,*),
d2 <-> d2(*,*),
z1 <-> z1(*,*),
z2 <-> z2(*,*),
alntn71 <-> alntn71(0,1.0),
alntn67 <-> alntn67(0,1.0)
PRINT = MEDIUM
ESTIMATE / TYPE=CORR NCASES=932
747
Path Anal ysi s (RAMONA)
If you were to specify explicitly the default values of the options for ESTIMATE, the last
statement would read:
We use the default maximum Wishart likelihood method (METHOD = MWL) to analyze
the correlation matrix. Our analysis differs from Jreskogs analysis in that the model
is treated as a correlation structure rather than a covariance structure. The display
correlation option of ESTIMATE (TYPE = CORR) identifies that the input is a correlation
matrix, and NCASES = 932 denotes the sample size used to compute it.
The output follows:
ESTIMATE / TYPE=CORR METHOD=MWL START=ROUGH,
CONVG=0.0001 ITER=500
There are 6 manifest variables in the model. They are:
ANOMIA67 POWRLS67 ANOMIA71 POWRLS71 EDUCTN SEI


There are 11 latent variables in the model. They are:
ALNTN67 E1 E2 ALNTN71 E3 E4 SES D1 D2 Z1 Z2


RAMONA options in effect are:
Display Corr
Method MWL
Start Rough
Convg 0.000100
Maximum iterations 100
Number of cases 932
Restart No
Confidence Interval 90.000000

Number of manifest variables = 6
Total number of variables in the system = 23.

Details of Iterations
Iter Method Discr. Funct. Max.R.Cos. Max.Const. NRP NBD
-------- ------- -------------- ------------ ------------ ----------
0 OLS 2.990254 0.000000
1(0) OLS 9363.179841 0.999315 87.020340 0 0
1(1) OLS 67.826312 0.974346 9.357014 0 0
1(2) OLS 1.861094 0.657239 1.221196 0 0
2(0) OLS 0.863526 0.644690 0.787367 0 0
3(0) OLS 0.020374 0.512199 0.131453 0 0
4(0) OLS 0.001137 0.301991 0.004030 0 0
5(0) OLS 0.001007 0.001247 0.000027 0 0
5(0) MWL 0.005313 0.034276 0.000027 0 0
6(0) MWL 0.005095 0.009493 0.000065 0 0
7(0) MWL 0.005090 0.000712 0.000003 0 0
8(0) MWL 0.005090 0.000172 0.000000 0 0
9(0) MWL 0.005090 0.000014 0.000000 0 0
10(0) MWL 0.005090 0.000003 0.000000 0 0

Iterative procedure complete.

Convergence limit for residual cosines = 0.000100 on 2 consecutive
iterations.

Convergence limit for variance constraint violations = 5.00000E-07
Value of the maximum variance constraint violation = 1.29230E-11

748
Chapter 23
Sample Correlation Matrix :
ANOMIA67 POWRLS67 ANOMIA71 POWRLS71 EDUCTN
ANOMIA67 1.000
POWRLS67 0.660 1.000
ANOMIA71 0.560 0.470 1.000
POWRLS71 0.440 0.520 0.670 1.000
EDUCTN -0.360 -0.410 -0.350 -0.370 1.000
SEI -0.300 -0.290 -0.290 -0.280 0.540
SEI
SEI 1.000

Number of cases = 932.


Reproduced Correlation Matrix :
ANOMIA67 POWRLS67 ANOMIA71 POWRLS71 EDUCTN
ANOMIA67 1.000
POWRLS67 0.660 1.000
ANOMIA71 0.560 0.469 1.000
POWRLS71 0.441 0.520 0.670 1.000
EDUCTN -0.367 -0.404 -0.357 -0.369 1.000
SEI -0.280 -0.308 -0.272 -0.281 0.540
SEI
SEI 1.000

Residual Matrix (correlations) :
ANOMIA67 POWRLS67 ANOMIA71 POWRLS71 EDUCTN
ANOMIA67 0.000
POWRLS67 0.000 0.000
ANOMIA71 -0.000 0.001 0.000
POWRLS71 -0.001 0.000 -0.000 0.000
EDUCTN 0.007 -0.006 0.007 -0.001 0.000
SEI -0.020 0.018 -0.017 0.001 0.000
SEI
SEI 0.000

Value of the maximum absolute residual = 0.020



ML Estimates of Free Parameters in Dependence Relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
ANOMIA67 <- ALNTN67 1 0.774 0.733 0.816 0.025 30.73
POWRLS67 <- ALNTN67 2 0.852 0.810 0.894 0.026 33.06
ANOMIA71 <- ALNTN71 3 0.805 0.763 0.848 0.026 31.03
POWRLS71 <- ALNTN71 4 0.832 0.788 0.876 0.027 31.19
EDUCTN <- SES 5 0.842 0.789 0.894 0.032 26.44
SEI <- SES 6 0.642 0.592 0.691 0.030 21.30
ALNTN67 <- SES 7 -0.563 -0.620 -0.506 0.035 -16.26
ALNTN71 <- ALNTN67 8 0.567 0.500 0.634 0.041 13.88
ALNTN71 <- SES 9 -0.207 -0.281 -0.133 0.045 -4.60

Scaled Standard Deviations (nuisance parameters)

Variable Estimate
------------ ------------
ANOMIA67 1.000
POWRLS67 1.000
ANOMIA71 1.000
POWRLS71 1.000
EDUCTN 1.000
SEI 1.000
749
Path Anal ysi s (RAMONA)
Values of Fixed Parameters in Dependence Relationships

Path Value
---------------------------- ------------
ANOMIA67 <- E1 1.000
POWRLS67 <- E2 1.000
ANOMIA71 <- E3 1.000
POWRLS71 <- E4 1.000
EDUCTN <- D1 1.000
SEI <- D2 1.000
ALNTN67 <- Z1 1.000
ALNTN71 <- Z2 1.000

ML estimates of free parameters in variance/covariance relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
E1 <-> E1 10 0.400 0.341 0.470 0.039 10.25
E1 <-> E3 11 0.133 0.091 0.175 0.026 5.22
E2 <-> E2 12 0.274 0.211 0.357 0.044 6.24
E2 <-> E4 13 0.035 -0.009 0.080 0.027 1.30
E3 <-> E3 14 0.351 0.289 0.427 0.042 8.40
E4 <-> E4 15 0.308 0.243 0.390 0.044 6.94
D1 <-> D1 16 0.292 0.216 0.395 0.054 5.44
D2 <-> D2 17 0.588 0.528 0.656 0.039 15.22
Z1 <-> Z1 18 0.683 0.616 0.743 0.039 17.52
Z2 <-> Z2 19 0.503 0.448 0.557 0.033 15.08

Values of Fixed Parameters in Variance/Covariance Relationships

Path Value
----------------------------- ------------
SES <-> SES 1.000

Equality Constraints on Variances

Lagrange Standard
Constraint Value Multiplier Error
---------------------------- ------------ ------------ ------------
ALNTN71 <-> ALNTN71 1.0000 0.000 0.000
ALNTN67 <-> ALNTN67 1.0000 0.000 0.000
ANOMIA67 <-> ANOMIA67 1.0000 0.000 0.000
POWRLS67 <-> POWRLS67 1.0000 0.000 0.000
ANOMIA71 <-> ANOMIA71 1.0000 0.000 -0.000
POWRLS71 <-> POWRLS71 1.0000 0.000 0.000
EDUCTN <-> EDUCTN 1.0000 0.000 -0.000
SEI <-> SEI 1.0000 0.000 0.000

Maximum Likelihood Discrepancy Function

Measures of fit of the model
----------------------------
Sample Discrepancy Function Value : 0.005 (5.090285E-03)

Population discrepancy function value, Fo
Bias adjusted point estimate : 0.001
90.000 percent confidence interval :(0.0,0.011)

Root mean square error of approximation
Steiger-Lind : RMSEA = SQRT(Fo/df)
Point estimate : 0.014
90.000 percent confidence interval :(0.0,0.053)

Expected cross-validation index
Point estimate (modified aic) : 0.042
90.000 percent confidence interval :(0.041,0.052)
CVI (modified AIC) for the saturated model : 0.045
750
Chapter 23
After a summary of the input specifications, SYSTAT produces details of the iteration
process. The number of the step halving step, carried out to yield a reduction in the
discrepancy function plus a penalty for constraint violations, is given in parentheses
next to the iteration number. Method indicates the method of estimation. Discr. Funct.
reports the discrepancy function value. Max. R. Cos. equals the absolute value of the
maximum residual cosine used to indicate convergence. Max. Const. is the absolute
value of the maximum violated variance constraint. This panel also includes the
number of apparently redundant parameters (number of zero pivots of the coefficient
matrix of the normal equationsNRP) and the number of active bounds on parameter
values (NBD).
The values of NRP and NBD can change from iteration to iteration. If NRP has a
constant nonzero value for several iterations prior to convergence, this suggests that
the model could be overparameterized. The value of NBD indicates the number of
variance or correlation estimates on bounds at any iteration.
Next, the output includes three matrices: the sample correlation (covariance) matrix,
the correlation (covariance) matrix reproduced by the model, and the matrix of
residuals. The residual matrix is the difference between the sample correlation
(covariance) matrix and the reproduced correlation (covariance) matrix. If the input is
a correlation matrix (TYPE = CORR), the residual matrix will have null diagonal
elements.
For both the dependence and covariance relationships, SYSTAT prints estimates of
the free-path coefficients and the values of all fixed-path coefficients involved in the
model. The following values are reported for the free parameters:
n Path.
n Param #. The number of the parameter. This number need not be the same as the
number in the input file. (It is the number assigned to the parameter name in the
asymptotic covariance matrix of estimators given subsequently.)
n Point Estimate. The estimate of the path coefficient.
n 90.00% Conf. Int. A 90% confidence interval for the path coefficient (the default).
If you want to alter the confidence level, specify, for example, CONFI = 0.95.
n Standard Error. An estimate of the standard error of the estimator.
n T value. The value of the t statistic (ratio of estimate to standard error).
Test statistic: : 4.739
Exceedance probabilities:-
Ho: perfect fit (RMSEA = 0.0) : 0.315
Ho: close fit (RMSEA <= 0.050) : 0.929

Multiplier for obtaining test statistic = 931.000
Degrees of freedom = 4
Effective number of parameters = 17
751
Path Anal ysi s (RAMONA)
If the input is a correlation matrix, the scaled standard deviations (nuisance parameters)
are reported with:
n The name of the manifest variable
n The ratio of the standard deviation reproduced from the model to the sample
standard deviation
After the covariance relationship output, SYSTAT presents information about equality
constraints on endogenous variable variances (if applicable):
n Constraint. The variance path that is constrained.
n Value. The value of the endogenous variable variance at convergence.
n Lagrange Multiplier. The value of the Lagrange multiplier at convergence.
n Standard Error. An estimate of the standard error of the Lagrange multiplier.
In most applications, the constraints on endogenous variable variances serve as
identification conditions and all Lagrange multipliers and standard errors are 0.
Example 2
Path Analysis with a Restart File
This example is based on Jreskogs (1977) path analysis model for the Duncan,
Haller, and Portes (1971) data on peer influences on ambition. It illustrates a situation
where some manifest variables are exogenous. It also illustrates the use of a restart file
for creating a data file for a second run where some modifications have been made.
The example consists of two runs. Jreskogs original model is used for the first run.
The model is treated as a covariance structurethis is inappropriate because a
correlation matrix is used as input. In the second run, we use a restart file that treats the
model as a correlation structure.
The six manifest exogenous variables are:
RPARASP Respondents parental aspiration
RESOCIEC Respondents socioeconomic status
REINTGCE Respondents intelligence
BFINTGCE Best friends intelligence
BFSOCIEC Best friends socioeconomic status
BFPARASP Best friends parental aspiration
752
Chapter 23
The four endogenous variables are:
The latent endogenous variables are:
And the exogenous error variables are E1, E4, E2, Z1, E3, and Z2.
REOCCASP Respondents occupational aspiration
BFEDASP Best friends educational aspiration
REEDASP Respondents educational aspiration
BFOCCASP Best friends occupational aspiration
REAMBITN Respondents ambition
BFAMBITN Best friends ambition
753
Path Anal ysi s (RAMONA)
1.0
1.0
1.0
1.0
REOCCASP
BFEDASP
BFOCCASP
REEDASP
E1
E3
E4
E2
Z1
1.0
Z2
1.0
REPARASP
REINTGCE
RESOCIEC
BFPARASP
BFINTGCE
BFSOCIEC BFAMBITN
1.0
1.0
REAMBITN
754
Chapter 23
The correlation matrix for the manifest variables is stored in the file EX2. Following is
the input file for the first run:
You would specify the default values of other options for ESTIMATE as:
The RESTART option of ESTIMATE creates a restart command file, EX2B.SYC, that is
submitted as input in the second run. RESTART tells RAMONA to take the estimated
parameter values and insert them as starting values in the MODEL statement. Note that
we must also type OUTPUT BATCH = filename to do this. Before the second run, we
modify EX2B.SYC to treat the model as a correlation structure.
Following Jreskogs model, the path coefficients REOCCASP <- REAMBITN and
BFOCCASP <- BFAMBITN are set equal to 1 for identification purposes. The output
follows.
USE ex2
RAMONA
MANIFEST = reintgce reparasp resociec reoccasp,
reedasp bfintgce bfparasp bfsociec,
bfoccasp bfedasp
LATENT = reambitn bfambitn e1 e2 e3 e4 z1 z2
MODEL reoccasp <- reambitn(0,1.0) e1(0,1.0),
reedasp <- reambitn e2(0,1.0),
bfedasp <- bfambitn e3(0,1.0),
bfoccasp <- bfambitn(0,1.0) e4(0,1.0),
reambitn <- bfambitn z1(0,1.0) reparasp,
reambitn <- reintgce resociec bfsociec,
bfambitn <- reambitn z2(0,1.0) resociec,
bfambitn <- bfsociec bfintgce bfparasp,
reparasp <-> reparasp reintgce resociec,
reparasp <-> bfsociec bfintgce bfparasp,
reintgce <-> reintgce resociec bfsociec,
reintgce <-> bfintgce bfparasp,
resociec <-> resociec bfsociec bfintgce,
resociec <-> bfparasp,
bfsociec <-> bfsociec bfintgce bfparasp,
bfintgce <-> bfintgce bfparasp,
bfparasp <-> bfparasp,
e1 <-> e1,
e2 <-> e2,
e3 <-> e3,
e4 <-> e4,
z1 <-> z1,
z2 <-> z2
PRINT = MEDIUM
OUTPUT BATCH = ex2b.syc / PROGRAM
ESTIMATE / TYPE=COVA NCASES=329 RESTART
ESTIMATE / TYPE=COVA METHOD=MWL START=ROUGH ITER=500,
CONVG=0.0001 NCASES RESTART
755
Path Anal ysi s (RAMONA)
There are 10 manifest variables in the model. They are:
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP BFINTGCE BFPARASP
BFSOCIEC BFOCCASP BFEDASP


There are 8 latent variables in the model. They are:
REAMBITN E1 E2 BFAMBITN E3 E4 Z1 Z2


RAMONA options in effect are:
Display Covar
Method MWL
Start Rough
Convg 0.000100
Maximum iterations 100
Number of cases 329
Restart Yes
Confidence Interval 90.000000

Number of manifest variables = 10
Total number of variables in the system = 18.
***WARNING***
A correlation matrix was provided although DISP=COV fit measures and
standard errors may be inappropriate.


Details of Iterations
Iter Method Discr. Funct. Max.R.Cos. Max.Const. NRP NBD
-------- ------- -------------- ------------ ------------ ----------
0 OLS 1.500570
1(0) OLS 0.325498 0.720392 0 0
2(0) OLS 0.023128 0.191263 0 0
3(0) OLS 0.019538 0.007112 0 0
3(0) MWL 0.085416 0.059603 0 0
4(0) MWL 0.082172 0.016527 0 0
5(0) MWL 0.082003 0.003878 0 0
6(0) MWL 0.081991 0.001141 0 0
7(0) MWL 0.081990 0.000260 0 0
8(0) MWL 0.081990 0.000081 0 0
9(0) MWL 0.081990 0.000018 0 0

Iterative procedure complete.


Convergence limit for residual cosines = 0.000100 on 2 consecutive
iterations.

Sample Covariance Matrix :
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP
REINTGCE 1.000
REPARASP 0.184 1.000
RESOCIEC 0.222 0.049 1.000
REOCCASP 0.410 0.214 0.324 1.000
REEDASP 0.404 0.274 0.405 0.625 1.000
BFINTGCE 0.336 0.078 0.230 0.299 0.286
BFPARASP 0.102 0.115 0.093 0.076 0.070
BFSOCIEC 0.186 0.019 0.271 0.293 0.241
BFOCCASP 0.260 0.084 0.279 0.422 0.328
BFEDASP 0.290 0.112 0.305 0.327 0.367
BFINTGCE BFPARASP BFSOCIEC BFOCCASP BFEDASP
BFINTGCE 1.000
BFPARASP 0.209 1.000
BFSOCIEC 0.295 -0.044 1.000
BFOCCASP 0.501 0.199 0.361 1.000
BFEDASP 0.519 0.278 0.410 0.640 1.000

Number of cases = 329.


756
Chapter 23
Reproduced Covariance Matrix :
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP
REINTGCE 1.000
REPARASP 0.184 1.000
RESOCIEC 0.222 0.049 1.000
REOCCASP 0.393 0.239 0.357 0.999
REEDASP 0.417 0.254 0.379 0.623 0.999
BFINTGCE 0.336 0.078 0.230 0.258 0.274
BFPARASP 0.102 0.115 0.093 0.103 0.110
BFSOCIEC 0.186 0.019 0.271 0.255 0.270
BFOCCASP 0.255 0.095 0.282 0.330 0.351
BFEDASP 0.273 0.102 0.303 0.354 0.376
BFINTGCE BFPARASP BFSOCIEC BFOCCASP BFEDASP
BFINTGCE 1.000
BFPARASP 0.209 1.000
BFSOCIEC 0.295 -0.044 1.000
BFOCCASP 0.489 0.237 0.374 0.999
BFEDASP 0.525 0.254 0.401 0.639 0.999

Residual Matrix (covariances) :
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP
REINTGCE 0.0
REPARASP 0.0 0.0
RESOCIEC 0.0 0.000 0.0
REOCCASP 0.018 -0.026 -0.033 0.001
REEDASP -0.013 0.020 0.026 0.001 0.001
BFINTGCE 0.0 0.0 0.0 0.042 0.013
BFPARASP 0.0 0.0 0.0 -0.027 -0.039
BFSOCIEC 0.0 0.0 0.0 0.038 -0.030
BFOCCASP 0.005 -0.011 -0.004 0.091 -0.023
BFEDASP 0.017 0.010 0.003 -0.027 -0.009
BFINTGCE BFPARASP BFSOCIEC BFOCCASP BFEDASP
BFINTGCE 0.0
BFPARASP 0.0 0.0
BFSOCIEC 0.0 0.0 0.0
BFOCCASP 0.011 -0.038 -0.013 0.001
BFEDASP -0.006 0.024 0.009 0.001 0.001


Value of the maximum absolute residual = 0.091

ML Estimates of Free Parameters in Dependence Relationships



Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
REEDASP <- REAMBITN 1 1.062 0.914 1.210 0.090 11.80
BFEDASP <- BFAMBITN 2 1.073 0.940 1.206 0.081 13.23
REAMBITN <- BFAMBITN 3 0.174 0.032 0.316 0.086 2.02
REAMBITN <- REPARASP 4 0.164 0.100 0.228 0.039 4.23
REAMBITN <- REINTGCE 5 0.255 0.185 0.324 0.043 5.99
REAMBITN <- RESOCIEC 6 0.222 0.151 0.294 0.043 5.11
REAMBITN <- BFSOCIEC 7 0.079 0.001 0.156 0.047 1.68
BFAMBITN <- REAMBITN 8 0.185 0.054 0.317 0.080 2.33
BFAMBITN <- RESOCIEC 9 0.067 -0.004 0.138 0.043 1.55
BFAMBITN <- BFSOCIEC 10 0.218 0.151 0.284 0.040 5.38
BFAMBITN <- BFINTGCE 11 0.330 0.262 0.398 0.041 7.97
BFAMBITN <- BFPARASP 12 0.152 0.092 0.212 0.036 4.18

757
Path Anal ysi s (RAMONA)
Values of Fixed Parameters in Dependence Relationships

Path Value
---------------------------- ------------
REOCCASP <- REAMBITN 1.000
REOCCASP <- E1 1.000
REEDASP <- E2 1.000
BFEDASP <- E3 1.000
BFOCCASP <- BFAMBITN 1.000
BFOCCASP <- E4 1.000
REAMBITN <- Z1 1.000
BFAMBITN <- Z2 1.000

ML estimates of free parameters in variance/covariance relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
REPARASP <-> REPARASP 13 1.000 0.879 1.137 0.078 12.81
REPARASP <-> REINTGCE 14 0.184 0.092 0.276 0.056 3.28
REPARASP <-> RESOCIEC 15 0.049 -0.042 0.140 0.055 0.88
REPARASP <-> BFSOCIEC 16 0.019 -0.072 0.109 0.055 0.34
REPARASP <-> BFINTGCE 17 0.078 -0.013 0.169 0.055 1.41
REPARASP <-> BFPARASP 18 0.115 0.023 0.206 0.056 2.06
REINTGCE <-> REINTGCE 19 1.000 0.879 1.137 0.078 12.81
REINTGCE <-> RESOCIEC 20 0.222 0.129 0.315 0.057 3.93
REINTGCE <-> BFSOCIEC 21 0.186 0.094 0.278 0.056 3.31
REINTGCE <-> BFINTGCE 22 0.336 0.240 0.431 0.058 5.76
REINTGCE <-> BFPARASP 23 0.102 0.011 0.193 0.056 1.84
RESOCIEC <-> RESOCIEC 24 1.000 0.879 1.137 0.078 12.81
RESOCIEC <-> BFSOCIEC 25 0.271 0.177 0.365 0.057 4.73
RESOCIEC <-> BFINTGCE 26 0.230 0.137 0.323 0.057 4.06
RESOCIEC <-> BFPARASP 27 0.093 0.002 0.184 0.055 1.68
BFSOCIEC <-> BFSOCIEC 28 1.000 0.879 1.137 0.078 12.81
BFSOCIEC <-> BFINTGCE 29 0.295 0.200 0.390 0.058 5.12
BFSOCIEC <-> BFPARASP 30 -0.044 -0.135 0.047 0.055 -0.79
BFINTGCE <-> BFINTGCE 31 1.000 0.879 1.137 0.078 12.81
BFINTGCE <-> BFPARASP 32 0.209 0.116 0.301 0.056 3.70
BFPARASP <-> BFPARASP 33 1.000 0.879 1.137 0.078 12.81
E1 <-> E1 34 0.412 0.336 0.506 0.051 8.07
E2 <-> E2 35 0.337 0.262 0.434 0.052 6.50
E3 <-> E3 36 0.313 0.246 0.399 0.046 6.84
E4 <-> E4 37 0.404 0.335 0.487 0.046 8.75
Z1 <-> Z1 38 0.281 0.214 0.370 0.047 6.03
Z2 <-> Z2 39 0.229 0.173 0.303 0.039 5.86


Maximum Likelihood Discrepancy Function

Measures of fit of the model
----------------------------
Sample Discrepancy Function Value : 0.082 (8.199040E-02)

Population discrepancy function value, Fo
Bias adjusted point estimate : 0.033
90.000 percent confidence interval :(0.001,0.089)

Root mean square error of approximation
Steiger-Lind : RMSEA = SQRT(Fo/df)
Point estimate : 0.046
90.000 percent confidence interval :(0.008,0.075)

Expected cross-validation index
Point estimate (modified aic) : 0.320
90.000 percent confidence interval :(0.288,0.376)
CVI (modified AIC) for the saturated model : 0.335

758
Chapter 23
Using the Restart File
A restart file was created during the first run to form an input file that specifies the
model represented in the path diagram. Now type the following modifications into the
EX2B restart file and save the file:
n TYPE = COV is replaced by TYPE = CORR.
n START = ROUGH is replaced by START = CLOSE.
n REOCCASP <- REAMBTN(0,1.0) is replaced by REOCCASP <- REAMBITN(*,1.0),
freeing a fixed-path coefficient.
n BFOCCASP <- BFAMBITN(0,1.0) is replaced by BFOCCASP <- BFAMBITN(*,1.0),
freeing a fixed-path coefficient.
n REAMBITN <-> REAMBITN(0,1.0) is added, imposing a variance constraint on an
endogenous latent variable.
n BFAMBITN <-> BFAMBITN(0,1.0) is added, imposing a variance constraint on an
endogenous latent variable.
Test statistic: : 26.893
Exceedance probabilities:-
Ho: perfect fit (RMSEA = 0.0) : 0.043
Ho: close fit (RMSEA <= 0.050) : 0.560

Multiplier for obtaining test statistic = 328.000
Degrees of freedom = 16
Effective number of parameters = 39
759
Path Anal ysi s (RAMONA)
The modified restart file is shown below:
Note that we rounded some parameter values to shorten the commands. Also, the
START setting, ROUGH, has been changed to CLOSE (under ESTIMATE) because a
restart file is used.
USE ex2
RAMONA
MODEL reoccasp <- reambitn(*,1.0) e1(0,1.0),
reedasp <- reambitn(1,1.062) e2(0,1.0),
bfedasp <- bfambitn(2,1.073) e3(0,1.0),
bfoccasp <- bfambitn(*,1.0) e4(0,1.0),
reambitn <- bfambitn(3,0.174) z1(0,1.0),
reparasp(4,0.164) reintgce(5,0.255),
resociec(6,0.222) bfsociec(7,0.079),
bfambitn <- reambitn(8,0.185) z2(0,1.0),
resociec(9,0.668),
bfsociec(10,0.218),
bfintgce(11,0.330),
bfparasp(12,0.152),
reparasp <-> reparasp(13,1.0,),
reintgce(14,0.184),
resociec(15,0.049),
bfsociec(16,0.019),
bfintgce(17,0.078),
bfparasp(18,0.115),
reintgce <-> reintgce(19,1.000),
resociec(20,0.222),
bfsociec(21,0.186),
bfintgce(22,0.336),
bfparasp(23,0.102),
resociec <-> resociec(24,1.0),
bfsociec(25,0.271),
bfintgce(26,0.230),
bfparasp(27,0.093),
bfsociec <-> bfsociec(28,1.0),
bfintgce(29,0.29),
bfparasp(30,-0.044),
bfintgce <-> bfintgce(31,1.0),
bfparasp(32,0.209),
bfparasp <-> bfparasp(33,1.0),
e1 <-> e1(34,0.412),
e2 <-> e2(35,0.337),
e3 <-> e3(36,0.313),
e4 <-> e4(37,0.404),
z1 <-> z1(38,0.281),
z2 <-> z2(39,0.229),
reambitn <-> reambitn(0,1.0),
bfambitn <-> bfambitn(0,1.0)
PRINT = MEDIUM
ESTIMATE / TYPE=CORR START=CLOSE NCASES=329
760
Chapter 23
Now execute this modified file (after you have edited it and saved it using FEDIT
or another text editor). The input is:
The output is:
SUBMIT ex2b
There are 10 manifest variables in the model. They are:
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP BFINTGCE BFPARASP
BFSOCIEC BFOCCASP BFEDASP


There are 8 latent variables in the model. They are:
REAMBITN E1 E2 BFAMBITN E3 E4 Z1 Z2


RAMONA options in effect are:
Display Corr
Method MWL
Start Close
Convg 0.000100
Maximum iterations 100
Number of cases 329
Restart No
Confidence Interval 90.000000

Number of manifest variables = 10
Total number of variables in the system = 28.
Reading correlation matrix...

Details of Iterations
Iter Method Discr. Funct. Max.R.Cos. Max.Const. NRP NBD
-------- ------- -------------- ------------ ------------ ----------
0 MWL 0.081990 0.000000
1(0) MWL 0.081990 0.000005 0.000000 0 0
2(0) MWL 0.081990 0.000001 0.000000 0 0

Iterative procedure complete.


Convergence limit for residual cosines = 0.000100 on 2 consecutive
iterations.

Convergence limit for variance constraint violations = 5.00000E-07
Value of the maximum variance constraint violation = 7.33524E-12


Sample Correlation Matrix :
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP
REINTGCE 1.000
REPARASP 0.184 1.000
RESOCIEC 0.222 0.049 1.000
REOCCASP 0.410 0.214 0.324 1.000
REEDASP 0.404 0.274 0.405 0.625 1.000
BFINTGCE 0.336 0.078 0.230 0.299 0.286
BFPARASP 0.102 0.115 0.093 0.076 0.070
BFSOCIEC 0.186 0.019 0.271 0.293 0.241
BFOCCASP 0.260 0.084 0.279 0.422 0.328
BFEDASP 0.290 0.112 0.305 0.327 0.367
BFINTGCE BFPARASP BFSOCIEC BFOCCASP BFEDASP
BFINTGCE 1.000
BFPARASP 0.209 1.000
BFSOCIEC 0.295 -0.044 1.000
BFOCCASP 0.501 0.199 0.361 1.000
BFEDASP 0.519 0.278 0.410 0.640 1.000

Number of cases = 329.
761
Path Anal ysi s (RAMONA)
Reproduced Correlation Matrix :
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP
REINTGCE 1.000
REPARASP 0.184 1.000
RESOCIEC 0.222 0.049 1.000
REOCCASP 0.393 0.240 0.357 1.000
REEDASP 0.417 0.254 0.379 0.624 1.000
BFINTGCE 0.336 0.078 0.230 0.258 0.274
BFPARASP 0.102 0.115 0.093 0.103 0.110
BFSOCIEC 0.186 0.019 0.271 0.255 0.270
BFOCCASP 0.255 0.095 0.282 0.330 0.351
BFEDASP 0.273 0.102 0.303 0.355 0.376
BFINTGCE BFPARASP BFSOCIEC BFOCCASP BFEDASP
BFINTGCE 1.000
BFPARASP 0.209 1.000
BFSOCIEC 0.295 -0.044 1.000
BFOCCASP 0.489 0.237 0.374 1.000
BFEDASP 0.525 0.254 0.401 0.640 1.000

Residual Matrix (correlations) :
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP
REINTGCE 0.0
REPARASP 0.0 0.0
RESOCIEC 0.0 0.0 0.0
REOCCASP 0.017 -0.026 -0.033 0.000
REEDASP -0.013 0.020 0.025 0.001 0.000
BFINTGCE 0.0 0.0 0.0 0.042 0.012
BFPARASP 0.000 0.0 0.0 -0.027 -0.039
BFSOCIEC 0.0 0.0 0.0 0.038 -0.030
BFOCCASP 0.005 -0.011 -0.004 0.091 -0.023
BFEDASP 0.017 0.010 0.002 -0.028 -0.010
BFINTGCE BFPARASP BFSOCIEC BFOCCASP BFEDASP
BFINTGCE 0.0
BFPARASP 0.0 0.0
BFSOCIEC 0.0 0.0 0.0
BFOCCASP 0.011 -0.038 -0.013 0.000
BFEDASP -0.006 0.024 0.009 0.001 0.000


Value of the maximum absolute residual = 0.091



ML Estimates of Free Parameters in Dependence Relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
REOCCASP <- REAMBITN 1 0.766 0.710 0.823 0.034 22.21
REEDASP <- REAMBITN 2 0.814 0.759 0.868 0.033 24.52
BFEDASP <- BFAMBITN 3 0.828 0.781 0.876 0.029 28.49
BFOCCASP <- BFAMBITN 4 0.772 0.721 0.823 0.031 24.75
REAMBITN <- BFAMBITN 5 0.175 0.034 0.317 0.086 2.04
REAMBITN <- REPARASP 6 0.214 0.133 0.294 0.049 4.36
REAMBITN <- REINTGCE 7 0.332 0.248 0.417 0.051 6.47
REAMBITN <- RESOCIEC 8 0.290 0.201 0.378 0.054 5.39
REAMBITN <- BFSOCIEC 9 0.103 0.002 0.204 0.061 1.69
BFAMBITN <- REAMBITN 10 0.184 0.055 0.313 0.078 2.35
BFAMBITN <- RESOCIEC 11 0.087 -0.005 0.178 0.056 1.55
BFAMBITN <- BFSOCIEC 12 0.282 0.200 0.365 0.050 5.62
BFAMBITN <- BFINTGCE 13 0.428 0.349 0.506 0.048 9.00
BFAMBITN <- BFPARASP 14 0.197 0.121 0.273 0.046 4.27


762
Chapter 23
Scaled Standard Deviations (nuisance parameters)

Variable Estimate
------------ ------------
REOCCASP 1.000
REEDASP 1.000
BFOCCASP 1.000
BFEDASP 1.000
REPARASP 1.000
BFINTGCE 1.000
BFPARASP 1.000
BFSOCIEC 1.000
RESOCIEC 1.000
REINTGCE 1.000

Values of Fixed Parameters in Dependence Relationships

Path Value
---------------------------- ------------
REOCCASP <- E1 1.000
REEDASP <- E2 1.000
BFEDASP <- E3 1.000
BFOCCASP <- E4 1.000
REAMBITN <- Z1 1.000
BFAMBITN <- Z2 1.000

ML estimates of free parameters in variance/covariance relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
REPARASP <-> REINTGCE 15 0.184 0.095 0.270 0.053 3.45
REPARASP <-> RESOCIEC 16 0.049 -0.042 0.139 0.055 0.89
REPARASP <-> BFSOCIEC 17 0.019 -0.072 0.109 0.055 0.34
REPARASP <-> BFINTGCE 18 0.078 -0.012 0.168 0.055 1.42
REPARASP <-> BFPARASP 19 0.115 0.024 0.203 0.054 2.10
REINTGCE <-> RESOCIEC 20 0.222 0.134 0.306 0.052 4.23
REINTGCE <-> BFSOCIEC 21 0.186 0.097 0.272 0.053 3.49
REINTGCE <-> BFINTGCE 22 0.336 0.253 0.414 0.049 6.85
REINTGCE <-> BFPARASP 23 0.102 0.012 0.191 0.055 1.87
RESOCIEC <-> BFSOCIEC 24 0.271 0.185 0.353 0.051 5.29
RESOCIEC <-> BFINTGCE 25 0.230 0.143 0.314 0.052 4.40
RESOCIEC <-> BFPARASP 26 0.093 0.003 0.182 0.055 1.70
BFSOCIEC <-> BFINTGCE 27 0.295 0.210 0.376 0.050 5.85
BFSOCIEC <-> BFPARASP 28 -0.044 -0.134 0.047 0.055 -0.79
BFINTGCE <-> BFPARASP 29 0.209 0.120 0.294 0.053 3.95
E1 <-> E1 30 0.413 0.334 0.509 0.053 7.80
E2 <-> E2 31 0.338 0.259 0.439 0.054 6.25
E3 <-> E3 32 0.314 0.244 0.404 0.048 6.51
E4 <-> E4 33 0.404 0.332 0.492 0.048 8.39
Z1 <-> Z1 34 0.479 0.390 0.570 0.055 8.64
Z2 <-> Z2 35 0.384 0.305 0.470 0.051 7.59

Values of Fixed Parameters in Variance/Covariance Relationships

Path Value
----------------------------- ------------
REPARASP <-> REPARASP 1.000
REINTGCE <-> REINTGCE 1.000
RESOCIEC <-> RESOCIEC 1.000
BFSOCIEC <-> BFSOCIEC 1.000
BFINTGCE <-> BFINTGCE 1.000
BFPARASP <-> BFPARASP 1.000

763
Path Anal ysi s (RAMONA)
The discrepancy function values and measures of fit of the model are the same in both
runs, but the maximum likelihood estimates differ because of different identification
conditions. The standard errors in the second run differ (those in the first run were
incorrect). An appropriate warning has been output by RAMONA. Notice in the last
run that the Lagrange multipliers and the corresponding standard errors are 0 because
all equality constraints on endogenous variable variances act as identification
conditions, not constraints on the model. This is the case in most, but not all, practical
applications.
Equality Constraints on Variances

Lagrange Standard
Constraint Value Multiplier Error
---------------------------- ------------ ------------ ------------
REAMBITN <-> REAMBITN 1.0000 0.000 0.000
BFAMBITN <-> BFAMBITN 1.0000 0.000 -0.000
REOCCASP <-> REOCCASP 1.0000 0.000 -0.000
REEDASP <-> REEDASP 1.0000 0.000 0.000
BFOCCASP <-> BFOCCASP 1.0000 0.000 0.000
BFEDASP <-> BFEDASP 1.0000 0.000 0.000


Maximum Likelihood Discrepancy Function

Measures of fit of the model
----------------------------
Sample Discrepancy Function Value : 0.082 (8.199040E-02)

Population discrepancy function value, Fo
Bias adjusted point estimate : 0.033
90.000 percent confidence interval :(0.001,0.089)

Root mean square error of approximation
Steiger-Lind : RMSEA = SQRT(Fo/df)
Point estimate : 0.046
90.000 percent confidence interval :(0.008,0.075)

Expected cross-validation index
Point estimate (modified aic) : 0.320
90.000 percent confidence interval :(0.288,0.376)
CVI (modified AIC) for the saturated model : 0.335

Test statistic: : 26.893
Exceedance probabilities:-
Ho: perfect fit (RMSEA = 0.0) : 0.043
Ho: close fit (RMSEA <= 0.050) : 0.560

Multiplier for obtaining test statistic = 328.000
Degrees of freedom = 16
Effective number of parameters = 39
764
Chapter 23
Example 3
Path Analysis Using Rectangular Input
This example (Mels and Koorts, 1989) illustrates how RAMONA uses the usual cases-
by-variables SYSTAT data file. Asymptotically distribution-free estimates are
obtained.
A questionnaire concerned with job satisfaction was completed by 213 nurses.
There are 10 manifest variables that serve as indicators of 4 latent variables: job
security (JOBSEC), attitude toward training (TRAING), opportunities for promotion
(PROMOT), and relations with superiors (RELSUP). The path diagram shows a model
to account for causal relationships between the three latent variables.
765
Path Anal ysi s (RAMONA)
E4
E5
1.0
1.0
TRAING
1.0
ITRAIN
STRAIN
E6
1.0
ETRAIN
PROMOT
1.0
E7
1.0
IPROM
E8
1.0
OPROM
RELSUP
E9
1.0
ISUP
E10
1.0
PROSUP
1.0
E1
E3
1.0
1.0
JOBSEC
UNFAIR
DCHARG
UNEMP
1.0
E2
1.0
Z1
1.0
766
Chapter 23
The output is:
USE ex3
RAMONA
MANIFEST = unfair dcharg unemp itrain strain etrain,
ipromot opromot isup prosup
LATENT = jobsec traing promot relsup e1 e2 e3 e4 e5,
e6 e7 e8 e9 e10 z1
MODEL unfair <- jobsec e1(0,1.0),
dcharg <- jobsec e2(0,1.0),
unemp <- jobsec e3(0,1.0),
itrain <- traing e4(0,1.0),
strain <- traing e5(0,1.0),
etrain <- traing e6(0,1.0),
ipromot <- promot e7(0,1.0),
opromot <- promot e8(0,1.0),
isup <- relsup e9(0,1.0),
prosup <- relsup e10(0,1.0),
jobsec <- traing promot relsup z1(0,1.0),
traing <-> traing (0,1.0),
promot <-> promot (0,1.0),
relsup <-> relsup (0,1.0),
traing <-> promot,
traing <-> relsup,
promot <-> relsup,
e1 <-> e1,
e2 <-> e2,
e3 <-> e3,
e4 <-> e4,
e5 <-> e5,
e6 <-> e6,
e7 <-> e7,
e8 <-> e8,
e9 <-> e9,
e10 <-> e10,
z1 <-> z1,
jobsec <-> jobsec(0,1.0)
PRINT = MEDIUM
ESTIMATE / TYPE=CORR METHOD=ADFU
There are 10 manifest variables in the model. They are:
UNFAIR DCHARG UNEMP ITRAIN STRAIN ETRAIN IPROMOT OPROMOT
ISUP PROSUP


There are 15 latent variables in the model. They are:
JOBSEC E1 E2 E3 TRAING E4 E5 E6 PROMOT E7 E8 RELSUP E9
E10 Z1


767
Path Anal ysi s (RAMONA)
RAMONA options in effect are:
Display Corr
Method ADFU
Start Rough
Convg 0.000100
Maximum iterations 100
Number of cases determined when data are read
Restart No
Confidence Interval 90.000000

Number of manifest variables = 10
Total number of variables in the system = 35.
Computing mean vector...
Computing covariance matrix and fourth order moments...
Computing ADF weight matrix...

Overall kurtosis = 19.754
Normalised = 9.305
Relative = 1.165




Individual
Variable kurtoses Normalised Relative
UNFAIR 1.395 4.155 1.465
DCHARG 1.866 5.560 1.622
UNEMP 0.181 0.540 1.060
ITRAIN -0.560 -1.669 0.813
STRAIN -1.102 -3.282 0.633
ETRAIN -0.730 -2.174 0.757
IPROMOT -1.006 -2.997 0.665
OPROMOT -0.757 -2.256 0.748
ISUP -0.945 -2.815 0.685
PROSUP -0.547 -1.628 0.818

Smallest relative pivot of covariance matrix of sample
covariances = 0.149

Details of Iterations
Iter Method Discr. Funct. Max.R.Cos. Max.Const. NRP NBD
-------- ------- -------------- ------------ ------------ ----------
0 OLS 1.254639 0.000000
1(0) OLS 0.398537 0.556472 0.405283 0 0
2(0) OLS 0.079200 0.115359 0.045912 0 0
3(0) OLS 0.075227 0.010971 0.000398 0 0
4(0) OLS 0.075196 0.002148 0.000018 0 0
4(0) ADFU 0.393299 0.361351 0.000018 0 0
5(0) ADFU 0.190011 0.084716 0.039737 0 0
6(0) ADFU 0.184936 0.019973 0.004648 0 0
7(0) ADFU 0.184639 0.003195 0.000205 0 0
8(0) ADFU 0.184609 0.001973 0.000061 0 0
9(0) ADFU 0.184606 0.000414 0.000002 0 0
10(0) ADFU 0.184605 0.000219 0.000001 0 0
11(0) ADFU 0.184605 0.000049 0.000000 0 0
12(0) ADFU 0.184605 0.000025 0.000000 0 0

Iterative procedure complete.


Convergence limit for residual cosines = 0.000100 on 2 consecutive
iterations.

Convergence limit for variance constraint violations = 5.00000E-07
Value of the maximum variance constraint violation = 1.13880E-08
768
Chapter 23
Sample Correlation Matrix :
UNFAIR DCHARG UNEMP ITRAIN STRAIN
UNFAIR 1.000
DCHARG 0.438 1.000
UNEMP 0.249 0.455 1.000
ITRAIN 0.150 0.110 0.056 1.000
STRAIN 0.173 0.209 0.028 0.543 1.000
ETRAIN 0.184 0.168 -0.006 0.544 0.694
IPROMOT 0.134 0.210 0.169 0.082 0.240
OPROMOT 0.099 0.179 0.159 0.115 0.184
ISUP 0.154 0.177 0.140 0.284 0.456
PROSUP 0.213 0.212 0.038 0.263 0.337
ETRAIN IPROMOT OPROMOT ISUP PROSUP
ETRAIN 1.000
IPROMOT 0.237 1.000
OPROMOT 0.208 0.683 1.000
ISUP 0.348 0.389 0.319 1.000
PROSUP 0.262 0.263 0.185 0.475 1.000

Number of cases = 213.


Reproduced Correlation Matrix :
UNFAIR DCHARG UNEMP ITRAIN STRAIN
UNFAIR 1.000
DCHARG 0.481 1.000
UNEMP 0.382 0.602 1.000
ITRAIN 0.081 0.128 0.102 1.000
STRAIN 0.093 0.146 0.116 0.638 1.000
ETRAIN 0.089 0.140 0.111 0.609 0.695
IPROMOT 0.140 0.221 0.176 0.171 0.195
OPROMOT 0.121 0.192 0.152 0.148 0.169
ISUP 0.124 0.196 0.156 0.364 0.415
PROSUP 0.098 0.154 0.122 0.286 0.326
ETRAIN IPROMOT OPROMOT ISUP PROSUP
ETRAIN 1.000
IPROMOT 0.186 1.000
OPROMOT 0.161 0.743 1.000
ISUP 0.396 0.377 0.327 1.000
PROSUP 0.311 0.296 0.257 0.560 1.000

Residual Matrix (correlations) :
UNFAIR DCHARG UNEMP ITRAIN STRAIN
UNFAIR -0.000
DCHARG -0.043 -0.000
UNEMP -0.133 -0.148 -0.000
ITRAIN 0.068 -0.018 -0.045 0.000
STRAIN 0.080 0.062 -0.088 -0.095 0.000
ETRAIN 0.095 0.028 -0.117 -0.065 -0.000
IPROMOT -0.007 -0.011 -0.007 -0.089 0.045
OPROMOT -0.023 -0.013 0.007 -0.033 0.016
ISUP 0.030 -0.020 -0.016 -0.080 0.042
PROSUP 0.115 0.057 -0.084 -0.023 0.012
ETRAIN IPROMOT OPROMOT ISUP PROSUP
ETRAIN 0.000
IPROMOT 0.051 0.000
OPROMOT 0.047 -0.060 0.000
ISUP -0.047 0.011 -0.008 -0.000
PROSUP -0.049 -0.034 -0.072 -0.085 0.000


Value of the maximum absolute residual = 0.148



769
Path Anal ysi s (RAMONA)
ADFU Estimates of Free Parameters in Dependence Relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
UNFAIR <- JOBSEC 1 0.552 0.451 0.653 0.061 9.01
DCHARG <- JOBSEC 2 0.871 0.770 0.972 0.061 14.22
UNEMP <- JOBSEC 3 0.692 0.592 0.791 0.061 11.42
ITRAIN <- TRAING 4 0.748 0.670 0.826 0.047 15.78
STRAIN <- TRAING 5 0.853 0.808 0.899 0.028 30.81
ETRAIN <- TRAING 6 0.814 0.756 0.873 0.035 23.04
IPROMOT <- PROMOT 7 0.926 0.842 1.011 0.052 17.98
OPROMOT <- PROMOT 8 0.802 0.714 0.891 0.054 14.96
ISUP <- RELSUP 9 0.844 0.752 0.937 0.056 14.97
PROSUP <- RELSUP 10 0.663 0.568 0.758 0.058 11.48
JOBSEC <- TRAING 11 0.074 -0.129 0.277 0.123 0.60
JOBSEC <- PROMOT 12 0.192 0.075 0.310 0.071 2.70
JOBSEC <- RELSUP 13 0.132 -0.081 0.345 0.130 1.02


Scaled Standard Deviations (nuisance parameters)

Variable Estimate
------------ ------------
UNFAIR 1.008
DCHARG 0.962
UNEMP 0.974
ITRAIN 1.000
STRAIN 1.002
ETRAIN 0.983
IPROMOT 0.989
OPROMOT 1.001
ISUP 0.998
PROSUP 0.970

Values of Fixed Parameters in Dependence Relationships

Path Value
---------------------------- ------------
UNFAIR <- E1 1.000
DCHARG <- E2 1.000
UNEMP <- E3 1.000
ITRAIN <- E4 1.000
STRAIN <- E5 1.000
ETRAIN <- E6 1.000
IPROMOT <- E7 1.000
OPROMOT <- E8 1.000
ISUP <- E9 1.000
PROSUP <- E10 1.000
JOBSEC <- Z1 1.000

ADFU estimates of free parameters in variance/covariance relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
TRAING <-> PROMOT 14 0.246 0.120 0.364 0.075 3.30
TRAING <-> RELSUP 15 0.576 0.452 0.677 0.069 8.40
PROMOT <-> RELSUP 16 0.482 0.354 0.593 0.073 6.62
E1 <-> E1 17 0.695 0.593 0.816 0.068 10.28
E2 <-> E2 18 0.242 0.117 0.499 0.107 2.26
E3 <-> E3 19 0.522 0.400 0.679 0.084 6.22
E4 <-> E4 20 0.440 0.338 0.574 0.071 6.21
E5 <-> E5 21 0.272 0.204 0.362 0.047 5.75
E6 <-> E6 22 0.337 0.254 0.446 0.058 5.85
E7 <-> E7 23 0.142 0.047 0.429 0.095 1.48
E8 <-> E8 24 0.356 0.239 0.530 0.086 4.14
E9 <-> E9 25 0.287 0.166 0.495 0.095 3.01
E10 <-> E10 26 0.560 0.448 0.702 0.077 7.31
Z1 <-> Z1 27 0.898 0.818 0.945 0.037 24.06

770
Chapter 23
If the usual SYSTAT cases-by-variables file is used as input, then kurtosis estimates are
printed before the iteration details. These can be used to judge the appropriateness of
normality assumptions. They can also be used to apply corrections manually to test
statistics and standard errors if the user is willing to accept that the assumption of an
elliptical distribution is appropriate for the data (Shapiro and Browne, 1987).
Values of Fixed Parameters in Variance/Covariance Relationships

Path Value
----------------------------- ------------
TRAING <-> TRAING 1.000
PROMOT <-> PROMOT 1.000
RELSUP <-> RELSUP 1.000

Equality Constraints on Variances

Lagrange Standard
Constraint Value Multiplier Error
---------------------------- ------------ ------------ ------------
JOBSEC <-> JOBSEC 1.0000 0.000 -0.000
UNFAIR <-> UNFAIR 1.0000 0.000 -0.000
DCHARG <-> DCHARG 1.0000 0.000 -0.000
UNEMP <-> UNEMP 1.0000 0.000 -0.000
ITRAIN <-> ITRAIN 1.0000 0.000 0.000
STRAIN <-> STRAIN 1.0000 0.000 0.000
ETRAIN <-> ETRAIN 1.0000 0.000 -0.000
IPROMOT <-> IPROMOT 1.0000 0.000 -0.000
OPROMOT <-> OPROMOT 1.0000 0.000 0.000
ISUP <-> ISUP 1.0000 0.000 0.000
PROSUP <-> PROSUP 1.0000 0.000 -0.000

ADFU Discrepancy Function

Measures of fit of the model
----------------------------
Sample Discrepancy Function Value : 0.185 (1.846051E-01)

Population discrepancy function value, Fo
Bias adjusted point estimate : 0.048
90.000 percent confidence interval :(0.0,0.144)

Root mean square error of approximation
Steiger-Lind : RMSEA = SQRT(Fo/df)
Point estimate : 0.041
90.000 percent confidence interval :(0.0,0.071)

Expected cross-validation index
Point estimate (modified aic) : 0.430
90.000 percent confidence interval :(0.382,0.526)
CVI (modified AIC) for the saturated model : 0.519

Test statistic: : 39.136
Exceedance probabilities:-
Ho: perfect fit (RMSEA = 0.0) : 0.099
Ho: close fit (RMSEA <= 0.050) : 0.662

Multiplier for obtaining test statistic = 212.000
Degrees of freedom = 29
Effective number of parameters = 26
771
Path Anal ysi s (RAMONA)
Example 4
Path Analysis and Standard Errors
Lawley and Maxwell (1971) gave correct standard errors for maximum likelihood
parameter estimates in a restricted factor analysis model for a correlation matrix. This
example shows how RAMONA can produce these correct standard errors. The method
used for calculating the standard errors differs from that of Lawley and Maxwell in that
RAMONA makes use of constrained optimization and Lawley and Maxwell obtained
their formula by applying the delta method to standardized estimates. It can be shown,
however, that the two methods are equivalent and produce the same results. Lawley and
Maxwell made use of a sample correlation matrix between nine ability tests
administered to 72 children.
772
Chapter 23
We analyze the relationships in the path diagram using the correlation matrix. The
difference between the two runs is that we first treat the model (inappropriately) as a
covariance structure and then as a correlation structure. We specify TYPE as COVA in
the first run and CORR in the second.
Verbal
Visual
Speed
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Y4
Y5
Y6
Y1
Y2
Y3
Y7
Y8
Y9
E4
E5
E6
E1
E2
E3
E7
E8
E9
773
Path Anal ysi s (RAMONA)
Following is the input file for the first run:
The resulting output is:
USE ex4a
RAMONA
MANIFEST = y1 y2 y3 y4 y5 y6 y7 y8 y9
LATENT = visual verbal speed e1 e2 e3 e4 e5 e6,
e7 e8 e9
MODEL y1 <- visual e1(0,1.0),
y2 <- visual e2(0,1.0),
y3 <- visual e3(0,1.0),
y4 <- verbal e4(0,1.0),
y5 <- verbal e5(0,1.0),
y6 <- verbal e6(0,1.0),
y7 <- speed e7(0,1.0),
y8 <- speed e8 (0,1.0),
y9 <- visual speed e9(0,1.0),
visual <-> visual(0,1.0),
verbal <-> verbal(0,1.0),
speed <-> speed(0,1.0),
visual <-> verbal,
visual <-> speed,
verbal <-> speed
PRINT = MEDIUM
ESTIMATE / TYPE=COVA NCASES=72
There are 9 manifest variables in the model. They are:
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9


There are 12 latent variables in the model. They are:
VISUAL E1 E2 E3 VERBAL E4 E5 E6 SPEED E7 E8 E9


RAMONA options in effect are:
Display Covar
Method MWL
Start Rough
Convg 0.000100
Maximum iterations 100
Number of cases 72
Restart No
Confidence Interval 90.000000
Variance paths for errors were omitted from the job specification
and have been added by RAMONA.

Number of manifest variables = 9
Total number of variables in the system = 21.
Reading correlation matrix...
***WARNING***
A correlation matrix was provided although DISP=COV fit measures and
standard errors may be inappropriate.


774
Chapter 23
Details of Iterations
Iter Method Discr. Funct. Max.R.Cos. Max.Const. NRP NBD
-------- ------- -------------- ------------ ------------ ----------
0 OLS 1.013354
1(0) OLS 0.437034 0.649686 0 0
2(0) OLS 0.143538 0.092248 0 0
3(0) OLS 0.135197 0.053602 0 0
4(0) OLS 0.134714 0.004511 0 0
4(0) MWL 0.472377 0.164664 0 0
5(0) MWL 0.425693 0.031463 0 0
6(0) MWL 0.421825 0.019794 0 0
7(0) MWL 0.421170 0.006232 0 0
8(0) MWL 0.421041 0.005613 0 0
9(0) MWL 0.421014 0.001271 0 0
10(0) MWL 0.421008 0.001616 0 0
11(0) MWL 0.421006 0.000284 0 0
12(0) MWL 0.421006 0.000478 0 0
13(0) MWL 0.421006 0.000085 0 0
14(0) MWL 0.421006 0.000144 0 0
15(0) MWL 0.421006 0.000028 0 0
16(0) MWL 0.421006 0.000044 0 0

Iterative procedure complete.


Convergence limit for residual cosines = 0.000100 on 2 consecutive
iterations.

Sample Covariance Matrix :
Y1 Y2 Y3 Y4 Y5
Y1 1.000
Y2 0.245 1.000
Y3 0.418 0.362 1.000
Y4 0.282 0.217 0.425 1.000
Y5 0.257 0.125 0.304 0.784 1.000
Y6 0.239 0.131 0.330 0.743 0.730
Y7 0.122 0.149 0.265 0.185 0.221
Y8 0.253 0.183 0.329 0.021 0.139
Y9 0.583 0.147 0.455 0.381 0.400
Y6 Y7 Y8 Y9
Y6 1.000
Y7 0.118 1.000
Y8 -0.027 0.601 1.000
Y9 0.235 0.385 0.462 1.000

Number of cases = 72.


Reproduced Covariance Matrix :
Y1 Y2 Y3 Y4 Y5
Y1 1.000
Y2 0.232 1.000
Y3 0.448 0.225 1.000
Y4 0.341 0.171 0.330 1.000
Y5 0.325 0.163 0.315 0.788 1.000
Y6 0.309 0.155 0.300 0.748 0.715
Y7 0.210 0.105 0.203 0.052 0.050
Y8 0.298 0.149 0.289 0.074 0.070
Y9 0.517 0.260 0.501 0.351 0.336
Y6 Y7 Y8 Y9
Y6 1.000
Y7 0.047 1.000
Y8 0.067 0.601 1.000
Y9 0.319 0.331 0.471 1.000

775
Path Anal ysi s (RAMONA)
Residual Matrix (covariances) :
Y1 Y2 Y3 Y4 Y5
Y1 -0.000
Y2 0.013 0.000
Y3 -0.030 0.137 -0.000
Y4 -0.059 0.046 0.095 0.000
Y5 -0.068 -0.038 -0.011 -0.004 0.000
Y6 -0.070 -0.024 0.030 -0.005 0.015
Y7 -0.088 0.044 0.062 0.133 0.171
Y8 -0.045 0.034 0.040 -0.053 0.069
Y9 0.066 -0.113 -0.046 0.030 0.064
Y6 Y7 Y8 Y9
Y6 0.000
Y7 0.071 -0.000
Y8 -0.094 -0.000 -0.000
Y9 -0.084 0.054 -0.009 -0.000


Value of the maximum absolute residual = 0.171



ML Estimates of Free Parameters in Dependence Relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
Y1 <- VISUAL 1 0.679 0.483 0.876 0.119 5.70
Y2 <- VISUAL 2 0.341 0.128 0.554 0.130 2.63
Y3 <- VISUAL 3 0.659 0.462 0.856 0.120 5.50
Y4 <- VERBAL 4 0.908 0.751 1.065 0.095 9.51
Y5 <- VERBAL 5 0.867 0.707 1.028 0.098 8.87
Y6 <- VERBAL 6 0.824 0.659 0.989 0.100 8.23
Y7 <- SPEED 7 0.651 0.435 0.866 0.131 4.97
Y8 <- SPEED 8 0.924 0.691 1.158 0.142 6.51
Y9 <- VISUAL 9 0.670 0.449 0.892 0.135 4.98
Y9 <- SPEED 10 0.192 -0.023 0.406 0.130 1.47

Values of Fixed Parameters in Dependence Relationships

Path Value
---------------------------- ------------
Y1 <- E1 1.000
Y2 <- E2 1.000
Y3 <- E3 1.000
Y4 <- E4 1.000
Y5 <- E5 1.000
Y6 <- E6 1.000
Y7 <- E7 1.000
Y8 <- E8 1.000
Y9 <- E9 1.000

ML estimates of free parameters in variance/covariance relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
VISUAL <-> VERBAL 11 0.552 0.369 0.735 0.111 4.97
VISUAL <-> SPEED 12 0.474 0.239 0.708 0.143 3.32
VERBAL <-> SPEED 13 0.088 -0.131 0.307 0.133 0.66
E1 <-> E1 14 0.538 0.373 0.777 0.120 4.49
E2 <-> E2 15 0.884 0.664 1.177 0.154 5.75
E3 <-> E3 16 0.566 0.398 0.806 0.122 4.65
E4 <-> E4 17 0.175 0.100 0.308 0.060 2.92
E5 <-> E5 18 0.248 0.162 0.378 0.064 3.88
E6 <-> E6 19 0.321 0.224 0.459 0.070 4.59
E7 <-> E7 20 0.577 0.387 0.859 0.140 4.13
E8 <-> E8 21 0.146 0.014 1.473 0.205 0.71
E9 <-> E9 22 0.392 0.255 0.604 0.103 3.81

776
Chapter 23
Analyzing the Correlation Structure
The maximum likelihood estimates and measures of t from the two jobs are the same;
the standard errors differ. Those from the first job agree with the incorrect standard
errors in Lawley and Maxwell; those from the second job agree with Lawley and
Maxwells correct standard errors. Comparison of iteration times in the two jobs shows
that the introduction of additional (nuisance) parameters and Lagrange multipliers
(TYPE = CORR) results in substantially slower iteration times. The second run differs
from the first only in that we specified TYPE = CORR instead of TYPE = COVA.
Values of Fixed Parameters in Variance/Covariance Relationships

Path Value
----------------------------- ------------
VISUAL <-> VISUAL 1.000
VERBAL <-> VERBAL 1.000
SPEED <-> SPEED 1.000


Maximum Likelihood Discrepancy Function

Measures of fit of the model
----------------------------
Sample Discrepancy Function Value : 0.421 (4.210057E-01)

Population discrepancy function value, Fo
Bias adjusted point estimate : 0.097
90.000 percent confidence interval :(0.0,0.354)

Root mean square error of approximation
Steiger-Lind : RMSEA = SQRT(Fo/df)
Point estimate : 0.065
90.000 percent confidence interval :(0.0,0.124)

Expected cross-validation index
Point estimate (modified aic) : 1.041
90.000 percent confidence interval :(0.944,1.298)
CVI (modified AIC) for the saturated model : 1.268

Test statistic: : 29.891
Exceedance probabilities:-
Ho: perfect fit (RMSEA = 0.0) : 0.153
Ho: close fit (RMSEA <= 0.050) : 0.330

Multiplier for obtaining test statistic = 71.000
Degrees of freedom = 23
Effective number of parameters = 22
777
Path Anal ysi s (RAMONA)
The input is:
The output is:
USE ex4b
RAMONA
MANIFEST = y1 y2 y3 y4 y5 y6 y7 y8 y9
LATENT = visual verbal speed e1 e2 e3 e4 e5,
e6 e7 e8 e9
MODEL y1 <- visual e1(0,1.0),
y2 <- visual e2(0,1.0),
y3 <- visual e3(0,1.0),
y4 <- verbal e4(0,1.0),
y5 <- verbal e5(0,1.0),
y6 <- verbal e6(0,1.0),
y7 <- speed e7(0,1.0),
y8 <- speed e8(0,1.0),
y9 <- visual speed e9(0,1.0),
visual <-> visual(0,1.0),
verbal <-> verbal(0,1.0),
speed <-> speed(0,1.0),
visual <-> verbal,
visual <-> speed,
verbal <-> speed
PRINT = MEDIUM
ESTIMATE / TYPE=CORR NCASES=72
There are 9 manifest variables in the model. They are:
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9


There are 12 latent variables in the model. They are:
VISUAL E1 E2 E3 VERBAL E4 E5 E6 SPEED E7 E8 E9


RAMONA options in effect are:
Display Corr
Method MWL
Start Rough
Convg 0.000100
Maximum iterations 100
Number of cases 72
Restart No
Confidence Interval 90.000000
Variance paths for errors were omitted from the job specification
and have been added by RAMONA.

Number of manifest variables = 9
Total number of variables in the system = 30.
Reading correlation matrix...

778
Chapter 23
Details of Iterations
Iter Method Discr. Funct. Max.R.Cos. Max.Const. NRP NBD
-------- ------- -------------- ------------ ------------ ----------
0 OLS 1.013354 0.0
1(0) OLS 0.437034 0.649686 0.192907 0 0
2(0) OLS 0.143538 0.092248 0.017916 0 0
3(0) OLS 0.135197 0.053602 0.006726 0 0
4(0) OLS 0.134714 0.004511 0.000055 0 0
4(0) MWL 0.472377 0.164664 0.000055 0 0
5(0) MWL 0.425693 0.031463 0.003106 0 0
6(0) MWL 0.421825 0.019794 0.001177 0 0
7(0) MWL 0.421170 0.006232 0.000072 0 0
8(0) MWL 0.421041 0.005613 0.000050 0 0
9(0) MWL 0.421014 0.001271 0.000003 0 0
10(0) MWL 0.421008 0.001616 0.000003 0 0
11(0) MWL 0.421006 0.000284 0.000000 0 0
12(0) MWL 0.421006 0.000478 0.000000 0 0
13(0) MWL 0.421006 0.000085 0.000000 0 0
14(0) MWL 0.421006 0.000144 0.000000 0 0
15(0) MWL 0.421006 0.000028 0.000000 0 0
16(0) MWL 0.421006 0.000044 0.000000 0 0

Iterative procedure complete.


Convergence limit for residual cosines = 0.000100 on 2 consecutive
iterations.

Convergence limit for variance constraint violations = 5.00000E-07
Value of the maximum variance constraint violation = 2.14295E-09


Sample Correlation Matrix :
Y1 Y2 Y3 Y4 Y5
Y1 1.000
Y2 0.245 1.000
Y3 0.418 0.362 1.000
Y4 0.282 0.217 0.425 1.000
Y5 0.257 0.125 0.304 0.784 1.000
Y6 0.239 0.131 0.330 0.743 0.730
Y7 0.122 0.149 0.265 0.185 0.221
Y8 0.253 0.183 0.329 0.021 0.139
Y9 0.583 0.147 0.455 0.381 0.400
Y6 Y7 Y8 Y9
Y6 1.000
Y7 0.118 1.000
Y8 -0.027 0.601 1.000
Y9 0.235 0.385 0.462 1.000

Number of cases = 72.


Reproduced Correlation Matrix :
Y1 Y2 Y3 Y4 Y5
Y1 1.000
Y2 0.232 1.000
Y3 0.448 0.225 1.000
Y4 0.341 0.171 0.330 1.000
Y5 0.325 0.163 0.315 0.788 1.000
Y6 0.309 0.155 0.300 0.748 0.715
Y7 0.210 0.105 0.203 0.052 0.050
Y8 0.298 0.149 0.289 0.074 0.070
Y9 0.517 0.260 0.501 0.351 0.336
Y6 Y7 Y8 Y9
Y6 1.000
Y7 0.047 1.000
Y8 0.067 0.601 1.000
Y9 0.319 0.331 0.471 1.000

779
Path Anal ysi s (RAMONA)
Residual Matrix (correlations) :
Y1 Y2 Y3 Y4 Y5
Y1 -0.000
Y2 0.013 0.000
Y3 -0.030 0.137 -0.000
Y4 -0.059 0.046 0.095 0.000
Y5 -0.068 -0.038 -0.011 -0.004 0.000
Y6 -0.070 -0.024 0.030 -0.005 0.015
Y7 -0.088 0.044 0.062 0.133 0.171
Y8 -0.045 0.034 0.040 -0.053 0.069
Y9 0.066 -0.113 -0.046 0.030 0.064
Y6 Y7 Y8 Y9
Y6 0.000
Y7 0.071 -0.000
Y8 -0.094 -0.000 -0.000
Y9 -0.084 0.054 -0.009 -0.000


Value of the maximum absolute residual = 0.171



ML Estimates of Free Parameters in Dependence Relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
Y1 <- VISUAL 1 0.679 0.537 0.822 0.086 7.87
Y2 <- VISUAL 2 0.341 0.143 0.539 0.121 2.83
Y3 <- VISUAL 3 0.659 0.513 0.804 0.089 7.44
Y4 <- VERBAL 4 0.908 0.850 0.967 0.036 25.52
Y5 <- VERBAL 5 0.867 0.801 0.934 0.041 21.41
Y6 <- VERBAL 6 0.824 0.747 0.901 0.047 17.66
Y7 <- SPEED 7 0.651 0.480 0.821 0.103 6.29
Y8 <- SPEED 8 0.924 0.741 1.108 0.111 8.30
Y9 <- VISUAL 9 0.670 0.485 0.856 0.113 5.96
Y9 <- SPEED 10 0.192 -0.021 0.404 0.129 1.48


Scaled Standard Deviations (nuisance parameters)

Variable Estimate
------------ ------------
Y1 1.000
Y2 1.000
Y3 1.000
Y4 1.000
Y5 1.000
Y6 1.000
Y7 1.000
Y8 1.000
Y9 1.000

Values of Fixed Parameters in Dependence Relationships

Path Value
---------------------------- ------------
Y1 <- E1 1.000
Y2 <- E2 1.000
Y3 <- E3 1.000
Y4 <- E4 1.000
Y5 <- E5 1.000
Y6 <- E6 1.000
Y7 <- E7 1.000
Y8 <- E8 1.000
Y9 <- E9 1.000

780
Chapter 23
ML estimates of free parameters in variance/covariance relationships

Point 90.00% Conf. Int. Standard T
Path Param # Estimate Lower Upper Error Value
--------------------------------- -------- ------------------ ------- -----
VISUAL <-> VERBAL 11 0.552 0.344 0.708 0.111 4.97
VISUAL <-> SPEED 12 0.474 0.210 0.674 0.143 3.32
VERBAL <-> SPEED 13 0.088 -0.132 0.299 0.133 0.66
E1 <-> E1 14 0.538 0.376 0.771 0.117 4.59
E2 <-> E2 15 0.884 0.758 1.030 0.082 10.74
E3 <-> E3 16 0.566 0.403 0.794 0.117 4.85
E4 <-> E4 17 0.175 0.096 0.322 0.065 2.72
E5 <-> E5 18 0.248 0.155 0.395 0.070 3.52
E6 <-> E6 19 0.321 0.216 0.476 0.077 4.17
E7 <-> E7 20 0.577 0.393 0.847 0.135 4.29
E8 <-> E8 21 0.146 0.014 1.491 0.206 0.71
E9 <-> E9 22 0.392 0.250 0.615 0.107 3.66

Values of Fixed Parameters in Variance/Covariance Relationships

Path Value
----------------------------- ------------
VISUAL <-> VISUAL 1.000
VERBAL <-> VERBAL 1.000
SPEED <-> SPEED 1.000

Equality Constraints on Variances

Lagrange Standard
Constraint Value Multiplier Error
---------------------------- ------------ ------------ ------------
Y1 <-> Y1 1.0000 0.000 -0.000
Y2 <-> Y2 1.0000 0.000 -0.000
Y3 <-> Y3 1.0000 0.000 0.000
Y4 <-> Y4 1.0000 0.000 0.000
Y5 <-> Y5 1.0000 0.000 -0.000
Y6 <-> Y6 1.0000 0.000 -0.000
Y7 <-> Y7 1.0000 0.000 -0.000
Y8 <-> Y8 1.0000 0.000 -0.000
Y9 <-> Y9 1.0000 0.000 0.000


Maximum Likelihood Discrepancy Function

Measures of fit of the model
----------------------------
Sample Discrepancy Function Value : 0.421 (4.210057E-01)

Population discrepancy function value, Fo
Bias adjusted point estimate : 0.097
90.000 percent confidence interval :(0.0,0.354)

Root mean square error of approximation
Steiger-Lind : RMSEA = SQRT(Fo/df)
Point estimate : 0.065
90.000 percent confidence interval :(0.0,0.124)

Expected cross-validation index
Point estimate (modified aic) : 1.041
90.000 percent confidence interval :(0.944,1.298)
CVI (modified AIC) for the saturated model : 1.268

Test statistic: : 29.891
Exceedance probabilities:-
Ho: perfect fit (RMSEA = 0.0) : 0.153
Ho: close fit (RMSEA <= 0.050) : 0.330

Multiplier for obtaining test statistic = 71.000
Degrees of freedom = 23
Effective number of parameters = 22
781
Path Anal ysi s (RAMONA)
Computation
Algorithms
Let be the parameter vector and () the covariance structure. Parameter estimates
are obtained by minimizing a discrepancy function, F (S, ()), specified using
METHOD. Alternatives are:
An iterative Gauss-Newton computing procedure with constraints (Browne and Du
Toit, 1992) is used to obtain parameter estimates. With MWL, the weight matrix is
respecified on each iteration. The procedure is then equivalent to the Aitchison and
Silvey (1960) adaptation of the Fisher scoring method to deal with equality constraints.
Some computer programs can yield negative estimates of variances. This does not
happen with RAMONA. Bounds are imposed to ensure that variance estimates are
non-negative and that all correlation estimates lie between 1 and +1. The imposition
of these bounds can result in the convergence of RAMONA in situations where
programs that do not impose them fail to converge. In some cases, a program that
allows negative variance estimates and does converge will yield a smaller discrepancy
function value than RAMONA.
Iteration is continued until the largest absolute residual cosine (Browne, 1982) falls
below a tolerance, specified in CONVG, on two consecutive iterations.
MWL
Maximum Wishart likelihood.
F (S, ) = ln|| ln|S| + tr[S
-1
] p
GLS Generalized least squares assuming a Wishart distribution for S.
F (S, ) = tr[S
-1
(S )]
2
OLS
Ordinary least squares.
F (S, ) = tr[(S )]
2
ADFU, ADFG Asymptotically distribution-free methods
F (S, ) = (s )'
-1
(s )
where s and are column vectors with p (p+1)/2 elements formed from
the distinct elements of S and , respectively, and is an estimate of the
asymptotic covariance matrix of sample covariances. For ADFU, is
unbiased (Browne, 1982) but need not be positive definite. If is
indefinite, the program moves automatically from ADFU to ADFG. With
ADFG, is biased but Gramian (Browne, 1982).
1
2
---
1
2
---

782
Chapter 23
Confidence Intervals
Approximate 90% confidence intervals are given for parameter estimates associated
with dependence paths and with covariance paths. Confidence intervals for path
coefficients and covariances (variances unrestricted) are provided under the
assumption of a normal distribution for the estimator (Browne, 1974) and are
symmetric about the parameter estimate. Confidence intervals for other parameters are
nonsymmetric about the parameter estimate (Browne, 1974) and are obtained under the
following assumptions:
n Correlation coefficients (covariances with both corresponding variances restricted to
unity): a normal distribution is assumed for the z-transform, ln[(1 + )/(1 )],
(Browne, 1974).
n Variances: a normal distribution is assumed for the natural logarithm, ln ,
(Browne, 1974).
n Error variances under a correlation structure (corresponding dependent
variable variances are constrained to unity): a normal distribution is assumed
for ln(
-1
1) (Browne, 1974).
Measures of Fit of a Model
This section provides a brief description of the measures of fit output by RAMONA.
Further information concerning these measures of fit can be found in Browne and
Cudeck (1992).
Let N = n + 1 be the sample size; p, the number of manifest variables; and q, the
number of free parameters in the model. Then the number of degrees of freedom is
d = p(p + 1) q. The sample covariance matrix is denoted by S and the
corresponding population covariance matrix by
0
.
The minimal sample discrepancy function value is
and the corresponding minimal population discrepancy function value is

1
2
---

1
2
---
F

Min F S ( )

,
,
_
=

F Min = F
0
( ) , ( )

783
Path Anal ysi s (RAMONA)
Now F
0
is bounded below by 0 and takes on a value of 0 if and only if
0
satisfies the
structural model exactly. Therefore, we can regard F
0
as a measure of badness-of-fit of
the model, (), to the population covariance matrix,
0
.
We assume that the test statistic has an approximate noncentral chi-square
distribution with d degrees of freedom and a noncentrality parameter = nF
0
. This will
be true if the discrepancy function is correctly specified for the distribution of the data,
F
0
is small enough, and N is large enough (Steiger, Shapiro, and Browne, 1985). Then
the expected value of will be approximately F
0
+ d/n, so that is a biased estimator
of F
0
. As a less biased point estimator of F
0
we use:
We also provide a 90% confidence interval on F
0
as suggested by Steiger and Lind
(1980). Let (x , d) be the cumulative distribution function of a noncentral chi-square
distribution with noncentrality parameter and d degrees of freedom. Given
and d, the lower limit,
L
, of the 90% confidence interval on is the
solution for of the equation
(x , d) = 0.95
and the upper limit
U
is the solution for of
(x , d) = 0.05
A 90% confidence interval on is then given by (n
-1

L
; n
-1

U
).
Because F
0
cannot increase if additional parameters are added, it gives little
guidance about when to stop adding parameters. It is preferable to use the root mean
square error of approximation (Steiger and Lind, 1980):
as a measure of the fit per degree of freedom of the model. This population measure of
badness-of-fit is also bounded below by 0 and will be 0 only if the model fits perfectly.
It will decrease if the inclusion of additional parameters substantially reduces F
0
but
will increase if the inclusion of additional parameters reduces F
0
only slightly.
Consequently, it can give some guidance as to how many parameters to use. Practical
experience has suggested that a value of the RMSEA of about 0.05 or less indicates a
n F

0
Max F

d n ( ) 0 , { } =
x n F

= n F
0

F
0
RMSEA
F

0
d
------- =
784
Chapter 23
close fit of the model in relation to the degrees of freedom. A value of about 0.08 or
less indicates a reasonable fit of the model in relation to the degrees of freedom.
A point estimate of the RMSEA is given by
Estimate
and a 90% confidence interval by
Interval Estimate Equation 23-4
The RMSEA does not depend on sample size and therefore does not take into account
the fact that it is unwise to fit a model with many parameters if N is small. A measure
of fit that does this is the expected cross-validation index (ECVI). Consider two
samples of size Na calibration sample C and a validation sample V. Suppose that the
model is fitted to the calibration sample yielding a reproduced covariance matrix .
The discrepancy between and the validation sample covariance matrix S
V
is then
measured with the discrepancy function yielding F(S
V
, ) as a measure of stability
under cross-validation. A difficulty with this approach is that two samples are required.
One can avoid a second sample by estimating the expected value of F(S
V
, ) from a
single sample. Assume that the discrepancy function is correctly specified for the
distribution of the data. Taking expectations over calibration samples and validation
samples gives the expected cross-validation index:
ECVI = F(SV, ) F
0
+ (d + 2q)/n Equation 23-5
A point estimate of the ECVI is given by (Browne and Cudeck, 1990):
Estimate Equation 23-6
If METHOD is set to MWL, this point estimate of the ECVI is related by a linear
transformation to the Akaike Information Criterion (Akaike, 1973) and will lead to the
same conclusions.
RMSEA ( )
F

0
d
------- =
RMSEA ( )

L
nd
------

U
nd
------ ;
,
_
=

ECVI ( ) F

2q n + =
785
Path Anal ysi s (RAMONA)
The point estimate in Equation 23-6 will decrease if an additional parameter reduces
sufficiently and increases otherwise. This will give some guidance as to the number
of parameters to retain. However, the amount of reduction in required before an
increase in the point estimate occurs is affected by the sample size. If n is very large,
increasing the number of parameters will tend to reduce the point estimate of the ECVI.
One should also bear in mind that sampling variability affects the point estimates.
An approximate 90% confidence interval on the ECVI may be obtained from:
Interval Estimate Equation 23-7
It can happen that , so that the point estimate in Equation 23-6 is smaller
than the lower limit of the confidence interval in Equation 23-7. In particular, this will
be true if the (approximately unbiased) point estimate in Equation 23-6 is less than the
lower bound (d+2q)/n for the approximation to the ECVI given in Equation 23-5.
For comparative purposes, RAMONA also provides the ECVI of the saturated
model where no structure is imposed on :
ECVI (Saturated Model)
The test statistic is also output by RAMONA. We follow convention in
providing the exceedance probability, 1 (n 1 0,d), for a test of the point hypothesis
Equation 23-8
which implies that the model holds exactly. Our opinion, however, is that this null
hypothesis is implausible and that it does not much help to know whether or not the
statistical test has been able to detect that it is false. More relevant is the exceedance
probability for an interval hypothesis of close fit, which we define by
Equation 23-9
and which implies that <
*
= .
The exceedance probability output by RAMONA is given by .
Note that the null hypothesis of perfect fit in Equation 23-8 is not rejected at the 5%
level if
L
= 0 or, equivalently, the lower limit of the confidence interval in Equation
23-4 is 0. The null hypothesis of close fit in Equation 23-9 is not rejected at the 5%
F

ECVI ( )

L
d 2q + +
n
---------------------------

U
d 2q + +
n
--------------------------- ;
,
_
=
F

d ( )
L
<
2 d q + ( )
n
-------------------------- =
n F

H
0
:F
0
0 =
H
0
: RMSEA 0.05
n d 0.05
2

1 nF|


*
d , ( )
786
Chapter 23
level if the lower limit of the confidence interval in Equation 23-4 is not greater than
0.05.
When METHOD is set to MWL, two sets of measures of fit are output. One is based
on the maximum likelihood discrepancy function value
and the other on the generalized least squares discrepancy function value
When the model fits well, the differences between the two sets of fit measures should
be small (Browne, 1974).
References
Aitchison, J. and Silvey, S. D. (1960). Maximum likelihood estimation procedures and
associated tests of significance. Journal of the Royal Statistical Society, Series B, 22,
154171.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood
principle. Second International Symposium on Information Theory, B. N. Petrov and F.
Csaki, eds. Budapest: Akademiai Kiado.
Bentler, P. M. and Weeks, D. G. (1980). Linear structural equations with latent variables.
Psychometrika, 45, 289308.
Bollen, K. A. (1989). Structural equations with latent variables. New York: John Wiley &
Sons, Inc.
Browne, M. W. (1974). Generalized least squares estimators in the analysis of covariance
structures. South African Statistical Journal, 8, 124. (Reprinted in Latent Variables in
Socioeconomic Models, D. J. Aigner and A. S. Goldberger, eds. 205226. Amsterdam:
North Holland.)
Browne, M. W. (1982). Covariance structures. In Topics in Applied Multivariate Analysis,
D. M. Hawkins, ed. 72141. Cambridge: Cambridge University Press.
Browne, M. W. and Cudeck, R. (1990). Single sample cross-validation indices for
covariance structures. Multivariate Behavioral Research, 24, 445455.
Browne, M. W. and Cudeck, R. (1992). Alternative ways of assessing model fit. Testing
Structural Equation Models, K. A. Bollen and J. S. Long, eds. Beverly Hills, Calif.: Sage.
F

ln

ln S tr S

1
[ ] p + =
F

1
2
---tr

1
S

( ) [ ] =
787
Path Anal ysi s (RAMONA)
Browne, M. W. and Du Toit, S. H. C. (1992). Automated fitting of nonstandard models.
Multivariate Behavioral Research.
Cudeck, R. (1989). Analysis of correlation matrices using covariance structure models.
Psychological Bulletin, 105, 317327.
Duncan, O. D., Haller, A. O., and Portes, A. (1971). Peer influence on aspirations, a
reinterpretation. Causal Models in the Social Sciences, H. M. Blalock, ed. 219244.
Aldine-Atherstone.
Everitt, B. S. (1984). An introduction to latent variable models. New York: Chapman and
Hall.
Guttman, L. (1954). A new approach to factor analysis: The radex. Mathematical Thinking
in the Social Sciences, P. F. Lazarsfeld, ed. 258348. Glencoe: The Free Press.
Jreskog, K. G. (1977). Structural equation models in the social sciences: Specification
estimation and testing. Applications of Statistics, P. R. Krishnaiah, ed. 265287.
Amsterdam: North Holland.
Lawley, D. N. and Maxwell, A. E. (1971). Factor analysis as a statistical method. 2nd ed.
New York: American Elsevier.
McArdle J. J. (1988). Dynamic but structural equation modeling of repeated measures data.
Handbook of Multivariate Experimental Psychology, J. R. Nesselroade and R. B.
Cattell, eds. 2nd ed. 561614. New York: Plenum.
McArdle, J. J., and McDonald, R. P. (1984). Some algebraic properties of the Reticular
Action Model for moment structures. British Journal of Mathematical and Statistical
Psychology, 37, 234251.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale: Erlbaum.
Mels, G. (1988). A general system for path analysis with latent variables. M. Sc. thesis,
University of South Africa.
Mels, G. and Koorts, A. S. (1989). Causal Models for various job aspects. SAIPA, 24,
144156.
Shapiro, A. and Browne, M. W. (1987). Analysis of covariance structures under elliptical
distributions. Journal of the American Statistical Association, 82, 10921097.
Steiger, J. H. and Lind, J. (1980). Statistically based tests for the number of common
factors. Paper presented at the annual meeting of the Psychometric Society: Iowa City.
Steiger, J. H., Shapiro, A., and Browne, M. W. (1985). On the asymptotic distribution of
sequential chi-square statistics. Psychometrika, 50, 253264.
Wheaton, B., Muthn, B., Alwin, D. F., and Summers, G. F. (1977). Assessing reliability
and stability in panel models. Sociological Methodology 1977, D. R. Heise, ed. 84136.
San Francisco: Jossey-Bass.
788
Chapter 23
Acknowledgments
The development of this program was partially supported by the Institute of Statistical
Research of the South African Human Sciences Research Council, the South African
Foundation for Research Development, and the University of South Africa.
The authors are indebted to Professor S.H.C. du Toit and to Mrs. Yvette Seymore
for a number of subroutines used in the program.
789


Chapt er
24
Perceptual Mapping
Leland Wilkinson
PERMAP offers two types of tools. The first is a group of procedures for fitting
subjects and objects in a common space. This group includes Carrolls (1972) internal
and external unfolding models, MDPREF and PREFMAP, as well as Gabriels (1971)
BIPLOT, which is a minor modification of MDPREF. The second is a set of procedures
for relating one dimensional configuration to another, generally called a Procrustes
rotation. Both the orthogonal Procrustes and the more general canonical rotations are
available.
PERMAP is a misnomer. Although most of the techniques it incorporates have been
used for perceptual mapping, they have applications outside of market research or
psychology and, like the biplot technique, may even have their origins elsewhere.
Furthermore, classical perceptual mapping techniques, such as multidimensional
scaling, correspondence analysis, and principal components, are found elsewhere in
SYSTAT. In the end, since almost all of the methods in this module involve a singular
value decomposition and are not bulky enough to deserve their own modules, they
have been collected into a single grab bag.
Statistical Background
Perceptual mapping involves a variety of techniques for displaying the judgments of
a set of objects by a group of subjects. Most of these techniques were developed in the
1970s by psychometricians, but they were soon adopted by market researchers and
scientists for analyzing a variety of preference and similarity data.
In applied usage, especially among market researchers, perceptual mapping is an
even more general term. Some commercial perceptual mapping programs are based
790
Chapter 24
on classical statistical or psychometric models. Some of these methods include
Fishers linear discriminant function, correspondence analysis, factor analysis, and
multidimensional scaling. Indeed, any procedure that produces a set of coordinates in
a q dimensional space from an matrix, , can be considered
perceptual mapping in the broad, applied sense. Quantitative theoretical market
researchers (for example, Green and Tull, 1975, and Lilien, Kotler, and Moorthy,
1992) use the term in this more general sense as well.
The origin of the term can be found in classical psychometrics (see Cliff, 1973, for
a review). Soon after the development of psychometric spatial models, some
psychologists thought scaling methods could be used to derive cognitive maps from
subjects ratings of stimuli. These maps would be pictures of the mental structures
used to perceive and integrate information. Following the classic linguistic studies of
Osgood, Suci, and Tannenbaum (1957), researchers produced intriguing cognitive
maps of stimuli such as countries, cities, adjectives, colors, and consumer products (for
example, Wish, Deutsch, and Biener, 1972, and Milgram and Jodelet, 1976).
Not long afterwards, perceptual and memory psychologists abandoned the cognitive
map model and developed theories based on information processing, problem solving,
and associative memory. Research by Shepard and Cooper (1982) and Kosslyn (1981),
for example, focused specifically on the storage and processing of mental images
rather than inferring spatial structure among nonspatial stimuli from associations
between responses to attributes. Shepards psychometric findings on mental rotations,
for example, were subsequently confirmed at the physiological level (Dow, 1990).
While no longer an active theoretical model, perceptual mapping can be useful as a
general collection of procedures for presenting statistical analyses to nontechnical
clients. Like classification trees, perceptual maps can show complex relations
relatively simply without algebra or statistical parameters. It is easier for many clients
to judge a distance on a map than to evaluate a conditional probability. Thus,
perceptual mapping techniques can be useful for data that have nothing to do with
perception.
Preference Mapping
A variety of algebraic and geometric models of preferences have been developed over
the last century. The unidimensional preference model (Coombs, 1950) is presented in
the following figure. Imagine that three subjects have expressed their preferences for
each of five objects (A,B,C,D,E). If their preferences can be represented by a single
dimension, the following figure is one of several possible models. Each subjects
n p q min n p , ( )
791
Perceptual Mappi ng
preference strength on the single attribute dimension is represented by a normal
distribution. In this model, the farther an object is from the mean of the subjects
preference distribution, the less that object is preferred. Following this rule, in the
following figure, the ordering of preferences for the five objects shows above each
subjects curve. Thus, the leftmost subject prefers object B most and E least, while the
rightmost subject prefers E most and A least. The following figure is the
unidimensional preference model for normal curves.
Coombs devised a method for recovering a unidimensional preference scale from the
subjects rankings of the objects. His procedure is called unfolding. If we assume that
the distances to objects from a subjects ideal point on the scale are all positive and
follow the usual distance axioms (see Chapter 19), then direction doesnt enter into the
calculation of preference. We can therefore imagine folding the scale about the
subjects ideal point to see the point of view of that subject. Coombs discovered that if
there are enough subjects and objects, we can unfold the scale from the given orderings
of the subjects preferences without knowing the strengths of the preferences. In
general, the more subjects and objects, the less room there is to represent the preference
orderings correctly by moving the locations of the preference curves. Like MDS itself,
the system becomes highly constrained to allow only one solution. The MDS procedure
in SYSTAT can be used to compute Coombs solution for unidimensional data.
Students of Coombs (Bennet and Hays, 1960) extended the unfolding model to
higher dimensions. The multidimensional preference model for normal curves shows
how this works. As in the unidimensional preference model for normal curves, there
are three subjects and five objects. The closest subjects preference curve leads to
preferences of BCEAD, in left-to-right order. The other two subjects have preference
curves nearest object E in the center of the configuration. Consequently, their most
preferred object is E. In the multidimensional model, distance is calculated in all
directions from the center of the subjects preference curve. Again, the SYSTAT MDS
792
Chapter 24
module can be used to find solutions for multidimensional unfolding problems. The
following figure is the multidimensional preference model for normal curves.
Preference curves do not have to be normal, symmetric, or even probability
distributions. Carroll (1972) devised an unfolding procedure based on a quadratic
preference curve model. He called the procedure external because it relied on
quantitative ratings of the subjects preferences and a previously determined fixed
configuration of objects in a space. While ordinary unfolding begins with an
matrix of n subjects rank orderings of p objects, external unfolding begins with a
matrix of p objects coordinates in q dimensions and a matrix of n
subjects ratings of their preferences for the p objects.
The unidimensional preference model for quadratic curves shows Carrolls model
in one dimension. Unlike the normal curve model, the quadratic preference curves
involve negative preferences. The subject on the left in the unidimensional preference
model for normal curves, for example, is indifferent about object E(or more indifferent
than she is about object D). The subject on the left in the unidimensional preference
model for quadratic curves, on the other hand, likes object E least. Carrolls model is
therefore appropriate for data following a bipolar (approach-avoidance) preference
model. The following is the unidimensional preference model for quadratic curves.
n p
p q p n
793
Perceptual Mappi ng
Carroll fits each subjects vector of preferences to a configuration of objects via
ordinary least squares. In fact, the preference curves in the previous unidimensional
preference model for quadratic curves are really inverted (negative) quadratic loss for
each subject when Carrolls least squares fitting method is used to fit her vector of
preferences to the coordinates of the objects.
Carroll offers four fitting methods, three of which appear in SYSTAT. The first,
called the VECTOR model in SYSTAT, is simply a multiple regression of the
preference vector on the coordinates themselves:

where is the preference scale value of the jth stimulus for the ith subject. The
coefficients are estimated from regressing y (the vector of preferences) on X (the
matrix of coordinates) and then transforming the coefficients.
This is called a vector model because the resulting fit is displayed as a vector
superimposed on the object configuration. Preferences are predicted from the
perpendicular projections of the objects coordinates onto each vector.
The second model, called the CIRCLE model in SYSTAT, is the one in the figures
above. It is fit by regressing each subjects preferences on the coordinates and squared
coordinates of the object configuration. From the coefficients in this regression, the
ideal points are established in the coordinate space of the object configuration. In two
dimensions, the intersection of each preference surface with the zero preference plane
is a circle. The basic algebraic model is

A B C D E
BACDE CDBEA EDCBA
S
T
R
E
N
G
T
H

O
F

P
R
E
F
E
R
E
N
C
E
s
i j
a
i k
k 1 =
q

x
j k
b
i
e
i j
+ + =
s
i j
a
ik
p q
s
i j
a
i
d
i j
2
b
i
e
ij
+ + =
794
Chapter 24
where
The third model, called the ELLIPSE model in SYSTAT, allows for differential
weighting of preference dimensions. It uses weights in computing the distances instead
of the ordinary regression in the circular model. As a result, each preference curve may
be elliptical at the zero preference plane. The model is

where
Biplots and MDPREF
The CORAN procedure in SYSTAT performs a correspondence analysis on a tabular
matrix. The singular value decomposition is used to compute row and column
coordinates in a single configuration. These coordinates are popularly represented as a
set of vectors for the columns and a set of points for the rows.
Biplots (Gabriel, 1971) are a singular value decomposition of a general
matrix. MDPREF (Carroll, 1972) is the same model except that the vectors (column
coordinates) are standardized to have equal length. This is because Carroll developed
the procedure for representing preferences with the vector model based on n subjects
preferences for p objects.
Procrustes Rotations
Procrustes rotations involve matching a source configuration to a target. SYSTAT
offers two types of rotations. The first is a classical orthogonal Procrustes rotation
d
i j
2
x
j k
y
ik
( )
2
k 1 =
q

=
s
i j
a
i
d
i j
2
b
i
e
ij
+ + =
d
i j
2
w
ik
x
jk
y
i k
( )
2
k 1 =
q

=
n p
795
Perceptual Mappi ng
(Schnemann, 1966). This produces a fit by rotating and transposing axes and is
especially suited for principal components and factor analyses.
The second method, called canonical rotation in SYSTAT, is a general translation,
rotation, reflection, and uniform dilation transformation that is ideally suited for
multidimensional scaling and any procedure where location, scale, and orientation are
arbitrary. This method is documented in Borg and Groenen (1997).
Perceptual Mapping in SYSTAT
Perceptual Mapping Main Dialog Box
To open the Perceptual Mapping dialog box, from the menus choose:
Statistics
Data Reduction
Perceptual Mapping
Dependent(s). The dependent variables should be continuous or categorical numeric
variables (for example, income).
Independent(s). The independent variables should be continuous or categorical
variables (grouping variables).
796
Chapter 24
Method. The following methods are available.
n Biplot. Requires only a dependent variable. Biplots are a singular value
decomposition of a general matrix.
n Canonical rotations. Requires both a dependent and independent variable. Relates a
one-dimensional configuration to another. Canonical rotation is a general
translation, rotation, reflection, and uniform dilation transformation that is ideally
suited for multidimensional scaling and any procedure where location, scale, and
orientation are arbitrary.
n Circle. Requires both a dependent and an independent variable. The columns of the
first set are fit to the configuration in the second.
n Ellipse. Requires both a dependent and an independent variable. The columns of the
first set are fit to the configuration in the second.
n MDPREF. Requires only a dependent variable. MDPREF is a biplot except that the
vectors (column coordinates) are the same unit length.
n Procrustes. Requires both a dependent and an independent variable. Procrustes
rotation relates a one-dimensional configuration to another and involves matching
a source configuration to a target. It produces a fit by rotating and transposing axes
and is especially suited for principal components and factor analyses.
n Vector. Requires both a dependent and an independent variable. The columns of the
first set are fit to the configuration in the second.
Standardize. Standardizes the data before fitting.
Dimension. Specifies the number of dimensions in which to do the scaling.
Polarity. Specifies the polarity of the preferences when doing preference mapping. If
the smaller number indicates the least and the higher number the most, select Positive.
For example, a questionnaire may include the question, Rate a list of movies where
one star is the worst and five stars is the best. If the higher number indicates a lower
ranking and the lower number indicates a higher ranking, select Negative. For
example, a questionnaire may include the question, Rank your favorite sports team
where 1 is the best and 10 is the worst.
797
Perceptual Mappi ng
Using Commands
After selecting a file with USE filename, continue with:
Usage Considerations
Types of data. PERMAP uses only rectangular data.
Print options. The output is standard for all PRINT options.
Quick Graphs. PERMAP produces Quick Graphs for every analysis. You can turn these
off with GRAPH=NONE.
Saving files. PERMAP does not save coordinates.
BY groups. PERMAP analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. PERMAP uses the FREQ variable, if present, to duplicate cases. This
inflates the total degrees of freedom to be the sum of the number of frequencies. Using
a FREQ variable does not require more memory, however.
Case weights. PERMAP does not use WEIGHT.
PERMAP
MODEL varlist or depvarlist = indvarlist
ESTIMATE / METHOD = BIPLOT
MDPREF
VECTOR
CIRCLE
ELLIPSE
PROCRUSTES
CANONICAL ,
STANDARDIZE ,
DIMENSION = n ,
POLARITY= POSITIVE
NEGATIVE
798
Chapter 24
Examples
Example 1
Vector Model
The PREFMAP procedure of Carroll is implemented through a model that regresses a
set of subjects (the left side of the model equation) onto the coordinates of a set of
objects (the right side of the model equation). The file SYMP contains coordinates from
a multidimensional scaling of disease symptoms from Wilkinson, Blank, and Gruber
(1996). It also contains, for a selected set of diseases, indicators for the presence or
absence of a symptom. These are informal ratings.
The input for fitting the vector model to the data is:
Following is the output:
USE SYMP
IDVAR=SYMPTOM$
PERMAP
MODEL LYME MALARIA YELLOW RABIES FLU = DIM1 DIM2
ESTIMATE / METHOD=VECTOR
Variables in the SYSTAT Rectangular file are:
SYMPTOM$ DIM1 DIM2 LYME MALARIA YELLOW RABIES FLU
Configuration has been centered prior to fitting.
External unfolding via vector model.
Goodness of fit for subjects
Subject R-square F-ratio df p
1 0.007 0.056 2 15 0.946
2 0.029 0.226 2 15 0.800
3 0.173 1.574 2 15 0.240
4 0.059 0.474 2 15 0.632
5 0.079 0.642 2 15 0.540
Regression coefficients for subjects
1 2 3
1 0.0 0.057 -0.002
2 0.0 0.096 -0.070
3 0.0 0.170 0.177
4 0.0 -0.073 0.161
5 0.0 -0.084 0.147

Subject coordinates
1 2
1 0.999 -0.032
2 0.806 -0.592
3 0.692 0.722
4 -0.415 0.910
5 -0.496 0.869

EXPORT successfully completed.
799
Perceptual Mappi ng
Example 2
Circle Model
The circular model places the diseases near the symptoms they most involve. The input
for fitting the circle model to the data is:
Following is the output:
USE SYMP
IDVAR=SYMPTOM$
PERMAP
MODEL LYME MALARIA YELLOW RABIES FLU = DIM1 DIM2
ESTIMATE / METHOD=CIRCLE
Variables in the SYSTAT Rectangular file are:
SYMPTOM$ DIM1 DIM2 LYME MALARIA YELLOW RABIES FLU
Configuration has been centered prior to fitting.
External unfolding via circular ideal point model.
Goodness of fit for subjects
Subject R-square F-ratio df "p"
1 0.271 1.735 3 14 0.206
2 0.385 2.926 3 14 0.071
3 0.265 1.685 3 14 0.216
4 0.257 1.615 3 14 0.231
5 0.079 0.401 3 14 0.755 Anti-Ideal
Regression coefficients for subjects
1 2 3 4
1 0.379 0.001 -0.046 -0.380
2 0.449 0.029 -0.123 -0.450
3 0.191 0.142 0.155 -0.191
4 0.334 -0.123 0.122 -0.335
5 -0.009 -0.083 0.148 0.009
800
Chapter 24
Example 3
Internal Model
The DIVORCE file includes grounds for divorce in the United States in 1971. It is
adapted from Wilkinson, Blank, and Gruber (1996), and originally from Long (1971).
We will do an MDPREF analysis on these data to plot the rows and columns in a
common space. The input is:
Subject coordinates
1 2
1 0.017 -1.000
2 0.232 -0.973
3 0.675 0.738
4 -0.710 0.704
5 0.487 -0.874

EXPORT successfully completed.
USE DIVORCE
IDVAR STATE$
PERMAP
MODEL ADULTERY..SEPARATE
ESTIMATE / METHOD=MDPREF
801
Perceptual Mappi ng
Following is the output:
Variables in the SYSTAT Rectangular file are:
ADULTERY CRUELTY DESERT SUPPORT FELONY IMPOTENT
PREGNANT DRUGS CONTRACT INSANE BIGAMY SEPARATE
STATE$
Configuration has been centered prior to fitting.
MDPREF (Biplot) Analysis
Eigenvalues
1 2 3 4 5 6 7 8 9
19.196 18.277 9.792 9.132 8.035 6.483 4.825 3.595 2.278
10 11 12
0.866 0.581 0.000

Vector coordinates
1 2
1 0.657 0.754
2 0.540 0.842
3 0.661 0.751
4 0.078 0.997
5 0.746 0.666
6 0.969 -0.248
7 0.793 -0.610
8 0.626 0.780
9 0.694 -0.720
10 -0.657 -0.754
11 0.851 -0.525
12 -0.982 0.191

Object coordinates
1 2
1 0.102 0.117
2 0.048 0.092
3 -0.032 0.088
: : :
. . .
49 -0.165 0.007
50 0.176 0.057

EXPORT successfully completed.
802
Chapter 24
Following is a graph of the output:
The biplot looks similar, with all of the grounds for divorce vectors approximately
equal in length because the original data have comparable variances on these variables.
Example 4
Procrustes Rotation
In a profound but seldom-cited dissertation, Wilkinson (1975) scaled perceptions of
cars and dogs among car club and dog club members. The file CARDOG contains the
INDSCAL configurations of the scalings of cars and dogs. Wilkinson paired cars and
dogs by using subjects responses on additional rating scales of attributes. INDSCAL
dimensions, on the other hand, are claimed to have an intrinsic canonical orientation
that ordinarily precludes rotation (see the references in Multidimensional Scaling).
The question here, then, is whether a Procrustes rotation guided by the extrinsically
based pairings will change the original INDSCAL configurations. We will rotate cars
to dogs. The input is:
USE CARDOG
PERMAP
MODEL C1,C2 = D1,D2
ESTIMATE/METHOD=PROCRUSTES
803
Perceptual Mappi ng
Following is the output:
Orthogonal Procrustes Rotation
Rotation matrix T

1 2

1 0.98 0.21
2 -0.21 0.98
Target (X) coordinates
1 2

1 21.00 -1.00
2 13.00 -6.00
3 -4.00 26.00
4 -9.00 20.00
5 3.00 15.00
6 -2.00 14.00
7 -20.00 1.00
8 -8.00 -16.00
9 -1.00 -26.00
10 4.00 -20.00
11 6.00 -15.00
12 8.00 6.00

Rotated (Y) coordinates
1 2

1 20.19 1.17
2 16.53 -5.73
3 -14.92 22.41
4 -9.00 18.55
5 9.69 12.25
6 -2.05 9.79
7 -12.15 -0.51
8 -15.47 -23.68
9 -4.65 -26.52
10 3.44 -11.54
11 -6.65 -2.42
12 11.95 1.49
804
Chapter 24
The rotation matrix in the output is nearly an identity matrix. Unlike the nonmetric
multidimensional scalings in Wilkinsons dissertation, which required rotation to
common orientation, the INDSCAL analyses recovered the apparently canonical
dimensions. These were agile-clumsy (horizontal) and big-small (vertical).
In place of the Procrustes output, which is normally separate scatterplots of the two
sets, we present a plot of the superimposed configurations.
Computation
All computations are in double precision.
Algorithms
The algorithms are documented in the Statistical Background section above. Most
involve a singular value decomposition computed in the standard manner.
Missing data
Cases and variables with missing data are omitted from the calculations.
References
Bennet, J. F. and Hays, W. L. (1960). Multidimensional unfolding: Determining the
dimensionality of ranked preference data. Psychometrika, 25, 2743.
Borg, I. and Groenen, P. J. F. (1997). Modern multidimensional scaling: Theory and
applications. New York: Springer Verlag.
Carroll, J. D. (1972). Individual differences in multidimensional scaling. In R. N. Shepard,
A. K. Romney, S. B. Nerlove (eds.), Multidimensional scaling: Theory and applications
in the behavioral sciences, Vol. 1, 105155. New York: Seminar Press.
Cliff, N. (1973). Scaling. Annual Review of Psychology, 24, 473506.
Coombs, C. H. (1950). Psychological scaling without a unit of measurement.
Psychological Review, 57, 148158.
Dow, B. M. (1990). Nested maps in macaque monkey visual cortex. In K. N. Leibovic
(ed.), Science of vision, 84124. New York: Springer Verlag.
805
Perceptual Mappi ng
Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal
component analysis. Biometrika, 58, 453467.
Green, P. E. and Tull, D. S. (1975). Research for marketing decisions, 3rd ed. Englewood
Cliffs, N. J.: Prentice Hall.
Kosslyn, S. M. (1981). Image and mind. Cambridge: Harvard University Press.
Lilien, G. L., Kotler, P., and Moorthy, K. S. (1992). Marketing models. Englewood Cliffs,
N.J.: Prentice Hall.
Long, L. H. (ed.) (1971). The world almanac. New York: Doubleday.
Milgram, S. and Jodelet, D. (1976). Psychological maps of Paris. In H. M. Proshansky, W.
H. Itelson, and L. G. Revlin (eds.), Environmental Psychology. New York: Holt,
Rinehart, and Winston.
Osgood, C. E., Suci, G. J., and Tannenbaum, P. H. (1957). The measurement of meaning.
Urbana, Ill.: University of Illinois Press.
Schnemann, P. H. (1966). The generalized solution of the orthogonal Procrustes problem.
Psychometrika, 31, 116.
Schwartz, E. J. (1981). Computational anatomy and functional architecture of striate
cortex: A spatial mapping approach to perceptual coding. Visual Research, 20, 645669.
Wilkinson, L. (1975). The effect of involvement on similarity and preference structures.
Unpublished dissertation, Yale University.
Wilkinson, L., Blank, G., and Gruber, C. (1996). Desktop Data Analysis with SYSTAT.
Upper Saddle River, N.J.: Prentice-Hall.
Wish, M., Deutsch, M., and Biener, L. (1972). Differences in perceived similarity of
nations. In R. N. Shepard, A. K. Romney, S. B. Nerlove (eds.), Multidimensional
scaling: Theory and applications in the behavioral sciences, Vol. 2, 289313. New
York: Seminar Press.
807


Chapt er
25
Probit Analysis
Dan Steinberg
The PROBIT module calculates maximum likelihood estimates of the parameters of
the PROBIT general linear model. A modified Gauss-Newton algorithm is used to
compute the estimates. Conventionally, the dependent variable is coded as 0 or 1,
although the PROBIT module will automatically recode values of the dependent
variable because it assumes that it is categorical. Models may include categorical
predictors (dummy coded), as well as interaction terms.
Statistical Background
The PROBIT model provides an appropriate method for estimating a multiple
regression or analysis of variance or covariance when the dependent variable is
categorical and can take only one of two values. PROBIT analyzes models of the form
which is interpreted to mean
where y is the binary dependent variable, X is a vector of independent variables, b is
a vector of unknown regression coefficients, e is a normally distributed random error,
and is the cumulative normal distribution. The normality assumption is an essential
characteristic of the PROBIT model; alternative distributional assumptions give rise to
different statistical models, such as LOGIT.
y Xb e +
Prob y 1 = ( ) Xb ( ) =

808
Chapter 25
The purpose of PROBIT analysis is to produce an estimate of the probability that the
value of the dependent variable is equal to 1 for any set of independent variable values,
and to identify those independent variables that are significant predictors of the
outcome. The estimated coefficients, b, are assumed to generate a predicted z score, Xb.
Interpreting the Results
The simplest interpretation of the output is obtained by noting the significant variables
and identifying them as useful predictors of the dependent variable. More sophisticated
interpretation requires scaling the coefficients into derivatives. While the predicted
effect of an independent variable on the z score is linear, the effect on the probability
that the dependent variable equals 1 is nonlinear. The derivative of this probability with
respect to the ith independent variable is given by
where f (Xb) is the normal density evaluated at the predicted z score.
This derivative will differ for each observation in the data set. The correct way to
estimate this derivative for the sample is to evaluate it for each observation and then
average all the observations. A good approximation can be obtained by evaluating the s
score for the mean set of Xs and using the above scaling formula, or using the normal
density evaluated at the x score, which would split the sample to match the observed split.
Probit Analysis in SYSTAT
Probit Analysis Main Dialog Box
To open the Probit dialog box, from the menus choose:
Statistics
Regression
Probit...
b
i
f Xb ( )
809
Probi t Anal ysi s
Dependent. The variable you want to examine. The dependent variable should be a
binary numeric variable. Normally, the dependent variable is coded so that the larger
value denotes the reference group.
Independent(s). Select one or more variables. Categorical variables must be designated
using the Category button. To add an interaction to your model use the Cross button.
For example, add income to the Independent list and then use the Cross button to add
education, which will look like income*education. A variable with a positive
correlation with the dependent variable should have a positive coefficient when fitted
alone. To reverse the direction of this coding, use ORDER with a descending sort for
the dependent variable.
Include constant. Includes the constant in the regression equation. Deselect this option
to remove the constant. You rarely want to remove the constant, and you should be
familiar with no-constant regression terminology before considering it.
Save file. Saves specified statistics into filename SYD.
Categories
Independent variables (predictors) in a PROBIT model can be either categorical or
continuous. To prevent category codes from being treated as continuous data, specify
categorical variables as such using the Categories dialog box.
810
Chapter 25
All independent variables selected for the model appear in the variable list.
Categorical Variable(s). You want to categorize an independent variable when it has
several categories such as education levels, which could be divided into the following
categories: less than high school, some high school, finished high school, some college,
finished bachelors degree, finished masters degree, and finished doctorate. On the
other hand, a variable such as age in years would not be categorical unless age were
broken up into categories such as under 21, 2165, and over 65.
You must indicate the coding method to apply to categorical variables. The two
available options include:
n Effect. Produces parameter estimates that are differences from group means.
n Dummy. Produces dummy codes for the design variables instead of effect codes.
Coding of dummy variables is the classic analysis of variance parameterization, in
which the sum of effects estimated for a classifying variable is 0. If your categorical
variable has k categories, dummy variables are created.
Using Commands
Select a data file using USE filename and continue with:
PROBIT must be specified in MODEL. Use an * between variables to specify
interactions.
PROBIT
MODEL yvar = CONSTANT + xvarlist + xvar*xvar + ... / PROBIT
CATEGORY grpvarlist / MISS EFFECT or DUMMY
ESTIMATE
k 1
811
Probi t Anal ysi s
Usage Considerations
Types of data. PROBIT uses rectangular data only.
Print options. Coefficient estimates and their covariance matrix are printed in all
circumstances.
Quick Graphs. PROBIT produces no Quick Graphs.
Saving files. In the PROBIT model, the predicted value of the dependent variable is a
normal z score. If you want to save this variable, you can issue a SAVE command. This
will produce a SYSTAT system file with the variables z score, the predicted z score
from the last model estimated, and MILLS, the hazard function evaluated at the
predicted z score. By using the cumulative normal probability function in the DATA
module, you can convert the z score into a predicted probability. The MILLS variable
is often used as a selectivity bias correction variable in regression models with
nonrandom sampling. For further details, see the references at the end of this chapter.
Additional variables saved are SEZSCORE (standard errors), PROB (corresponding
probability), DENSITY (associated density value), and confidence intervals for the
parameters.
BY groups. PROBIT analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. FREQ = <variable> increases the number of cases by the FREQ
variable. This feature does not use extra memory.
Case weights. WEIGHT is not available in PROBIT.
Examples
Example 1
Probit Analysis (Simple Model)
This example shows a simple linear PROBIT model. The data, which have been
extracted from the National Longitudinal Survey of Young Men, 1979, includes school
enrollment status (NOTENR = 1 if not enrolled), age (AGE), highest completed grade
812
Chapter 25
(EDUC), mothers education (MED), an index of reading material available in the
home (CULTURE), and an IQ score (IQ) for 38 individuals. The input is:
The resulting output is:
PROBIT always reports the number of cases processed and the means of the dependent
variables for each of the two subgroups defined by the value of the dependent variable.
If all observations have the same value of the dependent variable, PROBIT will return
an error message indicating that the model cannot be estimated.
USE NLS
PROBIT
MODEL NOTENR = CONSTANT + EDUC + AGE / PROBIT
ESTIMATE
Variables in the SYSTAT Rectangular file are:
NOTENR CONSTONE BLACK SOUTH EDUC AGE
FED MED CULTURE NSIBS LW IQ
FOMY

Categorical values encountered during processing are:
NOTENR (2 levels)
0, 1
Categorical variables are effects coded with the highest value as reference.


Binary Probit Analysis


Dependent variable : NOTENR
Input records : 200
Records kept for analysis: 200

Convergence achieved after 4 iterations.
Relative tolerance = 0.000

Number of observations : 200.000

Number with NOTENR = 0 (non-response) : 28.000
Number with NOTENR = 1 (response) : 172.000

Results of estimation


Log Likelihood: -75.240
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 2.187 1.148 1.905 0.057
2 EDUC -0.161 0.051 -3.185 0.001
3 AGE 0.048 0.040 1.184 0.236
-2 * L.L. ratio = 11.505 with 2 degrees of freedom
Chi-Sq. p-value = 0.003



Covariance Matrix
1 2 3
1 1.318
2 -0.027 0.003
3 -0.035 -0.000 0.002
813
Probi t Anal ysi s
Before printing the coefficient estimates, PROBIT reports whether it has achieved
convergence, the value of the likelihood function, the percentage change in the
likelihood achieved in the last iteration, the convergence criterion, the number of
iterations required, the size of the two subsamples of the data, and the likelihood ratio
chi-square test of the null hypothesis that all coefficients except the constant are equal
to 0. If there isnt a constant specified on the model statement, this last statistic will not
be computed. Next, the coefficient estimates, standard errors, and t statistics are
presented. The coefficients, analogous to regression coefficients, represent the change
in a score that is predicted by a unit change in the independent variable. Finally, the
variance-covariance matrix of the coefficient estimates is printed. This matrix,
analogous to the inverse (XX) of a linear regression model, can be used to conduct
hypothesis tests.
Example 2
Probit Analysis with Interactions
This example adds an interaction term to the simple model from the other example.
When doing so, it is useful to standardize variables in product terms so that they do not
soak up variance from main effects simply because they become highly correlated
due to scale effects. You can compare the results with and without standardization. The
input is:
The resulting output is:
USE NLS
STANDARDIZE EDUC AGE
PROBIT
MODEL NOTENR= CONSTANT + EDUC + AGE + EDUC*AGE / PROBIT
ESTIMATE
Variables in the SYSTAT Rectangular file are:
NOTENR CONSTONE BLACK SOUTH EDUC AGE
FED MED CULTURE NSIBS LW IQ
FOMY

200 cases and 13 variables processed and saved.

Categorical values encountered during processing are:
NOTENR (2 levels)
0, 1
Categorical variables are effects coded with the highest value as reference.


Binary Probit Analysis


Dependent variable : NOTENR
Input records : 200
Records kept for analysis: 200
814
Chapter 25
Notice that the education main effect remains significant and the interaction is itself
moderately significant in this expanded model.
Computation
All of the computations are in double precision.
Algorithms
PROBIT maximizes the likelihood function for the binary PROBIT model by the
Newton-Raphson method.
Missing Data
Cases with missing data for any variable in the model are deleted.
Convergence achieved after 4 iterations.
Relative tolerance = 0.000

Number of observations : 200.000

Number with NOTENR = 0 (non-response) : 28.000
Number with NOTENR = 1 (response) : 172.000

Results of estimation


Log Likelihood: -72.805
Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 1.176 0.125 9.381 0.000
2 EDUC -0.479 0.137 -3.505 0.000
3 AGE 0.067 0.126 0.530 0.596
4 EDUC*AGE 0.274 0.128 2.141 0.032
-2 * L.L. ratio = 16.375 with 3 degrees of freedom
Chi-Sq. p-value = 0.001



Covariance Matrix
1 2 3 4
1 0.016
2 -0.006 0.019
3 -0.000 0.002 0.016
4 0.002 -0.006 -0.004 0.016
815
Probi t Anal ysi s
References
Amemiya, T. (1981). Qualitative response models: A survey. Journal of Economic
Literature, December, 14831536.
Finney, D. J. (1971). Probit analysis, 3rd ed. Cambridge: Cambridge University Press.
Heckman, J. (1979). Sample bias as a specification error. Econometrica, 1153162.
McFadden, D. (1982). Qualitative response models. In W. Hildebrand (ed.), Advances in
Econometrics, Cambridge: Cambridge University Press.
817


Chapt er
26
Set and Canonical Correlation
Jacob Cohen and Leland Wilkinson
SETCOR computes set correlations (Cohen, 1982) and canonical correlations
(Hotelling, 1935, 1936). Although it is based on algorithms developed initially for the
mainframe program CORSET (Cohen and Nee, 1983) and subsequently for the PC
program SETCORAN (Eber and Cohen, 1987), the SYSTAT program is completely
new source code. Recent corrections in statistical tests by Jacob Cohen and Charles
Lewis (1988) have been incorporated into the SYSTAT version.
Finally, SETCOR also computes the Stewart and Love (1968) canonical
redundancy index and rotates canonical variates.
Statistical Background
Set correlation (SC) is a realization of the multivariate general linear model and
therefore is a natural generalization of simple and multiple correlation. In its standard
form, it generalizes bivariate and multiple regression to their multivariate analogue.
The standard univariate and multivariate methods provided by the SYSTAT MGLH
module (for example, multivariate analysis of variance and covariance, discriminant
function analysis) may be viewed as special cases of SC. SC thus provides a single
general framework for the study of association. In contrast to canonical correlation, it
yields a partitioning of variance in terms of the original variables, rather than their
canonical transformations.
818
Chapter 26
Sets
The building blocks of SC are sets of variables, which may be categorical or
quantitative. They may also comprise interactions or products of measured variables.
If they are nonlinearly related to each other, they should be transformed prior to
analysis to avoid misleading conclusions. The same assumptions underlying ANOVA,
linear regression, and other linear models are appropriate to SC.
Partialing
By partialing a set A from a set B (residualizing B by A), there is produced a new set
B|A whose variables have zero correlations with the set A variables. (The notation B.A,
used in some of the cited papers, is equivalent to the notation B|A here.) This device
has several uses in data analysis, including the statistical adjustment for irrelevant or
spurious sources of variance or covariance (as in the analysis of covariance), the
representation of curvilinear components and of interactions (Cohen, 1978), and the
representation of contrasts among means.
In MRC, the use of sets and partialing apply to the right side of the equation, the
independent variables, which is where the multiplicity lies. The dependent variable y
is a single variable. SC is a generalization of MRC such that a set of dependent
variables Y may be related to a set X, either of which may be a partialed set. Given that
virtually any information may be expressed as a set of variables, SC offers the
possibility of a flexible general data-analysis method.
The basic reference for SC is Cohen (1982), reprinted in Cohen & Cohen (1983,
Appendix 4), referred to hereafter as CSC. Cohen & Nee (1984) give estimators for
two measures of association (shrunken and ), and Cohen (1988b, Chapter 10)
provides a full treatment of power analysis in SC. Van den Burg and Lewis (1988)
describe the properties of the association measures and provide formal proofs. The
various devices for the representation of information as sets of variables are described
and illustrated in detail in Cohen and Cohen (1983, chapters 49, 11), referred to
hereafter as C&C. This chapter focuses on the nuts and bolts of the method and
illustrates its chief features as they are represented in the input and output of SETCOR.
R
Y,X
2
T
Y,X
2
819
Set and Canoni cal Correl ati on
Notation
In what follows, the symbols Y
B
and X
B
represent basic sets: set Y
B
may be a set of
dependent variables Y, or a set of dependent variables Y from which another set Y
P
has
been partialed, represented as Y|Y
P
. Similarly, set X
B
may be a set of independent
variables X, or a set of independent variables X from which another set X
P
has been
partialed, X|X
P
. (The term basic replaces the term generic used in CSC.) All references
to sets Y and X in formulas that follow are to be understood to mean Y
B
and X
B
, the
left-hand and right-hand sets, whether or not either has been partialed. Where Y
and Y
B
or X and X
B
must be distinguished, this will be done so in the notation.
Measures of Association Between Sets
It is desirable that a measure of association between sets be a natural generalization of
multiple , bounded by 0 and 1, invariant over full-rank linear transformation
(rotation) of either or both sets, and symmetrical (that is, = ). Of the measures
of multivariate association that have been proposed (Cramer and Nicewander, 1979),
three have been found to be particularly useful: multivariate and the symmetric
( ) and asymmetric ( ) squared trace correlations.
R
2
Y,X
Proportion of Generalized Variance
Using determinants of correlation matrices,
,
where
n R
YX
is the full correlation matrix of the Y
B
and X
B
variables,
n R
YY
is the matrix of correlations among the variables of set Y
B
, and
n R
XX
is the matrix of correlations among the variables of set X
B
.
This equation also holds when variance-covariance (S) or sums of squares and cross-
products (C) matrices replace the correlation matrices.
may also be written as a function of the q squared canonical correlations ( )
where , the number of variables in the smaller of the two basic sets:
R
2
R
Y,X
2
R
X,Y
2
R
Y,X
2
T
Y,X
2
P
Y,X
2
R
Y X ,
2
1 R
YX

R
YY
R
XX

------------------------------- =
R
Y,X
2
C
2
q min k
Y
k
X
, ( ) =
820
Chapter 26
is a generalization of the simple bivariate and of multiple and is properly
interpreted as the proportion of the generalized variance (multivariance) of set Y
B

accounted for by set X
B
(or vice versa, because like all product-moment correlation
coefficients, it is symmetrical). Generalized variance is the generalization of the
univariate concept of variance to a set of variables and is defined here as the
determinant of the covariance matrix of the variables in the set. You can interpret
proportions of generalized variance much as you interpret proportions of variance of a
single variable. does not vary with changes in location or scale of the variables,
with nonsingular transformations of the variables within each set (for example,
orthogonal or oblique rotations), or with different single degree-of-freedom codings of
nominal scales. makes possible a multiplicative decomposition in terms of
squared (multiple) partial (but not semipartial) correlations. See CSC and Van den
Burg and Lewis (1988) for the justification of these statements and a discussion of
these and other properties of .
T
2
Y,X and P
2
Y,X

Proportions of Additive Variance
Two other useful measures of multivariate association are based on the trace of the
variance-covariance matrix, M
YX
= S
YY
1
S
YX
S
XX
1
S
XY
, where Y and X are again
taken as basic. It can be shown that the trace of this matrix,

is the sum of the q squared canonical correlations. , the symmetric squared trace
correlation, is the trace divided by q, or the mean of the q squared canonical
correlations,
A space may be defined by a set of variables, and any nonsingular linear
transformation (for example, rotation) of these variables defines the same space.
Assume that and consider any orthogonalizing transformation of the (basic) Y
R
Y X ,
2
1 1 C
1
2
( ) 1 C
2
2
( ) 1 C
q
2
( ) =
R
Y,X
2
r
y,x
2
R
2
R
Y,X
2
R
Y,X
2
R
Y,X
2
tr M
YX
( ) C
2
q

=
T
Y,X
2
T
Y X ,
2
tr M
YX
( )
q
---------------------
C
2
q
------
q

= =
k
Y
k
X
<
821
Set and Canoni cal Correl ati on
variables. Find the multiple s of each of the orthogonalized Y variables with set X
B
;
their sum equals tr(M
YX
), so the mean of these multiple s is . This symmetric
squared trace correlation also has a proportion of variance interpretation, but unlike
, the definition of variance is that of additive (or total) variance, the sum of the unit
variances of the smaller set; that is, q. provides an additive decomposition into
squared semipartial (but not partial) correlations. It may, however, decrease when a
variable is added to the lesser of and (CSC; van den Burg and Lewis, 1988).
is the trace divided by , the number of dependent variables, and is therefore
asymmetric. When , its maximum is / . It shares with and multiple
the property that the addition of a variable to either X or Y cannot result in a
decrease. When (the usual case), = . In a comprehensive analysis
of their properties, van den Burg and Lewis (1988) argue that (rather than )
is a direct generalization of multiple .
Interpretations
The varied uses of partialing (residualization), made familiar by MRC, makes possible
a functional analysis of data in terms of research factors as defined above. The basic
set X
B
, made up of X|X
P
, may be used in the following ways:
n The statistical control of the research factor(s) in X
P
when X is to be related to Y
B
.
If the model to be tested posits an independent effect of X on Y
B
, then X|X
P
holds
X
P
constant; without X
P
partialed, the effect found for X may be a spurious
consequence of the association of X
P
with both X and Y
B
. In the analysis of
covariance (univariate and multivariate), partialing the covariates also has the
effect of reducing the error variance, and thus increases power.
n The representation of interactions of any order for research factors of any kind. For
example, the interaction set is constructed as X|X
P
, where X is UV, the set of
product variables that result from multiplying each of the variables in
research factor U by each of the variables in research factor V, and X
P
is , the
+ variables of the combined U and V research factors (C&C, Chapter 8).
n The representation of curve components in polynomial (curvilinear) regression.
For example, for the cubic component of a variable v, set X is v
3
and set X
P
is made
up of v and v
2
(C&C, Chapter 6).
n The representation of a particular contrast within a set of means of the categories
of a nominal scale. Here, X contains a single suitably coded variable and X
P

contains the remaining variables carrying other contrasts (C&C, Chapter 5).
R
2
R
2
T
Y,X
2
R
Y,X
2
T
Y,X
2
k
Y
k
X
P
Y,X
2
k
Y
k
Y
k
X
> k
X
k
Y
R
Y,X
2
R
2
k
Y
k
X
P
Y,X
2
T
Y,X
2
P
Y,X
2
T
Y,X
2
R
2
U V
k
U
k
V
U V +
k
YU
k
YV
822
Chapter 26
n The purification of a variable to its uniqueness, as when X is made up of one
subtest of a battery of intercorrelated measures and X
P
contains the remaining
subtests. Examples of X are the Digit Symbol subtest of the Wechsler Adult
Intelligence Scale or the Schizophrenia scale score of the Minnesota Multiphasic
Personality Inventory, with X
P
in each instance the respective remaining
subtest/scale scores.
n The use of missing data as positive information. Here, X represents a research
factor for which some subjects, having no data, are assigned an arbitrary constant
(usually the mean), and X
P
is a single binary variable coded 1 for the cases with
missing data and 0 for those with data present (C&C, Chapter 7).
In SC, the partialing devices described above for the set X
P
may equally be employed
in the Y
B
set as Y|Y
P
. Thus, you may control a dependent variable for age, sex, and
socioeconomic status, or represent curve components, interactions, missingness, or
uniqueness of a dependent variable or set of dependent variables. (See CSC for
examples.)
Types of Association between Sets
Given the option of partialing, there are five types of association possible in SC:
Formulas for the covariance matrices required for the computation of and
for the five types of association are given in CSC, Table 1. Following an SC analysis,
further analytic detail is provided by output for MRC analyses for each basic y variable
on the set of basic x variables, y and x being single variables in their respective sets.
Thus, it is for the individual variables, partialed or whole depending on the type of
association, that the regression and correlation results are provided. The information
provided in the output for these individual basic variables (betas, multiple s)
facilitates the interpretation of the SC results of the X
B
and Y
B
sets that they constitute.
Set Y Set XB
Whole
set Y with set X
Partial
set Y|Y
p
with set X|X
p
(where X
p
=Y
p
)
Y semipartial
set Y|Y
p
with set X
X semipartial
set Y with set X|X
p
Bipartial
set Y|Y
p
with set X|X
p
R
Y,X
2
T
Y,X
2
R
2
823
Set and Canoni cal Correl ati on
Testing the Null Hypothesis
Throughout this section, X and Y are to be understood as basic. For purposes of testing
the hypothesis of no association between sets X and Y, we treat Y as the dependent
variable set, X as independent, and employ the fixed model. Wilks likelihood ratio
is the ratio of the determinant of the error covariance matrix E to the determinant of the
sum of the error and hypothesis H covariance matrices,

where H is the variance-covariance accounted for in the Y variables by X, where
H = S
YX
S
XX
1
S
XY
The definition of E depends on whether the test is to employ Model 1 or Model 2 error.
Model 1 error is defined as
E
1
= S
YY
S
Y,XXp
R
XXp
1
S
XXp,Y
that is, the residual covariance matrix when covariance associated with sets X and X
P
have been removed.
Model 2 error is employed when there exists a set G, made up of variables in neither
X nor X
P
, that can be used to account for additional variance in S
YY
and thus reduce E
below E
1
in the interest of unbiasedness and increased statistical power. This occurs
when, with multiple research factors, the analyst wishes to use pure error, for
example, the within-cell variance in a factorial design. In this case, the error-reducing
set G is made up of the variables comprising the research factors (main effects) and
interactions other than the factor or interaction under test, as is done traditionally in
MANOVA and MANCOVA factorial designs.
E
2
= S
YY
S
Y,XXpG
R
XXpG
1
S
XXpG,Y
In whole and Y semipartial association, where X
P
does not exist, it is dropped from E
1

and E
2
. Formulas for the H and E matrices for the five types of association are given
in CSC, Table 2. (See Algorithms on p. 169 for corrections to the CSC formulas.)

E
E H +
----------------- - =
824
Chapter 26
When Model 1 error (no set G) is used, for the whole, partial, and Y semipartial
types of association, it can be shown that
= 1
Once is determined for a sample, Raos F test (1973) may be applied to test the null
hypothesis. As adapted for SC, the test is quite general, covering all five types of
association and both error models. When , where multivariate specializes
to multiple s, the Rao F test specializes to the standard null hypothesis F test for
MRC. For this case, and for the case where the smaller set is made up of no more than
two variables, the Rao F test is exact; otherwise, it is approximate (Cohen & Nee, 1987).
, where
u = numerator df = ,
v = denominator df = ms + 1 u/2 where
m= N max ( , + ) ( + + 3)/2, and
except that when = 4, s = 1. For partial , set X
P
= set Y
P
, so = is the
number of variables in the set that is being partialed, and for whole , neither set
X
P
nor set Y
P
exists. Further, , , and are 0 when the set does not exist for the
type of association or error model in question. The test assumes that the variables in X
are fixed and those in Y are multivariate normal, but the test is quite robust against
assumption failure (Cohen & Nee, 1989; Olson, 1974).
Estimates of the Population R
2
Y,X
, T
2
Y,X
, and P
2
Y,X
Like all squared correlations, , , and are positively biased. Shrunken
values (almost unbiased population estimates) are given by
,
R
Y,X
2

k
Y
1 = R
Y,X
2
R
2
F
1
s
---
1
,
_
v
u
-- - =
k
Y
k
X
k
Yp
k
Xp
k
G
k
Y
k
X

s
k
Y
2
k
X
2
4
k
Y
2
k
X
2
5 +
------------------------- =
k
Y
2
k
X
2
R
Y,X
2
k
Xp
k
Yp

R
Y,X
2
k
Yp
k
Xp
k
G

R
Y,X
2
T
Y,X
2
P
Y,X
2
R

Y X ,
2
1 1 R
Y X ,
2
( )
v u +
v
------------
,
_
s
=
825
Set and Canoni cal Correl ati on
, and
where w is the denominator df of the Pillai (1960) F test for
(Cohen & Nee, 1984). When , both and specialize to Wherrys
formula for the shrunken multiple (1931).
Set and Canonical Correlations in SYSTAT
Set and Canonical Correlations Main Dialog Box
To open the Set and Canonical Correlations dialog box, from the menus choose:
Statistics
Correlations
Set and Canonical...
T

Y X ,
2
1 1 T
Y X ,
2
( )
w u +
v
-------------
,
_
s
=
P

Y X ,
2
T

Y X ,
2 k
Y
k
X
----- =
T
Y,X
2
: w q N k
Y
k
X
max k
Yp
k
Xp
1 , ( ) [ ] =
q 1 = R

Y X ,
2
T

Y X ,
2
R
2
826
Chapter 26
To do a SETCOR analysis, first specify a model.
Dependent(s). Enter the dependent variables you want to examine.
Dependent Partial(s). You can specify the variables to be partialed out of the dependent
set, which produces a new set whose variables have zero correlation with the partialed
set. The partial variables are optional for dependent variables. The simple canonical
correlation model does not have a partial variable list.
Independent(s). Select one or more continuous or categorical variables (grouping
variables).
Independent Partial(s). You can specify the variables to be partialed out of the
independent set, which produces a new set whose variables have zero correlation with
the partialed set. The partial variables are optional for independent variables. The
simple canonical correlation model does not have a partial variable list.
Error. Specify a set of variables to be used in computing error terms for statistical tests.
Error variables are optional.
Set and Canonical Correlation Options
The Options dialog box controls the rotation and sample size options for estimation.
The following options can be specifed:
n Rotate first. Enter the number of canonical factors to rotate using the varimax
rotation.
n Number of cases. If you enter a triangular matrix (correlation), specify the number
of cases from which the matrix was computed. This is required if you are using a
correlation matrix instead of raw data.
827
Set and Canoni cal Correl ati on
Using Commands
After specifying the data with USE filename, continue with:
Usage Considerations
Types of data. SETCOR normally uses rectangular data. SETCOR will also accept a
lower triangular Pearson correlation matrix, which is produced by SYSTATs CORR
module, in which case n must be specified in the ESTIMATE command. You may be
tempted to solve missing data problems by using a correlation matrix produced by
CORR using its pairwise deletion option with an average or minimum n. Although this
may produce reasonable results when data are randomly missing, you should consider
Wilkinsons (1988) and Cohen and Cohens (1983) warnings concerning pairwise
matrices. A better technique for dealing with missing data is to use the maximum
likelihood (EM) estimation of the covariance or correlation matrix in the CORR
module. See Chapter 7 of Cohen and Cohen (1983) for a detailed discussion of missing
data.
Unlike the MGLH module in SYSTAT, SETCOR cannot handle products of variables
when sets are defined. Thus, with AGE and SEX as variables in a rectangular file,
naming AGE*SEX or AGE*AGE as a variable in a set will result in an error message.
To use product (or other) functions of variables, they must be created using SYSTAT
BASIC.
SETCOR can use nominal (qualitative or categorical) scales in any of the sets it
employs by means of a variety of coding methods. The most useful of these, which are
dummy, effects, and contrast coding, are discussed in detail in Chapter 5 of Cohen and
Cohen (1983), and their use in set correlation illustrated in Cohen (1982) and in Cohen
(1988a). The CATEGORY command codes these variables.
Print options. For PRINT=SHORT, the output gives n, the type of association, the
variables in sets YPARTIAL, XPARTIAL, and G (when present), the Rao F (with its df
and p value), and their shrunken values, and the following results for the basic y and x
variables: the within-set correlation matrices for the y and x variables, the rectangular
between-set correlation matrix, the betas for estimating each y variable from the x set
(with their standard errors and p values), a matrix of the intercorrelations of the
SETCOR
MODEL yvarlist | ypartials = xvarlist | xpartials
ERROR = varlist
ESTIMATE / N=n ROTATE=n
828
Chapter 26
estimated y values whose diagonal is the multiple of each y variable with the x set, and
the F test and p for the latter.
PRINT=LONG gives, in addition to the above results for basic y and x, the Stewart
and Love redundancy index for y given xb, the canonical correlations and their Bartlett
chi-square tests, and the canonical coefficients, loadings, and redundancies for both
sets. When PRINT=LONG, the option ROTATE rotates the dependent and independent
canonical loadings and the canonical correlations.
Quick Graphs. SETCOR produces no Quick Graphs.
Saving files. SETCOR does not save results.
BY groups. SETCOR analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. SETCOR uses the FREQ variable, if present, to duplicate cases. This
inflates the total degrees of freedom to be the sum of the number of frequencies. Using
a FREQ variable does not require more memory, however.
Case weights. SETCOR weights sums of squares and cross products using the WEIGHT
variable for rectangular data input. It does not require extra memory.
Examples
Example 1
Canonical CorrelationsSimple Model
This example shows a simple canonical correlation model. The data, which have been
extracted from the National Longitudinal Survey of Young Men, 1979, includes school
enrollment status (NOTENR, set to 1 if not enrolled), age (AGE), highest completed
grade (EDUC), mothers education (MED), an index of reading material available in
the home (CULTURE), and an IQ score (IQ) for 38 individuals. The input is:
USE ANXIETY
SETCOR
MODEL S(1)..S(2) = P(1)..P(3)
ESTIMATE / N=48
829
Set and Canoni cal Correl ati on
The resulting output is:
Variables in the SYSTAT Correlation file are:
S(1..2) P(1..3)

Whole set correlation analysis (Y VS. X)

Number of cases on which analysis is based: 48.
RAO F = 3.675 df = 6.0, 86.0 Prob= 0.003
R-Square = 0.367 Shrunk R-Square = 0.275
T-Square = 0.202 Shrunk T-Square = 0.090
P-Square = 0.202 Shrunk P-Square = 0.090

Within basic set y correlations

S(1) S(2)

S(1) 1.000
S(2) 0.303 1.000

Within basic set x correlations

P(1) P(2) P(3)

P(1) 1.000
P(2) 0.304 1.000
P(3) 0.403 0.589 1.000

Between basic y (col) and basic x (row) correlations

S(1) S(2)

P(1) 0.391 0.106
P(2) 0.298 -0.028
P(3) 0.197 0.321

Estimated (from x-set) y intercorrelations (R-square on diagonal)

S(1) S(2)

S(1) 0.193
S(2) 0.002 0.175

Significance tests for prediction of each basic y variable

Variable F-statistic Probability
S(1) 3.505 0.023
S(2) 3.115 0.036

Betas predicting basic y (col) from basic x (row) variables

S(1) S(2)

P(1) 0.353 -0.001
P(2) 0.243 -0.332
P(3) -0.088 0.517

Standard errors of betas

S(1) S(2)

P(1) 0.149 0.150
P(2) 0.168 0.170
P(3) 0.175 0.177
830
Chapter 26
T-statistics for betas

S(1) S(2)

P(1) 2.374 -0.010
P(2) 1.442 -1.953
P(3) -0.503 2.921

Probabilities for betas

S(1) S(2)

P(1) 0.022 0.992
P(2) 0.156 0.057
P(3) 0.618 0.005

Stewart-Love canonical redundancy index = 0.184

Canonical correlations

1 2

0.511 0.377

Bartlett test of residual correlations

Correlations 1 through 2
Chi-square statistic = 20.087 df = 6 prob= 0.003

Correlations 2 through 2
Chi-square statistic = 6.754 df = 2 prob= 0.034

Canonical coefficients for dependent (y) set

1 2

S(1) -0.893 0.551
S(2) 0.796 0.684

Canonical loadings (y variable by factor correlations)

1 2

S(1) -0.652 0.759
S(2) 0.525 0.851

Canonical redundancies for dependent set

1 2

0.092 0.092

Canonical coefficients for independent (x) set

1 2

P(1) -0.618 0.513
P(2) -0.941 -0.247
P(3) 0.959 0.809

831
Set and Canoni cal Correl ati on
Because this is a whole association, the basics are the original unpartialed variables.
Note that there is considerable shrinkage. The overall association (R
2
= 0.367) is
nontrivial and significant (PROB = 0.003).
The individual regression analyses provide detail: p(1..3) are significantly related to
s(1) (multiple R = 0.193), with p(1) yielding a significant beta. They are also
significantly related to s(2) (multiple R = 0.175), with p(3)s beta significant.
Example 2
Partial Set Correlation Model
This example shows a partial set correlation model. In a large-scale longitudinal study
of childhood and adolescent mental health (Cohen and Brook, 1987), data were
obtained on personal qualities that the subjects admired and what they thought other
children admired, as well as the sex and age of the subjects. The admired qualities were
organized into scales for antisocial, materialistic, and conventional values for the self
and as ascribed to others. In one phase of the investigation, the researchers wanted to
study the relationship between the sets of self versus others. However, several of these
scales exhibited sex differences, were nonlinearly (specifically quadratically) related
to age, and/or were differently related to age for the sexes. For the self-other
association to be assessed free of the confounding influence of age, sex, and their
interactions, it was desirable to partial those effects from the association. Accordingly,
using SYSTAT BASIC, the variables AGE, SEX times AGE and their squares were
created, which, together with AGE and SEX, constituted both the YPARTIAL and
Canonical loadings (x variable by factor correlations)

1 2


P(1) -0.518 0.764
P(2) -0.564 0.385
P(3) 0.156 0.870

Canonical redundancies for independent set

1 2

0.053 0.071
832
Chapter 26
XPARTIAL sets in the partial association. The resulting rectangular data file,
ADMIRE.SYD, was analyzed as follows:
The resulting output is:
USE ADMIRE
SETCOR
MODEL ANTISO_O,MATER_O,CONVEN_O | ,
AGE,SEX,AGESQ,SEXAGE,SEXAGESQ = ,
ANTISO_S,MATER_S,CONVEN_S | ,
AGE,SEX,AGESQ,SEXAGE,SEXAGESQ
ESTIMATE
Variables in the SYSTAT Rectangular file are:
ID$ ANTISO_S MATER_S CONVEN_S ANTISO_O MATER_O
CONVEN_O AGE SEX AGESQ SEXAGE SEXAGESQ

Partial set correlation analysis (Y|YPAR VS. X|XPAR, WHERE YPAR=XPAR)

Number of cases on which analysis is based: 755.

Dependent set y partialled by these variables:
AGE
SEX
AGESQ
SEXAGE
SEXAGESQ

Independent set x partialled by these variables:
AGE
SEX
AGESQ
SEXAGE
SEXAGESQ
RAO F = 52.169 df = 9.0, 1810.9 Prob= 0.000
R-Square = 0.429 Shrunk R-Square = 0.422
T-Square = 0.169 Shrunk T-Square = 0.159
P-Square = 0.169 Shrunk P-Square = 0.159

Within basic set y correlations

ANTISO_O MATER_O CONVEN_O

ANTISO_O 1.000
MATER_O 0.200 1.000
CONVEN_O -0.417 0.105 1.000

Within basic set x correlations

ANTISO_S MATER_S CONVEN_S

ANTISO_S 1.000
MATER_S 0.206 1.000
CONVEN_S -0.258 0.063 1.000

Between basic y (col) and basic x (row) correlations

ANTISO_O MATER_O CONVEN_O

ANTISO_S 0.393 0.077 -0.066
MATER_S 0.133 0.456 0.046
CONVEN_S -0.111 0.120 0.351
833
Set and Canoni cal Correl ati on
Estimated (from x-set) y intercorrelations (R-square on diagonal)

ANTISO_O MATER_O CONVEN_O

ANTISO_O 0.157
MATER_O 0.052 0.216
CONVEN_O -0.028 0.053 0.124

Significance tests for prediction of each basic y variable

Variable F-statistic Probability
ANTISO_O 46.436 0.000
MATER_O 68.673 0.000
CONVEN_O 35.358 0.000

Betas predicting basic y (col) from basic x (row) variables

ANTISO_O MATER_O CONVEN_O

ANTISO_S 0.377 0.009 0.022
MATER_S 0.056 0.448 0.018
CONVEN_S -0.017 0.094 0.356

Standard errors of betas

ANTISO_O MATER_O CONVEN_O

ANTISO_S 0.036 0.034 0.036
MATER_S 0.035 0.033 0.035
CONVEN_S 0.035 0.034 0.036

T-statistics for betas

ANTISO_O MATER_O CONVEN_O

ANTISO_S 10.543 0.249 0.616
MATER_S 1.611 13.430 0.520
CONVEN_S -0.486 2.783 9.965

Probabilities for betas

ANTISO_O MATER_O CONVEN_O

ANTISO_S 0.000 0.803 0.538
MATER_S 0.108 0.000 0.603
CONVEN_S 0.627 0.006 0.000

Stewart-Love canonical redundancy index = 0.166

Canonical correlations

1 2 3

0.487 0.401 0.329

Bartlett test of residual correlations

Correlations 1 through 3
Chi-square statistic = 418.286 df = 9 prob= 0.000

Correlations 2 through 3
Chi-square statistic = 216.191 df = 4 prob= 0.000

Correlations 3 through 3
Chi-square statistic = 85.145 df = 1 prob= 0.000
834
Chapter 26
The partial association is substantial (0.429), significant, and because of the large n and
small x and y sets, hardly affected by shrinkage. The within and between basic x and y
set correlation coefficients are all partial correlation coefficients because the basic x
and y sets are respectively X|XPARTIAL and Y|YPARTIAL with XPARTIAL=YPARTIAL,
and it is for these partialed variables that the multiple-regression output (betas, multiple
R squares, etc.) is given.
For example, the significant beta = 0.377 for ANTISO_S in estimating ANTISO_O
are for both with the variables AGE, SEX, AGESQ, SEXAGE, and SEXAGSQ partialed,
and ANTISOC_S is further partialed by MATER_S and CONVEN_S. Note that each _O
variable has significant betas with its paired _S variable. In addition, MATER_Os beta
for CONVEN_S is significant. All the partialed y variables have significant multiple R
squares with the partialed x set, that for MATER_O being the largest.
Canonical coefficients for dependent (y) set

1 2 3

ANTISO_O 0.462 1.039 -0.114
MATER_O 0.733 -0.671 -0.320
CONVEN_O 0.471 0.448 0.919

Canonical loadings (y variable by factor correlations)

1 2 3

ANTISO_O 0.412 0.718 -0.561
MATER_O 0.875 -0.416 -0.246
CONVEN_O 0.356 -0.056 0.933

Canonical redundancies for dependent set

1 2 3

0.084 0.037 0.045

Canonical coefficients for independent (x) set

1 2 3

ANTISO_S 0.392 0.986 -0.077
MATER_S 0.745 -0.585 -0.404
CONVEN_S 0.470 0.196 0.910

Canonical loadings (x variable by factor correlations)

1 2 3

ANTISO_S 0.425 0.815 -0.395
MATER_S 0.856 -0.369 -0.362
CONVEN_S 0.416 -0.095 0.904

Canonical redundancies for independent set

1 2 3

0.086 0.043 0.040
835
Set and Canoni cal Correl ati on
Example 3
Contingency Table Analysis
From the perspective of set correlations, a two-way contingency table displays the
association between two nominal scales, each represented by a suitably coded set of
variables. A nominal scale of n levels (categories) is coded as n1 variables, and when
each is partialed by the other n2 variables, it carries a specific contrast or comparison,
its nature depending on the type of coding employed. The major types of coding
dummy, effects, and contrastare described in Chapter 5 of Cohen and Cohen (1983);
their use in contingency table analysis is illustrated in Cohen (1982).
Zwick and Cramer (1986) compared the application of various multivariate
methods in the analysis of contingency tables using a fictitious example from
Marascuilo and Levin (1983), and Cohen (1988a) provides a complete set correlation
analysis of this example. It is of responses by 500 men to the question Does a woman
have the right to decide whether an unwanted birth can be terminated during the first
three months of pregnancy? The response alternatives were crosstabulated with
religion, resulting in the following table of frequencies:
Religion and response are represented by ordinal numbers in the data file
SURVEY3.SYD. Religion is effects-coded as E(l), E(2), and E(3). When from each of
these the other two are partialed, the resulting variable compares that group with the
unweighted combination of all four groups; that is, it estimates that groups effect.
Notice how we use the FREQ command to determine the cell frequencies:
Protestant Catholic Jewish Other Total
Yes
76 115 41 77 309
No
64 82 8 12 166
No opinion
11 6 2 6 25
Total
151 203 51 95 500
USE SURVEY3
CATEGORY RELIGION$ RESPONSE$
FREQ=COUNT
SETCOR
MODEL RESPONSE$=RELIGION$
ESTIMATE
836
Chapter 26
The resulting output is:
Variables in the SYSTAT Rectangular file are:
RELIGION$ RESPONSE$ COUNT

Case frequencies determined by value of variable COUNT.

Categorical values encountered during processing are:
RELIGION$ (4 levels)
Catholic, Jewish, Other, Protestant
RESPONSE$ (3 levels)
No, No_opinion, Yes

Whole set correlation analysis (Y VS. X)

Number of cases on which analysis is based: 640.
RAO F = 50.311 df = 6.0, 1270.0 Prob= 0.000
R-Square = 0.347 Shrunk R-Square = 0.341
T-Square = 0.182 Shrunk T-Square = 0.174
P-Square = 0.182 Shrunk P-Square = 0.174

Within basic set y correlations

RESPONSE$ 1 RESPONSE$ 2

RESPONSE$ 1 1.000
RESPONSE$ 2 -0.569 1.000

Within basic set x correlations

RELIGION$ 1 RELIGION$ 2 RELIGION$ 3

RELIGION$ 1 1.000
RELIGION$ 2 -0.123 1.000
RELIGION$ 3 -0.269 -0.381 1.000

Between basic y (col) and basic x (row) correlations

RESPONSE$ 1 RESPONSE$ 2

RELIGION$ 1 -0.147 0.189
RELIGION$ 2 -0.186 0.274
RELIGION$ 3 0.545 -0.405

Estimated (from x-set) y intercorrelations (R-square on diagonal)

RESPONSE$ 1 RESPONSE$ 2

RESPONSE$ 1 0.298
RESPONSE$ 2 -0.217 0.195

Significance tests for prediction of each basic y variable

Variable F-statistic Probability
RESPONSE$ 89.841 0.000

Betas predicting basic y (col) from basic x (row) variables

RESPONSE$ 1 RESPONSE$ 2

RELIGION$ 1 0.006 0.129
RELIGION$ 2 0.027 0.174
RELIGION$ 3 0.557 -0.304

837
Set and Canoni cal Correl ati on
Standard errors of betas

RESPONSE$ 1 RESPONSE$ 2

RELIGION$ 1 0.036 0.019
RELIGION$ 2 0.037 0.020
RELIGION$ 3 0.038 0.020

T-statistics for betas

RESPONSE$ 1 RESPONSE$ 2

RELIGION$ 1 0.168 6.846
RELIGION$ 2 0.735 8.866
RELIGION$ 3 14.551 -15.080

Probabilities for betas

RESPONSE$ 1 RESPONSE$ 2

RELIGION$ 1 0.867 0.000
RELIGION$ 2 0.463 0.000
RELIGION$ 3 0.000 0.000

Stewart-Love canonical redundancy index = 0.246

Canonical correlations

1 2

0.558 0.228

Bartlett test of residual correlations

Correlations 1 through 2
Chi-square statistic = 271.251 df = 6 prob= 0.000

Correlations 2 through 2
Chi-square statistic = 34.099 df = 2 prob= 0.000

Canonical coefficients for dependent (y) set

1 2

RESPONSE$ 1 0.815 0.904
RESPONSE$ 2 -0.279 1.184

Canonical loadings (y variable by factor correlations)

1 2

RESPONSE$ 1 0.973 0.229
RESPONSE$ 2 -0.743 0.670

Canonical redundancies for dependent set

1 2

0.233 0.013

Canonical coefficients for independent (x) set

1 2

RELIGION$ 1 -0.056 -0.690
RELIGION$ 2 -0.047 -1.008
RELIGION$ 3 0.965 -0.626

838
Chapter 26
The whole association is modest (0.347) but highly significant, and provides some
Fisherian protection for the tests of specific hypotheses that follow. To determine
where this overall association is coming from, assess the association of religion with
the Yes-No contrast C(l).C(2) using y semipartial association.
To analyze the effects of the religious groups on the Yes-No contrast, we turn to the
betas for E(1..3). Since these are partial regression coefficients, each reflects a comparison
of its religious group with an equally weighted combination of the four groups on the Yes
versus No contrast. For example, the Protestant group (beta = 0.129) shows a greater
proclivity to respond No (compared to Yes) with t = 6.846, df = 495, p = 0.000. (For
dealing with the implicitly coded Other group, see Chapter 5 of Cohen and Cohen,
1983.) Further analyses of these data using contrast functions of religious group
membership and bipartial association are given in Cohen (1988a).
Computation
All the computations are in double precision.
Algorithms
Table 2 in Cohen (1982) contains errors in two of the matrix expressions for the Y
semipartial. The expression for H should read (in Cohens notation)
H = C
D.C,B.C
C
B.C
-l
C

D.C,B.C
and in E
2
, B should be replaced with B.C. E
1
is correct as is. We are indebted to Charles
Lewis for this correction.
Canonical loadings (x variable by factor correlations)

1 2

RELIGION$ 1 -0.309 -0.398
RELIGION$ 2 -0.408 -0.684
RELIGION$ 3 0.998 -0.056

Canonical redundancies for independent set

1 2

0.131 0.011
839
Set and Canoni cal Correl ati on
Missing Data
When a rectangular data file is used in SETCOR, the program computes a Pearson
correlation matrix on all the numeric variables in the file on a listwise basis. This means
that if a value is missing for any variable in the file, the case is dropped and n is reduced
accordingly. If the pattern of missing data makes n small, you should impute missing
values by maximum likelihood (EM) in the CORR module.
References
Cohen, J. (1982). Set correlation as a general multivariate data-analytic method.
Multivariate Behavioral Research, 17, 301341.
Cohen, J. (1988a). Set correlation and contingency tables. Applied Psychological
Measurement, 12, 425434.
Cohen, J. (1988b). Statistical power analysis for the behavioral sciences, 2nd ed. Hillsdale,
N.J.: Lawrence Erlbaum Associates.
Cohen, P. and Brook, J. (1987). Family factors related to the persistence of
psychopathology in childhood and adolescence. Psychiatry, 50, 332345.
Cohen, J. and Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences, 2nd ed. Hillsdale, N.J.: Lawrence Erlbaum Associates.
Cohen, J. and Nee, J. C. M. (1983). CORSET, A Fortran IV program for set correlation.
Educational and Psychological Measurement, 43, 817820.
Cohen, J. and Nee, J. C. M. (1984). Estimators for two measures of association for set
correlation. Educational and Psychological Measurement, 44, 907917.
Cohen, J. and Nee, J. C. M. (1987). A comparison of two noncentral F approximations,
with applications to power analysis in set correlation. Multivariate Behavioral
Research, 22, 483490.
Cohen, J., and Nee, J. C. M. (1989). Robustness of Type I error and power in set correlation
analysis of contingency tables. Multivariate Behavioral Research, 23.
Cramer, E. M. and Nicewander, W. A. (1979). Some symmetric, invariant measures of set
association. Psychometrika, 44, 4354.
Eber, H. W. and Cohen, J. (1987). SETCORAN: A PC program to implement set
correlation as a general multivariate data-analytic method. Atlanta: Psychological
Resources.
Hotelling, H. (1935). The most predictable criterion. Journal of Educational Psychology,
26, 139142.
Hotelling, H. (1936). Relations between two set of variates. Biometrika, 28, 321377.
840
Chapter 26
Marascuilo, L. A., and Levin, J. R. (1983). Multivariate statistics in the social sciences.
Monterey, Calif.: Brooks/Cole.
Olson, C. L. (1974). Comparative robustness of six tests in multivariate analysis of
variance. Journal of the American Statistical Association, 69, 894908.
Pedhazur, E. J. (1982). Multiple regression in behavioral research, 2nd ed. New York:
Holt, Rinehart & Winston.
Pillai, K. C. S. (1960). Statistical tables for tests of multivariate hypotheses. Manila: The
Statistical Institute, University of the Philippines.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Stewart, D., and Love, W. (1968). A general canonical correlation index. Psychological
Bulletin, 70, 160163.
van den Burg, W., and Lewis, C. (1988). Some properties of two measures of multivariate
association. Psychometrika, 53, 109122.
Wherry, R. J. (1931). The mean and second moment coefficient of the multiple correlation
coefficient in samples from a normal population. Biometrika, 22, 353361.
Wilkinson, L. (1988). SYSTAT. The system for statistics. Evanston Ill.: Systat, Inc.
Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 24,
471494.
Zwick, R. and Cramer, E. M. (1986). A multivariate perspective on the analysis of
categorical data. Applied Psychological Measurement, 10, 141145.
841


Chapt er
27
Signal Detection Analysis
Herb Stenson
The SIGNAL module provides analyses of data that are appropriate for the theory of
signal detection as described by Green and Swets (1966), Egan (1975), and many
others. For some interesting recent applications, see Swets and Pickett (1982), Swets
(1986), and Kraemer (1988).
The response data to be analyzed by SIGNAL can consist of from 2 to 11 response
categories. Thus, either binary or rating-scale data can be analyzed. An iterative
technique is used in order to produce maximum likelihood estimates of all model
parameters, including the locations of the category boundaries. Graphical displays of
ROC curves are available in addition to the numerical output.
SIGNAL allows analyses based on a number of statistical models in addition to the
more usual normal distribution and nonparametric models. The additional models are
the logistic, negative exponential, chi-square, Poisson, and gamma distribution
models. These models are useful for various types of detection tasks in which the sets
of assumptions concerning the nature of the detector dictate one of these models. For
a discussion of these alternative models, see Egan (1975).
The parameter estimates from any model can be saved into a SYSTAT file, as can
the coordinates of any ROC curve.
Statistical Background
The theory of signal detectability (TSD) emerged after World War II as a synthesis of
existing methods for representing the characteristics of a receiver or sensing device
(Peterson, Birdsall, and Fox, 1954). Although its origins were in electrical
engineering, the abstraction of the theory made it especially suited to analysis of
842
Chapter 27
human perception in general and of any system involving detection of a weak signal
against a background of noise: perception of visual and auditory signals, medical
diagnosis based on signs or symptoms, robotic perception, and so on. TSD is now
widely used in medical research for evaluating the sensitivity and specificity of
diagnostic equipment and clinicians. In radiology, for example, TSD can be used to
quantify the performance of radiologists reading diagnostic X-rays when the signal
(true diagnosis) is known from subsequent events or external criteria. See Hanley and
McNeil (1982) for an example.
Detection Parameters
The signal detection theory literature abounds with various indices of the detectability
of a signal and associated parameters. Because of this proliferation and the confusion
that it can cause, the indices and parameters that are estimated by SIGNAL are described
here in some detail. You can read more about these indices in books by Swets and
Pickett (1982), Egan (1975), and Green and Swets (1966). Coombs, Dawes, and
Tversky (1970) provide the best summary.
Except for the NPAR model, the output printed by SIGNAL contains a standard set of
parameters and indices of detectability for all of the models. The NPAR model is
nonparametric, so there are no parameters to estimate. The only index of detection that
is given for it is the area under the ROC curve obtained by joining the points on the
ROC graph by straight lines. See Bamber (1975) for ways to test hypotheses about this
area measure.
For every model involving a statistical distribution, SIGNAL prints the mean and
standard deviation of the noise (N) distribution and the mean and standard deviation of
the signal+noise distribution. For compactness, let us call these MN, SN, MS,
and SS, respectively. For the normal distribution model, MN will always be 0 and SN
will always be unity because these two parameters are chosen as the origin and unit of
the scale for the decision axis. They are a part of the standard output because they do
not take on these fixed values for all of the models.
Using these means and standard deviations, SIGNAL computes and prints three
measures of the separation of the and N distributions for each of the models.
These measures are labeled as D-Prime, D Sub-A, and Sakitt D in the output.
D-Prime ( ) is the most common index of detectability used in detection research.
It is defined as . As has been pointed out by many authors, it suffers
from the lack of information about the standard deviation of the distribution
S N + ( )
S N +
d
MS MN ( ) SN
S N +
843
Si gnal Detecti on Anal ysi s
when this information is available. The other two measures computed by SIGNAL take
this information into account.
D Sub-A (d
a
) uses as a denominator the square root of the mean of the N and
variances. Let us call this number x. Thus, you would square both SN and SS, add these
squares, divide by two, and take the square root of the result in order to find x. Then
the index D Sub-A is defined as . This index is related to the area under
the ROC curve for normal distributions and has other statistical niceties. See Simpson
and Fitter (1973) for further discussion.
Sakitt D is another measure of detectability that takes into account the variances of
both the N and distributions. It was proposed by Sakitt (1973). It uses as a
denominator the square root of the product of SN and SS. Thus, this index is defined as
. Egan (1975) proposes this as the best detection index for
chi-square, gamma, and Poisson models.
In addition to these measures of the separation of the and N distributions,
SIGNAL also prints in the output for each model the ratio of SS to SN. It is labeled SD-
Ratio, which stands for standard deviation ratio.
The most general measure of detection available is the area under the theoretical
ROC curve that is fitted to your data. This measure is computed by SIGNAL and is
labeled ROC Area in the output for each model. The remainder of the output is
discussed in sections that follow, where each model is described.
Signal Detection Analysis in SYSTAT
Signal Detection Analysis Main Dialog Box
To open the Signal Detection dialog box, from the menus choose:
Statistics
Signal Detection...
S N +
MS MN ( ) x
S N +
MS MN ( ) SN SS ( )
S N +
844
Chapter 27
Signal detection models are computed with a model specification and estimation stage.
Stimulus. Select the variable that shows the true state of the signal on each trial for
signal-detection data. The stimulus variable can only contain the numbers 0 (for noise
occurrences) and 1 (for signal+noise occurrences) to indicate the stimulus state on trial.
The stimulus variable remains in effect until you change it or use a different data file.
Response(s). Select the variable(s) that contain the response(s) to the stimulus by one
or more detectors. The response variable can contain numbers only between 10 and
+10 and there can be only 11 categories of response. (for example, 0 through 10 or 5
through +5). If the input data contain decimals, they will be truncated instead of
rounded. The response variable remains in effect until you change it or use a different
data file.
Model type. With the exception of the nonparametric model, each model assumes that
the trials on which noise alone occurred (N) and the trials on which both signal and
noise occurred ( ) are samples from a particular kind of statistical distribution,
with possibly different parameters for the N and distributions. Possible models
include:
n Chi Square. Select to use a chi-square model. You can think of the chi-square model
as a generalization of the exponential model. To enter a fixed value for the degrees
of freedom, select Fix value and enter a positive number in the Degrees of Freedom
box. Alternatively you can specify a starting value for the degrees of freedom and
signal analysis will attempt to find the best-fitting value for the degrees of freedom
S N +
S N +
845
Si gnal Detecti on Anal ysi s
during iteration. To specify a starting value for degrees of freedom, select Estimate
and enter a positive number in the Degrees of Freedom box.
n Exponential. Select to use a negative exponential model. The negative exponential
density function is algebraically identical to a chi-square density function with two
degrees of freedom. However, its simple properties and usefulness justify treating
it as a separate model.
n Gamma. Select to use a gamma model. Use the gamma model when your
experiment can be described as a Poisson process and the detector uses the time
required to accumulate a fixed number of rare events as the basis of the response.
The length of a trial is determined by the detector as opposed to a Poisson counting
observer, who counts events during fixed interval trial. You can specify both the
M0 parameter and the number of events (R) that the detector is waiting to
accumulate. If you do not enter a fixed value for R, the default starting value (5)
will be used and the program will estimate a value of R at every iteration.
n Logistic. Select to use a logistic model that models the noise or signal+noise
distributions. The logistic distribution is a good approximation for the normal, and
it is mathematically more tractable.
n Normal. Select to indicate that the noise (N) and signal+noise ( ) distributions
are Gaussian.
n Npar. Select to use a nonparametric model, which is a simple way to get a quick
look at your data. If you believe that the assumptions of any of the parametric
models that you can use are not justifiable for your data, then use the nonparametric
model.
n Poisson. Select to use a Poisson model when the detector is basing a response on a
small number of countable, rare events that occur on each trial. For a Poisson
model, you can specify the mean of the noise distribution. The mean of the Poisson
distribution is the average number of occurrences per trial of the event being
counted. The mean for the signal+noise distribution will be estimated to give the
best fit to your data.
Scaling constant. For both the logistic and exponential models, the default scaling
constant is , which is approximately 1.814. With the default value in effect, the
standard deviation of the noise distribution will be 1.00. You can set the scaling
constant to be any positive number.
Iterations. This option controls the maximum number of iterations that you want to
allow the program to perform in order to estimate the parameters. The default value is
50, which for most applications is more than enough. However, if you have a lot of
S N +
3
846
Chapter 27
response categories, a small value of CONVERGE, and difficult data, you may need
more than this for some models.
Converge. The CONVERGE option controls the degree of accuracy sought in the
estimations. Its default value is 0.001, which means that the estimates are to be
accurate to 0.001 times their values. You can set CONVERGE to any number from 0.1
to 0.00001.
Using Commands
After selecting the data with USE filename, continue with:
Type must be one of the following:
For a single detector, the response variable list contains a single variable. Multiple
detectors (for example, judges) of a single signal can be fit in a single model.
When analyzing your data, SIGNAL computes initial estimates (starting values) for
each parameter that is to be estimated by the iterative process. Typing ESTIMATE again
after an analysis has finished will cause SIGNAL to use the most recent estimates of
parameters as starting values for continuing the analysis, rather than computing new
ones. You can, if you want, change the options for ESTIMATE when restarting the
program in this way. However, this restart procedure will not work if you specify a new
MODEL or use FREQ after an analysis terminates. It also will not work if you are using
the BY command. You can, however, use any of the other commands (except USE)
before restarting the program.
The values of CONVERGE and ITER always revert to their default values for each
use of ESTIMATE. Therefore, they must be stated explicitly each time that you use
ESTIMATE if you dont want to use the default settings. The value that you use for
CONVERGE is irrelevant, and therefore unnecessary, if you specify ITER=0. Because
no iterations will occur, there is no accuracy of estimation to worry about. Similarly,
since no iterations are used for the NPAR model, both ITER and CONVERGE are
inappropriate options for this model.
SIGNAL
MODEL responsevarlist = stimulusvar
ESTIMATE / type CONVERGE=d ITERATIONS=n ,
M0=d R=d C=d MEAN=d DF=d
NORMAL NPAR LOGISTIC
EXPONENTIAL CHI_SQUARE POISSON
GAMMA
847
Si gnal Detecti on Anal ysi s
The capability to restart the program using the most recent values, combined with
the options for ESTIMATE, enables you to be flexible in the way that you approach the
analysis of your data. You could, for example, set CONVERGE to a large value, such
as 0.1, and set ITER to a small value, such as 4. This will cause the iterative process to
proceed rather quickly. After a look at the output, you could restart the program by
typing ESTIMATE again with a smaller value for CONVERGE, and perhaps with a
different number for ITER.
Usage Considerations
Types of data. The format of input data for SIGNAL is quite flexible in order to easily
accommodate data from a variety of experimental designs commonly used in signal
detection studies. The program requires a SYSTAT data set containing a minimum of
two numeric variables: One that shows the true state of the signal on each trial of your
experiment, and the other that shows the response of a detector to that signal state.
Thus, the cases in your SYSTAT file represent trials (instances of the signal or lack
thereof) in a detection experiment.
If you have more than one detector responding to exactly the same sequence of
stimuli, responses from the additional detectors can also be coded as variables in the
SYSTAT data file. In this case, there should be only one variable that indicates the true
state of the signal on each trial. You could also have more than one variable that
designates the true state of the signal on each trial. For example, if each detector was
exposed to a different sequence of signal states, you could have a separate variable that
indicates the true state of the signal for each detector.
The example below shows how to enter data for a hypothetical experiment in which
each of three detectors (HS, LB, and LW) responded on each of five trials to exactly
the same sequence of signal states. (You would, of course, have many more trials than
this for a real experiment.) Imagine that a response was to be one of the numbers 2,
1, 0, 1, or 2, with a 2 indicating that the detector was sure that no signal was present
on a trial, a 2 indicating that the detector was sure that a signal was present on a trial,
and the other numbers indicating degrees of certainty as to whether a signal was
present or not. The true state of the signal (present or not present) is coded as the
variable labeled STATE.
848
Chapter 27
Another way to encode these same data would be to create either a string or numeric
variable to identify a detector (for example, UNIT$), a variable to show the true state
of the stimulus, (for example, STATE), and a third variable to indicate the response of
the detector (for example, RATING). Then, on each line of the data set, you would enter
the identifier for the detector, the state of the stimulus on the trial in question, and the
response of the detector. Such a data set would then contain as many cases as there
were trials times the number of detectors. You then could use the SELECT command
within the SIGNAL module to identify which of the detectors you want to analyze, or
you could use the BY command to analyze each detector sequentially. This would be
an easy way to enter data if each detector had been exposed to a different sequence of
signal states. However, this would not be an optimal way to enter data when each
detector was exposed to exactly the same sequence of signal states because you are
repeating the same set of numbers representing the signal states for each detector.
The availability of negative numbers as response options makes it possible to
encode responses from a particular kind of signal detection task that is sometimes used.
In this task, the detector (usually a human detector) is to specify first whether or not a
given trial contained a signal, and then is asked to rate his or her confidence that his or
her response is correct on, say, a five-point rating scale. A way to encode such data for
SIGNAL would be to treat all confidence ratings on trials when the subject reported the
absence of a signal as the numbers 1 through 5, and to treat the ratings on trials when
the subject reported that there was a signal present as the numbers +1 through +5. A
similar encoding strategy can be used when a detector reports the presence or absence
of a signal, and the reaction time for the response is categorized and used as a
confidence rating, with quick times indicating a high degree of confidence in the
response. You would encode the reaction times into categories acceptable to SIGNAL
for this experimental paradigm.
SIGNAL treats the response categories as ordinal data. Thus, it makes no difference
in the analysis what numbers are used, even if there are gaps in the sequence used. All
that is necessary is that the higher numbers indicate a signal-like response and the
lower ones indicate a noise-like response. For example, using the response
categories 1, 2, and 3 would result in the same analysis as using the categories 6, 0,
BASIC
SAVE MYFILE
INPUT STATE,HS,LB,LW
RUN
1 0 1 2
0 1 2 1
0 0 -1 0
1 2 1 1
0 1 0 1
849
Si gnal Detecti on Anal ysi s
and +2 for the same data. Only the category labels would be affected in the program
output. Notice that gaps can occur in the response category sequence either because
certain numbers were not available to the detector as response options or because the
detector never used one of the available options. The program obviously cannot
distinguish which of these is the case.
You can specify more than one variable as a response. This allows the pooling of
responses of detectors that were exposed to the same stimulus sequence, or the pooling
of responses from one detector that was exposed to the same sequence of stimuli on
more than one occasion. Each occasion would have to be entered into the data set as a
separate variable. For example, to pool the responses from detectors HS and LW from
the data set MYFILE (above), you would type:
MODEL HS,LW = STATE
The resulting signal detection analysis would treat all responses from these two
detectors as being from the same detector. Thus, the resulting detection parameter
estimates would apply to this group of detectors instead of to one of them individually.
If you used a different coding scheme, like the one used for the data set SWETSDTA
in the first example, where each detector has an identifier code, you could pool
detectors by simply not using a SELECT or BY command when you analyzed the data.
SIGNAL would then treat all response entries as coming from the same detector. You
could also use the SELECT command with multiple identifier variables, such as sex and
age, to pool data within these subgroups. The resulting analyses would then apply to
whatever group(s) you selected to pool.
Print options. The output is standard for all PRINT options.
Quick Graphs. SIGNAL plots the receiver operating characteristic (ROC) curve.
Saving files. If you save before you estimate, SIGNAL will save parameter estimates into
a file. If you add SAVE / ROC, SIGNAL will save the ROC curve.
BY groups. SIGNAL analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. If a file contains frequencies of each type of response to each of two
stimulus states, the frequencies of responses then can be used as a FREQ variable in the
SIGNAL module. This can be useful if your data are already aggregated in this way or
if you want to make up a table of hypothetical data to model some signal detection task.
Case weights. SIGNAL does not allow case weighting.
850
Chapter 27
Examples
Example 1
Normal Distribution Model for Signal Detection
This example shows frequency data for two detectors (subjects) in a study by Swets,
Tanner, and Birdsall (1961) as reported by Swets and Pickett (1982, pp. 216219).
Each of the subjects in the experiment used a six-category rating scale to indicate his
or her confidence that a signal was present on each of 597 trials when the signal was
present, and on 591 randomly-mixed trials on which the signal was not present. The
COUNT variable shows the number of times a subject gave a particular rating to a
given signal state. Notice that the identifier SUBJ is a numeric variable in this case (but
would not have to be).
By far, the most common model used for signal detection analysis is the normal
(Gaussian) model, in which the noise (N) distribution and the signal+noise ( )
distribution are both assumed to be Gaussian density functions, with the same variance
in the case of binary response data, or with possibly unequal variances in the case of
more than two response categories.
Here we use the data set named SWETSDTA that was described earlier. To perform
a signal detection analysis using the normal distribution model for the first subject in
the data set, we type:
Notice that the SELECT command is used to specify which detector to analyze.
SELECT remains in effect throughout any subsequent analyses until you change the
selection by using the SELECT command again (or cancel it completely by typing
SELECT with nothing after it).
The FREQ command is used in the same way as it is in the rest of SYSTAT. Here it
specifies the variable that shows the frequencies with which response categories were
used for the two different signal states. If you were using a data set that was coded in
a manner similar to MYFILE, you obviously would not use the FREQ command.
USE SWETSDTA
SELECT SUBJ=1
SIGNAL
MODEL RATING=SIGNAL
FREQ=COUNT
ESTIMATE
S N +
851
Si gnal Detecti on Anal ysi s
Following is the output produced:

Variables in the SYSTAT Rectangular file are:
SUBJ SIGNAL RATING COUNT

Case frequencies determined by value of variable COUNT.

Number of stimulus events (cases) responded to: 1188

Number of detectors (variables) observing an event: 1

Number of response categories used: 6

Number of responses to noise events: 591

Number of responses to signal events: 597

Total number of responses: 1188

Number of instances of missing data: 0

Response Frequency Joint Probability Conditional Prob. Cum.Cond.Prob.
Category Noise Signal Noise Signal Noise Signal Noise Signal
1 174 46 0.14646 0.03872 0.29442 0.07705 0.29442 0.07705
2 172 57 0.14478 0.04798 0.29103 0.09548 0.58545 0.17253
3 104 66 0.08754 0.05556 0.17597 0.11055 0.76142 0.28308
4 92 101 0.07744 0.08502 0.15567 0.16918 0.91709 0.45226
5 41 154 0.03451 0.12963 0.06937 0.25796 0.98646 0.71022
6 8 173 0.00673 0.14562 0.01354 0.28978 1.00000 1.00000

Total 591 597 0.4975 0.5025 1.0000 1.0000
Initial estimates of parameters: Gaussian model

Mean(Noise) SD(Noise) Mean(Sig+Noise) SD(Sig+Noise)
0.0 1.000 1.495 1.392
D-Prime D Sub-A Sakitt D SD-Ratio ROC Area
1.495 1.234 1.267 1.392 0.808
Upper Category Boundaries:
-0.523 0.204 0.706 1.366 2.229

Goodness of Fit: Log(Likelihood) ChiSq(3 df) Prob(ChiSq)
-1921.418606 2.364 0.500
Iterative maximum-likelihood estimation of parameters with tolerance = 0.00100
Iter -Log(Like) D-Prime SD-Ratio Category Boundaries
0 .1921419D+04 .14949D+01 .13917D+01 -.52284D+00 .20391D+00 .70596D+00
.13661D+01 .22293D+01
1 .1920995D+04 .15075D+01 .14088D+01 -.53627D+00 .20024D+00 .70545D+00
.13599D+01 .22829D+01
2 .1920986D+04 .15219D+01 .14174D+01 -.53395D+00 .20431D+00 .71050D+00
.13673D+01 .22951D+01
3 .1920985D+04 .15182D+01 .14160D+01 -.53279D+00 .20448D+00 .70969D+00
.13656D+01 .22945D+01
4 .1920985D+04 .15182D+01 .14160D+01 -.53279D+00 .20448D+00 .70969D+00
.13656D+01 .22945D+01
5 .1920985D+04 .15189D+01 .14166D+01 -.53304D+00 .20410D+00 .70956D+00
.13662D+01 .22937D+01
6 .1920985D+04 .15189D+01 .14166D+01 -.53304D+00 .20390D+00 .70956D+00
.13662D+01 .22937D+01
852
Chapter 27
The meanings of the first two sections of output are self-evident. The first is a report
on the data read, and the second is a tabulation of frequencies and probabilities (relative
frequencies) compiled from the input data. The final column of the frequency table is
Cumulative Conditional Probabilities. The false-alarm rates (FAR) and hit-rates (HR)
shown later in the output are computed by subtracting these cumulative probabilities
from unity. This results in FAR and HR being associated with the upper category
boundary of a labeled category. The report on the data read and the frequency table are
a standard part of the output for every model that can be used in SIGNAL.
The next section of output, labeled Initial estimates of parameters, contains the
detection parameters discussed earlier as well as a line of numbers labeled Upper
Category Boundaries. The latter are the standard normal deviates (z scores) that
Final parameter estimates using upper category boundaries: Gaussian model
Mean(Noise) SD(Noise) Mean(Sig+Noise) SD(Sig+Noise)
0.0 1.000 1.519 1.417

Ctgry Upper
Label Far HR FINV(FAR) FINV(HR) Boundary Beta Log(Beta)
1 0.7056 0.9229 -0.541 -1.425 -0.533 0.285 -1.255
2 0.4146 0.8275 0.216 -0.944 0.204 0.468 -0.758
3 0.2386 0.7169 0.711 -0.574 0.710 0.771 -0.260
4 0.0829 0.5477 1.386 -0.120 1.366 1.785 0.579
5 0.0135 0.2898 2.210 0.554 2.294 8.437 2.133
D-Prime D Sub-A Sakitt D SD-Ratio ROC Area
1.519 1.239 1.276 1.417 0.809

Goodness of Fit: Log(Likelihood) ChiSq(3 df) Prob(ChiSq)
-1920.984805 1.497 0.683
EXPORT successfully completed.
853
Si gnal Detecti on Anal ysi s
correspond to the area of the noise distribution that is above the upper boundary of each
successive response category. Notice that there is one fewer of these than there are
categories. This section of output is referred to as initial estimates because they are
the starting estimates that the iterative procedure uses to compute maximum likelihood
estimates of these same parameters.
At the end of the output for initial estimates, there is a line labeled Goodness of Fit.
The data on this line indicate how well the initial parameter estimates account for the
empirical data in your data file. The estimated parameters and category boundaries for
the two normal distributions (N and ) allow the estimation of the probability that
a given response will occur given either an N trial or an trial. We compute these
probabilities for each response that occurred on each trial of the experiment. The
product of all these probabilities gives us the probability (likelihood) of obtaining the
data that we in fact obtained, given that the model and its parameters are correct.
Instead of computing this product, SIGNAL finds the natural logarithm of each
probability and adds them together, rather than multiplying the probabilities
themselves. The first goodness-of-fit indicator is the sum of these logarithms for the
input data, given the estimates of the model that were made from the data. Thus, it is
labeled Log(Likelihood).
The log-likelihood is useful for certain analytic purposes, but it is not very
intuitively appealing. For this reason, SIGNAL also computes a Pearson chi-square
statistic indicating how well the model with its parameter estimates fits the input data.
The theoretical probability of each type of response, mentioned in the preceding
paragraph, allows the calculation of an expected frequency of each response for both
N and trials. Differences between the actual frequencies and the expected
frequencies, based on the model, are used to calculate the chi-square statistic in the
usual way. For the normal model, it will have degrees of freedom three fewer than the
number of response categories used. This chi-square value, its degrees of freedom, and
the associated chi-square probability are shown along with the log(likelihood) as
goodness-of-fit statistics. The probability of the chi-square value will be unity if the fit
is perfect and will approach 0 for very bad fits to the data.
The next section of output is a history of the iterative estimation of the model
parameters. The value of log(likelihood) is shown for each iteration along with the
estimated values of the model parameters for the iteration. As you can see from the
output, the value of log(likelihood) decreases at each iteration until it levels off, and
the program ceases to iterate. As the value of log(likelihood) decreases, the likelihood
of having gotten the data that were obtained increases, hence the term maximum
likelihood estimation. When the program can no longer produce significant increases
in this likelihood by adjusting the parameters and is not producing parameter values
S N +
S N +
S N +
854
Chapter 27
that differ much from iteration to iteration, it ceases. The letter D that appears in this
numerical output should be interpreted in the same way an E is when using scientific
notation. Using D rather than E merely signifies that double-precision arithmetic is
being used in the calculations.
As you can see in the output for the iterative estimations, SIGNAL estimates D-
Prime, the SD-Ratio and the Upper Category Boundaries on each iteration. D-Prime
could just as easily be labeled Mean , and SD-Ratio could be labeled Standard
Deviation of because we have assumed the mean and standard deviation of N to
be 0 and 1, respectively. As stated earlier, the numbers for the upper category
boundaries are standard normal deviates (z scores) relative to the N distribution.
Following the history of iteration is a table showing the final estimates of the
parameters along with some other information.
In this table of final parameter estimates, you will see a column labeled FINV(FAR)
and another labeled FINV(HR). These are the z scores corresponding to the FAR and
HR, respectively. These z scores are the inverse function of the FAR and HR values,
hence the more general label FINV. This is very useful when models other than the
normal distribution are used because then we are not necessarily dealing with standard
normal deviates. Also shown in this table are the Upper Category Boundaries, which
have already been described, and two columns labeled Beta and Log(Beta). Beta is the
ratio of the height of the normal distribution for to the height of the normal
distribution for N at a given upper category boundary. Log(Beta) is the natural
logarithm of Beta.
Following the table of final estimates are the computed values for all of the
detection indices described earlier, as well as the values of the same goodness-of-fit
measures that were described for the table of initial estimates. The plot shows the usual
ROC curve for the input data. HR is plotted against FAR, and the theoretical ROC
curve that results from the final parameter estimates is shown.
When you analyze data that have fewer than four response categories using the
NORMAL model, you will notice that no iterations occur. This is because the HR and
FAR data can always be fit perfectly by an algebraic procedure for fewer than four
categories. There are not enough degrees of freedom in the data to allow any error of
estimation. Thus, for these cases, all that you will get is a table of final estimates, and
the goodness-of-fit measures will show a perfect fit.
S N +
S N +
S N +
855
Si gnal Detecti on Anal ysi s
Example 2
Nonparametric Model for Signal Detection
If you use the NPAR option of the ESTIMATE statement, you will get some very simple
output. In addition to the data report and the frequency table, you will get the HR and
FAR for each category and the area under the nonparametric ROC function. This ROC
is constructed by connecting the empirical points on the graph with straight lines. The
area referred to is the area to the right and below the function defined by these lines.
Bamber (1975) showed that this nonparametric ROC was essentially the same thing
that mathematicians call an ordinal dominance graph. He finds that the area under such
a graph is closely related to the Mann-Whitney U statistic, thus enabling hypotheses
about such an area to be tested.
The nonparametric model is a simple way to get a quick look at your data, and if you
believe that the assumptions of any of the parametric models that you can use are not
justifiable for your data, then NPAR is the model for you. The input is:
Following is the output:
USE SWETSDTA
SELECT SUBJ=1
SIGNAL
MODEL RATING=SIGNAL
FREQ=COUNT
ESTIMATE / NPAR

Variables in the SYSTAT Rectangular file are:
SUBJ SIGNAL RATING COUNT

Case frequencies determined by value of variable COUNT.

Number of stimulus events (cases) responded to: 1188

Number of detectors (variables) observing an event: 1

Number of response categories used: 6

Number of responses to noise events: 591

Number of responses to signal events: 597

Total number of responses: 1188

Number of instances of missing data: 0

856
Chapter 27
Example 3
Logistic Model for Signal Detection
This model uses the logistic distribution as the model for the N and
distributions. The cumulative probability function for the logistic distribution is
, where Y is the random variable on which the distribution is
defined. In SIGNAL, Y is replaced by , where c is a scaling constant, and X is the
decision axis of the detection model. The default value of c is , which is
approximately 1.81. This has the effect of making the variance of X equal unity. Thus,
with the default in effect, the standard deviation of the N distribution will be 1.00. You
can set the value of c to be any positive number because c is an option that can be
appended to the ESTIMATE command for this model.
Response Frequency Joint Probability Conditional Prob. Cum.Cond.Prob.
Category Noise Signal Noise Signal Noise Signal Noise Signal
1 174 46 0.14646 0.03872 0.29442 0.07705 0.29442 0.07705
2 172 57 0.14478 0.04798 0.29103 0.09548 0.58545 0.17253
3 104 66 0.08754 0.05556 0.17597 0.11055 0.76142 0.28308
4 92 101 0.07744 0.08502 0.15567 0.16918 0.91709 0.45226
5 41 154 0.03451 0.12963 0.06937 0.25796 0.98646 0.71022
6 8 173 0.00673 0.14562 0.01354 0.28978 1.00000 1.00000

Total 591 597 0.4975 0.5025 1.0000 1.0000
Nonparametric analysis using upper category boundaries
False-alarm rates for successive categories:
0.706 0.415 0.239 0.083 0.014
Hit rates for successive categories:
0.923 0.827 0.717 0.548 0.290
Area under ROC = 0.803
EXPORT successfully completed.
S N +
1 1 Y ( ) exp + ( )
c X
3 ( )
857
Si gnal Detecti on Anal ysi s
You might very well want to use 1.7 for c because, if you do, the cumulative
probabilities for the N distribution will differ from standard normal (Gaussian)
probabilities by less than 0.01 for all values of X (with a mean of 0). The standard
deviation of X for the N distribution will not be unity, however. It will be equal to
, or about 1.07. Thus, the logistic distribution is a good approximation for
the normal, and it is mathematically more tractable.
The format of the output from SIGNAL for the logistic model does not differ from
that for the normal model, except that the program reports the value of c that was used,
and the variance of N is not necessarily unity, depending on this value of c. The values
of FINV in both the numerical and the Quick Graph output are computed as
where the probability p is either an HR or an FAR. As with the normal model, the
values shown for upper category boundaries are scaled on the N distribution. You can
examine the input and output by changing NPAR to LOGISTIC in the last example.
Treisman and Faulkner (1985) show the relationships between ROC curves derived
from a logistic model and aspects of the choice theory proposed by Luce (1959). Thus,
the use of this model lends theoretical elegance to signal detection theory.
Example 4
Negative Exponential Model for Signal Detection
The negative exponential density function is algebraically identical to a chi-square
density function with two degrees of freedom. However, its simple properties and
usefulness justify treating it as a separate model. A simple (somewhat limited) way to
think of the distribution is to imagine a detector that on each trial of an experiment
receives two random observations from a normal distribution that is either N or
in nature. The detector is built to compute the variance, or sum of squared deviations
of the two observations, and bases its response on that computation. Thus, it tries to
distinguish from N, based on the variance of each. A chi-square distribution
with two degrees of freedom is an appropriate model for such a detector.
The cumulative probability function for the negative exponential distribution is
, where X is a random variable (the decision axis in this case) and c is a constant.
The mean and standard deviation of the distribution are both equal to . Therefore,
if we have an N distribution with the value of and an distribution with
the value of , D-Prime is and the ratio of the standard
deviation to the N standard deviation is . A little algebra shows that the ROC is
1.7 3 ( )
FINV 1 p ( ) p ( ) c ln =
S N +
S N +
1 e
cX

1 c
c c
N
= S N +
c c
S
= c
N
c
S
( ) 1 S N +
c
N
c
S

858
Chapter 27
given by a power law: The HR is the FAR raised to the c{S}/c{N} power. The area
under the ROC curve is .
The implementation of this model in SIGNAL allows the user to choose a value of c
for the N distribution. The program then finds the best-fitting value of c for the
distribution given the input data. Thus, the user-supplied value of c is simply a scaling
constant for the decision axis, and there is only one parameter left for the program to
estimate from the data (in addition to the category boundaries). The default value of c
for the N distribution is unity. This is the only option for the MODEL statement when
the model is exponential.
The format of the output for this model differs from the format of the normal model
in only two respects. First, the values of c are given for the N and distributions
in both the table of initial estimates and the table of final estimates. Second, rather than
listing D-Prime and SD-Ratio as parameters being estimated during the iterative
process, the value of the mean of and the value of the mean of N are listed
instead. The mean of N is and, of course, remains constant during the iterations.
It is simply filling space in the table. The mean of is , where is the
constant for the distribution. Thus, iteratively estimating the mean is the same
thing as iteratively estimating .
The values for FINV in the numerical as well as the Quick Graph output are
computed by finding the logarithm of the probability involved and dividing it by or
, whichever is appropriate. The upper category boundaries shown in the output are
scaled using the standard deviation of the N distribution as the unit of measure and
absolute 0 as the origin. With regard to origin, notice that the linear ROC line always
starts at the (0,0) coordinate of the plot, as it must for the exponential model.
As you will notice in the output, the degrees of freedom for the chi-square goodness
of fit are equal to the number of response categories minus 2, rather than minus 3 as
for the normal and logistic distributions. This is because there is one less parameter to
estimate for the exponential model than for the other two. The input is:
USE SWETSDTA
SELECT SUBJ=1
SIGNAL
MODEL RATING=SIGNAL
FREQ=COUNT
ESTIMATE / EXPONENTIAL
1 1 c
S
c
N
( ) + ( )
S N +
S N +
S N +
1 c
S N + 1 c
S
c
S
S N +
c
S
c
S

c
N

859
Si gnal Detecti on Anal ysi s
The output is as follows:
Variables in the SYSTAT Rectangular file are:
SUBJ SIGNAL RATING COUNT

Case frequencies determined by value of variable COUNT.

Number of stimulus events (cases) responded to: 1188

Number of detectors (variables) observing an event: 1

Number of response categories used: 6

Number of responses to noise events: 591

Number of responses to signal events: 597

Total number of responses: 1188

Number of instances of missing data: 0

Response Frequency Joint Probability Conditional Prob. Cum.Cond.Prob.
Category Noise Signal Noise Signal Noise Signal Noise Signal
1 174 46 0.14646 0.03872 0.29442 0.07705 0.29442 0.07705
2 172 57 0.14478 0.04798 0.29103 0.09548 0.58545 0.17253
3 104 66 0.08754 0.05556 0.17597 0.11055 0.76142 0.28308
4 92 101 0.07744 0.08502 0.15567 0.16918 0.91709 0.45226
5 41 154 0.03451 0.12963 0.06937 0.25796 0.98646 0.71022
6 8 173 0.00673 0.14562 0.01354 0.28978 1.00000 1.00000

Total 591 597 0.4975 0.5025 1.0000 1.0000
Initial estimates of parameters: exponential model
Multiplicative constant for noise = 1.000
Multiplicative constant for signal+noise = 0.258

Mean(Noise) SD(Noise) Mean(Sig+Noise) SD(Sig+Noise)
1.000 1.000 3.870 3.870
D-Prime D Sub-A Sakitt D SD-Ratio ROC Area
2.870 1.015 1.459 3.870 0.795
Upper Category Boundaries:
0.346 0.871 1.424 2.480 4.333

Goodness of Fit: Log(Likelihood) ChiSq(4 df) Prob(ChiSq)
-1927.548455 15.180 0.004
Iterative maximum-likelihood estimation of parameters with
Tolerance =0.00100
Iter -Log(Like) Mean(S+N) Mean(N) Category Boundaries
0 .1927548D+04 .38702D+01 .10000D+01 .34633D+00 .87132D+00 .14240D+01
.24800D+01 .43331D+01
1 .1922809D+04 .39983D+01 .10000D+01 .34193D+00 .84605D+00 .14066D+01
.24759D+01 .48555D+01
2 .1922807D+04 .40085D+01 .10000D+01 .34293D+00 .84660D+00 .14073D+01
.24772D+01 .48693D+01
3 .1922807D+04 .40085D+01 .10000D+01 .34293D+00 .84660D+00 .14073D+01
.24772D+01 .48693D+01
4 .1922807D+04 .40088D+01 .10000D+01 .34263D+00 .84679D+00 .14077D+01
.24758D+01 .48705D+01
Final parameter estimates using upper category boundaries: Exponential model
Multiplicative constant for noise = 1.000
Multiplicative constant for signal+noise = 0.249
Mean(Noise) SD(Noise) Mean(Sig+Noise) SD(Sig+Noise)
1.000 1.000 4.009 4.009

860
Chapter 27
Egan (1975) discusses the negative exponential model at some length. He points out its
relationship to the Rayleigh distribution and notes that it represents the probability
distribution of a randomly selected sinusoid in the Rice model of Gaussian noise.
Example 5
Chi-Square Model for Signal Detection
We can think of the chi-square model as a generalization of the exponential model that
was just discussed. Imagine that the hypothetical detector now receives k random
observations of N, or k random observations of on a trial, and that N and
are normally distributed. As before, the detector bases its response on the sums of
squared deviations for the k observations. The appropriate model for such a detector is
the (unstandardized) chi-square distribution with k degrees of freedom (or
degrees of freedom if the detector does not know and must estimate the true means of
Ctgry Upper
Label Far HR FINV(FAR) FINV(HR) Boundary Beta Log(Beta)
1 0.7056 0.9229 0.349 0.080 0.343 0.323 -1.131
2 0.4146 0.8275 0.881 0.189 0.847 0.471 -0.753
3 0.2386 0.7169 1.433 0.333 1.408 0.718 -0.332
4 0.0829 0.5477 2.490 0.602 2.476 1.600 0.470
5 0.0135 0.2898 4.302 1.239 4.871 9.651 2.267
D-Prime D Sub-A Sakitt D SD-Ratio ROC Area
3.009 1.030 1.503 4.009 0.800

Goodness of Fit: Log(Likelihood) ChiSq(4 df) Prob(ChiSq)
-1922.806648 5.579 0.233
EXPORT successfully completed.
S N + S N +
k 1
861
Si gnal Detecti on Anal ysi s
N and from the data). For SIGNAL, we assume that k is the same for both
and N trials.
Let us designate a variable that represents the sum of squared deviations for k
observations from the N distribution divided by the population variance of the parent
(normal) distribution as CSN. Let CSS be the corresponding variable for the
trials. The distributions of sums of squared deviations are then and
, the so-called unstandardized distributions. Here,
represents the variance of the parent noise distribution, and represents the
variance of the signal+noise distribution. Some algebra shows that the sum of squared
deviations for is the sum of squared deviations for N times
, which turns out to be the ratio of the standard deviations (not
variances) of the unstandardized variables for and N. We must be careful
here to distinguish the variances of the parent N and distributions, and
, from the variance or standard deviations of the unstandardized
distributions to which they give rise. It is the latter that form the model of what the
detector is doing.
In the SIGNAL output for this model, one of the estimated parameters is called SD-
Ratio. This is the ratio of the standard deviation of the unstandardized
variable to the standard deviation of the N unstandardized variable. As stated
above, this is also the ratio of to . The unstandardized for
is a constant times the unstandardized for N. The constant is the SD-Ratio.
The means and standard deviations that SIGNAL prints for the N and
distributions are based on the assumption that is unity. The mean for the N
unstandardized is then simply the degrees of freedom, k. The mean for the
unstandardized is then SD-Ratio times k. Of course, the SD-Ratio then has unity as
its denominator so that it is just equal to the standard deviation of the unstandardized
distribution for .
The other parameter that can be estimated by the program is the correct degrees of
freedom (df) to use. It is allowed to be a non-integer value. You do not have to allow
SIGNAL to try to estimate df. If you want to fix df at some value, you can use an option of
the MODEL command to do this. For example, if you want to fix df at 4, you should type:
Consequently, the df will not change during the iterations. You can fix df at any positive
value. There is another option available here that will give you some flexibility about
the df. You may, for example, type:
ESTIMATE / CHISQ,DF=4
ESTIMATE / CHISQ,START,DF=12.8
S N + S N +

2
S N +
CSN VAR N ( )
CSS Var S N + ( )
2
Var N ( )
Var S N + ( )
S N +
Var S N + ( ) Var N ( )

2
S N +
S N + Var N ( )
Var S N + ( )
2
S N +
2

2
Var S N + ( ) Var N ( )
2
S N +
2
S N +
Var N ( )

2
S N +

2
S N +
862
Chapter 27
This will cause SIGNAL to use 12.8 (or any other positive number that you type) as a
starting value for the iterations. SIGNAL will (probably) move away from this starting
value during the iterations in an attempt to find the best-fitting value for df.
There is a potential problem with allowing SIGNAL to do this. The procedure for
trying to iteratively determine df seems particularly prone to the problem of local
minima for some kinds of data. This means that, at some point in the process, the
parameter estimates do not change much for a few iterations, and so the program stops
iterating as though the minimum value for the log(likelihood) has been found.
However, it is not the global minimum but a local minimum that has been found. It is
a good idea to always use the START option to start the program at a different value of
df once you think that you have found the maximum likelihood solution. There may be
a still better value for df. Better yet, if you have some idea of what df should be, either
fix df at this value or START at it. The default starting value for df is 10.
If you allow the program to find the df, a degree of freedom is lost for the Pearson
goodness-of-fit statistic. This will not occur for the initial estimates because these
are based on either the default starting value or a number that you assign to df.
However, if you let the program iteratively find df, then the Pearson will show one
less degree of freedom for the final estimates.
These degrees of freedom are the number of categories minus 3 if the program
estimates df, or the number of categories minus 2 if you fix df at a value. There is an
unresolved theoretical problem here. If you use three categories of response and allow
SIGNAL to iteratively find df, there will be one degree of freedom for the initial Pearson
statistic (the starting value for df is given) and zero degrees of freedom for the final
Pearson statistic. Zero degrees of freedom would seem to imply that a perfect fit is a
necessity, but it is not, as can easily be demonstrated with an example. In this case, the
program still computes the empirical value for the Pearson statistic and finds the
probability of that value based on one degree of freedom. However, the printout will
show that there are zero degrees of freedom. There seems to be no resolution for this
problem at present. This model also is a bit slower in execution than some others
because of the complexity of finding inverse values for probabilities, and because
of the necessity to use an iterative technique to measure the area under the ROC.
As with the other models, the values for the upper category boundaries are measured
on the N distribution. The printout for this model is so similar to the others that have
been described that it will not be discussed further here. For further information on this
model and ways in which it can be used, see Egan (1975).

2
863
Si gnal Detecti on Anal ysi s
Example 6
Poisson Model for Signal Detection
This model is appropriate when the detector is basing a response on a count of rare
events of some kind that occur during a trial. On N trials, only a very few of the events
are liable to occur; and on trials, more of the events occur, although they are still
rare. Think, for example, of a rare form of bacteria that is present to some small degree
in every person. Suppose that the presence of a certain disease is indicated by a small
increase in the count of this bacteria as seen on a microscope slide. A slide containing
a very small number of these organisms is considered to be from a normal person (a
noise trial), and a slide with more of the bacteria is considered to have come from a
diseased person (the signal+noise condition). It must be decided on the basis of small
differences in count whether a person is diseased or not (or a rating scale of the
likelihood of disease could be used). When the number of possibilities for bacteria is
large, as on the slide, but the probability of finding very many is small, the Poisson
model is appropriate.
The Poisson is a discrete distribution with probabilities defined only for the non-
negative integers. This would seem to lead to a theoretical ROC function that was
composed of discrete points on the graph, one for each integer. However, Egan (1975)
and others have argued that a guessing strategy when an ambiguous count is received by
the detector allows us to close the ROC by connecting the points with straight lines.
If the detector were mechanical or electrical, you would have to assume that when
an ambiguous count was received on a trial, the detector would sometimes act as
though the next highest integer was appropriate and sometimes act as though the next
lowest was appropriate, perhaps with unequal probabilities. This behavior allows us to
close the ROC with the aforementioned straight lines. This is the approach taken in
SIGNAL.
The decision axis is the scale of non-negative real numbers; and while the
probabilities of various counts theoretically can occur only at integer values, the
closing of the ROC implies that boundaries between response categories can occur at
any real number. Thus, the scale of the decision axis is fixed by the non-arbitrariness
of the counting numbers. The question then becomes, What two Poisson distributions
defined on these numbers best fit the response data given?
Formulas for the Poisson distribution are given in many statistics texts. The mean
of the Poisson distribution is the average number of occurrences per trial of the
event being counted. A trial can be a spatial and/or temporal entity. The variance of the
Poisson is also . Thus, there are two model parameters to deal with here (in addition
S N +
( )

864
Chapter 27
to category boundaries): the mean of the N distribution and the mean of the
distribution.
If you fix one of these means at some value, a priori, then there is only the other
mean and the category boundaries for SIGNAL to estimate. That is exactly what one of
the ESTIMATE options allows you to do. You can specify that the mean of the N
distribution be fixed at some value. For example, if you type
the program will fix the mean of the N distribution at 3 and then estimate the value for
the mean that gives the best fit to your data. The two means completely specify
the two distributions. The default value for the MEAN option is 5.
The START option used here works in a manner similar to that described for the chi-
square model. It makes the initial value of the mean of N equal to the set value. The
program would then include values of the mean of N in the iterative process, trying to
find a best-fitting value for the mean (that is, trying to find the most appropriate
Poisson distribution). The same comments that were made above in the chi-square
model regarding the Pearson fit statistic apply here as well.
Like the chi-square model, the iterative routine seems to be susceptible to a local
minimum problem for many Poisson data sets. Thus, the best strategy, if you do not
know the value of the mean for N, is to try a wide range of fixed values for it in
successive runs of the model. Then pick the fixed value that gave you the lowest value
of log(likelihood) and use it along with the START option to allow the program to
iterate near that value.
By the time you have read this far, the nature of the output should be self-evident to
you. It is very much the same for all of the models discussed. The program is somewhat
slower for the Poisson because of the iterative techniques that needed to be used to find
FINV for HR and FAR and to find the area under the ROC.
Example 7
Gamma Model for Signal Detection
Suppose that in an experiment, the N and trials can be described as a Poisson
process as described for the Poisson model. That is, a small number of discrete,
countable, and rare events occur on each trial. But now suppose that the detector adopts
or is programmed for the following strategy: The detector uses the time required to
accumulate a fixed number of the rare events as the basis of the response. If that fixed
number accumulates very slowly, the detector gives a noise-like response. If the
ESTIMATE / POISSON,MEAN=4
S N +
S N +
S N +
865
Si gnal Detecti on Anal ysi s
fixed number accumulates more rapidly, the detector gives a signal-like response.
Thus, the detectors response (binary or rating category) is based on timethe time it
takes to accumulate a predetermined number of discrete events. Notice that the length
of a trial is determined by the detector, as opposed to the Poisson-counting observer
described above, who counts events during a fixed-interval trial. In the former case, the
gamma distribution is an appropriate detection model.
Formulas for the gamma distribution are found in advanced statistics textbooks.
Suffice it to say here that the distribution has two parameters: m and r. Let us call the
number of events that the detector is waiting to accumulate r. The mean of the gamma
distribution is so that m times the mean is r, and m is then a scaling constant. For
a detection problem, there is only one value of r; but there are two values of m: one for
the N distribution and one for the distribution. Let us call these and ,
respectively. The mean of the time that it takes for r (Poisson) events to accumulate if
the N process is in effect is then , and the mean of the time that it takes for r
events to accumulate if the process is in effect is . The variance of a
gamma distribution is r divided by , so knowing r and m defines both the mean and
variance.
Thus, in addition to category boundaries, there are three parameters for our model:
R, M0, and Ml. In SIGNAL, we fix M0 at a value predetermined by the program (the
default value) or by the user via the option described below. The default value is unity.
If you want to change the value of M0, for example, to 3, you would type:
The value of Ml is then estimated by the program after M0 is determined. Note that M0
is never estimated. It either keeps its default value or the value that you assign. These
values, along with R, determine the means and variances of the N and
distributions. The value of R has a default value of 5. You can change it in a manner
similar to that for M0. If you want to change both M0 and R, you must list them both
in the same MODEL statement.
If you do not exercise the option to fix R, then the default starting value will be used
and the program will estimate a value of R at every iteration. You can also choose to
let this happen but pick your own starting value for R, just as in the previous two
models. For example, you could type
to accept the default value of R and let M0 change over iterations. All of the discussion
of local minima in the previous two models applies here as well. Do not let the program
do your thinking for you.
ESTIMATE / GAMMA,M0=3
ESTIMATE / GAMMA,M0=3,START
r m
S N + m
o
m
1
r m
o

S N + r m
1

m
2
S N +
866
Chapter 27
If you think about it, you will realize that the N and distributions have to be
reversed from their usual positions on the decision axis for this model. The N trials
result in longer waiting times, and the trials result in shorter waiting times. Also,
for the same reason, HR and FAR are in the lower tail instead of the upper tails of the
distributions in this case. This has all been taken care of for you with SIGNAL.
Everything that needs to be reversed has been reversed so that, for example, you do not
get negative values of D-Prime when you should not. (That doesnt mean that negative
values cannot occur.)
Again, the output is in the same approximate format as for the other models already
described. The few differences should be self-evident. It should also be evident that
this model was specifically designed to handle waiting-time data. If you use it as a
general GAMMA model, you will have to remember all of the design features mentioned
here, especially the reversal of the direction of the decision axis and the fact that HR
and FAR are computed from the lower tails of the distributions.
For more in-depth discussion of this model, you should again consult Egan (1975),
who also makes some interesting comparisons of Poisson counting detectors versus
gamma timing detectors.
Computation
All arithmetic is double precision.
Algorithms
The algorithm used to minimize negative log-likelihood is an adaptation of the Nelder-
Mead simplex method as presented by ONeill (1985). This method does not require
derivatives, making it useful for all of the present models simultaneously. It is,
however, less time-efficient than methods that use derivatives.
The area under the ROC is not directly computable for certain of the models
(CHISQ, POISSON, and GAMMA). An algorithm was written to approximate this area
by successively dividing it into smaller and smaller trapezoids and using the
trapezoidal rule to accumulate the area. On successive iterations, the area is subdivided
into trapezoidal panels, first two, then four, then eight, etc. If the increase in
accumulated area is less than the value of CONVERGE from one iteration to the next,
the subroutine ceases and returns the most recent value of the area. If, after 512 panels
S N +
S N +
867
Si gnal Detecti on Anal ysi s
have been constructed, the stopping rule has not been met, the routine prints a warning
message, returns the most recent area estimate, and ceases.
The initial estimates for the NORMAL and LOGISTIC models are obtained by finding
the eigenvector of the FINV(HR) and FINV(FAR) vectors. The category boundaries are
located by projecting the data points onto this vector and then scaling to the units of the
N distribution. Similar methods are used for the other models, with the restriction that
the least-squares vector on which the boundaries are located must pass through the
point 0,0 on the linear ROC.
Missing Data
Missing data are treated as though the trial or trials that are missing the data did not
exist only for the particular detector missing the data.
References
Bamber, D. (1975). The area above the ordinal dominance graph and the area below the
receiver operating characteristic graph. Journal of Mathematical Psychology, 12,
387415.
Egan, J. P. (1975). Signal detection theory and ROC analysis. New York: Academic Press.
Coombs, C. H., Dawes, R. M., and Tversky, A. (1970). Mathematical psychology: An
elementary introduction. Englewood Cliffs, N.J.: Prentice-Hall, Inc.
Green, D. M. and Swets, J. A. (1966). Signal detection theory and psychophysics. New
York: John Wiley & Sons, Inc.
Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology, 143, 2936.
Kraemer, H. C. (1988). Assessment of 2 x 2 associations: Generalization of signal detection
methodology. The American Statistician, 42, 3749.
Luce, D. (1959). Individual choice behavior. New York: John Wiley & Sons, Inc.
ONeill, R. (1985). Function minimization using a simplex procedure. In Griffiths, P. and
Hill, I. D. (eds.), Applied statistics algorithms. Chichester, England: Ellis Horwood
Limited. 7987.
Peterson, W. W., Birdsall, T. G., and Fox, W. C. (1954). The theory of signal detectability.
Institute of Radio Engineers Transactions, PGIT-4, 171212.
Sakitt, B. (1973). Indices of discriminability. Nature, 241, 133134.
Simpson, A. J. and Fitter, M. J. (1973). What is the best index of detectability?
Psychological Bulletin, 80, 481488.
868
Chapter 27
Swets, J. A. (1986). Indices of discrimination or diagnostic accuracy: Their ROCs and
implied models. Psychological Bulletin, 99, 110117.
Swets, J. A. and Pickett, R. M. (1982). Evaluation of diagnostic systems. New York:
Academic Press.
Swets, J. A, Tanner, W. P., and Birdsall, T. G. (1961). Decision processes in perception.
Psychological Review, 68, 301340.
Treisman, M. and Faulkner, A. (1985). On the choice between choice theory and signal
detection theory. Quarterly Journal of Experimental Psychology, 37A, 387405.
869


Chapt er
28
Spatial Statistics
Leland Wilkinson
Spatial statistics compute a variety of statistics on a 2-D or 3-D spatially oriented
data set. Variograms assist in the identification of spatial models. Kriging offers 2-D
or 3-D kriging methods for spatial prediction. Simulation realizes a spatial model
using Monte Carlo methods. Finally, a variety of point-based statistics are produced,
including areas (volumes) of Voronoi polygons, nearest-neighbor distances, counts
of polygon facets, and quadrat counts. Graphs are automatically plotted and
summary statistics are printed for many of these statistics.
The geostatistical routines in SYSTAT Spatial are based on GSLIB (Deutsch and
Journel, 1998). Point statistics are computed from a Voronoi/Delaunay partition of 2-D
or 3-D configurations.
Statistical Background
Spatial statistics involve a variety of methods for analyzing spatially distributed data.
SYSTAT Spatial covers two principal areas: fixed-point methods (kriging and
Gaussian simulation) and random-point methods (nearest-neighbor distances,
polygon area/volumes, quadrat procedures). All of these procedures can be defined
through a basic spatial model.
The Basic Spatial Model
The basic spatial model can be defined as follows. Assume is a site or
point in d-dimensional Euclidean space. The random variable represents a
possible observation of some quantity or quality at the site . Instead of fixing at
s
0

d

Z s
0
( )
s
0
s
0
870
Chapter 28
a single site, however, we can let s vary over the index set , so as to make Z(s)
a stochastic process or a random field:
There are two principal variants of this random process. If D is a fixed subset of ,
then we have a geostatistical model in which points are at predetermined locations and
Z(s) is sampled at these points. Although points are taken at fixed locations, the
geostatistical model assumes s can vary continuously over . On the other hand, if D
is a random subset, then we have a point process in which Z(s), or s itself, is sampled
at random locations.
Conventional statistical procedures are unsuited for either of these models for
several reasons. The major reason involves independence of observations. As with
time series (see the SERIES module in SYSTAT), spatial models normally involve
dependence among observations. We cannot assume that errors for a conventional
statistical model applied to spatial data will be independent.
In the geostatistical model, the value of Z(s
i
) is usually correlated with the value of
Z(s
j
). For example, if we sample groundwater level at a site, the value we find there can
be predicted, in part, by the value at a nearby site. Furthermore, the nature of this
relationship may be more complex than the one-dimensional dependencies found in a
time series. The dependency structure may vary with direction, distance, and time. We
expect nearby sites to be related. We might even find that groundwater level is related
(perhaps negatively) to the level at a distant site and this relationship might vary over
time.
In the point process model, we face a similar dependency issue even though sites
are randomly distributed. We also encounter another problem: the statistics we
construct in order to examine patterns and test hypotheses are not usually normal. The
distribution of distances between pairs of points, the counts of points in fixed areas, the
areas of clear space around points, and other spatial statistics are not normally
distributed. So even if our interest is focused only on the distribution of sites (that is,
we have no Z(s) or Z(s) = s), we cannot resort to conventional statistical procedures.
I will first introduce the approaches designed to handle these problems in
geostatistics and then summarize the basic approaches in point processes. The
examples in the following sections should further highlight these issues.
D
d

Z s ( ):s D { }

d
871
Spati al Stati sti cs
The Geostatistical Model
The classic geostatistical model involves the random variate Z(s) over a fixed field for
s defined on the real numbers. The set

where D is the collection of sites being studied, is a random function of the sites. The
cumulative distribution function of Z(s) is
Fitting models and making inferences about Z(s) requires us to make a global
assumption about its behavior over all members of the set D. That is, summarizing Z(s)
usually requires using information from neighboring sites, and how that information is
used depends on our global assumptions. These assumptions usually involve some
form of stationarity. In its strong form, stationarity requires that

for all , , , and any finite n.
Because h acts as a translation vector, this condition is also called stationarity under
translation. In geostatistical modeling, we often use a weaker form of the stationarity
assumption:
for all , and
for all
The parameter is called the stationary mean. The function C(.) is called a
covariogram. These two conditions define weak, or second-order stationarity for Z(s).
The first implies that the mean is invariant over sites. The second implies that the
covariance of random functions Z(.) between all pairs of sites is a function of the
difference between sites. Furthermore, if C(.) is independent of direction (that is, it
depends only on the Euclidean distance between s
1
and s
2
), then we call it isotropic.
Z s ( ):s D { }
F s z ; ( ) Prob Z s ( ) z { } =
F s
1
z
1
s
2
z
2
... ,s
n
z
n
; , ; , ; ( ) F s
1
h z
1
s
2
h z
2
... ,s
n
h z
n
; + , ; + , ; + ( ) =
s
i
D s
i
h + D h
d

E Z s ( ) ( ) = s D
cov Z s
1
( ) Z s
2
( ) , ( ) C s
1
s
2
( ) = s
1
s
2
, D

872
Chapter 28
Variogram
Instead of C(.), geostatisticians usually work with a different but related function. This
function is constructed from the variance of the first differences of the process:
The function , where h is the difference over all sites, is called the variogram
and the function is called the semi-variogram. The classical estimator of the
variogram function is:
, where
When the data are irregularly spaced, the classical variogram estimator is computed
with a tolerance region. The following figure shows the parameters for defining this
region. The angle parameter determines the direction along which we want to compute
the variogram. The number of lags, the lag distance, and the lag tolerance determine
the maximum distance in this direction. The bandwidth determines the width of the
band covering sites to be included in the calculations. And the angle tolerance
determines the amount of tapering at the origin-end of the covering region. If this value
is greater than 90 degrees, SYSTAT creates an omni-directional variogram, in which
the full 360 degree sweep is used for computing lags. For three-dimensional spatial
fields, these parameters are extended to the depth dimension from the usual horizontal
(East) and vertical (North) dimensions.
2 s
1
s
2
( ) var Z s
1
( ) Z s
2
( ) ( ) =
2 h ( )
h ( )
2

h ( )
1
N
ij
------ Z s
i
( ) Z s
j
( ) ( )
2
j i <

i 1 =
n

= h s
i
s
j
=
873
Spati al Stati sti cs
Variogram Models
We often want to construct a variogram model that fits well our empirical variogram.
The smooth functions we use for variogram models not only help summarize the
behavior of our process but they also give us a numerical method for fitting Z(s) by
least squares. There are several popular functions for modeling the semi-variogram.
The ones provided in SYSTAT (with scalar h= ) are:
n Spherical
lag 1
lag 2
lag 3
lag 4
lag 5
lag 6
lag 7
bandwidth
angle
lag tolerance
angle
tolerance
North
East
lag distance
h
h ( )
0 h , 0 =
c
0
c 1.5
h
a
-- -
,
_
0.5
h
a
-- -
,
_
3
0 h a < , +
c
0
c + h a > ,

'

=
0 10
h
0
1

(
h
)
874
Chapter 28
n Exponential
n Gaussian
n Power
0 h , 0 =
h ( ) c
0
c 1
3h
a
------
,
_
exp h 0 > , + =

'

0 10
h
0
1

(
h
)
0 h , 0 =
h ( ) c
0
c 1
3h
a
------
,
_
2
exp h 0 > , + =

'

0 10
h
0
1

(
h
)
0 h , 0 =
h ( ) c
0
ch
a
h 0 0 a 2 < < , > , + =

'

875
Spati al Stati sti cs
n Hole, or wave
The hole model is sometimes parameterized equivalently with sin instead of cos. There
is a variant of the hole effect model that includes a damping factor (usually
exponential) for larger values of h. This model is not included in SYSTAT.
For most of these models, the parameter c is called the sill, and the parameter a is
called the range. When appropriate, the sill is the maximum value of the function on
the ordinate axis; in other cases, it is an asymptote. The range is the value of h for which
. For other models, it is the value of h for which approaches
. Another parameter is called the nugget effect (c
0
). The nugget is an offset
parameter measured at near zero. It raises the height of the entire curve (except
for ). Estimating these parameters for typical geophysical data presents
some difficulties, as Cressie (1991) discusses, but most smoothing methods that use a
variogram model are fairly robust against minor deviations in their values.
0 10
h
0
1

(
h
)
0 h , 0 =
h ( ) c
0
c 1
h
a
------
,
_
cos h 0 > , + =

'

0 10
h
0
1

(
h
)
h ( ) c
0
c + = h ( )
c
0
c +
h ( )
h ( ) 0 =
876
Chapter 28
Variogram models can be combined in what Deutsch and Journel (1998) call the
nested model. This is a linear combination of submodels with separate parameter
specifications for each. SYSTAT allows up to three submodels in a specification.
Pannatier (1996) offers an interactive program for variogram modeling
(VARIOWIN) that is based on the same parameterizations as GSLIB and SYSTAT.
Both VARIOWIN and SYSTAT offer variants of semi-variograms derived from
GSLIB, such as the covariogram, correlogram, and semi-madogram. See Deutsch and
Journel (1998) for details on these methods.
Anisotropy
If the variogram for a process is not identical for all directions, then it lacks isotropy.
This anisotropy condition requires a more complex variogram model. The more basic
type of anisotropy is geometric. This condition can be modeled by weighting distance
according to direction, in a manner similar to the computation of Mahalanobis distance
in discriminant analysis (see the Discriminant procedure). That is, we compute
instead of , where
and
The weight matrix W is usually a positive definite composition of linear
transformations involving rotation and dilatation. This turns the circular isometric
locus for the isotropic model into an ellipse. SYSTAT specifies this matrix through
several parameters. The first group specifies the angles for rotation: ANG1 is a
deviation from north in a clockwise direction, ANG2 is a deviation from horizontal (for
3-D models), and ANG3 is a tilt angle. The second group specifies the shape of the
ellipse that comprises a level curve for a given distance calculation: AHMAX is the
maximum extent, AHMIN is the minimum extent, and AVERT is the 3-D (vertical)
extent. An anisotropy index is calculated from these measures:
ANIS1=AHMIN/AHMAX and, for 3-D, ANIS2=AVERT/AHMAX.
A second type of anisotropy is called zonal. This condition exists when different
models apply to different directions. SYSTAT allows this type of modeling through
nested variogram models. When the anisotropy parameter settings are different for
each type of model in a nested structure, then we have a zonal isotropic model. See
Journel and Huijbregts (1978) or Deutsch and Journel (1969) for further details.
h
w
( )
h ( )
h h s
1
s
2
( )
T
s
1
s
2
( ) [ ]
1 2
= =
h
w
h
w
s
1
s
2
( )
T
W s
1
s
2
( ) [ ]
1 2
= =
877
Spati al Stati sti cs
Simple Kriging
The most popular geostatistical prediction method is called kriging, named after a
South African mining engineer (Krige, 1962). Cressie (1990) provides a history of its
origins in a variety of fields, including meteorology and physics. The simple kriging
prediction model for Z(s) is:

where are weights and is the stationary mean. Kriging estimates weights that
minimize the error variance over all estimated points , not necessarily measured
at the given sites. The numerical estimation procedure requires a variogram model to
be specified through the MODEL statement in SYSTAT.
Ordinary Kriging
By setting , we restrict the model and exclude the stationary mean. This
constrained model is called ordinary kriging, the default method used in SYSTAT. The
model is of the same form as that for simple kriging. Notice that, because the sum of
the kriging weights is 1, the last summation term drops out of the model.
Universal Kriging
We may not be able to assume that for all , as we do in simple
and ordinary kriging. Instead, we may want to assume that is a linear
combination of known functions {f
0
(s), ..., f
p
(s)}. This allows us to model Z(s) with
trend components across the field. While these functions may be more general, it is
customary to fit polynomial components to model this global trend in the following
form:
Z s ( )
i
Z s
i
( ) 1
i
i 1 =
n

,

_
+
i 1 =
n

i

i
Z s ( )

i
1 =
E Z s ( ) ( ) = s D
E Z s ( ) ( )
Z s ( )
j
f
j
s ( ) s ( ) +
j 0 =
p

=
878
Chapter 28
The term represents a stationary random process. SYSTAT offers linear and
quadratic function terms for this type of modeling, including interactions. The terms
are specified in the TREND command. Deutsch and Journel (1998) eschew the term
universal kriging and instead call this method kriging with a trend model. The
SMOOTH=KRIG option of the PLOT command in SYSTAT is an independent
implementation of universal kriging. This exploratory smoother does not offer the full
modeling and output capabilities of the KRIG command in SPATIAL, however.
Simulation
Stochastic simulation offers the opportunity to create a realization of a spatial process
to view the implications of a particular model or to estimate standard errors through
Monte Carlo methods. Gaussian simulation generates the realization
from the multivariate Gaussian , where the parameters for the
centroid vector and covariance matrix are taken from the stationary means and
covariances over the field.
SYSTAT implements the LUSIM algorithm from GSLIB (Deutsch and Journel,
1998). This algorithm requires the number of grid points and number of data points to
be relatively small. It is designed to be most suited for a large number of realizations
at a small number of nodes. SYSTAT executes one realization per use of the command,
however, so the simulation is less useful for this purpose. To compensate, the memory
requirements have been increased so that somewhat larger problems can be handled.
See Deutsch and Journel (1998) and Haining (1990) for further details.
Point Processes
Cressie (1991) and Upton and Fingleton (1985) cover various models and applications
that can be loosely grouped under the heading of point processes. Unlike traditional
geostatistical methods, our focus of interest in these areas is on the distribution of sites
themselves or functions of that distribution. We usually consider the location of sites
in these cases to be a random variable.
The statistical indexes of the distribution of sites in a field are numerous. Most are
based on some fundamental geometric measures. A biological example helps to
illustrate these measures. The following plot shows the location of fiddler-crab holes
in an 80 by 80 centimeter plot of the Pamet river marsh in Truro, Massachusetts
(Wilkinson, 1998).
s ( )
z s ( ):s D { } N , ( )
879
Spati al Stati sti cs
We might ask a number of questions about these data. First of all, are the holes
randomly distributed in the plane? SYSTAT does not offer test statistics for answering
this question directly but does provide the fundamental measures needed for a variety
of such tests. The most widely used measure for spatial hypotheses of this sort is the
nearest-neighbor distance, represented by the minimum spanning tree in the following
figure (drawn with the SPAN option of the PLOT command in SYSTAT):
Upton and Fingleton (1985, Table 1.10) and Cressie (1993, Table 8.6) discuss a large
number of statistical tests that are simple functions of these distances. The density of
the nearest-neighbor distance under complete spatial randomness in two dimensions is:
g d ( ) 2d d
2
( ) d 0 > , exp =
880
Chapter 28
where is an intensity parameter, like the Poisson . Here is a graph of this density:
The density resembles the chi-square, with the shape evolving from normal to skewed
as a function of the intensity parameter.
SYSTAT plots a histogram of these nearest-neighbor distances after the POINT
command is issued. For our crabs, this distance has substantive meaning that can be
useful for further modeling with other variables. The nearest-neighbor distance is the
shortest distance from any crab hole to another in the sampled area. A crab is, all other
things being equal, most likely to compete with the nearest-neighbor crab for local
resources (absent remote foraging).
Another statistic used for tests of randomness is the Voronoi area, or volume if we
have a 3-D configuration (flying crabs?). The following plot shows the Voronoi
polygons (Dirichlet tiles) for the crab data. I used the VORONOI option of the PLOT
command to draw these:

881
Spati al Stati sti cs
The Voronoi polygons delimit the area around each hole (point) within which every
possible point is closer to the crabs hole than to any other. For our crabs, they might
represent the area around each hole in which a crab might wander before hitting a
neighbor who wanders with equal vigor. Upton and Fingleton (1985) discuss statistical
tests based on these areas and the wide applications based on this measure. Okabe et
al. (1992) cover Voronoi tessellations in depth.
A third statistic, also based on the Voronoi tessellation, is the count of the number
of facets in each Voronoi polygon. For our crabs, this is a measure of the number of
near neighbors each crab must contend with. It is positively correlated with the area
measure, but is nevertheless distributed differently. Upton and Fingleton (1985)
discuss applications.
A fourth measure of point intensity is the quadrat count. We simply count the
number of points found in a set of rectangles defined by a grid (the SYSTAT GRID
command). Upton and Fingleton (1985) discuss statistical tests based on this measure.
Not surprisingly, several are chi-square based, following the rationale for using chi-
square tests on binned one-dimensional variates.
Finally, Cressie (1991) and Upton and Fingleton (1985) discuss edge effects that can
influence the distribution of many of these statistics. For example, the Voronoi areas
(volumes) for points at the periphery of the configuration may be infinite or, because of
the distribution of a few neighboring points, substantial outliers. These edge points also
tend to have fewer neighbors as candidates for distance calculations. Consequently, it
is often useful to be able to identify the points that lie on the convex hull in two or three
dimensions. The following figure shows the hull for the 2-D crab data.
882
Chapter 28
You may want to exclude points on the hull from analyses based on some of the above
measures. Cressie (1991) discusses other methods for eliminating edge effects,
including bordering the configuration and excluding points in the border.
Spatial Statistics in SYSTAT
Deutsch and Journel (1998) discuss various kriging models and provide the algorithms
in a package called GSLIB, on which the kriging program in SYSTAT is based. If you
do not already own this book, you should buy it before using the kriging and simulation
methods in SYSTAT. The theoretical and applied material in this book provides
essential background that necessarily exceeds the scope of a computer manual. For
other procedures in SYSTAT Spatial, you can consult other references given in this
chapter.
Spatial Statistics Main Dialog Box
To open the Spatial Statistics main dialog box, from the menus choose:
Statistics
Spatial Statistics
883
Spati al Stati sti cs
Specify the variables and select one of the following analyses:
n Variogram. Computes spatial dissimilarity measures over varying distances.
n Kriging. Generates predictions by minimizing the error variance over all estimated
points.
n Simulation. Uses the multivariate normal distribution to generate a realization for a
defined model. Simulation is often used to study a particular model or to estimate
standard errors.
n Point statistics. Yields areas (volumes) of Voronoi polygons, nearest-neighbor
distances, counts of polygon facets, and quadrat counts for sites by treating site
locations as a random variable.
Define options specific to each analysis using the Options button.
For Kriging and Simulation, select the graph type to display:
n Tile. Produces a plot contoured with shading fill patterns in color gradations.
n Contour. Produces a plot contoured with gradation lines.
n Surface. Produces a three-dimensional surface plot.
If the analysis involves East, North, and Depth variables, no graphs are produced.
884
Chapter 28
Model Options
The Model Options dialog box offers settings for defining the variogram model. Up to
three nesting structures are allowed.
Nesting Structure. Specify the number of nested structures.
For each structure, specify the form of the model. Alternatives include:
n Spherical. Near the origin, the spherical model is linear. The tangent to the curve at
the origin reaches the sill at a distance of two-thirds of the distance at which the
curve reaches the sill.
n Exponential. Near the origin, the exponential model is also linear. The tangent to
the curve at the origin reaches the sill at a distance of one-third of the distance at
which the value of the curve reaches 95% of the sill.
n Gaussian. Near the origin, this model is parabolic.
n Power. This model does not reach a sill. For exponents between 0 and 1, the model
is concave; for values between 1 and 2, the model is convex. An exponent of 1
yields a linear variogram.
n Hole. The hole model oscillates around the sill.
885
Spati al Stati sti cs
In addition, specify the following:
Nugget effect. Enter the value at which the variogram intersects the vertical axis.
Specifying a nugget raises the height of the variogram.
Sill. Enter the maximum value attained by the function. For some models, the sill is the
asymptote of the function.
Rotating and dilating the orientation helps control for geometric anisotropy by
transforming the ellipse (or ellipsoid) into a circle (or sphere). All angles must be
specified in degrees.
n First rotation angle. The clockwise deviation from north.
n Second rotation angle. The deviation from horizontal.
n Third rotation angle. The tilt angle.
n Maximum range. The maximum extent. For power models, the maximum range
defines the exponent.
n Minimum range. The minimum extent.
n Vertical range. The vertical extent.
In two dimensions, the anisotropy index is the minimum extent divided by the
maximum extent. In three dimensions, a second index is the vertical extent divided by
the maximum extent.
Variogram
The Variogram dialog box provides the settings for specifying how the variogram is to
be computed. For irregularly spaced data, SYSTAT computes variogram estimates
using a tolerance region defined by lag and azimuth parameters.
886
Chapter 28
The lag parameters determine the maximum distance in the direction defined by the
azimuth angle.
n Number of lags. Enter the number of lags used for calculating the spatial similarity
measure.
n Separation distance. Specify the length of each lag.
n Tolerance. Specify a length to add to the separation distance to account for data on
an irregular grid. This value is usually one-half of the separation distance (or
smaller).
The azimuth parameters define the direction and width of the region used for the
variogram.
n Angle. Defines the direction (in degrees clockwise from the North axis) along
which the variogram is computed.
n Tolerance. Specify the amount of tapering (in degrees) near the origin for the
covering region. For values exceeding 90 degrees, an omni-directional variogram
results.
n Bandwidth. Specify the width of the band covering sites. Variogram calculations
include points lying within the specified value in either direction from the vector
defined by the azimuth angle.
For three-dimensional models, three additional dip parameters extend the variogram to
include the depth dimension. (The dip angle is measured in degrees clockwise from the
East axis.)
887
Spati al Stati sti cs
Variograms in SYSTAT differ with respect to the spatial dissimilarity measure [(h)]
used. Select one of the following measures:
n Semivariogram. Half of the average squared difference.
n Covariance. The covariance between points.
n Correlogram. Standardized covariances.
n General. The semi-variogram divided by the squared mean for each lag.
n Pairwise. Half of the average squared normalized difference, where each pair is
normalized by their mean before squaring.
n Log. Semi-variogram of the logged values.
n Madogram. Mean absolute deviation.
Grid
The Grid dialog box offers settings for determining the size and shape of the grid used
for the kriging estimates and for quadrat counting in the point methods.
For each axis, specify:
n Minimum. The minimum value for the grid along the axis.
n Number of nodes. The number of points along the axis.
n Maximum. The maximum value along the axis.
SYSTAT uses equal spacing between consecutive nodes for each axis.
888
Chapter 28
Kriging
Kriging yields estimates of the dependent variable based on nearby points, taking into
account spatial relationships.
Number of Discretization Points. Specify the number of points for each block in the
kriging analysis.
Trend. Defines polynomial trend components to add to the universal kriging analysis.
X, Y, and Z correspond to the East, North, and Depth variables, respectively. For
example, suppose the x axis (East) variable is LONG. Selecting XX adds the term
LONG*LONG to the kriging model.
Search Radius. Defines the size of the region used to compute the kriging estimates.
Search Ellipsoid. Defines the orientation of the region used to compute the kriging
estimates.
Three types of kriging are available:
n Ordinary. Constrains the sum of the kriging weights to be 1.
n Simple. Uses unconstrained weights. For simple kriging, specify the stationary
mean.
n Universal. Kriging with polynomial trend components.
889
Spati al Stati sti cs
Using Commands
First, specify your data with USE filename. Continue with:
There are two arguments in varlist for 2-D distributions and three for 3-D. Submodels
are expressed by using slashes up to three times and specifying the optional arguments
separately for each submodel, all in one statement.
For variograms:
SPATIAL
MODEL var = varlist / NUGGET = d,
SILL = d,
ANG1 = d,
ANG2 = d,
ANG3 = d,
AHMIN = d,
AHMAX = d,
AVERT = d,
TYPE = SPHERICAL
EXPONENTIAL
GAUSSIAN
POWER
HOLE,
/ repeat options,
/ repeat options
VARIOGRAM /NLAG = n,
XLAG = d,
XLTOL = d,
AZM = d,
ATOL = d,
BANDH = d,
DIP = d,
DTOL = d,
BANDV = d,
TYPE = SEMI
COVARIANCE
CORRELOGRAM
GENERAL
PAIRWISE
LOG
MADOGRAM
890
Chapter 28
For kriging:
For universal kriging, use TYPE=ORDINARY and the TREND option. In addition,
specify the form of the trend using the TREND command:
The syntax of the GRID statement is:
The syntax of the SIMULATE and POINT statements follows:
KRIG / NXDIS = n,
NYDIS n,
NZDIS = n,
NDMIN = d,
NDMAX = d,
RADMIN = d,
RADMAX = d,
RADVER = d,
SANG1 = d,
SANG2 = d,
SANG3 = d,
SKMEAN = d,
TREND,
TYPE = SIMPLE
ORDINARY,
GRAPH = CONTOUR
TILE
SURFACE
TREND xvar + yvar + zvar + ,
xvar*xvar + yvar*yvar + zvar*zvar +
xvar*yvar + xvar*zvar + yvar*zvar
GRID/XMIN = d,
YMIN = d,
ZMIN = d,
XMAX = d,
YMAX = d,
ZMAX = d,
NX = n,
NY = n,
NZ = n
SIMULATE/GRAPH = CONTOUR
TILE
SURFACE,
POINT varlist
891
Spati al Stati sti cs
Usage Considerations
Types of data. SPATIAL uses rectangular data only. The basis (spatial) variables are
expected to be measures of latitude, longitude, depth, or other spatial dimensions. The
dependent variable is expected to be symmetrically distributed or transformed to a
symmetrical distribution.
Print options. There are no print options. Output reports parameter settings. Data are
saved into files. Graphs show the distributions and fitted models.
Quick Graphs. SPATIAL produces variograms, kriging surfaces, simulations, and
nearest-neighbor histograms. You can choose the type of graph used to display the
results of the KRIG and SIMULATE commands by using the GRAPH option.
Saving files. SPATIAL saves variograms, kriging estimates, simulated values, and point
statistics.
BY groups. SPATIAL analyzes data BY groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. SPATIAL allows bootstrapping.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in SPATIAL.
Examples
The examples begin with a kriging analysis of a spatial data set and proceed to
simulation and point processes. Data in the Point Statistics example are used with the
permission of Kooijman.
Example 1
Kriging (Ordinary)
The data in this example were taken from a compilation of worldwide carbon and
nitrogen soil levels for more than 3500 scattered sites. These data were compiled by P. J.
Zinke and A. G. Stangenberger of the Department of Forestry and Resource Management
at the University of California, Berkeley. The full data set is available at the U.S. Carbon
892
Chapter 28
Dioxide Information Analysis Center (CDIAC) site on the World Wide Web. For our
purposes, I have restricted the data to the continental U.S. and have averaged duplicate
measurements at single sites by analyzing BY the LAT and LON variables using the STATS
module and saving the averages.
The first step in the analysis is to examine the dependent variable (CARBON). The
sample histogram for this variable is positively skewed.
Here is the resulting histogram of the carbon levels:
We can use the Dynamic Explorer in the graphics window to transform the CARBON
variable so that it looks more normally distributed. A value near 0 in the X-Power spin-
box produces a normal-appearing histogram, suggesting that a log transformation can
approximately symmetrize these data.
USE SOIL
HIST CARBON
LET CARBON = L10(CARBON)
HIST CARBON
893
Spati al Stati sti cs
Here is the histogram for the log10 transformed data:
We will proceed using these log-transformed data.
The first step on the way to fitting a kriging surface is to identify a model through
the variogram. We can get preliminary guidance for choosing a variogram model by
using the default values of the VARIOGRAM command:
The MODEL statement specifies that CARBON is to be a function of LON (longitude of
the sampling site) and LAT (latitude). The VARIOGRAM statement produces an omni-
directional semi-variogram by default. The output follows:
SPATIAL
MODEL CARBON = LON LAT
VARIOGRAM
Structural Model
Nugget (c0): 0.000000
First rotation angle (azimuth, or degrees clockwise from North): 0.000000
Second rotation angle (dip, or degrees down from azimuthal): 0.000000
First anisotropy index (anis1=ahmin/ahmax): 1.000000
Sill (c): 1.000000
Range (a): 1.000000
Semivariogram
Direction: 0.000000
Number of lags: 10
Lag distance: 5.630000
Lag tolerance: 2.815000
Angular tolerance: 90.000000
Maximum horizontal bandwidth: 5.630000
894
Chapter 28
The semi-variogram suggests several things. First, we need a nugget and sill value to
offset our model semi-variogram high enough to reach a maximum value somewhere
around 0.10. Second, we need to specify a range value so that the model variogram
asymptotes near a distance of 10.
By adding options to the MODEL statement, we manage to fit a theoretical
variogram to the observed results. The AHMAX and AHMIN parameters specify that the
range for both the major and the minor axes is 10 degrees. We choose a lag distance
(XLAG) of 1 degree (latitude and longitude) to base our variogram model on relatively
local detail. Finally, we again use the default angular tolerance to produce an omni-
directional semi-variogram.
The output follows:
MODEL CARBON = LON LAT /
SILL=.05,NUGGET=.05,AHMAX=10,AHMIN=10
VARIOGRAM / XLAG=1
Structural Model
Nugget (c0): 0.050000
First rotation angle (azimuth, or degrees clockwise from North): 0.000000
Second rotation angle (dip, or degrees down from azimuthal): 0.000000
First anisotropy index (anis1=ahmin/ahmax): 1.000000
Sill (c): 0.050000
Range (a): 10.000000
Semivariogram
Direction: 0.000000
Number of lags: 10
Lag distance: 1.000000
Lag tolerance: 0.500000
Angular tolerance: 90.000000
Maximum horizontal bandwidth: 1.000000
Semivariogram
0 5 10 15 20
Distance
0.0
0.04
0.08
0.12
C
o
e
f
f
i
c
i
e
n
t
895
Spati al Stati sti cs
We can check whether we need to worry about anisotropy by checking semi-
variograms for different angles. Here are two excursions, 90 degrees separated. By
setting angular tolerance (ATOL) to 20 degrees, we keep the lagging window narrow at
its origin-end.
The output is:
VARIOGRAM / XLAG=1,AZM=45,ATOL=20
VARIOGRAM / XLAG=1,AZM=135,ATOL=20
Semivariogram
Direction: 45.000000
Number of lags: 10
Lag distance: 1.000000
Lag tolerance: 0.500000
Angular tolerance: 20.000000
Maximum horizontal bandwidth: 1.000000
Semivariogram
0 2 4 6 8 10
Distance
0.0
0.04
0.08
0.12
C
o
e
f
f
i
c
i
e
n
t
0 2 4 6 8 10
0.0
0.04
0.08
0.12
Semivariogram
0 2 4 6 8 10
Distance
0.0
0.04
0.08
0.12
C
o
e
f
f
i
c
i
e
n
t
0 2 4 6 8 10
0.0
0.04
0.08
0.12
896
Chapter 28
There is enough difference to suggest the possibility of anisotropy, but not enough to
radically alter the results. Some of this variety is likely due to the uneven scattering of
the sites. This can induce negative or fluctuating spatial correlations, as evidenced by
a wavy semi-variogram. If these are pronounced, one might consider a wave or hole-
effect semi-variogram model.
We will keep our metric spherical nevertheless. This uneven scattering is not
comforting, however; kriging normally should rest on fairly evenly distributed sampling
sites or on a regular grid. Our data are received, however, so the sites are given.
Now we can use the spherical model to fit a surface to the carbon data. We add a
GRID statement to specify the grid points where estimates are to be made. We also
SAVE the estimates into a file called KRIG. The options to the KRIG statement specify
the minimum number of data points to be included in an estimate (NDMIN), the
maximum (NDMAX), the minimum and maximum radii for searching for neighboring
sites to include in an estimate (RADMIN and RADMAX), and finally the type of graph (a
CONTOUR plot).
Semivariogram
Direction: 135.000000
Number of lags: 10
Lag distance: 1.000000
Lag tolerance: 0.500000
Angular tolerance: 20.000000
Maximum horizontal bandwidth: 1.000000
GRID / NX=10, XMIN=-125, XMAX=-65,
NY=10, YMIN=30, YMAX=50
SAVE KRIG
KRIG / NDMIN=2, NDMAX=20,
RADMAX=5, RADMIN=5,
GRAPH=CONTOUR
Semivariogram
0 2 4 6 8 10
Distance
0.0
0.05
0.10
0.15
0.20
C
o
e
f
f
i
c
i
e
n
t
0 2 4 6 8 10
0.0
0.05
0.10
0.15
0.20
897
Spati al Stati sti cs
Here is the output:
Our model shows the lowest soil carbon concentrations in the southwest and the
highest in the north, particularly the northwest.
We can use our saved data to overlay these estimates on a map of the U.S. I have
used a stereographic projection and set the axis limits to make the two graphs
correspond.
Ordinary Kriging
Search radius1: 5.000000
Search radius2: 5.000000
Search angle 1: 0.000000
Search angle 2: 0.000000

Number of blocks used in estimation: 92

Average estimated value: 0.882166
Variance of estimated values: 0.023829
BEGIN
USE USSTATES
MAP / PROJ=STEREO,AX=4,SC=2,YMIN=20,YMAX=50
USE KRIG
PLOT ESTIMATE*GRID(2)*GRID(1)/PROJ=STEREO,CONTOUR,
YMIN=20,YMAX=50,SMOO=INVS,
AX=0,SC=0,SIZE=0,ZTICK=20
END
898
Chapter 28
A final note: We have fit a surface to data distributed on the continental U.S. This
represents a relatively small portion of the global sphere, so I have assumed the data to
lie on a plane. The map projection makes clear that there is some distortion to be
expected when we ignore the spherical nature of the coordinates, however. Cressie
(1991) discusses spherical kriging methods, but they are not available in SYSTAT.
Smaller areas, such as state, province, or county data, should be little cause for concern.
Example 2
Simulation
We now compute a single realization based on the model we fit in the kriging example.
USE soil
LET CARBON = L10(CARBON)
SPATIAL
MODEL CARBON = LON LAT / SILL=.05,NUGGET=.05,AHMAX=10,
AHMIN=10
SIMULATE / GRAPH=CONTOUR
-130
-120
-110
-100 -90
-80
-70
-60
Longitude
20
30
40
50
L
a
t
i
t
u
d
e
899
Spati al Stati sti cs
The resulting output is:
The results follow the same pattern found in the kriging. Higher carbon levels occur in
the northeast and northwest. The time to compute a single simulation is greatly affected
by the number of grid nodes specified in the GRID command. Grids larger that 10 cuts
per variate, particularly for larger data sets, can increase the memory and time
requirements substantially.
Example 3
Point Statistics
The data for this example are from Kooijman (1979), reprinted in Upton and Fingleton
(1990). They consist of the locations of beadlet anemones (Actinia equina) on the
surface of a boulder at Quiberon Island, off the Brittany coast, in May, 1976. I have
added bordering histograms to the scatterplot shown in Figure 1.26 of Upton and
Fingleton. The size of the points is proportional to the measured diameter of the
anemones (D).
USE KOOIJMAN
PLOT Y*X / HEIGHT=2IN, WIDTH=3IN, SIZE=D, BORDER=HIST
900
Chapter 28
The bordered histograms reveal that the distribution of the anemones is fairly uniform
in both marginal directions.
We can get an elevated view of the distribution by computing a 3-D histogram of
the anemone locations. A 3-D density kernel provides a smooth estimate of the density.
It is available in the SYSTAT graphing module:
DEN .*Y*X / HEIGHT=2IN, WIDTH=3IN, ALT=2IN, AXES=CORNER,
ZGRID
DEN .*Y*X / KERNEL, HEIGHT=2IN, WIDTH=3IN, AXES=0,
SCALES=0
901
Spati al Stati sti cs
With a few exceptions, the intensity of the distribution appears to be fairly uniformly
distributed over the sampled area
Now we proceed to examine the measures of spatial variation. The first graph shows
the Voronoi tessellation of this configuration. The second, the minimum spanning tree,
highlights the nearest-neighbor distances. The final graph, the convex hull, highlights
the outermost bordering points:
PLOT Y*X / VORONOI, HEIGHT=2IN, WIDTH=3IN
PLOT Y*X / SPAN, HEIGHT=2IN, WIDTH=3IN
PLOT Y*X / HULL, HEIGHT=2IN, WIDTH=3IN
0 100 200 300
X
0
50
100
150
200
Y
902
Chapter 28
Now we proceed to compute the various point statistics:
Here is the histogram of the nearest-neighbor distances:
SPATIAL
SAVE POINTS
POINT X Y
0 100 200 300
X
0
50
100
150
200
Y
0 100 200 300
X
0
50
100
150
200
Y
903
Spati al Stati sti cs
We can do a probability plot of these distances by merging the point measures with the
original data:
They appear to be quite normally distributed. Now we examine the relation of
Kooijmans measurement of the diameter of the anemones to the other spatial
measures. We also construct a new variable called CROWDING by taking the inverse
of VOLUME. The D variable is Kooijmans anemone diameter. It is correlated with
distance, as Upton and Fingleton point out, but even more strongly related to the
MERGE KOOIJMAN POINTS
PPLOT DISTANCE
0 5 10 15 20 25
DISTANCE
-3
-2
-1
0
1
2
3
E
x
p
e
c
t
e
d

V
a
l
u
e

f
o
r

N
o
r
m
a
l

D
i
s
t
r
i
b
u
t
i
o
n
904
Chapter 28
inverse Voronoi area of the anemones. This relationship holds even after deleting the
four outlying values of CROWDING.
The output follows:
Here is the SPLOM of these measures output by the CORR procedure:
Example 4
Unusual Distances
The transformation and programming capabilities of SYSTAT can be used to compute
statistics needed for other spatial analyses. The following example computes Euclidean
CORR
LET CROWDING = 1/VOLUME
PEARSON D..CROWDING
Means
D DISTANCE VOLUME FACETS CROWDING
4.263 7.176 954.669 5.829 0.005

Pearson correlation matrix

D DISTANCE VOLUME FACETS CROWDING
D 1.000
DISTANCE 0.249 1.000
VOLUME 0.092 0.144 1.000
FACETS 0.085 0.285 0.082 1.000
CROWDING -0.291 -0.680 -0.342 -0.237 1.000

Number of observations: 217
D
D
I
S
T
A
N
C
E
V
O
L
U
M
E
F
A
C
E
T
S
D
C
R
O
W
D
I
N
G
DISTANCE VOLUME FACETS CROWDING
905
Spati al Stati sti cs
and city-block distances for the crab data and plots them against each other. The
distances are computed from a central point (px, py) in the field. The city-block
distances have particular significance for fiddler crabs in Truro because the
encroachment of recent residential and commercial development may force the crabs
to follow a rectangular traffic grid to go about their business. The input is:
Several other useful distance statistics can be calculated directly from coordinate
information. Distance between two points on the circumference of a circle given angle
coordinates in degrees is:
USE CRABS
LET PY=40
LET PX=40
LET EUCL=SQR((Y-PY)^2+(X-PX)^2)
LET CITY=ABS(Y-PY)+ABS(X-PX)
PLOT EUCL*CITY / BORDER=HIST,XMIN=0,XMAX=70,YMIN=0,YMAX=70
LET CDIST = 2*3.14159*RADIUS*ABS(ANG1-ANG2)/360.
0 10 20 30 40 50 60 70
CITY
0
10
20
30
40
50
60
70
E
U
C
L
906
Chapter 28
The great-circle global distance in statute miles between two points is:
Computation
All computations are in double precision.
Missing Data
Cases with missing data are deleted from all analyses.
Algorithms
SPATIAL uses kriging, simulation, and variogram algorithms documented in Deutsch
and Journel (1998). Point statistics are computed by a Voronoi tessellation algorithm.
SYSTAT applies the inverse distance smoother to the estimated grid values for kriging
and simulation when producing Quick Graphs (see the description of this algorithm in
SYSTAT Graphics). For sparser grids, this can lead to a high degree of interpolation of
the estimated values. To view the actual estimates, save the results into a file and plot
them separately without a smoother. You can also specify a large number of grid points
(more than 40) to minimize the effects of the inverse smoother.
REM DEGRAD = RADIANS PER DEGREE
REM AY = NORTH LATITUDE OF POINT A
REM AX = WEST LONGITUDE OF POINT A
REM BY = NORTH LATITUDE OF POINT B
REM BX = WEST LONGITUDE OF POINT B
REM MR = STATUTE MILES PER RADIAN
REM THIS EXAMPLE SETS THE REFERENCE POINT (AX,AY) NEAR CHICAGO
LET DEGRAD=2*3.1415926/360
LET MR=69.09332414/DEGRAD
LET AY=45*DEGRAD
LET AX=-90*DEGRAD
LET BY=LABLAT*DEGRAD
LET BX=LABLON*DEGRAD
LET GCDIST = MR * ACS(SIN(AY)*SIN(BY) ,
+ COS(AY)*COS(BY)*COS(AX-BX))
907
Spati al Stati sti cs
References
Cressie, N. A. C. (1990). The origins of kriging. Mathematical Geology, 22, 239252.
Cressie, N. A. C. (1991). Statistics for spatial data. New York: John Wiley & Sons, Inc.
Deutsch, C. V. and Journel, A. G. (1998). GSLIB: Geostatistical software library and
users guide (2nd ed.). New York: Oxford University Press.
Haining, R. (1990). Spatial data analysis in the social and environmental sciences.
Cambridge: Cambridge University Press.
Journel, A. G. and Huijbregts, C. J. (1978). Mining geostatistics. New York: Academic
Press.
Kooijman, S. A. L. M. (1979). The description of point patterns. In R. M. Cormack and
J. K. Ord (eds.), Spatial and Temporal Analysis in Ecology. Fairland, Md.:
International Co-operative Publishing House, pp. 305332.
Krige, D. G. (1966). Two-dimensional weighted average trend surfaces for ore evaluation.
In Proceedings of the Symposium on Mathematical Statistics and Computer
Applications in Ore Valuation. Johannesburg, 1338.
Okabe, A., Boots, B., and Sugihara, K. (1992). Spatial tessellations: Concepts and
applications of voronoi diagrams. New York: John Wiley & Sons, Inc.
Pannatier, Y. (1996). VARIOWIN: Software for spatial data analysis in 2D. New York:
Springer-Verlag.
Ripley, B. D. (1981). Spatial statistics. New York: John Wiley & Sons, Inc.
Upton, G. J. G. and Fingleton, B. (1990). Spatial data analysis by example (2 vols.). New
York: John Wiley & Sons, Inc.
Wilkinson, L. (1998). The grammar of graphics. Unpublished manuscript.
909


Chapt er
29
Survival Analysis
Dan Steinberg, Dale Preston, Doug Clarkson, and Phillip Colla
SURVIVAL can be used to explore grouped, right-censored, and interval-censored
survival data and to estimate nonparametric, partially parametric, and fully parametric
models by maximum likelihood. The SURVIVAL modules ability to handle disjoint
and overlapping interval-censored data and combinations of interval censoring, right
censoring, and exact failure times is a major enhancement over other programs.
The facilities provided in SURVIVAL include the Kaplan-Meier estimator,
Turnbulls generalization of the Kaplan-Meier estimator for interval-censored data,
plots of failure and censoring times, quantile plots for standardized reference
distributions, log-rank tests, the proportional hazards (Cox) regression, and the
Weibull, log-normal, logistic, and exponential regression models. All models can be
estimated with or without covariates, either directly or by stepwise regression
procedures. The Kaplan-Meier estimator, quantile plots, and Cox regression all
permit stratification. The survivor function, hazard function, reliabilities, and
quantiles can be generated from parametric models for specific covariate values, and
the baseline hazards can be derived from the Cox and stratified Cox models.
The results of most analytic techniques can be saved into SYSTAT files for further
manipulation and analysis with other SYSTAT modules.
Statistical Background
SURVIVAL contains a collection of tools for the analysis of survival or reliability data.
Typically, the dependent variable is a duration, such as the length of time it takes a
woman to conceive after cessation of birth control pills, the survival times of cancer
patients on experimental drugs, or the time a motor runs before it fails. The methods
910
Chapter 29
have been used to analyze a broad range of topics including unemployment durations,
stability of marriages, peoples willingness to pay for public goods, and the lengths of
pieces of yarn. It could conceivably be used for the modeling of any strictly positive
quantity. These topics are also studied under other namesreliability, duration,
waiting time, failure time, event history, and transition data analysis are each titles
under which survival topics have been discussed. (References are provided below.)
The distinguishing mark of survival analysis, besides the special parametric models
typically used, is that the dependent variable can be censored. SURVIVAL allows for
two types of such incomplete data: right-censored and interval-censored data. When a
case is right-censored, the dependent variable is known to be greater than a specified
number, but its true value is not known. When data are subject to interval censoring,
failure times may be known only to have occurred within some specified time interval.
Left censoring can be handled by SURVIVAL when it coincides with interval censoring
with a zero lower bound.
Interval censoring naturally arises in data collected by periodic inspection (Nelson,
1978). For example, a utility company might check gas meters at three-month
intervals. A study of the time between meter failures would be conducted on interval-
censored data because the exact failure times would never be known. Meters that had
failed would only be known to have failed within some three-month interval, and
meters that had not failed would be censored.
In general, censoring can occur because a study is ended after a predetermined time
period, after a fixed number of failures has occurred, because of periodic inspection,
because cases are subject to competing risks (Cox and Oakes, 1984), or for other
reasons. A fairly extensive discussion can be found in Lawless (1982). For the methods
of this program to be applicable, the censoring scheme should have nothing to do with
the future survival of the case. That is, the censoring process cannot be informative.
Conditional on having survived to some time t, cases that are censored at that time
should be representative of all cases with the same explanatory variables surviving to
time t. If the fact that a case is censored provides information about its expected
lifetime that distinguishes it from other cases that have not been censored, the
assumptions underlying the models estimable with SURVIVAL are violated. For
example, censoring will not be independent of future survival if an investigator
removes all persons with good or bad prognoses; results will also be subject to severe
bias if patients remove themselves from a study when they feel they are making little
progress. (See Cox and Oakes, 1984, or Lagakos, 1979, for further discussion.) For the
remainder of this chapter, we assume that the censoring scheme, whatever it may be,
is not informative; you should check the conditions under which your data were
gathered to ensure that this condition is met prior to analysis.
911
Survi val Anal ysi s
Graphics
Were going to reproduce (approximately) Figures 2.1 and 2.2 in Parmar and Machin
(1995) to give you an idea of how survival measurements differ from other types of
data. This should also give you some ideas about using SYSTATs graphics to produce
survival graphs for publication. The first figure shows patients entering a prospective
clinical study at different dates, with known survival times indicated by a solid black
symbol and censored times by a pale symbol. The input file looks like this:
BASIC
INPUT ENTRY$,DAYS_IN,DAYS_OUT,CENSOR,SURVIVAL
RUN
01/01/91 0 910 0 910
01/01/91 0 752 1 752
03/26/91 86 1092 0 1006
04/26/91 116 452 1 336
06/23/91 175 1098 1 923
07/09/91 190 308 1 118
07/22/91 203 817 0 614
08/02/91 214 763 1 549
09/01/91 244 1098 1 854
10/07/91 280 432 0 152
12/14/91 348 645 1 297
12/26/91 360 1001 0 641
~
SAVE PMA
LET PATIENT=CASE
LET ENTER=DOC(ENTRY$,MM/DD/YY)
LET EXIT=ENTER + DAYS_OUT - DAYS_IN - 2
RUN
EXIT
USE PMA
CATEGORY PATIENT,CENSOR
BEGIN
PLOT PATIENT*EXIT / XFORM=MMM. YYYY,
XTICK=2,XPIP=12,
XMIN=33238,XMAX=34333,
INDENT,SIZE=1.5,
YREVERSE,VECTOR=ENTER,PATIENT,
HEIGHT=3IN,WIDTH=4IN,SC=2,AX=C,
XLAB= ,YLAB=PATIENT,
FILL=CENSOR,LEGEND=NONE
DRAW LINE / FROM=1.4IN,0IN,TO=1.4IN,3IN
DRAW LINE / FROM=3.81IN,0IN,TO=3.8IN,3IN
WRITE Patient accrual period /LOC=.7IN,3.3IN
HEI=5PT,WID=5PT,
CENTER
WRITE Observation only period/LOC=2.6IN,3.3IN,
HEI=5PT,WID=5PT,
CENTER
END
912
Chapter 29
Weve included the raw data so that you can see something about entering and coding
time. First of all, if your data source does not have day-of-the-century values (which
most spreadsheets use for their time variable), it is easier to import the data as ASCII
dates. Dont worry about the year 2000. SYSTATs dates are good for several more
centuries.
Parmar and Machin use British notation in their Table 2.1 for the dates (for example,
26.12.91 instead of 12/26/91). If you want to enter data that way, change the day-of-
century conversion in the program from
to
Notice how the separator symbol is understood by SYSTAT because you put it in the
format string. Any character other than Y, M, D, H, M, or S will do as well. By
converting dates to day-of-the-century form, we can now do date arithmetic,
calculating the exit times for our graph. We then exit BASIC and plot our first graph.
Notice, also, how powerful the formatting facility for dates is once we code time as
day-of-the-century. We can request the output format for any axis with a simple format
string. SYSTAT takes care of choosing round tick-mark values (allowing for leap years
and different-length months). Again, if you want another date format, simply change
it. For example,
Most of the commands and options are needed to duplicate Parmar and Machins
format. The main idea, however, is that we are seeking a graph that shows how entry
and exit times from a study fit on a common time line. Notice, incidentally, that we
treat PATIENT as a categorical variable, so that each patient is given a tick mark,
instead of treating the patient IDs as numbers on a continuous scale. Following is the
result:
LET ENTER=DOC(ENTRY$,MM/DD/YY)
LET ENTER=DOC(ENTRY$,DD.MM.YY)
XFORM=DD MMM, YYYY
913
Survi val Anal ysi s
The second graph changes the time line from calendar time to survival time. Following
is the input:
This time, Parmar and Machin order the patients according to survival time, so we use
the ORDER command to sort the indices. Since PATIENT is categorical, the tick marks
will be labeled in that order. We create a ZERO value so that we can draw the lines for
the dot plot starting at zero survival time.
We used a simple recoding of SURVIVAL to get months:
If we were concerned about accuracy, we could do the time arithmetic exactly with
SYSTATs date functions. (See SYSTAT: Data for more information.) The difference
between 30 and 31 days could not be detected in the range of this graph, however.
USE PMA
CATEGORY PATIENT
ORDER PATIENT/SORT=6,10,11,4,8,7,12,2,9,1,5,3
LET ZERO=0
LET SURVIVAL=SURVIVAL/30
PLOT PATIENT*SURVIVAL / ,
INDENT,SIZE=1.5,YREVERSE,
HEIGHT=3IN,WIDTH=4IN,SC=2,AX=2,
XMIN=0,XMAX=36,XTICK=6,XPIP=6,
VECTOR=ZERO,PATIENT,FILL=CENSOR,
LEGEND=NONE,
XLAB=Survival time (months),
YLAB=Patient
LET SURVIVAL=SURVIVAL/30
914
Chapter 29
Following is the result:
Parametric Modeling
Parametric modeling in SURVIVAL involves the fitting of a fully specified probability
model (up to a finite number of unknown parameters) by the method of maximum
likelihood. Because the a priori commitment to a specific functional form can result in
rather poor fits, it is important to explore the fitted model, to examine generalized
residuals, and to compare the fitted survivor function to nonparametric and partially
parametric models.
The parametric models available in SURVIVAL are based on the exponential,
Weibull, log-normal, and log-logistic distributions. Each model can be fit with or
without covariates. The exponential and Weibull distributions have two options to
allow for the alternative parameterizations discussed below. The Weibull, log-normal,
and log-logistic distributions are each specified as two-parameter distributions
generalized to include the effects of covariates on survival times. Each is an
accelerated life model in which the logarithm of survival time is a linear function of
the covariates.
Accelerated Failure Time Distributions
A random variable has an accelerated failure time distribution if the natural logarithm
of time can be modeled as
ln t ( ) z w + + =
915
Survi val Anal ysi s
where , , and are parameters to be estimated, z is a vector of covariates, and w
is a random variable with the known distribution function . Writing
the survivor function of t is given by
The distributions available for accelerated life models in SURVIVAL use the following
definitions of :
The Weibull and exponential models can also be estimated in the more familiar
proportional hazards parameterization with the WB and EXP commands. The survivor
function is now written as
where is proportional to the mean of the distribution and is the
shape parameter. The exponential distribution is a special case of the Weibull
distribution with the shape parameter constrained to 1. In terms of the accelerated life
formulation, equals and equals . The Weibull distribution is the only
distribution that can be equivalently parameterized as either a proportional hazards or
an accelerated life model.
Some authors prefer to parameterize the Weibull model in terms of rather than
(for example, Cox and Oakes, 1984). To facilitate comparisons with different texts,
the SURVIVAL output includes several transformations of this parameter. Regardless of
the parameterization, the log-likelihood, coefficient estimates, and standard errors for
the covariates will be the same. Parameter estimates and standard errors for the
location and shape parameters will differ, however. Choose whatever parameterization
is most convenient.
Distribution Function F (w) Model
extreme value: 1-exp[-exp(w)] EWB,EEXP
logistic: 1 /[1 + exp(-w)] LGST
standard normal: (w) LNOR

F w ( )
w t ( ) ln t ( ) z =
s t ( ) 1 F w t ( ) ( ) =
F w ( )

s t ( ) exp t ( )

=
= exp z ( )
ln ( ) 1
1

916
Chapter 29
In the output, the fundamental scale (shape) and location parameters are labeled
_B(1)_ and _B(2)_, respectively. The table below lists their meaning for each of the
possible models:
Choosing a Parametric Form
Quantile plots of the unadjusted data can be useful in assessing the suitability of a
functional form when we are interested in the unconditional distribution of the failure
times. When the unconditional distribution may differ substantially from the
conditional distribution (conditioning on covariates), the PPLOT output may not be
helpful in deciding on a model with covariates. You can also examine the quantile
quick graph plot produced automatically after fitting parametric models.
Selection of a parametric form can also be guided by thinking about the shape of the
hazard. This is the approach taken by Barlow and Proschan (1965) and Allison (1984),
among others. Since the probability models available in SURVIVAL have sharply
different implications for the hazard, any strong prior notions about the hazard time
profile can rule out certain models.
The table above lists hazard shapes that are possible for each of the failure models.
For example, the exponential model implies a hazard that is constant over time. This
means that given a set of covariates, the conditional probability of failure does not
depend on the length of the survival time and exhibits duration independence. In
contrast, the Weibull model will imply either an increasing or a decreasing hazard,
depending on the value of the shape parameter. For example, for much mechanical
equipment, the conditional probability of failure is an increasing function of its age
(survival time), and a Weibull model is often appropriate.
Model Hazard Shape _B(1)_ _B(2)_
WB increasing for >1
decreasing for <1
shape location
proportional to mean time
EWB increasing for >1
decreasing for <1
shape location
proportional to log mean time
EXP constant shape =1 location
equal to mean time if no covariates
EEXP constant shape =1 location
equal to log mean time if no covariates
LNOR non-monotonic scale location
LGST decreasing for >1
single-peaked for <1
scale location


917
Survi val Anal ysi s
The remaining two models are a little more complex. The log-normal model yields
a nonmonotonic hazard rising to a peak and then declining. If the scale parameter is
large, however, the hazard will look like an increasing function over any range of
outcomes with appreciable probability. Finally, the log-logistic hazard will be
decreasing if the scale parameter is greater than 1; otherwise, it will be nonmonotonic
with a single maximum (Cox and Oakes, 1984).
It is important to be aware of the potential effect of unobserved heterogeneity on the
estimated hazard function. In general, when cases differ along unmeasured dimensions
that are relevant to the hazard, the estimated hazard will tend to exhibit a more negative
duration dependence than would be obtained with a correctly specified model. For
example, Cox and Oakes (1984) point out that if every case has an exponential hazard
with a mean parameter distributed as a gamma random variable, the population hazard
(a compound exponential) will follow a Pareto distribution with negative duration
dependence. This topic has been discussed briefly in the biostatistical literature
(Vaupel et al., 1979; Hougaard, 1984; Manton et al., 1986) and has received
considerable attention in the econometric literature. See, for example, the Journal of
Econometrics, Vol. 28 (1985), which is devoted to duration analysis.
Survival Analysis in SYSTAT
Survival Analysis Main Dialog Box
Survival analyses are computed by specifying a model and estimating it. This is true
for both parametric models, such as the Weibull, and for nonparametric models, such
as Cox regression and Kaplan-Meier curves. For all models, including the Kaplan-
Meier and others without covariates, specifying a model may simply be a way of
naming the survival, censoring, or strata variables. Post hoc analyses, such as plotting
survivor functions, computing life tables from a model, and requesting quantiles, are
also available.
To open the Survival dialog box, from the menus choose:
Statistics
Survival...
918
Chapter 29
Time. Specify the survival variable. The survival variable is usually a measurement of
time, such as the survival duration of a cancer patient or the length of a spell of
unemployment, but it could be a weight, a trip length, or any other variable for which
negative or zero values would be meaningless.
Covariate(s). Specify covariate variables. Covariates are quantitative predictor
variables.
Censor Status. Specify a censoring indicator variable. The censoring variable is usually
an indicator variable coded as 1 for durations that are complete (uncensored) and 0 for
durations that are incomplete (censored). The censoring variable is sometimes called
an event variable because it indicates whether or not an event, such as a birth or death,
was observed. Survival analysis allows for but does not require censored data; if your
observations are all current, each case would have the censoring variable equal to 1.
Lower Time Bound. Specify a lower-bound variable, which is used for interval
censoring. This variable need not appear in the data set if the data are subject to right
censoring and exact failures alone.
The coding of the survival variable, the censoring variable, and the lower-bound
variable depends on whether the data are interval-censored or not. Use the following
coding scheme:
919
Survi val Anal ysi s
If the lower-bound variable is specified, it should be coded according to the above
scheme. Certain internal data changes are made to the lower bound and censoring
variables as the data are entered. For exact failures, the lower bound, if it is included,
is set equal to the survival variable. For right-censored cases, the lower bound, if it is
being input, is set to 1. For interval-censored cases, the lower-bound value should be
non-negative and less than or equal to the survival variable value. If SURVIVAL finds
an interval-censored observation with the lower bound equal to the survival time, the
censoring is changed to an exact failure (the censoring variable is set to 1). These
changes are made solely for the convenience of SURVIVAL, and you will see them only
if you save the data during the input process.
In addition, you can specify a stratification (blocking) variable.
Estimation Options
Survival estimation options are available when you click Options.
Case status
Survival
variable
Censoring
variable
Lower-bound
variable
Exact failure failure time 1
Right censored censoring time 0
Interval censored upper bound 1 lower bound
920
Chapter 29
Estimation options allows you to specify convergence, a tolerance level, select
complete or stepwise estimation, and specify entry and removal criteria.
Converge. Sets the convergence criterion. This is the largest relative change in any
coordinate before iterations terminate.
Tolerance. Prevents the entry of a variable that is highly correlated with the
independent variables already included in the model. Enter a value between 0 and 1.
Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the
correlation required to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
For complete estimation, in which all independent variables are entered in a single
step, you can enter start values. Start values for the computation routines are calculated
automatically whenever a model is specified. We suggest that you use these start values
unless you have compelling reasons to provide your own or wish to conduct score tests
with the Cox model.
For stepwise estimation, in which variables are entered or removed from the model
one at a time, the following alternatives are available for stepwise entry and removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
removes the variable with the largest Remove value.
n Forward. Begins with no variables in the model, and at each step SYSTAT adds the
variable with the smallest Enter value.
n Automatic. For Backward, at each step SYSTAT automatically removes a variable
from your model. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. At each step in the model building process, you select the variable to
enter or remove from the model.
Probability. You can also control the criteria used to enter and remove variables from
the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified
value. Enter a value between 0 and 1 (for example, 0.025).
n Remove. Removes a variable from the model if its alpha value is greater than the
specified value. Enter a value between 0 and 1 (for example, 0.025).
Force. Forces the first n variables listed in your model to remain in the equation.
Max Step. You can define the maximum number of steps that the stepwise estimation
should perform.
921
Survi val Anal ysi s
Tables and Graphs
You can also select from a variety of output tables and regression models when you
click Options.
Table Type. From the drop-down list, select the type of table you would like displayed.
n Survival K-M. Survival K-M is a simple nonparametric estimator that produces a life
table and a plot of the estimated survivor curve.
n Actuarial Life. Divides the time period of observations into time intervals. Within
each interval, the number of failing observations is recorded.
n Actuarial Hazard. Requests that the hazard function be tabled instead of the standard
actuarial survival curve.
n Conditional Life. Requests that the conditional survival be tabled instead of the
standard actuarial survival curve. This table displays the probability of survival
given an interval.
n Parametric Quantiles. Requests approximate confidence intervals for quantiles and
quick graphs based on the last parametric model estimated.
n Parametric Reliability. Requests reliability confidence intervals and quick graphs
based on the last parametric model estimated.
n Parametric Hazard. Requests Quick Graphs and approximate confidence intervals
for values of the hazard function at specified times, based on the last parametric
model estimated.
922
Chapter 29
Model Type. Select the type of regression model you want to use from the drop-down
list. You can choose from Cox regression, logistic model, exponential model, extreme
value exponential model, Weibull model, extreme value Weibull model, and log-
normal model.
Tables, quantiles, hazards, and reliabilities vary as a function of the covariates in the
model (if any). SYSTAT offers two methods for dealing with covariates:
n Condition on mean covariate. The survivor curve by default will be evaluated with
all covariates set to their means.
n Condition on fixed covariate values. You can specify fixed values on the covariates
over which tables are produced. Highlight a covariate, enter the fixed value in the
Value field, and click Add. The fixed value on the covariate will be displayed in
the Fixed value settings list.
In addition, you can specify the following plot options:
n Log time. Expresses the x axis in units of the log of time, or log(time).
n Maximum time. For reliabilities, hazards, and actuarial life tables, you can specify
the maximum time limit of the plot. This should always be expressed as a time even
if you select a Log time axis.
n Number of bins. For reliabilities, hazards, and actuarial life tables, enter the desired
length of the plot along the time or log time axis. If not specified, 10 bins are used.
n Survivor function. Plots the survivor function on the y axis.
n Cumulative hazard. The negative of the log of the survivor function is plotted on the
y axis.
n Log cumulative hazard. Plots the log of the negative of the log of the survivor
function (log(-log(survivor))) on the y axis.
Time Varying Covariates
To specify time varying covariates, click the Options button in the Survival dialog box.
923
Survi val Anal ysi s
You can define set names for time-dependent covariates and create, edit, or delete time-
dependent covariates.
Parameter. To modify an existing time-dependent variable, select it from the Parameter
drop-down list. To set up a new time-dependent covariate, click New. Define the
covariate in the large text field on the right. You can use existing variables and choose
functions of different types. You may state as many functions of parameters as you
want.
You must define a function for each covariate selected in this dialog box. When you
click Continue, SYSTAT will check that each time-dependent covariate has a
definition. If a name exists but no variables were assigned to it, the time-dependent
covariate is discarded and the name will not be in the drop down list when you return
to this dialog box. To delete a covariate, select it in the Parameter drop-down list and
click Delete.
Using Commands
After selecting the data file with USE filename, continue with:
SURVIVAL
MODEL timevar = covarlist | tdcovarlist /,
CENSOR=var LOWER=var STRATA=var
FUNPAR tdcovar=expression
(There is one FUNPAR statement for each time-dependent covariate)
ESTIMATE / method , START=d,d,d ... , TOLERANCE=d ,
CONVERGE=d
924
Chapter 29
Stepwise model fitting is accomplished with the START, STEP, and STOP commands
in place of ESTIMATE:
METHOD is one of:
Finally, there are several commands for producing tables and graphs following a model
estimation.
Usage Considerations
Types of data. SURVIVAL uses rectangular data and distinguishes three types of data
organization, depending on the type of censoring:
n Data are either exact failures or right-censored.
n Interval-censored and right-censored data intervals do not overlap, and right
censoring occurs at the upper boundary of an interval--no exact failures.
n Any other data type, typically, interval-censored data with overlapping intervals,
or a mixture of interval-censored and exact failure data.
SURVIVAL automatically classifies the data; the type of data will determine the kinds
of analysis you can perform. The fully parametric models can be estimated for any type
of data, but the Cox proportional hazards model can be fit only to the first data type,
and the K-M estimator is replaced with Turnbulls (1976) generalized K-M estimator
for the third data type. When checking for overlapping intervals, SURVIVAL does not
consider a shared endpoint to be an overlap.
The CATEGORY command for categorizing variables works only for stratification.
If you have categorical covariates, recode them with the CODE command in SYSTAT
BASIC before using SURVIVAL.
START / method , BACKWARD FORWARD ENTER=p REMOVE=p,
FORCE=n , MAXSTEP=n TOLERANCE=d CONVERGE=d
STEP var or + or or / AUTO
STOP
COX LGST EXP EEXP
WB EWB LNOR
LTAB / TLOG covar=d,covar=d CHAZ LCHAZ
ACT d,n / TLOG LIFE CONDITIONAL HAZARD
QNTL / TLOG
RELIABILITY d,n / TLOG
HAZARD d,n / TLOG
925
Survi val Anal ysi s
Print options. PRINT=LONG adds covariance matrices of parameters to the output.
Quick Graphs. Quick Graphs produced by SURVIVAL include Kaplan-Meier curves and
survival functions for parametric models.
Saving files. Almost every command in SURVIVAL allows you to save selected output
to a SYSTAT data file. Any command that produces a table or a plot permits a prior
SAVE command; this is especially useful if you wish to pursue another type of analysis
not presently supported within SURVIVAL.
BY groups. BY group analysis is not allowed in SURVIVAL.
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable. It does not use extra memory.
Case weights. WEIGHT is not available in SURVIVAL.
Examples
Example 1
Life Tables: The Kaplan-Meier Estimator
The nonparametric estimator available in SURVIVAL is the Kaplan-Meier or product
limit estimator (for example, Kaplan and Meier, 1958, or Lee, 1980). The input for this
follows. Notice that we estimate a model based solely on time and the censoring
structure and then ask for the survival table with the LTAB command. The default
survival table produced by LTAB is Kaplan-Meier. The input is:
USE MELANOMA
SURVIVAL
MODEL TIME/CENSOR=CENSOR
ESTIMATE
LTAB
926
Chapter 29
Following is the output:
Variables in the SYSTAT Rectangular file are:
TIME CENSOR WEIGHT SEX PA PB
ULCER DEPTH NODES SEX$


Time variable: TIME
Censor variable: CENSOR
Weight variable: 1.0
Input records: 69
Records kept for analysis: 69

Weighted
Censoring Observations Observations

Exact Failures 36
Right Censored 33


Type 1, exact failures and right censoring only.
Analyses/estimates: Kaplan-Meier, Cox and parametric models
Overall time range: [ 72.000 , 7307.000]
Failure time range: [ 72.000 , 1606.000]

Kaplan-Meier estimation
All the data will be used

Number Number K-M Standard
At Risk Failing Time Probability Error

69.000 1.000 72.000 0.986 0.014
68.000 1.000 125.000 0.971 0.020
67.000 1.000 127.000 0.957 0.025
66.000 1.000 133.000 0.942 0.028
65.000 1.000 142.000 0.928 0.031
64.000 1.000 151.000 0.913 0.034
63.000 1.000 154.000 0.899 0.036
62.000 1.000 176.000 0.884 0.039
61.000 1.000 184.000 0.870 0.041
60.000 1.000 229.000 0.855 0.042
59.000 1.000 251.000 0.841 0.044
58.000 1.000 256.000 0.826 0.046
57.000 1.000 320.000 0.812 0.047
56.000 1.000 362.000 0.797 0.048
55.000 1.000 391.000 0.783 0.050
54.000 1.000 414.000 0.768 0.051
53.000 1.000 422.000 0.754 0.052
52.000 1.000 434.000 0.739 0.053
51.000 1.000 441.000 0.725 0.054
49.000 1.000 465.000 0.710 0.055
48.000 1.000 471.000 0.695 0.055
47.000 1.000 495.000 0.680 0.056
45.000 1.000 544.000 0.665 0.057
44.000 1.000 584.000 0.650 0.058
43.000 1.000 645.000 0.635 0.058
42.000 1.000 659.000 0.620 0.059
41.000 1.000 749.000 0.605 0.059
39.000 1.000 788.000 0.589 0.060
37.000 1.000 803.000 0.573 0.060
36.000 1.000 812.000 0.557 0.061
32.000 1.000 1020.000 0.540 0.061
31.000 1.000 1042.000 0.523 0.062
28.000 1.000 1151.000 0.504 0.062
26.000 1.000 1239.000 0.484 0.063
13.000 1.000 1579.000 0.447 0.068
12.000 1.000 1606.000 0.410 0.072

927
Survi val Anal ysi s
The standard error reported for the survivor function is computed using Greenwoods
formula (Kalbfleisch and Prentice, 1980). The K-M estimate is a step function with
jumps at each exact failure time.
By default, the plots produced by the K-M option are of the survivor function
plotted against time. You can also obtain the cumulative hazard plot (time against the
negative of the log of the survivor function) or log-cumulative hazard plots
(log(log(survivor))) with the CHAZ and LCHAZ options of the LTAB command.
Group size = 69.000
Number failing = 36.000
Product limit likelihood = -173.084

Mean survival time = 1034.958

Survival Quantiles

75.000% 422.000
50.000% 1151.000
41.000% 1606.000

Kaplan-Meier estimation
EXPORT successfully completed.
928
Chapter 29
Example 2
Actuarial Life Tables
Actuarial life tables divide the time period of observations into time intervals. Within
each interval, the number of failing observations is recorded. To see an actuarial table,
you must first specify and estimate a model. We already did so in the Life Tables
example.
We use the required TIME parameter to specify the maximum time (2000) and the
optional number of intervals (4) to keep the table brief. The default number of intervals
is 10. Following is the output:
Hazard Function
You can request that the hazard function be tabled instead of the standard actuarial
survival curve. Following is the input:
Following is the output:
USE MELANOMA
SURVIVAL
MODEL TIME/CENSOR=CENSOR
ESTIMATE
ACT 2000,4
Lower Number
Interval Interval Interval Entering Number Number
Bound Midpoint Width Interval Failed Censored

0.0 250.000 500.000 62.000 22.000 1.000
500.000 750.000 500.000 39.000 8.000 5.000
1000.000 1250.000 500.000 26.000 4.000 14.000
1500.000 1750.000 500.000 8.000 2.000 6.000
ACT 2000,4 / HAZARD
Lower Probability S.E.
Interval Interval Density S.E. Hazard Hazard
Bound Midpoint Function PDF Rate Rate

0.0 250.000 0.001 0.003 0.001 0.000
500.000 750.000 0.000 0.002 0.000 0.000
1000.000 1250.000 0.000 0.002 0.000 0.000
1500.000 1750.000 0.000 0.004 0.001 0.001
929
Survi val Anal ysi s
Conditional Survival
You can also request that the conditional survival be tabled instead of the standard
actuarial survival curve. This table displays the probability of survival given an
interval. Following is the input:
Following is the output:
Example 3
Stratified Kaplan-Meier Estimation
Nonparametric analysis can be refined further by the use of a stratification variable.
There is no fixed upper limit on the number of strata that can be used, and with
moderate-sized data sets, a large number of strata are possible. In the MELANOMA
data set, the variable SEX is coded as 1 for males and 0 for females, respectively. By
adding the STRATA option, we will get a single plot with the two estimated survivor
curves:
USE MELANOMA
SURVIVAL
MODEL TIME/CENSOR=CENSOR
ESTIMATE
ACT 2000,4 / CONDITIONAL
Number Conditional Conditional Cumulative S.E.
Interval Exposed Probability Probability Probability Cum. Prob.
Midpoint To Risk Of Failure Of Survival Of Survival Of Survival

250.000 61.500 0.358 0.642 1.000
750.000 36.500 0.219 0.781 0.642 0.061
1250.000 19.000 0.211 0.789 0.502 0.065
1750.000 5.000 0.400 0.600 0.396 0.069
USE MELANOMA
SURVIVAL
MODEL TIME / CENSOR=CENSOR, STRATA=SEX
LABEL SEX / 1=Male,0=Female
PRINT=LONG
ESTIMATE
LTAB
930
Chapter 29
Following is the output:
Variables in the SYSTAT Rectangular file are:
TIME CENSOR WEIGHT SEX PA PB
ULCER DEPTH NODES SEX$


Time variable: TIME
Censor variable: CENSOR
Weight variable: 1.0
Input records: 69
Records kept for analysis: 69

Weighted
Censoring Observations Observations

Exact Failures 36
Right Censored 33


Type 1, exact failures and right censoring only.
Analyses/estimates: Kaplan-Meier, Cox and parametric models
Overall time range: [ 72.000 , 7307.000]
Failure time range: [ 72.000 , 1606.000]

Stratification on SEX specified, 2 levels

Kaplan-Meier estimation
With stratification on SEX
All the data will be used


The following results are for SEX = Female.

Number Number K-M Standard
At Risk Failing Time Probability Error

31.000 1.000 133.000 0.968 0.032
30.000 1.000 184.000 0.935 0.044
29.000 1.000 251.000 0.903 0.053
28.000 1.000 320.000 0.871 0.060
27.000 1.000 391.000 0.839 0.066
26.000 1.000 414.000 0.806 0.071
25.000 1.000 434.000 0.774 0.075
23.000 1.000 471.000 0.741 0.079
22.000 1.000 544.000 0.707 0.082
20.000 1.000 788.000 0.672 0.085
19.000 1.000 812.000 0.636 0.088
15.000 1.000 1151.000 0.594 0.092
13.000 1.000 1239.000 0.548 0.095
5.000 1.000 1579.000 0.438 0.124
4.000 1.000 1606.000 0.329 0.133

Group size = 31.000
Number failing = 15.000
Product limit likelihood = -58.200

Mean survival time = 1142.022

Survival Quantiles

74.000% 471.000
55.000% 1239.000
33.000% 1606.000


931
Survi val Anal ysi s
The following results are for SEX = Male.

Number Number K-M Standard
At Risk Failing Time Probability Error

38.000 1.000 72.000 0.974 0.026
37.000 1.000 125.000 0.947 0.036
36.000 1.000 127.000 0.921 0.044
35.000 1.000 142.000 0.895 0.050
34.000 1.000 151.000 0.868 0.055
33.000 1.000 154.000 0.842 0.059
32.000 1.000 176.000 0.816 0.063
31.000 1.000 229.000 0.789 0.066
30.000 1.000 256.000 0.763 0.069
29.000 1.000 362.000 0.737 0.071
28.000 1.000 422.000 0.711 0.074
27.000 1.000 441.000 0.684 0.075
26.000 1.000 465.000 0.658 0.077
25.000 1.000 495.000 0.632 0.078
23.000 1.000 584.000 0.604 0.080
22.000 1.000 645.000 0.577 0.081
21.000 1.000 659.000 0.549 0.081
20.000 1.000 749.000 0.522 0.082
18.000 1.000 803.000 0.493 0.082
16.000 1.000 1020.000 0.462 0.083
15.000 1.000 1042.000 0.431 0.083

Group size = 38.000
Number failing = 21.000
Product limit likelihood = -89.404

Mean survival time = 703.643

Survival Quantiles

74.000% 362.000
49.000% 803.000
43.000% 1042.000

Kaplan-Meier estimation

Log-rank test, stratification on SEX strata range 1 to 2

Method: MANTEL
Chi-Sq statistic: 0.568 with 1 df
Significance level (p value): 0.451

Method: BRESLOW-GEHAN
Chi-Sq statistic: 1.589 with 1 df
Significance level (p value): 0.207

Method: TARONE-WARE
Chi-Sq statistic: 1.167 with 1 df
Significance level (p value): 0.280
EXPORT successfully completed.
932
Chapter 29
The plot can be used to see if the survivor curves are similar in shape and how far apart
they lie. By computing the survivor curves in their log(-log(survivor)) transforms (with
the LCHAZ option), you can check for parallelism. Parallel curves, even if the curves
themselves are not linear, suggest that the stratification variable acts as a covariate in a
proportional hazards model. (See Kalbfleisch and Prentice, 1980, and the Cox
regression examples for further discussion of this point.)
If the SAVE command is issued just prior to the LTAB command, a log-cumulative
hazard function for each stratum will be saved.
Rank Tests
This output includes three variations of the log-rank test. The first, the Mantel-
Haenszel test, is what is conventionally called the log-rank test. The remaining tests are
versions of Wilcoxon tests, and they offer different weighting schemes in calculating
the difference between observed and expected failures at each failure time in a
contingency table analysis. The simple log-rank test uses unit weights so that each
failure time has equal weighting. The Breslow-Gehan version weights each failure
time by the total number at risk at that time so that earlier times receive greater weight
than later times. The Tarone-Ware version weights by the square root of the total
number at risk, placing less emphasis on later failure times.
Discussions of log-rank tests can be found in Kalbfleisch and Prentice (1980),
Lawless (1982), Miller (1981), and Cox and Oakes (1984). The tests themselves were
introduced by Mantel and Haenszel (1959), Gehan (1965), Breslow (1970), and Tarone
and Ware (1977). If there are no tied failures, the simple LRANK test is equivalent to a
score test of the proportional hazards model containing a dummy variable for each
stratum.
933
Survi val Anal ysi s
Example 4
Turnbull Estimation: K-M for Interval-Censored Data
The Kaplan-Meier estimator, as originally introduced in 1958, is restricted to exact
failure and right-censored data. It is simply defined as:
where j* is the set j such that t{j} < t{i}, d{j} is the number of deaths at time t{j}, and
r{j} is the number at risk (those that have not yet failed or been censored immediately
before time t{j}).
For data of type 2, in which there are disjoint intervals and possibly right censoring,
the K-M estimator is extended so that the above definition still applies. Now d{j}
denotes the number of failures in the jth interval and right censoring is assumed to have
occurred immediately after the upper boundary of the appropriate interval.
For data of type 3, the generalization of the Kaplan-Meier estimator requires a major
departure from this equation. A version of this generalized K-M estimator was first
suggested by Peto (1973) and was further developed by Turnbull (1976). As type 3 data
have overlapping interval-censored data and may have exact failures and right
censoring as well, the first task is to determine the intervals over which the survivor
function is estimated to decrease. Because this estimator is not discussed in the
standard texts, we provide a brief exposition of the method here.
When data are of type 3, every case is considered to have left and right time
boundaries (L{i}, R{i}) defining its interval of censoring or failure. Cases with exact
failures have L{i} = R{i}; a right-censored observation will have R{i} equal to infinity;
and an interval-censored failure will have L{i} < R{i}. The Peto-Turnbull
generalization begins by identifying a unique set of disjoint time intervals for which
failure probabilities will be estimated. These intervals are constructed by selecting
lower boundaries from the left boundaries and upper boundaries from the right
boundaries, such that these new intervals do not contain any observed L{i} or R{i}
except at the boundaries.
S
KM
t i { } ( ) 1
d j { }
r j { }
-----------
,
_
j
*

=
934
Chapter 29
For example, consider the type 3 data set TYPE3A:
There are seven different observed intervals in the data, and the following four intervals
are generated by the Turnbull estimator:
The lower and upper boundaries are referred to as qs and ps, respectively, by both Peto
(1973) and Turnbull (1976). We will explain the determination of the first interval.
Cases 1 through 3 overlap each other somewhere on the interval (1.0, 3.0) with distinct
left boundaries being 1.0 and 1.9 and distinct right boundaries being 2.0 and 3.0. The
interval (1.9, 2.0) is the only interval constructible out of these boundaries that does not
itself contain another endpoint. For example, (1.0, 2.0) contains the left endpoint 1.9.
The constructed intervals are of minimal size and involve a maximal overlap of cases
spanning the interval.
A similar method is used to generate the remaining intervals. Intuitively, the goal is
to determine where in the interval the probability of failure lies. Given that a failure
occurs between 1.9 and 3.0, and also that a failure occurs between 1.0 and 2.0, our
attempt to assign all the probability to the smallest possible interval leads to the choice
of the subinterval (1.9, 2.0).
Turnbull shows that a maximum likelihood nonparametric cumulative distribution
function (CDF) can assign probability only to these intervals. Further, for a given set
of probability assignments, the likelihood is independent of the behavior of the CDF
within the interval, meaning that the CDF may be entirely arbitrary within the interval
(Wang, 1987).
LTIME TIME WEIGHT CENSOR
1.0 2.0 4 1
1.0 2.0 5 1
1.9 3.0 5 1
4.0 5.1 3 1
4.0 4.2 8 1
5.0 6.0 10 1
7.0 8.0 6 1
7.0 9.0 4 1
lower (q) upper (p)
1.9 2.0
4.0 4.2
5.0 5.1
7.0 8.0
935
Survi val Anal ysi s
The second stage of the generalized Kaplan-Meier estimator computation is to
assign probability to each (q{i}, p{i}) interval, which will define the CDF that
maximizes the likelihood of the data. The solution vector of probabilities s is obtained
by the EM algorithm of Dempster, Laird, and Rubin (1977). Specifically, the observed
frequency distribution of the data should be equal to the expected frequency, given s.
Following is the input for the analysis:
The TYPE3A data set yields the following output:
USE TYPE3A
SURVIVAL
MODEL TIME / CENSOR=CENSOR, LOWER=LTIME
FREQ=WEIGHT
ESTIMATE
LTAB
Variables in the SYSTAT Rectangular file are:
LTIME TIME WEIGHT CENSOR

Case frequencies determined by value of variable WEIGHT.


Time variable: TIME
Censor variable: CENSOR
Weight variable: WEIGHT
Weight variable: LTIME
Sorting was found to be required on the following special variables:
TIME
Sorting activated, input continues.

Case frequencies determined by value of variable WEIGHT.
Input records: 8
Records kept for analysis: 8

Weighted
Censoring Observations Observations

Exact Failures 0
Right Censored 0
Interval Censored 8

Type 3, general censoring (left censoring and/or nondistinct intervals).
Analyses/estimates: Kaplan-Meier (generalized) and parametric models
Overall time range: [ 1.000 , 9.000]
Failure time range: [ 1.000 , 9.000]

Turnbull K-M estimation
All the data will be used

Iter L-L
0 -60.304
1 -59.757
2 -59.757

936
Chapter 29
The EM algorithm is frequently slow to converge, but it has the advantage of increasing
the likelihood on each iteration. For a theoretical discussion of EM convergence, see
Wu (1983).
Example 5
Cox Regression
Proportional hazards regression (Cox, 1972) is a hybrid model--partly nonparametric,
in that it allows for an arbitrary survivor function like the Kaplan-Meier estimator, and
partly parametric, in that covariates are assumed to induce proportional shifts of the
arbitrary hazard function. The Kaplan-Meier (product limit) estimator is equivalent to
the Cox model without covariates. In SURVIVAL, Cox models are allowed only for type
1 data. The proportional hazards model is assumed to take the form
where b(t) is the nonparametric baseline hazard, and f(z. ) is a parametric shift
function of the covariate vector z and the parameter vector . Typically, f(z. ) is
specified as exp(z ), where z is an inner vector product, and this is the form used
in SURVIVAL.
SURVIVAL reports maximum likelihood estimates for and allows access to h(t,z)
and b(t) via the LTAB command. Models are specified with the MODEL command and
a list of covariates.
Following is an example:
Convergence achieved in 2 iterations
Final convergence criterion: 0.0 -59.757

Turnbull K-M Density
Lower Time Upper Time Probability Change

1.900 2.000 0.689 0.311
4.000 4.200 0.481 0.207
5.000 5.100 0.222 0.259
7.000 8.000 0.0 0.222
Turnbull K-M estimation
USE MELANOMA
SURVIVAL
MODEL TIME = ULCER,DEPTH,NODES / CENSOR=CENSOR
ESTIMATE / COX
h t z , ( ) b t ( )f z. ( ) =

937
Survi val Anal ysi s
The input above will fit the proportional hazards model with three covariates and
display:
Variables in the SYSTAT Rectangular file are:
TIME CENSOR WEIGHT SEX PA PB
ULCER DEPTH NODES SEX$


Time variable: TIME
Censor variable: CENSOR
Weight variable: 1.0
Input records: 69
Records kept for analysis: 69

Weighted
Censoring Observations Observations

Exact Failures 36
Right Censored 33

Covariate means

ULCER = 1.507
DEPTH = 2.562
NODES = 3.246

Type 1, exact failures and right censoring only.
Analyses/estimates: Kaplan-Meier, Cox and parametric models
Overall time range: [ 72.000 , 7307.000]
Failure time range: [ 72.000 , 1606.000]

Cox Proportional Hazards Estimation
Time variable: TIME
Censoring: CENSOR

Weight variable: 1.0
Lower time: Not specified

Iter Step L-L
0 0 -137.527
1 0 -136.100
2 0 -127.887
3 0 -127.813
4 0 -127.813

Results after 4 iterations
Final convergence criterion: 0.000
Maximum gradient element: 0.000
Initial score test of regression: 37.083 with 3 df
Significance level (p value): 0.000
Final log-likelihood: -127.813
-2*[LL(0)-LL(4)] TEST: 19.429 with 3 df
Significance level (p value): 0.000
Parameter Estimate S.E. t-ratio p-value
ULCER -0.776 0.376 -2.063 0.039
DEPTH 0.094 0.050 1.885 0.059
NODES 0.131 0.053 2.490 0.013

95.0 % Confidence Intervals
Parameter Estimate Lower Upper
ULCER -0.776 -1.514 -0.039
DEPTH 0.094 -0.004 0.192
NODES 0.131 0.028 0.235

938
Chapter 29
We are provided with a summary of the iteration log. The partial likelihood began at -
137.527 when the parameter vector was all 0s and ended at -127.813.
The output also reports the score test of the hypothesis that all three coefficients are
equal to their start values (in this case 0); the chi-square statistic is 37.083 with three
degrees of freedom and has a p value less than 0.001. This test is analogous to the F-
test reported by the SYSTAT module MGLH for a linear regression, and is simply a test
of the hypothesis that the gradient of the log-likelihood function is 0 when evaluated
at the start values of the coefficients.
Since the start values are 0 for the Cox model (unless specifically set otherwise by
the user), the statistic yields a test of a standard null hypothesis. (Other null hypotheses
could be conveniently tested by using the START and MAXIT = 1 options of the
ESTIMATE command.) Asymptotically, the score test above is equal to the likelihood-
ratio test defined as twice the difference between the final and initial likelihood values.
In larger samples, there is typically good agreement between the two tests. In small
samples, as in this case, the statistics may be quite different.
When comparing the parameter estimates of a Cox model with those of a fully
parametric model such as the Weibull, it is important to note that the coefficients are
expected to have opposite signs and will differ by a scale factor. If the data actually
follow a Weibull model with coefficients and shape parameter , then the
proportional hazards parameters will be ( / ) (Kalbfleisch and Prentice, 1980).
Example 6
Stratified Cox Regression
The proportional hazards assumption implies that groups with different values of the
covariates have unchanging relative hazard functions over time. Thus, in a study of
male and female survival, the ratio of male to female hazard functions would be
assumed constant if sex were a covariate. If we thought the hazard function for males
Covariance matrix

ULCER DEPTH NODES
ULCER 0.142
DEPTH 0.006 0.002
NODES -0.005 -0.000 0.003



Correlation matrix

ULCER DEPTH NODES
ULCER 1.000
DEPTH 0.293 1.000
NODES -0.255 -0.052 1.000


939
Survi val Anal ysi s
was increasing relative to the hazard function for females over time, we would have a
violation of the proportional hazards assumption for the SEX variable.
To accommodate such potential assumption violations, an important generalization
of the Cox model is allowed in SURVIVAL. This is the use of stratification (sometimes
also referred to as blocking). Stratification relaxes the assumption of a single
underlying baseline hazard for the entire population. Instead, it permits each stratum to
have its own baseline hazard, with considerably different stratum-specific time profiles
possible. Stratification stops short of estimating a separate model for each group
because the coefficients for the covariates are common across all the strata. To estimate
a stratified Cox model, we proceed with the following input. We have added an LTAB
command to plot a cumulative hazard life table for the model.
Following is the output:
USE MELANOMA
SURVIVAL
MODEL TIME = ULCER,DEPTH,NODES / ,
CENSOR=CENSOR,STRATA=SEX
ESTIMATE / COX
LTAB / ULCER=0,DEPTH=0,NODES=0,LCHAZ
Variables in the SYSTAT Rectangular file are:
TIME CENSOR WEIGHT SEX PA PB
ULCER DEPTH NODES SEX$


Time variable: TIME
Censor variable: CENSOR
Weight variable: 1.0
Input records: 69
Records kept for analysis: 69

Weighted
Censoring Observations Observations

Exact Failures 36
Right Censored 33

Covariate means

ULCER = 1.507
DEPTH = 2.562
NODES = 3.246

Type 1, exact failures and right censoring only.
Analyses/estimates: Kaplan-Meier, Cox and parametric models
Overall time range: [ 72.000 , 7307.000]
Failure time range: [ 72.000 , 1606.000]

Stratification on SEX specified, 2 levels

Cox Proportional Hazards Estimation
with stratification on SEX
Time variable: TIME
Censoring: CENSOR
940
Chapter 29
Weight variable: 1.0
Lower time: Not specified

Iter Step L-L
0 0 -112.564
1 0 -108.343
2 0 -103.570
3 0 -103.533
4 0 -103.533

Results after 4 iterations
Final convergence criterion: 0.000
Maximum gradient element: 0.000
Initial score test of regression: 32.533 with 3 df
Significance level (p value): 0.000
Final log-likelihood: -103.533
-2*[LL(0)-LL(4)] TEST: 18.063 with 3 df
Significance level (p value): 0.000
Parameter Estimate S.E. t-ratio p-value
ULCER -0.817 0.385 -2.123 0.034
DEPTH 0.083 0.053 1.587 0.112
NODES 0.131 0.057 2.289 0.022

95.0 % Confidence Intervals
Parameter Estimate Lower Upper
ULCER -0.817 -1.570 -0.063
DEPTH 0.083 -0.020 0.186
NODES 0.131 0.019 0.243

Covariance matrix

ULCER DEPTH NODES
ULCER 0.148
DEPTH 0.006 0.003
NODES -0.006 -0.000 0.003



Correlation matrix

ULCER DEPTH NODES
ULCER 1.000
DEPTH 0.301 1.000
NODES -0.287 -0.096 1.000





Life table for last Cox model
1 evaluation covariate vector
All the data will be used


941
Survi val Anal ysi s
The following results are for SEX = 0.
No tied failure times

Model Model
Number Number Survival Hazard
At Risk Failing Time Probability Rate

31.000 1.000 133.000 0.941 0.059
30.000 1.000 184.000 0.883 0.062
29.000 1.000 251.000 0.826 0.065
28.000 1.000 320.000 0.769 0.069
27.000 1.000 391.000 0.712 0.074
26.000 1.000 414.000 0.658 0.076
25.000 1.000 434.000 0.606 0.078
23.000 1.000 471.000 0.554 0.086
22.000 1.000 544.000 0.501 0.096
20.000 1.000 788.000 0.445 0.112
19.000 1.000 812.000 0.392 0.117
15.000 1.000 1151.000 0.337 0.142
13.000 1.000 1239.000 0.277 0.177
5.000 1.000 1579.000 0.159 0.427
4.000 1.000 1606.000 0.070 0.556

Group size = 31.000
Number failing = 15.000


The following results are for SEX = 1.
No tied failure times

Model Model
Number Number Survival Hazard
At Risk Failing Time Probability Rate

38.000 1.000 72.000 0.996 0.004
37.000 1.000 125.000 0.953 0.044
36.000 1.000 127.000 0.909 0.046
35.000 1.000 142.000 0.866 0.048
34.000 1.000 151.000 0.823 0.049
33.000 1.000 154.000 0.782 0.050
32.000 1.000 176.000 0.742 0.051
31.000 1.000 229.000 0.703 0.052
30.000 1.000 256.000 0.665 0.055
29.000 1.000 362.000 0.627 0.057
28.000 1.000 422.000 0.590 0.059
27.000 1.000 441.000 0.552 0.063
26.000 1.000 465.000 0.514 0.069
25.000 1.000 495.000 0.476 0.074
23.000 1.000 584.000 0.439 0.077
22.000 1.000 645.000 0.401 0.086
21.000 1.000 659.000 0.361 0.099
20.000 1.000 749.000 0.324 0.105
18.000 1.000 803.000 0.287 0.113
16.000 1.000 1020.000 0.250 0.129
15.000 1.000 1042.000 0.215 0.139

Group size = 38.000
Number failing = 21.000
Cox estimation

942
Chapter 29
Stratification allows the survival pattern to vary markedly for cases with different
values of the stratification variable while keeping the coefficients governing hazard
shifts common across strata. In the above models, allowing SEX to be a stratification
variable does not alter the coefficients by much.
Comparison of the baseline hazards across the strata allows you to decide whether
the stratification variable can be modeled as a covariate. If the log(-log(survivor)) plots
are roughly parallel, the stratification variable is acting to shift the baseline hazard and
is correctly considered to be a covariate. If, on the other hand, the curves are quite
different in shape, the variable is best left as a stratification variable and should not be
included as a covariate. Only one stratification variable can be specified at any given
time.
The baseline survivor function derived from the Cox model is produced with the LTAB
command followed by zero settings for the covariates, as in our example. By adding the
LCHAZ option, we get a baseline hazard for each of the two sexes and a log(-log) plot of
the survivor functions against time. As Kalbfleisch and Prentice (1980) point out, this
technique can be applied repeatedly, swapping the roles of covariates and stratification
Log-rank test, stratification on SEX strata range 1 to 2

Method: MANTEL
Chi-Sq statistic: 0.568 with 1 df
Significance level (p value): 0.451

Method: BRESLOW-GEHAN
Chi-Sq statistic: 1.589 with 1 df
Significance level (p value): 0.207

Method: TARONE-WARE
Chi-Sq statistic: 1.167 with 1 df
Significance level (p value): 0.280
EXPORT successfully completed.
943
Survi val Anal ysi s
variables until you are satisfied with a particular model. With so few data points, it is
difficult to draw firm conclusions, but the log(-log(survivor)) curves do look largely
parallel. This suggests that SEX can appear as a covariate in this model, albeit not a
significant one.
A more conservative analytic procedure than stratification would first split the
sample into the subgroups that are suspected of having different survival behavior and
then estimate separate models for each group. A likelihood-ratio test based on the
summed log-likelihoods of the separate subgroup models and the likelihood for the
stratified Cox model could form the basis of a test of whether stratification is sufficient
to capture the group differences. If stratification is accepted, you could then proceed to
investigate whether the stratification variable could enter as a covariate.
Example 7
Stepwise Regression
When there is little theoretical reason to prefer one model specification over another,
stepwise methods of covariate selection can be useful, particularly if there is a large
number of potential covariates. SURVIVAL allows both forward and backward stepwise
covariate selection, with optional forcing of certain covariates into the model and
control over the addition and deletion criteria. The stepping can be used with any
model (except stratified ones) in SURVIVAL, although the forward selection
(STEP/FORWARD) cannot be used with the Cox model unless at least one covariate is
forced into the model. In general, we advise you to use backward elimination
(STEP/BACKWARD) with all stepwise procedures because it is less likely to miss
potentially valuable predictors.
The criterion for adding a variable is based on a Lagrange multiplier test (or Raos
score test) of the hypothesis that the variable has a zero coefficient when added to the
current list of covariates (Peduzzi, Holford, and Hardy, 1980; Engel, 1984). The signed
square root of this chi-square statistic on one degree of freedom is then treated as a
normal random variable for significance computation.
For variable deletion, the t statistic (actually the asymptotic normal statistic) based
on the ratio of the coefficient to its standard error, as derived from the inverse of the
information matrix, is used. If the default ENTER and REMOVE levels are overridden,
care should be taken to prevent cycling of variables into and out of the model.
Stepwise model selection in nonlinear contexts is subject to the same criticisms as
stepwise linear regression. In particular, conventional hypothesis testing can be
misleading, and models will look much better than they really are. For a general
944
Chapter 29
discussion of stepwise modeling problems, see Hocking (1983) and additional
references cited for General Linear Models.
Following is an example. The input is:
We have changed the remove p value to 0.05 from the default of 0.15 in order to force
out of the model any nonsignificant effects.
The output follows:
USE MELANOMA
SURVIVAL
MODEL TIME = ULCER,DEPTH,NODES / CENSOR=CENSOR
START / COX,BACK,REMOVE=.05
STEP / AUTO
Variables in the SYSTAT Rectangular file are:
TIME CENSOR WEIGHT SEX PA PB
ULCER DEPTH NODES SEX$


Time variable: TIME
Censor variable: CENSOR
Weight variable: 1.0
Input records: 69
Records kept for analysis: 69

Weighted
Censoring Observations Observations

Exact Failures 36
Right Censored 33

Covariate means

ULCER = 1.507
DEPTH = 2.562
NODES = 3.246

Type 1, exact failures and right censoring only.
Analyses/estimates: Kaplan-Meier, Cox and parametric models
Overall time range: [ 72.000 , 7307.000]
Failure time range: [ 72.000 , 1606.000]


Step number 0
Log-likelihood = -127.813
t-ratio p
Further stepping impossible.
Variables included:
ULCER -2.063 0.039
DEPTH 1.885 0.059
NODES 2.490 0.013


945
Survi val Anal ysi s
Example 8
The Weibull Model for Fully Parametric Analysis
This example fits an accelerated life model using the Weibull distribution. When we fit
parametric models, we automatically get a plot of the log failure times against the
quantiles of the chosen distribution. Following is the input:
Step number 1
Log-likelihood = -129.259
t-ratio p
Further stepping impossible.
Variables included:
ULCER -2.577 0.010
NODES 2.524 0.012
Variables excluded:
DEPTH 1.941 0.052



Final Model Summary

Parameter Estimate S.E. t-ratio p-value
ULCER -0.926 0.359 -2.577 0.010
NODES 0.136 0.054 2.524 0.012

95.0 % Confidence Intervals
Parameter Estimate Lower Upper
ULCER -0.926 -1.631 -0.222
NODES 0.136 0.030 0.241

Covariance matrix

ULCER NODES
ULCER 0.129
NODES -0.005 0.003



Correlation matrix

ULCER NODES
ULCER 1.000
NODES -0.280 1.000
USE MELANOMA
SURVIVAL
MODEL TIME = ULCER,DEPTH,NODES / CENSOR=CENSOR
ESTIMATE / EWB
QNTL
946
Chapter 29
Following is the output:
Variables in the SYSTAT Rectangular file are:
TIME CENSOR WEIGHT SEX PA PB
ULCER DEPTH NODES SEX$


Time variable: TIME
Censor variable: CENSOR
Weight variable: 1.0
Input records: 69
Records kept for analysis: 69

Weighted
Censoring Observations Observations

Exact Failures 36
Right Censored 33

Covariate means

ULCER = 1.507
DEPTH = 2.562
NODES = 3.246

Type 1, exact failures and right censoring only.
Analyses/estimates: Kaplan-Meier, Cox and parametric models
Overall time range: [ 72.000 , 7307.000]
Failure time range: [ 72.000 , 1606.000]


Weibull distribution B(1)--shape, B(2)--location
Extreme value parameterization
Time variable: TIME
Censoring: CENSOR

Weight variable: 1.0
Lower time: Not specified

Iter Step L-L Method
0 0 -346.029 BHHH
1 0 -333.961 BHHH
2 0 -325.721 BHHH
3 0 -318.696 BHHH
4 0 -316.158 BHHH
5 0 -312.058 N-R
6 0 -307.552 BHHH
7 0 -306.814 BHHH
8 1 -306.615 N-R
9 0 -306.510 N-R
10 0 -306.508 N-R
11 0 -306.508 N-R

Results after 11 iterations
Final convergence criterion: 0.000
Maximum gradient element: 0.000
Initial score test of regression: 14.738 with 5 df
Significance level (p value): 0.012
Final log-likelihood: -306.508

Parameter Estimate S.E. t-ratio p-value
_B(1)_ (SHAPE) 1.202 0.161 7.470 0.000
_B(2)_ (LOCATION) 7.277 0.728 9.990 0.000
ULCER 0.776 0.431 1.800 0.072
DEPTH -0.154 0.057 -2.675 0.007
NODES -0.063 0.020 -3.162 0.002

1.0/_B(1)_ = 0.832, EXP(_B(2)_) = 1446.887

947
Survi val Anal ysi s
Mean
Vector Failure Time Variance

ZERO 1595.592 3716876.399
MEAN 900.377 1183539.543

Coefficient of variation: 1.208

95.0 % Confidence Intervals
Parameter Estimate Lower Upper
_B(1)_ (SHAPE) 1.202 0.886 1.517
_B(2)_ (LOCATION) 7.277 5.849 8.705
ULCER 0.776 -0.069 1.622
DEPTH -0.154 -0.266 -0.041
NODES -0.063 -0.102 -0.024

Covariance matrix
_B(1)_ _B(2)_ ULCER DEPTH NODES
_B(1)_ 0.026
_B(2)_ 0.003 0.531
ULCER 0.007 -0.288 0.186
DEPTH -0.001 -0.021 0.007 0.003
NODES -0.000 -0.003 0.001 0.000 0.000



Correlation matrix
_B(1)_ _B(2)_ ULCER DEPTH NODES
_B(1)_ 1.000
_B(2)_ 0.024 1.000
ULCER 0.108 -0.915 1.000
DEPTH -0.132 -0.511 0.291 1.000
NODES -0.077 -0.199 0.079 0.020 1.000
Group size = 69.0000
Number failing = 36.0000

Quantile 95.0 confidence intervals
for last model estimated: EWB (Weibull distribution)


Covariate vector:
ULCER=1.5072, DEPTH=2.5620, NODES=3.2464

Lower Upper Log Of S.E. Of
Estimated Time Time Estimated Log
Quantile Time Bound Bound Time Time

0.9990 0.6373 0.0786 5.1662 -0.4506 1.0677
0.9950 4.4185 0.8945 21.8251 1.4858 0.8149
0.9900 10.1932 2.5485 40.7694 2.3217 0.7073
0.9750 30.9354 10.1861 93.9516 3.4319 0.5668
0.9500 72.2627 29.1689 179.0230 4.2803 0.4629
0.9000 171.6176 84.2625 349.5338 5.1453 0.3629
0.7500 573.7870 353.0871 932.4371 6.3523 0.2477
0.6667 866.6446 560.8401 1339.1924 6.7646 0.2220
0.5000 1650.6876 1101.2413 2474.2710 7.4089 0.2065
0.3333 2870.8589 1861.9127 4426.5399 7.9624 0.2209
0.2500 3796.5470 2386.6769 6039.2630 8.2418 0.2368
0.1000 6985.1899 3989.1996 12231.2453 8.8515 0.2858
0.0500 9583.1488 5152.7473 17822.8689 9.1678 0.3166
0.0250 12306.2149 6287.2253 24087.4026 9.4179 0.3427
0.0100 16065.7917 7752.8895 33292.0604 9.6844 0.3718
0.0050 19013.9159 8840.9176 40892.7006 9.8529 0.3907
0.0010 26151.5267 11313.1214 60452.1357 10.1717 0.4275
948
Chapter 29
In the output, the fundamental scale (shape) and location parameters are labeled _B(1)_
and _B(2)_, respectively. Also, notice that SURVIVAL selected the BHHH method in
early iterations to ensure a positive definite information matrix. It then switched to
conventional Newton-Raphson.
The probability plot should follow a relatively straight line if the distribution used
is appropriate. You should compute several distributions and examine this graph as a
diagnostic aid. We also present the fitted distributions quantiles in a table and Quick
Graph.
Computation
All computations are in double precision.
Algorithms
Start values for the computation routines are calculated automatically whenever a
model is specified. In SURVIVAL, start values are obtained from a linear regression
based on an accelerated life model without covariates. The model is:
ln t ( ) w + =
949
Survi val Anal ysi s
which specifies the log failure time to be the sum of a constant and a parametric error.
We rewrite this in terms of the probability of failure before time t, denoted by p, as
where F is the CDF of the Weibull, log-normal, or log-logistic distribution. A linear
regression of the observed failure times on a constant and the appropriate transform of
the Kaplan-Meier estimate of p for each time yields start values for and . For the
WB form of the Weibull model, we use and .
Missing Data
SURVIVAL will analyze only cases that have valid data for every special variable and
all covariates listed in the MODEL command. If any one of these variables is missing
for a case, that record will not be input. Consequently, if you want to analyze data
containing missing values for some of the covariates and retain the maximum possible
number of cases for each analysis, use the CORR procedure to estimate the missing
values via the EM algorithm and save your data with the imputed values.
Parameters
In SURVIVAL, we use the accelerated life parameterization for convenience in
computing and interpreting the results. The models behave well and converge quickly,
and the notion of a covariate accelerating life is intuitive. Some other texts and
programs prefer a different parameterization, most typically for the Weibull and
exponential models. To facilitate comparisons, SURVIVAL output prints
transformations of the shape and location parameters that will match other
parameterizations, and the optional WB and EXP commands use a proportional hazards
parameterization. If you observe a difference in the shape and location parameters but
identical covariate coefficients (or identical except for sign), you have come across a
difference in parameterization. This is no cause for concern; from a mathematical point
of view, the sets of results are identical except for a transformation of some parameters.
Centering
In SURVIVAL, the default is to input data without centering. If you do opt to center, and
this is advisable particularly for estimation of the WB model, you will discover that
ln t ( ) F p ( )
1
+ =

a e

= 1 =
950
Chapter 29
your location parameter (_B(2)_) may change. This is analogous to the change in the
intercept you would see in a multiple regression if you centered some of your data.
Again, the change is of no consequence.
Log-Likelihood
The most common discrepancy between SURVIVAL and textbook results is in the
reported log-likelihood at convergence. Some authors such as Kalbfleisch and Prentice
(1980) prefer to eliminate any terms in the log-likelihood that are constants or
exclusively functions of the data (that is, not functions of the unknown parameters).
Thus, in the Weibull model, they drop an term from the likelihood contribution
of each uncensored case. While this does not in any way affect the maximum
likelihood solutions for the parameter vector, it does result in a log-likelihood much
smaller than that reported by SURVIVAL. For example, Kalbfleisch and Prentice (1980)
report a Weibull model estimated on the data in their Table 1.2 as having a log-
likelihood of 22.952; SURVIVAL reports 144.345. The coefficients and standard
errors are identical for both normalizations, however. A similar divergence will be
noted on the log-logistic and log-normal models. All of the differences are innocuous
and are the result of different normalizations; they do not represent any real differences
in results.
Iterations
The maximum likelihood procedures in SURVIVAL are iterative. The basic iteration
consists of determination of the gradient of the log-likelihood with respect to the
parameter vector, calculation of a parameter change vector, and evaluation of the log-
likelihood based on the updated parameter vector. If the new log-likelihood is larger
than that of the previous iteration, the iteration is considered complete and a new
iteration is begun; if not, a step halving is initiated. SURVIVAL will continue to iterate
until convergence has been attained.
A step halving is required if some metric of the parameter change vector is too large,
resulting in a more negative log-likelihood. This change vector is simply cut in half,
and the log-likelihood is reevaluated. If this log-likelihood is an improvement, the
iteration is considered complete and a new iteration is begun; otherwise, another step
halving is done.
During this process, if either the total number of complete iterations or the total
number of step halvings for a single iteration becomes greater than or equal to the limit
ln t ( )
951
Survi val Anal ysi s
specified in the MAXIT option, estimation will stop, and a message stating that the
iteration limit was encountered will be given followed by the parameter values, log-
likelihood, etc., at this point.
The iteration limit will usually be a problem only for models that typically converge
slowly, such as WB and EXP. On the other hand, as the parameter estimates approach
their final values and the convergence criterion is almost satisfied, SURVIVAL may
have difficulty in improving the log-likelihood. Successively smaller steps will be
required to get an improved log-likelihood for the iteration, since there is only a little
room left for improvement this far along anyway. If iteration i results in a log-
likelihood very close to the optimal value, but the overall convergence criterion is not
yet satisfied, then many step halvings are required on iteration to get an
improvement, and the step halving limit may be encountered. This may not be a
problem. If the parameters that are printed out appear not to have met the convergence
criterion, they probably are near their optimal values anyway. Intelligent control of the
convergence criterion and iteration and step halving limits is important here.
Singular Hessian
SURVIVAL will not estimate models that include an exact linear dependency among
covariates or that include a constant covariate. For either situation, the Hessian (matrix
of second derivatives) is singular, and a message to that effect will be printed in the
output. The problem of covariate interdependency is common to all models
(parametric and proportional hazards). Stratified proportional hazards models add
another level of complexity. If one of the covariates is constant within a stratum, a
singular Hessian can result.
Survival Models
We use the symbol to represent the CDF for the continuous non-negative random
variable T. Within SURVIVAL, we require that all failure times be strictly positive (that
is, zero failure times are not permitted). The survivor function is defined as
The density function is
i 1 +
F t ( )
S t ( ) 1 F t ( ) Prob T t > ( ) = =
f t ( ) dF t ( ) dt ( ) =
952
Chapter 29
and the hazard function is
Censoring occurs when the value of t is not observed completely but is restricted to an
interval on the real line. In general, an observation is interval-censored if all we know
is that the failure time falls between times t
u
and t
i
, or
The censoring is called right censoring when
In some contexts, if and is finite, the censoring is called left censoring, but
we do not distinguish this from general interval censoring in SURVIVAL.
Proportional Hazards Models
Coxs proportional hazards model can be written as
where is the nonparametric baseline hazard. The survivor function is then
where . SURVIVAL allows each stratum i to have its own baseline hazard
.
The Cox model is estimated by maximizing the partial likelihood that does not
include the baseline hazard . For tied failure times, we use Breslows
generalization of the Cox likelihood. Denoting the ordered failure times for the ith
stratum by
h t ( ) d S t ( ) ln [ ] dt f t ( ) S t ( ) = =
t
j
t t
u
< <
t
j
t < <
t
i
0 = t
u
h t z, , ( ) h
0
t ( )exp z ( ) =
h
0
t ( )
S t z, , ( ) S
0
t ( )
q
=
q exp z ( ) =
h
0i
t ( )
h
0
t ( )
t
1i ( )
t
mi
,
L s
ji ( )
( ) z ( )
d j i ( )
exp
R
t ji ( )

exp
j 1 =
m
i

i 1 =
I

=
953
Survi val Anal ysi s
where is the number of failures in the ith stratum, is the number of failures
in stratum i at time , is the vector sum of covariate vectors for each of these
observations, and is the risk set at failure time . When there are no tied
failures, this formula reduces to Coxs original likelihood.
The recovery of the baseline hazard for a stratum follows Prentice and Kalbfleisch
(1979). Defining
the baseline hazard for covariate vector z = 0 is
and for tied failure times,
ln(ln(survivor)) Plots and Quantile Plots
We can write the log(log(survivor)) equation as
which shows that for different values of the covariate vector z, the curve is simply
shifted by an additive constant . Although the baseline curve need
not be linear, the curves for different strata satisfying the proportional hazards
assumption will be parallel.
For the Weibull model, the baseline hazard can be written as
so that
which will plot as a straight line against .
m
i
d ji ( )
t
ji ( )
s
ji ( )
d ji ( ) R
tji
t
ji ( )

j
1 exp z ( ) / [

R
t ji
exp z' ( )] =
S t 0 ; ( )
j
t
j
t <

j
exp d ji ( ) ( ) / [

R
t j i
exp z' ( )] =
S t z , , ( ) ( ) ln ( ) ln S
0
t ( ) ( ) ln ( ) ln z' + =
z S
0
t ( ) ( ) ln ( ) ln
S t ( ) e
t

=
S t ( ) ( ) ln ( ) ln t ( ) ln =
t ( ) log
954
Chapter 29
Convergence and Score Tests
The convergence criterion is based on the relative increase in the likelihood between
iterations. If
convergence is achieved, where is the value of the log-likelihood at the ith
iteration.
The relative change in the log-likelihood is also used to decide whether first
derivatives or the Newton-Raphson method is used in the search algorithm. By default,
if the relative increase exceeds the user-defined threshold, only first derivatives are
calculated, and the sum of the outer products of the gradient vector are used as an
approximation to the matrix of second derivatives (Berndt, Hall, Hall, and Hausman,
1974); below the threshold, the Newton-Raphson method is used.
The score test (Rao, 1977; Engel, 1984) is a Lagrange multiplier (LM) test of the
hypothesis that the entire parameter vector of the Cox model is 0. The statistic is
computed as
where is the score (gradient) vector evaluated at parameter vector , and
is an estimate of the information matrix also evaluated at . Under the null hypothesis
that , S is asymptotically distributed as a chi-square variate with degrees of
freedom equal to the number of elements in . In SURVIVAL, the score test is
computed for equal to the start values for . Ordinarily, these are 0 for the Cox
model, but they may be overridden with the START option.
Stepwise Regression
The stepwise algorithm follows the suggestion of Peduzzi, Hofford, and Hardy (1980)
and, if unrestricted, begins with a test for downward stepping. The criterion for deletion
of a variable is based on the t statistic or, more correctly, the asymptotic normal
statistic, computed as the ratio of the coefficient to its estimated standard error.
A step up is based on a score test of the hypothesis that a potential covariate not
currently in the model has a coefficient of 0. If the model currently has p covariates, to
test for the addition of the th covariate, we need to evaluate the information
matrix I and the score vector U under the null hypothesis. Writing for the current
L
i ( )
L
i 1 ( )
[ ] L
i ( )
converge <
L
i ( )
S U ( )I ( )
1
U ( ) =
U ( ) I ( )


p 1 + ( )

0
955
Survi val Anal ysi s
parameter vector obtained from maximizing the log-likelihood for p parameters, and
partitioning the score vector , the score statistic is
where is the partitioned inverse of I. The statistic could be expanded to test for a
set of potential covariates jointly but is implemented for a single covariate only in the
current version of SURVIVAL. The resulting scalar is asymptotically a chi-square
variate on one degree of freedom, whose square root is treated as a standard normal.
Variances of Quantiles, Hazards, and Reliabilities
The pth quantile of a distribution for the random variable is that value of t for which
. For the accelerated life model we have
and for a given p, a point estimate for is obtained from
where is the inverse of the extreme value, normal, or logistic distribution,
depending on the model in use. The variance of is derived under the assumption
that the estimated parameters are multivariate normal with mean and covariance matrix
given by the maximum likelihood solutions. The confidence intervals are computed in
terms of and then transformed to the time scale.
Confidence intervals for reliabilities are computed from asymptotic approximations
based on a first-order Taylor series expansion in terms of the estimated parameters of
the model. This is sometimes called the delta method (Rao, 1977). In SURVIVAL, we
compute confidence intervals in terms of the log-odds ratio because
this quantity does not have any range restrictions and is more nearly a linear function
of the parameters. The confidence intervals for the log odds are then transformed to the
probability scale.
U U
1
U
2
( , ) =
U
0
0 ( , )I
0
0 ( , )
1
U
0
0 ( , ) U
2
I
22
U
2
=
I
22
F t ( ) p =
t ( ) ln + = z w +
t ( ) ln
t ( ) ln + = z F
1
w ( ) +
F
1
t ( ) ln
t ( ) ln
p 1 p ( ) ( ) ln
956
Chapter 29
References
Allison, P. (1984). Event history analysis. Beverly Hills, Calif.: Sage Publications.
Anderson, J. A. and Senthilselvan, A. (1980). Smooth estimates for the hazard function.
Journal of the Royal Statistical Society, Series B, 42, 322327.
Barlow, R. E. and Proschan, F. (1965). Mathematical theory of reliability. New York: John
Wiley & Sons, Inc.
Berndt, E. K., Hall, B., Hall, R. E., and Hausman, J. A. (1974). Estimation and inference in
non-linear structural models. Annals of Economic and Social Measurement, 3, 653665.
Breslow, N. (1970). A generalized Kruskal-Wallis test for comparing K samples subject to
unequal patterns of censorship. Biometrika, 57, 579594.
Breslow, N. (1974). Covariance analysis of censored survival data. Biometrics, 30, 8999.
Cox, D. R. (1972). Regression models and life tables. Journal of the Royal Statistical
Society, Series B, 34, 187220.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269276.
Cox, D. R. and Oakes, D. (1984). Analysis of survival data. New York: Chapman and Hall.
Cox, D. R. and Snell, E. J. (1968). A general definition of residuals. Journal of the Royal
Statistical Society, Series B, 30, 248275.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series
B, 39, 138.
Elandt-Johnson, R. C. and Johnson, N. L. (1980). Survival models and data analysis. New
York: John Wiley & Sons, Inc.
Elber, C. and Ridder, G. (1982). True and spurious duration dependence: The identifiability
of the proportional hazards model. Review of Economic Studies, 49, 402411.
Engel, R. F. (1984). Wald, likelihood ratio and Lagrange multiplier tests in econometrics.
In Z. Griliches and M. Intrilligator (eds.), Handbook of Econometrics. New York:
North-Holland.
Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly censored
samples. Biometrika, 52, 203223.
Gross A. J. and Clark, V. A. (1975). Survival distributions: Reliability applications in the
biomedical sciences. New York: John Wiley & Sons, Inc.
Han, A. and Hausman, J. (1986). Semiparametric estimation of duration and competing
risks models. Department of Economics, Massachusetts Institute of Technology,
Cambridge, Mass.
Heckman, J. and Singer, B. (1984). The identifiability of the proportional hazards model.
Review of Economic Studies, 51, 321341.
Heckman, J. and Singer, B. (1984). A method for minimizing the impact of distributional
assumptions in econometric models for duration data. Econometrica, 52, 271320.
957
Survi val Anal ysi s
Hougaard, P. (1984). Life table methods for heterogeneous populations: Distributions
describing the heterogeneity. Biometrika, 71.
Kalbfleisch, J. and Prentice, R. (1980). The statistical analysis of failure time data. New
York: John Wiley & Sons, Inc.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete
observations. Journal of the American Statistical Association, 53, 457481.
Lagakos, S. (1979). General right censoring and its impact on the analysis of survival data.
Biometrics, 35, 13956.
Lancaster, T. (1985). Generalized residuals and heterogeneous duration models: With
applications to the Weibull model. Journal of Econometrics, 28, 155169.
Lancaster, T. (1988). Econometric analysis of transition data. Cambridge: Cambridge
University Press.
Lawless, J. F. (1982). Statistical models and methods for lifetime data. New York: John
Wiley & Sons, Inc.
Lee, E. T. (1980). Statistical methods for survival data analysis. Belmont, Calif.:
Wadsworth.
Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from
retrospective studies of disease. Journal of the National Cancer Institute, 22, 719748.
Manton, K. G., Stallard, E., and Vaupel, J. (1986). Alternative models for the heterogeneity
of mortality risks among the aged. Journal of the American Statistical Association, 81,
635644.
Miller, R. (1981). Survival analysis. New York: John Wiley & Sons, Inc.
Nelson, W. (1978). Life data analysis for units inspected once for failure. IEEE
Transactions on Reliability, R-27, 4, 274279.
Nelson, W. Applied life data analysis. New York: John Wiley & Sons, Inc.
Parmar, M. K. B. and Machin, D. (1995). Survival analysis: A practical approach. New
York: John Wiley & Sons, Inc.
Peduzzi, P. N., Hofford, T. R., and Hardy, R. J. (1980). A stepwise variable selection
procedure for nonlinear regression models. Biometrics, 36, 511516.
Prentice, R. L. and Kalbfleisch, J. D. (1979) Hazard rate models with covariates.
Biometrics, 35, 2539.
Preston, D. and Clarkson, D. B. (1983). SURVREG: A program for the interactive analysis
of survival regression models. The American Statistician, 37, 174.
Rao, C. R. (1977). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Steinberg, D. and Monforte, F. (1987). Estimating the effects of job search assistance and
training programs on the unemployment durations of displaced workers. In K. Lang and
J. Leonard (eds.), Unemployment and the Structure of Labor Markets. London: Basil
Blackwell.
958
Chapter 29
Tarone, R. E. and Ware, J. (1977). On distribution-free tests for equality of survival
distributions. Biometrika, 64, 156160.
Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped,
censored and truncated data. Journal of the Royal Statistical Society, Series B, 38,
290295.
Vaupel, J. W., Manton, K. G., and Stallard, E. (1979). The impact of heterogeneity in
individual frailty on the dynamics of mortality. Demography, 16, 439454.
Wang, M. (1987). Nonparametric estimation of survival distributions with interval
censored data. John Hopkins University, Baltimore, Md.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
50, 125.
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of
Statistics, 11, 95103.
959


Chapt er
30
T Tests
Laszlo Engelman
The following t tests are available on the Statistics menu:
Statistical Background
The following figure shows four different curvesall probability densities. These
were drawn with

using SYSTATs t density function, where df is the degrees-of-freedom parameter.
Plotting probability and cumulative density functions is often a good way to see what
the distributions we use for confidence intervals and hypothesis tests look like. The
dashed curve is the normal density. The other three are t densities with 1, 2, and 5
degrees of freedom. Notice that as degrees of freedom increase, the shape of the curve
approaches the normal density.
Two Groups Two-sample (independent) t test. The values of the variable of interest (for
example, INCOME) are stored in a single column and SYSTAT uses codes
of a grouping variable (for example, GENDER) to separate the cases into
two groups (the codes can be numbers or characters). SYSTAT tests whether
the difference between the two means differs from 0.
Paired Paired comparison (dependent) t test. For each case used in a paired t test,
SYSTAT computes the differences between values of two variables (col-
umns) and tests whether the average differs from 0.
One-Sample One-sample t test. For the one-sample t test, values of a single variable are
compared against a constant that you specify.
FPLOT F TDF T df , ( ) =
960
Chapter 30
The t distribution was found by William S. Gosset (1908) while working at the
Guinness brewery in Dublin. His discovery came out of the practical need to analyze
the results of small-sample experiments in the brewing process. Prior to Gossets work,
the normal curve, or its parametric variant called the error function, was used to
approximate the uncertainty of the mean of a set of measurements. As Gosset and
others had noted, this approximation was satisfactory only for large samples.
Incidentally, the authorship of his 1908 Biometrika paper under the pseudonym
Student arose from the Guinness corporate desire to conceal this new technology
from competitors. It also reflected Gossets gratitude for help from the statistician Karl
Pearson, his Professor.
Gosset provided a method that enabled reasonable inferences to be drawn from the
mean of small random samples. He identified this distribution (later named t by Sir
Ronald Fisher) by taking repeated random samples of size 4 from a data set and
examining the behavior of a statistic based on the ratio of the sample mean to the
sample standard deviation.
Although Gosset used real data and a somewhat different parameterization of the
problem, you can get an idea of what he discovered by running the following Monte
Carlo experiment:
-5 -4 -3 -2 -1 0 1 2 3 4 5
t
0.0
0.1
0.2
0.3
0.4
0.5
f
(
t
)
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
961
T Tests
Here is the probability plot output:
Gosset examined batches of 750 means and noted the extremely long tails of the
distribution as compared to the normal. Notice in the probability plot that we see the
two most extreme cases having absolute t values greater than 10. This would be
incredible if we had a sample of 750 from a standard normal distribution. The
theoretical chance of seeing values this large or larger from a random sample of a t
distribution with 3 degrees of freedom, on the other hand, is:
which yields 0.0021, or about two cases, which is what we found in this sample.
BASIC
RSEED=3333
REPEAT 750
DIM X(4)
REM Generate 750 samples (n=4) of normal random
numbers
FOR I=1 TO 4
LET X(I)=ZRN
NEXT
REM Compute a t statistic for each sample
LET T=AVG(X(1)..X(4))/(STD(X(1)..X(4))/SQR(4))
RUN
REM Plot the ordered t values against their
expected values
REM Use a theoretical t distribution with 3 df
PPLOT T / T=3
-20 -10 0 10 20
T
-20
-10
0
10
20
E
x
p
e
c
t
e
d

V
a
l
u
e

f
o
r

t
(
3
)

D
i
s
t
r
i
b
u
t
i
o
n
CALC 2 * 1 TCF(10,3) ( )
962
Chapter 30
Degrees of Freedom
The parameter called degrees of freedom enters into the definition of the t distribution
because the estimate of the standard deviation in the denominator depends on it. If we
compute n deviations from the mean of n observations, we can predict perfectly the last
(or any) deviation by knowing the mean, n, and the other observations. Stigler
(1986) notes that this concept originated in the development of the chi-square
distribution in the 19th century. The chi-square is involved in the computation of t
because it is the distribution of the sample variance (the square of the standard
deviation).
When degrees-of-freedom are large enough (say, greater than 30), the t and standard
normal (z) distributions are practically indistinguishable. You can see this by
comparing TDF(T,30) to ZDF(Z). On the other hand, you can see from the figure on
p. 960 that for small degrees of freedom, the difference between the two curves is
substantial, particularly in the tails. This is the gist of Gossets accomplishment. Gosset
quantified the amount of bias scientists are likely to encounter by ignoring degrees of
freedom and using z instead of t.
The T Test
We call a test based on the distribution
a t test. Fisher (1925) extended this form to a variety of applications, including tests on
regression coefficients and differences of means. Many procedures in SYSTAT use this
distribution for various purposes. TTEST focuses on the classic tests of means.
The simplest test is called a one-sample t test. This is a test of the form:
Null hypothesis:
Alternative hypothesis:
It is occasionally overlooked in practice that may be any real value; null does not
mean zero. Indeed, this is the point of the one-sample test: to assess the credibility
of an observed mean value given an expected value and an estimate of error.
n 1
t
x
s n
------------- =

0
=

0

0
963
T Tests
Mathematical constants such as and certain physical constants are obvious
candidates for in an hypothesis testing framework.
The prevalence of t tests on differences is perhaps what has led to linking of null
and zero in the minds of some applied researchers. There are two species of this test.
The first is called the paired t test (dependent t test). In this context, we seek to assess
the credibility of an observed difference between the means of two repeated
observations x
1
and x
2
being due to a process where no difference exists in the
population. Since the difference between two normally distributed variables is itself a
normally distributed variable with mean and variance , we
construct a test on:
Null hypothesis:
Alternative hypothesis:
Computing this test is a matter of taking differences of our data and treating the new
variable comprising these differences as we do in a single-sample t test. This
differencing improves the power of an experiment when the covariance is large
(and positive) relative to the variances because we are implicitly subtracting twice its
value from the sum of the variances when we compute our estimate of the variance of
the difference. This generally occurs when the pairs of measurements are on the same
individual or siblings, for example. Woe to the researcher who encounters negative
covariances, however. This may happen when negative feedback biases the
measurement process or other lurking variables cause one measurement to covary
negatively with the other in each pair.
The second species of t test on differences is slightly different. We have two sets of
independent measurements rather than one set of pairs of measurements. We cannot
difference pairs of measurements because there are no pairs; the samples may not even
be the same size. Our hypotheses are:
Null hypothesis:
Alternative hypothesis:
Computing this test requires us to get an estimate of the variance of the difference by
using the estimates of the separate variances. If the samples are indeed independent,
then this variance is simply the sum of the separate variances and its estimate is a
weighted sum of the separate variance estimates. We lose a degree of freedom for each
sample in the process, however, since the sums of the separate samples are each
constrained.

1

2

1
2

2
2
2
12
+

1 2
0 =

1 2
0

12

1

2
0 =

1

2
0
964
Chapter 30
Pooling
Estimating the variance of a difference between means of measurements on two
independent samples involves pooling sources of variance from each separate sample.
This pooling requires us to assume we are combining homogeneous sources of
variation. If this is not true, then the computed t statistic does not follow the t
distribution. This problem, summarized in Snedecor and Cochran (1989), has been
attacked by a number of statisticians. SYSTAT provides a separate variances t test for
this condition; the test approximates the true distribution of the statistic when the
assumption of equal variance (but not distributional shape) is violated.
Assumptions
Throughout this discussion, we have been assuming that the distribution of
measurements that determine the means we are testing is normal. For large
independent random samples, normality is not as much of a concern because the
central limit theorem tells us that the distribution of sample means is normal in these
cases, even when the sampled variable is substantially non-normal. The t test, on the
other hand, is designed for small samples.
This situation has led Freedman, Pisani, and Purves (1980) to note a predicament:
we need to assume normality for the t test, but our sample is too small to be of use in
assessing the validity of this assumption. This predicament would be more worrisome
were it not for the robustness of the t test against violations of this assumption. There
is a considerable Monte Carlo literature in this area, whose general finding is that the
primary condition that should concern us is substantial skewness rather than
symmetrical departures from normality. SYSTAT provides a graphic with each t test
that includes box and dot plots superimposed on normal curves to assess this condition
informally. Because of the issue noted by Freedman et al., generally it is not helpful to
compute statistical tests of normality on samples sized appropriately for t tests prior to
doing the tests. Graphical inspection is to be preferred.
965
T Tests
T Tests in SYSTAT
Two-Sample T Test Main Dialog Box
To open the Two-Sample T Test dialog box, from the menus choose:
Statistics
t-test
Two Groups...
The following must be specified to perform a two-sample t test:
Variable(s). Select the variables for which t tests are desired. Each variable corresponds
to a separate t test. When testing several variables, use the optional p value adjustments
to control for multiple tests.
Grouping variable. The t test compares the means for the two groups defined by this
variable.
You can also request optional confidence intervals for the mean differences.
Paired T Test Main Dialog Box
To open the Paired T Test dialog box, from the menus choose:
Statistics
t-test
Paired
966
Chapter 30
The following must be specified to perform a paired t test:
Variable(s). Select the variables for which t tests are desired. If more than two variables
are selected, each variable pair results in a separate t test. When testing several variable
pairs, use the optional p value adjustments to control for multiple tests.
You can also request optional confidence intervals for the paired mean differences.
One-Sample T Test Main Dialog Box
To open the One-Sample T Test dialog box, from the menus choose:
Statistics
t-test
One-Sample
The following must be designated to perform the test:
Variable(s). Select the variables for which t tests are desired. Each variable corresponds
to a separate t test. When testing several variables, use the optional p value adjustments
to control for multiple tests.
967
T Tests
Mean. The constant value to which you want to compare the sample mean for each
selected variable.
You can also request optional confidence intervals for the mean differences.
T Test Options
SYSTAT allows you to request tests for several variables with one specification. The
p value associated with the t test assumes that you are making one and only one test.
The probability of finding a significant difference by chance alone rapidly increases
with the number of tests. So, you should avoid requesting tests for many variables and
reporting only those that appear to be significant.
What do you do when you want to study test results for several variables? As
protection for multiple testing, SYSTAT offers two adjustments to the probabilities:
n Dunn-Sidak. The Dunn-Sidak adjustment is appropriate when more than one test is
performed simultaneously. For n tests, the Dunn-Sidak adjusted probability is
.
n Bonferonni. The Bonferonni adjustment is appropriate when more than one test is
performed simultaneously. It drops the second- and higher-order terms from the
expression . The Bonferonni adjusted probability is .
Another option available for all three t tests is confidence intervals. You can specify the
confidence using the following option:
n Confidence. Confidence level for the confidence interval. Enter a value between 0
and 1 to specify the likelihood that the confidence interval, a range of values based
on the difference between the sample means (or between the sample mean and a
specified constant for a one-sample test), includes the difference between the
population means. Typical values for the confidence level are 0.95 and 0.99.
Higher values (closer to 1) produce wider confidence intervals; lower values
produce narrower confidence intervals.
Using Commands
To request a two-sample t test, specify your data with USE filename and continue with:
TTEST
TEST varlist * grpvar / BONF DUNN CONFI=n
1 1 p ( ) * n
1 1 p ( ) * n n * p
968
Chapter 30
Alternatively, to request a paired t test, continue with:
Finally, to request a one-sample t test, continue with:
Usage Considerations (T Tests)
Types of data. Test variables must be numeric. The grouping variable for the two-
sample t test can contain either numbers or characters.
Print options. The output is standard for all PRINT options.
Quick Graphs. TTEST produces Quick Graphs. The graph produced depends on the test.
n The two-sample t test produces a Quick Graph combining three graphical displays
for each group: a box plot displaying the sample median, quartiles, and outliers (if
any), a normal curve calculated using the sample mean and standard deviation, and
a dit plot displaying each observation.
n The paired t test produces a Quick Graph in which, for each case pair, a line
connects the values on the two variables.
n The one-sample t test produces a Quick Graph combining three graphical displays:
a box plot displaying the sample median, quartiles, and outliers (if any), a normal
curve calculated using the sample mean and standard deviation, and a dit plot
displaying each observation.
Saving files. TTEST does not save the results of the analysis.
BY groups. TTEST analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. TTEST uses the FREQ variable, if present, to duplicate cases.
Case weights. TTEST uses the WEIGHT variable, if present, to weight cases.
TTEST
TEST varlist / BONF DUNN CONFI=n
TTEST
TEST varlist = constant / BONF DUNN CONFI=n
969
T Tests
Examples
Example 1
Two-Sample T Test
Do males tend to earn more than females? We use the SURVEY2 data to test whether
the average income for males differs from that for females. The SURVEY2 data file has
one case for each subject, with the annual income (INCOME) and a numeric or
character code to identify the sex (the values female and male are stored in the
grouping variable SEX$). Note that the cases do not need to be ordered by the values
of the grouping variable.
In addition to the Quick Graph, which SYSTAT automatically provides with each
test, we show alternative ways of viewing these dataa box-and-whiskers plot, a dual
histogram, and a kernel density estimator for each group. The input follows:
The output is:
TTEST
USE survey2
TEST income * sex$
DENSITY income * sex$ / BOX TRANS
DENSITY income / DUAL=sex$ FILL=0,1
DENSITY income / GROUP=sex$ OVERLAY KERNEL DASH=1,6
Two-sample t test on INCOME grouped by SEX$

Group N Mean SD
Female 152 20.257 14.828
Male 104 24.971 16.418

Separate Variance t = -2.346 df = 206.2 Prob = 0.020
Difference in Means = -4.715 95.00% CI = -8.676 to -0.753

Pooled Variance t = -2.391 df = 254 Prob = 0.018
Difference in Means = -4.715 95.00% CI = -8.597 to -0.832
970
Chapter 30
The average yearly income for males in this sample is almost $5,000 more than that for
females ($24,971 versus $20,257). The standard deviation (SD) for males (16.4) is also
larger than that for females (14.8).
The p values (Prob) for both tests indicate a significant difference in the average
incomes of males and females. That is, for the separate variance test, t = 2.346 with 206.2
degrees of freedom and an associated probability of 0.02. The values for the pooled test
are t = 2.391, df = 254, and p value = 0.018. Which result should you use? Use the pooled
test when you are comfortable that the population variances in the two groups are equal.
Scan graphical displays for similar shapes and note that the more the sample variances
differ, the more the degrees of freedom for the separate variance test drop. You pay a
penalty for unequal variancesdiminished degrees of freedom mean that your effective
sample size decreases. Here, we would use the separate variance t test.
The difference in means is $4,715. The separate variances estimate of the 95%
confidence interval for this mean difference extends from $753 to $8,676. Note that
the interval using the pooled variance estimate is shorter.
971
T Tests
For each group, three graphical displays are combined in the Quick Graph: a box
plot displaying the sample median, quartiles, and outliers (if any), a normal curve
calculated using the sample mean and standard deviation, and a dit plot displaying each
observation. The median incomes differ more than the mean incomes displayed in the
t test output. The distribution of female incomes is more right-skewed than the
distribution of male incomes. The box plot and normal curve indicate that the
distribution of male incomes is fairly symmetric.
Example 2
Bonferroni and Dunn-Sidak Adjustments
How do developed and emerging nations differ? We use the OURWORLD file with
data for 57 countries. Variables recorded for each case (country) include URBAN
(percentage of the population living in urban areas), LIFEEXPF (years of life
expectancy for females), LIFEEXPM (years of life expectancy for males), and GDP$
(grouping variable with codes Developed and Emerging). The input is:
Following are the results (we used an editor to delete the difference in means and
confidence intervals):
TTEST
USE ourworld
FORMAT=8
TEST urban lifeexpf lifeexpm * gdp$ / BONF DUNN
FORMAT
Two-sample t test on URBAN grouped by GDP$

Group N Mean SD
Developed 29 66.10344828 16.84243117
Emerging 27 38.55555556 19.69446102

Separate Variance t = 5.60601761 df = 51.4 Prob = 0.00000083
Dunn-Sidak Adjusted Prob = 0.00000248
Bonferroni Adjusted Prob = 0.00000248

Pooled Variance t = 5.63775383 df = 54 Prob = 0.00000065
Dunn-Sidak Adjusted Prob = 0.00000194
Bonferroni Adjusted Prob = 0.00000194

972
Chapter 30
On the average, 66.1% of the inhabitants of developed nations live in urban areas, while
38.6% of those in emerging nations live in urban areas. Note that the sample size, N, is
29 + 27 = 56, but there are 57 cases in the OURWORLD file (the value of URBAN for
Belgium is missing). Compare the df for the two tests51.4 versus 54. Thus,
considering graphical displays (not shown), the standard deviations, and the small
difference between the dfs for the two tests, we are not uncomfortable reporting results
for the pooled variance test. Significantly more people in developed nations live in
urban areas than do people in emerging nations (t = 5.638, df = 54, p value < 0.0005).
Simply view this output as an illustration of the mechanics of the adjustment
features. A difference between a probability of 0.00000083 and 0.00000248 is
negligible, considering possible problems in sampling, errors in the data, or a failure
to meet necessary assumptions. However, if you scan the results for 100 variables, a
probability of 0.0006 for a separate variance t test is not significant when multiple
testing is considered, since the Bonferroni adjusted probability would be 0.06.
Focusing on female life expectancy, the standard deviation (SD) for the emerging
nations is more than two times larger than that for the developed nations, and the df
for the separate variance test drops to 33.6. Using the separate variance test, we
conclude that an average life expectancy of 77.4 years differs significantly from 62
years (t = 6.782, df = 33.6, p value < 0.0005).
Two-sample t test on LIFEEXPF grouped by GDP$

Group N Mean SD
Developed 30 77.43333333 4.47740175
Emerging 27 62.00000000 11.03490964

Separate Variance t = 6.78218991 df = 33.6 Prob = 0.00000009
Dunn-Sidak Adjusted Prob = 0.00000027
Bonferroni Adjusted Prob = 0.00000027

Pooled Variance t = 7.04827869 df = 55 Prob = 0.00000000
Dunn-Sidak Adjusted Prob = 0.00000001
Bonferroni Adjusted Prob = 0.00000001

Two-sample t test on LIFEEXPM grouped by GDP$

Group N Mean SD
Developed 30 70.83333333 3.83345827
Emerging 27 58.70370370 9.96846881

Separate Variance t = 5.93974079 df = 32.9 Prob = 0.00000117
Dunn-Sidak Adjusted Prob = 0.00000351
Bonferroni Adjusted Prob = 0.00000351

Pooled Variance t = 6.18109495 df = 55 Prob = 0.00000008
Dunn-Sidak Adjusted Prob = 0.00000025
Bonferroni Adjusted Prob = 0.00000025
973
T Tests
Conclusions regarding male life expectancy are similar to those for females, except
that for males, life expectancy is, on the average, shorter than that for females70.8
years in developed nations and 58.7 in emerging nations. You could use a paired t test
to check if the sex difference is significant.
Example 3
T Test Assumptions
In this example, we examine the dollar amounts that Islamic and New World countries
spend per person on health. We request tests of health dollars as measured, in square
root units, and for log-transformed values. Since SYSTAT requires that the grouping
variable has two values, we remove the European countries from the sample (that is,
group$ <> "Europe"). The input is:
Following are the results (we omit the differences in means and confidence intervals):
USE ourworld
TTEST
SELECT group$ <> Europe
TEST health * group$
LET sqhealth=SQR(health)
TEST sqhealth * group$
LET lghealth=L10(health)
TEST lghealth * group$
SELECT
Two-sample t test on HEALTH grouped by GROUP$

Group N Mean SD
Islamic 15 20.336 41.736
NewWorld 21 85.955 200.531

Separate Variance t = -1.456 df = 22.4 Prob = 0.159
Pooled Variance t = -1.243 df = 34 Prob = 0.222
Two-sample t test on SQHEALTH grouped by GROUP$

Group N Mean SD
Islamic 15 3.194 3.295
NewWorld 21 6.890 6.357

Separate Variance t = -2.271 df = 31.5 Prob = 0.030
Pooled Variance t = -2.057 df = 34 Prob = 0.047
Two-sample t test on LGHEALTH grouped by GROUP$

Group N Mean SD
Islamic 15 0.664 0.777
NewWorld 21 1.442 0.622

Separate Variance t = -3.214 df = 25.9 Prob = 0.003
Pooled Variance t = -3.338 df = 34 Prob = 0.002
974
Chapter 30
The output includes the results for dollars spent per person for health as recorded
(HEALTH), in square root units (SQHEALTH), and in log units (LGHEALTH). In the
first panel, it appears that New World countries spend considerably more than Islamic
nations ($85.96 versus $20.34, on the average). For these untransformed samples,
however, this difference is not significant (t = 1.456, df = 22.4, p value = 0.159).
For the untransformed data, the standard deviation for the New World countries
(200.531) is almost five times larger than that for the Islamic nations (41.736). For
SQHEALTH, the former is approximately two times larger than the latter. For
LGHEALTH, the difference has reversed. The Islamic group exhibits more spread.
Some analysts might want to try a transform between a square root and a log. For
example,
LET cuberoot = HEALTH^.333
975
T Tests
But remember, we are selecting a transform using samples of 15 and 21 per groupand
the results will have to be explained to others. The graphical displays and test results for
the logged data appear to be okay. We conclude that, on the average, New World
countries spend significantly more for health than do Islamic nations (t = 3.338, df = 34,
p value = 0.002 for data analyzed in log units).
Each Quick Graph combines three graphical displays: a box plot displays the
sample median, quartiles, and outliers (if any), a normal curve calculated using the
sample mean and standard deviation, and a dit plot that displays each observation. In
the box plots and normal curves, notice that the shapes of the Islamic and New World
distributions are most similar for the data in log units. Canada and Libya are far outside
values in the box plots for the raw data and the data in square root units. The log
transformation tames these outliers.
Example 4
Paired T Test
Do females live longer than males? For each of the 57 countries in the OURWORLD
data file, life expectancy is recorded for females and males. Each case (country) has
two measures in the same units (years of life expectancy), so we use the paired
comparison t test to test if the means are equal. We include box-and-whiskers plots to
illustrate any differences. The input is:
TTEST
BEGIN
DENSITY lifeexpf / BOX TRANS SCALE=2 AXES=2 ,
xmin=35 xmax=85 XLAB=Female Life Expectancy,
LOC=-2.5IN,0IN
DENSITY lifeexpm / BOX TRANS SCALE=2 AXES=2 ,
xmin=35 YMAX=85 XLAB=Male Life Expectancy
LET dif = lifeexpf - lifeexpm
DENSITY dif / BOX TRANS XLAB=Difference LOC=2.5IN,0IN
END
GRAPH NONE
TEST lifeexpf lifeexpm
976
Chapter 30
The output follows:
The graphs confirm that females tend to live longer, and that only two countries have
negative differences (males live longer than females).
In the sample, females, on the average, tend to live 70.123 years and males tend to
live 65.088 years. The mean difference between female and male life expectancy is
5.035 years. A 95% confidence interval for this difference in means extends from
4.415 years to 5.655 years.
The interval is computed as follows:
Mean Difference t{0.975;df} * (SD Difference)
A difference of 5.035 years departs significantly from 0 (t = 16.264, df = 56,
p value < 0.005). Females do tend to live longer. To calculate the t statistic manually,
first, for each country, compute the difference between female and male life
expectancy. Then, compute the average and the standard deviation (SD) of the
differences. Finally, calculate t where n is the number of countries (or pairs):
t = (average difference)*SQR(n)/SD
The Bonferroni and Dunn-Sidak adjustments to probability levels are available for
protection for multiple testing.
Paired samples t test on LIFEEXPF vs LIFEEXPM with 57 cases

Mean LIFEEXPF = 70.123
Mean LIFEEXPM = 65.088
Mean Difference = 5.035 95.00% CI = 4.415 to 5.655
SD Difference = 2.337 t = 16.264
df = 56 Prob = 0.000
35
45
55
65
75
85
F
e
m
a
l
e

L
i
f
e

E
x
p
e
c
t
a
n
c
y
35
45
55
65
75
85
M
a
l
e

L
i
f
e

E
x
p
e
c
t
a
n
c
y
-5
0
5
10
D
i
f
f
e
r
e
n
c
e
977
T Tests
Example 5
One-Sample T Test
Will Europes population remain stable? You read that for the population to remain
stable, the ratio of the birth rate to the death rate should not exceed 1.25that is, five
births for every four deaths. Should you reject the null hypothesis that the average
European birth-to-death ratio is 1.25? The input follows:
The output is:
The average birth-to-death ratio for the European countries in the sample is 1.257. We
are unable to reject the null hypothesis that the population value is 1.25 (t = 0.147,
df = 19, p value = 0.884). We have no evidence that Europes population will increase
in size.
USE ourworld
TTEST
SELECT group$ = Europe
TEST b_to_d = 1.25
One-sample t test of B_TO_D with 20 cases; Ho: Mean = 1.250

Mean = 1.257 95.00% CI = 1.157 to 1.357
SD = 0.213 t = 0.147
df = 19 Prob = 0.884
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7
B_TO_D
0
1
2
3
4
5
6
7
C
o
u
n
t
978
Chapter 30
Do we reach the same conclusion for Islamic nations? Repeat the previous steps,
except specify Islamic as GROUP$. The output is:
The average birth-to-death ratio for the Islamic countries is 3.478 (more than 2.5 times
greater than that of the Europeans). The Islamic birth-to-death ratio differs
significantly from 1.25 (t = 7.557, df = 15, p value < 0.0005). We anticipate a
population explosion among these nations.
(As in the other t tests, the Bonferroni and Dunn-Sidak adjustments to probability
levels are available for protection for multiple testing.)
References
Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver and Boyd.
Freedman, D., Pisani, R., and Purves, R. (1980). Statistics. New York: W. W. Norton & Co.
Snedecor, G. W. and Cochran, W. G. (1989). Statistical methods (8th ed.). Ames: Iowa
State University Press.
Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before
1900. Cambridge: Harvard University Press.
Student. (1908). The probable error of a mean. Biometrika, 54, 125.
One-sample t test of B_TO_D with 16 cases; Ho: Mean = 1.250

Mean = 3.478 95.00% CI = 2.850 to 4.107
SD = 1.179 t = 7.557
df = 15 Prob = 0.000
2 3 4 5 6 7
B_TO_D
0
2
4
6
8
10
12
C
o
u
n
t
979


Chapt er
31
Test Item Analysis
Herb Stenson
TESTAT provides classical analysis and logistic item-response analysis of tests that
are composed of responses to each of a set of test items (variables) by each of a set of
respondents (cases). Classical analysis provides test summary statistics, reliability
coefficients, standard errors of measurement for selected score intervals, item
analysis statistics, and summary statistics for individual cases. Graphical as well as
numerical displays are provided.
You also can score individual items for each respondent provided that test items
are of the right versus wrong variety. However, TESTAT is not limited to these kinds
of data; it will accept and analyze any sort of numerical variables that can be used in
SYSTAT. Thus, data from true-false tests, multiple-choice tests, rating scales,
physiological measures, etc., can all be analyzed with TESTAT using the classical test
theory model.
Analysis using logistic, item-response theory is implemented in TESTAT using an
iterative, maximum likelihood procedure to estimate item difficulties, item
discrimination indices, and subjects abilities.
Either a one- or two-parameter logistic model can be selected. Item histograms can
be printed to examine the fit of each item to the model. TESTAT can save subject
scores into a SYSTAT file.
If you use BY, a test can be analyzed for any subgroups of respondents (cases) that
you specify. You also have the option of specifying subsets of items (variables) as a
subtest to be analyzed. TESTAT also can save item difficulties and discrimination
indices into a file for item banking.
980
Chapter 31
Statistical Background
The two statistical approaches to analyzing data from psychological and educational
tests have been termed classical and latent trait. The classical model assumes that
items are imperfect measurements of an underlying factor. Like common factor
analysis, a single theoretical (unobserved) factor is assumed to comprise a true
source of variation and random error accounts for the remaining variation in observed
scores. Since we cannot observe this true factor, we can estimate it by making
assumptions that the random errors are independent and, usually, normally distributed.
Thus, the sum of the item scores can yield an estimate of the true score.
The classical model has no role for items of different difficulty. Indeed, it is
assumed that any differences in responding to items is due to the ability of the subjects
and not to the difficulty of the items. Consequently, tests developed under the classical
model tend to have banks of items all of a similar average difficulty or response
pattern.
The latent trait model, on the other hand, postulates an underlying distribution that
relates item responses to a theoretical trait. This distribution is usually (as in TESTAT)
assumed to be logistic, but it can take other forms. In its parameterization, the latent
trait model specifically separates subject abilities (individual differences) and item
difficulties (scale differences). Tests developed under the latent trait model tend to
have a pool of items that vary in difficulty. Some items are failed (or not endorsed) by
most subjects and some are passed (or endorsed) by most subjects. Because of this, a
latent trait test is especially well suited for measuring larger ranges of abilities or
opinions. In addition, the latent trait model allows a more precise description of the
performance of an item than simply the item-test correlation. This helps in screening
for poor items in a test.
Because of its more elegant parameterization, the latent trait model is generally
regarded by test experts to be superior to the classical model for developing surveys
and tests of attributes. Indeed, despite the popularity of the classical model (and its
associated statistics such as Cronbachs alpha, item-test correlations, and factor
loadings) among nonprofessionals and applied researchers, the latent trait model is the
one used by the well-known psychological and educational testing organizations. The
continuing popularity of the older classical model may be due to its relative simplicity
and the lack of availability of latent trait software in the major statistical packages.
Until SYSTAT introduced latent trait modeling in a general statistical package, it was
confined to specialized software available at selected academic and commercial sites.
SYSTAT offers both methods, but we strongly recommend that you learn and apply
the latent trait model to develop tests that you intend to reuse.
981
Test I tem Anal ysi s
Classical Model
The principal statistics in the classical model are reliability measures that represent
how well a set of items relate to each other (assuming that they all measure a common
factor). The reliability, or internal-consistency, coefficients that are produced by
TESTAT are the coefficient of correlation between the odd and even test scores, the
Spearman-Brown coefficient based on the odd-even correlation, the Guttman-Rulon
coefficient, coefficient alpha for all items, coefficient alpha for odd-numbered items,
and coefficient alpha for even-numbered items.
The Spearman-Brown coefficient is based on the assumption that the two halves of
the test are strictly parallel. The Guttman-Rulon coefficient is based on the assumption
that the two halves are parallel in every sense except for having different variances. We
call it the Guttman-Rulon coefficient here because the two different formulas for
computing it proposed by Guttman (1945) and Rulon (1939) are algebraically
equivalent. Coefficient alpha is the internal consistency measure proposed by
Cronbach (1951). It is algebraically equivalent to Formula 20 by Kuder and
Richardson (KR20) when the test data are dichotomously scored items.
Coefficient alpha deserves a little more discussion here. First, it should be noted that
while this coefficient cannot take on values greater than 1.0, it has no lower limit.
Therefore, it not only can take on negative values but it can take on negative values less
than 1.0, unlike the Pearson correlation coefficient. If you get a value of alpha less than
0, it is because a substantial number of test items have negative correlations with the
total test score (or with other items, which is the same thing). You can check the effect
of reverse scoring the offending items by using the KEY command with the + and
option (as described later).
Second, a version of alpha called standardized alpha is often computed. This
coefficient reflects the average size of item-total correlations as opposed to item-total
covariances. TESTAT does not produce it for the following reasons. Alpha can be
interpreted as the lower limit of reliability for a test that is scored by summing the item
scores. If standardized alpha is computed, this coefficient is the lower limit of
reliability for a test that is scored by first converting the scores for each item so that all
items have equal variances, and then summing these converted scores. Thus, this latter
version of alpha does not accurately describe your test unless the items have equal
variances. If you need this coefficient, you could first use the DATA module to convert
each of your items to z scores and then run these data with TESTAT. The total scores
on the test will then be appropriate for the alpha that is computed, which will be the so-
called standardized alpha.
982
Chapter 31
More information about all of these test statistics can be found in standard textbooks
such as those by Allen and Yen (1979) and Crocker and Aegina (1986).
Latent Trait Model
The latent trait model assigns a probability distribution to responses to each item.
Usually, this is a logistic distribution, but it also can be normal. The following figure
shows distributions for five hypothetical items. Each curve displays the probability of
a correct response on each item by students of different levels of ability. Each item has
a common shaped curve based on the cumulative logistic (or normal) distribution
function. The only parameter distinguishing the curves is their location. Easier items
appear to the left and more difficult items appear to the right. The model generating
this graph is called the one-parameter, or Rasch, model.
Often, it is more plausible to assume that items vary in discrimination as well as in
difficulty. Items with steeper curves discriminate between subjects of different ability
more effectively than items with shallower curves. Not surprisingly, this is called a
two-parameter model. The following figure shows an example for five hypothetical
items. Notice that the second and third items from the left differ noticeably in
discrimination as well as in difficulty.
-5 -3 -1 1 3 5
Ability
0.0
0.2
0.4
0.6
0.8
1.0
P
r
o
b
a
b
i
l
i
t
y

o
f

R
e
s
p
o
n
s
e
983
Test I tem Anal ysi s
TESTAT fits a one- or two-parameter model to binary responses on a test. The observed
data fall into only two categories. The model assumes that these observations were
generated by a continuous probability distribution. The computational machinery is
similar to that used in logistic regression, but a separate logistic curve must be fit for
every item on a test. Correspondingly, a separate curve must be fit for every subject.
The graph predicting subjects response probabilities looks like the figures above,
except the x axis is Difficulty instead of Ability.
Test Item Analysis in SYSTAT
Classical Test Item Analysis Main Dialog Box
To open the Classical Test Item Analysis dialog box, from the menus choose:
Statistics
Test Item Analysis
Classical
-5 -3 -1 1 3 5
Ability
0.0
0.2
0.4
0.6
0.8
1.0
P
r
o
b
a
b
i
l
i
t
y

o
f

R
e
s
p
o
n
s
e
984
Chapter 31
Variable(s). Select a set of test items and move these into the Variable(s) list.
Key. You can alter the nature of the data by scoring each item response as correct or
incorrect or by reversing the scoring scale. For each variable enter a scoring key value.
Reliabilities. By default, split-half reliabilities and summary statistics are based on an
odd-even split. Instead of using the odd-even split, you can select Split-Half to use the
first half of the items versus the last half of items.
Save file. Saves subject scores into filename.SYD. The file will include on each record
the name of an item and its average score.
Logistic Test Item Analysis Main Dialog Box
To open the Logistic Test Item Analysis dialog box, from the menus choose:
Statistics
Test Item Analysis
Logistic
If your data are binary and are coded as zeros and ones, you can analyze your data
using item-response theory with the logistic function as the item characteristic curve.
Variable(s). Select a set of test items and move these into the Variable(s) list.
Key. You can alter the nature of the data by scoring each item response as correct or
incorrect or by reversing the scoring scale. For each variable enter a scoring key value.
985
Test I tem Anal ysi s
Model Options. Choose between a one-parameter or a two-parameter model. If you
select One parameter logistic, the item discrimination index will be the same for every
item, but may change values during the iterative process due to rescaling of the
abilities. If you select Two parameter logistic, each item can have a different
discrimination index.
Estimation Options. The following can be specified:
n Steps. Indicate the maximum number of steps that are to be allowed.
n Iterations. Enter the maximum number of iterations allowed when estimating a
single subjects ability or a single items parameters within a stage.
n Converge. Specify the stopping convergence criterion. Setting a small convergence
will decrease the number of steps required to reach a final set of estimates.
n LConverge. Specify a value for the likelihood of convergence. The default value is
0.005. This means that if the likelihood of the data increase by less than 0.5 percent,
the program will stop at the end of that step. That is, if the likelihood ratio is less
than 1.005 at the end of a step, the program will stop and print out the most recent
parameter estimates.
Save file. Saves subject scores into filename.SYD.
Using Commands
Select a data file using USE filename and continue with:
Usage Considerations
Types of data. By default, TESTAT will use whatever data are in the data set to perform
the analyses. However, if you want to alter the nature of these data by scoring each item
response as correct or incorrect, or by reversing the scoring scale, you can use KEY. It
has two forms.
The first form is used as a scoring key to score each item response as a 0 or a 1,
which can mean incorrect and correct, or any other meaningful binary designation.
To use this form, you must provide the scoring key as a sequence of non-negative
TESTAT
MODEL varlist
KEY (values)
ESTIMATE / CLASSICAL or LOG1 or LOG2,
HALF STEPS=n ITER=n CONVERGE=d LCONVERGE=d
986
Chapter 31
numbers corresponding in a one-to-one fashion to the sequence of items on the test (or
subtest). The numbers in your data set must not be negative.
Suppose, for example, that your data set consists of five questions that must be
answered true or false and that you have coded the respondents answers as 0s and
1s. If the correct answer to questions 1, 2, and 4 is true, and the correct answer to
the remainder of the questions is false, then you would type the KEY command prior
to the ESTIMATE command as follows:
This would cause the item responses to be scored according to your scoring key prior
to the analysis by the ESTIMATE command. If you want to create a SYSTAT data set
containing the scored items, you must also precede the ESTIMATE command with a
SAVE command, naming the data set into which the scored data are to be saved. This
data set will contain 1s and 0s indicating correct or incorrect responses for each item
and case.
In a similar fashion, the responses to multiple-choice items can also be scored as 0s
or 1s using the scoring key. If, for example, your five-question test was made up of
four-alternative multiple-choice items, then you can use the numbers 1 through 4 to
indicate the correct answers in the scoring key. Of course, the respondents answers
must also be entered into the input data set as the numbers 1, 2, 3, or 4. Suppose that
your input data set containing responses to the five questions was named MYFILE.
Then the following commands would score these data as 0s and 1s, save them into a
SYSTAT data set named SCORED, and produce test and item histograms:
The data set that is saved (SCORED in this case) will contain as an extra variable the
total score for each subject (case).
The second form of the KEY command is used to reverse the scoring of selected
items. It can be used when the largest data values for one item indicate the same thing
as the smallest data values for another item. This scoring key consists of a sequence of
+ and signs to indicate that the item scores are to be multiplied by a +1 or 1,
thus reversing the direction of scoring in the case of 1. Reversing the scoring scale in
this way will not affect item variances, but it will alter the possible ranges of item
means and total scores. Thus, you might use this method to check the effect on alpha
KEY = 1,1,0,1,0
USE MYFILE
TESTAT
MODEL var1,var2,var3,var4,var5
SAVE SCORED
KEY = 3,1,4,2,1
ESTIMATE
987
Test I tem Anal ysi s
of reversing one or more items. If this increases alpha and it makes sense in the context
of the test, you may want to use the DATA module to change the scoring of such items
so that the highest response score possible would be replaced by the lowest response
score possible and so on for the lowest and any intermediate responses.
For example, the following commands will save the input from five items,
multiplied by their corresponding weights of +1 and 1, into the data set WEIGHT, and
produce the default output of ESTIMATE using first-half, last-half as the split-half
option:
Print options. The default output statistics consist of summary statistics for the test and
a set of reliability (internal consistency) coefficients. The output statistics are the mean,
standard deviation, standard error of the mean, maximum and minimum values, and
the number of cases on which these were computed for the following summary
variables: total score (summed across the variables), total score/number of items, total
score on odd-numbered items, and total score on even-numbered items.
If the total number of items (variables) in the data set is odd, then the total score for
odd-numbered items will be based on one more item than the total score for even-
numbered items.
Note that the standard deviations, in keeping with tradition in test theory, are based
on sums of squared deviations divided by N, rather than . To give unbiased
estimates, the standard errors of means are computed by dividing the standard
deviations by the square root of .
If you want to see item analysis statistics in addition to the test statistics, use PRINT
LONG.
The first set of additional data that will be provided when you use PRINT LONG is
the approximate standard error of measurement for total test scores in each of 15 score
intervals. These intervals are each 1/2 standard deviation wide and are centered at the
mean. Thus, they are the so-called Stanine intervals. The intervals are shown in both z
score and total score metrics, so that, even if you have no need for these standard errors
of measurement, the table will be useful for seeing how various total scores translate
into z scores.
The standard error of estimate shown for an interval is the square root of the average
squared difference between odd and even scores (or first minus last half for cases
whose total score is in the interval). This is a method recommended by Livingston
(1982) and studied empirically by Lord (1984). Lord showed that standard errors of
SAVE WEIGHT
KEY +,-,-,+,+
ESTIMATE / HALF
N 1
N 1
988
Chapter 31
estimate computed by Livingstons method approximate the standard errors of
estimate that he got using a three-parameter logistic model to analyze a large set of
achievement test data. However, Lord cautions against the use of these estimates if the
number of cases in an interval is small or if the interval is near the minimum or
maximum total score that is possible.
The second set of additional data that is provided when you use PRINT LONG is a
set of item statistics that are useful in performing an item analysis of a test. Shown for
each item are the item mean and standard deviation, the correlation of the item with the
total score, the item reliability index (item-total correlation times standard deviation),
the item-total correlation if the item is excluded from the total, and the value of
coefficient alpha if the item is excluded from the test.
Quick Graphs. If your input data are binary right versus wrong data, each item plot
shows the percentage of the cases in a z-score interval that got the item correct. That
is, the axis labeled Scaled Mean-Item Score shows the percentage correct for each
interval. However, if your data are not of the right versus wrong variety, then the
Scaled Mean-Item Score is the mean-item score for cases in an interval, scaled so that
its minimum possible value is 0 and its maximum possible value is 100. (The minimum
and maximum values are found by locating the largest and smallest data values that
exist in the input data). Note that the N and percentage listed next to the histograms are
the number of cases and percentage of cases with scores in an interval, not the
percentage correct. The column labeled SCORE gives the actual score.
For the latent trait models, Quick Graphs of the fitted logistic curves are plotted for
each item in a grouped array.
Saving files. You can save average item scores (difficulties or, in the case of binary
items, p values) into a SYSTAT file. The file will include, on each record, the name
of an item and its average score.
BY groups. TESTAT analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. TESTAT uses the FREQ variable, if present, to duplicate cases. This
inflates the total degrees of freedom to be the sum of the number of frequencies. Using
a FREQ variable does not require more memory, however.
Case weights. TESTAT weights sums of squares and cross products using the WEIGHT
variable for rectangular data input. It does not require extra memory.
989
Test I tem Anal ysi s
Examples
Example 1
Classical Test Analysis
The following data are reports of fear symptoms by selected United States soldiers
after being withdrawn from World War II combat. The data were originally reported by
Suchman in Stouffer et al. (1950). The variable COUNT contains the number of
soldiers in each profile of symptom reports.
Notice that we use the FREQ command to implement the case weighting variable
COUNT. TESTAT weights the cases according to this count before computing
statistics. We also save the estimates. The input is:
Following is the output:
USE COMBAT
TESTAT
MODEL POUNDING..URINE
FREQ=COUNT
IDVAR=COUNT
SAVE TEMP/ITEM
ESTIMATE/CLASSICAL
Variables in the SYSTAT Rectangular file are:
POUNDING SINKING SHAKING NAUSEOUS STIFF FAINT
VOMIT BOWELS URINE COUNT

Case frequencies determined by value of variable COUNT.

Data below are based on 93 complete cases for 9 data items.

Test score statistics

Total Average Odd Even
Mean 4.538 0.504 2.473 2.065
Std Dev 2.399 0.267 1.333 1.277
Std Err 0.250 0.028 0.139 0.133
Maximum 9.000 1.000 5.000 4.000
Minimum 1.000 0.111 0.0 0.0
N cases 93.000 93.000 93.000 93.000

Internal consistency data

Split-half correlation 0.690
Spearman-Brown Coefficient 0.816
Guttman (Rulon) Coefficient 0.816
Coefficient Alpha - all items 0.787
Coefficient Alpha - odd items 0.613
Coefficient Alpha - even items 0.661

990
Chapter 31
Use PRINT=LONG to see item histograms for this test.
Example 2
Logistic Model (One Parameter)
If your data are binary and are coded as 0s and 1s or recoded with the KEY command,
you can analyze your data using item-response theory with the LOGISTIC function as
the item characteristic curve. Either a one-parameter (Rasch) model or a two-parameter
logistic model can be implemented by using the MODEL command. The one-parameter
model is the default. The input is:
Approximate standard error of measurement of total score
for 15 z score intervals

z score Total score N Std Error
-3.750 -4.458 0 .
-3.250 -3.258 0 .
-2.750 -2.059 0 .
-2.250 -0.860 0 .
-1.750 0.340 10 1.000
-1.250 1.539 16 1.000
-0.750 2.739 6 1.000
-0.250 3.938 29 1.390
0.250 5.137 10 1.095
0.750 6.337 8 1.000
1.250 7.536 8 0.0
1.750 8.735 6 1.000
2.250 9.935 0 .
2.750 11.134 0 .
3.250 12.334 0 .

Item reliability statistics


Item- Item Excl Excl
Total Reliab Item Item
Item Label Mean Std Dev R Index R Alpha
1 POUNDING 0.903 0.296 0.331 0.098 0.215 0.794
2 SINKING 0.785 0.411 0.499 0.205 0.354 0.782
3 SHAKING 0.559 0.496 0.678 0.336 0.539 0.757
4 NAUSEOUS 0.613 0.487 0.721 0.351 0.599 0.747
5 STIFF 0.538 0.499 0.693 0.346 0.559 0.754
6 FAINT 0.452 0.498 0.715 0.356 0.588 0.749
7 VOMIT 0.376 0.484 0.622 0.301 0.472 0.767
8 BOWELS 0.215 0.411 0.625 0.257 0.502 0.763
9 URINE 0.097 0.296 0.503 0.149 0.402 0.777
USE COMBAT
TESTAT
MODEL POUNDING..URINE
FREQ=COUNT
IDVAR=COUNT
SAVE TEMP/ITEM
ESTIMATE/LOG1
991
Test I tem Anal ysi s
Under the single-parameter logistic model, the item discrimination index will be the
same for every item but may change values during the iterative process due to rescaling
of the abilities. The initial values of all parameters are computed by a technique given
by Cohen (1979) to approximate the abilities and item difficulties of a one-parameter
logistic model. They are scaled to have a mean of 0 and a standard deviation of 1 for
the ability estimates.
Following is the output:
Case frequencies determined by value of variable COUNT.

93 cases were processed, each containing 9 items
6 cases were deleted by editing for missing data or for zero or
Perfect total scores after item editing.
0 items were deleted by editing for missing data or for zero or
Perfect total scores after item editing.

Data below are based on 87 cases and 9 items

Total score mean = 4.230 2.164, standard deviation = .

-Log(Likelihood) using initial parameter estimates = 270.981602

STEP 1 convergence criterion = 0.050000

Stage 1: estimate ability with item parameter(s) constant.

-Log(Likelihood) Change Likelihood Ratio
270.070977 -0.910626 2.485877

Greatest change in ability estimate was for case 87

Change from old estimate = 0.134095 , current estimate = 2.005331

Stage 2: estimate item parameter(s) with ability constant.

-Log(Likelihood) Change Likelihood Ratio
269.662219 -0.408757 1.504946

Greatest change in difficulty estimate was for item BOWELS
Change from old estimate = 0.084109, current estimate = 1.301014
Current value of discrimination index = 1.205582

STEP 2 convergence criterion = 0.050000

Stage 1: estimate ability with item parameter(s) constant.

-Log(Likelihood) Change Likelihood Ratio
269.590283 -0.071937 1.074588

Greatest change in ability estimate was for case 80

Change from old estimate = 0.006024 , current estimate = 2.011354

Stage 2: estimate item parameter(s) with ability constant.

-Log(Likelihood) Change Likelihood Ratio
269.548875 -0.041408 1.042277

Greatest change in difficulty estimate was for item BOWELS
Change from old estimate = 0.031751, current estimate = 1.315291
Current value of discrimination index = 1.225624
992
Chapter 31
Three levels of the iterative process must be distinguished here. The program operates
in what are labeled STEPS in the output. Each step consists of two stages. In stage 1,
the subjects abilities are estimated, one at a time, holding the item parameters constant
at their most recent values. At the end of stage 1, the resulting abilities are rescaled so
as to have a mean of 0 and a standard deviation of 1. The item parameters are also
rescaled to conform to the new ability scale. In stage 2, the item parameter(s) are
estimated, one item at a time, holding the abilities constant at their most recent values.
A new step is then begun, if necessary, in which this two-stage process is repeated.
993
Test I tem Anal ysi s
Within each stage is the third level of the iterative process, called ITER. Here, as a
single ability (in stage 1) or a single items parameters (in stage 2) are being estimated,
successive iterations are performed until the parameter being estimated does not
change by more than a tolerance value called TOL. When this criterion is met, the
program moves on to estimate the ability for the next case (in stage 1) or the next items
parameters (in stage 2). This iterative process is repeated within a stage until the data
are exhausted. Then the next stage is begun.
There are two criteria for stopping the stepwise process. At the end of each stage,
the likelihood of obtaining the test data that are in the input data set is computed, given
the current values of all parameters. The negative logarithm of this likelihood and the
change in this value from the previous stage are printed. The ratio of the current
likelihood to the previous likelihood is also computed and printed. If, at the end of a
step (after stage 2), this likelihood ratio is less than a value specified by a stopping
criterion called LCONVERGE, no further steps are run, and the final item parameters
are printed. This is the first stopping criterion.
The second stopping criterion relies on the maximum change in parameter estimates
between stages. At the end of a stage, the maximum change in the parameters being
estimated in that stage is printed. If, at the end of a step, no parameter estimated in
either stage of that step changed more than the value of CONVERGE, the stagewise
process is terminated, and the final item parameters are printed. Thus, the program will
stop entering new steps whenever either of the two stopping criteria is met, whichever
occurs first.
The final parameter estimates are, thus, a type of maximum likelihood estimate.
However, you should realize that because the process alternates between estimating
item parameters and abilities, the final parameter estimates are not a true maximum
likelihood estimate of all parameters simultaneously. As with other programs that use
this same type of alternating estimation technique, the process does converge for all
but very unusual data sets.
Example 3
Logistic Model (Two Parameter)
The 20-item version of the Social Desirability Scale described by Strahan and Gerbasi
(1972) was administered as embedded items in another test to 359 undergraduate
students in psychology. The social desirability items were scored for the social
desirability of the response and coded as 0s and 1s in a SYSTAT data set named
994
Chapter 31
SOCDES.SYD. The following commands were used to produce the output for this
example:
Following is the output:
USE SOCDES
TESTAT
MODEL X(1..20)
SAVE TEMP / ITEMS
ESTIMATE / LOG2,STEP=2,CONVERGE=.1
359 cases were processed, each containing 20 items
4 cases were deleted by editing for missing data or for zero or
Perfect total scores after item editing.
0 items were deleted by editing for missing data or for zero or
Perfect total scores after item editing.

Data below are based on 355 cases and 20 items

Total score mean = 9.386 3.992, standard deviation = .

-Log(Likelihood) using initial parameter estimates = 3634.928345

STEP 1 convergence criterion = 0.100000

Stage 1: estimate ability with item parameter(s) constant.

-Log(Likelihood) Change Likelihood Ratio
3634.122209 -0.806136 2.239239

Greatest change in ability estimate was for case 22

Change from old estimate = -0.105724 , current estimate = 2.956047

Stage 2: estimate item parameter(s) with ability constant.

-Log(Likelihood) Change Likelihood Ratio
3622.569856 -11.552353 104021.506702

Greatest change in difficulty estimate was for item X(19)
Change from old estimate = 0.021791, current estimate = 0.946823
Greatest change in discrimination estimate was for item X(8)
Change from old estimate = -0.163600, current estimate = 0.530705

STEP 2 convergence criterion = 0.100000

Stage 1: estimate ability with item parameter(s) constant.

-Log(Likelihood) Change Likelihood Ratio
3619.922513 -2.647343 14.116484

Greatest change in ability estimate was for case 66

Change from old estimate = -0.180754 , current estimate = -2.265529

Stage 2: estimate item parameter(s) with ability constant.

-Log(Likelihood) Change Likelihood Ratio
3612.343024 -7.579488 1957.627103

Greatest change in difficulty estimate was for item X(4)
Change from old estimate = -0.580964, current estimate = -1.770080
Greatest change in discrimination estimate was for item X(4)
Change from old estimate = -0.153424, current estimate = 0.407619
995
Test I tem Anal ysi s
You can see that the second item discriminates better than the first and that both items
seem to fit the model moderately well.
Computation
All calculations are in double precision arithmetic, with provisional algorithms used
for calculating all means and sums of squares that are needed. The formulas for all of
the statistics that are shown in the output can be found in the references given below.
996
Chapter 31
Algorithms
Provisional algorithms are used for means and sums of squares. The calculations for
the classical and logistic models are as follows:
Classical Model
Your data must have at least four variables (test items). The number of cases
(respondents) must be at least two. Cases with missing data are not used in any of the
statistical analyses. Such cases are identified in the case by case listing, if this listing
is requested. If you want to substitute a value, such as 0, for missing data, you should
do this when you create the SYSTAT data set in the DATA module.
During the calculations, TESTAT creates two temporary data sets on your data disk.
Together, they are about as large as your input data set, so you should make sure that
there is enough room for them on your disk.
Logistic Model
While the number of variables (items) may be as small as 4, unreliable results will be
obtained if the number is less than about 20. The minimum number of cases
(respondents) is two, but this is obviously far too small a number for reliable results.
As with the classical model, cases with missing data are not used in any of the
calculations. In addition, the item-response routines require that no case have either a
0 or perfect total score on the test or subtest being analyzed. Thus, an editing routine
finds and marks such cases for exclusion from the analysis. Likewise, any item
(variable) that is responded to in exactly the same way by all respondents must be
excluded from the analysis. The editing routine looks for such items after first
excluding offending cases. Once any such items are marked for exclusion, the routine
again looks for inappropriate cases, using only the remaining items. It iterates in this
fashion until no inappropriate cases or items remain. Any items or cases that have been
excluded from the analysis are reported by the output routines. The same temporary
data sets that are mentioned above for the classical model are also created for the
logistic model. Make sure that your disk has room for them.
The algorithm for finding the maximum likelihood (actually, the minimum of the
negative logarithm of the likelihood) for each ability in stage 1 and for each items
parameter(s) in stage 2 is based on Fletcher-Powell minimization (Press et al., 1986).
997
Test I tem Anal ysi s
The logistic model that is used in this program is the now-familiar two-parameter
formula found in Lord (1980), Hulin, Drasgow, and Parsons (1983), and many other
references.
The discrimination parameter for an item is a and the difficulty is b, while the
subjects ability is 0. In TESTAT, the function to be minimized is designed to place
limits on the values of 0 and a by driving the iterative routine away from estimates
greater than these limits. The limits are 6.00 for the absolute value of 0 and 3.00 for the
absolute value of a, the discrimination index. If your data imply a lot of items with
extreme values of a, or a lot of extreme values of 0, you may find that the program will
start to oscillate around some value of the likelihood ratio that is not less than the
stopping value. You cannot change these limits because to make them very much
larger could result in illegally large values for the exponent (x) in the model.
As with any iterative estimation procedure, you should beware of local minima. If
you suspect that such a problem exists after inspecting your output, try running the first
few steps with a very large value of CONVERGE and then switching to a smaller value.
Missing Data
Any case with missing values on any item is deleted.
References
Allen, M. J. and Y, W. M. (1979). Introduction to measurement theory. Belmont, Calif.:
Wadsworth.
Cohen, L. (1979). Approximate expressions for parameter estimates in the Rash model.
British Journal of Mathematical and Statistical Psychology, 32, 113120.
Coombs, C. H., Dawes, R. M., and Tversky, A. (1970). Mathematical psychology: An
elementary introduction. Englewood Cliffs, N.J.: Prentice-Hall, Inc.
Crocker, L. and Algina, J. (1986). Introduction to classical and modern test theory. New
York: Holt Rinehart Winston.
Cronbach, L. J. (1956). Coefficient alpha and the internal structure of tests. Psychometrika,
16, 297334.
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10,
255282.
Hulin, C. L., Drasgow, F., and Parsons, C. K. (1983). Item response theory: Application to
psychological measurement. Homewood, Ill.: Dow Jones-Irwin.
998
Chapter 31
Livingston, S. (1982). Estimation of conditional standard error of measurement for
stratified tests. Journal of Educational Measurement, 19, 135138.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, N.J.: Erlbaum.
Lord, F. M. (1984). Standard error of measurement at different ability levels. Technical
Report Number RR-84-8. Princeton, N.J.: Educational Testing Service.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1986). Numerical
recipes: The art of scientific computing. Cambridge: Cambridge University Press.
Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by split-
halves. Harvard Educational Review, 9, 99103.
Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Staf, S. A., and Clausen, J.
A. (1950). Measurement and prediction. Princeton, N.J.: Princeton University Press.
Strahan, R. and Gerbasi, K. C. (1972). Short, homogeneous versions of the Crowne-
Marlowe social desirability scale. Journal of Clinical Psychology, 28, 191193.
999


Chapt er
32
Time Series
Leland Wilkinson and Yuri Balasanov
Time Series implements a wide variety of time series models, including linear and
nonlinear filtering, Fourier analysis, seasonal decomposition, nonseasonal and
seasonal exponential smoothing, and the Box-Jenkins (1976) approach to
nonseasonal and seasonal ARIMA. You can save results from transformations,
smoothing, the deseasonalized series, and forecasts for use in other SYSTAT
procedures.
The general strategy for time series analysis is to:
n Plot the series using T-plot, ACF, PACF, or CCF.
n Transform the data to stabilize the variance across time or to make the series
stationary using Transform.
n Smooth the series using moving averages, running medians, or general linear
filters using LOWESS or Exponential smoothing.
n Fit your model using ARIMA.
n Examine the results by plotting the smoothed or forecasted results.
Before performing a particular time series analysis, you can specify how missing
values should be handled.
n Interpolate. Interpolates missing values by using DWLS (Distance Weighted
Least Squares). DWLS interpolates by locally quadratic approximating curves
that are weighted by the distance to each nonmissing point in the series. With this
algorithm, all nonmissing values in the series contribute to the missing data
estimates, and thus complex local features can be modeled by the interpolant.
1000
Chapter 32
n Delete. Prevents interpolation and only the leading nonmissing values are retained
for analysis. In series that begin with one or more missing values, the series is
deleted from the first missing value following one or more nonmissing values. This
option enables you to forecast missing values from a nonmissing subsection of the
series, for example. You can then insert these forecasts into the series and repeat
the procedure later in the series if necessary.
Statistical Background
Time series analysis can range from the purely exploratory to the confirmatory testing
of formal models. Series encompasses both exploratory and confirmatory methods.
Among the exploratory methods are smoothing and plotting. Among confirmatory
models are two general approaches: time domain and frequency domain. In time-
domain models, we examine the behavior of variables over time directly. In frequency-
domain models, we examine frequency (periodic) components contributing to a time
series.
Time-domain (autoregressive, moving average, and trend) models represent a series
as a function of previous points in the same series or as a systematic trend over time.
Time-domain models can fit complex patterns of time series with just a few
parameters. Makridakis, Wheelwright, and McGee (1983), McCleary and Hay (1980),
and Nelson (1973) introduce these models, while Box and Jenkins (1976) provide the
primary reference for ARIMA models.
Frequency-domain (spectral) models decompose a series into a sum of sinusoidal
(waveform) elements. These models are particularly useful when a series arises from
a relatively small set of cyclical functions. Bloomfield (1976) introduces these models.
In this introduction, we will discuss exploratory methods (smoothing), time-domain
models (ARIMA, seasonal decomposition, exponential smoothing), and frequency-
domain (Fourier) models.
Smoothing
Smoothing is a complex topic whose applications exceed space here; consult Velleman
and Hoaglin (1981) or Bloomfield (1976) for more complete discussions.
1001
Ti me Seri es
Moving Averages
One of the simplest smoothers is a moving average. If a data point consists of a smooth
component plus random error, then if we average several points surrounding a point,
the errors should tend to cancel each other out.
Here are two possible moving averages three and four points wide. The window
shows which points are being averaged. The boldface shows which point in the series
is replaced with the average.
Notice that the four-point window does not have a point in the series at its center.
Consequently, we replace the right point of the two in the middle with the average of
Three-point
window
Series y1 y2 y3 y4 y5 y6 y7 y8 y9
Window y1 y2 y3
y2 y3 y4
y3 y4 y5
y4 y5 y6
y5 y6 y7
y6 y7 y8
y7 y8 y9
New series y1 x2 x3 x4 x5 x6 x7 x8 y9
Four-point
window
Series y1 y2 y3 y4 y5 y6 y7 y8 y9 y10
Window y1 y2 y3 y4
y2 y3 y4 y5
y3 y4 y5 y6
y4 y5 y6 y7
y5 y6 y7 y8
y6 y7 y8 y9
y7 y8 y9 y10
New series y1 y2 x3 x4 x5 x6 x7 x8 x9 y10
1002
Chapter 32
the four points. This rule is followed for all even windows except two-point windows.
Two-point windows can thus be used to shift asymmetrical smoothings back to the left.
If you prefer algebra, then the following description shows how the three-point
window smooths y into x.
Notice also that the first and last points in the series are unchanged by the three-point
window of moving averages. The four-point window leaves the first two and last two
points unchanged.
Weighted Running Smoothing
If you know something about filter design (see Bloomfield, 1976), you can construct a
more general linear filter by using weights. In the examples, we illustrate seven- and
four-point moving averages with equal weights.
The smoothings in the examples used even weights of 1 for each member in the
window since we did not specify otherwise. We could, however, set these weights to
any real number; for example, 1,2,1. Some of you may recognize these as Hanning
weights (Chambers, 1977; Velleman and Hoaglin, 1981). It is possible to show
algebraically that weighting by (1,2,1) in a three-observation window is the same as
smoothing twice with equal weights in a two-observation window. The DWLS
smoothing method for graphics is a form of weighting in which weights are determined
by distance weighted least squares.
Running Median Smoothers
Now, lets look at another smootherrunning medians. Sometimes its handy to have
a more robust filter when you suspect the data do not contain Gaussian noise. You can
choose this filter with the Median option. It works like the Mean option, except the
values in the series are replaced by the median of the window instead of the mean.
Can you see why running mean and running median smoothers with a window of
two are the same?
We can use combinations of these smoothers to construct more complex nonlinear
filters. The following sequence of smoothings comprises a nonlinear filter because it
doesnt involve a simple weighted average of the values in a window (except for the
final Hanning step). It uses a combination of running medians instead:
x
1
y
1
=
x
2
y
1
y
2
y
3
+ + ( ) 3 =
x
3
y
2
y
3
y
4
+ + ( ) 3 =
1003
Ti me Seri es
Running median smoother, window 4
Running median smoother, window 2
Running median smoother, window 5
Running median smoother, window 3
Running means smoother, window 3, weights 1, 2, 1
You can read about this filter (called 4253H) in Velleman and Hoaglin (1981). It is due
to the work of Tukey (1977). It happens to be a generally effective compound smoother
because it clears outliers out of the sequence in the early stages and polishes up the
smooth later. Velleman and Hoaglin use this smoother twice on the same data by
smoothing the data, smoothing the residuals from this smooth, and adding the two
together. You can do this by using Save with the last smoothing to save the smoothed
values into a SYSTAT file. You can then merge the files and compute residuals. In the
final step, you can smooth the residuals.
LOWESS Smoothing
Cleveland (1979) presented a method for smoothing values of Y paired with a set of
ordered X values. Chambers et al. (1983) introduce this technique and present some
clear examples. If you are not a statistician, by the way, and want some background
information on recent advances in statistics, read the Chambers book (and Velleman
and Hoaglin if you dont know about Tukeys work).
Scatterplot smoothing allows you to look for a functional relation between Y and X
without prejudging its shape (or its monotonicity). The method for finding smoothed
values involves a locally weighted robust regression. SYSTAT implements
Clevelands LOWESS algorithm on equally spaced data values. You can also use
LOWESS on scatterplots of unequally spaced data values.
ARIMA Modeling and Forecasting
The following data show the U.S. birth rate (per 1000) for several decades during and
following World War II. They were compiled from federal statistics, principally the
U.S. census.
1004
Chapter 32
These data are a time series because they comprise values on a variable distributed
across time. How can you use these data to forecast birth rates up to the year 2000? A
popular statistical method for such a forecast is linear regression. Lets try it. Here is a
plot of birth rates against year with the least squares line. The data points are connected
so that you can see the series more clearly.
YEAR RATE YEAR RATE
1943 22.7 1965 19.4
1944 21.2 1966 18.4
1945 20.4 1967 17.8
1946 24.1 1968 17.5
1947 26.6 1969 17.8
1948 24.9 1970 18.4
1949 24.5 1971 17.2
1950 24.1 1972 15.6
1951 24.9 1973 14.9
1952 25.1 1974 14.9
1953 25.1 1975 14.8
1954 25.3 1976 14.8
1955 25.0 1977 15.4
1956 25.2 1978 15.3
1957 25.3 1979 15.9
1958 24.5 1980 15.9
1959 24.3 1981 15.9
1960 23.7 1982 15.9
1961 23.3 1983 15.5
1962 22.4 1984 15.7
1963 21.7 1985 15.7
1964 21.0
1940 1960 1980 2000
YEAR
0
10
20
30
R
A
T
E
1940 1960 1980 2000
0
10
20
30
1005
Ti me Seri es
Whats wrong with this forecasting method? You may want to read Chapter 14 (if you
havent already). There, we discussed assumptions needed for estimating a model
using least squares. We can legitimately fit a line to these data by least squares for the
explicit purpose of getting predicted values on the line as close as possible, on average,
to observed values in the data. In forecasting, however, we want to use a fitted model
to extrapolate beyond the series. The fitted linear model is:
If we want our estimates of the slope and intercept in this model to be unbiased, we
need to assume that the errors () in the population model are independent of each other
and of YEAR. Does our data plot give us any indication of this?
On the contrary, it appears from the data that the randomness in this model is related
to YEAR. Take any two adjacent years data. On average, if there is an underprediction
one year, there will be an underprediction the next. If there is overprediction one year,
there is likely to be overprediction the next. These data clearly violate the assumption
of independence in the errors.
Autocorrelation
There is a statistical index that reveals how correlated the residuals are. It is called the
autocorrelation. The first-order autocorrelation is the ordinary Pearson correlation of
a series of numbers with the same series shifted by one observation (y
2
, y
1
; y
3
, y
2
; ...;
y
n
, y
n-1
). In our residuals from the linear model, this statistic is 0.953. If you remember
about squaring correlation coefficients to reveal proportion of variance, this means that
over 89 percent of the variation in error from predicting one years birth rate can be
accounted for by the error in predicting the previous years.
The second-order autocorrelation is produced by correlating the series (y
3
, y
1
; y
4
, y
2
;
...; y
n
, y
n-2
). Computing this statistic involves shifting the series down two years. As you
may now infer, we can keep shifting and computing autocorrelations for as many years
as there are in the series. There is a simple graphical way to display all these
autocorrelations. It looks like a bar graph of the autocorrelations sequenced by year, or
index in the series. The first bar is the first autocorrelation (0.953). The next highest
bar is the second, and so on. Here it is:
RATE 579.342 0.285 * YEAR =
1006
Chapter 32
This autocorrelation plot tells us about all the autocorrelations in the residuals from the
linear model. As you can see, there is a strong dependence in the residuals. As we shift
the series far enough back, the autocorrelations become negative, because the series
crosses the prediction line and the residuals become negative. Over the entire series,
there are three crossings and three corresponding shifts in sign among the
autocorrelations.
Autoregressive Models
We would have the same serial correlation problem if we refined our model to include
a quadratic term:
You can try this model with MGLH, but you will find a large autocorrelation in the
residuals even though the curve fits the data more closely. How can we construct a
model that includes the autocorrelation structure itself?
The autoregressive model does this:
Notice that this model expresses a years birth rate as a function of the previous years
birth ratenot as a function of YEAR. Time becomes a sequencing variable, not a
predictor.
Autocorrelation Plot
0 10 20 30 40 50
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
RATE
0

1
YEAR
2
YEAR
2
+ + + =
RATE
0

1
RATE
i 1

i
+ + =
1007
Ti me Seri es
To fit this, we fit an AR(1) model with the ARIMA procedure. Here is the result,
with forecasts extending to the year 2000:
The forecasted values are represented by the dotted line. Unlike the regression model
forecast, the autoregressive forecast begins at the last birth rate value and drifts back
toward the mean of the series. This forecast behavior is typical of this particular model,
which is often called a random walk.
Moving Average Models
There is another series model that can account for fluctuations across time. The
moving average model looks like this:
This models a series as a cumulation of random shocks or disturbances. If this model
represented someones spending habits, for example, then whether the person went on
a spending spree one day would depend on whether he or she went on one the day
before. Unlike the autoregressive model, which represents an observation as a function
of previous observations values, the moving average model represents an observation
as a function of the previous observations errors. So you can see the difference
between the two, here are examples of first-order autoregressive, or AR(1), and moving
average, or MA(1), series:
1940 1960 1980 2000
YEAR
0
10
20
30
R
A
T
E
1940 1960 1980 2000
0
10
20
30
y
i

i

i 1
=
1008
Chapter 32
ARMA Models
Autoregressive and moving average models can be mixed to make autoregressive-
moving average models. They can be mixed with different orders, for example, AR(2)
plus MA(1), which is often expressed as ARMA(2,1). A text on forecasting will offer
instances of these more complicated models. You could visually add the two sample
series above, however, to see how an ARMA(1,1) model would look.
Identifying Models
Before you can fit an AR, MA, or ARMA model, you need to identify which model is
appropriate for your series. You can look at the series plot to find distinctive patterns,
as in the figure contrasting AR(1) and MA(1) directly above. Real data seldom fit these
ideal types as clearly, however. There are several powerful tools that distinguish these
families of models. We have already seen one: the autocorrelation function plot (ACF).
The partial autocorrelation function plot (PACF) provides additional information
about serial correlation. To identify models, we use both of these plots.
1009
Ti me Seri es
Stationarity
Before doing these plots, however, you should be sure the series is stationary. This
means:
n The mean of the series is constant across time. You can use the Trend transformation
to remove linear trend from the series. This will not reduce quadratic or other
curvilinear trend, however. A better method is to Difference the data. This
transformation replaces values by the differences between each value and the
previous value, thereby removing trend. For cyclical series, like monthly sales,
seasonal differencing may be required before fitting a model (see below). Data that
are drifting up or down across the series generally should be differenced.
n The variance of the series is constant across time. If the series variation is increasing
around its mean level across time, try a Log transformation. If it is decreasing
around its mean level across time (a rare occurrence), try a Square transformation.
You should generally do this before differencing.
n The autocorrelations of the series depend only on the difference in time points and not
on the time period itself. If the first half of the ACF looks different from the second,
try seasonal differencing after identifying a period on which the data are
fluctuating. Monthly, quarterly, seasonal, annual data often cycle this way.
ACF Plots
The autocorrelation function plot displays the pattern of autocorrelations. We have
seen in this introduction an ACF plot of the residuals from a linear fit to birth rate. The
slow decay of the autocorrelations after the first indicates autoregressive behavior in
the residuals.
PACF Plots
The partial autocorrelation function plot displays autocorrelations, but each one below
the first is conditioned on the previous autocorrelation. The PACF shows the
relationship of points in a series to preceding points after partialing out the influence
of intervening points. We examine them for effects that do not depend linearly on
previous (smaller lag) autocorrelations.
1010
Chapter 32
Identification Using ACF and PACF Plots Together
Lets summarize our identification strategy. First, make sure the series is stationary. If
variance is nonconstant, transform it with Log or Square. If trend is present, remove it
with differencing. Finally, if seasonality is present, remove it with seasonal
differencing. Then examine ACF and PACF plots together. On the facing page is a
chart of possible types of patterns. Underneath each series is the ACF plot on the left
and the PACF plot on the right. Several rows contain more than one possible plot. This
is because coefficients in the model can be negative or positive and these plots show
different combinations of signs.
Finally, remember that differencing can remove both trend and autoregressive effects.
If an AR(1) model fits your data, as with our birth rate example, then differencing will
produce only white noise and your ACF will look uniformly random. As a result,
differencing is like constraining an autoregressive parameter to be exactly one.

1011
Ti me Seri es

AR(1)
AR(2)
MA(1)
MA(2)
ARMA(1,1)
ARMA(0,0)(1,0)
ARMA(0,0)(0,1)
ARMA(0,0)(1,1)
1012
Chapter 32
Estimating the ARIMA Model
When you have identified the model as AR, MA, or a mixture, then you can fit it by
specifying the AR order (P=n) and the MA order (Q=n). The I in ARIMA stands for
Integrated and is a parameter that has to do with differencing. Any differencing you
do while identifying the model will be included automatically in calculating your
forecasts.
When you have estimated the model, pay attention to the standard errors of the
parameters. If a parameter estimate is much smaller in absolute value than two standard
errors away from zero, then it is probably unnecessary in the model. Refit the model
without it. If you are uncertain about model identification, you can sometimes use this
rule of thumb to compare two different models. The mean square error (MSE) of the
model fit can also guide you. Generally, you are looking for a parsimonious model with
small MSE.
Problems with Forecast Models
Forecasting is a vast field, and we cannot begin to explain even the basics in such a
brief discussion. Makridakis, Wheelwright, and McGee (1983) cover the topic fairly
extensively. SYSTAT contains several methods for forecasting, with which you can
experiment on these data. Exponential smoothing, for example, should provide similar
forecasts to the ARIMA model. Keep in mind several things as you go:
n There is nothing like extrinsic knowledge. We use forecasting methods for
SYSTAT budget planning. We always compare them to staff predictions of sales,
however. In general, averaging staff predictions does better than the data-driven
forecasting models. The reason is simplestaff know about external factors that
are likely to affect sales. These are one-time events that are not easily included in
models. Although we are not experts on the stock market, we would bet the same
is true for investing. Chartist models that are based solely on the trends in stocks
will not do as well, on average, as strategies based on knowledge of companies
economic performance and, in the illegal extreme, inside trading information.
n Always examine your residuals. The same reasons for using residual diagnostics in
ordinary linear regression apply to nonlinear forecasting models. In both cases, you
want to see independence, or white noise.
n Dont extrapolate too far. As in regression, predictions beyond the data are shaky.
The farther you stray from the ends of the data, the less reliable are the predictions.
The confidence limits on the forecasts will give you some flavor of this.
1013
Ti me Seri es
Box and Jenkins (1976) provide the primary reference for these procedures. Financial
forecasters should consult Nelson (1973) and Vandaele (1983) for applied
introductions. Social scientists should look at McCleary and Hay (1980) for
applications to behavioral data.
Many treatments (including Box and Jenkins) outline the ARIMA modeling process
in three stages: Identification, Estimation, and Diagnosis. This is the outline we have
followed in this introduction. With SYSTAT you identify models with Transform,
Case plot, ACF plot, and PACF plot, estimate them with ARIMA, and diagnose their
adequacy with more plots. For more complex problems, you may have to use other
procedures, also.
ARIMAAuto Regressive Integrated Moving Averagemodels can fit many time
series with remarkably few parameters. Sometimes, ARIMA and Fourier models can
be used effectively on the same data. As with other modeling procedures, decisions
about appropriateness of competing models must rest on theoretical grounds.
Nevertheless, a researcher should lean toward ARIMA models when it is reasonable to
assume that points in a process are primarily functions of previous points and their
errors, rather than periodic signal plus noise.
Seasonal Decomposition and Adjustment
A time series can be viewed as a sum of individual components that may include a term
for location (level or mean value), a trend component (long-term movements in the
level of a series over time), a seasonal component, and an irregular component (the part
unique to each time point). We can use the Mean transformation to remove the mean
(location) from a series, Trend to remove a linear trend from a series, and Difference
to eliminate either a trend or a seasonal effect from a series. Each of these
transformations changes the scale of the series but does not directly provide
information about the form of the trend or the seasonal component.
Alternatively, you may want to adjust the values in a series for the seasonal
component but leave the series in the same scale or unit. This enables you to interpret
the value units in the same way as the original series and to compare values in the series
after removing differences due to seasonality.
For example, sales data for many products are strongly seasonal. More suntan lotion
is sold in the summer than in the winter. It is therefore difficult to compare suntan
lotion sales from month to month (going up? going down?) without first taking
seasonal differences into account.
1014
Chapter 32
Seasonal differences can be accounted for by determining a factor for each period
of the cycle. Quarterly data may have a seasonal factor for each of the four quarters.
Monthly data may have a seasonal factor for each of the twelve months.
Seasonal factors can take either of two forms: additive (fixed) or multiplicative
(proportional). An additive seasonal factor is a fixed number of units above or below
the general level of the series; for example, 10,000 more bottles of suntan lotion were
sold in July than the average month. In a multiplicative or proportional model, the
seasonal factor is a percentage of the level of the series; for example, 200% more
bottles of suntan lotion were sold in July than in the average month.
Additive seasonal effects are removed from a series by subtracting estimates of the
appropriate seasonal factor from each point in the series. Multiplicative seasonal
effects are removed by dividing each point by the appropriate seasonal factor. Seasonal
computes either additive or multiplicative seasonal factors for a series and uses them
to adjust the original series.
Exponential Smoothing
Exponential smoothing forecasts future observations as weighted averages (a running
smoother) of previous observations. For simple exponential smoothing, each forecast
is the new estimate of location for the series. For models with trend and/or seasonal
components, exponential smoothing smooths the location, trend, and seasonal
components separately. For each component, you must specify a smoothing weight
between 0 and 1. In practice, weights between 0.10 and 0.30 are most frequently used.
The Exponential Smoothing option allows you to specify a linear or percentage
growth (also called exponential or multiplicative) trend or neither, and an additive or
multiplicative seasonal component or neither. There is always a location component.
Thus, there are nine possible smoothing models from which you can choose.
Smoothing with a linear trend component and no seasonal component is Holts
method. Smoothing with both a linear trend and a multiplicative seasonal term is
Winters three-parameter model.
The exponential smoothing procedure obtains initial estimates of seasonal
components in the same manner as Seasonal. If there is a trend component, SYSTAT
uses regression (after adjusting values for any seasonal effects) to estimate the initial
values of the location and trend parameters. If there is neither a trend nor a seasonal
component, the first value in the series is used as the initial estimate of location.
1015
Ti me Seri es
Fourier Analysis
If you believe your series is cyclicalsuch as astronomical or behavioral datathen
you should consider Fourier analysis. The Fourier model decomposes a series into a
finite sum of trigonometric componentssine and cosine waves of different
frequencies. If your data are cyclical at a particular frequency, such as monthly, then a
few Fourier components might be sufficient to capture most of the nonrandom
variation.
Fourier analysis decomposes a time series just as a musical waveform can be
decomposed into a fundamental wave plus harmonics. The French mathematician
Fourier devised this decomposition around the beginning of the nineteenth century and
applied it to heat transfer and other physical and mathematical problems. This
transformation is of the general form:
f(t) = x + xsin(t) + xcos(t) + xsin(2t) + xcos(2t) + ...
The Fourier decomposition can be useful for designing a filter to smooth noise and for
analyzing the spectral composition of a time series. The most frequent application
involves constructing a periodogram which displays the squared amplitude
(magnitude) of the trigonometric components versus their frequencies. Fourier can be
used to construct these displays. For further details on Fourier analysis, consult
Brigham (1974) or Bloomfield (1976).
Fourier transforms are time consuming to compute because they involve numerous
trigonometric functions. Cooley and Tukey (1965) developed a fast algorithm for
computing the transform on a discrete series that makes the spectral analysis of lengthy
series practical. A variant of this Fast Fourier Transform algorithm is implemented in
SYSTAT.
The discrete Fourier transform should be done on series with lengths (number of
cases) that are powers of 2. If you do not have samples of 32, 64, 128, 256, etc., you
should pad your series with zeros up to the next power of 2. If you have a series called
Series with only 102 cases, for example, you can recode to add zeros to cases 103
through 128. If you do not pad the file in this way, the Fourier procedure finds the
highest power of 2 less than the number of cases in the file and transforms only that
number of cases. (In this example, it would have transformed only the first 64 cases.)
A useful graph to accompany Fourier analysis is the periodogram. This graph plots
magnitude (or squared magnitude) against frequency. It reveals the relative
contribution of different frequency waveforms to the overall shape of the series. If the
periodogram contains one large spike, then it means that the series can be fit well by a
1016
Chapter 32
single sinusoidal waveform. The periodogram is itself like a series, so sometimes you
may want to smooth it with one of the Series smoothers
Fourier analysis is often used to construct a filter, which works like running
smoothers. A filter allows variation of only a limited band of frequencies to pass
through. A low-pass filter, for example, removes high-frequency information. It is
often used to remove noise in radio transmissions, recordings, and photographs. A
high-pass filter, on the other hand, removes low-frequency variation. It is used as one
method for detecting edges in photographs. You can construct filters in SYSTAT by
computing the Fourier transform, deleting real and imaginary components for low or
high frequencies, and then using the inverse transform to produce a smoothed
waveform.
If you reproduce a series from a few low-frequency Fourier components, the
resulting smooth will be similar to that achieved by a running window of an
appropriate width. The Fourier method will constrain the smooth to be more regularly
periodic, however, since the selected trigonometric components will completely
determine the periodicity of the smooth.
Graphical Displays for Time Series in SYSTAT
Plotting data, autocorrelations, and partial autocorrelations is often one of the first
steps in understanding time series data. SYSTAT provides several graphical displays,
each of which is discussed in turn.
T-Plot Main Dialog Box
T-plot provides time series plots. This can give you a general idea of a series, enabling
you to detect a long-term trend, seasonal fluctuations, and gross outliers.
To open the T-plot dialog box, from the menus choose:
Statistics
Time Series
T-plot...
1017
Ti me Seri es
The variable you select is the dependent (vertical axis) variable, and the case number
(time series observation) is the independent variable (horizontal axis). The points are
connected with a line.
Time Main Dialog Box
Time labels the sequence(s) of values in a file with identifiers that represent the cycle
and the periodicity. The identifiers label the T-plot x axis for each time point.
To open the Time dialog box, from the menus choose:
Statistics
Time Series
Time...
The following options can be specified:
Origin. Starting point of the time series, expressed as a year.
Period. Periods within each year. The value defines the number of observations within
each year. For example, specify 12 for months, 52 for weeks, 365 for days, etc. If there
is only one observation per year, specify 1.
First. Starting point of the period of observation. For example, if the period is months
within each year and the first observation is for June, the First value would be 6.
Date format. The date display format for values on the x axis (time axis). Select a format
from the drop-down list.
1018
Chapter 32
ACF Plot Main Dialog Box
Autocorrelation plots show the correlations of a time series with itself shifted down
a specified number of cases. Plots of autocorrelations help you to investigate the
relation of each time point to previous time points. If the autocorrelation at lag 1 is
high, then each value is highly correlated with the value at the previous time point. If
the autocorrelation at lag 12 is high for data collected monthly, then each month is
highly correlated with the same month a year before (for example, for monthly sales
data, sales in December may be more related to those in previous Decembers than to
those in November or January).
To open the ACF Plot dialog box, from the menus choose:
Statistics
Time Series
ACF...
You can specify the maximum number of lags to plot. The plot contains the
autocorrelations for all lags between 1 and the number specified.
PACF Plot Main Dialog Box
Partial autocorrelation plots show the relationship of points in a series to preceding
points after partialing out the influence of intervening points.
To open the PACF Plot dialog box, from the menus choose:
Statistics
Time Series
PACF...
1019
Ti me Seri es
You can specify the maximum number of lags to plot. The plot contains the partial
autocorrelations for all lags between 1 and the number specified.
CCF Plot Main Dialog Box
Cross-correlation plots help to identify relations between two different series and any
time delays to the relations. A correlation for a negative lag indicates the relation of the
values in the first series to values in the second series that number of periods earlier.
The correlation at lag 0 is the usual Pearson correlation. Similarly, correlations at
positive lags relate values in the first series to subsequent values in the second series.
To open the CCF Plot dialog box, from the menus choose:
Statistics
Time Series
CCF...
You can specify the number of lags to plot. Approximately half of the lags will be
positive and half will be negative.
1020
Chapter 32
Using Commands
To graph a time series, first specify your data with USE filename. Continue with:
Transformations of Time Series in SYSTAT
Transformations Main Dialog Box
To open the Transformations dialog box, from the menus choose:
Statistics
Time Series
Transform...
Available transformations include:
n Mean. Subtracts the mean from each value in the series.
n Log. Replaces the values in a series with their natural logarithms, and thus removes
nonstationary variability, such as increasing variability over time.
n Square. Squares the values in a series. This is useful for producing periodograms
and for normalizing variance across the series.
n Trend. Removes linear trend from a series.
SERIES
TIME origin period first
TPLOT var / LAG=n
ACF var / LAG=n
PACF var / LAG=n
CCF var1, var2 / LAG=n
1021
Ti me Seri es
n Difference. Replaces each value by the difference between it and the previous
value, thereby removing trend (nonstationarity in level over time). Using
differences between each successive value is called lag 1. A Lag option allows
seasonal differences (for example, for data collected monthly, request a lag of 12).
n Percent Change. Replaces each value by the difference from the previous value
expressed as a percentage changethe difference in values divided by the previous
value.
n Index. Replaces each value by the ratio of the value to the value of a base
observation, which you can specify for Base. By default, SYSTAT uses the first
observation in the series.
n Taper. Smooths the series with the split-cosine-bell taper. Tapering weights the
middle of a series more than the endpoints. Use it prior to a Fourier decomposition
to reduce leakage between components. For Proportion, enter the proportion (P)
of the series to be tapered. Choose a weight function that varies between a boxcar
(P=0) and a full cycle of a cosine wave from trough to trough (P=1). For
intermediate values of P, the weight function is flat in the center section and cosine
tapered at either end. Default=0.5.
You can pile up transformation commands in any order, as long as you do not encounter
a mathematically undefined result. In that case, SYSTAT displays an error message and
the variable is restored to its original value in the file.
All transformations are in place. That is, the series is stored in the active work area
and the transformed values are written over the old ones. The original file is not altered,
however, because all the work is done in the memory of the computer. To save the
results of a transformation to a SYSTAT file, select Save file.
Clear Series
You can clear any past series transformations from memory and restore the original
values of the series. It is not possible to clear only the latest transformation (unless you
are saving to files after each step)Clear Series undoes all the transformations.
To clear series transformations, from the menus choose:
Statistics
Time Series
Clear Series...
1022
Chapter 32
Using Commands
To transform a time series, first specify your data with USE filename. Continue with:
CLEAR var clears transformations from memory.
Smoothing a Time Series in SYSTAT
Sometimes, with a noisy time series, you simply want to view some sort of smoothed
version of the series even though you have no idea what type of function generated the
series. A variety of techniques can smooth, or filter, the noise from such a series.
Smooth Main Dialog Box
To open the Smooth dialog box, from the menus choose:
Statistics
Time Series
Smooth...
Smooth provides the following methods for smoothing time series:
SERIES
DIFFERENCE var / LAG=n MISS=n
LOG var
PCNTCHANGE var
MEAN var
SQUARE var
TREND var
INDEX var / BASE=n
TAPER var / P=n
1023
Ti me Seri es
n Mean. Running means (moving averages). Mean of a span of series values
surrounding and including the current value. Specify the number of values
(observations) to use in the calculation.
n Median. Running medians. Median of a span of series values surrounding and
including the current value. Specify the number of values (observations) to use in
the calculation.
n Weight. General linear filters in which you can specify your own weights. Smooth
transforms the weights before using them so that they sum to 1.0. Weight=1, 2, 1
is the same as Weight=0.25, 0.5, 0.25 or Weight=3, 6, 3.
To save the results of a smoothing operation to a SYSTAT file, select Save file.
LOWESS Main Dialog Box
Cleveland (1979) presented a method for smoothing values of Y paired with a set of
ordered X values. Chambers et al. (1983) introduce this technique and present some
clear examples. If you are not a statistician, and want a glimpse of some of the details,
read the Chambers book and Velleman and Hoaglin (if you are unfamiliar with Tukeys
work).
Scatterplot smoothing enables you to look for a functional relation between Y and
X without prejudging its shape (or its monotonicity). LOWESS is a smoothing method
that uses an iterative locally weighted least-squares method to fit a curve to a set of
points.
To open the LOWESS dialog box, from the menus choose:
Statistics
Time Series
Lowess...
Tension. Tension determines the stiffness of the smooth. It varies between 0 and 1, with
a default of 0.5.
To save the results of a LOWESS smooth to a SYSTAT file, select Save file.
1024
Chapter 32
Exponential Smoothing Main Dialog Box
To open the Exponential Smoothing dialog box, from the menus choose:
Statistics
Time Series
Exponential...
The following options can be specified:
Smooth. Specify a smoothing weight between 0 and 1. In practice, weights between 0.1
and 0.3 are most frequently used.
Trend components. You can supply a weight for either a Linear or Percentage trend
component. Values usually range between 0.1 and 0.3. The default is 0.2.
Forecast. Number or the range of new cases to predict. For example, a value of 10
produces forecasts for 10 time points; a range from 144 to 154 produces forecasts for
time points 144 through 154.
Seasonal components. You can supply a weight for either Additive or Multiplicative.
Values usually range between 0.1 and 0.3. The default is 0.2.
Seasonal periodicity. Indicates the repetitive cyclical variation, such as the number of
months in a year or the number of days in a week. The default is 12 (as in months in a
year).
To save the forecasts and residuals to a SYSTAT file, select Save file.
1025
Ti me Seri es
Using Commands
To smooth a time series, first specify your data with USE filename. Continue with:
Seasonal Adjustments in SYSTAT
Transformations can remove the mean, trends, and seasonal effects from a time series.
However, transformations alter the scale of a time series and also yield no information
regarding the form of the removed trend or seasonal effect. As an alternative, you can
use seasonal adjustments to account for seasonal factors while maintaining the original
scale of the time series.
Seasonal Adjustment Main Dialog Box
To open the Seasonal Adjustment dialog box, from the menus choose:
Statistics
Time Series
Seasonal Adjustment...
The following options can be specified:
n Term. An Additive seasonal factor is a fixed number of units above or below the
general level of the series. In a Multiplicative model, the seasonal factor is a
percentage of the level of the series.
SERIES
SMOOTH var / LOWESS=n MEAN=n MEDIAN=n WT=n1,n2,
EXPONENTIAL var / ADDITIVE=n FORECAST=n (or a,b for a range)
LINEAR=n MULTIPLICATIVE=n PERCENTAGE=n,
SEASON=n SMOOTH=n
1026
Chapter 32
n Season. Indicates the periodicitythe repetitive cyclical variationsuch as the
number of months in a year or the number of days in a week. The default is 12.
To save the deseasonalized series to a SYSTAT file, select Save file.
Using Commands
To seasonally adjust a time series, first specify your data with USE filename.
Continue with:
ARIMA Models in SYSTAT
ARIMA (AutoRegressive Integrated Moving Average) models combine autoregressive
techniques with the moving average approach. Consequently, each case is a function
of previous cases and previous errors.
ARIMA Main Dialog Box
To open the ARIMA dialog box, from the menus choose:
Statistics
Time Series
ARIMA...
SERIES
ADJSEASON var / ADDITIVE MULTIPLICATIVE SEASON=n
1027
Ti me Seri es
ARIMA provides ARIMA models for time series. The following options can be
specified:
AR parameters. Number of autoregressive parameters.
Seasonal AR. Number of seasonal autoregressive parameters.
MA parameter. Number of moving average parameters.
Seasonal MA. Number of seasonal moving average parameters.
Seasonal periodicity. Defines the seasonal periodicity.
Estimate constant. Includes a constant in the model.
Options. You can specify the number of iterations for the ARIMA model, the
convergence criterion, and a backcast value to extend the series backwards (forecasting
in reverse). Although it slows down computation, you should use backcasting for
seasonal models especially and choose a length greater than the seasonal period.
Dont play around with convergence unless you are failing to get convergence of the
estimates after many iterations. It is better to increase the number of iterations than to
decrease the convergence criterion, since your estimates will be more precise. In any
case, it cannot be set greater than one tenth. Sometimes models fail to converge after
many iterations because you have misspecified them.
Forecast. Number or the range of new cases to predict. For example, a value of 10
produces forecasts for 10 time points; a range from 144 to 154 produces forecasts for
time points 144 through 154.
To save the residuals to a SYSTAT file, select Save file.
Using Commands
To fit an ARIMA model, first specify your data with USE filename. Continue with:
SERIES
ARIMA var / P=n PS=n Q=n QS=n SEASON=n CONSTANT,
BACKCAST=n ITER=n CONV=n,
FORECAST=n (or time1,time2)
1028
Chapter 32
Fourier Models in SYSTAT
Fourier models are particularly well-suited to cyclical time series. These models
decompose a time series into a sum of trigonometric components.
Fourier Main Dialog Box
The Fourier model decomposes a time series into a finite sum of sine and cosine waves
of different frequencies. If your data are cyclical at a particular frequency, such as
monthly, then a few Fourier components might capture most of the nonrandom
variation.
To open the Fourier dialog box, from the menus choose:
Statistics
Time Series
Fourier...
The Lag specification indicates the number of cases to use in the analysis.
If you select two variables, the inverse transformation is computed. The first
variable selected is used as the real component, and the second variable is used as the
imaginary component. To save the real and imaginary components in a SYSTAT file,
select Save file. Real and imaginary components are saved instead of magnitude and
phase because that allows you to do an inverse Fourier transform.
For example, assume that you have saved the results of a direct transformation into
a file MYFOUR. That file should contain two variablesREAL and IMAGwhich are
the two components of the transformation. To obtain the inverse Fourier
transformation:
USE myfour
FOURIER real imag
1029
Ti me Seri es
Since you specify two variables, SYSTAT assumes that you want the inverse
transformation, and that the first variable is the real component, and the second, the
imaginary component. The work is done in the active work area, so the resulting real
series is stored in the active work area occupied by REAL (or whatever you called the
first variable corresponding to the real component).
If you absolutely must have magnitude and phase in a SYSTAT file instead of the
real and imaginary components, do the following transformations:
Using Commands
To fit a Fourier model, first specify your data with USE filename. Continue with:
Usage Considerations
Types of data. For time series analysis, each case (row) in the data represents an
observation at a different time. The observations are assumed to be taken at equally
spaced time intervals.
Print options. Output is standard for all PRINT lengths.
Quick Graphs. Smoothing and seasonal adjustments yield a time series plot. Forecasting
in ARIMA results in a time series plot of the original series with the forecasts. Fourier
analysis produces periodograms (the squared magnitude against frequencies).
Saving files. You can save transformed, smoothed, deseasonalized, and forecasted
values, as well as both the real and imaginary parts of the Fourier transform.
BY groups. BY variables are not available in SERIES.
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. FREQ variables have no effect in SERIES.
Case weights. SERIES does not allow case weighting.
USE myfour
LET magnitude = SQR(real*real + imag*imag)
LET phase = ATN (imag/real)
SERIES
FOURIER varlist / LAG=n
1030
Chapter 32
Examples
Example 1
Time Series Plot
To illustrate these displays, we use monthly counts of international airline passengers
during 194960. Box and Jenkins call the series G. Each of the 144 monthly counts is
stored as a case in the SYSTAT file named AIRLINE.
TPLOT provides a graphical view of the raw data. Here we plot the AIRLINE
passenger data. The input follows:
The resulting plot is:
Notice that the counts tend to peak during the summer months each year and that the
number of passengers tends to increase over time (a positive trend). Notice also that the
spread or variance tends to increase over time. One way to deal with this problem is to
log-transform the data.
Applying the log transformation requires the following commands:
USE airline
SERIES
TIME 1949 12
TPLOT pass
LOG pass
TPLOT pass
Series Plot
1
9
4
9
1
9
5
0
1
9
5
1
1
9
5
2
1
9
5
3
1
9
5
4
1
9
5
5
1
9
5
6
1
9
5
7
1
9
5
8
1
9
5
9
1
9
6
0
1
9
6
1
100
300
500
700
P
A
S
S
1031
Ti me Seri es
The resulting plot follows:
Compare this plot with the previous onethe variance across time now appears more
stable, but there is still a positive upward trend over time.
Example 2
Autocorrelation Plot
To display an autocorrelation plot, the input is:
The plot is:
USE airline
SERIES
LOG pass
ACF pass
Series Plot
1
9
4
9
1
9
5
0
1
9
5
1
1
9
5
2
1
9
5
3
1
9
5
4
1
9
5
5
1
9
5
6
1
9
5
7
1
9
5
8
1
9
5
9
1
9
6
0
1
9
6
1
4
5
6
7
P
A
S
S
Autocorrelation Plot
0 10 20 30 40 50 60
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1032
Chapter 32
Note that we use the logged values of PASS. The shading in the display indicates the
size of the correlation at each lag (that is, like a bar chart). The correlation of each value
with the previous value in time (lag 1) is close to 1.0; with values 12 months before (lag
12), it is around 0.75. The curved line marks approximate 95% confidence levels for
the significance of each correlation. Notice the slow decay of these values. To most
investigators, this indicates that the series should be differenced.
Example 3
Partial Autocorrelation Plot
To display a partial autocorrelation plot, the input is:
The plot is:
The first autocorrelation is the same as in the ACF plot. There are no previous
autocorrelations, so it is not adjusted. The second-order autocorrelation was close to
0.90 in the ACF plot, but after adjusting for the first autocorrelation, it is reduced to
0.118.
USE airline
SERIES
LOG pass
PACF pass
Partial Autocorrelation Plot
0 10 20 30 40 50 60
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1033
Ti me Seri es
Example 4
Cross-Correlation Plot
This example uses the SPNDMONY file, which contains two quarterly series,
SPENDING (consumer expenditures) and MONEY (money stock) in billions of current
dollars for the United States during the years 19521956. The first record (case) in the
file contains the SPENDING and MONEY dollars for the first quarter of 1952; the
second record, dollars for the second quarter of 1952, and so on (that is, if each case
contains SPENDING and MONEY values for a quarter). These series are analyzed by
Chatterjee and Price (1977).
The input follows:
The resulting plot is:
There is strong correlation between the two series at lag 0, tapering off the further one
goes in either direction. This is true of all cross-correlation functions between two
trended series. Since both series are increasing, early values in both series tend to be
small, and final values tend to be large. This produces a large positive correlation.
USE spndmony
SERIES
CCF spending money / LAG=15
Cross Correlation Plot
-10 -5 0 5 10
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1034
Chapter 32
Differencing
To better understand the relationship, if any, between the series, difference them to
remove the common trend and then display a new CCF plot.
To difference both series:
This shows a significant negative correlation at only one time interval: +3 lags of the
series. Since we selected SPENDING first, we see that consumer expenditures are
negatively correlated with the money stock three quarters later. Thus, consumer
spending may be a leading indicator of money stock.
USE spndmony
SERIES
DIFFERENCE spending
DIFFERENCE money
CCF spending money / LAG=15
Cross Correlation Plot
-10 -5 0 5 10
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1035
Ti me Seri es
Example 5
Differencing
Lets replace the values of the series in the AIRLINE data with the difference between
each value and the previous valuefirst order (lag) differencing. The input is:
The output follows:
USE airline
SERIES
TIME 1949 12
LOG pass
DIFFERENCE pass
TPLOT pass
ACF pass / LAG=15
PACF pass / LAG=15
Series Plot
1
9
4
9
1
9
5
0
1
9
5
1
1
9
5
2
1
9
5
3
1
9
5
4
1
9
5
5
1
9
5
6
1
9
5
7
1
9
5
8
1
9
5
9
1
9
6
0
1
9
6
1
-0.3
-0.1
0.1
0.3
P
A
S
S
Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1036
Chapter 32
The strong upward trend seen in the undifferenced time series plots is not evident here.
Notice also that the scale on this plot ranges from approximately 0.2 to +0.2, while on
the plot in the time series plot example, it ranges from 4.6 to 6.4.
The very strong lag 12 ACF and PACF correlations with a decay of strong
correlations for shorter lags suggest that the series is seasonal. (We suspected this after
seeing the first plot of the data.) Differencing this monthly series by lag 12 can remove
cycles from the series.
Order 12 Differencing
Here, we will difference by order 12 and look at the plots. Lets summarize what has
happened to the original data. First, the data were replaced by their log values. Next,
the data were replaced by their first-order differences. Now we replace these
differences with order 12 differences with the following commands:
USE airline
SERIES
LOG pass
DIFFERENCE pass
DIFFERENCE pass / LAG=12
ACF pass / LAG=15
PACF pass / LAG=15
Partial Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1037
Ti me Seri es
The autocorrelations and partial autocorrelations after differencing by order 12 are
shown below:
The ACF display has spikes at lag 1 and lag 12. We conclude that the number of airline
passengers this month depends on the number last month and on the number one year
ago during the same month.
Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
Partial Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1038
Chapter 32
Example 6
Moving Averages
The SYSTAT file AIRCRAFT contains the results of a flutter test (amplitude of
vibration) of an aircraft wing (Bennett and Desmarais, 1975). Although the model for
these data is known, we are going to try to recover a smooth series without using this
information.
Lets try a seven-point moving average on the AIRCRAFT data and smooth the
resulting smoothed series with a four-point moving average. This should remove some
of the jitters. The input is:
The output follows:
USE aircraft
SERIES
SMOOTH flutter / MEAN=7
SMOOTH flutter / MEAN=4
Series Plot
0 10 20 30 40 50 60 70
Case
0
1
2
3
F
L
U
T
T
E
R
1039
Ti me Seri es
The second plot is even smoother than the first. We chose the lengths of the window by
trial and error after looking at the data to see how much they jitter to the left and right
of each point relative to the overall pattern of the series. You will do better if you know
something about the function generating the data.
Example 7
Smoothing (A 4253H Filter)
To fit a 4253H filter to the AIRCRAFT data, the input is:
A Quick Graph follows each Smooth request. (To omit the display, type
GRAPH=NONE.) The displays shown below correspond to the first request (MEDIAN=4)
and the final smooth (a running means smoother with weights).
USE aircraft
SERIES
SMOOTH flutter / MEDIAN=4
SMOOTH flutter / MEDIAN=2
SMOOTH flutter / MEDIAN=5
SMOOTH flutter / MEDIAN=3
SMOOTH flutter / WT=1,2,1
Series Plot
0 10 20 30 40 50 60 70
Case
0.0
0.5
1.0
1.5
2.0
F
L
U
T
T
E
R
1040
Chapter 32
The previous smooth (MEDIAN=3) is marked by dashed lines.
Example 8
LOWESS Smoothing
Here is the flutter variable (in the AIRCRAFT data) smoothed with LOWESS
smoothing. Use a Tension value of 0.18 to get more of the local detail. The input is:
USE aircraft
SERIES
SMOOTH flutter / LOWESS=.18
Series Plot
0 10 20 30 40 50 60 70
Case
0
1
2
3
F
L
U
T
T
E
R
Series Plot
0 10 20 30 40 50 60 70
Case
0.0
0.5
1.0
1.5
2.0
F
L
U
T
T
E
R
1041
Ti me Seri es
The resulting plot is:
And the Winner Is
The actual function used to generate the data in the moving average and 4253H filter
examples is shown below:
where t =1, 2, , 64 (the index number of the series). We added normal (Gaussian)
noise to this function in inverse proportion to the square root of t. We leave it to the
reader to design an optimal filter for the Weight option after looking at the noise
distribution in the plot. The generating function on the data is shown below:
USE aircraft
BEGIN
PLOT flutter * time / HEI=1.5IN WID=3.5IN,
XMIN=0 XMAX=70 YMIN=0 YMAX=3,
XLABEL=Time YLABEL=Flutter,
SYMB=1 SIZE=.75 FILL=1 COLOR=BLACK
FPLOT y=1-EXP(-0.03*t)*COS(.3*t) + (0.35) ; ,
HEI=1.5IN WID=3.5IN,
XMIN=0 XMAX=70 YMIN=0 YMAX=3,
XLAB= YLAB= AXES=NONE SCALE=NONE
END
Series Plot
0 10 20 30 40 50 60 70
Case
0
1
2
3
F
L
U
T
T
E
R
) t 3 . 0 cos( e 1 ) t ( Y
t 03 . 0

1042
Chapter 32
Is there a winner? The LOWESS smooth looks pretty good. Usually, for Gaussian data
like these, it is hard to beat running means. Running medians and LOWESS do
extremely well on non-Gaussian data, however, because they are less susceptible to
outliers in the series. You will also find that exploratory smoothing requires a lot of fine
tuning with window widths (tension) and weights.
Example 9
Multiplicative Seasonal Factor
We use the same AIRLINE datafrom Box and Jenkins (1976)used in the time
series plot example. If you examine the plot there, you can see the strong periodicities.
The size of the periodicities depends on the level of the series, so we know that the form
of seasonality is multiplicative. Each year, the number of passengers peaks during July
and August, but there are also jagged spikes in the data that correspond, apparently, to
holidays like Christmas and Easter.
Here we adjust the airline series for the multiplicative seasonal effect implied by the
series plot. The input follows:
USE airline
SERIES
TIME 1949 12
ADJSEASON pass / MULTIPLICATIVE
1043
Ti me Seri es
The output is:
Airline travel appears heaviest during the summer months, June (6) through
September (9).
The plot shows that the trend and the irregular components remain, but the seasonal
component has been removed from the series.
Example 10
Multiplicative Seasonality with a Linear Trend
In the time series plot example, we looked at the AIRLINE data from Box and Jenkins.
The plot of the series shows a strong increasing trend and what looks like multiplicative
Series originates at: 1949. Periodicity: 12. First Period: 1.

Adjust series for a seasonal periodicity of 12.

PASS copied from SYSTAT file into active work area


Seasonal indices for the series are:

1: 91.077
2: 88.133
3: 100.825
4: 97.321
5: 98.305
6: 111.296
7: 122.636
8: 121.652
9: 105.997
10: 92.200
11: 80.397
12: 90.164

Series is transformed.
Series Plot
1
9
4
9
1
9
5
0
1
9
5
1
1
9
5
2
1
9
5
3
1
9
5
4
1
9
5
5
1
9
5
6
1
9
5
7
1
9
5
8
1
9
5
9
1
9
6
0
1
9
6
1
100
200
300
400
500
600
P
A
S
S
1044
Chapter 32
seasonality. We could try to forecast this series with a model having a linear trend and
multiplicative seasonality. The input is:
The output follows:
USE airline
SERIES
EXPONENTIAL pass / SMOOTH=.3 LINEAR=.4 MULT=.4 FORECAST=10
Smooth location parameter with coefficient = 0.300
Linear trend with smoothing coefficient = 0.400
Multiplicative seasonality with smoothing coefficient = 0.400

Initial values

Seasonal indices for the series are:

1: 91.077
2: 88.133
3: 100.825
4: 97.321
5: 98.305
6: 111.296
7: 122.636
8: 121.652
9: 105.997
10: 92.200
11: 80.397
12: 90.164

Initial smoothed value = 88.263
Initial trend parameter = 2.645

Final values

Seasonal indices for the series are:

1: 87.628
2: 82.996
3: 95.362
4: 98.155
5: 101.936
6: 117.471
7: 133.389
8: 130.644
9: 107.631
10: 93.352
11: 78.956
12: 86.086

Final smoothed value = 497.921
Final trend parameter = 8.374

Within series MSE = 325.915, SE = 18.053

Obs Forecast

145. 443.655
146. 427.158
147. 498.784
148. 521.612
149. 550.243
150. 643.937
151. 742.367
152. 738.027
153. 617.035
154. 542.995
1045
Ti me Seri es
The output begins with the model and initial parameter estimates. The initial smoothed
value is a regression estimate of the level of the seasonally adjusted series immediately
before the first observation in the sample. The initial trend parameter is the slope of the
regression of observations on observation numberthe increase or decrease from one
observation to the next due to the overall trend. For a percentage growth model, the
trend parameter is the expected percentage change from the previous to the current
observation due to trend.
After the values are smoothed, SYSTAT prints the final estimates of the seasonal,
location, and trend parameters, plus the within-series forecast error. You can vary the
smoothing coefficients and see if they reduce the standard error.
Alternative Smoothing Coefficients
In an attempt to reduce the standard error, we alter the smoothing coefficients:
The output is:
CLEAR
USE airline
SERIES
TIME 1949 12
EXPONENTIAL pass / SMOOTH=.2 LINEAR=.2 MULT=.2 FORECAST=10
Smooth location parameter with coefficient = 0.200
Linear trend with smoothing coefficient = 0.200
Multiplicative seasonality with smoothing coefficient = 0.200

Final values

Seasonal indices for the series are:

1: 90.960
2: 87.249
3: 99.846
4: 98.817
5: 100.173
6: 113.664
7: 126.653
8: 124.419
9: 105.075
10: 91.727
11: 79.163
12: 88.002

Final smoothed value = 499.423
Final trend parameter = 4.106

Within series MSE = 220.026, SE = 14.833

1046
Chapter 32
We get a smaller within-series forecast error (220.026 versus 325.915).
In-Series Forecasts
Sometimes its best to develop a model on a portion of a series and see how well it
predicts the remainder. There are 12 years of airline data for a total of 144 monthly
observations. The following commands develop the smoothing model with the first 10
years of data (120 observations) and predict the final 2 years (observations 121144):
Obs Forecast

Jan, 1961 458.008
Feb, 1961 442.905
Mar, 1961 510.954
Apr, 1961 509.745
May, 1961 520.853
Jun, 1961 595.666
Jul, 1961 668.936
Aug, 1961 662.247
Sep, 1961 563.597
Oct, 1961 495.771
CLEAR
USE airline
SERIES
EXPONENTIAL / SMOOTH=.3 LINEAR=.4 MULT=.4 FORECAST=121 .. 144
Series Plot
1
9
4
9
1
9
5
0
1
9
5
1
1
9
5
2
1
9
5
3
1
9
5
4
1
9
5
5
1
9
5
6
1
9
5
7
1
9
5
8
1
9
5
9
1
9
6
0
1
9
6
1
1
9
6
2
100
300
500
700
P
A
S
S
1047
Ti me Seri es
Output from this procedure includes the following forecasts:
Note that the within-series standard error is not the same as in the previous run because
its now based on only the first 120 observations. The error for the actual forecasts
(65.891) is much larger than that for the in-series forecasts (17.091).
For a thorough review of issues and developments in exponential smoothing
models, see Gardner (1985). For an introduction to these models, see any introductory
forecasting book, such as Makridakis, Wheelwright, and McGee (1983).
Example 11
ARIMA Models
The first thing to consider in modeling the AIRLINE passenger data is the increasing
variance in the series over time. We logged the data (in the time series plot example)
and found that the variance stabilized. An upward trend remained, however, so we
differenced the series (in the differencing example). We now identify which ARIMA
Within series MSE = 292.090, SE = 17.091

Obs Forecast

Feb, 1959 354.228
Mar, 1959 423.781
Apr, 1959 426.328
May, 1959 447.235
Jun, 1959 530.144
Jul, 1959 588.677
Aug, 1959 580.174
Sep, 1959 484.919
Oct, 1959 420.917
Nov, 1959 365.250
Dec, 1959 407.409
Jan, 1960 427.066
Feb, 1960 418.752
Mar, 1960 499.821
Apr, 1960 501.697
May, 1960 525.153
Jun, 1960 621.184
Jul, 1960 688.343
Aug, 1960 677.034
Sep, 1960 564.765
Oct, 1960 489.287
Nov, 1960 423.786
Dec, 1960 471.840
Forecast MSE = 1904.887 SE = 43.645
1048
Chapter 32
parameters we want to estimate by plotting the data in several ways. The parameters of
the ARIMA model are:
For seasonal ARIMA models, we need three additional parameters: Seasonal AR,
Seasonal I, and Seasonal MA. Their definitions are the same as above, except that they
apply to points that are not adjacent in a series. The AIRLINE data involve seasonal
parameters, for example, because dependencies extend across years as well as months.
Checking ACF and PACF Displays
There appears to be at least some differencing needed for the AIRLINE data because
the series drifts across time (overall level of passengers increases). ACF and PACF
plots give us more detailed information on this. The ACF and PACF plots are shown
below (here we limit the lags to 15).
Name Description
AR autoregressive Each point is a weighted function of a previous point plus random error.
I difference Each points value is a constant difference from a previous points value.
MA moving average Each point is a weighted function of a previous points random error plus
its own random error.
Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1049
Ti me Seri es
Notice that the autocorrelations are substantial and well outside two standard errors on
the plot. There are two bulges in the ACF plot at lag=1 and lag=12, suggesting the
nonseasonal (monthly) and seasonal (yearly) dependencies that we supposed. The
PACF plot shows the same dependencies more distinctly. Here are the autocorrelations
and partial autocorrelations of the differenced series:
Partial Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1050
Chapter 32
Now we have only 143 points in the series because the first point had no prior value to
remove. It was therefore set to missing. The two plots show that the differencing has
substantially removed the monthly changes in trend. We still have the seasonal (yearly)
trend, however. Therefore, difference again and then replot. The autocorrelations and
partial autocorrelations after differencing by order 12 are shown below. With
commands:
DIFFERENCE / LAG=12
ACF / LAG=15
PACF / LAG=15
Partial Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1051
Ti me Seri es
Most of the dependency seems to have been removed. Although there are some
autocorrelations and partial autocorrelations outside two standard errors, we will not
difference again. We will fit a model first because over-differencing can mask the
effects of MA parameters. In fact, the pattern in this last plot suggests one regular and
one seasonal MA parameter because there are ACF spikes (instead of bulges) at lags 1
and 13, and the PACF shows decay at lags 1 and 13.
Consult the references previously cited for more information on how to read these
plots for identification.
Fitting an ARIMA Model
Here we fit a seasonal multiplicative ARIMA model with no autoregressive parameter,
one difference parameter, one moving average parameter, no seasonal autoregressive
parameter, one seasonal difference parameter, and one seasonal moving average
parameter. The input is:
We save the residuals into a file to check the adequacy of the model by using the
various facilities available in SYSTAT. You can also do normal probability plots,
USE airline
SERIES
LOG pass
DIFFERENCE
DIFFERENCE / LAG=12
SAVE resid
ARIMA /Q=1 QS=1 SEASON=12 BACKCAST=13
USE resid
ACF / LAG=15
Partial Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1052
Chapter 32
stem-and-leaf plots, Kolmogorov-Smirnov tests, and other statistical tests on residuals.
We focus on the serial dependence among the residuals by creating an autocorrelation
plot.
The output is:
None of the autocorrelations are significant.
Iteration Sum of Squares Parameter values
0 .2392764D+00 .100 .100
1 .1835532D+00 .345 .433
2 .1764962D+00 .449 .633
3 .1759952D+00 .416 .592
4 .1758742D+00 .409 .613
5 .1758463D+00 .392 .614
6 .1758443D+00 .396 .613
7 .1758443D+00 .396 .613
8 .1758443D+00 .396 .613
9 .1758443D+00 .396 .613
Final value of MSE is 0.001
Index Type Estimate A.S.E. Lower <95%> Upper
1 MA 0.396 0.093 0.212 0.579
2 SMA 0.613 0.074 0.467 0.760
Asymptotic correlation matrix of parameters

1 2

1 1.000
2 -0.171 1.000
Autocorrelation Plot
0 4 8 12 16
Lag
-1.0
-0.5
0.0
0.5
1.0
C
o
r
r
e
l
a
t
i
o
n
1053
Ti me Seri es
ARIMA Forecasting
We could have added forecasting by specifying 10 cases to be forecast. The input is:
SYSTAT forecasts the future values of the series:
The forecast origin in this case is taken as the last point in the series. From there, the
model computes and prints 10 new points with their upper and lower 95% asymptotic
confidence intervals. SYSTAT automatically plots the forecasts.
USE airline
SERIES
LOG pass
DIFFERENCE
DIFFERENCE / LAG=12
SAVE resid
ARIMA /Q=1 QS=1 SEASON=12 BACKCAST=13 FORECAST=10
Forecast Values
Period Lower95 Forecast Upper95
145. 418.862 450.296 484.090
146. 391.976 426.557 464.189
147. 438.329 482.099 530.240
148. 443.296 492.246 546.602
149. 453.832 508.378 569.480
150. 516.623 583.439 658.898
151. 587.307 668.338 760.549
152. 581.136 666.091 763.464
153. 484.248 558.844 644.931
154. 427.648 496.754 577.028
Series Plot
0 50 100 150 200
Case
100
200
300
400
500
600
700
800
P
A
S
S
1054
Chapter 32
Example 12
Fourier Modeling of Temperature
Lets look at a typical Fourier application. The data in the NEWARK file are 64 average
monthly temperatures in Newark, New Jersey, beginning in January, 1964. The data are
from the U.S. government, cited in Chambers et al. (1983). Notice that their
fluctuations look something like a sine wave, so we might expect that they could be
modeled adequately by the sum of a relatively small number of trigonometric
components. We have taken exactly 64 measurements to fulfill the powers of 2 rule.
We remove the series mean before the decomposition. The input is:
The output follows:
USE newark
SERIES
TIME 1964,12
TPLOT temp
MEAN temp
FOURIER temp / LAG=15
Fourier components of TEMP

Index Frequency Real Imaginary Magnitude Phase Periodogram
1 0.0 0.0 0.0 0.0 . 0.0
2 0.01563 -0.763 -0.363 0.845 -2.697 14.535
3 0.03125 -0.803 -0.177 0.822 -2.924 13.760
4 0.04687 -1.587 -0.779 1.768 -2.685 63.683
5 0.06250 -1.658 -1.817 2.460 -2.310 123.262
6 0.07813 -6.248 -7.214 9.544 -2.285 1855.631
7 0.09375 2.606 3.633 4.471 0.948 407.199
8 0.10938 1.040 1.786 2.067 1.044 87.038
9 0.12500 0.592 0.936 1.107 1.007 24.978
10 0.14063 0.438 0.588 0.733 0.930 10.954
11 0.15625 -0.127 1.135 1.142 1.682 26.558
12 0.17188 0.067 0.715 0.718 1.477 10.507
13 0.18750 -0.255 0.785 0.825 1.885 13.860
14 0.20313 0.140 0.132 0.192 0.756 0.749
15 0.21875 -0.071 0.291 0.299 1.811 1.823
1055
Ti me Seri es
The Quick Graph displays a periodogramthat is, the squared magnitude against
frequencies. Notice that our hunch was largely correct. There is one primary peak at a
relatively low frequency. This periodogram differs from that produced in earlier
versions of SYSTAT. SYSTAT now uses
N/pi*(squared magnitude)
where N is the number of cases in the file.
Series Plot
J
a
n
, 1
9
6
4
M
a
y
, 1
9
6
4
S
e
p
, 1
9
6
4
J
a
n
, 1
9
6
5
M
a
y
, 1
9
6
5
S
e
p
, 1
9
6
5
J
a
n
, 1
9
6
6
M
a
y
, 1
9
6
6
S
e
p
, 1
9
6
6
J
a
n
, 1
9
6
7
M
a
y
, 1
9
6
7
S
e
p
, 1
9
6
7
J
a
n
, 1
9
6
8
M
a
y
, 1
9
6
8
S
e
p
, 1
9
6
8
J
a
n
, 1
9
6
9
M
a
y
, 1
9
6
9
S
e
p
, 1
9
6
9
J
a
n
, 1
9
7
0
M
a
y
, 1
9
7
0
S
e
p
, 1
9
7
0
J
a
n
, 1
9
7
1
20
40
60
80
T
E
M
P
Periodogram
0.00 0.05 0.10 0.15 0.20 0.25
Frequency
0
20
40
60
80
100
S
q
u
a
r
e
d

M
a
g
n
i
t
u
d
e
1056
Chapter 32
Two final points follow. First, some analysts prefer to plot the logs of these values
against frequency. We could do this in the following way:
Logging, by the way, looks noisier than the plot above but can reveal significant spikes
that might be hidden in the raw periodogram.
The second point involves smoothing the periodogram. Often it is best to taper the
series first before computing the periodogram. This makes the spikes more pronounced
in the log-periodogram plot:
Since we didnt specify a value, split-cosine-bell used its default, 0.5.
Computation
All of the time series smoothers and Fourier routines are computed in single precision.
Estimation for ARIMA models is performed in double precision and forecasting is
done in single precision.
Algorithms
The LOWESS algorithm for XY and scatterplot smoothing is documented in Cleveland
(1979) and Cleveland (1981). The Fast Fourier Transform is due to Gentleman and
Sande (1966), and documented further in Bloomfield (1976).
ARIMA models are estimated with a set of algorithms. Residuals and unconditional
sums of squares for the seasonal multiplicative model are calculated by an algorithm
in McLeod and Sales (1983). The sums of squares are minimized iteratively by a quasi-
Newton method due to Fletcher (1972). A penalty function for inadmissible values of
the parameters makes this procedure relatively robust when values are near the
circumference of the unit circle. Standard errors for the parameter estimates are
computed from the inverse of the numeric estimate of the Hessian matrix, following
Fisher (1922). Forecasting is performed via the difference equations documented in
Chapter 5 of Box and Jenkins (1976).
SQUARE temp
LOG temp
TPLOT temp
MEAN temp
TAPER temp
FOURIER temp
SQUARE temp
LOG temp
TPLOT temp
1057
Ti me Seri es
References
Bloomfield, P. (1976). Fourier analysis of time series: An introduction. New York: John
Wiley & Sons, Inc.
Box, G. E. P. and Jenkins, G. M. (1976). Time series analysis: Forecasting and control.
Revised edition. Oakland, Calif.: Holden-Day, Inc.
Brigham, E. O. (1974). The fast Fourier transform. New York: Prentice-Hall.
Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. (1983). Graphical methods
for data analysis. Belmont, Calif.: Wadsworth International Group.
Chatterjee and Price. (1977). Regression analysis by example. New York: John Wiley &
Sons, Inc.
Cleveland, W. S. (1979). Robust locally weight regression and smoothing scatterplots.
Journal of the American Statistical Association, 74, 829836.
Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine computation of
complex Fourier series. Mathematical Computation, 19, 297301.
Gardner, E. S. (1985). Exponential smoothing: The state of the art. Journal of Forecasting,
4, 128.
Makridakis, W., Wheelwright, S. C., and McGee, U. E. (1983). Forecasting: Methods and
applications. 2nd ed. New York: John Wiley & Sons, Inc.
McCleary, R. and Hay, R. A., Jr. (1980). Applied time series analysis for the social
sciences. Beverly Hills: Sage Publications.
Nelson, C. R. (1973). Applied time series analysis for managerial forecasting. San
Francisco: Holden-Day, Inc.
Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison-Wesley.
Vandaele, W. (1983). Applied time series and Box-Jenkins models. New York: Academic
Press.
Velleman, P. F. and Hoaglin, D. C. (1981). Applications, basics, and computing of
exploratory data analysis. Belmont: Duxbury Press.
1059


Chapt er
33
Two-Stage Least Squares
Dan Steinberg
The TSLS module is designed for estimation of simultaneous equations systems via
Two-Stage Least Squares (TSLS) and Two-Stage Instrumental Variables (White,
1984). In the first stage, the independent variables are regressed on the instrumental
variables. In the second stage, the dependent variable is regressed on the predicted
values of the independent variables (determined from the first stage). TSLS produces
heteroskedasticity-consistent standard errors for ordinary least squares (OLS) models
and instrumental variables models and provides diagnostic tests for heteroskedasticity
and nonlinearity. TSLS also computes regressions with polynomially distributed lag
structure in the errors.
Statistical Background
Two-stage least squares was introduced by Theil in the early 1950s in unpublished
memoranda and independently by Basmann (1957). Theils textbook (1971) treats the
topic extensively; other textbooks include Johnston (1984), Judge et al. (1986),
Maddala (1977), and Mardia et al. (1979).
Two-Stage Least Squares Estimation
Two-Stage Least Squares (TSLS) is the most common example of an instrumental
variables (IV) estimator. The IV estimator is appropriate if we want to fit the statistical
model
1060
Chapter 33
Equation 33-1
when some of the regressors in X are correlated with the errors . This can occur if
some of the Xs are measured with error or when some of the Xs are dependent
variables in a larger system of equations.
To use the instrumental variables procedure, we must have some variables Z in our
data set that are uncorrelated with the error terms . These variables,
which are called the instrumental variables, can include some or all of the variables
X of the model and any other variables in the data. To estimate a model, there must be
at least as many instrumental variables as there are regressors.
Heteroskedasticity
The problem of heteroskedasticity is discussed in Theil (1971) and extensively in
Judge et al. (1985), which includes numerous references. The approach to
heteroskedasticity taken in this module, which is to produce correct standard errors for
the OLS case, was introduced by Eicker (1963, 1967) and Hinkley (1977). It was
rediscovered independently by White (1980), who also extended its application to the
TSLS context (White, 1982). A technical account of the theory underlying all of the
methods used in this module appears in White (1984). The basic statistical model of
regression analysis can be written as
Equation 33-2
where is the dependent variable, is a vector of independent variables, is a vector
of unknown regression coefficients, and is an unobservable random variable. If the
regressors are uncorrelated with the random error , ordinary least squares
(OLS) will generally produce consistent and asymptotically normal estimators. Further,
if the errors have constant variance for all of the observations in the data set, the usual
t statistics are correct and hypothesis testing can be conducted on the basis of the
variance-covariance matrix of the coefficient estimates. These are the assumptions
underlying the estimation and hypothesis testing of MGLH and other major regression
packages. If either of these assumptions is false, features of TSLS can be used to obtain
valid hypothesis tests and consistent parameter estimates.
y Xb + =
E[Z ] 0 = ( )
y Xb + =
y X b
E[X ] 0 = ( )
1061
Two- Stage Least Squares
We estimate heteroskedasticity-consistent standard errors because they are correct
asymptotically under a broad set of assumptions. If the random errors in a regression
model exhibit heteroskedasticity, the conventional standard errors and covariance
matrix are usually inconsistent. The t statistics are erroneous, and any hypothesis tests
that employ the covariance matrix estimate will also be incorrect (have the wrong size).
The heteroskedasticity-consistent standard errors, by contrast, are correct, whether or
not heteroskedasticity is present.
There is no way to tell whether the robust standard errors will be larger or smaller
than the OLS results, but they may differ substantially. The classical approach to
heteroskedasticity is to postulate an exact functional form for the second moments of
the errors. Some analysts assume, for example, that the variance of the error for each
observation is proportional to the square of one of the independent variables. (See
Judge et al. for further details.) The model is estimated by generalized (or weighted)
least squares (GLS) with weights obtained from least-square residuals. Of course, this
approach requires that the assumptions of the analyst are correct. If these assumptions
are incorrect, the standard errors resulting from GLS will also be incorrect.
The heteroskedasticity-consistent standard errors computed in TSLS are not based
on any attempt to correct for, or otherwise model, the heteroskedasticity. Instead,
essentially nonparametric estimates of the OLS standard errors are computed. We still
get OLS coefficients, but the variances of the coefficients are revised. White (1980)
showed that this is possible for virtually any type of heteroskedasticity.
Two-Stage Least Squares in SYSTAT
Two-Stage Least Squares Main Dialog Box
To open the dialog box, from the menus choose:
Statistics
Regression
2 Stage Least Squares....
1062
Chapter 33
SYSTAT computes Two-Stage Least Squares by first specifying a model and then
estimating it.
Dependent. The variable you want to examine. The dependent variable should be
continuous and numeric.
Independent(s). Select one or more continuous or categorical variables (grouping
variables). To add an interaction to your model, click the Cross button. For example,
to add the term sex*education, add sex to the Independent(s) list and then add
education by clicking the Cross button.
Instrumental(s). Select the instrumental variable(s) that you want to estimate.
Instrumentals may be continuous or categorical. To add an interaction to your model,
click the Cross button. For example, to add the term sex*education, add sex to the
Instrumental(s) list and then add education by clicking the Cross button. The number
of instrumental variables must equal or exceed the number of independent variables.
Lags. Specify the number of lags for variables in the independent or instrumental
variable target list. Highlight the variable in the general variable list, enter the number
of lags, and click Add. The variable appears in the independent or instrumental
variables list with a colon followed by the number of lags.
Include constant. Indicate whether you want the constant turned on or off. In practice,
the constant is almost always included.
1063
Two- Stage Least Squares
Heteroskedasticity consistent. Computes heteroskedasticity-consistent standard errors,
which are correct whether or not heteroskedasticity is present.
Save file. Saves statistics into filename.SYD.
Using Commands
Select a data file with USE filename and continue with:
The CONSTANT term is almost always included. The second block of variables is the
set of predictors (regressors). Any predictors can be declared categorical with the
CATEGORY statement. This will encode them with dummy variables. After a vertical
bar, list the instrumental variables.
Usage Considerations
Types of data. TSLS uses rectangular data.
Print options. PRINT=LONG adds the covariance matrix of the coefficient estimates.
Quick Graphs. No Quick Graphs are produced by TSLS.
Saving files. Predicted values and residuals can be saved to a SYSTAT system file with
the SAVE command. The SAVE command must be issued before the ESTIMATE
command. The SAVE command works across the BY command. If several BY groups
are being analyzed, the predicted values, etc., are saved for each BY group in a single
SYSTAT file.
BY groups. TSLS analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in TSLS.
TSLS
MODEL yvar = CONSTANT + var + ... + var1*var2 + ...,
| ivar + ... + ivar1*ivar2 + ...
ESTIMATE / HC
1064
Chapter 33
Examples
Example 1
Heteroskedasticity-Consistent Standard Errors
An example will illustrate how to use TSLS to diagnose a regression model and obtain
correct answers if the classical regression assumptions are violated. The data that we
are using were extracted from the National Longitudinal Survey of Young Men, 1979.
Information for 38 men is available on natural log of wage (LW), highest completed
grade (EDUC), mothers education (MED), fathers education (FED), race
(BLACK=1), AGE, and several other variables. We want to estimate a model relating
wage to education, race, and age. The input for a simple linear model with
heteroskedasticity-consistent standard errors is:
The resulting output is:
USE NLS
TSLS
MODEL LW=CONSTANT+EDUC+BLACK+AGE
ESTIMATE / HC
Input records: 200
Records kept for analysis: 200

Ordinary least squares results (OLS)
Dependent variable: LW

N: 200, mean of dependent variable: 6.080000
R-squared: 0.140278
Adjusted R-squared: 0.127119, uncentered R-squared (R0-squared): 0.994372

Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 4.483 0.327 13.726 0.000
2 EDUC 0.023 0.013 1.715 0.088
3 BLACK -0.207 0.135 -1.530 0.128
4 AGE 0.050 0.011 4.605 0.000

F(3,196) = 10.660242, prob= 0.000002

Standard error of regression: 0.462279
Regression sum of squares: 6.834354
Residual sum of squares: 41.885646


Covariance matrix of regression coefficients

1 2 3 4
1 0.107
2 -0.002 0.000
3 -0.009 0.000 0.018
4 -0.003 -0.000 0.000 0.000
1065
Two- Stage Least Squares
Initial TSLS output reports conventional regression output, which could also have been
obtained in MGLH. The standard errors reported are obtained from the diagonal of the
matrix , where is the sum of squared residuals divided by the degrees of
freedom; the t statistics are the classical ones as well. The one new statistic is the
uncentered , reported as R0-squared . This statistic has no bearing on the
goodness of fit of the regression and should be routinely ignored. It is reported only as
a computational convenience for those who want to use the Lagrange multiplier tests
discussed by Engle (1984).
In addition to producing what White (1980) called heteroskedasticity-consistent
standard errors, the TSLS module calculates three diagnostic statistics for the linear
model. The first is the usual Durbin-Watson statistic. This is the same test for
autocorrelation that appears in the MGLH module.
The second is the White (1980) specification test, which explicitly checks the
residuals for heteroskedasticity. Under the null hypothesis of homoskedasticity, this
statistic has an asymptotic chi-square distribution. If this statistic is large, we have
evidence of heteroskedasticity. In the example, this statistic is 7.907 with eight degrees
of freedom, indicating that we cannot reject the null hypothesis of homoskedasticity.
The White test is actually a general test of misspecification (White, 1982) and is
sensitive to various departures of the model and data from standard assumptions. A
significant statistic is evidence that something is wrong with the model, but it does not
identify the source of the problem. It may be heteroskedasticity, but it could also be
left-out variables or nonlinearity, for example.
The third is the nonlinearity test, which checks for neglected nonlinearities in the
regression function. It is simply a test of the joint hypothesis that all possible
interactions, including squared regressors, have zero coefficients in a full model. A
Heteroskedastic consistent results

Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 4.483 0.335 13.386 0.000
2 EDUC 0.023 0.014 1.633 0.104
3 BLACK -0.207 0.163 -1.266 0.207
4 AGE 0.050 0.012 4.254 0.000

Specification test statistic df p-value
Durbin-Watson 1.836
White specification 7.907 8.000 0.443
Nonlinearity 7.710 5.000 0.173

Heteroskedastic consistent covariance matrix of regression coefficients

1 2 3 4
1 0.112
2 -0.002 0.000
3 -0.015 0.000 0.027
4 -0.003 -0.000 0.000 0.000
s
2
XX ( )
1
s
2
R
2
R
o
2
( )
1066
Chapter 33
Lagrange multiplier test, it too has an approximate chi-square distribution under the
null hypothesis of correct specification. Again, large values for this statistic are
evidence for neglected interactions. In our example, there is no evidence of non-
linearity on the basis of this broad test.
Each of the latter two tests involves supplementary regressions with a possibly large
number of additional independent variables. The White test is computed by regressing
the squared OLS residuals on all the squares and cross-products of the Xs, and the
nonlinearity test regresses residuals on these same cross-products.
Example 2
Two-Stage Least Squares
In this example, we use the extended MODEL statement to construct a TSLS model:
The first part of the MODEL statement is identical to what we have seen in MGLH. It
has a dependent variable and a list of independent variables, characterizing a structural
equation. This is the theoretical model we want to estimate. The vertical bar (|) in the
middle of the statement signifies the end of the structural equation and indicates that a
list of instrumental variables follows. In this example, our structural equation relates
the logarithm of the wage, LW, to EDUC, BLACK, AGE, and a CONSTANT. The list of
instrumental variables follows, and it consists of CONSTANT, MED, FED, BLACK,
and AGE. At this point, all we need to know is that there are at least as many
instrumental variables to the right of the sharp sign as there are regressors to the left.
Satisfying this condition means that TSLS will attempt to fit the model.
What exactly does our model mean? In our example, CONSTANT, BLACK, and
AGE appear both in the structural equation and in the list of instrumental variables. By
using them in this way, the analyst expresses confidence that these are conventional
independent variables, uncorrelated with the error term . These exogenous variables
can be said to be instruments for themselves. The variable EDUC, however, appears
only in the structural equation. This is a signal that the analyst wants to consider EDUC
as an endogenous variable that might be correlated with error term. On the right hand
side of the vertical bar, MED and FED appear as instrumental variables only. They can
thus be said to be instruments for EDUC.
USE NLS
TSLS
MODEL LW=CONSTANT+EDUC+BLACK+AGE | CONSTANT+MED+FED+BLACK+AGE
ESTIMATE

1067
Two- Stage Least Squares
The total number of instruments is five, which is one greater than the number of
regressors. Notice that if the lists before and after the vertical bar are identical, the
procedure reduces mathematically to OLS; TSLS, however, will do a lot of extra work
to discover this.
Some analysts prefer to think of Two-Stage Least Squares as involving a literal pair
of estimated regressions. From this point of view, we have, for example, the following
two equations:
MODEL LW = CONSTANT + EDUC + BLACK + AGE
MODEL EDUC = CONSTANT + FED + MED
The first equation is the structural equation for LW, and the second is a possible
structural equation for EDUC, relating EDUC to the education levels of parents.
Because EDUC is itself seen to be a dependent variable in the larger set of equations,
it cannot properly appear as a regressor in a standard regression. The two-stage
technique involves estimating the equation for EDUC first, forming predicted values
for EDUC, say EDUCHAT, and then estimating the model
MODEL LW = CONSTANT + EDUCHAT + BLACK + AGE
instead. Notice, however, that to estimate a TSLS model in a literal pair of regressions,
the equation for EDUC would have to expand to include all of the exogenous
independent variables appearing in the equation for LW. That is, the correct first-stage
regression would actually be
MODEL EDUC = CONSTANT + FED + MED + BLACK + AGE
although we thought the shorter model was the true model. Also, the standard errors
obtained from a literal two-stage estimation are not correct, as they must be calculated
from actual and not predicted values of the independent variables. Fortunately, these
details are taken care of for you by TSLS. Just make sure to partition your variables into
exogenous and endogenous groups and list all the exogenous variables to the right of
the vertical bar.
1068
Chapter 33
Following is the output:
Example 3
Two-Stage Instrumental Variables
As in the case of the OLS estimator, the standard errors calculated for the TSLS
estimator will be incorrect in the presence of heteroskedasticity. Although we could
calculate heteroskedasticity-consistent standard errors, it turns out that sometimes we
can do even better. If the number of instrumental variables is strictly greater than the
number of regressors in the presence of heteroskedasticity, the two-stage instrumental
variables (TSIV) estimator is more efficient than TSLS. This means that in TSIV, the
coefficient estimates as well as the standard errors may differ somewhat from TSLS.
Observe, though, that large differences between TSIV and TSLS coefficients may
indicate model misspecification (for example, the variables assumed to be exogenous
are not truly exogenous). As in the case of heteroskedasticity-consistent OLS,
computation of the TSIV estimator does not require knowledge of the form of
heteroskedasticity. See White (1982b, 1984) for more on the TSIV estimator.
The following sequence of statements tells TSLS to estimate the model by TSIV as
well as by TSLS. The only difference from TSLS is that the HC option is requested.
Input records: 200
Records kept for analysis: 200

Instrumental variables, OLS results (TSLS)
Dependent variable: LW

N: 200, mean of dependent variable: 6.080000
Instruments: CONSTANT + MED + FED + BLACK + AGE

Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 4.695 0.491 9.564 0.000
2 EDUC 0.006 0.033 0.177 0.860
3 BLACK -0.223 0.138 -1.610 0.109
4 AGE 0.051 0.011 4.614 0.000

Standard error of regression: 0.464223
Residual sum of squares: 42.238554

Covariance matrix of regression coefficients

1 2 3 4
1 0.241
2 -0.013 0.001
3 -0.020 0.001 0.019
4 -0.002 -0.000 0.000 0.000
USE NLS
TSLS
MODEL LW=CONSTANT+EDUC+BLACK+AGE | CONSTANT+MED+FED+BLACK+AGE
ESTIMATE / HC
1069
Two- Stage Least Squares
If the number of instruments is the same as the number of regressors, the TSIV and
TSLS coefficient estimators are identical. In this cue, only the standard errors printed
under the TSIV results will differ from TSLS. These are the heteroskedasticity-
consistent standard errors for the usual IV estimator.
Following is the output:
Example 4
Polynomially Distributed Lags
In this example, we use the extended MODEL statement to construct an OLS model with
polynomially distributed lags. The degree of lag is expressed after the variable name,
separated by a colon:
Input records: 200
Records kept for analysis: 200

Instrumental variables, OLS results (TSLS)
Dependent variable: LW

N: 200, mean of dependent variable: 6.080000
Instruments: CONSTANT + MED + FED + BLACK + AGE

Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 4.695 0.491 9.564 0.000
2 EDUC 0.006 0.033 0.177 0.860
3 BLACK -0.223 0.138 -1.610 0.109
4 AGE 0.051 0.011 4.614 0.000

Standard error of regression: 0.464223
Residual sum of squares: 42.238554

Covariance matrix of regression coefficients

1 2 3 4
1 0.241
2 -0.013 0.001
3 -0.020 0.001 0.019
4 -0.002 -0.000 0.000 0.000

Instrumental variables, heteroscedastistic consistent results (2SIV)

Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 4.612 0.514 8.976 0.000
2 EDUC 0.020 0.032 0.610 0.542
3 BLACK -0.156 0.169 -0.924 0.357
4 AGE 0.046 0.012 3.995 0.000
1 2 3 4
1 0.264
2 -0.014 0.001
3 -0.031 0.001 0.029
4 -0.003 -0.000 0.000 0.000
USE NLS
TSLS
MODEL LW = CONSTANT + EDUC:3 + BLACK + AGE
EST
1070
Chapter 33
Following is the output:
Computation
All computations are in double precision.
Algorithms
TSLS computes least-squares estimates via standard high-precision algorithms.
Specific details are given in the references.
Missing Data
Cases with missing data on any variable in the model are deleted before estimation.
Input records: 200
Records kept for analysis: 197
Records deleted for missing incomplete data: 3

Ordinary least squares results (OLS)
Dependent variable: LW

N: 197, mean of dependent variable: 6.086294
R-squared: 0.146857
Adjusted R-squared: 0.119916, uncentered R-squared (R0-squared): 0.994479

Parameter Estimate S.E. t-ratio p-value
1 CONSTANT 4.723 0.461 10.250 0.000
2 EDUC 0.020 0.014 1.445 0.150
3 EDUC<1> 0.010 0.014 0.700 0.485
4 EDUC<2> -0.013 0.014 -0.968 0.334
5 EDUC<3> -0.009 0.014 -0.625 0.532
6 BLACK -0.240 0.136 -1.759 0.080
7 AGE 0.049 0.011 4.411 0.000

F(6,190) = 5.450993, prob= 0.000032

Standard error of regression: 0.461989
Regression sum of squares: 6.980559
Residual sum of squares: 40.552436


Covariance matrix of regression coefficients

1 2 3 4 5 6 7
1 0.212
2 -0.002 0.000
3 -0.002 -0.000 0.000
4 -0.002 -0.000 -0.000 0.000
5 -0.003 -0.000 -0.000 -0.000 0.000
6 -0.013 0.000 -0.000 0.000 0.000 0.019
7 -0.004 -0.000 0.000 0.000 0.000 0.000 0.000
1071
Two- Stage Least Squares
References
Breusch, T. S. and Pagan, A. R., (1979). A simple test for heteroskedasticity and random
coefficient variation. Econometrica, 47, 12871294.
Eicker, F. (1963). Asymptotic normality and consistency of the least squares estimators for
families of linear regressions. Annals of mathematical statistics, 34, 447456.
Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. In
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability,
Vol. 1. Berkeley: University of California Press.
Engle, R. F. (1984). Wald, likelihood ratio and Lagrange multiplier tests in econometrics.
In Griliches, Z. and Intrilligator, M. D. (eds.), Handbook of econometrics, Vol. II. New
York: Elsevier.
Hinkley, D. D. (1977). Jackknifing in unbalanced situations. Technometrics, 19: 285292.
Johnston, J. (1984). Econometric methods, 3rd ed. New York: McGraw-Hill.
Judge, G. G., Griffith, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. (1985). The theory
and practice of econometrics, 2nd ed. New York: John Wiley & Sons, Inc.
MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance
matrix estimators with improved finite sample properties. Journal of Econometrics, 29,
305325.
Maddala, G. S. Econometrics. New York: McGraw-Hill, 1977.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. New York:
Academic Press.
Theil, H. (1971). Principles of econometrics. New York: John Wiley & Sons, Inc.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity. Econometrica, 48, 817838.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
50, 125.
White, H. (1982). Instrumental variables estimation with independent observations.
Econometrica, 50, 483500.
White, H. (1984). Asymptotic theory for econometricians. New York: Academic Press.
1073


I ndex
A matrix, 469
accelerated failure time distribution, 914
ACF plots, 1018
additive trees, 62, 68
AID, 35, 37
Akaike Information Criterion, 784
alternative hypothesis, 13
analysis of covariance, 401
examples, 432, 448
model, 402
analysis of variance, 212, 457
algorithms, 455
ANOVA command, 408
assumptions, 358
between-group differences, 364
bootstrapping, 408
compared to loglinear modeling, 587
compared to regression trees, 35
contrasts, 360, 404, 405, 406
data format, 408
examples, 409, 412, 416, 427, 429, 431, 432, 434,
440, 442, 445, 448, 450
factorial, 357
hypothesis tests, 356, 404, 405, 406
interactions, 357
model, 402
multivariate, 363, 366
overview, 401
post hoc tests, 359, 402
Quick Graphs, 408
repeated measures, 363, 406
residuals, 402
unbalanced designs, 361
unequal variances, 358
usage, 408
within-subject differences, 364
Anderberg dichotomy coefficients, 120, 126
angle tolerance, 872
anisotropy, 876, 884
geometric, 876
zonal, 876
ARIMA models, 1003, 1013, 1026
algorithms, 1056
ARMA models, 1008
autocorrelation plots, 344, 1006, 1009, 1018
Automatic Interaction Detection, 35
autoregressive models, 1006
axial designs, 235
backward elimination, 349
bandwidth, 872
BASIC, 912
basic statistics. See descriptive statistics
between-group differences
in analysis of variance, 364
bias, 349
binary logit, 518
compared to multinomial logit, 520
binary trees, 33
biplots, 794, 795
Bisquare procedure, 658
Bonferroni inequality, 37
Bonferroni test, 127, 359, 402, 463, 967
bootstrap, 19, 20
algorithms, 28
bootstrap-t method, 19
command, 20
data format, 20
examples, 21, 24, 25, 26
missing data, 28
naive bootstrap, 19
overview, 17
1074
Index
Quick Graphs, 20
usage, 20
Box and Behnken designs, 230, 235
Box and Hunter designs, 230, 232
box plot, 212
Bray-Curtis measure, 119, 126
C matrix, 469
canonical correlation analysis, 457
bootstrapping, 827
data format, 826, 827
examples, 828, 831, 835
interactions, 827
model, 825
nominal scales, 827
overview, 817
partialed variables, 825
Quick Graphs, 827
rotation, 826
usage, 827
canonical correlations, 256
canonical rotation, 795
categorical data, 695
categorical predictors, 35
CCF plots, 1019
central limit theorem, 964
centroid designs, 235
CHAID, 36, 37
chi-square, 162
chi-square test for independence, 150
circle model
in perceptual mapping, 793
city-block distance, 126
classical analysis, 980
classification functions, 250
classification trees, 36
algorithms, 50
basic tree model, 32
bootstrapping, 43
commands, 42
compared to discriminant analysis, 36, 39
data format, 43
displays, 40
examples, 44, 46, 48
loss functions, 38, 40
missing data, 50
mobiles, 31
model, 40
overview, 31
pruning, 37
Quick Graphs, 43
saving files, 43
stopping criteria, 37, 42
usage, 43
cluster analysis
additive trees, 68
algorithms, 85
bootstrapping, 70
commands, 69
data types, 70
distances, 66
examples, 71, 76, 79, 80, 82, 83
exclusive clusters, 54
hierarchical clustering, 64
k-means clustering, 67
missing values, 85
overlapping clusters, 54
overview, 53
Quick Graphs, 70
saving files, 70
usage, 70
Cochrans test of linear trend, 170
coefficient of alienation, 619, 640
coefficient of determination. See multiple
correlation
coefficient of variation, 213
Cohens kappa, 166, 170
communalities, 302
compound symmetry, 364
conditional logistic regression model, 520
conditional logit model, 522
confidence curves, 652
confidence intervals, 11, 213
path analysis, 782
conjoint analysis
additive tables, 88
1075
Index
algorithms, 112
bootstrapping, 95
commands, 95
compared to logistic regression, 92
data format, 95
examples, 96, 100, 103, 107
missing data, 113
model, 93
multiplicative tables, 89
overview, 87
Quick Graphs, 95
saving files, 95
usage, 95
contingency coefficient, 167, 170
contour plots, 882
contrast coefficients, 363
contrasts
in analysis of variance, 360
convex hulls, 881
Cooks distance, 345
Cook-Weisberg graphical confidence curves, 652
correlations, 55, 115
algorithms, 145
binary data, 126
bootstrapping, 129
canonical, 817
commands, 128
continuous data, 125
data format, 129
dissimilarity measures, 126
distance measures, 126
examples, 129, 132, 134, 135, 137, 140, 143, 145
missing values, 124, 146
options, 127
Quick Graphs, 129
rank-order data, 126
saving files, 129
set, 817
usage, 129
correlograms, 885
correspondence analysis, 790, 794
algorithms, 158
bootstrapping, 152
commands, 152
data format, 152
examples, 153, 155
missing data, 151, 158
model, 151
multiple correspondence analysis, 151
overview, 149
Quick Graphs, 152
simple correspondence analysis, 151
usage, 152
covariance matrix, 125
covariance paths
path analysis, 732
covariograms, 871
Cramrs V, 167
critical level, 13
Cronbachs alpha. See descriptive statistics
cross-correlation plots, 1019
crossover designs, 457
crosstabulation, 205
bootstrapping, 173
commands, 173
data format, 173
examples, 175, 177, 179, 180, 181, 183, 188, 190,
191, 194, 196, 198, 199, 201, 202
multiway, 172
one-way, 160, 162, 168
overview, 159
Quick Graphs, 173
standardizing tables, 161
two-way, 160, 163, 169, 170
usage, 173
cross-validation, 38, 250, 350
D matrix, 469
D Sub-A (d
a
), 843
dates, 912
degrees of freedom, 962
dendrograms, 57, 70
dependence paths
path analysis, 731
descriptive statistics, 1
basic statistics, 213, 214
bootstrapping, 217
1076
Index
commands, 217
Cronbachs alpha, 216
data format, 217
overview, 207
Quick Graphs, 217
stem-and-leaf plots, 215
usage, 217
design of experiments, 92, 231
bootstrapping, 237
Box and Behnken designs, 235
Box and Hunter designs, 232
commands, 236
examples, 237, 238, 240, 241, 242, 243, 244
factorial designs, 232
Latin square designs, 233
mixture designs, 235
overview, 229
Plackett-Burman designs, 234
Quick Graphs, 237
Taguchi designs, 234
usage, 237
dichotomy coefficients
Anderberg, 126
Jaccard, 126
positive matching, 126
simple matching, 126
Tanimoto, 126
difference contrasts, 468
difficulty, 997
discrete choice model, 522
compared to polytomous logit, 523
discriminant analysis, 457
bootstrapping, 258
commands, 257
compared to classification trees, 36
data format, 258
estimation, 254
examples, 258, 263, 268, 276, 283, 285, 291
linear discriminant function, 250
linear discriminant model, 246
model, 253
multiple groups, 252
options, 254
overview, 245
prior probabilities, 252
Quick Graphs, 258
statistics, 256
stepwise estimation, 254
usage, 258
discrimination parameter, 997
dissimilarities
direct, 617
indirect, 617
distance measures, 55, 115
distances
nearest-neighbor, 879
dit plots, 15
dot histogram plots, 15
D-Prime (d'), 842
dummy codes, 460
Duncans test, 360
Dunnett test, 463
Dunn-Sidak test, 127, 967
ECVI, 784
edge effects, 881
effects codes, 353, 460
eigenvalues, 256
ellipse model
in perceptual mapping, 794
EM algorithm, 333
EM estimation, 147
for correlations, 127
endogenous variables
path analysis, 732
equamax rotation, 303, 307
Euclidean distances, 617
exogenous variables
path analysis, 732
expected cross-validation index, 784
exponential distribution, 914
exponential model, 874, 884
exponential smoothing, 1014
external unfolding, 792
1077
Index
factor analysis, 301, 790
algorithms, 333
bootstrapping, 309
commands, 309
compared to principal components analysis, 304
convergence, 305
correlations vs. covariances, 301
data format, 309
eigenvalues, 305
eigenvectors, 308
examples, 311, 314, 318, 320, 324, 327
iterated principal axis, 305
loadings, 308
maximum likelihood, 305
missing values, 333
number of factors, 305
overview, 297
principal components, 305
Quick Graphs, 309
residuals, 308
rotation, 303, 307
save, 308
scores, 308
usage, 309
factor loadings, 980
factorial analysis of variance, 357
factorial designs, 230, 232
Fieller bounds, 549
filters, 1016
Fishers exact test, 166, 170
Fishers linear discriminant function, 790
Fishers LSD, 359, 463
fixed variance
path analysis, 734
Fletcher-Powell minimization, 996
forward selection, 349
Fourier analysis, 1015, 1028
fractional factorial designs, 457
Freeman-Tukey deviates, 590
frequencies, 20, 43, 95, 129, 152, 173, 217, 258, 309,
373, 408, 471, 533, 594, 623, 661, 702, 719,
745, 797, 811, 827, 847, 924, 968, 985, 1029,
1063
frequency, 891
frequency tables. See crosstabulation
Friedman test, 698
gamma coefficients, 126
Gaussian model, 874, 884
Gauss-Newton method, 651, 652
general linear models
algorithms, 516
bootstrapping, 471
categorical variables, 460
commands, 471
contrasts, 465, 467, 468, 469
data format, 471
examples, 473, 480, 482, 483, 485, 488, 490, 494,
502, 505, 506, 510, 514, 515
hypothesis tests, 465
mixture model, 462
model estimation, 458
overview, 457
post hoc tests, 463
Quick Graphs, 471
repeated measures, 461
residuals, 458
stepwise regression, 462
usage, 471
generalized least squares, 743, 1061
generalized variance, 820
geostatistical models, 870, 871
Gini index, 38, 40
GLM. See general linear models
Goodman-Kruskal gamma, 126, 167, 170
Goodman-Kruskal lambda, 170
Greenhouse-Geiser statistic, 365
Guttmans coefficient of alienation, 619
Guttmans loss function, 639
Guttmans mu2 monotonicity coefficients, 119, 126
Guttman-Rulon coefficient, 981
Hadi outlier detection, 123, 127
Hampel procedure, 658
1078
Index
Hanning weights, 1002
hazard function
heterogeneity, 917
heteroskedasticity, 1060
heteroskedasticity-consistent standard errors, 1061
hierarchical clustering, 56, 64
hinge, 209
histograms
nearest-neighbor, 891
hole model, 875, 884
Holts method, 1014
Huber procedure, 658
Huynh-Feldt statistic, 365
hypothesis
alternative, 13
null, 13
testing, 12
hypothesis testing, 341
ID3, 37
incomplete block designs, 457
independence, 163
in loglinear models, 586
INDSCAL model, 615
inertia, 150
inferential statistics, 7
instrumental variables, 1059
internal consistency, 981
interquartile range, 209
interval-censored data, 910
isotropic, 871
item-response analysis. See test item analysis
item-test correlations, 980
Jaccard dichotomy coefficients, 120, 126
jackknife, 18, 20
jackknifed classification matrix, 250
Kendalls tau-b coefficients, 126, 167, 170
k-means clustering, 60, 67
Kolmogorov-Smirnov test, 696
KR20, 981
kriging, 882
ordinary, 877, 888
simple, 877, 888
trend components, 877
universal, 878
Kruskals loss function, 638
Kruskals STRESS, 619
Kruskal-Wallis test, 694, 695
Kukoc statistic 7
Kulczynski measure, 126
kurtosis, 213
lags
number of lags, 872
latent trait model, 980, 982
Latin square designs, 230, 233, 457
lattice, 716
lattice designs, 235
Lawley-Hotelling trace, 256
least absolute deviations, 650
Levene test, 358
leverage, 346
likelihood-ratio chi-square, 166, 170, 588, 590
compared to Pearson chi-square, 588
Lilliefors test, 713
linear contrasts, 360
linear discriminant function, 250
linear discriminant model, 246
linear models
analysis of variance, 401
general linear models, 457
linear regression, 369
linear regression, 11, 341
bootstrapping, 373
commands, 373
data format, 373
estimation, 371
1079
Index
examples, 374, 377, 380, 383, 387, 390, 394, 396,
397, 398, 399
model, 370
overview, 369
Quick Graphs, 373
residuals, 343, 370
stepwise, 349, 371
tolerance, 371
usage, 373
using a correlation matrix as input, 351
using a covariance matrix as input, 351
using an SSCP matrix as input, 351
listwise deletion, 146, 333
loadings, 300, 301
logistic item-response analysis, 996
one-parameter model, 982
two-parameter model, 982
logistic regression, 517
algorithms, 576
bootstrapping, 533
categorical predictors, 526
compared to conjoint analysis, 92
compared to linear model, 518
conditional variables, 525
confidence intervals, 549
convergence, 528
data format, 533
deciles of risk, 529
discrete choice, 527
dummy coding, 526
effect coding, 526
estimation, 528
examples, 534, 536, 537, 542, 547, 550, 558, 565,
567, 571, 574
missing data, 576
model, 525
options, 528
overview, 517
post hoc tests, 531
prediction table, 525
print options, 533
quantiles, 530, 550
Quick Graphs, 533
simulation, 531
stepwise estimation, 528
tolerance, 528
usage, 533
weights, 533
logit model, 519
loglinear modeling
bootstrapping, 594
commands, 593
compared to analysis of variance, 587
compared to crosstabulation, 593
convergence, 589
data format, 594
examples, 595, 605, 608, 612
frequency tables, 593
model, 589
overview, 585
parameters, 590
Quick Graphs, 594
saturated models, 587
statistics, 590
structural zeros, 591
usage, 594
log-logistic distribution, 914
log-normal distribution, 914
loss functions, 38, 647
multidimensional scaling, 638
LOWESS smoothing, 1003
low-pass filter, 1016
LSD test, 402, 463
madograms, 885
Mahalanobis distances, 246, 256
Mann-Whitney test, 694, 695
MANOVA. See analysis of variance
Mantel-Haenszel test, 172
MAR, 147
Marquardt method, 655
mass, 150
matrix displays, 57
maximum likelihood estimates, 648
maximum likelihood factor analysis, 304
maximum Wishart likelihood, 743
MCAR, 147
1080
Index
McFaddens conditional logit model, 522
McNemars test, 166, 170
MDPREF, 794, 795
MDS. See multidimensional scaling
mean, 3, 208, 213
means coding, 354
median, 4, 208, 213
meta-analysis, 352
MGLH. See general linear models
midrange, 209
minimum spanning trees, 879
Minkowski metric, 619
mixture designs, 231, 235
models, 10
estimation, 10
mosaic plots, 882
moving average, 1001, 1007
mu2 monotonicity coefficients, 126
multidimensional scaling, 790
algorithms, 638
assumptions, 616
bootstrapping, 623
commands, 623
configuration, 619, 622
confirmatory, 622
convergence, 619
data format, 623
dissimilarities, 617
distance metric, 619
examples, 624, 626, 628, 632, 636
Guttman method, 639
individual differences, 615
Kruskal method, 638
log function, 619
loss functions, 619
matrix shape, 619
metric, 619
missing values, 640
nonmetric, 619
overview, 615
power function, 619
Quick Graphs, 623
residuals, 619
Shepard diagrams, 619, 623
usage, 623
multinomial logit, 520
compared to binary logit, 520
multiple correlation, 342
multiple correspondence analysis, 150
multiple regression, 346
multivariate analysis of variance, 366
mutually exclusive, 162
nesting, 457
Newman-Keuls test, 360
Newton-Raphson method, 585
nodes, 33
nominal data, 694
nonlinear modeling
algorithms, 690
computation, 690
estimation, 651
loss functions, 647
missing data, 691
problems in, 651
nonlinear models, 643
bootstrapping, 661
commands, 661
computation, 655
convergence, 655
data format, 661
examples, 662, 665, 668, 671, 673, 675, 676, 678,
681, 686, 688, 689
functions of parameters, 657
loss functions, 652, 659, 660
model, 652
parameter bounds, 655
Quick Graphs, 661
recalculation of parameters, 656
robust estimation, 658
starting values, 655
usage, 661
nonmetric unfolding model, 615
nonparametric statistics
algorithms, 713
bootstrapping, 702
1081
Index
commands, 697, 699, 701
data format, 702
examples, 702, 704, 705, 707, 708, 710, 712
Friedman test, 698
independent samples tests, 695, 696
Kolmogorov-Smirnov test, 696, 699
Kruskal-Wallis test, 695
Mann-Whitney test, 695
one-sample tests, 701
overview, 693
Quick Graphs, 702
related variables tests, 697, 698
sign test, 697
usage, 702
Wald-Wolfowitz runs test, 701
Wilcoxon signed-rank test, 698
normal distribution, 209
NPAR model, 842
nugget, 875
null hypothesis, 12
oblimin rotation, 303, 307
Occams razor, 91
odds ratio, 170
omni-directional variograms, 872
ORDER, 913
ordinal data, 694
ordinary least squares, 743
orthomax rotation, 303, 307
PACF plots, 1018
pairwise deletion, 146, 333
pairwise mean comparisons, 359
parameters, 10
parametric modeling, 914
partial autocorrelation plots, 1008, 1009, 1018
partialing
in set correlation, 821
partially ordered scalogram analysis with
coordinates
algorithms, 728
bootstrapping, 719
commands, 718
convergence, 718
data format, 719
displays, 717
examples, 720, 721, 724
missing data, 728
model, 718
overview, 715
Quick Graphs, 719
usage, 719
path analysis
algorithms, 781
bootstrapping, 745
commands, 745
confidence intervals, 743, 782
covariance paths, 732
covariance relationships, 741
data format, 745
dependence paths, 731
dependence relationships, 739
endogenous variables, 732
estimation, 743
examples, 746, 751, 764, 771
exogenous variables, 732
fixed parameters, 739, 741
fixed variance, 734
free parameters, 739, 741
latent variables, 743
manifest variables, 743
measures of fit, 782
model, 737, 739
overview, 729
path diagrams, 729
Quick Graphs, 745
starting values, 743
usage, 745
variance paths, 732
path diagrams, 729
Pearson chi-square, 163, 168, 170, 586, 590
compared to likelihood-ratio chi-square, 588
Pearson correlation, 117, 123, 125
perceptual mapping
algorithms, 804
bootstrapping, 797
1082
Index
commands, 797
data format, 797
examples, 798, 799, 800, 802
methods, 795
missing data, 804
model, 795
overview, 789
Quick Graphs, 797
usage, 797
periodograms, 1015
permutation tests, 162
phi coefficient, 38, 40, 167, 170
Pillai trace, 256
Plackett-Burman designs, 234
point processes, 870, 878
polynomial contrasts, 360, 363, 468
pooled variances, 964
populations, 7
POSET, 715
positive matching dichotomy coefficients, 120
power model, 874, 884
preference curves, 792
preference mapping, 790
PREFMAP, 795
principal components analysis, 297, 298, 457
coefficients, 300
compared to factor analysis, 304
compared to linear regression, 299
loadings, 300
prior probabilities, 252
probability plots, 15, 343
probit analysis
algorithms, 814
bootstrapping, 811
categorical variables, 809
commands, 810
data format, 811
dummy coding, 809
effect coding, 809
examples, 811, 813
interpretation, 808
missing data, 814
model, 807, 808
overview, 807
Quick Graphs, 811
saving files, 811
usage, 811
Procrustes rotations, 794, 795
proportional hazards models, 915
QSK coefficient, 126
quadrat counts, 869, 881, 882
quadratic contrasts, 360
quantile plots, 916
quantitative symmetric dissimilarity coefficient, 119
quartimax rotation, 303, 307
quasi-independence, 591
Quasi-Newton method, 651, 652
random fields, 870
random samples, 8
random variables, 340
random walk, 1007
randomized block designs, 457
range, 209, 213, 875
rank-order coefficients, 126
Rasch model, 982
receiver operating characteristic curves. See signal
detection analysis
regression
linear, 11, 369
logistic, 517
two-stage least-squares, 1059
regression trees, 35
algorithms, 50
basic tree model, 32
bootstrapping, 43
commands, 42
compared to analysis of variance, 35
compared to stepwise regression, 36
data format, 43
displays, 40
examples, 44, 46, 48
1083
Index
loss functions, 38, 40
missing data, 50
mobiles, 31
model, 40
overview, 31
pruning, 37
Quick Graphs, 43
saving files, 43
stopping criteria, 37, 42
usage, 43
reliability, 981
repeated measures, 363, 461
assumptions, 364
response surfaces, 92, 652
right-censored data, 910
RMSEA, 783
robustness, 695
ROC curves, 841, 842, 847
root mean square error of approximation, 783
rotation, 303
running median smoothers, 1002
Sakitt D, 843
samples, 8
sampling. See bootstrap
saturated models
loglinear modeling, 587
scalogram. See partially ordered scalogram analysis
with coordinates
scatterplot matrix, 117
Scheff test, 359, 402, 463
screening designs, 235
SD-Ratio, 843
seasonal decomposition, 1013
second-order stationarity, 871
semi-variograms, 872, 885
set correlations, 817
assumptions, 818
measures of association, 819
missing data, 839
partialing, 818
See also canonical correlation analysis
Shepard diagrams, 619, 623
sign test, 697
signal detection analysis
algorithms, 866
bootstrapping, 847
chi-square model, 843
commands, 846
convergence, 843
data format, 847
examples, 855, 856, 857, 860, 863, 864
exponential model, 843
gamma model, 843
logistic model, 843
missing data, 867
nonparametric model, 843
normal model, 843
overview, 841
Poisson model, 843
Quick Graphs, 847
ROC curves, 841, 847
usage, 847
variables, 843
sill, 875
similarity measures, 115
simple matching dichotomy coefficients, 120
Simplex method, 651, 652
simulation, 878
singular value decomposition (SVD), 149, 794, 804
skewness, 211, 213
positive, 4
slope, 346
smoothing, 1000
Somers d coefficients, 167, 170
sorting, 5
Sosa statistic 21, 62
spatial statistics, 869
algorithms, 906
azimuth, 885
bootstrapping, 891
commands, 889
data, 891
dip, 885
1084
Index
examples, 891, 898, 899, 904
grid, 887
kriging, 877, 882, 888
lags, 885
missing data, 906
models, 869, 884
nested models, 876
nesting structures, 884
nugget effect, 875, 884
plots, 882
point statistics, 882
Quick Graphs, 891
sill, 875, 884
simulation, 878, 882
trends, 882
variograms, 872, 882, 885
Spearman coefficients, 119, 126, 167
Spearman-Brown coefficient, 981
specificities, 302
spectral models, 1000
spherical model, 873, 884
split plot designs, 457
split-half reliabilities, 983
standard deviation, 3, 209, 213
standard error of estimate, 341
standard error of kurtosis, 213
standard error of skewness, 213
standard error of the mean, 11, 213
standardization, 55
standardized alpha, 981
standardized deviates, 149, 590
standardized values, 6
stationarity, 871, 1009
statistics
defined, 1
descriptive, 1
inferential, 7
See also descriptive statistics
stem-and-leaf plots, 3, 208
See also descriptive statistics, 215
stepwise regression, 349, 362, 524
stochastic processes, 870
stress, 618, 639
structural equation models. See path analysis
Stuarts tau-c coefficients, 167, 170
Studentized residuals, 344
subpopulations, 211
subsampling, 18
sum of cross-products matrix, 125
sums of squares
type I, 361, 366
type II, 367
type III, 362, 367
type IV, 367
surface plots, 882
survival analysis
algorithms, 948
bootstrapping, 924
censoring, 910, 917, 952
centering, 949
coding variables, 917
commands, 923
convergence, 954
Cox regression, 921
data format, 924
estimation, 919
examples, 925, 928, 929, 933, 936, 938, 943, 945
exponential model, 921
graphs, 921
logistic model, 921
log-likelihood, 950
log-normal model, 921
missing data, 949
models, 917, 951
overview, 909
parameters, 949
plots, 911, 953
proportional hazards models, 952
Quick Graphs, 924
singular Hessian, 951
stepwise, 954
stepwise estimation, 919
tables, 921
time varying covariates, 922
usage, 924
variances, 955
Weibull model, 921
symmetric matrix, 117
1085
Index
t distributions, 960
compared to normal distributions, 962
t tests
assumptions, 964
Bonferroni adjustment, 967
bootstrapping, 968
commands, 967
confidence intervals, 967
data format, 968
degrees of freedom, 962
Dunn-Sidak adjustment, 967
examples, 969, 971, 973, 975, 977
one-sample, 962, 966
overview, 959
paired, 963, 965
Quick Graphs, 968
separate variances, 964
two-sample, 963, 965
usage, 968
Taguchi designs, 230, 234
Tanimoto dichotomy coefficients, 120, 126
tau-b coefficients, 126, 170
tau-c coefficients, 170
test item analysis
algorithms, 996
bootstrapping, 985
classical analysis, 980, 981, 983, 996
commands, 985
data format, 985
examples, 989, 990, 993
logistic item-response analysis, 982, 984, 996
missing data, 997
overview, 979
Quick Graphs, 985
reliabilities, 983
scoring items, 983, 984
statistics, 985
usage, 985
tetrachoric correlation, 120, 121, 126
theory of signal detectability (TSD), 841
time domain models, 1000
time series, 999
algorithms, 1056
ARIMA models, 1003, 1026
bootstrapping, 1029
clear series, 1021
commands, 1020, 1022, 1025, 1026, 1027, 1029
data format, 1029
examples, 1030, 1031, 1032, 1033, 1035, 1038,
1039, 1040, 1042, 1043, 1047, 1054
forecasts, 1024
Fourier transformations, 1028
missing values, 999
moving average, 1001, 1022
overview, 999
plot labels, 1017
plots, 1016, 1017, 1018, 1019
Quick Graphs, 1029
running means, 1002, 1022
running medians, 1002, 1022
seasonal adjustments, 1013, 1025
smoothing, 1000, 1022, 1023, 1024
stationarity, 1009
transformations, 1020, 1021
trends, 1024
usage, 1029
tolerance, 350
T-plots, 1016
transformations, 211
tree-clustering methods, 37
tree diagrams, 57
triangle inequality, 616
Tukey pairwise comparisons test, 359, 402, 463
Tukeys jackknife, 18
twoing, 38
two-stage least squares
algorithms, 1070
bootstrapping, 1063
commands, 1063
data format, 1063
estimation, 1059
examples, 1064, 1066, 1069
heteroskedasticity-consistent standard errors,
1061
lagged variables, 1061
missing data, 1070
model, 1061
overview, 1059
Quick Graphs, 1063
usage, 1063
1086
Index
type I sums of squares, 361, 366
type II sums of squares, 367
type III sums of squares, 362, 367
type IV sums of squares, 367
unbalanced designs
in analysis of variance, 361
uncertainty coefficient, 170
unfolding models, 791
variance, 213
variance paths
path analysis, 732
varimax rotation, 303, 307
variograms, 872, 882, 891
model, 873
vector model
in perceptual mapping, 793
Voronoi polygons, 869, 880, 882
Wald-Wolfowitz runs test, 701
wave model, 875
Weibull distribution, 914
weight, 891
weighted running smoothing, 1002
weights, 20, 43, 95, 129, 152, 173, 217, 258, 309, 373,
408, 471, 533, 594, 623, 661, 702, 719, 745,
797, 811, 827, 847, 924, 968, 985, 1029, 1063
Wilcoxon signed-rank test, 695, 698
Wilks lambda, 250, 256
Wilks trace, 256
Winters three-parameter model, 1014
within-subjects differences
in analysis of variance, 364
XTAB procedure, 205
Yates correction, 166, 170
y-intercept, 346
Youngs S-STRESS, 619
Yules Q, 167, 170
Yules Y, 167, 170

You might also like