You are on page 1of 10

Review of Software to Fit Generalized Estimating Equation

Regression Models
Nicholas J. HORTON and Stuart R. LIPSITZ
Researchers are often interested in analyzing data that arise
from a longitudinal or clustered design. Although there are
a variety of standard likelihood-based approaches to anal-
ysis when the outcome variables are approximately multi-
variate normal, models for discrete-type outcomes generally
require a dierent approach. Liang and Zeger formalized an
approach to this problem using generalized estimating equa-
tions (GEEs) to extend generalized linear models (GLMs)
to a regression setting with correlated observations within
subjects. In this article, we briey review GLM, the GEE
methodology, introduce some examples, and compare the
GEE implementations of several general purpose statistical
packages (SAS, Stata, SUDAAN, and S-Plus). We focus
on the user interface, accuracy, and completeness of imple-
mentations of this methodology.
KEY WORDS: Computer software for statistical analy-
sis; Generalized estimating equations; Missing data.
1. INTRODUCTION
Generalized linear models (GLMs) (McCullagh and
Nelder 1989) are a standard method used to t regression
models for univariate data that are presumed to follow an
exponential family distribution. Frequently researchers are
interested in analyzing data that arise from a longitudi-
nal, repeated measures or clustered design, and there ex-
ists correlation between observations on a given subject. If
the outcomes are approximately multivariate normal, then
there are well established methods of analysis (Laird and
Ware 1982) that have been widely implemented in gen-
eral purpose statistical packages. But if the outcomes are
binary or counts, general likelihood based approaches are
less tractable. For clustered binary outcomes, several ap-
proaches have been suggested (e.g., Fitzmaurice and Laird
Nicholas J. Horton is Research Fellow, Department of Biostatistics, Har-
vard School of Public Health, Boston, MA 02115 (Email: horton@hsph.
harvard.edu). Stuart R. Lipsitz is Associate Professor, Dana Farber Can-
cer Institute and Department of Biostatistics, Harvard School of Public
Health, Boston, MA 02115. We are grateful for the support provided by
NIMH grant R01-MH54693 and NIH grants CA 55576 and CA 55670.
We would like to thank Nan Laird, Vincent Carey, Rick Williams, James
Hardin and Gordon Johnston, as well as the Section Editor for their help-
ful comments. We also thank Gwen Zahner for use of the Connecticut
child surveys, which were conducted under contract to the Connecticut
Department of Children and Youth Services.
1993), but these have not been incorporated into general
purpose statistical computing packages.
Generalized estimating equations (GEEs) were developed
to extend the GLM to accommodate correlated data, and are
widely used by researchers in a number of elds. In this
article we will review GLMs and the GEE methodology,
and through an example, compare the GEE implementations
of several general purpose statistical packages (including
SAS, Stata, SUDAAN, and S-Plus). We will begin by briey
reviewing the methodology.
2. BRIEF REVIEW OF GLMS AND GEES
McCullagh and Nelder (1989) introduced the GLM for
exponential family data with the form
f
Y
(y, , ) = exp {(y b())/a() + c(y, )} ,
where a(.), b(.), and c(.) are given, is the canonical param-
eter, and is the dispersion parameter. The GLM is then
given by
g(
i
) = g(E[Y
i
]) = x

i
,
where x
i
is a p 1 vector of covariates for the ith subject,
and is a p1 vector of regression parameters. One of the
attractive properties of the GLM is that it allows for lin-
ear as well as non-linear models under a single framework.
It is possible to t models where the underlying data are
normal, inverse Gaussian, gamma, Poisson, binomial, geo-
metric, and negative binomial by suitable choice of the link
function g(.) (Hilbe 1994).
Liang and Zeger (1986) and Zeger and Liang (1986) intro-
duced generalized estimating equations (GEEs) to account
for the correlation between observations in generalized lin-
ear regression models. One aspect of their approach builds
upon previous methods of variance estimation developed to
protect against inappropriate assumptions about the vari-
ance (Huber 1967; White 1980, 1982). GEEs are used to
characterize the marginal expectation of a set of outcomes
as a function of a set of study variables. In a marginal
model, the analyst is interested in modeling the marginal
expectation (average response for observations sharing the
same covariates) as a function of explanatory variables. Dig-
gle, Liang, and Zeger (1994) provided a detailed review of
marginal models as well as other approaches (including ran-
dom eects models and transition (markov) models).
Let Y
ij
, i = 1, . . . , n, j = 1, . . . , t be the jth outcome
for the ith subject, where we assume that observations on
dierent subjects are independent, though we allow for as-
sociation between outcomes observed on the same subject.
In the GEE setting, we are not assuming that Y
ij
is a mem-
ber of the exponential family, but we are assuming that the
mean and variance are characterized as in the GLM. We
160 The American Statistician, May 1999, Vol. 53, May 1999 c 1999 American Statistical Association
Table 1. Common Working Correlation Models
Structure Denition Example # Parameters
Independence
Ru,v = 1 if u = v
= 0 otherwise
_
_
_
1 0 : : : 0
0 1 : : : 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 : : : 1
_
_
_
0
Exchangeable
Ru,v = 1 if u = v
= otherwise
_
_
_
1 : : :
1 : : :
.
.
.
.
.
.
.
.
.
.
.
.
: : : 1
_
_
_
1
Unstructured
Ru,v = 1 if u = v
=
u,v otherwise
_
_
_
1
1,2
: : :
1,t

1,2
1 : : :
2,t
.
.
.
.
.
.
.
.
.
.
.
.

1,t

2,t
: : : 1
_
_
_
t(t-1)/2
Auto-regressive
Ru,v = 1 if u = v
=
|uv|
otherwise
_
_
_
_
1 : : :
t 1
1 : : :
t 2
.
.
.
.
.
.
.
.
.
.
.
.

t 1

t 2
: : : 1
_
_
_
_
1
M-dependent
Ru,v = 1 if u = v
=
|uv|
otherwise
_
_
_
1
1
: : :
t 1

1
1 : : :
t 2
.
.
.
.
.
.
.
.
.
.
.
.

t 1

t 2
: : : 1
_
_
_
0<M t 1
Fixed
Ru,v = 1 if u = v
= ru,v otherwise
_
_
_
1 r
1,2
: : : r
1,t
r
1,2
1 : : : r
2,t
.
.
.
.
.
.
.
.
.
.
.
.
r
1,t
r
2,t
: : : 1
_
_
_
0 (User specied)
assume the marginal regression model
g(E[Y
ij
]) = x

ij
, (1)
where x
ij
is a p 1 vector of study variables (covariates)
for the ith subject at the jth outcome, consists of the p re-
gression parameters of interest and g(.) is the link function.
Common choices for the link function might be g(a) = a
for measured data (the identity link) g(a) = log(a) for count
data (log link), or g(a) = log(a/(1a)) for binary data (logit
link). Since likelihood methods for binary data do not com-
monly exist in general purpose statistical software, GEEs
have been a popular approach to regression model tting
for this type of data. For binary data with the logit link, we
have that
log(E[Y
ij
]/(1 E[Y
ij
])) = x

ij
,
which implies that
E[Y
ij
] =
ij
=
exp(x

ij
)
1 + exp(x

ij
)
,
and since the outcomes are binary, we have that
var(Y
ij
) = V
ij
=
exp(x

ij
)
(1 + exp(x

ij
))
2
.
In addition to this marginal mean model, we need to
model the covariance structure of the correlated observa-
tions on a given subject. Assuming no missing data, the
t t covariance matrix of Y
i
is modeled as
V
i
= A
1/2
i
R()A
1/2
i
,
where A
i
is a diagonal matrix of variance functions v(u
ij
),
and R() is the working correlation matrix of Y
i
indexed
by a vector of parameters . We will now describe speci-
cations for R.
2.1 Specication of Working Correlation Matrix
There are a variety of common structures that may be
appropriate to use to model the working correlation matrix.
Table 1 displays a number of such matrices.
Issues guiding the choice of correlation structures are be-
yond the scope of this article (see Diggle et al. 1994 for a
readable discussion), but in general if the number of ob-
servations per cluster is small in a balanced and complete
design, then an unstructured matrix is recommended. For
datasets with mistimed measurements, it may be reasonable
to consider a model where the correlation is a function of
the time between observations (i.e., M-dependent or auto-
regressive). For datasets with clustered observations (i.e.,
rat litters), there may be no logical ordering for observa-
tions within a cluster and an exchangeable structure may
be most appropriate.
Comparisons of estimates and standard errors from sev-
eral dierent correlation structures may indicate sensitivity
The American Statistician, May 1999, Vol. 53, May 1999 161
Figure 1. Example of Fixed Working Correlation Matrix.
to mispecication of the variance structure. For both the in-
dependence working structure and the xed working struc-
ture, no estimation of is performed (for the xed struc-
ture, the user must specify a t t matrix mat). We note
that use of the exchangeable (also referred to as compound
symmetry) working correlation matrix with measured data
and identity link function is equivalent to a random eects
model with a random intercept per cluster.
For the corr=xed option, mat should be a symmetric
matrix with 1s on the diagonal, as seen in Figure 1, which
species a banded structure with a xed correlation and lin-
ear decline as the distance between observations increases.
2.2 Empirical and Model Based Variance Estimators
Zeger and Liang (1986) referred to V
i
as a working
matrix because it is not required to be correctly specied
for the parameter estimates and the estimated variance of
the parameter estimates in model (1) to be consistent (as
long as the mean model itself is correct and there is no
missing data). However, Liang and Zeger (1986) showed
that there can be important gains in eciency realized by
correctly specifying the working correlation matrix.
A set of estimating equations are solved (through an iter-
ative process) to nd the value of the estimator

. An em-
pirical variance estimator can be used to estimate var(

).
This variance estimator is also referred to as a sandwich
or robust estimator. Another variance estimate available
from GEE models is the model-based (or naive) estimate,
which is consistent when both the mean model and the co-
variance model are correctly specied. Since in general the
analyst will not know the correct covariance structure, the
empirical variance estimate will be preferred when the num-
ber of clusters is large. When the number of clusters is
small, say < 20, the model based variance estimator may
have better properties (Prentice 1988) even if the work-
ing variance is wrong. This is because the robust variance
estimator is asymptotically unbiased, but could be highly
biased when the number of clusters is small.
In addition to the Zeger robust estimator, SUDAAN
also supports the Binder (1983) estimate of variance (Taylor
series approximation) and a jackknife estimator of variance
can be calculated using a working independence assump-
tion.
2.3 Missing Data Issues
Longitudinal or clustered studies often have missing data,
either by design or happenstance. If a litter in a teratology
study is the level of clustering, litter size may vary between
litters. Patients in an observational study may miss appoint-
ments or drop out of the study. The protocol for a clinical
trial may call for patients to be observed at specied inter-
vals, but their actual observations may take place at varying
times. Such unbalanced and/or incomplete data can compli-
cate GEE analyses. If the missingness can be thought of as
being missing completely at random (MCAR) in the sense
of Little and Rubin (1987), then the consistency results es-
tablished by Liang and Zeger (1986) hold. However, the no-
tation and calculations for arbitrary missing data patterns
are more complicated than in the balanced and complete
case.
Robins, Rotnitzky, and Zhao (1995) proposed methods
to allow for data that is missing at random (MAR). Their
inverse probability censoring weight (IPCW) approach re-
quires that the missingness law be modeled, and that
weights corresponding to the inverse probability of miss-
ingness be included in the GEE. This will yield consistent
parameter estimates, but the variance will tend to be incor-
rect (since the weights are being estimated but are treated as
constants by default). Unfortunately, the method of Robins
et al. (1995) only works well when there is dropoutthat is,
once a subject misses a time, that subject is not seen again.
Often subjects miss a single observation, and then are seen
at the next time. The probability of the missingness pattern
over time is not estimable with a simple logistic regression
in this case, so the Robins et al. (1995) method is more dif-
cult to implement. Lee, Laird, and Johnston (in revision)
propose a modication to the GEE approach that combines
restricted maximum likelihood (REML) estimating equa-
tions for the parameters in the variance-covariance matrix.
We will consider how these approaches may be carried out
in existing packages.
In summary, when tting GEEs, the analyst must consider
not only the model for the mean, but the model for the
variance and the underlying missingness process. We will
now describe the software packages to be reviewed, and
describe how to carry out an analysis in each package.
3. SOFTWARE PACKAGES TO BE REVIEWED
We review four packages that are commonly used to t
GEEs: SAS, Stata, SUDAAN, and S-Plus.
SASThe version of SAS used for the evaluation was
SAS/STAT Release 6.12 (SAS Institute 1996). GEE sup-
port has been included in PROC GENMOD. Information
about SAS is available from the SAS Institute web page
(http://www.sas.com).
StataThe version of Stata used for the evaluation
was 5.0 (Stata Corp 1997). GEE models can be t in
Stata using the xtgee command, part of the xt cross-
sectional time-series analysis tools. Information about
Stata is available from the Stata Corporation web page
http://www.stata.com.
SUDAANThe version of SUDAAN used for the evalu-
ation was 7.5.3 (Shaw, Barnwell, and Bieler 1997). Informa-
tion about SUDAAN is available from the Research Trian-
gle Institute web page (http://www.rti.org/patents/sudaan/
sudaan.html). PROC LOGISTIC, PROC MULTILOG, and
PROC REGRESS allow tting and evaluation of models us-
ing GEEs for binary and continuous outcomes. Support for
162 Statistical Computing Software Reviews
Table 2. Syntax to Specify a Given Correlation Structure
Structure SAS Stata SUDAAN S-Plus
Independence corr=inde corr(ind) R=INDEPENDENT corstr="independent"
Exchangeable corr=exch corr(exc) R=EXCHANGE corstr="exchangeable"
Unstructured corr=un corr(uns) Not available corstr="unstructured"
Auto-regressive corr=ar corr(ar 1) Not available corstr="ar1"
M-dependent corr=mdep(m) corr(sta m) Not available Can be done
Fixed corr=fixed (mat) corr(fix mat) Not available Can be done
count data in PROC LOGLINK is planned for the next re-
lease. Because SUDAAN was developed for analysis of
complex survey sampling data, it is particularly well suited
to the analysis of repeated measures and clustered data.
S-PlusThe version of S-Plus used for the evaluation
was 3.4 (Mathsoft 1996). Information about S-Plus is avail-
able from the MathSoft web page (http://www.mathsoft.
com/splus). Although GEE support is not built in to S-
Plus, a package to implement GEEs (YAGS or Yet An-
other GEE solver) is available from Vincent Carey and
easily added as a library to S-Plus. The library can be
found on the web (http://www.biostat.harvard.edu/carey)
and precompiled binaries can be found at Brian Ripleys
web page (http://www.stats.ox.ac.uk/pub/SWin/). Instal-
lation of YAGS took 10 minutes and the le (which in-
cludes the source, documentation, and test data) was ap-
proximately 1/3 of a megabyte in size. Another pack-
age available for analysis of repeated measures de-
signs within S-Plus is the Oswald system, available
at http://www.maths.lancs.ac.uk/Software/Oswald. We do
not further discuss Oswald, and concentrate only on YAGS.
To t a GEE, the analyst must rst answer several ques-
tions: what is the appropriate family of distributions for the
data (i.e., binary, count, or measured)? What link function
is appropriate? What is a reasonable model for the corre-
lation between observations? What is an appropriate mean
model? What variance estimator should be used?
The analyst must specify both the distribution family
(which determines the approach estimator of , the scale
or dispersion parameter) and the working correlation ma-
trix. SAS, Stata, and S-Plus all support the following dis-
tribution families: Gaussian (normal), Bernoulli/binomial,
Poisson, and Gamma. Stata is planning support for the in-
verse Gaussian and negative binomial distribution families
in a forthcoming release. SUDAAN supports only the Gaus-
sian and Bernoulli/binomial distributions, though support
for Poisson regression (PROC LOGLINK) is being imple-
mented for the next release. SUDAAN also allows a cumu-
lative logit link (for ordered multinomial outcomes) and a
generalized logit link (for nominal multinomial regression).
The packages will default to the canonical link for each
distribution, but other options are available in some pack-
ages. For example, for the binomial distribution, the probit
link is available in Stata, while SAS and S-Plus support the
probit and complementary log-log links (though support for
these links is planned for release 6.0 of Stata). SUDAAN
supports only the canonical (logit) link. Some caution must
be exercised when choosing distributions and links, since
some combinations do not make sense. S-Plus would not
allow the combination of a binomial family with an identity
link, though SAS and Stata t the model with this combi-
nation.
All packages allow the specication of the mean model
in a straightforward fashion.
Table 2 displays the commands to specify a given cor-
relation structure in the packages under review. SAS and
Stata can display the estimated working correlation matrix,
while SUDAAN and S-Plus will display the elements of
from which the correlation matrix can be constructed for
some set of observation times or clustering values. Since
clusters have no natural ordering, and because SUDAAN is
designed for analysis of clustered data, it only supports the
independence and exchangeable working correlation struc-
tures. Finally, all packages will display both empirical and
model-based variance estimates. The default for Stata is
to display the model-based estimates, while SAS and SU-
DAAN default to the empirical (sandwich) estimates. S-Plus
displays both estimates.
We now consider an example GEE model t using these
software packages.
4. EXAMPLE: MENTAL HEALTH SERVICE
UTILIZATION
To conduct the software comparison, we analyzed data
from a study of mental health utilization by children.
The study design has been reported elsewhere (Zahner,
Pawelkiewicz, DeFrancesco, and Adnopoz 1992; Zahner,
Jacobs, Freeman, and Trainor 1993), as has a substan-
tive analysis of the service utilization data (Zahner and
Daskalakis 1997). Subjects included 2,519 children, aged
611, who were part of two cross-sectional surveys con-
ducted in eastern Connecticut in the late 1980s. A goal of
these surveys was to study determinants of mental health
service utilization.
Parents of the children completed survey questionnaires
that solicited information on child characteristics. The pri-
mary outcomes were service use in three settings: general
health, school, and mental health. For a given setting, ser-
vice use was dened as a parental report that the child had
ever seen a provider or been in a special program for a be-
havioral problem. If the particular service was used, the out-
come (SERV) was coded 1, and coded 0 otherwise. Clearly
these binary outcomes are correlated for a given child.
In this study it is of interest to relate the rate of ser-
vice use in the three settings to both child and family
characteristics. Covariates thought to be predictive of ser-
The American Statistician, May 1999, Vol. 53, May 1999 163
Figure 2. First 12 Observations of the Service Use Dataset.
vice use included age (OLD: 0=age 6 to 8, 1=age 9 to
11), gender (BOY: 0=female, 1=male), and academic prob-
lems (ACADPROB: 0=no academic problems, 1=repeated
a grade, advised to repeat grade).
In the logistic regression analysis for repeated binary
measures we adjusted for setting (using indicators for
SCHOOL and MENTAL; i.e., we used general services as
baseline), the above covariates, and the interaction between
setting and the three covariates. Our dataset consisted of
one line per setting per subject, along with a variable ID
(subject identier), and a variable SETTING which was set
to equal 2 if the setting was GENERAL, 1 if MENTAL,
and 0 if SCHOOL. This variable is needed to determine
the proper ordering of observations when calculating the
working correlation matrix. Figure 2 displays the rst dozen
lines of the dataset, which includes three outcomes for each
of four subjects. Table 3 displays the parameter estimates
and variance estimates for those parameters (both empirical
and model-based) for the model with an unstructured work-
ing correlation matrix and a logit link. We t a model us-
ing an independence and exchangeable working correlation
structure, but in this example, the parameter estimates and
standard error estimates were identical to the rst decimal
place. We also t a model in SUDAAN using the Binder
variance estimate as well as a jackknife variance estimate
using an independence working correlation structure. Both
yielded similar results, though we note that the jackknife
estimate took 25 minutes to calculate on a Pentium PC, as
opposed to approximately one minute for the other models.
Table 4 displays the estimated working correlation ma-
trices for the independence, exchangable and unstructured
working correlation structures.
The parameter estimates were identical (to the third dec-
imal place) for all packages tting the unstructured model.
Figure 3 displays the syntax needed to specify this model
for the packages under review (since SUDAAN does not
support the unstructured working correlation matrix, an ex-
changeable model was t).
One question of interest in this model is whether the ef-
fect of the covariates on service use is the same across
service settings. If the eect is the same, another question
of interest is whether the covariate is associated with the
outcome. We can test the former hypothesis for the OLD
covariate with the null hypothesis:
H
0
: OLD MENTAL = OLD SCHOOL = 0.
We can construct a Wald test statistic T =

( var(

))
1
,
where

is a 21 vector containing the parameter estimates
for OLD*MENTAL and OLD*SCHOOL and

are the vari-
ances and covariances for the parameters being tested. This
test statistic will have an approximate
2
distribution with
two degrees of freedom under the null hypothesis. If we
do not reject this null hypothesis, we may be interested in
testing
H
0
: OLD = OLD MENTAL = OLD SCHOOL = 0.
Similarly, a three df test statistic may be constructed to test
this hypothesis.
It was trivial to test these hypotheses in Stata. Figure 4
gives the input and output from the extremely exible test
command in Stata. With the accumulate option, multi-
dimensional hypothesis tests can be constructed. SUDAAN
also allowed for testing of arbitrary hypotheses in this fash-
ion. SAS allows testing of contrasts using the CONTRAST
command, though the current version will not test the GEE
model (this is planned to be rectied in a future release).
In addition, SAS plans to add an ESTIMATE statement to
provide estimates of linear functions of the regression pa-
rameters. For other types of tests, SAS and S-Plus require
Table 3. Parameter Estimates and Estimated Standard Errors
With Unstructured Working Correlation Matrix
Empirical Model
Parameter Estimate std err std err
INTERCEPT 2.944 .149 .145
MENTAL .352 .193 .194
SCHOOL .185 .174 .171
OLD .123 .144 .144
BOY .365 .146 .147
ACADPROB .724 .145 .146
OLD*MENTAL .291 .190 .190
OLD*SCHOOL .331 .162 .163
BOY*MENTAL .278 .189 .193
BOY*SCHOOL .154 .165 .167
ACADPROB*MENTAL .184 .191 .193
ACADPROB*SCHOOL 1.136 .167 .168
Table 4. Estimated Working Correlation Matrices for Dierent Working
Correlation Structures (Independence, Exchangeable, Unstructured)
School Mental General
School 1
Mental 0 .196 .165 1
General 0 .196 .198 0 .196 .227 1
164 Statistical Computing Software Reviews
Figure 3. Syntax to Fit GEEs for Example Model.
the analyst to perform the matrix multiplication, but they do
facilitate saving the estimated covariance matrix of the pa-
rameters to automate these computations. Figure 5 displays
the code needed to calculate the value of the test statistic
for the OLD covariate. Note that when the user requests
the covariance matrix of the parameters using the GEER-
COV make statement, SAS actually creates a matrix with
covariances in the upper triangle, and correlations in the
lower triangle.
We conclude from our example that that there is a sig-
nicant interaction between service setting and academic
problems (
2
2
= 52.3, p < .0001) but not for age and setting
(
2
2
= 4.6, p = .10) or gender and setting (
2
2
= 2.2, p =
0.33). Overall, boys have a higher proportion of mental
health service use than girls (
2
3
= 8.2, p = .04) and older
children tend to have used them more than younger children
(
2
3
= 20.6, p = .0001).
The American Statistician, May 1999, Vol. 53, May 1999 165
Table 5. Complete Data From Articial Example
Y
1
Y
2
X Count
0 0 0 324
0 0 1 261
0 1 0 43
0 1 1 36
1 0 0 72
1 0 1 107
1 1 0 61
1 1 1 96
We will next consider an articial data example with
missing data.
5. EXAMPLE: MISSING DATA
We constructed an articial dataset consisting of 1000
paired binary observations from the following underlying
distribution:
logit(E[Y
ij
]) =
0
+
1
TIME +
2
X
i
(2)
where X
i
is a dichotomous covariate; TIME = 1 if j =
2, 0 otherwise;
0
= 1;
1
= .50;
2
= .50; and
corr(Y
1
, Y
2
) = .40. Table 5 displays one sample from this
underlying distribution which we used as our complete
case sample. We t models with an exchangeable (satu-
rated in this setting) correlation matrix which yielded pa-
rameter estimates (and standard errors):

0
: .943(.092),

1
: .500(.080),

2
: .500(.118). We created a dataset using
the following (MAR) missingness law:
P(Y
2
is observed|Y
1
, X) =
_

_
.9 if Y
1
= 1 and X = 1
.7 if Y
1
= 1 and X = 0
.5 if Y
1
= 0 and X = 1
.3 if Y
1
= 0 and X = 0
.
The missingness mechanism is considered to be missing
at random (MAR) because it does not depend on the un-
observed values of Y
2
. We t models using all available
Figure 4. Stata Code to Calculate Value of Multidimensional Wald
Test Statistic.
cases in all packages using an independence, exchangeable
and unstructured working correlation assumption. Table 6
displays the parameter estimates from one sample gener-
ated using this missingness law. Because of dierences in
how the unstructured and exchangeable correlation matri-
ces are estimated in the dierent packages, and because the
dataset was unbalanced, the results are slightly dierent.
The estimates under the working independence assumption
are much dierent, however, and appear highly biased. In-
cluding a correlation between Y
1
and Y
2
in the working
variance model (either exchangeable or unstructured) ap-
pears to reduce this bias somewhat because of the high cor-
relation (.4) between Y
1
and Y
2
.
We considered two approaches to address the bias of
GEE methods with MAR data: the IPCW method of
Robins, Rotnitzky, and Zhao (1995), and the normal ap-
proximation technique of Lee et al. (in revision). For the
Robins approach, we t a saturated logistic model for

P(Y
2
is observed|Y
1
, X).
We conducted a simulation by creating 500 datasets using
the above missingness law, and t all three models (available
case (AC), IPCW, and REML) to each dataset. Table 7 dis-
plays the average of the parameter estimates from each of
the simulations, along with the parameters from the sample
generated using model (2). We note that for this missing-
ness law, there is considerable bias in the parameter esti-
mates using the available case technique. Use of the GEE-
REML technique decreases the bias, while use of the IPCW
technique tends to eliminate the bias in this example. Fig-
ure 6 displays the SAS code to t models using the IPCW
approach. We note that the standard error estimates from
PROC GENMOD will be biased for the true values because
by default GENMOD assumes that the weights are known,
when in fact they are estimated from the data.
6. ADDITIONAL NOTES ON THE SOFTWARE
PACKAGES
We now consider some additional notes on the software
packages. One minor problem was found with Stata and
SUDAAN, which required id variables to be numeric. It
was possible to recode our id variable to a unique integer,
but this was an annoyance. Stata does not support weights,
which would preclude use of the IPCW method for handling
MAR data.
SUDAANs roots in complex survey sampling are both an
advantage and a disadvantage. One advantage relates to de-
scriptive tables and statistics. While not exactly GEE mod-
eling, SUDAAN facilitates the calculation of the correct
standard error for descriptive statistics. As an example in
teratology, it is straightforward to calculate the proportion
of fetuses (clustered by litter) that are malformed, along
with a condence interval for that proportion using a sand-
wich variance estimate. One disadvantage of SUDAAN is
that it has limited support for correlation structures (only
independence and exchangeable) though it has support for
multiple nesting levels, which is not supported directly by
any of the other packages. For example, consider a dataset
which consisted of three repeated measurements over time
166 Statistical Computing Software Reviews
Figure 5. SAS Code to Calculate Value of Multidimensional Wald Test Statistic (not necessary in future versions of SAS).
Table 6. Parameter Estimates From Missing Data Example
Package Correlation Int Time X
All Independence .9104 .2508 .4424
SAS Exchangeable .9574 .5984 .5262
Stata Exchangeable .9574 .5984 .5262
SUDAAN Exchangeable .9442 .5075 .5034
S-Plus Exchangeable .9574 .5984 .5262
SAS Unstructured .9574 .5984 .5262
Stata Unstructured .9571 .5962 .5256
S-Plus Unstructured .9571 .5962 .5256
on each of two parents (mothers and fathers) in a family. Let
the observed data be arranged such that Y
i
= (Y

iM
, Y

iF
)

is the vector of observations for the ith cluster, where Y


iM
represents the vector of three observations on the mother,
and Y
iF
represents the vector of three observations on the
father. Given enough clusters, it might be feasible to con-
sider estimating an unstructured 6 6 correlation matrix,
which can be done in SAS, S-Plus, or Stata. Use of a xed
matrix might also be reasonable to consider. While arbi-
trary, it might be plausible to consider that observations
on the same person would be relatively highly correlated
over time (.5 if one time point apart, .4 if two time points
apart), that observations at the same time on dierent par-
ents would be less highly correlated (.2), and that observa-
Table 7. Results From Missing Data Simulation
Parameter Actual sample AC IPCW REML

0
.943 .980 .941 .980

1
.500 .596 .502 .540

2
.500 .569 .498 .570
tions on dierent parents at dierent times would have a
small correlation (.1). This could be implemented using a
xed working correlation matrix of the form
R =
_
_
_
_
_
_
_
_
1.0 .5 .4 .2 .1 .1
.5 1.0 .5 .1 .2 .1
.4 .5 1.0 .1 .1 .2
.2 .1 .1 1.0 .5 .4
.1 .2 .1 .5 1.0 .5
.1 .1 .2 .4 .5 1.0
_
_
_
_
_
_
_
_
SUDAAN allows multiple levels of clustering and avoids
use of arbitrary xed matrices or estimation of many nui-
sance parameters. The program allows a robust variance to
be calculated at one stage of the design (e.g., parents) and
an exchangeable correlation to be calculated at a lower level
(time within parents) using the NEST statement.
The S-Plus GEE implementation oers the most general
coding for working correlation matrices, which allows for
arbitrarily complex or special-case models. The disadvan-
tage of this approach is that some additional work is re-
quired by the analyst to take advantage of this versatility.
The American Statistician, May 1999, Vol. 53, May 1999 167
Figure 6. Syntax to Fit IPCW GEEs With MAR Data in SAS.
All vendors provided quick (within four to ve hours in
all cases), cordial, and accurate responses to our emailed re-
quests for technical support. SUDAAN had the most exten-
sive documentation regarding GEEs, though the oerings
for both SAS and Stata were complete and well organized.
The documentation for YAGS was more fragmentary and
required more basic knowledge on the part of the analyst.
All packages provided examples to augment their reference
documentation.
7. DISCUSSION
In this article we have reviewed the methodology under-
lying GEEs, t models for two examples using four soft-
ware packages, and considered extensions to handle miss-
ing at random data. In general, the packages were easy to
use, the implementations were similar, and the results cal-
culated were similar. Analysis of a dataset with missing
observations yielded somewhat diering results between
packages, and may warrant further investigation. Certain
aspects of the interface (i.e., calculating multidimensional
hypothesis tests, use of weights, etc.) were particularly well-
implemented in certain packages. On the whole, GEEs are
well-supported by all of these software packages, and are
straightforward to use.
[Received January 1999. Revised February 1999.]
REFERENCES
Binder, D. A. (1983), On the Variances of Asympototically Normal Es-
timators From Complex Surveys, International Statistical Review, 51,
279292.
Diggle, P. J., Liang, K. Y., and Zeger, S. L. (1994), Analysis of Longitudinal
Data, Oxford: Clarendon Press.
Fitzmaurice, G. M., and Laird, N. M. (1993), A Likelihood-Based Method
for Analysing Longitudinal Binary Responses, Biometrika, 80, 141
151.
Hilbe, J. M. (1994), Generalized Linear Models, The American Statisti-
cian, 48, 255265.
168 Statistical Computing Software Reviews
Huber, P. J. (1967), The Behavior of Maximum Likelihood Estimates
Under Non-standard Conditions, in Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, vol. 1, pp. 221
233.
Laird, N. M., and Ware, J. H. (1982), Random-Eects Models for Lon-
gitudinal Data, Biometrics, 38, 963974.
Lee, H., Laird, N. M., and Johnston, G. (in revision), Combining GEE and
REML for Estimation of Generalized Linear Models With Incomplete
Multivariate Data.
Liang, K.-Y., and Zeger, S. L. (1986), Longitudinal Data Analysis Using
Generalized Linear Models, Biometrika, 73, 1322.
Little, R. J. A., and Rubin, D. B. (1987), Statistical Analysis With Missing
Data, New York: Wiley.
MathSoft (1996), Splus Version 3.4, Supplement, Seattle, WA: Data Anal-
ysis Products Division.
McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models, New
York: Chapman and Hall.
Prentice, R. L. (1988), Correlated Binary Regression With Covariates
Specic to Each Binary Observation, Biometrics, 44, 10331048.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1995), Analysis of Semi-
parametric Regression Models for Repeated Outcomes in the Presence
of Missing Data, Journal of the American Statistical Association, 90,
106121.
SAS Institute (1996), SAS/STAT Software: Changes and Enhancements for
Release 6.12, Cary, NC: SAS Institute, Inc.
Shah, B. V., Barnwell, B. G., Bieler, G. S. (1997), SUDAAN Users Manual,
Release 7.5, Research Triangle Park, NC: Research Triangle Institute.
StataCorp (1997), Stata Statistical Software: Release 5.0, College Station,
TX: Stata Corporation.
White, H. (1980), A Heteroskedasticity-Consistent Covariance Matrix Es-
timator and a Direct Test for Heteroskedasticity, Econometrica, 48,
817838.
(1982), Maximum Likelihood Estimation of Misspecied Mod-
els, Econometrica, 50, 125.
Zahner, G. E. P., and Daskalakis, C. (1997), Factors Associated With
Mental Health, General Health and School-Based Service Use for Psy-
chopathology, American Journal of Public Health, 87, 14401448.
Zahner, G. E. P., Jacobs, J. H., Freeman, D. H., and Trainor, K. (1993),
Rural-Urban Child Psychopathology in a Northeastern U.S. State:
19861989, Journal of the American Academy of Child Adolescent Psy-
chiatry, 32, 378387.
Zahner, G. E. P., Pawelkiewicz, W., DeFrancesco, J. J., and Adnopoz, J.
(1992), Childrens Mental Health Service Needs and Utilization Pat-
terns in an Urban Community, Journal of the American Academy of
Child Adolescent Psychiatry, 31, 951960.
Zeger, S. L., and Liang, K.-Y. (1986), Longitudinal Data Analysis for
Discrete and Continuous Outcomes, Biometrics, 42, 121130.
The American Statistician, May 1999, Vol. 53, May 1999 169

You might also like