You are on page 1of 21

Psychological Methods Copyright 2002 by the American Psychological Association, Inc.

2002, Vol. 7, No. 1, 105–125 1082-989X/02/$5.00 DOI: 10.1037//1082-989X.7.1.105

Combining Effect Size Estimates in Meta-Analysis With


Repeated Measures and Independent-Groups Designs
Scott B. Morris Richard P. DeShon
Illinois Institute of Technology Michigan State University

When a meta-analysis on results from experimental studies is conducted, differ-


ences in the study design must be taken into consideration. A method for combining
results across independent-groups and repeated measures designs is described, and
the conditions under which such an analysis is appropriate are discussed. Combin-
ing results across designs requires that (a) all effect sizes be transformed into a
common metric, (b) effect sizes from each design estimate the same treatment
effect, and (c) meta-analysis procedures use design-specific estimates of sampling
variance to reflect the precision of the effect size estimates.

Extracting effect sizes from primary research re- group serves as a control. The difference between the
ports is often the most challenging step in conducting groups on the outcome measure is used as an estimate
a meta-analysis. Reports of studies often fail to pro- of the treatment effect. The other researcher may
vide sufficient information for computing effect size choose to use a single-group pretest–posttest design,
estimates, or they include statistics (e.g., results of in which each individual is measured before and after
significance tests of probability values) other than treatment has occurred, allowing each individual to be
those needed by the meta-analyst. In addition, a set of used as his or her own control.1 In this design, the
primary research studies will often use different ex- difference between the individuals’ scores before and
perimental designs to address the same research ques- after the treatment is used as an estimate of the treat-
tion. Although not commonly recognized, effect sizes ment effect.
from different experimental designs often estimate Both researchers in the prior example are interested
different population parameters (Ray & Shadish, in addressing the same basic question—is the training
1996) and cannot be directly compared or aggregated program effective? However, the fact that the re-
unless adjustments for the design are made (Glass, searchers chose different research designs to address
McGaw, & Smith, 1981; Morris & DeShon, 1997). this question results in a great deal of added complex-
The issue of combining effect sizes across different ity for the meta-analyst. When the research base con-
research designs is particularly important when the sists entirely of independent-groups designs, the cal-
primary research literature consists of a mixture of culation of effect sizes is straightforward and has been
independent-groups and repeated measures designs. described in virtually every treatment of meta-
For example, consider two researchers attempting to analysis (Hedges & Olkin, 1985; Hunter & Schmidt,
determine whether the same training program results 1990; Rosenthal, 1991). Similarly, when the studies
in improved outcomes (e.g., smoking cessation, job all use repeated measures designs, methods exist for
performance, academic achievement). One researcher
may choose to use an independent-groups design, in
which one group receives the training and the other 1
Repeated measures is a specific form of the correlated-
groups design, which also includes studies with matched or
yoked pairs. The common feature of all these designs is that
Scott B. Morris, Institute of Psychology, Illinois Institute observations are not independent, because two or more data
of Technology; Richard P. DeShon, Department of Psychol- points are contributed by each individual or each matched
ogy, Michigan State University. pair. Because all correlated-groups designs have the same
Correspondence concerning this article should be ad- statistical properties, the methodology presented in this ar-
dressed to Scott B. Morris, Institute of Psychology, Illinois ticle applies to all correlated-groups designs. However, for
Institute of Technology, 3101 South Dearborn, Chicago, simplicity, the method is described in terms of the pretest–
Illinois 60616. E-mail: scott.morris@iit.edu posttest repeated measures design.

105
106 MORRIS AND DESHON

conducting meta-analysis on the resulting effect sizes sign. The goal of this article is to discuss the condi-
(Becker, 1988; Dunlap, Cortina, Vaslow, & Burke, tions under which effect sizes should and should not
1996; Gibbons, Hedeker, & Davis, 1993). However, be combined, so that researchers can make informed
in many research areas, such as training effectiveness decisions about the best way to treat alternate designs
(Burke & Day, 1986; Dilk & Bond, 1996; Guzzo, in a particular research domain.
Jette, & Katzell, 1985), organizational development Much of this article focuses on the question of
(Neuman, Edwards, & Raju, 1989), and psycho- whether effect sizes are comparable across the alter-
therapy (Lipsey & Wilson, 1993), the pool of studies nate designs. For effect size estimates to be meaning-
available for a meta-analysis will include both re- fully compared across studies, it is necessary that (a)
peated measures and independent-groups designs. In all effect sizes estimate the same treatment effect and
these cases, the meta-analyst is faced with concerns (b) all effect sizes be scaled in the same metric. These
about whether results from the two designs are com- two issues are reflected in the two parameters that
parable. compose the standardized mean difference effect size.
Although our discussion focuses on combining ef- The numerator reflects the mean difference between
fect size estimates from independent-groups and re- treatment conditions, and the denominator reflects the
peated measures designs, it is important to note that standard deviation of the population. If the effect sizes
this is a general problem in meta-analysis and is not from different studies estimate different population
specific to these two designs. Unless a set of studies mean differences or different population standard de-
consists of perfect replications, differences in design viations, they cannot be meaningfully combined. For
may result in studies that do not estimate the same instance, studies with different operationalizations of
population effect size. Essentially the same issues the independent variable may produce different treat-
have been raised for meta-analysis with other types of ment effects (Cortina & DeShon, 1998; Hunter &
designs, such as different factorial designs (Cortina & Schmidt, 1990). Also, the magnitude of the treatment
Nouri, 2000; Morris & DeShon, 1997) or studies with effect can be influenced by the experimental design.
nonequivalent control groups (Shadish, Navarro, Alternate experimental designs control for different
Matt, & Phillips, 2000). sources of bias, potentially leading to different esti-
The combination of effect sizes from alternate de- mates of treatment effects. As a result, in many meta-
signs raises several important questions. Is it possible analyses the experimental design is examined as a
to simply combine these effect sizes and perform an moderator of the effect size.
overall meta-analysis? Do these effect sizes provide Another factor that affects the comparability of ef-
equivalent estimates of the treatment effect? How fect sizes across studies is the scaling of the effect
does the mixture of designs affect the computational size. Although the use of the standardized mean dif-
procedures of the meta-analysis? ference adjusts for differences in the scaling of the
When dealing with independent-groups and re- dependent variable across studies, it does not guaran-
peated measures designs, the current literature does tee that the effect sizes have comparable metrics. Dif-
not offer consistent guidance on these issues. Some ferences in study design can lead to different defini-
researchers have recommended that studies using a tions of the relevant populations and, therefore,
single-group pretest–posttest design should be ex- different standard deviations. For example, Morris
cluded from a meta-analysis (e.g., Lipsey & Wilson, and DeShon (1997) showed that the within-cell stan-
1993). Others have combined effect sizes across de- dard deviation from a factorial analysis of variance
signs, with little or no discussion of whether the two (ANOVA) reflects a population where the other fac-
designs provide comparable estimates (e.g., Eagly, tors in the design are fixed. The standard deviation
Makhijani, & Klonsky, 1992; Gibbons et al., 1993). from a t test, on the other hand, does not control for
Our perspective is that effect size estimates can be these other factors and, therefore, may reflect a dif-
combined across studies only when these studies pro- ferent population. Effect size estimates computed
vide estimates of the same population parameter. In from these different standard deviations would not be
some cases, studies that use different designs will es- in the same metric and could not be meaningfully
timate different parameters, and therefore effect sizes combined.
from these studies should not be combined. In other Similarly, the error term from a repeated measures
cases, it will be possible to obtain comparable effect t test is a function of the standard deviation of change
size estimates despite differences in the research de- scores, whereas the error term from an independent-
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 107

groups t test is a function of the standard deviation of come is measured at a single point in time and is
raw scores. These two tests reflect different concep- compared across independent groups that receive dif-
tualizations of the relevant population, and both con- ferent treatments (e.g., experimental and control
ceptualizations have been adopted as the basis for a groups). When independent groups are analyzed, the
repeated measures effect size. Some have argued that research question focuses on the difference between
effect size from a repeated measures study should be groups relative to the variability within groups. The
defined in terms of the standard deviation of raw relevant parameters reflect the posttest means of the
scores (Becker, 1988; Dunlap et al., 1996; Hunter & two treatment populations and the common standard
Schmidt, 1990). Others (Gibbons et al., 1993; Johnson deviation of scores within each population (␴post). For
& Eagly, 2000) have defined the repeated measures simplicity, we refer to experimental and control treat-
effect size using the standard deviation of change ments (␮post, E and ␮post, C), but the method can be
scores. Both definitions of the effect size are reason- generalized to other contrasts as well. Hedges (1981,
able; however, they reflect different population pa- 1982) has defined the independent-groups effect size
rameters. As long as all effect sizes are defined con- as follows:
sistently, the analyst may select the effect size metric
that best reflects the research question under investi- ␮post, E − ␮post, C
␦IG = . (1)
gation. ␴post
The purpose of this presentation is to highlight that
If we assume homogeneity of variance, the best esti-
effect sizes can be combined across independent-
mate of ␴post is the pooled within-group standard de-
groups and repeated measures designs. However, do-
viation of posttest scores (SDpost, P). Therefore, the
ing so requires that (a) all effect sizes be transformed
sample estimator of the independent-groups effect
into a common metric, (b) effect sizes from each de-
size is as follows:
sign estimate the same treatment effect, and (c) meta-
analysis procedures use design-specific estimates of Mpost, E − Mpost, C
sampling variance to reflect the precision of the effect dIG = , (2)
SDpost, P
size estimates. In the following sections, we review
the effect sizes that have been defined for alternate where Mpost, E and Mpost, C are the sample posttest
designs and discuss the conditions under which they means of the experimental and control groups, respec-
will provide comparable estimates. In addition, we tively.
provide a general method whereby the meta-analysis In the single-group pretest–posttest design, all par-
can be conducted in the metric most appropriate for ticipants receive the same treatment, and scores on the
the analyst’s research question. outcome are compared before and after treatment is
administered. The research questions in the repeated
measures design focus on change within a person,
Alternate Definitions of the Effect Size relative to the variability of change scores. Hence,
these data are analyzed in reference to the population
In many research domains, the pool of primary
of change scores. This is illustrated by the repeated
studies contains a mixture of independent-groups and
measures t test, where the denominator is a function
repeated measures designs, which could lead to dif-
of the standard deviation of change scores rather than
ferent definitions of the effect size. We consider three
the standard deviation of raw scores. Gibbons et al.
common designs that can be distinguished along two
(1993) defined the repeated measures effect size
dimensions. First, some designs use repeated mea-
(␦RM) in terms of the population mean change (␮D, E)
surements of the outcome variable (e.g., before and
and standard deviation of change scores (␴D, E) in the
after treatment), whereas other designs measure the
experimental group,
outcome only at a single point in time (posttreatment).
Second, some designs compare results across inde- ␮ D, E
pendent groups (e.g., treatment and control groups), ␦RM = , (3)
␴D, E
whereas other designs examine only the treatment
group. These factors define three designs, each of which is estimated by the sample statistic,
which has led researchers to develop distinct defini-
tions of the effect size. MD, E Mpost, E − Mpre, E
dRM = = . (4)
In the independent-groups posttest design, the out- SDD, E SDD, E
108 MORRIS AND DESHON

Here, MD, E is the sample mean change, or the mean First, all effect size estimates must be placed in the
difference between pre- and posttest scores, in the same metric before aggregation is possible. Effect
experimental group (Mpre, E and Mpost, E), and SDD, E sizes for repeated measures data typically use differ-
represents the sample standard deviation of change ent standard deviations than the effect size for the
scores. independent-groups posttest design. The use of differ-
In the independent-groups pretest–posttest design, ent standard deviations results in incompatible scales,
the outcome is measured before and after treatment, unless all effect sizes are transformed into a common
and different groups receive different treatments (e.g., metric.
experimental and control groups). Becker (1988) rec- Second, the meta-analyst must determine whether
ommended first computing an effect size within each the effect sizes from different designs provide equally
treatment condition and then subtracting the control- good estimates of the treatment effect. Some designs
group from the experimental-group effect size. The provide better control for sources of bias and therefore
effect size for each treatment condition is defined as more accurately estimate the treatment effect. Com-
the pretest–posttest change divided by the pretest bining results across designs is not appropriate if the
standard deviation (␴pre). Because pretest standard de- designs yield effect sizes that are differentially af-
viations are measured before any treatment has oc- fected by biasing factors. Therefore, before combin-
curred, they will not be influenced by the experimen- ing effect sizes across different designs, the meta-
tal manipulations and are therefore more likely to be analyst must determine that potential sources of bias
consistent across studies (Becker, 1988). If homoge- do not impact the effect size estimates. This could be
neity of pretest variances is assumed, the effect size accomplished conceptually, based on knowledge of
for the independent groups pretest–posttest design the research methodologies used, or empirically,
(␦IGPP) is through moderator analysis.
Third, different designs estimate the treatment ef-
共␮post, E − ␮pre, E 兲 共␮post, C − ␮pre, C 兲 fect with more or less precision. Differences in pre-
␦IGPP = − , (5)
␴pre ␴pre cision should be taken into account when aggregating
effect sizes across studies. This can be accomplished
which is estimated by the sample statistic, by weighting studies by the estimated sampling vari-
共Mpost, E − Mpre, E 兲 共Mpost, C − Mpre, C 兲 ance of the effect size, which is partly a function of
dIGPP = − . (6) the study design. Each of these issues is discussed in
SDpre, E SDpre, C
detail in the following sections.

Combining Results Across Designs


Comparability of Metrics
For effect size estimates to be combined across
studies, it is essential that they estimate the same Making accurate inferences when combining effect
population parameter. Any differences in the designs sizes across studies in meta-analysis requires that the
of those studies could result in effect sizes that esti- effect sizes all be in the same metric (Glass et al.,
mate different parameters. The three effect sizes de- 1981). Unless the scale of the dependent variable is
fined above illustrate the potential for inconsistencies standardized, differences in the measures used across
across independent-groups and repeated measures de- studies could create artificial differences in effect
signs. Because each effect size is defined in terms of size. Meta-analysis procedures entail the use of stan-
a different mean contrast and a different standard de- dardized measures of effect size such as the correla-
viation, it will be appropriate to combine them in a tion coefficient (Hunter & Schmidt, 1990; Rosenthal,
meta-analysis only when the relevant parameters are 1991) or the standardized mean difference between
equivalent across designs. In some cases, it will be groups (Glass et al., 1981; Hedges, 1982) to accom-
reasonable to assume that the parameters are equiva- plish this requirement. However, the use of a stan-
lent or can be transformed into an equivalent form. In dardized effect size does not guarantee comparable
other cases, effect sizes from alternate designs will scaling. Effect sizes from alternate designs may use
not be comparable and should not be combined in a different standard deviations (e.g., the standard devia-
meta-analysis. tion of pretest vs. posttest scores or of raw scores vs.
It is appropriate to combine effect sizes across de- change scores). Effect sizes from alternate designs
signs as long as three requirements can be satisfied. will be comparable only if these standard deviations
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 109

are the same or can be transformed into a common tions. The interest is in how the individual’s perfor-
parameter. mance changes as a result of successive trials
One reason for different scaling of the effect size (Keppel, 1982).
stems from the use of pretest versus posttest standard One reason researchers choose an independent-
deviations (Carlson & Schmidt, 1999). As shown in groups or repeated measures design is based on the
Equations 2, 4, and 6, the independent-groups effect match of the design with the research question. If the
size uses only posttest standard deviations; the inde- research question concerns the effectiveness of alter-
pendent-groups pretest–posttest effect size uses only nate treatments (e.g., different dosages of a drug), the
pretest standard deviations; and the repeated measures focus is on whether differences between treatment
effect size uses the standard deviation of difference groups exist. Hence, the independent-groups design is
scores, which is influenced by both pre- and posttest appropriate. On the other hand, research on change
standard deviations. Thus, these effect sizes will be within an individual (e.g., learning or practice effects)
comparable only when the variability of scores is con- is better analyzed using a repeated measures design,
stant across time periods. This is consistent with a because this design allows the same individual to be
compound symmetric error structure for repeated tracked across conditions, thereby facilitating the
measures data (Winer, 1971). To the extent that treat- analysis of change.
ment or time affects individuals differentially (a sub- In a similar fashion, the definition of the effect size
ject by time interaction), scores will grow more or less should reflect the research question. If the focus of the
variable over time (Cook & Campbell, 1979). In these meta-analysis is on group differences, the mean dif-
cases, effect sizes computed using different standard ference between conditions is compared with the vari-
deviations will not be comparable. ability of scores within each condition (i.e., the raw-
Even when the variance of scores is homogeneous score metric). On the other hand, if the research focus
across time, steps must be taken to ensure that effect is on change, mean change due to treatment should be
sizes estimated from alternate designs are in the same compared with the variability of change scores (i.e.,
metric. Because they use different standard devia- the change-score metric). The focus of the question on
tions, dIG and dRM are not in the same metric. Fortu- the level of performance versus the change in perfor-
nately, it is possible to translate effect sizes from one mance leads to the use of different standard deviations
metric to the other. Such transformations have been and thus to different definitions of the effect size.
discussed in previous work on meta-analysis, but Even when the mean difference is equivalent across
these methods allow only transformation into the raw- the two designs (␮D, E ⳱ ␮post, E − ␮post, C), dIG and
score metric (e.g., Dunlap et al., 1996; Glass et al., dRM can differ considerably, because the mean differ-
1981). We adopt a more flexible approach whereby ence is related to populations with different standard
the researcher may transform all effect sizes into the deviations. The difference between ␴post and ␴D is a
metric most appropriate to address the research ques- function of the correlation between pre- and posttest
tion. scores (␳). Assuming equal standard deviations in pre-
Choosing an effect size metric. Before transform- and posttest populations,
ing effect sizes into a common metric, the meta-
␴D = ␴ 公2共1 − ␳兲. (7)
analyst must decide what the metric will be. Different
metrics resulting from alternate study designs each Consequently, the two definitions of the effect size
represent legitimate, but different, definitions of the are also related as a function of the pretest–posttest
population. The choice depends on how the meta- correlation,
analyst wishes to frame the research question.
The repeated measures and independent-groups de- ␮D ␦IG
␦RM = = . (8)
signs reflect different ways of framing the research ␴ 公2共1 − ␳兲 公2共1 − ␳兲
question, which lead to different definitions of the
population effect size. Specifically, in the indepen- When ␳ is greater than .5, ␴D will be smaller than ␴,
dent-groups design, individuals are assigned to differ- and as a result, the repeated measures effect size will
ent treatment conditions. The focus is on group dif- be larger than the independent-groups effect size. In
ferences in the level of the outcome measure. In contrast, when ␳ is less than .5, ␴D will be greater than
contrast, in a repeated measures design, the same in- ␴, and the independent-groups effect size will be
dividual is observed under multiple treatment condi- larger. The two effect sizes will produce the same
110 MORRIS AND DESHON

result only when ␳ ⳱ .5, but even in this case they


may have different interpretations. Research on the
stability of performance over time suggests that the
pretest–posttest correlation will often exceed .5, both
for simple perceptual tasks (Fleishman & Hempel,
1955) and in more complex domains, such as job
performance (Rambo, Chomiak, & Price, 1983).
Thus, use of the change-score metric will often pro-
duce larger effect sizes than the raw-score metric.
The different interpretations of the two effect size
Figure 1. Graphic interpretation of independent-groups ef-
metrics can be illustrated through an example. Kelsey
fect size. The dashed line represents the distribution of
(1961) conducted a study to investigate the effect of scores without treatment. The solid line represents the dis-
mental practice on task performance using a repeated tribution of scores with treatment.
measures design. For the purpose of this illustration,
we assume that the same mean difference would have
been obtained if the practice and no-practice condi- combined). Because mean differences between sub-
tions were independent groups (the appropriateness of jects do not influence the change scores, equating sub-
this assumption is discussed in the following section). jects on the mean score does not alter the result and
The independent-groups effect size would reflect the provides a clearer picture of the variance in change
mean difference between practice and no-practice scores. If we assume that the slopes of the lines have
conditions, relative to the pooled within-group stan- a normal distribution, a dRM of 0.84 implies that the
dard deviation: change would be positive for 80% of the cases. That
is, mental practice would be expected to produce an
45.0 − 37.9 improvement in task performance for 80% of the
d IG = = 0.58. (9)
12.3 population.
The example reflects the common situation where
This effect size estimates the average improvement the change-score metric produced a larger effect size
relative to the variability in task performance in the estimate than the raw-score metric (because ␳ is
population. Specifically, after mental practice, the av- greater than .5). This difference does not indicate
erage performance was 0.58 standard deviations over- or underestimation by one of the methods but
above the average performance without practice. An rather reflects a difference in the focus of the research
alternative interpretation of the effect size is based on question.
the overlap between distributions. Assuming that the The choice of a metric for the effect size should be
populations are normally distributed with equal vari- guided by the analyst’s research question. If the re-
ance, one could conclude that the average perfor- search focuses on differences across alternate treat-
mance after mental practice was greater than the per- ments, the raw-score metric is preferred. On the other
formance of 72% of the no-practice population (see hand, if the focus of the research is on individual
Figure 1). change, the change-score metric is most appropriate.
The results could also be represented as a repeated
measures effect size, where the mean difference is
divided by the standard deviation of change scores:

45.0 − 37.9
dRM = = 0.84. (10)
8.5

Here, the effect size indicates that the average im-


provement was 0.84 standard deviations above zero.
The interpretation of dRM can be represented graphi-
cally by plotting the pre- and posttest scores for each
individual (see Figure 2). For ease of interpretation, Figure 2. Graphic interpretation of the repeated measures
all of the individuals whose scores are depicted in the effect size. Each line represents the pretest–posttest differ-
figure have the same mean score (pre- and posttest ence for one individual.
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 111

In many situations, the same research question Transformations to produce a common metric.
could be framed in terms of either metric. For ex- When the population correlation between pre- and
ample, the effectiveness of a training program could posttest scores is known, any of the effect sizes can be
be expressed as the difference between training and transformed into either the raw-score or change-score
no-training groups, suggesting the raw-score metric. metric. It should be noted that these transformations
Alternately, effectiveness could be defined as the only correct for differences in the metric (i.e., the
amount of change produced as a result of training, standard deviation) of the effect size. They will not
suggesting the change-score metric. In either case, the overcome disparities in how different designs esti-
effect size would reflect the difference between per- mate the mean difference between groups (i.e., differ-
formance with and without training but would repre- ences in control for biasing factors).
sent this difference in terms of different standard de- To transform a repeated measures effect size into
viations. The choice will depend on whether the meta- the raw-score metric, use
analyst conceives of the relevant population as
dIG = dRM公2共1 − ␳兲 . (11)
reflecting the level of versus the change in the out-
come variable. To transform an independent-groups effect size into
When choosing a metric for the effect size, re- the change-score metric, use
searchers should also consider whether studies are
sampled from populations with different values for ␳, d IG
dRM = . (12)
the correlation between pre- and posttest scores. If ␳ 公2共1 − ␳兲
differs across studies, the variance of change scores
The transformed effect size in Equation 11 is simi-
will be heterogeneous, and dRM from these studies
lar to the effect size proposed for repeated measures
will be standardized in different metrics. If study
data by Becker (1988, cf. Morris, 2000):
characteristics that moderate ␳ can be identified (e.g.,
length of time between repeated measurements), sub- Mpost − Mpre
sets of studies with homogeneous ␳ could be analyzed d IG = . (13)
SDpre
separately. Alternately, effect sizes could be defined
in the raw-score metric, which does not depend on the Although the two approaches will not produce exactly
value of ␳. Because the raw-score metric is not sen- the same value, they are equivalent estimates of the
sitive to variations in ␳, it is recommended for situa- population effect size. As long as pre- and posttest
tions in which the homogeneity of ␳ cannot be as- scores have equal population variances, the two esti-
sumed and cannot be tested empirically. mates will have identical expectation and sampling
When the research question does not clearly sug- variance. However, if variances are not equal over
gest use of one metric over the other, several addi- time, the use of the pretest standard deviation in Equa-
tional factors should influence the choice. For ex- tion 13 would be preferable, because this value is
ample, it will generally be best to define the effect unaffected by the treatment and therefore should be
sizes in terms of the predominant design used in the more consistent across studies (Becker, 1988).
pool of studies to be meta-analyzed. Studies that use Others have suggested pooling pre- and posttest
a particular design are more likely to report the data standard deviations rather than using the pretest stan-
needed to compute the effect size for that design. dard deviation in Equation 13 (Dunlap et al., 1996;
Therefore, matching the effect size metric to the de- Taylor & White, 1992). The increase in the degrees of
sign of the majority of studies will greatly facilitate freedom should result in a better estimate of ␴ and
the computation of effect sizes. thus a more precise estimate of ␦IG. Unfortunately, the
Another consideration is the ease of communicat- distributional properties of this estimator are un-
ing results. A major advantage of the raw-score metric known. Because the pooled standard deviation is
is its familiarity. The independent-groups effect size computed from nonindependent samples, the degrees
has been used in numerous meta-analyses, and most of freedom are less than if all observations were in-
readers are familiar with its interpretation. Because dependent. However, the exact reduction in degrees of
the change-score metric represents a departure from freedom is not known, and therefore a precise esti-
this common approach, we recommend its use only in mate of the sampling variance cannot be computed.
those situations in which the research question clearly Given the need to estimate the sampling variance, it is
calls for the analysis of change. preferable to estimate the effect size using either
112 MORRIS AND DESHON

Equation 11 or 13, for which the sampling variance is Similarly, posttest scores are denoted by Postij. The
known. various factors influencing scores in each condition
Effect sizes from the independent-groups pretest– are illustrated in Figure 3.
posttest design could also be computed in either met- Pretest scores are influenced by the population
ric. If it can be assumed that the variance of scores is grand mean (␮), a selection effect (␣j), and a random
homogeneous across time, the effect size defined in error term (⑀ij1). The population grand mean refers to
Equation 6 (dIGPP) will be in the same metric as dIG. the pretest mean of the common population from
Alternatively, the effect size could be computed in the which participants are sampled, before any selection,
repeated measures metric. This could be accom- treatment, or other events occur. ⑀ij1 and ⑀ij2 refer to
plished either by replacing the pretest standard devia- the random error terms affecting pretest and posttest
tion in Equation 6 with the standard deviation of scores for individual i in treatment group j. The model
change scores or by applying the transformation given assumes that all errors are independent and that the
in Equation 12. expected error in each condition is zero. The selection
effect is equal to the population difference between
Comparability of Treatment Effects the group pretest mean and the grand mean. The se-
Often, meta-analysis includes studies involving a lection effect reflects any factors that produce system-
variety of experimental and quasi-experimental de- atic group differences between experimental and con-
signs. Various designs have been developed to control trol groups on the pretest. For example, suppose that
for different sources of potential bias. To combine men were assigned to the experimental condition,
results across different designs, we therefore need to whereas women were assigned to the control condi-
assume that the potential sources of bias do not impact tion: ␮ would refer to the grand mean across gender,
effect size estimates. In the following sections, we ␣E would be the difference between the mean pretest
outline the potential sources of bias for independent- score for men and the grand mean, and ␣C would be
groups and repeated measures designs and discuss the the difference between the pretest mean for women
assumptions needed to combine effect sizes across and the grand mean. Such effects are likely in non-
these designs. We also describe methods to empiri- equivalent control group designs (Cook & Campbell,
cally evaluate the impact of biasing factors (e.g., mod- 1979), in which self-selection or other nonrandom
erator analysis), as well as methods that can be used to factors can influence group membership. If partici-
correct for bias.
The magnitude of the treatment effect (in the origi-
nal metric) is represented by the difference between
group means (i.e., the numerator of the effect size
estimate). To illustrate the situations in which alter-
nate designs provide comparable estimates of the
treatment effect, we first describe a general model that
articulates several key sources of bias. Then, we dis-
cuss how these biases influence the results in various
research designs. This model is intended to illustrate
the types of bias that can occur in independent-groups
and repeated measures designs. A more comprehen-
sive discussion of the strengths and weaknesses of
alternate designs can be found in a text on quasi-
experimental design (e.g., Cook & Campbell, 1979).
Furthermore, researchers conducting meta-analysis
should consider the design issues most relevant to the
Figure 3. Potential sources of bias in treatment effect es-
particular research domain they are studying. timates. Pre ⳱ pretest score; Post ⳱ posttest score; ␮ ⳱
Consider a study in which participants are assigned mean of pretest population; ␣ ⳱ selection effect; ␥ ⳱ time
to either a treatment or a control group, and a depen- effect; ⌬ ⳱ treatment effect; ␤ ⳱ relationship between pre-
dent variable is measured in both groups before and and posttest scores; ⑀ ⳱ random error term. Subscripts in-
after the treatment is administered. The pretest score dicate individual participants (i), treatment versus control
for the ith individual in group j is denoted by Preij. groups (E and C), and pre- versus posttest scores (1 and 2).
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 113

pants are randomly assigned to experimental and con- If one were to further assume that time affected both
trol conditions, ␣j would be zero. groups equally, and that there was no selection bias,
Posttest scores will be influenced to some degree the expected value would equal the true treatment
by the individual’s standing on the pretest. The slope effect.
of the relationship between pre- and posttest scores is Single-group pretest–posttest design. If a single
indicated by ␤j. If pre- and posttest scores have equal group is measured before and after treatment, the
variances, ␤j is the within-group correlation between treatment effect is estimated from the mean of the
the pretest and posttest. The model assumes that this change scores. The difference between PostE and PreE
relationship is the same for all participants within a is as follows:
group but may differ across groups.
PostiE − PreiE ⳱ ␤E(⑀iE1) + ␥E + ⌬E + ⑀iE2 −⑀iE1. (18)
Posttest scores are also potentially influenced by a
time effect (␥j) and a treatment effect (⌬E). ␥j reflects The expected value of the average change score is
any factors that might systematically alter scores be-
tween the pretest and posttest but are unrelated to the E(MPost, E − MPre, E) ⳱ ⌬E + ␥E. (19)
treatment. Examples of such effects are maturation, Thus, the standard pretest–posttest design accurately
history, or fatigue (Cook & Campbell, 1979). ⌬E re- estimates the treatment effect only when the time ef-
flects the change in scores that is directly caused by fect is zero.
the treatment. Because the control group does not re- Independent-groups pretest–posttest design. If
ceive the treatment, ⌬C ⳱ 0 by definition and is there- pre- and posttest scores are available for both the
fore excluded from Figure 3. Both ⌬E and ␥j are as- treatment and control groups, a common method of
sumed to equally affect all individuals within a analysis would be to test for the difference across
treatment condition; however, the time effect is not groups in the mean pretest–posttest change. The
necessarily the same for treatment and control groups. change score for the experimental group is given in
Following the model, pretest scores can be writ- Equation 18. The change score for the control group is
ten as
PostiC − PreiC ⳱ ␤C(⑀iC1) + ␥C + ⑀iC2 − ⑀iC1, (20)
Preij ⳱ ␮ + ␣j + ⑀ij1. (14)
and the expected value of the difference between av-
It is further assumed that, absent any time or treatment erage change scores is
effect, the expected pre- and posttest means would be E[(MPost, E − MPre, E) − (MPost, C − MPre, C)]
equal. Specifically, it is assumed that the expected ⳱ ⌬E + (␥E − ␥C). (21)
value on the posttest, given a score at the mean of the
pretest, would equal the expected value of the pretest, or Therefore, this design accurately estimates the treat-
ment effect when the time effect is equivalent across
E(Postij|Preij ⳱ ␮ + ␣j, ⌬ ⳱ ␥j ⳱ 0) ⳱ ␮ + ␣j. (15) groups.
When the assumption of equal time effects cannot
From this assumption, posttest scores in the presence be met, data from this design can still be used to
of treatment and time effects can be written as estimate the same treatment effect (subject to the
same bias) as either of the other two designs, simply
Postij = ␮ + ␣j + ␤j 共Preij − ␮ − ␣j兲 + ␥j + ⌬j + ⑀ij2
by using the appropriate means. That is, the treatment
= ␮ + ␣j + ␤j共⑀ij1兲 + ␥j + ⌬j + ⑀ij2, (16)
effect could be estimated from the difference between
where ⌬j ⳱ 0 for the control group. We can use this posttest means (comparable to the independent-
model to examine how alternate designs would esti- groups posttest design) or from the mean posttest–
mate the treatment effect. pretest difference in the experimental group (compa-
Independent-groups posttest design. In the inde- rable to the single-group pretest–posttest design).
pendent-groups posttest design, the treatment effect is However, both of these estimates ignore part of the
computed from the difference between the two post- available data and therefore will provide less precise
test means. The expected value of the difference be- and potentially more biased estimates when the as-
tween means is as follows: sumption is met.
Data from this design can also be analyzed using
E(MPost, E − MPost, C) ⳱ ⌬E + (␣E − ␣C) analysis of covariance (ANCOVA), with pretest
+ (␥E − ␥C). (17) scores as the covariate. A related approach is to com-
114 MORRIS AND DESHON

pute the mean difference between covariance-adjusted the effect size metrics, and the potential sources of
means or residualized gain scores, which are defined bias that may influence the effect size estimates. In
as the residual term after regressing the posttest scores the absence of bias, all effect sizes provide equivalent
onto pretest scores. Both approaches provide the same estimates of the treatment effect. However, the effect
estimate of the treatment effect (Glass et al., 1981). size estimates will often differ when the sources of
These approaches are particularly useful when the bias have a nontrivial effect on the results. Therefore,
pretest and posttest scores are in different metrics it is not appropriate to aggregate effect sizes across
(e.g., because of the use of different measures), in the different designs unless the potential sources of
which case the gain score would be difficult to inter- bias can be ruled out through either rational or em-
pret. Unfortunately, because the comparison is based pirical methods.
on adjusted means, the treatment effect estimated with The independent-groups posttest design does not
these methods will be comparable to the other designs control for selection effects and therefore will often be
only under restrictive conditions. According to Maris incompatible with results from the other designs.
(1998), ANCOVA will provide an unbiased estimate Lack of random assignment to treatment conditions
of the treatment effect only when selection into can bias the estimate from the independent-groups
groups is based on the individual’s standing on the posttest design but has no effect on the other designs.
covariate. Except in unusual cases (e.g., the regres- When it can be assumed that assignment to groups is
sion-discontinuity design; Cook & Campbell, 1979), random, the expected value of the selection effect
this condition is unlikely to be met, and therefore the should be zero.
difference between adjusted means will not be com- In the single-group, pretest–posttest design it is as-
parable to treatment effects from other designs. sumed that all change over time is due to the treat-
When are estimates equivalent? As can be seen ment. In contrast, the other designs require the as-
by comparing Equations 17, 19, and 21, each estimate sumption that change over time is equivalent across
of the treatment effect is subject to different sources conditions. Only when there is no time effect will all
of bias. Table 1 provides a summary of the designs, three designs be unbiased.

Table 1
Susceptibility of Alternate Effect Size Estimates to Potential Sources of Bias
Potential bias in estimate of Potential bias in
treatment effect (⌬) estimate of ␴a
Selection Time Differential Subject ×
Effect size Effect size effect effect time effect Treatment
Study design metric estimate (␣) (␥) (␥E − ␥C) interaction

Independent-groups
Raw score Mpost, E − Mpost, C ✓ ✓ ✓
posttestb
SDpost, P
Single-group
Raw score Mpost, E − Mpre, E ✓
pretest–posttest
SDpre, E

Change score Mpost, E − Mpre, E ✓ ✓


SDD, E
Independent-groups MD, E MD, C
Raw score ✓
pretest–posttest −
SDpre, E SDpre, C

Change score MD, E MD, C ✓ ✓



SDD, E SDD, C
Note. post ⳱ posttest; E ⳱ experimental group; C ⳱ control group; P ⳱ pooled (i.e., the standard deviation was pooled across experimental
and control groups); pre ⳱ pretest; D ⳱ pre–post difference scores.
a
We assume that the pretest standard deviation provides an unbiased estimate of ␴. b The sources of bias will be the same regardless of the
effect size metric.
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 115

In many areas of research, it is unrealistic to as- Other types of time effects, such as fatigue, may be
sume that there will be no time effect in the control related to the treatment, and therefore it may not be
group. For example, if change over long periods of appropriate to assume equivalence across groups. In
time is looked at in a study, it is very likely that such cases, it may be reasonable to assume no time
maturation or history effects would occur, suggesting effect in the control group. With this assumption, ef-
that the time effect will be nonzero. Research on psy- fect sizes from the single group pretest–posttest and
chotherapy (Lipsey & Wilson, 1993) and training independent-groups pretest–posttest designs can be
(Carlson & Schmidt, 1999) frequently demonstrates combined, although both will be biased estimates of
nonzero change in the control group. the treatment effect.
However, in other domains, there may be no reason In addition to the sources of bias discussed in this
to expect a change absent treatment. In experimental section (selection, time, and differential time effects),
research comparing performance of the same subject Table 1 also indicates susceptibility to bias produced
under different conditions, many researchers use by a subject by treatment interaction. This form of
counterbalancing of conditions, so that the mean dif- bias will not affect the estimation of the treatment
ference between treatments will not be influenced by effect, but it can bias the estimation of the standard
the order of presentation. A time effect would have deviation as discussed in the Comparability of Metrics
the same impact on treatment and control group section. Because the subject by treatment interaction
means and therefore would not bias the estimate of the will inflate or deflate posttest variance, any effect size
treatment effect. Other research applies repeated mea- that uses posttest standard deviations is susceptible to
sures over relatively short periods of time under con- this form of bias. Effect sizes from the single-group
trolled laboratory conditions. In such cases, matura- pretest–posttest and the independent-groups pretest–
tion or history should have little effect. For example, posttest designs will be unbiased if computed in the
in a meta-analysis on the effects of self-reference on raw-score metric (using pretest standard deviations)
memory (Symons & Johnson, 1997), studies using a but will be susceptible to bias when computed in the
single-group pretest–posttest design produced a mean change-score metric. Results from the independent-
effect size that was very close to the estimate from groups posttest only design will be susceptible to this
studies that included a control group. bias regardless of the metric of the effect size, because
Furthermore, the assumption of no change in the only posttest standard deviations would be available
control group may be viable for variables that are (Carlson & Schmidt, 1999).
resistant to spontaneous change. For example, in a An alternative way to justify aggregation of effect
meta-analysis on psychological treatments for insom- sizes across designs would be to determine empiri-
nia, Murtagh and Greenwood (1995) combined stud- cally whether alternate designs provide similar esti-
ies using both the independent-groups posttest design mates of the effect size. As a first step in a meta-
and the single-group pretest–posttest design. The au- analysis, a moderator test could be performed to
thors argued that it was appropriate to aggregate effect compare effect sizes across designs. If the mean effect
sizes across these two designs, because spontaneous sizes differ substantially, then separate analyses
recovery from chronic insomnia is unlikely to occur. should be performed for each design. However, if
This decision was supported by their results, which similar mean effect size estimates are found for the
showed a similar mean effect size across designs. alternate designs, they could be combined into a
Similarly, in a meta-analysis of training studies, Carl- single meta-analysis. Of course, it would also be pos-
son and Schmidt (1999) found changes in the control sible that differences in the bias in the various esti-
group for some variables, but no substantial change mators would be confounded by differences in other
was found for measures of trainee attitudes. study characteristics. As a means of getting around
Even when time effects occur, effect sizes can be this problem, the study design could be used as one of
combined across designs under certain conditions. several moderators analyzed simultaneously in the
When there is an equivalent time effect across groups, meta-analysis (Hedges & Olkin, 1985).
both designs involving independent groups will be Correcting for sources of bias. Often, it will not
unbiased and therefore can be combined. This would be possible to assume that potential biasing factors
be reasonable if the time effect was due to maturation have no effect. In some cases, it may still be possible
and group assignment was unrelated to maturation to integrate effect sizes across designs if the relevant
rates. sources of bias can be estimated from the available
116 MORRIS AND DESHON

data. One of the advantages of aggregating results A disadvantage of this method is that separate ef-
across studies is that the strengths of one study can fect size estimates are required for treatment and con-
compensate for the weaknesses of another. If some trol groups. Therefore, it would not be possible to
studies provide adequate data to estimate a potential integrate these results with effect sizes from studies
source of bias, this estimate can be applied to other using an independent-groups posttest design. Using
studies in order to obtain unbiased estimates of the Becker’s (1988) second method, it may be possible to
effect size in all studies. combine results from all three designs.
Consider a meta-analysis in which some of the Rather than correcting for bias at the aggregate
studies use a single-group pretest–posttest design and level, it is also possible to introduce a bias correction
others have an independent-groups pretest–posttest for individual studies. A preliminary meta-analysis
design. Whenever there is a nonzero pretest–posttest would be conducted on control groups from the stud-
change in the control group (␥C), effect sizes from the ies with independent-groups pretest–posttest designs.
two designs will estimate different parameters. If it The mean standardized pretest–posttest change from
can be assumed that the time effect (␥) is the same for these studies provides an estimate of the time effect
the treatment and control groups, the independent- (␥C). If this time effect is assumed to be the same in
groups pretest–posttest design will provide an unbi- the treatment condition, then the mean time effect can
ased estimate of the population effect size, whereas be subtracted from each of the effect size estimates for
the single-group pretest–posttest design will overesti- the single-group pretest–posttest designs. As a result,
mate the effect of treatment. However, when suffi- both designs will provide an unbiased estimate of the
cient information is available, it is possible to obtain treatment effect. Furthermore, under conditions in
an unbiased estimate using effect sizes from both de- which the independent-groups posttest design provides
signs. an unbiased estimate of the treatment effect, effect sizes
Becker (1988) described two methods that can be from all three designs will be comparable and could
used to integrate results from single-group pretest– therefore be combined in the same meta-analysis.
posttest designs with those from independent-groups Because this method uses the results from one set
pretest–posttest designs. In both cases, meta-analytic of studies to estimate the correction factor for other
procedures are used to estimate the bias due to a time studies, the effect size estimates will not be indepen-
effect. The methods differ in whether the correction dent. Therefore, standard meta-analysis models,
for the bias is performed on the aggregate results or which assume independence, will not be appropriate.
separately for each individual effect size. The two Consequently, Becker (1988) recommended the use
methods are briefly outlined below, but interested of a generalized weighted least squares model for ag-
readers should refer to Becker (1988) for a more thor- gregating the results.
ough treatment of the issues. An important assumption of this method is that the
The first approach would be to aggregate the pre- source of bias (i.e., the time effect) is constant across
test–posttest effect sizes separately for the treatment studies. This assumption should be tested as part of
and control groups. For each independent-groups pre- the initial meta-analysis used to estimate the pretest–
test–posttest study, one effect size would be computed posttest change in the control group. If effect sizes are
based on the pre- and posttest means of the treatment heterogeneous, the investigator should explore poten-
group, and a separate effect size would be computed tial moderators, and if found, separate time effects
based on the pre- and posttest means for the control could be estimated for subsets of studies.
group. The single-group pretest–posttest design Similar methods could be used to estimate and con-
would only provide an effect size for the treatment trol for other sources of bias. For example, Shadish et
group. In order to account for the fact that multiple al. (2000) conducted a separate meta-analysis on the
effect sizes are included from the same study, a mixed difference between pretest scores for treatment and
model analysis could be used to estimate the mean control groups. This pretest effect size provided an
standardized pretest–posttest change for the treatment estimate of the degree of bias resulting from nonran-
and control groups. The result for the control group dom assignment.
provides an estimate of the time effect, and the dif- Sampling Variance Estimates
ference between the two estimates provides an unbi-
ased estimate of the population effect size. A similar Sampling variance refers to the extent to which a
method has been suggested by Li and Begg (1994). statistic is expected to vary from study to study, sim-
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 117

ply as a function of sampling error. Estimates of sam- and the metric of the effect size. ␦* refers to the popu-
pling error are used in a meta-analysis when comput- lation effect size in the metric chosen for the meta-
ing the mean and testing the homogeneity of effect analysis. The bias function c(df ) is approximated by
sizes. Sampling variance is largely a function of the the following (Hedges, 1982):
sample size but is also influenced by the study design.
3
For example, when ␳ is large, the repeated measures c 共df 兲 = 1 − . (23)
design will provide more precise estimates of popu- 4 df − 1
lation parameters, and the resulting effect size will When the data are from an independent-groups
have a smaller sampling variance. posttest design, ñ⳱ (nE * nC)/(nE + nC), and df ⳱ nE
Variance formulas have been developed for both + nC − 2. If the data are from a single-group pretest–
the independent-groups (Hedges, 1981) and repeated posttest design, ñ is the number of paired observa-
measures effect size (Gibbons et al., 1993). In addi- tions, and df ⳱ n − 1. In addition, if the result is
tion, Becker (1988, cf. Morris, 2000) and Hunter and expressed in a different metric than the original de-
Schmidt (1990) have developed slightly different for- sign, the appropriate transformation (as illustrated in
mulas for an effect size in the raw-score metric esti- Equations 11 and 12) is substituted for A. The result-
mated from repeated measures data. Rather than re- ing variance for each type of effect size is indicated in
lying on these separate formulas, we present a general Table 2.
form of the variance that encompasses the three ex- The sampling variance formulas are somewhat
isting procedures, as well as a new situation, that is, more complex when the effect size is estimated from
an effect size in the change-score metric estimated an independent groups pretest–posttest design. As
from an independent-groups posttest design. shown in Equation 6, separate effect size estimates
The general form for the sampling variance (␴2ei ) would be calculated for the treatment and control
for an effect size in either metric is groups. The difference between these two compo-

冉 冊冉 冊冉 冊
nents effect sizes provides the best estimate of the
A2 df ñ ␦2* overall effect size for the study. The variance of this
␴e2i = 1+ ␦2 − . (22)
ñ df − 2 A2 *
关c 共df 兲兴2 combined effect size is equal to the sum of the vari-
ances for the two components (Becker, 1988). Thus,
The derivation of this formula is presented in Appen- the variance would be estimated for each group, using
dix A. Each of the variables in Equation 22 can take the appropriate equation from Table 2, and then
on different values, based on the design of the study summed.

Table 2
Sampling Variance of the Effect Size as a Function of the Study Design and the Metric Used in the Meta-Analysis
Study design Effect size metric Sampling variance

冉 冊冉 冊 ␦RM
Single-group
n−1
2
Change score 1
pretest–posttest 共1 + n␦RM
2
兲−
n n −3 关c共n − 1兲兴2


2 共1 − ␳兲
册冉 冊冋
n−1
册␦IG
Single-group 2
Raw score n
pretest–posttest 1+ ␦ 2

n n −3 2 共1 − ␳兲 IG 关c共n − 1兲兴2

冉 冊冉 N−2
冊 ␦IG
Independent-groups 2
Raw score 1
posttest 共1 + ñ␦IG
2
兲−
ñ N −4 关c共N − 2兲兴2

冋 册冉 冊N−2 ␦RM 2
Independent-groups 1
Change score 关1 + 2 共1 − ␳兲 ñ␦RM
2
兴−
posttest 2 共1 − ␳兲 ñ N −4 关c共N − 2兲兴2

Note. n is the number of paired observations in a single–group pretest–posttest design; ␦RM and ␦IG are the population effect sizes in the
change-score and raw-score metrics, respectively; c(df ) is the bias function defined in Equation 23; ñ ⳱ (nE * nC)/(nE + nC); N is the combined
number of observations in both groups (nE + nC).
118 MORRIS AND DESHON

Operational Issues in Estimating Effect Sizes be included in all study reports, it is often possible to
compute this value from available data. If both
Estimating the Pretest–Posttest Correlation the pre- and posttest standard deviations (SDpre and
Transforming effect sizes into alternate metrics re- SDpost) are known, as well as the standard deviation of
quires an estimate of the population correlation be- difference scores (SDD),
tween pre- and posttest scores. Others have avoided 2
SDpre + SDpost
2
− SDD 2
this step by computing the independent-groups effect r= (24)
共2兲 共SDpre兲 共SDpost兲
size from means and standard deviations (Becker,
1988; Dunlap et al., 1996; Hunter & Schmidt, 1990). can be used to compute the pretest–posttest correla-
However, the pretest–posttest correlation is also used tion. If the pre- and posttest standard deviations are
in the estimate of the sampling variance and therefore not known, r can also be estimated from the pooled
will have to be estimated regardless of which ap- standard deviation (SDP),
proach is used. In addition, it may be possible to 2
estimate ␳ from studies in which sufficient data are SDD
r=1− . (25)
available and to generalize this estimate to other stud- 2SDP2
ies. This is similar to common procedures for estimat-
ing study artifacts (e.g., reliability) based on incom- The derivations of Equations 24 and 25 are presented
plete information in primary research reports (Hunter in Appendix B.
& Schmidt, 1990). Thus, the analyst would first per- If the variance of difference scores is not reported,
form a preliminary meta-analysis on the pretest– it can often be computed from test statistics and then
posttest correlations and then use the result as the substituted into one of the above equations. For ex-
value of ␳ in the transformation formula. A variety of ample, if the means, sample size, and repeated mea-
methods exist to aggregate correlation coefficients sures t test (tRM) are reported,
across studies, involving various corrections, weight- n 共 Mpost − Mpre兲2
ing functions, and so forth (e.g., Hedges & Olkin, 2
SDD = 2
. (26)
1985; Hunter & Schmidt, 1990; Rosenthal, 1991). We t RM
do not advocate any particular approach but rather
Estimating Effect Sizes From Test Statistics
rely on the meta-analyst to determine the most appro-
priate methods for the estimation of ␳. If the means and standard deviations are unavail-
It is not necessary to assume that a single value of able, the effect size can also be computed from test
␳ is appropriate for all studies. Some may feel that ␳ statistics, using familiar conversion formulas. When
will change as a function of study characteristics, such an independent-groups t test is reported (tIG), the in-
as the length of time between pre- and posttest mea- dependent-groups effect size estimate can be com-
sures (Dunlap et al., 1996). A test for homogeneity of puted as follows (Glass et al., 1981):


effect size (Hedges & Olkin, 1985) can be used to
evaluate whether the estimates of ␳ are consistent nE + nC
dIG = tIG , (27)
across studies. If this test is significant, the differences nE nC
in ␳ could be modeled as part of the initial meta-
where nE and nC are the sample sizes from the two
analysis, and then appropriate values estimated for
treatment conditions. Similarly, the repeated measures
each study.
t test (tRM) can be transformed into a repeated mea-
As noted earlier, the homogeneity of ␳ also has
sures effect size (Rosenthal, 1991),
implications for the choice of an effect size metric.
Effect sizes defined in the change-score metric should tRM
be combined only for studies where ␳ is the same. To dRM = , (28)
公n
use the change-score metric when ␳ varies across
studies, the researcher should perform separate meta- where n is the number of individuals or matched pairs
analyses for subsets of studies that have homogeneous in the experiment.
␳. Alternatively, the meta-analysis could be conducted For studies with an independent-groups pretest–
in the raw-score metric, which is not affected by posttest design, the effect size can be computed from
variations in ␳. test statistics, as long as the designs provide equiva-
Although the pretest–posttest correlation may not lent estimates of the treatment effect. The estimate
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 119

depends on the particular statistical test used. If a t test ing for homogeneity of effect size. We present meth-
was computed on the difference between treatment- ods for the fixed-effects model (Hedges & Olkin,
and control-group gain scores, then the repeated mea- 1985). The procedures can be readily generalized to
sures effect size can be estimated using random-effects models as well (Hedges & Vevea,


1998).
nE + nC The population effect size is generally estimated
dRM = tgain . (29)
nEnC from a weighted mean of the effect size estimates.
The rationale for weighting is to allow more precise
An equivalent test would be to conduct a 2 × 2
estimates to have greater influence on the mean. If an
ANOVA with one between-groups factor (experimen-
effect size has a small sampling variance, values are
tal vs. control group) and one repeated factor (pre- vs.
likely to fall close to the population parameter. On the
posttest). The square root of the F test on the group by
other hand, if the sampling variance is large, an indi-
time interaction is equivalent to tgain and therefore can
vidual estimate can differ substantially from the popu-
be substituted into Equation 29. Because the F value
lation effect size simply because of sampling error.
does not indicate the direction of the difference, the
By giving greater weight to more precise estimates,
meta-analyst must also specify whether the effect size
the resulting mean will be more accurate (i.e., will be
is positive or negative, depending on the pattern of
less biased and have a smaller mean square error;
means. For larger factorial designs, adjustments for
Hedges & Olkin, 1985).
the other factors should also be considered (Cortina &
Precision is largely a function of the sample size;
Nouri, 2000).
larger samples produce more precise estimates. As a
If the data were analyzed using residualized gain
result, some have recommended weighting by sample
scores or ANCOVA, the significance test will be
size (Hunter & Schmidt, 1990). When all studies are
based on the difference between adjusted means, and
from the same design, this produces an average effect
therefore it will not be possible to estimate the effect
size that is very close to the optimal precision-
size for unadjusted means from the significance test.
weighted estimate (Hedges & Olkin, 1985). However,
As noted above, the difference between adjusted
when multiple designs are included, both the design
means will often not estimate the same treatment ef-
and the sample size influence precision. For example,
fect as other designs. However, in cases in which the
in a repeated measures design, each participant is
estimates are deemed to be comparable, Glass et al.
treated as his or her own control, thereby reducing
(1981) provided formulas for estimating the indepen-
error variance due to individual differences. The
dent-groups effect sizes from statistical tests. If a t test
smaller error variance results in a more precise esti-
on the difference in residualized gain scores (tr) is
mate of the mean difference and, consequently, a
reported,
more precise effect size estimate.

dIG = tr冑 共1 − ␳2兲 冉 nEnC 冊


nE + nC
. (30)
The mean effect size will be most accurate when
the estimates from the individual studies are weighted
by the reciprocal of the sampling variance (Hedges &
The F test from an ANCOVA can be translated into Olkin, 1985). In addition, because the design influ-
an independent-groups effect size using ences the formula for the sampling variance, variance


weighting accounts for both sample size and study
F 共1 − ␳2兲共dfw − 1兲 design. The variance-weighted mean effect size is
dIG = 2 , (31)
共nE + nC 兲共dfw − 2兲
where dfw is the residual within-groups degrees of
兺w d
i
i i

d= , (32)
freedom.
兺w i
i

Conducting the Meta-Analysis


where the weights (wi) are defined as the reciprocal of
Meta-analysis on effect sizes from alternate designs the sampling variance (1/␴ˆ 2ei) estimated from Equa-
can be performed using standard procedures, as long tion 22.
as (a) the effect sizes are first transformed into a com- As Hedges (1982) noted, an inherent problem in
mon metric and (b) the appropriate sampling variance meta-analysis is that it is necessary to know the popu-
formulas are used when estimating the mean and test- lation effect size in order to estimate the sampling
120 MORRIS AND DESHON

variance, which in turn is needed to estimate the tistic (Qj) is computed for studies within each of the J
population effect size. This problem can be solved by levels of the moderator, as well as the overall statistic
first computing an unweighted average effect size and for all studies (QT). The difference between levels of
then using this value as the estimate of ␦ in the vari- the moderator can be tested using a between-groups
ance formula. These variance estimates are then used statistic (QB), where
to compute the weighted mean effect size using Equa-
J
tion 32.
Differences in study designs must also be taken into QB = QT − 兺Q
j=1
j (37)
account in tests for homogeneity of effect size. Tests
for homogeneity are based on the comparison of the with df ⳱ J − 1. Methods for simultaneous analysis of
observed variance of the effect size (␴ˆ 2d) to the theo- multiple moderators are described in Hedges and
retical variance due to sampling error (␴ˆ 2e ). Again, the Olkin (1985).
effect of study design will be incorporated by using As an example, a small meta-analysis of the inter-
appropriate formulas for the sampling variance. As personal skills training literature was performed. For
with the mean, variance-weighted estimates are gen- purposes of illustration, separate meta-analyses were
erally used. Thus, the observed variance is conducted using both the raw-score and the change-

兺w 共d − d兲
i
i i
2 score metrics. This is not recommended in practice,
where a single meta-analysis should be conducted us-
␴ˆ 2d = , (33)
兺w
ing the metric that best addresses the research ques-
i
i tion.
Table 3 presents 15 effect sizes addressing the ef-
where wi ⳱ 1/␴ˆ e2i .
The variance due to sampling error ficacy of interpersonal skills training taken from 10
is estimated from the weighted average of the indi- studies. The first step in this meta-analysis was to
vidual study variances, or determine the design used and the relevant sample
k size information for each computed effect size. The
兺w ␴ˆ
i=1
i
2
ei
k
first column in Table 3 contains the author of the
study. The second column identifies the design used
␴ˆ 2e = k
= k
, (34) in the study. Nine of the effect sizes were based on
兺w 兺␴ˆ
1
i
single-group pretest–posttest designs and the remain-
2
i=1 i=1 ei ing 6 used independent-groups posttest designs. The
third and fourth columns contain sample size infor-
where k is the number of studies in the meta-analysis.
mation for the groups (these numbers may be unequal
Once the design effects have been taken into ac-
for independent-groups posttest designs).
count in the estimates of ␴ˆ 2d and ␴ˆ 2e , standard tests for
After identifying the design, the effect size esti-
homogeneity proceed normally. Using the Hunter and
mates were computed. For the independent-groups
Schmidt (1990) 75% rule, the effect size would be
posttest designs this computation was based on Equa-
viewed as homogeneous if
tion 2 (if descriptive statistics were available) or
2
␴ˆ e Equation 27 (if a t test or F test was available). For
2 ⬎ .75. (35) effect sizes taken from single-group pretest–posttest
␴ˆ d designs, Equation 4 or Equation 28 was used, depend-
Alternatively, Hedges (1982) recommended evaluat- ing on whether descriptive statistics were available.
ing homogeneity of effect size with a significance Next, Equations 11 and 12 were used to convert
test, which can be written as effect sizes from the independent-groups posttest
studies into the change-score metric, and effect sizes
k␴ˆ 2d from the single-group pretest–posttest studies into the
Q= . (36) raw-score metric, so that the results of each study
␴ˆ 2e
were computed in each metric. Again, this step was
The Q statistic is tested against a chi-square distribu- performed only to illustrate the procedure, and only
tion with k-1 degrees of freedom. one metric should be used in practice. A small com-
The Q statistic can also be used to test for categori- plication arises when making these transformations.
cal moderators of the effect size. A separate test sta- Both Equation 11 and Equation 12 require informa-
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 121

Table 3
Example of a Meta-Analysis Based on Raw-Score and Change-Score Effect Sizes
Raw-score metric Change-score metric
Study Design n1 n2 dIG dRM c VarIG wIG wIG * dIG VarRM wRM wRM * dRM
Smith-Jentsch, Salas, IG 30 30 0.61 0.69 0.99 0.08 13.12 8.04 0.10 10.30 7.09
and Baker (1996)
Smith-Jentsch, Salas, IG 30 30 0.89 1.00 0.99 0.08 13.12 11.62 0.10 10.30 10.25
and Baker (1996)
Moses and Ritchie IG 90 93 0.96 1.07 1.00 0.02 41.15 39.34 0.03 32.29 34.68
(1976)
Fiedler and Mahar IG 9 9 1.34 1.51 0.95 0.29 3.50 4.70 0.36 2.75 4.14
(1979)
Fiedler and Mahar IG 190 215 0.26 0.29 1.00 0.01 91.42 23.59 0.01 71.74 20.80
(1979)
Engelbrecht and IG 41 35 0.52 0.59 0.99 0.06 16.67 8.69 0.08 13.08 7.65
Fischer (1995)
Leister et al. (1977) RM 27 27 0.75 0.84 0.97 0.05 20.45 15.36 0.06 16.09 13.58
Leister et al. (1977) RM 27 27 0.50 0.57 0.97 0.05 20.45 10.31 0.06 16.09 9.11
R. M. Smith, White, RM 12 12 1.25 1.40 0.93 0.13 7.60 9.48 0.17 5.98 8.38
and Montello (1992)
R. M. Smith, White, RM 12 12 1.10 1.23 0.93 0.13 7.60 8.32 0.17 5.98 7.35
and Montello (1992)
P. E. Smith (1976) RM 36 36 2.89 3.25 0.98 0.04 28.13 81.30 0.05 22.14 71.96
Gist et al. (1991) RM 44 44 0.84 0.94 0.98 0.03 34.96 29.33 0.04 27.51 25.94
Gist et al. (1991) RM 35 35 0.83 0.93 0.98 0.04 27.28 22.61 0.05 21.46 20.00
Campion and Campion RM 127 127 0.38 0.43 0.99 0.01 105.74 40.50 0.01 83.19 35.77
(1987)
Harris and Fleishmann RM 39 39 0.11 0.13 0.98 0.03 30.69 3.44 0.04 24.15 3.04
(1955)
Note. n1 and n2 are the sample sizes in the experimental and control group or the first and second time period, depending on the experimental
design; d is the effect size estimate; c is the bias function in Equation 23; Var is the estimated sampling variance of the effect sizes; w is the
value used to weight the effect size estimates. IG ⳱ independent groups; RM ⳱ repeated measures. The statistics are reported for both the
raw-score and change-score effect sizes as indicated by the subscripts IG and RM, respectively.

tion concerning the population correlation between The raw-score and change-score effects sizes for
the repeated measures. As discussed earlier, we be- each study are presented in the fifth and sixth columns
lieve that an aggregate of the correlational data across of Table 3. Notice that the change-score effect sizes
the single-group pretest–posttest designs provides the are uniformly larger than the raw-score effect sizes, as
best estimate of the population correlation. Therefore, will generally be the case.
a small meta-analysis of the correlations across the Once the effect sizes were computed, the next step
single-group pretest–posttest studies was undertaken. in the meta-analysis was to estimate the sampling
Only the last four studies in Table 3 contained suffi- variance of each effect size. In addition to the values
cient information to estimate the correlation of re- already computed, two additional pieces of informa-
sponses across time (.69, .72, .55, and .58). Using tion were required to use these formulas—the bias
Hedges and Olkin’s (1985) method to meta-analyze function, c, and the population effect size. The value
these correlations yielded a variance-weighted aver- for the bias function, c, may be computed for both
age correlation of .61. Before incorporating this esti- effect size metrics using Equation 23.
mate of the correlation into our analysis, the homo- There are many alternative ways to estimate the
geneity of the correlations was examined. The Q test population effect size needed to compute the sampling
yielded a nonsignificant chi-square result, ␹2(3, N ⳱ variance. For most purposes, the simple average of the
245) ⳱ 3.26, p > .05, indicating that the null hypoth- effect sizes in Table 3 serves as a reasonable estimate
esis of homogeneity was not rejected. of this parameter. Across all 15 effect sizes, the av-
122 MORRIS AND DESHON

erage raw-score effect size was 0.882 and the average for the single-group pretest–posttest designs and 0.54
change-score effect size was 0.991. for the independent-groups posttest designs. The test
The sampling variance of each effect size was es- of moderation indicated that the design was a signifi-
timated using the equations in Table 2 and the values cant moderator of the effect sizes, ␹2(1, N ⳱ 1161) ⳱
just presented. As shown in Table 2, different equa- 16.22, p < .05. Thus, there is evidence that even after
tions were required depending on the original study accounting for differences in the effect sizes due to
design and whether the effect size was analyzed in the the metric, the effect sizes still differ across the de-
original metric or transformed. Equation 32 was used signs. Substantive or bias explanations would have to
to estimate the variance-weighted mean effect size. be explored to identify the cause of this difference,
The weights were defined as the reciprocal of the and these effect sizes should not be combined in a
sampling variance for each effect size estimate. On single estimate of the effect of training on interper-
the basis of the values in Table 3, the effect sizes were sonal skills.
computed by taking the sum of the column represent-
ing the weighted effect sizes and dividing by the sum Conclusion
of the column representing the weights. For the
change-score effect sizes this equation yielded a value When a meta-analyses is conducted, it is often de-
of 0.77, and for the raw-score effect sizes this value sirable to combine results across independent-groups
was 0.69. As expected, the effect size was slightly and repeated measures designs. When effect sizes are
larger in the change-score metric than in the raw-score combined across these designs, it is critical that a
metric. In either case, the results indicate a moderately number of steps be followed. Appropriate transforma-
large improvement due to training. tions must be used to ensure that all effect sizes are in
Next, effects sizes were tested for homogeneity. a common metric. In addition, meta-analysis proce-
Sampling variance accounted for only a small propor- dures should use design-specific sampling variance
tion of the observed variance in effect sizes (.08 for formulas to specify the precision of effect size esti-
both effect size metrics), which falls far below the mates. Finally, unless the researcher can justify, based
Hunter and Schmidt (1990) 75% rule (Equation 35). on rational analysis or empirical moderator analyses,
Thus, for both metrics, there was substantial variance that the alternate designs estimate the same treatment
in the effect size estimates that could not be attributed effect, the effect sizes from the two designs should not
to sampling error. Hedges’ chi-square test of homo- be combined. This procedure provides maximal flex-
geneity resulted in similar conclusions. The chi- ibility to the meta-analyst. Researchers can choose to
square homogeneity test was significant for both the analyze effect sizes in either the raw-score or change-
raw-score metric, ␹2(14, N ⳱ 1161) ⳱ 184.90, p < score metric, depending on which metric best reflects
.01, and the change-score metric, ␹2(14, N ⳱ 1161) the pool of studies and the research question, and can
⳱ 183.78, p < .01. As with the 75% rule, this means readily incorporate results from studies using different
that there is evidence of heterogeneity among the ef- designs.
fect sizes, and a search for possible moderator vari- A major challenge to this approach is the need to
ables is warranted. justify that effect sizes from different designs estimate
As noted earlier, different designs may be subject the same treatment effect. Because different designs
to different sources of bias and therefore may not control for different types of bias, certain designs are
provide comparable estimates of effect size. To ex- likely to over- or underestimate the treatment effect.
amine whether effect sizes could be aggregated across The researcher can justify aggregation of effect sizes
designs, we tested for study design as a moderator of across designs either rationally or empirically. De-
the effect size. The moderator analysis was conducted pending on the nature of the process under investiga-
using the fixed-effects analysis described in Hedges tion and the specific experimental designs used, it
and Olkin (1985) for the raw-score metric. When con- may be possible to rule out potential sources of bias,
sidered separately, the effect sizes from independent- thereby allowing estimates to be compared across de-
groups posttest designs were heterogeneous ␹2(5, N signs. For example, when an intervention is examined
⳱ 802) ⳱ 18.72, p < .05, as were the effect sizes over relatively short time periods, bias due to history
from the single-group pretest–posttest designs, ␹2(8, or maturation can be ruled out, and time effects can be
N ⳱ 359) ⳱ 149.96, p < .05. The variance-weighted assumed to be minimal. As a result, effect sizes from
average effect size (in the raw-score metric) was 0.78 independent-groups and repeated measures designs
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 123

should estimate the same treatment effect. Thus, al- References


though the two designs are not comparable in general,
they may be comparable in specific research domains.
Becker, B. J. (1988). Synthesizing standardized mean-
The aggregation of effect sizes across designs could
change measures. British Journal of Mathematical and
also be justified empirically. As a first step in con-
Statistical Psychology, 41, 257–278.
ducting a meta-analysis, the researcher should test for
mean differences between the effect sizes from alter- Burke, M. J., & Day, R. R. (1986). A cumulative study of
nate designs. If systematic differences are found, re- the effectiveness of managerial training. Journal of Ap-
sults from different designs must be analyzed sepa- plied Psychology, 71, 232–245.
rately. Alternately, the magnitude of various sources Campion, M. A., & Campion, J. E. (1987). Evaluation of an
of bias could be estimated as part of the meta-analysis interviewee skills training program in a natural field ex-
(Becker, 1988; Li & Begg, 1994; Shadish et al., 2000). periment. Personnel Psychology, 40, 675–691.
However, if no differences between designs are Carlson, K. D., & Schmidt, F. L. (1999). Impact of experi-
found, effect sizes could be combined. Naturally, the mental design on effect size: Findings from the research
strongest case for aggregation can be made when both literature on training. Journal of Applied Psychology, 84,
rational and empirical justification can be provided. 851–862.
A limitation of the repeated measures effect size is
Cook, T. D., & Campbell, D. T. (1979). Quasi-
that it compares only two time periods. In many re-
experimentation: Design and analysis issues for field set-
search areas in which the repeated measures effect
tings. Boston: Houghton Mifflin.
size would be of greatest interest (e.g., practice or
learning effects), it is often beneficial to observe the Cortina, J. M., & DeShon, R. P. (1998). Determining
trajectory of growth curves across multiple observa- relative importance of predictors with the observa-
tions (Keppel, 1982). The use of a single pretest– tional design. Journal of Applied Psychology, 83, 798–
posttest comparison might miss important information 804.
about the shape of growth trajectories. However, it Cortina, J. M., & Nouri, H. (2000). Effect size for ANOVA
should be noted that this problem is inherent in any designs. Thousand Oaks, CA: Sage.
meta-analysis using standardized mean differences,
Dilk, M. N., & Bond, G. R. (1996). Meta-analytic evalua-
not just the methods proposed here. Becker (1988)
tion of skills training research for individuals with severe
suggested that this could be addressed by computing
mental illness. Journal of Consulting and Clinical Psy-
multiple comparisons within each study and then us-
chology, 64, 1337–1346.
ing a meta-analysis procedure that models both
within- and between-studies effects. Dunlap, W. P., Cortina, J. M., Vaslow, J. B., & Burke, M. J.
Combining effect sizes estimated from studies us- (1996). Meta-analysis of experiments with matched
ing different research designs is a challenging and groups or repeated measures designs. Psychological
often time-consuming process. In this presentation, Methods, 1, 170–177.
we detailed the methods required to appropriately Eagly, A. H., Makhijani, M. G., & Klonsky, B. G. (1992).
combine effect sizes from repeated measures and in- Gender and the evaluation of leaders: A meta-analysis.
dependent-groups designs and highlighted the infer- Psychological Bulletin, 111, 3–22.
ential hazards that may be encountered when doing Engelbrecht, A. S., & Fischer, A. H. (1995). The managerial
so. However, it should be emphasized that the diffi- performance implications of a developmental assessment
culties highlighted above (e.g., differences in metric center process. Human Relations, 48, 387–404.
and differential susceptibility to bias) are not unique
Fiedler, F. E., & Mahar, L. (1979). A field experiment vali-
to combining effects sizes across dependent- and in-
dating contingency model leadership training. Journal of
dependent-groups designs. In fact, virtually all of
Applied Psychology, 64, 247–254.
these issues should be considered when combining
effect sizes across different independent-groups de- Fleishman, E. A., & Hempel, W. E., Jr. (1955). The relation
signs (cf. Morris & DeShon, 1997). We recommend between abilities and improvement with practice in a vi-
that when a meta-analysis is conducted, it should be sual discrimination reaction task. Journal of Experimen-
common practice to record differences in design and tal Psychology, 49, 301–312.
to examine experimental design as a moderator of the Gibbons, R. D., Hedeker, D. R., & Davis, J. M. (1993). Es-
effect sizes. timation of effect size from a series of experiments in-
124 MORRIS AND DESHON

volving paired comparisons. Journal of Educational Sta- Maris, E. (1998). Covariance adjustment versus gain
tistics, 18, 271–279. scores—Revisited. Psychological Methods, 3, 309–327.
Gist, M. E., Stevens, C. K., & Bavetta, A. G. (1991). Effects Morris, S. B. (2000). Distribution of the standardized mean
of self-efficacy and post-training intervention on the ac- change effect size for meta-analysis on repeated mea-
quisition and maintenance of complex interpersonal sures. British Journal of Mathematical and Statistical
skills. Personnel Psychology, 44, 837–861. Psychology, 53, 17–29.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta- Morris, S. B., & DeShon, R. P. (1997). Correcting effect
analysis in social research. Beverly Hills, CA: Sage. sizes computed from factorial ANOVA for use in meta-
Guzzo, R. A., Jette, R. D., & Katzell, R. A. (1985). The analysis. Psychological Methods, 2, 192–199.
effects of psychologically-based intervention programs Moses, J. L., & Ritchie, R. J. (1976). Supervisory relation-
on worker productivity: A meta-analysis. Personnel Psy- ships training: A behavioral evaluation of a behavioral
chology, 38, 275–291. modeling program. Personnel Psychology, 29, 337–343.
Harris, E. F., & Fleishmann, E. A. (1955). Human relations Murtagh, D. R. R., & Greenwood, K. M. (1995). Identifying
training and the stability of leadership patterns. Journal of effective psychological treatments for insomnia: A meta-
Applied Psychology, 39, 20–25. analysis. Journal of Consulting and Clinical Psychology,
Hedges, L. V. (1981). Distribution theory for Glass’s esti- 63, 79–89.
mator of effect size and related estimators. Journal of Neuman, G. A., Edwards, J. E., & Raju, N. S. (1989). Or-
Educational Statistics, 6, 107–128. ganizational development interventions: A meta-analysis
of their effects on satisfaction and other attitudes. Per-
Hedges, L. V. (1982). Estimation of effect size from a series
sonnel Psychology, 42, 461–489.
of independent experiments. Psychological Bulletin, 92,
Rambo, W. W., Chomiak, A. M., & Price, J. M. (1983).
490–499.
Consistency of performance under stable conditions of
Hedges, L. V., & Olkin, I. (1985). Statistical methods for
work. Journal of Applied Psychology, 68, 78–87.
meta-analysis. San Diego, CA: Academic Press.
Ray, J. W., & Shadish, W. R. (1996). How interchangeable
Hedges, L. V., & Vevea, J. L. (1998). Fixed- and random-
are different estimators of effect size? Journal of Con-
effects models in meta-analysis. Psychological Methods,
sulting and Clinical Psychology, 64, 1316–1325.
3, 486–504.
Rosenthal, R. (1991). Meta-analytic procedures for social
Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta- research (Rev. Ed.). Newbury Park, CA: Sage.
analysis: Correcting error and bias in research findings. Shadish, W. R., Navarro, A. M., Matt, G. E., & Phillips, G.
Newbury Park, CA: Sage. (2000). The effects of psychological therapies under
Johnson, B. T., & Eagly, A. H. (2000). Quantitative synthe- clinically representative conditions: A meta-analysis.
sis of social psychological research. In H. T. Reis & Psychological Bulletin, 126, 512–529.
C. M. Judd (Eds.), Handbook of research methods in so- Smith, P. E. (1976). Management modeling training to im-
cial and personality psychology (pp. 496–528). Cam- prove morale and customer satisfaction. Personnel Psy-
bridge, England: Cambridge University Press. chology, 29, 351–359.
Kelsey, I. B. (1961). Effects of mental practice and physical Smith, R. M., White, P. E., & Montello, P. A. (1992). In-
practice upon muscular endurance. Research Quarterly, vestigation of interpersonal management training for ad-
32, 47–54. ministrators. Journal of Educational Research, 85, 242–
Keppel, G. (1982). Design and analysis: A researcher’s 245.
handbook (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Smith-Jentsch, K. A., Salas, E., & Baker, D. P. (1996).
Leister, A., Borden, D., & Fiedler, F. E. (1977). Validation Training team performance-related assertiveness. Per-
of contingency model leadership training: Leader match. sonnel Psychology, 49, 909–936.
Academy of Management Journal, 20, 464–470. Symons, C. S., & Johnson, B. T. (1997). The self-reference
Li, Z., & Begg, C. B. (1994). Random effects models for effect in memory: A meta-analysis. Psychological Bulle-
combining results from controlled and uncontrolled stud- tin, 121, 371–394.
ies in a meta-analysis. Journal of the American Statistical Taylor, M. J., & White, K. R. (1992). An evaluation of al-
Association, 89, 1523–1527. ternative methods for computing standardized mean dif-
Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of ference effect size. Journal of Experimental Education,
psychological, educational, and behavioral treatment: 61, 63–72.
Confirmation from meta-analysis. American Psycholo- Winer, B. J. (1971). Statistical principles in experimental
gist, 48, 1181–1209. design. New York: McGraw-Hill.
COMBINING RESULTS ACROSS DESIGNS IN META-ANALYSIS 125

Appendix A

Derivation of the Variance of the Independent-Groups and Repeated Measures Effect Size

Hedges (1981) and Gibbons et al. (1993) have derived the For the repeated measures effect size, df ⳱ n − 1, and
variance of the independent-groups and repeated measures ñ ⳱ n.
effect size, respectively. In both cases, normality and ho- If the effect size is transformed by a multiplicative con-
mogeneity of variance across populations are assumed. Both stant (dT ⳱ A d*), the variance will be A2 times the vari-
derivations resulted in the same equation for the sampling ance of the original effect size. The variance will be a func-
variance, except for differences in the effect size parameter tion of the population effect size in the original metric (␦*).
(␦IG or ␦RM, both referred to as ␦* here), n, and df, However, for consistency, the equation should be presented
in terms of the chosen metric. This can be accomplished by
␴2ei = 冉 冊冉 冊
1

df
df−2
共1 + ñ ␦2*兲 −
␦2*
c共df 兲2
. (A1)
replacing d* with ␦T/A. Thus, the variance of the trans-
formed effect size is as follows:

For the independent-groups effect size, df ⳱ nE + nC − 2


and ␴e = A 冋冉 冊冉 冊冉 冊
2
1

df
df − 2
1 + ñ 2
␦T2
+
␦T2 Ⲑ A2
c共df 兲2

冉 冊冉 冊冉 冊
A
nE nC A2 df ñ ␦T2
ñ = . (A2) = 1+ ␦T2 − . (A3)
nE + nC ñ df − 2 A2 c共df 兲2

Appendix B

Estimation of Pretest–Posttest Correlation From Standard Deviations

2
The variance of the difference scores (SDD ) can be writ- If it is assumed that the pre- and posttest variances are both
ten as a function of the pre- and posttest variances (SD2pre equal to the pooled variance (SDP2), Equation B2 reduces to
and SD2post) and the pretest–posttest correlation (r): 2
SDD
SD2D ⳱ SD2pre + SD2post − (2) (r) (SDpre) (SDpost). (B1) r=1− . (B3)
2SDP2
Solving for r, we obtain
Received June 13, 2000
SD2pre + SDpost
2
− SDD
2
Revision received April 15, 2001
r= . (B2)
共2兲共SDpre兲共SDpost兲 Accepted August 21, 2001 ■

You might also like