You are on page 1of 142

Meta-analyses in

mental health research.


A practical guide
Pim Cuijpers, Ph.D
3 FACULTAIRE LOGOS
3.2 ENGELS
Colofon

Copyright 2016 Pim Cuijpers


Faculty of
Author: Pim Cuijpers, Ph.D
Economics and
Business
Administration
A hardcopy of this book can be ordered at: meta-analysis@vu.nl
Design & layout: Tjerk Zweers
Faculty of
ISBN 978-90-825305-0-6 Humanities

Faculty of
Behavioural and
Movement Sciences

ii Faculty of
Social Sciences
Meta-analyses in mental
health research
A practical guide

Pim Cuijpers, Ph.D

iii
Contents

0. Introduction. What are meta-analyses and why are they important?... 3


Traditional, systematic and meta-analytic reviews......................................................3
Different types of meta-analyses........................................................................................5
Advantages and problems of meta-analyses..................................................................7
Key points...................................................................................................................................10

Step 1. Defining research questions for meta-analyses: PICO.................. 15


Introduction..............................................................................................................................15
Inclusion and exclusion criteria for meta-analyses...................................................16
Control conditions in mental health research.............................................................19
Which design to use in a meta-analysis?........................................................................22
Key points...................................................................................................................................24

Step 2. Searching bibliographical databases..................................................... 29


Selecting bibliographical databases................................................................................30
Other ways to identify studies for inclusion................................................................34
Searching in bibliographical databases..........................................................................35
Boolean operators, truncation, wildcards, proximity operators, and search
filters............................................................................................................................................39
A simplified example..............................................................................................................41
Key points...................................................................................................................................43

Step 3. Selection of studies, retrievement of data and risk of bias........... 47


Selection of studies................................................................................................................47
Working with two independent assessors....................................................................51
Selection of full-text studies for inclusion in your meta-analysis........................51
Data extraction: Characteristics of the included studies........................................52
Assessing methodological quality and bias..................................................................54
Sources of bias..........................................................................................................................54
Selection bias............................................................................................................................55
Detection bias..........................................................................................................................56
Attrition bias.............................................................................................................................57
Reporting bias..........................................................................................................................58
Other potential threats to validity...................................................................................59
Researcher allegiance...........................................................................................................59
Assessing risk of bias: the Cochrane Risk of bias assessment tool .....................60
Key points...................................................................................................................................63

iv
Step 4. Calculating and pooling effect sizes....................................................... 67
Effect sizes based on continuous outcomes ................................................................68
How to find the data needed for calculating effect sizes?......................................70
Interpreting effect sizes.......................................................................................................72
More outcomes in one study..............................................................................................75
Effect sizes based on dichotomous outcomes ............................................................77
Pooling of effect sizes............................................................................................................82
When can effect sizes be pooled?.....................................................................................83
The random and the fixed effects model.......................................................................86
The forest plot: An excellent summary of a meta-analysis.....................................87
Sensitivity analyses................................................................................................................90
Key points...................................................................................................................................91

Step 5. Examining heterogeneity and potential publication bias.............. 95


Visual inspection of the forest plot..................................................................................95
Test for homogeneity (Q) (is it significant?)..................................................................98
Quantifying heterogeneity.................................................................................................98
Examining the causes of heterogeneity.........................................................................99
Subgroup analyses to examine sources of heterogeneity................................... 100
Choosing subgroups for subgroup analyses.............................................................. 102
Metaregression analyses.................................................................................................. 104
Publication bias and other reporting biases.............................................................. 106
Testing for publication bias with indirect methods:
The funnel plot...................................................................................................................... 108
Key points................................................................................................................................ 113

Step 6. Writing and publishing meta-analyses...............................................117


Publishing the protocol of your meta-analysis......................................................... 117
The PRISMA Statement..................................................................................................... 118
The structure of a paper on a meta-analysis............................................................. 119
Key points................................................................................................................................ 124

References....................................................................................................................127

v
vi
Tables and figures

Figure 1. Randomized trials in PubMed from 1965-2013...................................4


Table 2.1 Websites for searching bibliographical databases............................32
Table 2.2 MeSH tree on mental disorders from PubMed ..................................36
Figure 3.1 PRISMA flowchart..........................................................................................48
Table 3.2. Save results from your search in Pubmed in a file and
import in Endnote ..........................................................................................50
Figure 3.3 Graphical representation of risk of bias in studies included in a
meta-analysis ...................................................................................................61
Table 4.1 Conversion of effect sizes to Numbers-needed-to-be-treated
(NNT) ..................................................................................................................74
Table 4.2 Possible dichotomous outcomes in a randomized
controlled trial.................................................................................................79
Table 4.3 Examples of data to calculate effect sizes for dichotomous
outcomes............................................................................................................81
Table 4.4 The number of studies needed to find effect sizes, for low,
medium and high between study variance and power
of 0.80 and 0.90...............................................................................................85
Figure 5.1 Forest plot for psychotherapy versus placebo for adult
depression.........................................................................................................96
Figure 5.2 Forest plot for psychotherapy for depression in older adults
versus control groups....................................................................................97
Figure 5.3 Metaregression analysis of number of sessions as predictor
of the effect size in studies examining the effects of
psychotherapy for adult depression..................................................... 105
Figure 5.4 Funnel plot of standard error by Hedges g in studies
comparing CBT with control groups ................................................... 109
Table 6.1 Overview of the structure of a paper on a meta-analysis............ 120

vii
viii
0. Introduction

What are meta-analyses


and why are they
important?

1
2
0. Introduction
What are meta-analyses and why are
they important?

The number of biomedical studies that is published in scientific jour-


nals is increasing rapidly. Currently between one and two million biomed-
ical articles are published each year (Bjrk, Roos, & Lauri, 2008; Harnad
et al., 2008; Houghton & Vickery, 2005) and according to the largest bio-
medical database, PubMed, the number of randomized trials examining
the effects of interventions in the biomedical field increased from 39 in
1965 to 16,054 in 2013 with a consistent increase over the years (see
Figure 1).
If all trials would find the same effect for one intervention, that in-
creasing number of trials would not be a problem. Each new trial would
simply confirm what earlier trials found and the precision with which the
effect size would increase with each new trial. Unfortunately, in the field
of mental health and social sciences randomized trials typically do not re-
sult in the same outcomes and effect sizes found for one intervention can
differ quite a lot.

Traditional, systematic and meta-analytic reviews


With an increasing number of randomized trials examining one inter-
vention it becomes more and more difficult to keep an overview of these
studies. Reviews are aimed at helping readers of scientific studies to keep
an overview of a research field.
Traditional reviews are written by experts in a field, with an emphasis
on the authority of the author. It is usually unclear how the studies that
are described in the review were selected, and it is very well possible that
only studies are referred to that support the opinion of the author. It is

3
0. Introduction

also unclear in traditional reviews how the available evidence is summa-


rized, whether important outcomes are discussed for some studies but
not for others, and whether the assessment of included studies was done
in a systematic way. The value of the conclusions of such reviews can
therefore not be verified.
Systematic reviews on the other hand follow general rules for sci-
entific research. They have a clear objective and a carefully formulated
research question. That question is examined with predefined eligibility
criteria for studies and an explicit, reproducible methodology. These re-
views search in a systematic way for all relevant studies, they assess the
validity of the included studies in a systematic way, and they also present
a synthesis of the outcomes systematically.
Meta-analyses are a specific type of systematic review. They meet all
criteria for a systematic review, but also statistically integrate the find-
ings of included studies into one estimate of the size of the effect and its
significance.

Figure 1. Randomized trials in PubMed from 1965-2013

&!!!!!

%"!!!!

%!!!!!

$"!!!!

$!!!!!

#"!!!!

#!!!!!

"!!!!

4
What are meta-analyses and why are they important?

In this book we will focus on meta-analyses and how they can be con-
ducted. This book can also be used for systematic reviews, by following
the steps described in this book, but without the statistical procedures for
statistically integrating the results of the included studies.

Different types of meta-analyses


There are different types of meta-analyses. In this book we will focus
on traditional or pairwise meta-analyses. In these meta-analyses one
comparison is examined, for example between a treatment and a control
group or another treatment and all studies examining this comparison
are included. The meta-analysis then estimates the pooled difference be-
tween this treatment and control groups. In this book we will focus on
these traditional meta-analyses.
There are, however, also other types of meta-analyses. In network
meta-analyses different comparisons can be included at the same time.
Suppose for example that several different treatments for one disorder
have been tested in randomized trials. With traditional meta-analyses we
would have to conduct a meta-analysis for each of the treatments com-
pared with a control group, and several more meta-analyses should have
to be done to examine the trials in which patients were randomized to
two or more of the treatments (and not to a control group). The advan-
tage of a network meta-analysis is that all of these comparisons can be
included in one and the same analysis. Network meta-analyses are also
called multiple treatment comparison meta-analysis or mixed treatment
meta-analysis.
In individual patient data meta-analyses the primary data of the stud-
ies that are identified through a systematic review are collected and inte-
grated into one big dataset. These individual patient data from multiple
studies makes it possible to make a better estimate of the pooled effect
size, because the same analytic strategies can be applied in all included
studies. For example data from patients who were randomized to one of
the conditions, but dropped out before the end of study, can be estimat-

5
0. Introduction

The Cochrane Collaboration


Cochrane (previously called the Cochrane Collaboration) has
played a key role in the development and influence of meta-anal-
yses in the biomedical sciences. It was founded in 1993 under the
leadership of Iain Chalmers. It was named after Archie Cochrane,
who made several calls for up-to-date, systematic reviews of all
relevant randomized controlled trials in health care (Cochrane,
1972). Cochrane is a global independent network of researchers,
professionals, patients, carers, and people interested in health,
with its headquarters in London, and local centers and branches
in a few dozen of cities all across the world. Cochrane currently
has 37,000 contributors from more than 130 countries, and they
work together to produce credible, accessible health informa-
tion that is free from commercial sponsorship and other conflicts
of interest. Cochrane reviews meet high standards for system-
atic reviews and meta-analyses, and the Cochrane Handbook
(available at: http://handbook.cochrane.org) (Higgins, J.P.T. &
Green, S., 2011) is a invaluable resource for all technical aspects
of meta-analyses and systematic reviews.

6
What are meta-analyses and why are they important?

ed across all included studies with the same method. Also the analyses of
the outcomes can also be conducted in one uniform way across all studies
(Riley, Lambert, & Abo-Zaid, 2010). Because all datasets are integrated
the statistical power to test predictors of outcome is much better than
for individual randomized trials, because these are usually designed with
sufficient power to find a significant outcome. For examining predictors of
outcome at least four times as many participants are needed as for finding
an effect of the treatment (Brookes et al., 2004). On average, 64% of the
trials in any field contribute their primary data to individual patient data
meta-analyses (Riley, Simmonds, & Look, 2007).
In this book we will not describe the methods of network or individual
patient data meta-analyses, but we will focus on traditional meta-analy-
ses.

Advantages and problems of meta-analyses


Integrating the results of multiple trials in a meta-analysis has several
advantages. Because the individual studies are combined the statistical
power to detect (the absence of) effects is higher than for the individual
studies. That makes it possible to make a more precise and accurate esti-
mation of the true effect of an intervention. Because the studies that are
included in a meta-analysis are examined systematically, it is also possible
to explore inconsistencies between studies and to examine whether the
effects of the intervention differs among specific subgroups of studies.
Furthermore, meta-analyses make it possible to make an estimate of the
number of studies that were conducted but not published (publication
bias; see chapter 5).
It is not surprising therefore that meta-analyses are used by all major
stakeholders. Doctors and therapists use meta-analyses when decision
tools and treatment guidelines are developed, because meta-analyses
provide the best evidence for the effects of a treatment. Policy makers
make use of meta-analyses to help them decide about treatments that are
included in health care programs and the payments of these treatments.

7
0. Introduction

Patients can make use of the results of meta-analyses when they have to
make decisions about treatments. And finally researchers can use the re-
sults of meta-analyses in several ways, for example to generate new re-
search questions that are not well examined in the existing trials, or to
see methodological limitations of existing trials and meta-analyses can be
used to estimate sample sizes for future trials.
There are however also several problems with meta-analyses. A me-
ta-analysis can never be better than the studies they summarize. If none
of the included studies has been done according to established methods
for trials and the risk of bias (chapter 3) is high, then a meta-analysis int-
grating the results of these studies can not solve the problem of the risk of
bias in these studies. Sources of bias in trials can not be controlled by the
method of meta-analyses. This feature of meta-analyses is often refer to
as garbage in, garbage out.
Another problem of meta-analyses especially in mental health
care is that there are always differences between studies, for example
in terms of the exact inclusion criteria, setting, recruitment methods, or
treatments and the delivery of treatments. Trials examining an interven-
tions are hardly ever exact replications of each. Some critics say that such
studies can not be compared because of these differences. They say that
meta-analyses combine apples and oranges.
The file drawer problem refers to the problem that not all relevant
studies are indeed published and are therefore often not included in me-
ta-analyses. When these unpublished studies have negative outcomes
not supporting the effectiveness of an intervention while published stud-
ies do support this effectiveness, than meta-analyses may seriously over-
estimate the true effects of an intervention. We will come back to this
problem in chapter 5.
A final problem of meta-analyses is the agenda-driven bias of re-
searchers who conduct the meta-analyses. Many meta-analyses are
written by researchers who are biased towards the intervention they
examine in the meta-analysis. This kind of bias is often called researcher

8
What are meta-analyses and why are they important?

The start of modern meta-analyses


The term meta-analysis was coined by Gene V. Glass and ex-
plained the use of it in his famous presidential address to the
American Educational Research Association in 1976 in San
Francisco. In this address he introduced the idea of standardized
mean differences or effect sizes (the difference between two
groups in standard deviations) and how they can be pooled across
studies. Since this first introduction, the technique of meta-analy-
sis has been spreading across all areas of science.

A few years later, in 1980, Gene Glass was involved in the most
extensive illustration of this new method was published on the
outcomes of psychotherapies. This book with the title Benefits
of Psychotherapy included the effect sizes from 375 outcome
studies of psychotherapies for neurotic disorders with more
than 4,000 patients in the treatment and control conditions. This
book was a response to the famous article by Eysenck in 1952
(Eysenck, 1952) in which he claimed that psychotherapy is not
effective, and that many patients receiving treatment do indeed
get better but they also get better without treatment. Smith and
colleagues found for the psychotherapies an effect size of 0.68,
which from todays standards is a moderate to large effect. Mod-
ern meta-analyses, however, have shown that this effect size was
probably considerably overestimated, because of the high risk
of bias in many studies and publication bias. But at that time this
book refuted the article from Eysenck with a new and innovative
method, and Eysenck seemed to have been wrong about the ef-
fects of psychotherapy.

9
0. Introduction

allegiance. We will see later in chapter 3 that researcher bias may affect
the outcomes of individual randomized controlled trials, but that is prob-
ably also true for systematic review where researcher with an allegiance
towards a particular intervention may be inclined to interpret outcomes
of included trials more positive then independent researchers.

Key points
Because of the exponential growth of research there is a need to
integrate the results of multiple studies
Traditional reviews are not systematic and transparent; they can
not solve the need for integration of research
Systematic reviews have a reproducible methodology
In a meta-analysis the results of individual studies are statistically
integrated into one (more precise) outcome
Systematic review have many advantages for professionals, pa-
tients, policy makers and researchers
But there are also risks: bias, garbage in and out, and combining
apples and oranges

10
What are meta-analyses and why are they important?

11
12
Step 1

Defining research
questions for meta-
analyses: PICO

13
14
Step 1
Defining research questions for meta-
analyses: PICO

Introduction
All good scientific research starts with a well-formulated research
question that is important and relevant. That is not different in me-
ta-analyses. Before doing a meta-analysis it is important to think about
the goal of the meta-analysis, why this meta-analysis is important, and
how this new meta-analysis compares with earlier meta-analyses. In step
6 (on Reporting and publishing meta-analyses) we will see that a pub-
lished meta-analysis begins with an Introduction section in which these
issues are described, including the background of the problem, the impor-
tance of the problem, earlier (meta-analytic) research and why this new
meta-analysis is needed.
In meta-analyses of randomized trials examining an intervention, an
Introduction section should always end with the research question, in
which the Participants, the Interventions, the Comparisons and the Out-
comes are specified. Together these four parts of the research question
are summarized as PICO. A good research question of a meta-analysis
always contains the four elements of the PICO acronym.
A research question for a meta-analysis could be for example: What is
the efficacy of cognitive behavior therapy (CBT) on sleep diary outcomes, com-
pared with control, for the treatment of adults with chronic insomnia (Trau-
er, Qian, Doyle, Rajaratnam, & Cunnington, 2015). All four elements of
the PICO are in there: P: adults with chronic insomnia; I: CBT; C: control
groups; O: sleep diary outcomes.
Or another example: The effects of eye movement desensitization and
reprocessing (EMDR) versus cognitive-behavioral therapy (CBT) for adult post-

15
Step 1

traumatic stress disorder (Chen, Zhang, Hu, & Liang, 2015). It also has all
elements of the PICO acronym: P: adults with posttraumatic stress; I:
EMDR; and C: CBT. The outcomes are not specified in this research ques-
tion, but most researchers in this field know that these treatments are fo-
cused on symptoms of posttraumatic stress, so in this case it is not needed
to state that explicitly in the research question.
A well-formulated research question according to the PICO acronym
is not only necessary for having a good research question, but it is also
helps with defining the inclusion and exclusion criteria for a meta-anal-
ysis.

Inclusion and exclusion criteria for meta-analyses


It is very important for meta-analyses to explicitly define in- and ex-
clusion criteria. Simple research questions according to the PICO acro-
nym are never specific enough for making decision which studies should
be included and which should be excluded. When working on a meta-anal-
ysis, dozens of very specific and detailed questions will come up that are
relevant for including or excluding a study in the meta-analysis. It is im-
portant therefore, to think about these in- and exclusion as detailed as
possible before starting with a meta-analysis.
For participants it is important to think about all their characteristics
that may be relevant for the in- and exclusion criteria. For example, a me-
ta-analysis may be focused on a treatment of a mental disorder. For exam-
ple, lets say that we want to do a meta-analysis of treatments of general-
ized anxiety disorders (Pim Cuijpers, Sijbrandij, et al., 2014). In that case
we have to specify the inclusion criteria much further.
So for this meta-analysis, do we want to include only people who meet
diagnostic criteria for a generalized anxiety disorder according to a diag-
nostic interview, or do we also include people who score high on worry-
ing according to a self-report instrument measuring generalized anxiety?
Do we include only children, only adolescents, adult, older adults? Or do
we focus on any age group? And do we want to include only studies among

16
Defining research questions for meta-analyses: PICO

people with generalized anxiety who are in treatment in mental health


care, or do we also include studies in which participants are recruited
from the general population?
Then there may also be exclusion criteria, for example when a study
focused on patients with generalized anxiety disorder and a comorbid
other mental disorder, like depression or a personality disorder. If there
is a possibility that these studies may have different effects than other
studies, it could be a choice to exclude them from the meta-analysis (or
include them and examine later in subgroup analyses if these studies do
indeed have different effects). Or it may be relevant to exclude studies
in inpatients, because this population and the setting they are in is very
different from outpatients.
It is not possible to give a definite overview of factors that have to be
considered for the in- and exclusion of studies, but it is always important
to consider the definition of the disorder or condition, the setting of the
studies, and the demographic characteristics of the population.
In the same way it is important to specify which interventions should
be examined in the studies that are included. So in our example on gener-
alized anxiety disorder, do we only include studies on cognitive behavior
therapy? Or do we focus on psychotherapies in general, including for ex-
ample psychodynamic therapy, interpersonal psychotherapy, or non-di-
rective counseling? And suppose we would choose for CBT, would that
include therapies that mainly focus on the behavioral technique called
exposure or should they explicitly focus on changing cognitions (cogni-
tive restructuring)? And what about the third wave therapies, like ac-
ceptance and commitment therapy, would that be considered CBT or
not? And suppose we would indeed focus only on CBT, would we also in-
clude a study in which CBT is combined with another treatment module,
like for example hypnosis?
In the same way we could look at treatment format: Would we include
only individual treatments or also group treatments? And what about
guided self-help therapies and Internet-based therapies? Is there a min-

17
Step 1

imum or maximum to the number of treatment sessions or the length of


the treatment? Should the treatment be delivered by a specific therapist
(for example a therapist who has been trained in delivering this therapy)?
Like with the participants it is not possible to give a definite overview
of factors related to the interventions that have to be considered for in-
and exclusion. But it is important to look at least at the content of the
intervention, the format, the length, and the person delivering the inter-
vention.
Then the comparators have to be specified. Meta-analyses of ran-
domized trials always focus on a specific comparison, an intervention
versus a control group or versus another intervention. So after specify-
ing the intervention, it is also necessary to specify the comparator, which
means the condition with which the intervention is compared. That can
be no treatment, usual care, placebo or a waiting list control condition. Or
it could be another treatment, or a placebo treatment. The same factors
that were described for the interventions can also be applied for the com-
parators, especially if this is an alternative treatment.
But control group comparisons in mental health research and espe-
cially in psychological interventions are complicated and each type of
control group has its own strengths and limitations. In the next paragraph
we will briefly describe the most used types of control conditions in men-
tal health research. For now it is important to realize that for a PICO it is
also important to specify the control or comparison condition as detailed
as possible.
Outcomes also need to be specified in a meta-analysis. Interventions
in mental health can have all kinds of outcomes, directly related to men-
tal health, like quality of life, depression, anxiety, or suicidality, or more
indirectly, like social skills, resilience or social functioning. Outcomes can
be continuous (on a scale ranging from zero or low to high), like level of
depressive or anxiety symptoms, or they can be dichotomous (yes or no),
like response, remission, clinical significant change, or death. They can be
based on self-report scales by patients, on clinician-rated instruments, on

18
Defining research questions for meta-analyses: PICO

observational scales, or on reports by significant others. And outcomes


can be measured directly after the intervention or during differing fol-
low-up assessments after the intervention.
In most biomedical research it is custom to choose one primary out-
come, and that is from a methodological point of view the best option.
However, in many trials on psychological interventions, especially the
older ones, not one primary outcome is defined, but several different
measures, often measuring the same construct, are included. In me-
ta-analyses it is also important to define the main outcome that will be
examined. For example in our example of the meta-analysis of psycholog-
ical interventions for generalized anxiety disorder this could be anxiety.
This would mean that all outcome instruments measuring anxiety will be
included in the main analyses. In Step 4 we will see how this can be han-
dled statistically.

Control conditions in mental health research


As indicated above, the Comparator in the PICO acronym indicates
the comparison group that is compared with the intervention group. In
research on psychological interventions, choosing the right control group
in randomized trials as well as in meta-analyses can be complicated. It is
beyond the scope of this book to go into too much detail, but a general dis-
cussion is important, because the comparator in PICOs are an important
component of the research question of meta-analyses.
In trials of pharmacological treatments usually pill placebos are used
as control conditions. Pill placebos are typically sugar pills with no active
substance. In these trials patients, clinicians, outcome assessors and re-
searchers are blinded for which patients get the medication and which get
the placebo. Therefore placebo controlled trials are considered to be the
gold standard for testing the effectiveness of drugs. However, this design
also has disadvantages. For example the assumption that patients are in-
deed blinded for whether they receive the pill placebo or the active med-
ication, is often not met, because patients often know which of the two

19
Step 1

they receive, because placebo does not have the side effects that many
active medications do have (Moncrieff, Wessely, & Hardy, 2004).
For psychological interventions, it is still more complicated to choose
the right control group (Mohr et al., 2009). That is true for randomized
trials but also for meta-analyses in which the results of these individual
trials are integrated. Waiting list control groups are often used to ex-
amine the effects of psychological interventions, because they motivate
people to participate in the control group and the trial, and the partici-
pants get at least some kind of intervention. However, when people are
on a waiting list they probably do nothing to solve their problems because
they are waiting for the therapy (Mohr et al., 2009, 2014). So, it is stimu-
lating not to use their normal ways of coping with problems. If they had
been assigned to a care-as-usual control group, a part of them would have
possibly taken other actions to solve their problems. And because these
participants are willing to be randomized to a waiting list control group,
they probably have high expectations of the effects of the therapy. And
it is known that high expectations result in better outcomes of interven-
tions. So, waiting list control groups may considerably overestimate the
effects of an intervention, for example when compared to usual care.
Another type of control group that is often used in randomized trials
of psychological interventions is care-as-usual. People participating in
such a trial have the chance to get the intervention or nothing. Nothing
means that they can use the care they would usually get when they would
not participate in the trial. A big problem is that usual care very much
depends on the setting and country where the trial is used. For example
many European countries have a national health insurance system, where
all patients have access to health care. Whether for example psychologi-
cal treatments are part of this, differs per country. And in countries with-
out such a national system, patients have to pay for their own insurance
and health care, and that means that access is very much depending on
income. All this implies that care-as-usual can vary considerably between

20
Defining research questions for meta-analyses: PICO

setting and country and that it may be problematic to pool the data from
trials using care-as-usual conditions.
Pill placebos are also sometimes used as control groups in psychologi-
cal interventions. However, in these trials, usually there is also a condition
in which participants receive a medication, because only a placebo condi-
tion in such a trial does not make much sense. But the advantage is that
this allows to examine the effects of psychological interventions with the
same comparator as for medications aimed at the same condition.
Another type of control group that is often used is the placebo ther-
apy. In such a control condition, usually a very basic intervention is given,
often based on client-centered therapy, where the therapist is empathic,
friendly, supporting, but not using specific intervention techniques. The
idea is that this control condition provides anything all therapies have,
and the it is compared to a real intervention in which specific techniques
are used. This should allow to examine what the specific techniques con-
tribute to the effects of the intervention above the basic support given in
all interventions. However, there are this approach is problematic for sev-
eral reasons. First, non-directive supportive counseling (the control con-
dition) can have considerable effects in itself. For example, we found that
in depression counseling has effects that are comparable to cognitive be-
havior therapy (Pim Cuijpers et al., 2012). We also found that the studies
that use these control groups are often heavily influenced by researcher
allegiance. Researcher allegiance means that a researcher is a proponent
of a specific intervention. In counseling for depression, we found that in
studies without researcher allegiance the effects of counseling were com-
parable to those of other therapies. The other problem is that when such
a control condition is not delivered as a serious intervention (because it
is meant as a control group), participants will probably know that this is
not a serious intervention and then the effects will be influenced by this.
So, when a psychological placebo is not convincing enough, the superior
effect of the active intervention may be caused by this. And it is often very

21
Step 1

difficult to differentiate between serious control interventions and con-


trol interventions that are not serious enough.
There are many other types of control groups in psychological inter-
ventions, but it is beyond the scope of this book to go too deep into this.

Which design to use in a meta-analysis?


The PICO acronym is especially relevant for randomized controlled
trials. In this book we focus on meta-analyses of randomized trials of in-
terventions in mental health care, so the PICO will be the most important
lead for formulating research questions and transform that into search
strings for bibliographical databases (see next step). However, the results
of studies with all kinds of other designs can be integrated in meta-anal-
yses.
For example there are other designs that are aimed at examining the
effects of interventions. An important example are non-randomized con-
trolled studies and open studies. In non-randomized trials participants
are not randomized to an intervention or comparison condition meaning
that the assignment was not based on chance but on a systematic proce-
dure. For example, patients in mental health care receiving a treatment
are compared with people with a comparable profile taken from an epi-
demiological survey. Or all participants who show interest are placed in
an intervention group, and when the group is full the rest is placed on a
waiting list. Such non-randomized controlled studies can be included in
meta-analyses, but it must be clear that the results of these studies are
much more uncertain than those of randomized trials (because of the
risk of bias, described in step 3).
In the same way open trials can be examined. In open trials, there is
no control group at all, and only baseline score and the scores after the
intervention are available. For such studies effect sizes can be calculated
indicating the improvement of participants within these interventions.
However, because there is no control group, it can not be said whether
this improvement is cause by the intervention of the natural course of the

22
Defining research questions for meta-analyses: PICO

problem at which the intervention is aimed. So, the results of these stud-
ies (including meta-analyses) should be interpreted with caution.
Then meta-analyses can also integrate the results of studies that are
not aimed at examining the effects of interventions. In principle, me-
ta-analyses can integrate all outcomes of studies that have a standard er-
ror (see Step 4). For example, we recently examined whether people suf-
fering from depression have a higher risk of dying within the next period
than people without depression (Pim Cuijpers, Vogelzangs, et al., 2014).
We examined that in almost 300 prospective cohort studies in which
some people had depression at baseline and others did not. All of these
studies had assessed how many of these people had died at follow-up. It
was indeed found that people with depression had a larger chance of dy-
ing than people without depression.
In the same way it is possible to pool the results of studies that have
examined the correlation between two variables, studies that have ex-
amined psychometric properties of psychological measurement instru-
ments, studies that examine differences between populations, and all
kinds of other studies.

23
Step 1

Key points
Any meta-analysis begins with a good research question, just like
any other study
Good research questions for meta-analyses of randomized trials
follow the PICO acronym and they define and specify:
Participants
Intervention
Comparison
And outcome
Comparison conditions for psychological interventions (such as
waiting lists, care-as-usual, psychological placebos) may be prob-
lematic
Apart from randomized trials, the results of all kinds of other
studies can be integrated in meta-analyses as well

24
Defining research questions for meta-analyses: PICO

25
26
Step 2

Searching
bibliographical
databases

27
28
Step 2
Searching bibliographical databases

After defining your research question using the PICO acronym, the
next step is to find trials that have examined this research question and
that can be included in your meta-analysis. In this chapter we will describe
how you can identify these studies. First you will have to choose the bib-
liographical databases that you will search. Then you will have to develop
a strategy for searching in these databases and identify the studies that
meet your inclusion criteria.
If you work in an institute, like a university, and there is a library where
information specialists or librarians are working, it is highly recommend-
ed to involve one of the these information specialists in the process of
identifying studies for inclusion in your meta-analysis. Information spe-
cialists have made a job of knowing which databases are available and
how to search these. The quality of your searches increases considerably
when you are capable of involving an information specialist.
In this chapter we will present databases that can approached on the
Internet. We will give the names and urls of these websites in a separate
table (Table 2.1). However, the names and urls of these websites often
change, so if a url is not correct, please search on the Internet for the cor-
rect address.
Some of the databases that you will use are free, such as PubMed.
For others you will have to have a subscription, like for PsycInfo, or the
Cochrane database of trials. Unfortunately, for a meta-analysis in mental
health and social sciences you will need to have access to some of the paid
databases and cannot be conducted without that access.

29
Step 2

Selecting bibliographical databases


For a meta-analysis in mental health research, there are three bibli-
ographical databases that should be searched:
Pubmed. This database provides free access to MEDLINE, which
is the National Library of Medicines database of citations and
abstracts in the fields of medicine, nursing, dentistry, veterinary
medicine, health care systems, and preclinical sciences. It is free
and can be accessed by anyone. Pubmed now has 25 million ci-
tations and abstracts from MEDLINE, life science journals, and
online books. It has abstracts from more than 5,600 biomedical
journals published in the United States and worldwide, and goes
back to 1946.
PsycInfo is a bibliographical database from the American Psy-
chological Association that includes nearly 4 million bibliographic
records centered on psychology and the behavioral and social
sciences. The records come from more than 2,500 scholarly jour-
nals, books and theses. It is not for free.
Cochrane Central Register of Controlled Trials (CENTRAL). This
is the database that is developed by the Cochrane Collaboration
and that only contains randomized trials. These trials are iden-
tified by highly sensitive searches in other bibliographical data-
bases, but also by hand searching the contents of about 2,400
scientific journals. CENTRAL can not be accessed for free in most
countries.

Many meta-analyses in mental health and social sciences are limited


to studies that are identified through searches in these three core data-
bases. However, there are many other databases that can be useful and
identify other studies that meet your inclusion criteria.
Embase is another important general biomedical bibliographical da-
tabase that includes many journals that are not included in PubMed (and
the other way around). It includes 30 million abstracts and indices from

30
Searching bibliographical databases

published, peer-reviewed biomedical literature, in-press publications and


conferences. Unfortunately, it is not free like Pubmed, but if you want to
search broadly it is advised to search Embase as well.
Apart from these general databases there are all kinds of more subject
specific databases, for example BiblioMap (a database of health promo-
tion research with over 14,000 records), CINAHL (aimed at nursing sci-
ence), ERIC (a large database on education-related literature), and Age-
Line (on aging issues from individual, national and global perspectives).
Many meta-analyses in mental health and social sciences are limited
to studies that are identified through searches in these three core data-
bases. However, there are many other databases that can be useful and
identify other studies that meet your inclusion criteria.
Embase is another important general biomedical bibliographical da-
tabase that includes many journals that are not included in PubMed (and
the other way around). It includes 30 million abstracts and indices from
published, peer-reviewed biomedical literature, in-press publications and
conferences. Unfortunately, it is not free like Pubmed, but if you want to
search broadly it is advised to search Embase as well.
Citation databases are another source for identifying studies to be
included in meta-analyses, such as ISI Web of Knowledge (Thompson Re-
uters), Scopus (Elsevier) and Google Scholar. These databases rate how
often papers are cited by other papers, but they also offer the possibili-
ty to do searches for specific types of papers. Unfortunately, ISI Web of
Knowledge and Scopus are not free.
Citation databases can also be used to find relevant articles in another
way. If you have identified studies that meet your inclusion criteria, you
can first find them in the database and then check the studies that cite
them. Studies that cite these papers could be new studies that also meet
inclusion criteria. It can be a lot of work, because especially older studies
may be cited a lot. So you could choose to only follow a selection of these
studies.

31
Step 2

Table 2.1 Websites for searching bibliographical databases

Database
Core databases
Pubmed Database of the US National Library of www.ncbi.nlm.nih.gov/
Medicine pubmed

PsycInfo Database from the American Psycholog- www.apa.org/pubs/databases/


ical Association on behavioral and social psycinfo
sciences
Cochrane Central Regis- Database of randomized trials in health
ter of Controlled Trials care
(CENTRAL)
Embase 30 million abstracts and indices from www.elsevier.com/solutions/
published, peer-reviewed biomedical embase-biomedical-research
literature, in-press publications and
conferences are available on Embase.
Subject specific databases
Bibliomap On Health promotion research https://eppi.ioe.ac.uk/webda-
tabases/Intro.aspx?ID=7
CINAHL (nursing sci- Cumulative Index to Nursing and Allied
ence) Health Literature. Access to CINAHL is
provided on the Web by EBSCO Publish-
ing, Ovid Technologies and ProQuest

ERIC (education-related Educational Resources Information Center http://eric.ed.gov


literature) (ERIC) is a large database on education-
al-related literature, supported by the
U.S. Department of Educations Office of
Educational Research and Improvement
and is administered by the U.S. National
Library of Education (NLE).
AgeLine (aging issues) AgeLine indexes 213 journals, books, www.ebscohost.com/academ-
book chapters and reports. Designed for ic/ageline
researchers, professionals, students and
general consumers, AgeLine addresses
aging issues from individual, national and
global perspectives.

32
Searching bibliographical databases

Citation databases
Thompson Reuters web Thompson Reuters citation database www.webofknowledge.com
of knowledge
Scopus Elsevier citation database www.elsevier.com/solutions/
scopus
Google Scholar The largest citation database devel- scholar.google.com
oped by Google
National and regional databases
Latin America: LILACS http://lilacs.bvsalud.org/en/
Chinese Biomedical Institute of Medical Information & www.imicams.ac.cn/publish/
Literature Database Library default/eng
(CBM)
China National Knowl- Database of Chinese studies http://oversea.cnki.net/kns55/
edge Infrastructure default.aspx
(chkd-cnki)
indMED Database covering peer reviewed http://indmed.nic.in/indmed.html
Indian biomedical journals
Dissertations and theses
ProQuest dissertations Database of dissertations www.proquest.com/products-ser-
vices/dissertations
ProQuest dissertations Database of dissertations Great www.proquest.com/products-ser-
UK & Ireland Britain and Ireland vices/pqdt_uk_ireland.html
Deutsche National The German National Library (Deut- www.dnb.de/DE/Wir/Koopera-
Bibliothek sche National Bibliothek) offers access tion/dissonline/dissonline_node.
to German dissertations html
CNKI Database of Chinese theses http://oversea.cnki.net/kns55/
Navi/CDMDNavi.aspx?Navi-
ID=36&XueKE=1
Other reviews and guidelines
DARE The Database of Abstracts of Reviews www.crd.york.ac.uk/CRDWeb
of Effects of the University of York
National Guideline NGC is a public resource for ev- www.guideline.gov
Clearinghouse idence-based clinical practice
guidelines
Trial registers http://apps.who.int/trialsearch/

33
Step 2

There are also many national and regional databases, such as LILACS
(a database of the scientific and technical literature of Latin America and
the Caribbean) and indMED, covering Indian biomedical journals. There
are also several Chinese databases you can search (see the overview in
(Xia, Wright, & Adams, 2008)), such as the Chinese Biomedical Literature
Database from the Institute of Medical Information & Library and the
China National Knowledge Infrastructure.
Some databases offer access to doctoral dissertations, such as Pro-
Quest, that offers access to dissertations from the US, but also has a sep-
arate collection for dissertations from the United Kingdom and Ireland.
Other websites offer access to German (German National Library) and
Chinese (CNKI) dissertations.

Other ways to identify studies for inclusion


Apart from searching in bibliographical databases to identify studies,
there are several other methods that can be helpful in the identification of
studies for your meta-analysis.
After the selection of the studies through the searches has been
completed, you can go through the references of the included
studies that to see if trials are cited that you did not identify
through your searches.
It is also important to find earlier meta-analyses or treatment
guidelines and verify whether they have identified studies that
you did not find through the searches. You can identify earlier
meta-analyses and treatment guidelines by searching the same
bibliographical databases in which you searched to identify stud-
ies to be included. In addition to that you could search in DARE
(Database of Abstracts of Reviews of Effects; Table 2.1) from the
University of York, which is a database aimed at collecting all
meta-analyses of treatments. Or you can search the website of
the National Guideline Clearinghouse (Table 2.1) that can help
with finding clinical practice guidelines on the subject. Practice

34
Searching bibliographical databases

guidelines are typically based on the results of meta-analyses and


recent trials.
Another method to identify studies is to do hand searches in
major journals. This means that you first identify what the most
important journals in the field are (for example by looking which
journals have published most of the trials that you have identified
through your searches). Then you look through the tables of con-
tents of these major journals to identify further studies.
You can also contact key experts in the field, for example re-
searchers who have published several trials in this field or who
have contributed in another way to the research in this field. They
may know of studies that are recently finished or that are still
ongoing, but meet inclusion criteria.
Finally, you can try to identify trials that are currently conducted
and may not have been published yet, through trial registers.
Because there are many trial registers, the Clinical Trials Search
Portal from the WHO (Table 2.1) provides access to a central
database containing the trial registration data sets provided by
more than fifteen national trial registries.

Searching in bibliographical databases


When you have decided which database you want to include for your
searches, you will have to develop a search strategy and search strings for
each of the databases you will search. The starting point for your searches
is the PICO we discussed in Step 1. You should translate your PICO into
search terms and strings of these search strings. You can do that by taking
each of the elements of the PICO and find the right terms indicating each
of these elements.

35
Step 2

Table 2.2 MeSH tree on mental disorders from PubMed

Mental Disorders [F03]


Adjustment Disorders [F03.075]
Anxiety Disorders [F03.080]
Agoraphobia [F03.080.100]
Neurocirculatory Asthenia [F03.080.500]
Obsessive-Compulsive Disorder [F03.080.600]
Obsessive Hoarding [F03.080.600.500]
Panic Disorder [F03.080.700]
Phobic Disorders [F03.080.725]
Stress Disorders, Traumatic [F03.080.931]
Battered Child Syndrome [F03.080.931.124]
Combat Disorders [F03.080.931.249]
Stress Disorders, Post-Traumatic [F03.080.931.500]
Stress Disorders, Traumatic, Acute
[F03.080.931.550]
Delirium, Dementia, Amnestic, Cognitive Disorders
[F03.087]
Dissociative Disorders [F03.300]
Eating Disorders [F03.375]
Factitious Disorders [F03.400]
Impulse Control Disorders [F03.500]
Mental Disorders Diagnosed in Childhood [F03.550]
Mood Disorders [F03.600]
Affective Disorders, Psychotic [F03.600.150]
Bipolar Disorder [F03.600.150.150]
-- Cyclothymic Disorder [F03.600.150.150.300]
Depressive Disorder [F03.600.300]
Depression, Postpartum [F03.600.300.350]

36
Searching bibliographical databases

Depressive Disorder, Major [F03.600.300.375]


Depressive Disorder, Treatment-Resistant
[F03.600.300.387]
Dysthymic Disorder [F03.600.300.400]
Premenstrual Dysphoric Disorder
[F03.600.300.550]
Seasonal Affective Disorder [F03.600.300.700]
Neurotic Disorders [F03.650]
Personality Disorders [F03.675]
Antisocial Personality Disorder [F03.675.050]
Borderline Personality Disorder [F03.675.100]
Compulsive Personality Disorder [F03.675.150]
Dependent Personality Disorder [F03.675.200]
Histrionic Personality Disorder [F03.675.400]
Hysteria [F03.675.400.500]
Paranoid Personality Disorder [F03.675.600]
Passive-Aggressive Personality Disorder
[F03.675.625]
Schizoid Personality Disorder [F03.675.700]
Schizotypal Personality Disorder [F03.675.725]
Schizophrenia and Disorders with Psychotic Features
[F03.700]
Sexual and Gender Disorders [F03.800]
Sleep Disorders [F03.870]
Somatoform Disorders [F03.875]
Substance-Related Disorders [F03.900]

37
Step 2

The development of search strings is usually not very straightforward.


Usually you will first develop a search strategy, try how it works out and
evaluate it in terms of numbers of records retrieved. If necessary you
adapt your search and run it again. You continue with that until you have
your final search strings.
In each search you will have to find a balance between sensitivity and
precision. If you want to identify all studies that meet your inclusion cri-
teria, you could work through the full PubMed database and the other
databases in full (if that would be physically possible, because these are
millions of records). This would be an enormous work, but you would be
sure that you would not miss any study that meets your criteria. On the
other end of the spectrum of precision and specificity you could make
a very narrow search, that results in a smaller number of records. That
would be much less work, but the risk that you will miss records of studies
that meet your inclusion criteria is much larger. Any search should find a
balance between these extremes with on the one hand broad searches
with low risk of missing studies and large numbers of records, and on the
other hand narrow searches with less records and a higher risk of missing
studies. An important aspect of searches is the difference between text
words and key words. Text words are the words in the title and abstracts
of the records. Key words are the words that are attached to each ab-
stract and that capture the core of the content of the study. In PubMed
these key words are called MeSH terms. MeSH stands for Medical Sub-
ject Heading. These MeSH terms are attached to papers separately from
the abstract and title, so it is very well be possible that a specific term is
not mentioned in the abstract or title, but still is a key word. It is therefore
important to use both text and key words in your searches.
You can find the right key words to look in studies that meet your in-
clusion criteria and see which key words are attached to them. But you
can also look in the thesaurus of the key words. Each of the databases
use their own thesaurus and the key words in such a thesaurus are hier-
archically structured into a taxonomy. In PubMed this taxonomy is called

38
Searching bibliographical databases

the MeSH tree. In Table 2.2 you can see the branch of the MeSH tree
for Mental disorders (taken from www.nlm.nih.gov/mesh/trees.html). In
this branch you can follow each of the sub-branches into new branches.
As an example we have given in Table 2.2 the sub-branches for Anxiety
disorders, Mood disorders and Personality disorders. But each of the oth-
er branches also have sub-branches (see www.nlm.nih.gov/mesh/trees.
html). An alternative is that you search for terms that are included in the
MeSH tree in the MeSH tree browser at www.ncbi.nlm.nih.gov/mesh.
As indicated earlier, each database has its own taxonomy. Embase,
for example has the Emtree thesaurus. The taxonomies of the databases
are built in different ways, but they can all be searched by entering the
right key words.

Boolean operators, truncation, wildcards, proximity operators, and


search filters
If you develop search strings consisting of separate search terms you
will have to make use of Boolean operators, like AND, OR, and NOT. If
you put AND between two words, the resulting abstracts will include
both terms. If you put OR between two words the resulting records will
have either the first or the second term. The inclusion of NOT will exclude
records having a specific term.
Brackets can help with defining which words should have AND, OR
or NOT. Suppose for example that you want to identify trials on psycho-
logical treatment for depression. In that case you could develop a search
string combining terms for depression and treatment, and that could look
like: (depression OR mood disorder) AND (psychotherapy or cognitive be-
havior therapy OR interpersonal psychotherapy)
You can see in this example that the key concepts (depression and
therapy) should both be present in the records that you want to identify
(because they are connected with AND), and that the key concepts can be
identified with different terms (because they are connected with OR). Of

39
Step 2

course this is only an example illustrating the use of AND and OR, not a
real search string.
Most databases allow to use truncation, wildcards and proximity
operators, which can be useful when you do searches. Truncation is a
searching technique used in databases in which a word ending is replaced
by a symbol, usually the asterisk (*). For example if you use random* as a
search term in PubMed, you will get all the records that include terms like
random, randomized, randomised, randomly. A wildcard (?) replaces
a letter in a word. For example, the term m?n will identify records with
the term man, men, min, mun, etc. Also proximity operators can be
used (in PubMed: ADJ). For example: depression adj3 disorder returns
records where depression and disorder are within 3 words of each other
in any order. adj3 means adjacent within 3 words.
Search filters can also be very useful when conducting searches. At
the website from the InterTASC Information Specialists Sub-Group
Search Filter Resource (www.york.ac.uk/inst/crd/intertasc) you can find
an overview of search filters for many different types of studies for sev-
eral biomedical databases. PubMed and other databases also have their
own search filters. For example, Pubmed has a MeSH term for random-
ized trials (Randomized Controlled Trial[Publication Type]) and you
can limit your searches with this MeSH term. But there are several other
search filters you can use for limiting searches to randomized trials. For
example a simple but more sensitive search filter for randomized trials in
PubMed is: randomized controlled trial [pt] OR controlled clinical trial [pt]
OR randomized [tiab] OR randomly [tiab]. And the Cochrane Collaboration
has developed a highly sensitive search string for randomized trials, which is
aimed at missing as few studies as possible (and accepting that it results in large
numbers of records that have to be screened): ((randomized controlled trial [pt])
OR (controlled clinical trial [pt]) OR (randomized [tiab]) OR (placebo [tiab]) OR
(drug therapy [sh]) OR (randomly [tiab]) OR (trial [tiab]) OR (groups [tiab])) NOT
((animals [mh] NOT humans [mh])).

40
Searching bibliographical databases

A simplified example
So how does all this work when you actually do searches for a me-
ta-analysis? Suppose that you want to do a meta-analysis of cognitive be-
havior therapy for depression compared with waiting list control groups,
and that you would start your search in PubMed. A simple search strategy
could focus on three of the four elements of your PICO:

(depressive disorder, major[MeSH Terms] OR depressive


disorder[MeSH Terms])
AND
(Cogniti* and (therapy OR treatment OR intervent*))
AND
(randomized controlled trial [pt] OR controlled clinical trial [pt]
OR randomized [tiab] OR randomly [tiab])

You could also add search terms for the waiting list component (the
comparator), but it is also possible to leave that out. If you leave it out you
will get all studies on cognitive therapy, regardless of the type of control
group or comparator (because you dont limit the searches to the waiting
list control group). You can do that when the terms for waiting list would
be ambivalent or unclear, and the number of resulting records would not
become too large because of that. When you do the search without wait-
ing list you will have about 1500 resulting records (search conducted in
2015). If you add terms for waiting list (Waiting AND list) you will end up
with about 50. The 1500 is not too much and only 50 seems very few, with
a risk of too many missing studies (because the term waiting list may not
be included in the abstract or title). So, in this case it seems better to leave
out the terms for the waiting list.
This issue can also be illustrated with searches for psychotherapy
for generalized anxiety disorder. Search terms that can used for identi-
fying studies on generalized anxiety disorder are: Generalized anxiety
or generalised anxiety or worry*. If you combine these terms with the

41
Step 2

search string for randomized trials, you will find only about 1400 re-
cords (in 2015). So you could stop with extending your search with oth-
er terms for the other parts of the PICO, because this does not seem too
broad for a search.

42
Searching bibliographical databases

Key points
Search at least Pubmed, PsycInfo, Cochrane database
Develop search strings based on your PICO
Find a balance between sensitivity and precision in your search
strategy
Ask help from a librarian
Use text as well as MeSH terms,
Use Boolean operators, truncation, wildcards, proximity opera-
tors, and search filters.

43
44
Step 3

Selection of studies,
retrievement of data
and risk of bias

45
46
Step 3. Selection of studies,
retrievement of data and risk of bias

In Step 2 you identified and searched bibliographical databases with


the search strings that match your research question optimally. When
you have finished these searches you will have to do the actual selection
of the studies. In this chapter we will describe how that selection is done.
Furthermore, we will describe the steps that you have to take in order to
retrieve the data you need from the included studies.
In the publication of your meta-analysis you will have add a PRISMA
flowchart (see also step 6). In that flowchart you will have to describe
the process of selecting studies from bibliographical databases until the
inclusion of the studies in your meta-analysis. The PRISMA flowchart is
given in Figure 3.1.

Selection of studies
When you have finished the searched you will first have to save the re-
sults of your searches in files. These files should be in a format that can be
used by the reference software that you use, such as Endnote or Reference
Manager. It is beyond the scope of this book to describe exactly how these
packages work. But it is important that you use one of such packages, be-
cause they can help you to remove duplicate abstracts.
Because you have used more than one bibliographical databases for
your searches, it is highly probably that these databases have identified
identical abstracts. You will have to remove these duplicate abstracts.
Removal of duplicate abstracts is a requirement for the PRISMA flow-
chart of the selection process of studies that you will have to make, ac-
cording to the PRISMA guidelines for systematic reviews. So, searching
the results of each of the databases separately (and thus reading some

47
Step 3
PRISMA 2009 Flow Diagram
Figure 3.1 PRISMA owchart
Figure 3.1 PRISMA flowchart
Identication

#ofrecords identied through #ofadditional records identied


database searching through other sources

#ofrecords after duplicates removed


Screening

#ofrecords screened #ofrecords excluded

#offull-text articles #offull-text articles


Eligibility

assessed for eligibility excluded, with reasons


#ofstudies included in
qualitative synthesis
Included

#ofstudies included in
quantitative synthesis
(meta -analysis)

From: Moher D, Liberati A, Tetzlaff J, Altman DG, The PRISMA Group (2009). Preferred Reporting Items
From: Moher D, Liberati A, Tetzla J, Altman DG, The PRISMA GroupPreferred
(2009).ReportingItems forSystematic Reviews and
Meta-
forAnalyses:
Systematic Reviews
The PRISMA and Meta-
Statement. PLoSAnalyses:
Med 6(6):The PRISMA
e1000097. Statement. PLoS Med 6(6): e1000097.
doi:10.1371/journal.pmed1000097
doi:10.1371/journal.pmed1000097
For more information, visitFormore information, visit www.prisma -statement.org .
www.prisma-statement.org.

databases separately (and thus reading some abstracts in more than one database)

is not possible if you follow the PRISMA guidelines.

48

37
Selection of studies, retrievement of data and risk of bias

abstracts in more than one database) is not possible if you follow the
PRISMA guidelines.
First you should save the results of your searches in files. How that
is done depends on the database you have searched. In Table 3.2 it is de-
scribed how you can save a file from PubMed. But each database is differ-
ent and it is not possible to give a comprehensive description of how that
is done for all databases. The next step is to import these files into your
reference software package.
When you have imported the results from all databases that you have
searched into your reference software, you can start removing duplicate
records. Most reference software packages have methods to do that au-
tomatically.
It is important that you make notes of how many records you have
found in each of the databases and how many records are left after re-
moving of the duplicate records. These numbers are needed for the PRIS-
MA flowchart. And dont forget to note the date at which you did the
searches! You will have to report that in your paper.
After you have imported all results from the bibliographical databases
into your reference software and have removed the duplicates, you are
ready for the selection of records. You can do that in your reference soft-
ware or you can export the records to a word processor document.
The selection of records in this phase is only about the decision to re-
trieve the full-text of a study or not. So if there is any doubt whether a
record may be about a study that meets inclusion criteria for your me-
ta-analysis, you should retrieve the full-text of this record.
The PRISMA flowchart does not require you to specify why a record
is excluded. You can simply say that records are excluded based on title
and abstract. But you will have to report exactly how many records were
excluded.

49
Step 3

Table 3.2. Save results from your search in


Pubmed in a file and import in Endnote

When you have conducted a search in PubMed, you can down-


load the abstracts from that search by following the following
steps:
Below the search box, so can see a link called Send to:.
Click on this
You will then see a menu, called Choose destination
Check File, and the menu expands
Click next to the box Summary (text) (below Format)
Then choose MEDLINE
Click on Create File
Then you can save this file to your computer
This file can be imported in your reference software like
Endnote or Reference Manager. All bibliographical pack-
ages offer comparable ways of saving selections of stud-
ies that can be imported to reference software.

If you have saved your file, you first open the Endnote library in
which you want to import it.
Click on import in the File menu
Click on the box next to Import options
If PubMed (NLM) is not one of the options you can
choose from, click on Other filters and select it from the
menu.
Click on import and the records will be added to your
Endnote library
You can add files from other bibliographical databases
through the same procedure and selecting the right im-
port filter.

50
Selection of studies, retrievement of data and risk of bias

Working with two independent assessors


At this moment you will preferably work with two independent re-
searchers who select the records. It has been shown that when the selec-
tion of records is done by only researcher, that some studies are missed.
Working with two independent researchers is therefore important. As we
will see along the way, it is important that the extraction from the data
from the included studies and the calculation of effect sizes for studies,
should also done be done by two independent researchers. If that is not
possible because there is insufficient manpower available, then at least
the inclusion of studies (after retrieval of the full-texts of the studies)
and the calculation of effect sizes should be done by two independent re-
searchers.
When these two researchers work independently on selection re-
cords and studies, and retrievement of data from the studies, they will
have to compare the outcomes they both made. They will certainly find
differences. These differences are usually solved by discussing the differ-
ences and the reasons why the decisions are made. If they cannot agree
how to handle these differences, usually a senior researcher involved in
the project will lead the discussion and take a final decision.

Selection of full-text studies for inclusion in your meta-analysis


When you have selected the studies for which you want to read the
full-text, you should collect these from your library and when they are not
available you will have to get copies from other libraries. Alternatively
for newer papers you can contact the authors and ask for a copy, or you
can check if they have posted the full-text of their paper in a repository
at their university. Websites like ResearchGate (www.researchgate.net)
also provide many full-text research papers.
If you have collected the full-text of these papers you can start with
the selection of the studies you want to include. You will have to make a
comprehensive list of inclusion and exclusion criteria, that can help you
with the decision to include or exclude studies. These inclusion and exclu-

51
Step 3

sion criteria are again based on your PICO. You will have to read each of
the papers carefully and see if it meets your inclusion criteria. This should
also be done by two independent researchers with disagreements solved
by discussion and a final decision by a third researcher.
If you decide to exclude a full-text study, you have to give a reason for
that. The PRISMA flowchart requires that you report how many full-text
papers you retrieved, how many were excluded and what the reasons for
exclusion were. This process is often not very clear, because studies can
be excluded for several reasons. So giving one reason why a study was
excluded is often confusing because there may have been other reasons
why it would have been excluded. A solution to that is to make a hierarchy
of reasons for exclusion. However, in most meta-analyses, researchers in-
dicate the first reason they found in the article why it should be excluded.
In this phase a decision for inclusion is not definite. You may include
a study because it meets inclusion criteria but may find out later for ex-
ample that it is impossible to calculate effect size because not all the data
that are necessary for that are given in the paper. Or you may find out
that this study is actually a secondary paper about another study that you
already included.

Data extraction: Characteristics of the included studies


But at some point after reading all these papers carefully and exclud-
ing the ones that certainly dont meet your criteria for inclusion, you can
start with the data extraction. Overall there are three major groups of
data that you will extract from the included studies.
1. Characteristics of the studies
2. Risk of bias or quality assessment
3. Data to calculate effect sizes

In the rest of this chapter we will describe the first two categories of
data to be retrieved. The third category, about the calculation of effect
sizes, will be described in the next chapter.

52
Selection of studies, retrievement of data and risk of bias

There are no fixed rules for which characteristics of the studies should
be extracted. When you have read scientific articles about the subject of
your meta-analysis and you have read the full-texts of the studies you
have retrieved, you probably know which characteristics of the studies
you should collect. But it is not uncommon that during the process of ex-
tracting data you come up with other characteristics you should retrieve.
In general you can say that you should at least collect data about the
elements of your PICO. That means that you should collect data about
the participants in the trials, the interventions, and the comparators. The
outcomes (the last part of the PICO) are used for the calculation of the
effect sizes (described in the next chapter).
So for example for the participants you can rate how they were re-
cruited, what the exact definition of their problem was (for treatment
studies), exclusion criteria that were used in the trials, sociodemographic
characteristics like age, gender, socioeconomic status, and proportion of
participants from minority groups.

Characteristics from the interventions that can be collected include


for example the type of intervention (like cognitive therapy), the number
of sessions, the format (individual, group), the training of the therapist,
supervision of the therapist, manual used, etc.
For the comparator it can be rated which type of control group was
used, or if the comparator is an alternative treatment, the characteristics
of this alternative treatment.
Other characteristics of the studies that can be collected include for
example the year in which the study was published, the country where it
was conducted, or the sample size used in the study.
Meta-analyses always contain a table describing the characteristics
of the included studies. So when collecting the characteristics from the
published studies it is useful to make such a table from the start and fill
the cells for each of the included study, while retrieving the data.

53
Step 3

Assessing methodological quality and bias


The assessment of the risk of bias is one of the essential parts of any
meta-analyses. As indicated in the Introduction to this book, a meta-anal-
ysis can never be better than the studies they summarize. If all studies
included in a meta-analysis have high risk of bias, very little can be con-
cluded from the results of this meta-analysis, because the outcomes can
be heavily influenced by this risk of bias. This garbage in, garbage out is
one of the main problems of meta-analyses.
Risk of bias is not the same as the quality of a study, although there is
overlap between the concepts (Higgins & Green, 2011). Bias is a system-
atic error in a study, or deviation from the truth, in results or inferences.
Risk of bias refers to the weak spots of randomized trials, where the re-
searchers (usually without intending that) can influence the outcomes of
the study. Because the results may still be unbiased, it is more appropriate
to talk about risk of bias than about bias.
Quality refers more to how good a study has been designed and con-
ducted. Unfortunately, there are no definite criteria for what constitutes
high quality. There are many quality rating scales for randomized trials,
but they differ from each in which items are included and that varies for
types of interventions (Higgins & Green, 2011). For example, quality in
psychotherapy research is different from quality in medication trials. In
general it is therefore better to rate risk of bias than to rate quality, be-
cause risk of bias is much more straightforward.

Sources of bias
There are several sources of bias that should be assessed when con-
ducting a meta-analysis. More information about the sources of bias and
how to assess that can be found in the Cochrane Handbook for System-
atic Reviews of Interventions (Higgins & Green, 2011), which gives an ex-
cellent overview of the different types of risk of bias, and is available for
free online (http://handbook.cochrane.org). Below we will describe the

54
Selection of studies, retrievement of data and risk of bias

different sources of bias in randomized trials. After that we will describe


how these sources of bias can be assessed in meta-analyses.

Selection bias
Selection bias refers to systematic differences between the groups
that were randomized in the trial. One of the strong characteristics of ran-
domized trials is that participants are assigned to conditions in a random
way. When that is done correctly, there are no baseline differences be-
tween the two (or more) groups that are randomized. If, however, this as-
signment is not done well, there may be systematic differences between
the two (or more), which violate this basic principle of randomized trials.
That also means that if differences between these groups after treatment
are found, these are not caused by the treatment, but by these systematic
differences between the randomized groups.
Selection bias can result from errors in the randomization process.
First, the assignment of participants in trials should be done in a random
way, meaning that it is only chance that makes a person being assigned to
one condition and not to the other. The process of generating the order
in which participants are assigned to condition is usually called sequence
generation. There are several ways how this can be done in an adequate
way, like for example using a random numbers table or a computerized
random number generator (like www.random.org). But also for example
coin tossing or throwing dice are adequate ways to generate random
numbers. Incorrect ways of generating the order in which participants
are assigned to conditions include assignments by date of birth, the date
of admission, or patient record number. Other ways of assigning partici-
pants that are wrong include assignment based on the judgment of a clini-
cian or the preference of the participant.
But adequate sequence generation is not the only risk where the allo-
cation of participants can go wrong. It is also important that the research-
ers and participants cannot foresee the assignment, because then they
could also influence the process of randomization. It is important there-

55
Step 3

fore to conceal the allocation as much as possible from researchers and


participants. This allocation concealment can be realized by asking an
independent person, who is not involved in the trial, to do the assignment
to conditions. An alternative could be to make sequentially numbered,
opaque, and sealed envelopes with the condition to which the participant
is assigned.
Until about twenty years ago the method of randomization was not
described in most studies on mental health problems at all, and it was usu-
ally only reported that participants were randomized. Since the concept
of risk of bias was introduced (Higgins et al., 2011), the methods of ran-
domization are more often described in reports of therapies, but this is
still not always done.

Detection bias
Detection bias refers to systematic differences between groups
in how outcomes are determined. Detection bias can be prevented by
blinding (or masking) of participants, the personnel involved in the study,
and outcome assessors.
In medication trials it is possible to fully blind patients who partici-
pate. Patients receive the medication that is tested or a placebo pill that
is exactly the same as the medication, but without the active substance.
In such trials patients and the doctors who treat the patients dont know
whether they have received the medication or the placebo. In psycholog-
ical interventions this blinding is typically not possible. Participants know
whether they are randomized to an intervention or to a waiting list, usual
care or pill placebo. Blinding is simply not possible in most psychological
interventions in mental health care. That means that the effects that are
found for psychological interventions can very well be caused by other
factors than the specific techniques used in the intervention. For example
is it very well possible that the effects are (partly) caused by the expecta-
tions a patient has of the intervention. It is well-known that expectations
are associated with outcome of therapies (Constantino, Arnkoff, Glass,

56
Selection of studies, retrievement of data and risk of bias

Ametrano, & Smith, 2011). Unfortunately there is no solution for this


problem.
It is, however, possible to blind assessors of outcome. If the people
who assess outcome are not blinded they may be convinced that the par-
ticipants who received the intervention are better off that the ones who
did not receive the intervention. Because of this they may be inclined to
interpret the things a participant says as a positive outcome. There is con-
siderable empirical evidence that lack of blinding of outcome assessors
does indeed lead to more positive outcomes of interventions than blinded
assessors (Higgins & Green, 2011).
Simply not telling the outcome assessors to which condition the par-
ticipants are assigned is important, but it is also important to ask study
participants and the assessors not to talk about the intervention when
the interview is conducted. Because if they do that, the assessors still
know to which condition the participants are assigned.
In many psychological interventions there is another problem, namely
that the outcomes of the trial are assessed with self-report measures, and
not through interviews with (blinded) assessors. When participants fill
in a self-report measure, and they know to which condition they are as-
signed, they cannot fill in this questionnaire in a blinded way of course. So,
blinding is not possible either in that case. However, it is found in some re-
search that self-report measures are more conservative than assessment
of outcome by assessors, and lead to lower outcomes of interventions
(Pim Cuijpers, Li, Hofmann, & Andersson, 2010). Self-report measures
are also more conservative than assessment of outcome by assessors
when these assessors are blinded. So it is not clear if the impossibility of
blinding self-report measures really is a problem.

Attrition bias
One of the core elements of randomized trials is that participants who
are randomized are also all included in the analyses of outcome. In earlier
studies this element of trials was not considered important and analyses

57
Step 3

of the outcomes were usually only applied to the ones who completed
the study. The participants who dropped out of the study were simply
ignored, and in some early trials it was even considered appropriate to
replace drop-outs from the study.
Nowadays it is usually understood that analyzing all randomized par-
ticipants is important for estimating the true effect of an intervention. It
is very well possible that the participants who drop out are also the ones
who do not benefit from the intervention and who get worse because of
the intervention or during the intervention. Focusing only on the ones
who do not drop out probably inflates the effect sizes considerably. There
is also empirical evidence from meta-analyses that studies that only in-
clude study completers find higher effects than studies that include all
randomized participants in the analyses (Cuijpers, van Straten, Bohlmei-
jer, Hollon, & Andersson, 2010).
But how can the data of participants that drop-out be used in the
analyses? We do not have these data, so how can they be used? There are
several ways to estimate, or impute, these missing data, such as using the
last observation that is available (last observation carried forwards), mul-
tiple imputation techniques or mixed models for repeated measurements
(Crameri, von Wyl, Koemeda, Schulthess, & Tschuschke, 2015).
Note that there is a difference between participants who drop out of
the intervention and those who drop out of the study. People who drop
out of the intervention can still participate in the study by participating
in the assessments of outcome after the intervention. The problem of im-
puting missing data is mostly relevant for the participants who drop out
of the study.

Reporting bias
In many randomized controlled trials several outcome measures are
used. Researchers can be inclined to report only the outcomes for which
significant outcomes are found or outcomes with the largest effect siz-
es. This can affect the outcomes of meta-analyses considerably, because

58
Selection of studies, retrievement of data and risk of bias

other, non-significant outcomes are just as well valid outcomes. It is also


possible for example that the authors conducted analyses to optimize the
positive outcomes of the trial, while these analyses were different from
the originally planned analyses.
In earlier studies it cannot be examined whether there is reporting
bias. But in more recent years, trials must be registered in a trial registry
(http://apps.who.int/trialsearch). And it is also more and more common
to publish the protocol of trials in protocol papers. It is possible to verify
in these reports which outcome measures, analyses and other methods
were planned and whether the reported paper deviated from that.

Other potential threats to validity


There are all kinds of other potential threats to validity, such as ex-
treme baseline imbalance between the randomized groups (randomiza-
tion should result in groups with about equal size and comparable charac-
teristics). Smaller samples have a larger chance of imbalance because of
chance, but the larger the sample size the smaller the chance of imbalance
between groups. If there is still considerable imbalance something may
have gone wrong with the randomization.
There are other potential threats to validity, like that a study has been
claimed to have been fraudulent.

Researcher allegiance
In research on psychological interventions, the problem of research-
er allegiance is also a potential threat to validity. Researcher allegiance
can be defined as a researchers belief in the superiority of a treatment
[and] the superior validity of the theory of change that is associated
with the treatment (Leykin & DeRubeis, 2009, p.55). Many meta-analy-
sis have shown that researcher allegiance is associated with considerably
better outcomes for the preferred treatment (Dragioti, Dimoliatis, Foun-
toulakis, & Evangelou, 2015; Munder, Brtsch, Leonhart, Gerger, & Barth,
2013a).

59
Step 3

Researcher allegiance is measured by checking several characteristics


of the studies. For example, it is checked whether the authors present evi-
dence or believes that the examined therapy is effective, while that has to
be established in the study. Or a clear hypothesis is given why the treat-
ment is superior to others. Another indication of researcher allegiance
is that the intervention or the manual used was developed by one of the
authors.

Assessing risk of bias: the Cochrane Risk of bias assessment tool


Risk of bias in randomized trials can best be assessed with the Co-
chrane Risk of bias assessment tool (Higgins, J.P.T. & Green, S., 2011).
With this tool all major types of bias can be assessed (except researcher
allegiance, see below). Extensive information about the tool and how to
apply it can be found on the website devoted to the tool (www.riskofbias.
info). With the tool the six major types of bias are assessed:
Adequate sequence generation
Allocation concealment
Blinding
Incomplete outcome data addressed
Selective reporting
Other potential threats to validity
And for each of these types of bias it is listed when it is present or not
with detailed explanations. For each criterion there are three possibili-
ties:
Low risk of bias: when the paper describes clearly that this type
bias was handled well.
High risk of bias: when the paper describes a procedure that indi-
cates that the risk of bias is present.
Unclear risk of bias: when the paper does not give enough infor-
mation to say whether there was risk of bias or not. For example,
most older trials with psychological interventions, simply stated
that participants were randomized without describing how the

60
Selection of studies, retrievement of data and risk of bias

sequence generation was conducted or how allocation was con-


cealed.

For the assessment of the risk of bias it is again very important that
this is done by two independent researchers and that disagreements are
discussed until agreement is reached (when needed a third, senior, re-
viewer is involved).
Reporting of risk of bias in a meta-analysis is very important. First,
risk of bias should be clearly reported for each of the included studies in
a meta-analysis. This can be done in the table describing the characteris-
tics of the included studies that should be included in any meta-analysis.
Second, it is also important to report the aggregated results for the risk
of bias for all included studies together. For example, the percentage of
all studies that meet each criterion and the total number of studies that
meet all criteria. Third, a graphical representation of the risk of bias can
also be useful. In Figure 3.3 an example is given how risk of bias can be
represented graphically.

Figure 3.3 Graphical representation of risk of bias


in studies included in a meta-analysis
(Kok et al., n.d.).

Random sequence allocation (selection bias)


Allocation concealment (selection bias)
Blinding of outcome assessment (detection bias)
Blinding of participants (performance bias)
Incomplete outcome data (attrition bias)

0% 25% 50% 75% 100%

Low risk of bias Unclear risk of bias High risk of bias

61
Step 3

Apart from assessing the risk of bias in the studies that are included in
a meta-analysis it is of course also very important to examine whether the
risk of bias has an effect on the outcomes of the meta-analysis. We will
describe in Step 5 how that can be done.
Unfortunately, the Cochrane risk of bias assessment tool does not as-
sess researcher allegiance. This has to be assessed, therefore, separately
from the Cochrane tool.

62
Selection of studies, retrievement of data and risk of bias

Key points
Work through the retrieved records found in the searches and
retrieve the full texts of the papers that may meet your inclusion
criteria.
Work preferably with two researchers
Read the retrieved full texts of the papers carefully
Make a clear overview of in/exclusion criteria to guide this pro-
cess
Extract relevant data from the included studies on the partici-
pants, the intervention and general study characteristics
Assessment of validity of studies is of vital importance
The Cochrane risk of bias tool is a good instrument that assesses
major types of risk of bias: Adequate sequence generation; allo-
cation concealment; blinding; incomplete outcome data; selective
reporting; and other potential threats to validity
For psychological interventions, researcher allegiance is also an
important threat to validity of the included trials.

63
64
Step 4

Calculating and pooling


effect sizes

65
66
Step 4. Calculating and pooling effect
sizes

Meta-analyses are aimed at statistically integrating the results of in-


dividual studies. Before that integration is possible, however, it is needed
to calculate outcome measures for each study in such a way that pooling
across studies is possible. That means that these outcomes have to calcu-
lated in a standardized way. If for example a study measures depression
with one depression measure, like the Beck Depression Inventory (Beck,
Ward, Mendelson, Mock, & Erbaugh, 1961), and the second study uses
another instrument, like the Hamilton Rating Scale for Depression (Hamil-
ton, 1960), these outcomes can not simply be integrated, because of the
differences between the instruments. Somehow the scores on these in-
struments have to be standardized, so they can be pooled across studies.
In social sciences and mental health research the most used standard-
ized effect sizes are Cohens d and Hedges g. These effect sizes are based
on continuous outcomes (that can take any value in a given range) and
indicate the difference between two groups in terms of standard devia-
tions. When the term effect size is used in mental health research, usu-
ally either Cohens d or Hedges g is meant. Technically this is not correct,
however, because other standardized outcome measures, for example
those based on dichotomous outcomes (such as the Relative Risk (RR), or
the Odds Ratio (OR)) are also effect sizes. However, the term effect size
is used so much in mental health research, that we will use it here also to
indicate Cohens d or Hedges g, and when we indicate another effect size
(like the RR or OR) we will explicitly say that.
In the following paragraphs we will describe how effect sizes can be
calculated and what they mean. We will first describe the most used ef-
fect sizes based on continuous outcomes and then we will describe the
most common effect sizes based on dichotomous outcomes. It is also pos-

67
Step 4

sible to use effect sizes for counts, rates, and ordinal outcomes. Because
such effect sizes are not much used in meta-analyses in mental health re-
search we will not focus on them here.

Effect sizes based on continuous outcomes


When researchers in mental health and social sciences speak about
effect sizes, they usually mean Cohens d (Cohen, 1988). Cohens d is the
difference between the means of the intervention group and the control
(or comparison) group, divided by the pooled standard deviations of the
intervention and control groups. In a formula:

Mintervention Mcontrol
d=
SDpooled

In this formula, M stands for the mean and SD stands for the standard
deviation. The pooled SD is calculated as:

SDpooled = SDintervention SDcontrol

Hedges g is calculated using the same formula, only the method to


calculate the pooled SD is somewhat different than in Cohens d, and it
is assumed that Hedges g is more accurate when the sample size of the
study is small. In general it is therefore better to use Hedges g than Co-
hens d. In biomedical research Cohens d and Hedges g are often called
standardized mean difference.
A third type of effect size is Glasss (delta). This uses the same for-
mula as Cohens d, but in stead of using the pooled standard deviation, it
uses the standard deviation of the control group. Most software packages
for meta-analyses offer facilities to help with the calculation of effect siz-

68
Calculating and pooling effect sizes

es. But on the Internet, many calculators are available that can help with
the calculation of effect sizes.
So in order to calculate an effect size, you need the mean (M), the
standard deviation (SD) and the sample size (N) from the two groups that
you are comparing (I and C from the PICO acronym). But where can you
find these data in an article? Usually the data you need are in a Table pre-
senting the outcomes of the trial. For example, if you look at the following
open-access paper describing a randomized trial examining the efficacy of
mindfulness-based cognitive therapy as a public mental health interven-
tion for adults with mild to moderate depressive symptomatology (Pots,
Meulenbeek, Veehof, Klungers, & Bohlmeijer, 2014): http://journals.plos.
org/plosone/article?id=10.1371/journal.pone.0109789.
You will find the outcome data in the table at page 7. As you can see,
the table presents the M and SD of the outcome measures, as well as the
N for each of the two conditions. If a paper does not report these data in
a table, it is necessary to read the text of the results section carefully, be-
cause sometimes these data are only presented as text. In this study the
main outcome measure is the CES-D (Center for Epidemiological Studies
Depression scale) (Roberts, 1980). When we take the M, SD and N of
the treatment group at post-test (M=11.79; SD=8,76; N=76) and of the
waiting list control group (N=16,43; SD=9,94; N=75), than this results in
a Cohens d of 0.50. This d is also reported in the Table itself.
Unfortunately, not all trials give the exact data that are needed for the
calculation of effect sizes. Sometimes the M is reported, but not the SD.
However, the SD can also be calculated from other statistics, such as the
standard error (SE) or the 95% confidence interval (CI) around the mean.

The formula to calculate the SD from the SE is: SD = SE * N.

If the 95% CI around the mean is given, the SE can be calculated with
the following formula: SD = ((M - CIlower) / 1.96) * N. In this formula CIlo-
wer indicates the lower threshold of the 95% CI.

69
Step 4

Cohens d and Hedges g can also be calculated from tests in which the
difference between the two groups is tested, for example the t-test, but
also the p-value indicating the difference between the two groups. It is
beyond the scope of this book to provide the statistical details of these
calculations, and most software packages offer help with the conversion
of these statistics into effect sizes.
The formulas to calculate effect sizes are only valid when the effect
size is based on the difference between two groups, like a treatment and
a control group. If an effect size is calculated that measures the improve-
ment from pre-test (baseline) to post-test, then these formulas can not be
used, as described above. The reason for that is that the pre- and post-test
scores are not independent of each other, and the correlation between
the pre- and post-test scores are needed to calculate the effect size. Un-
fortunately, this correlation is hardly ever reported in trials and therefore
researchers typically assume a value for this correlation (for example 0.7).
An alternative could be to calculate Glasss as the effect size, because
this uses the standard deviation of the control group (the baseline mea-
sure) and does not use the pooled standard deviation of both groups. For
Glasss the correlation between pretest and posttest is therefore not
needed.

How to find the data needed for calculating effect sizes?


In order to find the data for the effect size calculations you will have
to read the article carefully. Most meta-analyses of treatments of psycho-
logical interventions use effect sizes indicating the difference between an
intervention and a control condition in which participants dont receive
the intervention, or in which they receive an alternative intervention
(such as medication, or exercise). This difference is measured after the
experimental intervention has ended. The data for calculating the effect
sizes are often found in a table.
For example, a recent randomized trial examined the effects of a
web-based guided self-help intervention for employees with depressive

70
Calculating and pooling effect sizes

symptoms (Geraedts, Kleiboer, Wiezer, van Mechelen, & Cuijpers, 2014)


www.ncbi.nlm.nih.gov/pmc/articles/PMC4026573/). When you read this
paper, you will see that all the data you need for the calculation of effect
sizes are in Table 2 (www.ncbi.nlm.nih.gov/pmc/articles/PMC4026573/
table/table2/). As we said in the previous paragraph, we need the mean,
standard deviation and N of the intervention group and the comparison
group. In Table 2 of this article, you find the N of the two groups in the title
of the Table, and the means and standard deviation of the main outcome
measure (depression according to the CES-D) at post-test in the two first
lines with data (in the column Posttest, mean (SD)). Using these data
with the formula for Cohens d, you will find that the effect size is d=0.25.
Unfortunately, not all studies report these outcomes very clearly. That
can be especially problematic in older papers. Sometimes they do report
the means and standard deviations of the two conditions you are compar-
ing, but then you have to read the paper very carefully in order to retrieve
the N of the participants in each of the conditions, and you have to look
for the number of randomized participants in one paragraph of the paper
and try to find information about attrition in another paragraph, and in-
formation about the table (does it report the randomized participants or
only the ones who reported outcomes?) in yet another paragraph.
One also cannot always rely automatically on the data that are re-
ported in the table with the outcomes. For example, in the study we just
described you should also look carefully which data you should select.
In this table two different values for the post-test means and standard
deviations are given, one for all randomized participants and one for the
completers of the intervention. For the calculation of the effect size you
can only choose one of these (and usually the intention-to-treat data in
the first line are the preferred data).

71
Step 4

Study Outcome Mther SDther Nther Mctr SDctr Nctr


Study 1 Instrument A 100 10 90 120 12 95
Instrument B 50 6 90 60 9 95
Study 2 Instrument A 80 15 30 95 17 26
Instrument C 15 3 30 17 4 26
Study 3 Instrument D 44 5 56 49 7 59

So when you have collected the data from the studies you want to in-
clude in your meta-analysis you have a table that could look like this (for
the first three studies). This is enough to calculate effect sizes for each
study and pool them across studies.

Interpreting effect sizes


One of the big advantages of working with effect sizes is that they al-
low to look at the size of an effect, in stead of only looking at the question
whether the two conditions differ significantly from each other. Signifi-
cance testing does not allow to assess how large an effect is, but only tests
whether a difference between groups is significant. That in itself is not
very informative, however, because a statistical test depends on the sam-
ple size, effect size, and its variance. Thus large studies, compared to small
studies, are more likely to find statistically significant effects. In contrast
to the p-value, an effect sizes captures the size of an effect, regardless of
its significance (Pim Cuijpers, Turner, Koole, van Dijke, & Smit, 2014).
Effect sizes can be considered as small, moderate or large. Cohen
(1988) suggested to consider effect sizes of 0.20 as small, 0.50 as moder-
ate and 0.80 as large. Lipsey & Wilson (1993) collected several hundreds
of meta-analyses in the educational and psychological interventions and
estimated based on those meta-analyses that effect sizes smaller than
d=0.32 are small, 0.33 to 0.55 are moderate, and those larger than 0.56
are large.
Although these thresholds give an indication for the size of an effect,
effect sizes are still statistical concepts, and do not automatically say
something about the clinical relevance of the finding. As explained ear-

72
Calculating and pooling effect sizes

lier, an effect size indicates the difference between two groups in terms
of standard deviations. But whether that is clinically relevant can not be
determined based on the size of the effect. For example, an effect size of
0.1 in terms of years of survival would be considered by most clinicians
as a very important and strong effect, whereas the same effect size of
0.1 in terms of more social skills or knowledge about mental health
would likely not be considered clinically meaningful by most clinicians
(Pim Cuijpers, Turner, et al., 2014). Thus, there is little correspondence
between the effect size and its clinical relevance. It has been suggested
that an effect size of 0.5 can be seen as a generic threshold for clinical
relevance (Fournier et al., 2010; Kirsch et al., 2008; National Institute for
Clinical Excellence, 2009) but this is inaccurate and misleading, because it
does not take into account the clinical relevance of the outcome measure.
Another disadvantage of the effect size is that it is difficult to explain
the clinical relevance to patients and clinicians. If a clinician wants to ex-
plain to a patient what an effect size of g=0.5 means, he or she would say
something like: If you get this treatment the average patient will score
0.5 standard deviation better than patients who do not get the treat-
ment. And then the clinician has to explain what a standard deviation
means. It can hardly be expected that patients then know what they can
expect from this treatment.

73
Step 4

Table 4.1 Conversion of effect sizes to Numbers-


needed-to-be-treated (NNT) a)

d NNT d NNT d NNT d NNT d NNT d NNT


0.01 166.67 0.26 6.85 0.51 3.55 0.76 2.44 1.01 1.91 1.26 1.59
0.02 83.33 0.27 6.58 0.52 3.50 0.77 2.42 1.02 1.89 1.27 1.59
0.03 62.5 0.28 6.41 0.53 3.42 0.78 2.39 1.03 1.87 1.28 1.58
0.04 45.45 0.29 6.17 0.54 3.36 0.79 2.36 1.04 1.86 1.29 1.57
0.05 35.71 0.30 5.95 0.55 3.31 0.80 2.34 1.05 1.85 1.30 1.56
0.06 29.41 0.31 5.75 0.56 3.25 0.81 2.30 1.06 1.83 1.31 1.55
0.07 25.00 0.32 5.56 0.57 3.18 0.82 2.28 1.07 1.82 1.32 1.54
0.08 21.74 0.33 5.43 0.58 3.14 0.83 2.26 1.08 1.81 1.33 1.53
0.09 20.00 0.34 5.26 0.59 3.09 0.84 2.23 1.09 1.79 1.34 1.52
0.10 17.86 0.35 5.10 0.60 3.05 0.85 2.21 1.10 1.77 1.35 1.52
0.11 16.13 0.36 5.00 0.61 2.99 0.86 2.19 1.11 1.76 1.36 1.51
0.12 14.71 0.37 4.85 0.62 2.96 0.87 2.16 1.12 1.75 1.37 1.50
0.13 13.51 0.38 4.72 0.63 2.91 0.88 2.15 1.13 1.74 1.38 1.49
0.14 12.82 0.39 4.59 0.64 2.86 0.89 2.13 1.14 1.72 1.39 1.48
0.15 11.90 0.40 4.50 0.65 2.82 0.90 2.10 1.15 1.71 1.40 1.47
0.16 11.11 0.41 4.39 0.66 2.78 0.91 2.08 1.16 1.70 1.41 1.47
0.17 10.42 0.42 4.27 0.67 2.75 0.92 2.07 1.17 1.69 1.42 1.46
0.18 9.80 0.43 4.20 0.68 2.70 0.93 2.04 1.18 1.68 1.43 1.45
0.19 9.43 0.44 4.10 0.69 2.67 0.94 2.02 1.19 1.67 1.44 1.45
0.20 8.93 0.45 4.00 0.70 2.63 0.95 2.01 1.20 1.66 1.45 1.44
0.21 8.47 0.46 3.91 0.71 2.60 0.96 1.99 1.21 1.64 1.46 1.43
0.22 8.06 0.47 3.85 0.72 2.56 0.97 1.97 1.22 1.63 1.47 1.42
0.23 7.69 0.48 3.76 0.73 2.54 0.98 1.95 1.23 1.62 1.48 1.42
0.24 7.46 0.49 3.68 0.74 2.50 0.99 1.94 1.24 1.61 1.49 1.41
0.25 7.14 0.50 3.62 0.75 2.48 1.00 1.91 1.25 1.60 1.50 1.40

a) The NNTs are calculated according to the method provided by: Kraemer HC, Kupfer
DJ. Size of treatment effects and their importance to clinical research and practice.
Biological Psychiatry 2006; 59: 990-996 (Kraemer & Kupfer, 2006)

74
Calculating and pooling effect sizes

One way to solve this problem is to convert the effect size to the num-
bers-needed-to-be-treated (NNT). The NNT indicates the number of pa-
tients that have to be treated in order to generate one additional positive
outcome (Laupacis et al., 1988). It has the advantage that it is easier to
understand what the clinical meaning of the NNT is compared to what
the effect size means. In the next paragraph on dichotomous outcomes
we will see that the NNT is the inverse of the Risk Difference between
two conditions. So, for example if 30% of the patients in the control group
improve, and 50% improves in the treatment group, the Risk Difference is
20% (50%-30%) and the NNT is 5 (=1/0.20).
There are at least five methods to convert an effect size to the NNT,
all of which assuming that the mean scores follow a normal or near nor-
mal distribution (da Costa et al., 2012; Furukawa & Leucht, 2011). Four of
these methods require an estimate of the event rate in one or both of the
conditions and each of these four methods is superior to the fifth method
(Kraemer & Kupfer, 2006). However, because this fifth method does not
need an estimate of a variable that is often not available, it is still used in
many meta-analyses. With this fifth method it is possible to calculate a
NNT for each value of the effect size. In the table below for each value of
the effect size the NNT is given.

More outcomes in one study


Usually each study (or each comparison between two conditions)
should have one effect size, and not more. In biomedical trials it is com-
mon to have one primary outcome measure. In many trials in the mental
health field, however, it is not defined what the primary and secondary
outcomes are, especially in older studies. That means that one trial can
have multiple outcomes that measure the same construct. For example in
the field of depression trials, it is very common that the outcomes are re-
ported of self-reported depressive symptoms (for example the Beck De-
pression Inventory) and of clinician-rated depression severity (for exam-

75
Step 4

ple the Hamilton Depression Rating scale), without one being the primary
outcome. So what to do in such a situation?
There are several solutions to this problem:

1. One of the instruments (or a category of instruments, such as


clinician-rated measures) is chosen and if the study does not in-
clude this, it is excluded from the meta-analysis. This solution has
the advantage that using one instrument may reduce heteroge-
neity (see chapter 5). Furthermore, because only one instrument
is used, it is also possible to calculate the effect size in terms of
exact points on the scale. The disadvantage is that not all relevant
studies may be included in the meta-analysis, reducing statistical
power and representativeness of the sample of studies.

2. There is a pre-defined hierarchy in which the instruments are


included. For example, the most used, or best validated is chosen
first, if that instrument is not available the second most used or
best validated is chosen, etc. This option has the advantage that
all studies can be included (optimal power and representative-
ness), but it may also lead to increased heterogeneity because
different instruments are used.

3. All instruments that measure the same outcome within a study


are first pooled within the study, and than the pooled effect size
for all measures together is used in the pooling across studies.
This solution also has the advantage that all studies can be in-
cluded (optimal power and representativeness), but it may also
lead to increased heterogeneity because different instruments
are used. An additional advantage is that no data within studies
are wasted and that a better estimate of the effect size within the
study can be obtained. One problem with this approach is that
multiple outcomes within one study are not independent from

76
Calculating and pooling effect sizes

each other and the correlation between the outcomes should be


accounted for when pooling them within the study.

As each of these three solutions has its pros and cons, and there is not
one of the three that is preferred for every meta-analysis, and it depends
on the set of studies and their outcome measures which solution should
be chosen.

Effect sizes based on dichotomous outcomes


Although the effect size based on continuous outcomes are most used
in meta-analyses in mental health research and social sciences, many out-
comes are dichotomous. For example, patients participating in a trial can
be considered as a responder to treatment (for example because he or
she scores below a cut-off on a rating scale), or a patient may drop out of
treatment. Such outcomes are dichotomous (yes/no) in stead of contin-
uous. In the general medical field dichotomous outcomes are used more
than continuous outcomes, for example when the effect of a treatment
on survival is examined (the proportion of patients who survived after
treatment compared to no treatment). Dichotomous outcomes have the
advantage that it is much easier to understand for patients and clinicians
what the outcome means. A patient is either better or not, he or she re-
sponded to a treatment or not, which is easier to understand than a cer-
tain score on a continuum of outcomes.
In order to understand the different types of effect sizes based on
dichotomous outcomes it is important to have a look at Table 2, where
the different options for dichotomous outcomes of a randomized trial are
described. Event in this table can be any dichotomous outcome (e.g., re-
sponse, drop-out, death), but to make it easier to understand we call it
success (versus fail). The formulas for the major types of effect sizes
based on dichotomous outcomes is also given in that Table.

77
Step 4

Software packages for meta-analyses

There are several software packages that allow users to conduct


meta-analyses, including:

SPSS and SAS (meta-analysis macros for SAS and SPSS, devel-
oped by David B. Wilson are available from: http://mason.gmu.
edu/~dwilsonb/ma.html)
STATA
Review manager (developed by the Cochrane Collaboration;
available from: http://tech.cochrane.org/revman/download)
Several packages for meta-analyses are available in R, including
metafor, mvmeta and mada (Schwarzer et al., 2015).
Comprehensive Meta-analysis.

It is beyond the scope of this book to give an extensive review


of these packages, but probably the most easy to use package
is Comprehensive Meta-analysis. Most researchers doing me-
ta-analyses in social sciences and mental health use this package.
A trial version is available from: www.meta-analysis.com. Data
can be copied directly from any spreadsheet program, there are
many options available for calculating effect sizes and most of
the relevant analyses are available and easy to use. There are also
video tutorials available for most basic and advanced analyses
(www.meta-analysis.com/pages/videotutorials.php).

78
Calculating and pooling effect sizes

The Relative Risk (RR) is the risk of participants in the treatment


group to have the event (or success), divided by the risk of participants
in the comparison group to have the event. The risk is simply the propor-
tion of participants that have the event. If the RR is 1, the risk for the event
in the treatment and comparison group are exactly the same, so there is
no difference between treatment and comparison group. If the 95% con-
fidence interval of the RR does not include 1, then the RR is significant at
the p<0.05 level.

Table 4.2 Possible dichotomous outcomes in a randomized controlled trial

Event No event Total


(success) (fail)
Therapy a b (a+b)
Control c d (c+d)

(a*d) Odds of success in the treatment group


Odds ratio (OR) = =
(b*c) Odds of success in the comparison group

a / (a+b) Risk of success in treatment group


Relative risk (RR) = =
c / (c+d) Risk of success in comparison group

Risk difference (RD) = (a / (a+b)) (c / c+d)) = Risk in therapy group risk in control
group

Numbers-needed-to-treat (NNT) = 1/RD = One divided by the risk difference

79
Step 4

So suppose that a trial shows that 30 out of the 50 participants in a


treatment group respond, while in the control group only 10 of the 50
participants respond. In that case the RR is 30/50 (the risk in the treat-
ment group) divided by 10/50 (the risk in the control group), which is 0.6
divided by 0.2, and that equals 3. This RR of 3 means that the chance for
response is 3 times higher for the treatment than for the control group.
The RR does not imply a direction for the outcome. So it also possi-
ble to divide the proportion of the responders in the control group by the
proportion in the treatment group, so exactly the opposite. In that case
the RR is 0.33 (=0.2/0.6). So the RR of 3 and the RR of 0.33 have the same
meaning, except that the direction of the outcome was chosen in opposite
direction.

The Odds Ratio (OR) is somewhat more complicated to understand.


The OR represents the odds that an event will occur in the treatment,
compared to the odds of the event occurring in the comparison group.
The odds itself is the ratio of the probability that a particular event will
occur to the probability that it will not occur. The OR can be any number
between zero and infinity.
The OR was developed as a statistic that resembled the RR. And just
as is the case as with the RR, an OR of 1 indicates that odds for an event
in the treatment group are the same as the odds in the comparison group
and that when the 95% confidence interval of the OR does not include
1, the OR is significant at the p<0.05 level. The OR and RR are different
concepts, but the value of the OR approximates the value of the RR when
the event occurs in 10% of the cases or less.

The Risk Difference (RD) indicates the difference between the risk
for the event in the treatment and the risk for the event in the comparison
group. So in the example above, the risk in the treatment group was 60%
(30/50) and the risk in the control group was 20% (10/50). This means
that the RD in this case was 40%.

80
Calculating and pooling effect sizes

As indicated earlier in this paragraph, The Numbers-needed-to-be


treated (NNT) indicates the number of patients that have to be treated in
order to generate one more event (positive outcome) than no treatment
(Laupacis et al., 1988). The NNT is the inverse of the RD (Table 2). In our
example, the NNT = 1/RD = 1/0.40 = 2.5. This means that in this example
2.5 patients should be treated with the treatment in order to have one
more success event than in the control group.

The RR and OR can be pooled in meta-analyses and can be considered


to be effect sizes (although usually in the social science and mental health
literature the term effect size is typically used for Cohens d and Hedges
g). Risk differences should preferably not be pooled because these are ab-
solute outcomes and pooling of those typically result in very high levels of
heterogeneity (Higgins, J.P.T. & Green, S., 2011).
In order to calculate the RR or OR for each study, the papers about
the studies should again be examined carefully. Below a table is given
(with simulated data) of the data that should be collected for calculating
RRs and ORs, that can be pooled in a meta-analysis. Success can stand
for any dichotomous outcome (like recovery, but also relapse or sur-
vived).

Table 4.3 Examples of data to calculate effect


sizes for dichotomous outcomes

Study THERAPY CONTROL


Nsuccess Ntotal Nsuccess Ntotal
Study 1 8 75 15 80
Study 2 12 100 16 120
Study 3 3 40 5 35

81
Step 4

Pooling of effect sizes


When you have collected the data for the effect size calculations and
you have calculated the effect sizes for each of the included studies, you
can calculate the mean of these effect sizes. Calculating the mean of the
effect sizes is also called pooling. The idea is that the pooled effect size
is the best estimate of the true effect size that can be made on the basis
of all currently available evidence. Pooling the results of a set of individual
studies not only gets the best estimate of the true effect size, but it has
more advantages.
Pooling makes it possible to examine smaller effect sizes which would
not be possible with individual studies. That is because the results of mul-
tiple studies are integrated and the statistical power of these integrat-
ed studies is therefore higher. Pooling makes it also possible to examine
whether the effects differ in specific subgroups of studies. Another pos-
sibility is to examine whether continuous characteristics of studies are
associated with the effect size, for example whether the effect size is re-
lated to publication year or to the number of sessions in an intervention.
In the next chapter we will discuss these moderator analyses (subgroup
analyses and metaregression analyses).
It is not an option to simply calculate the mean of the effect sizes of
the individual studies. Because if we would do that, then small studies
would have the same weight as large studies. So suppose we have two
studies, one with an effect size of g=0.80 and 30 participants in the tri-
al, and another study with an effect size of g=0.20 and 400 participants
in the trial. If we would simply calculate the mean we would end up with
a pooled effect size of g=0.50. However, the trial with 400 participants
probably gives a better estimate of the true effect size, and therefore this
study should have more weight than the smaller study.
One key issue when pooling effect sizes is the variability among stud-
ies. The variability among studies is also called heterogeneity. For the
analyses and the interpretation of the results of meta-analyses, hetero-
geneity is a key issue. There are several types of heterogeneity. Clinical

82
Calculating and pooling effect sizes

heterogeneity refers to variability among the participants, interventions


and outcomes of the included studies. Methodological heterogeneity re-
fers to variability in study design and risk of bias.
Statistical heterogeneity refers to the variability in the effect sizes
that are found for the included studies. When statistical heterogeneity is
present that means that the observed effect sizes are more different from
each other than what would be expected due to chance (random error)
alone. Statistical heterogeneity is the result of clinical or methodological
heterogeneity or both. In meta-analysis, statistical heterogeneity is typi-
cally referred to as heterogeneity.
Heterogeneity is a very important issue, especially in meta-analyses
in social sciences and mental health, because the studies typically vary
considerably, and clinical, methodological and statistical heterogeneity is
often high. Examining sources of (statistical) heterogeneity is therefore a
key issue in meta-analyses. In the next chapter we will describe the differ-
ent methods of exploring possible sources of heterogeneity. But it is also a
concept that has to be understood when pooling studies in itself, because
the presence of heterogeneity is the key feature of deciding for the model
that will be used to pool the data.

When can effect sizes be pooled?


When the number of studies is too small pooling does not make sense.
But when are there enough studies for pooling? There is no straightfor-
ward answer to this question. Whether there are enough studies depends
on the number of studies, the number of participants per study, the quali-
ty or risk of bias of the studies, and the heterogeneity (statistical and clin-
ical) of the set of studies.
It is possible to do power calculations for meta-analyses to estimate
how many studies with how many participants are needed to find a spe-
cific effect size (Borenstein, M., Hedges, L.V., Higgins, J.P.T., & Rothstein,
H.R., 2009). In the table below can be seen how many studies are needed
with 20, 30, 40, or 50 participants per condition to find effect sizes be-

83
Step 4

tween d=0.20 and d=0.60. Important for such power calculations is the
between study variance (Tau-square). These calculations were done for
low, medium and high between-study variance and for a power (1 beta)
for 0.80 and 0.90.
The number of studies and participants is not enough to decide wheth-
er pooling of studies is useful. If the majority studies has low risk of bias
and/or if clinical and statistical heterogeneity is high, then pooling may
still not be useful. There are no good guidelines for when heterogeneity
and risk of bias are good enough for pooling. If pooling is not useful, it is
still possible to write a systematic review without pooling the effect sizes.

84
Calculating and pooling effect sizes

Table 4.4 The number of studies needed to find effect sizes, for low,
medium and high between study variance and power of 0.80 and 0.90

Power = 0.80 Power = 0.90


Between study variance Between study variance
N per N per
d condition Low Medium High condition Low Medium High
0.2 20 26 32 39 20 35 44 52
30 18 22 26 30 23 29 35
40 13 17 20 40 18 22 26
50 11 13 16 50 14 18 21

0.3 20 12 15 18 20 16 20 23
30 8 10 12 30 11 13 16
40 6 8 9 40 8 10 12
50 5 6 7 50 7 8 10

0.4 20 7 9 10 20 9 11 14
30 5 6 7 30 6 8 9
40 4 5 5 40 5 6 7
50 3 4 4 50 4 5 6

0.5 20 5 6 7 20 5 8 9
30 3 3 5 30 4 5 6
40 3 3 4 40 3 4 5
50 2 3 3 50 3 3 4

0.6 20 3 4 5 20 4 5 6
30 2 3 3 30 3 4 4
40 2 2 3 40 2 3 3
50 2 2 2 50 2 2 3

85
Step 4

The random and the fixed effects model


The results of individual studies can be pooled according to two mod-
els, the fixed and the random effects model. In the fixed effect model it
is assumed that all studies are exact replications of each other and all
studies share a common (true) effect size (Borenstein, M. et al., 2009). All
factor that could have an influence on the effect size are exactly the same
in all studies. Because all studies estimate the same identical (true) effect
size, the observed effect sizes only vary because of random error inherent
in each study.
In the random effects model, effect sizes can differ from each other
because of random error (like in the fixed effects model). But they also dif-
fer because of random variation because the effect sizes themselves are
sampled from a population of effect sizes. When a group of studies is in-
cluded in a random effects model it is not assumed that they are identical
and there is no assumption that the effect size is the same for all studies.
Each study is allowed to introduce its own underlying variance, because
there are differences between studies in terms of participants, interven-
tions and design of the studies.
Most randomized trials in a specific subfield of mental health or so-
cial science cannot be assumed to be exact replications of each other and
most trials do differ from each other to a certain extent. In meta-analyses
in the field of mental health and social sciences, the random effects model
typically should be preferred above the fixed effect model, unless it is very
clear that the included trials are indeed exact replications of each other.
The 95% confidence intervals around the effect size found in random
effects models are usually more broad than in the fixed effect model, and
are therefore more conservative than the fixed effect model. When het-
erogeneity between studies is low or zero, the outcomes of the fixed ef-
fect and the random effects model are the same or almost the same.
Some researchers conducting meta-analyses assume that when het-
erogeneity is low (in the next chapter we will explain how heterogeneity
can be calculated), the fixed effects model should be used. When doing

86
Calculating and pooling effect sizes

that, the outcomes of the random and fixed effect models will be compa-
rable, but when there are differences between the studies it is still better
to use the random effects model. The decision to use the fixed effect or
the random effects model should be based on the knowledge about the
studies, and whether they share a common effect size, not on a statistical
test of heterogeneity.
From this perspective it is also important to stress that the confidence
intervals around the level of heterogeneity are often broad, meaning that
even when heterogeneity is low, its 95% confidence interval could still
include a high level of heterogeneity and its therefore uncertain if het-
erogeneity is indeed low (Ioannidis, Patsopoulos, & Evangelou, 2007).
When using the random effects model it is important to examine pos-
sible sources of heterogeneity. Are these studies indeed heterogeneous,
and can we find explanations for this heterogeneity? In the next chapter,
we will focus on the methods that are available for examining heteroge-
neity.

The forest plot: An excellent summary of a meta-analysis


The so-called forest plot is an excellent summary of a meta-analysis
(Lewis & Clarke, 2001). In the figure below a typical forest plot is present-
ed. It is the result of a meta-analysis effects of programs aimed at reduc-
ing the stigma associated with mental disorders (Griffiths, Carron-Arthur,
Parsons, & Reid, 2014) (it is freely available at www.ncbi.nlm.nih.gov/pmc/
articles/PMC4102289).
This meta-analysis contained 19 trials examining the effects of an-
ti-stigma programs on personal stigma and the studies, the effect size for
each study, and the 95% CI around the effect size is presented in the Fig-
ure. The pooled effect size is also presented (on the line overall).
Note that the length of the line for the 95% CI indicates the size of
the study. The larger the study, the more narrow the 95% CI will be,
and the more precise the effect size can be estimated. So, broad 95%
CIs indicate small studies and narrow 95% CIs indicate large studies.

87
Step 4

Personal stigma and social distance


Study Standardized effect size
(95% CI)
Kiropoulos et al (41) 0.90 (0.60- 1.19)
Wood & Wahl (48) 0.65 (0.27-1.02)
Farrer et al (32) 0.58 (0.00-1.15)
Campbell et al (25) 0.46 (0.04- 0.88)
Finkelstein et al (33) 0.42 (0.03-0.81)
Jorm et al (38) 0.39 (0.03- 0.75)
Jorm et al (39) 0.38 (-0.20-0.95)
Bayar et al (21) 0.27 (0.00- 0.54)
Penn et al (52) 0.27 (-0. 18-0. 72)
Ir vine et al (23) 0.22 (-0.09- 0.52)
Dias-Vieira (SO) 0.21 (0.03- 0.38)
Cor rigan et al (29) 0.20 (-0.19-0.59)
Kitchener & Jorm (42) 0.17 (-0.08- 0.42)
Gulliver et al (37) 0.14 (-0.67-0.95)
Jorm et al (51) 0.14 (-0.11- 0.38)
Corri gan et al (27) 0.10 (-0.35- 0.55)
Brown et al (24) 0.08 (-0.35-0.51)
Grifths et al (17) 0.00 (-0.23- 0.23)
Sharp (47) -0.06 (-0.43--0.32)
Overall 0.28 (0.17- 0.39)

-1 -0.5 0 0.5 1 1.5


From: Grifths et al., World Psychiatry 2014

For example in the Figure, the study by Gulliver et al. (Gulliver et al., 2012)
is clearly a small study, while the study by Griffiths et al. (Griffiths, Chris-
tensen, Jorm, Evans, & Groves, 2004) is large.
In this forest plot the studies are presented according to the size of
the effects, with the first study having the highest effect size and the last
study the lowest effect size. Other meta-analyses present the studies in
alphabetical order, or according to the year in which they were published
(like in cumulative meta-analyses, see next chapter). But the order in
which the studies are presented, is not important if we want to learn how
the forest plot can be seen as the core of a meta-analysis.
Below this same figure is given again, but with some illustrative points.
The pooled effect size is indicated with a black dot on the last line of the
plot, and the 95% CI is the line drawn though this dot. As can be seen

88
Calculating and pooling effect sizes

from this Figure, the upper and lower threshold of the 95% CI around the
pooled effect size range from 0.17 to 0.39. The blue vertical lines indicate
these thresholds for the 95% CIs.
The 95% CI of the first study (Kiropoulos, Griffiths, & Blashki, 2011)
does not overlap with the blue lines, and therefore the 95% CI of this study
does not overlap with the pooled 95% CI of all studies. When this is the
case, such a study is often considered to be an outlier. An outlier means
that a study differs considerably from the other studies in a meta-analy-
sis. There is not one best way of identifying outliers, but one simple meth-
od is to look whether the 95% CI of the study overlaps with the 95% CI of
the pooled effects size. In the next chapter we will discuss how to handle
potential outliers when examining heterogeneity in a meta-analysis. For
here it is sufficient to illustrate how the outlier can be identified.

95 CI around
Personal stigma and social distance effect size

Study Standardized effect size Effect size and


(95% CI) 95% CI of one study
Kiropoulos et al (41) 0.90 (0.60- 1.19)
Wood & Wahl (48) 0.65 (0.27-1.02)
Farrer et al (32) 0.58 (0.00-1.15)
Campbell et al (25) 0.46 (0.04- 0.88)
Finkelstein et al (33) 0.42 (0.03-0.81)
Jorm et al (38) 0.39 (0.03- 0.75)
Jorm et al (39) 0.38 (-0.20-0.95)
Bayar et al (21) 0.27 (0.00- 0.54)
Penn et al (52) 0.27 (-0. 18-0. 72)
Ir vine et al (23) 0.22 (-0.09- 0.52)
Dias-Vieira (SO) 0.21 (0.03- 0.38)
Cor rigan et al (29) 0.20 (-0.19-0.59)
Kitchener & Jorm (42) 0.17 (-0.08- 0.42)
Gulliver et al (37) 0.14 (-0.67-0.95)
Jorm et al (51) 0.14 (-0.11- 0.38)
Corri gan et al (27) 0.10 (-0.35- 0.55)
Brown et al (24) 0.08 (-0.35-0.51)
Pooled effect size
Grifths et al (17) 0.00 (-0.23- 0.23) and 95% CI
Sharp (47) -0.06 (-0.43--0.32)
Overall 0.28 (0.17- 0.39)

-1 -0.5 0 0.5 1 1.5


From: Grifths et al., World Psychiatry 2014

89
Step 4

In brief, the forest plot is an excellent summary of a meta-analysis,


with a summary of each individual study (effect size + 95 % CI), the pooled
outcome, and how this can be used to identify potential outliers.

Sensitivity analyses
When a meta-analysis is conducted many decisions are taken, about
the inclusion of specific studies, participants, outcomes and designs. For
example in the next chapter we will give an example of a meta-analysis we
conducted on psychological treatments of depression in old age. But it is
not clear how old age should be defined. Some studies include only people
older than 55, other use 60 or 65 years as cut-off for being included in the
trial. Or in psychological treatments usually different outcome measures
are used that measure the same construct. Also studies with different lev-
els of risk of bias may be included.
Sensitivity analyses can be helpful with examining whether such de-
cisions have affected the outcomes. It is for example in most cases useful
to limit the analyses to studies with low risk of bias to see if that leads to
different outcomes than when all studies (also the ones with higher risk of
bias) are included. And when multiple outcome measures are used to ex-
amine the effects of the interventions, it can also be examined in sensitiv-
ity analyses whether the outcomes for specific measures lead to different
outcomes. Any other decision made in the meta-analysis can be examined
like this in sensitivity analyses.

90
Calculating and pooling effect sizes

Key points
Cohens d indicates the difference between the treated and the
control at post-test in terms of standard deviations
Effect sizes d of 0.2 are small, 0.5 moderate and 0.8 large
Many other statistics can be pooled in meta-analyses
Pooling means the statistical integration of the result of multiple
studies into one overall effect size
Whether or not it is useful to pool effect sizes across studies de-
pends on the number of studies, the number of participants per
study, the heterogeneity of the set of studies and the risk of bias
of the set of studies.
The forest plot gives a good summary of a meta-analysis, with the
effect size for each study, the 95% confidence interval (that also
indicates the size of the study), the differentiation of effect sizes,
outliers and the pooled effect size.

91
92
Step 5

Examining heterogeneity
and potential
publication bias

93
94
Step 5. Examining heterogeneity and
potential publication bias

In the previous chapter we defined (statistical) heterogeneity as the


variability in the effect sizes that are found for the included studies in a
meta-analysis, and that when it is present that the observed effect sizes
are more different from each other than what would be expected due to
chance (random error) alone. We also saw that exploring heterogeneity
is a key issue when conducting meta-analyses. In this chapter we will de-
scribe the different methods that can be used to examine heterogeneity
and possible sources of heterogeneity. In this chapter we will also de-
scribe how possible publication bias, or small sample bias, can be exam-
ined using the funnel plot.

Visual inspection of the forest plot


The first method to examine heterogeneity is to look carefully at the
forest plot. In the previous chapter we already explained how the forest
plot is an excellent summary of a meta-analysis that reveals many things
about the individual studies, but also about heterogeneity.
As a further illustration of the forest plot and how this illustrates
heterogeneity, we present a few examples. In Figure 5.1 you can see the
forest plot from a meta-analysis examining the effects of psychotherapy
for adult depression compared with pill placebo control groups (Cuijpers,
Turner, et al., 2014). In this meta-analysis there are no outliers (defined as
studies from which the 95% confidence interval does not overlap with the
95% confidence interval of the pooled effect size). The effect sizes do dif-
fer, but the large studies are quite close to the mean effect size (Barrett et
al., 2001; Williams et al., 2000). Some small studies deviate more from the
mean effect size (Jarrett et al., 1999; Mynors-Wallis, Davies, Gray, Bar-
bour, & Gath, 1997) but because they are smaller that could be explained

95
Step 5

by chance. So visual inspection would suggest there is not much hetero-


geneity. And as we will see later in this chapter that is true (although the
uncertainty is also high as we will see later).

Figure 5.1 Forest plot for psychotherapy versus placebo for adult depression
(P. Cuijpers, Turner, et al., 2014)

g 95% CI Hedges's g and 95% CI


Barber, 2011 0,06 -0,33 to 0,45
Barrett, 2001 -0,00 -0,31 to 0,31
DeRubeis, 2005 0,31 -0,05 to 0,67
Dimidjian, 2006 BA 0,23 -0,20 to 0,67
Dimidjian, 2006 CT 0,27 -0,17 to 0,71
Elkin, 1990 CBT 0,23 -0,12 to 0,59
Elkin, 1990 IPT 0,36 0,00 to 0,71
Hegerl, 2009 0,34 0,01 to 0,67
Jarett, 1999 0,58 0,11 to 1,05
Mynors-Wallis, 1995 0,68 0,14 to 1,21
Sloane, 1985 0,08 -0,59 to 0,75
Williams, 2000 0,20 -0,03 to 0,44
Pooled 0,25 0,14 to 0,35
-1,00 -0,50 0,00 0,50 1,00
Favours placebo Favours therapy

Now compare this with the forest plot that is presented in Figure 5.2.
This is from a meta-analysis on psychological treatments of depression
in older adults compared with control groups (waiting list, care-as-usu-
al, placebo). We have marked the 95% confidence interval around the
pooled effect size with two vertical red lines, so it is easier to see which
studies are outliers. The outliers (the studies in which the 95% confidence
interval of the effect size does not overlap with the pooled effect size) are
marked with red ovals. Large studies have more narrow 95% confidence
intervals, and as can be seen from the figure also some large studies are
outliers (e.g., Joling et al., 2011; Williams et al., 2000). That is remarkable
because it could be expected that large studies can make better estimates
of the true effect size, so if large studies are outliers than it is very proba-
ble that heterogeneity is high. That combined with the fact that the num-

96
Examining heterogeneity and potential publication bias

ber of outliers is considerable, suggests that heterogeneity is indeed high.


Later in this chapter we will show methods to quantify heterogeneity and
there we will see that heterogeneity is indeed very high in this meta-anal-
ysis.

Figure 5.2 Forest plot for psychotherapy for depression


in older adults versus control groups
(Pim Cuijpers, Karyotaki, Pot, Park, & Reynolds, 2014)

g 95% CI p Hedges's g and 95% CI


Arean,1993 - pst 1,35 0,65 - 2,05 0,00
Arean,1993 - rem 0,45 -0,12 - 1,03 0,12
Burns, 2007 0,40 -0,03 - 0,84 0,07
Chan, 2013 1,43 0,59 - 2,27 0,00
Choi, 2012 - ftf 0,30 -0,28 - 0,89 0,30
Choi, 2012 - tele 0,43 -0,14 - 1,01 0,14
Ekkers, 2011 0,59 0,16 - 1,01 0,01
Floyd, 2004 - bibl 0,76 -0,00 - 1,52 0,05
Floyd, 2004 - ind 1,56 0,60 - 2,52 0,00
Fry, 1983 - str rem 3,17 2,61 - 3,74 0,00
Fry, 1983 - unstr rem 1,73 1,29 - 2,17 0,00
Gitlin, 2013 0,46 0,19 - 0,74 0,00
Haringsma, 2005 0,45 0,07 - 0,83 0,02
Hautzinger, 2004 0,94 0,47 - 1,40 0,00
Heckman, 2011 -cop 0,38 0,09 - 0,67 0,01
Heckman, 2011 - sup 0,25 -0,04 - 0,53 0,09
Joling, 2011 0,10 -0,20 - 0,40 0,50
Korte, 2012 0,51 0,23 - 0,79 0,00
Laidlaw, 2008 0,42 -0,20 - 1,03 0,18
Lamers, 2010 0,26 -0,00 - 0,51 0,05
Landreville, 1997 0,30 -0,50 - 1,10 0,46
Mossey, 1996 0,25 -0,28 - 0,78 0,36
Pot, 2010 0,39 0,09 - 0,69 0,01
Preschl, 2012 0,71 0,05 - 1,38 0,04
Scogin, 1987 1,34 0,33 - 2,36 0,01
Scogin, 1989 -bt 0,34 -0,29 - 0,97 0,29
Scogin, 1989 - cbt 1,05 0,41 - 1,69 0,00
Serfaty, 2009 0,17 -0,19 - 0,53 0,35
Serrano, 2004 0,95 0,33 - 1,57 0,00
Serrano, 2012 0,39 -0,53 - 1,30 0,41
Sloane, 1985 0,11 -0,56 - 0,79 0,74
Snarksi, 2011 -0,12 -0,83 - 0,59 0,74
Spek, 2007 cbt 0,30 0,03 - 0,58 0,03
Teri, 1997 - ba 0,70 -0,06 - 1,46 0,07
Teri, 1997 - pst 0,88 0,09 - 1,66 0,03
Van Schaik, 2006 0,07 -0,26 - 0,40 0,67
Watt, 2000 -integr 1,90 0,79 - 3,01 0,00
Watt, 2000 -instr 1,55 0,52 - 2,59 0,00
Williams, 2000 0,15 -0,13 - 0,43 0,30
Wuthrich, 2013 0,74 0,23 - 1,25 0,00
POOLED 0,64 0,47 - 0,80 0,00
-0,75 0,00 0,75 1,50
Favours placebo Favours therapy

97
Step 5

Test for homogeneity (Q) (is it significant?)


Another method to examine heterogeneity is with a Q-test. In this
Q-test it is tested whether the observed effect sizes are significant-
ly more different from each other than what would be expected due to
chance (heterogeneity). If the Q-test is significant there is evidence for
heterogeneity.
The problem with this test is that it suggests that if it is not significant
there is the suggestion that there is no heterogeneity. But that is not true.
The strength of this Q-test depends very much on the number of studies
included in the meta-analysis. If the number of included studies is small,
which is very common in meta-analyses, then the test suggests there is
no heterogeneity, while in fact there is not enough power to identify it.
Sometimes it is suggested to use a p-value of 0.10 for the Q-test to solve
this problem, but this is not really a solution because it does not solve the
problem of lower power.

Quantifying heterogeneity
Another way to examine heterogeneity is to quantify it. The I2 statistic
is the most used method to quantify heterogeneity (Higgins, Thompson,
Deeks, & Altman, 2003). It is the percentage of the total variance that can
be explained by heterogeneity. The formula for calculating I2 is:

I2= ((Q-df)/Q) x 100%

I2 ranges from 0% (no heterogeneity) to 100% (the differences be-


tween the effect sizes can completely be explained by chance alone).
The advantage over the Q-test is that it not simply tests whether there
is significant heterogeneity, but it gives the level of heterogeneity in per-
centages. In general it is assumed that a percentage of 25% indicates low
heterogeneity, 50% moderate and 75% high heterogeneity (Higgins et al.,
2003).

98
Examining heterogeneity and potential publication bias

It is important to realize that I2 is an indicator for heterogeneity that


is uncertain as well. It is important, therefore not only to calculate I2 it-
self, but also the 95% confidence interval around I2 (Ioannidis et al., 2007).
Especially when the number of studies is small and the number of partic-
ipants per study is small, the uncertainty of I2 can be considerable. For
example, it is very well possible that the I2 found in a small meta-analysis
is zero, while the 95% confidence interval around I2 ranges from 0 to 80%.
In this case the Q-test is also not significant. And in many meta-analyses
it is concluded that there is no heterogeneity in this case. But in reality, it
is not known how much heterogeneity there is. It could be 0%, but just as
well 75 or 80%.

Examining the causes of heterogeneity


When a meta-analysis finds heterogeneity it is important to examine
the possible sources of this heterogeneity. If heterogeneity is high and it is
not possible to identify explanations for it, than there is a major problem
with the meta-analysis, because in this case the mean effect size is not
very meaningful. The effect size of individual studies can be above or be-
low the mean effect size, and it is unclear why that is the case. In principle
that means that the mean effect size found is not interpretable very well.
If results are very heterogeneous and the effect sizes point in different di-
rections it may even be advisable not to do a meta-analysis at all, because
the results may be misleading.
As indicated earlier, when analyses are conducted with the fixed ef-
fects model, heterogeneity is ignored because it is assumed that all stud-
ies are exact replications of each other and all studies share a common
(true) effect size. But in mental health and social sciences the random ef-
fects model should be preferred, and ignoring heterogeneity is not a real
possibility.
We already discussed one method to examine possible sources of
heterogeneity earlier in this chapter, namely the identification of outliers.
When a study is an outlier and excluding this study from a meta-analy-

99
Step 5

sis results in a considerable drop in the level of heterogeneity, the cause


of the heterogeneity may be in that study. It is important to examine this
study in more detail and see why the outcomes differ so much from the
other studies. These differences can be on anything, ranging from the par-
ticipants, to the intervention, the outcomes, general design of the study,
or specific types of bias.

Subgroup analyses to examine sources of heterogeneity


When heterogeneity is found in a meta-analysis it is very well possible
that this heterogeneity is caused by differences between subgroups of
studies. In these subgroup analyses the total set of studies is divided into
two or more subgroups, and for each of these subgroups the effect size is
calculated, as well as the heterogeneity. Furthermore, a test is conducted
to examine whether the effect sizes of the subgroups differ significantly
from each other. Usually these subgroup analyses are done with a mixed
effects model in which the effect sizes within the subgroups are pooled
with a random effects model and the test whether the effect sizes be-
tween the subgroups differ significantly from each other is done with a
fixed effect model.
For example, in one meta-analysis we examined the effects of Inter-
net-based treatments of depression and included all studies that com-
pared a psychological intervention delivered through the internet with
a control group (Andersson & Cuijpers, 2009). When all 15 studies were
taken together there was quite some heterogeneity (I2=57%) and the
mean effect size was d=0.41. To examine the possible sources of this het-
erogeneity a series of subgroup analyses were conducted. In one of these
subgroup analyses, the total set of studies was split into two groups, one
in which there was no support from a professional for the patient working
through the intervention, and one in which patients did receive support
from a coach or psychologist, through email or telephone. The studies
with professional support found an effect size of d=0.61 and low hetero-
geneity (24%), while the studies without professional support resulted in

100
Examining heterogeneity and potential publication bias

an effect size of d=0.25 and also low heterogeneity (10%). Furthermore,


the difference between these two groups of studies was highly significant
(p<0.001). None of the other subgroup analyses that were conducted re-
sulted in a significant difference between subgroups or low levels of het-
erogeneity in these subgroups.
These subgroup analyses find exactly what the subgroup analyses are
meant for in this context, namely to identify differences in effect sizes
between subgroups of studies and lower levels of heterogeneity in these
subgroups. The results of this example suggest that the overall pooled ef-
fect size of all studies together is not so very useful, and it is better to look
separately at the effect sizes of the studies with support and those with-
out support, as these are probably two different types of interventions.
Although results like this are very interesting and useful, and may
point at causes of heterogeneity, these results should always be consid-
ered with caution because these results are only observational and not
causal. That means that there is no strong evidence that it is indeed this
type of subgroup that is responsible for the differential outcomes of the
subgroups. What we see is that there is a difference between the sub-
groups, but that may very well be explained by other factors that are not
measured. For example, in our example it could very well be possible that
for example the interventions without support are much briefer than
the ones with support and that it is in fact the length of the program that
causes the differences in effects between the two subgroups of studies.
Or it may be possible that interventions without support are more simple
(because the builders assume that when patients have to do this all on
their own, a simple intervention is feasible) than the ones with support.
It is also very well possible that the true cause of the difference between
the subgroups is unknown and cannot be found in the published papers.
So when significant differences are found between subgroups of stud-
ies that cannot be seen as causal evidence, and there is no solution for this
problem. In other example that means that when we want to be sure that
Internet interventions with support and those without support differ sig-

101
Step 5

nificantly from each other, new randomized trials should be conducted in


which patients are randomized to the same intervention but with support
in one arm and no support in the other arm. When these trials indeed sup-
port the difference between intervention with and without support this
can be considered strong evidence, much stronger than the observational
results from the subgroup analyses.
Subgroup analyses are not exclusively conducted to examine sources
of heterogeneity. When heterogeneity is low, it can still be useful to con-
duct subgroup analyses. First, because as we saw earlier that the level of
heterogeneity is often uncertain (broad 95% confidence intervals), espe-
cially when the number of trials and participants in those studies is small.
But even when heterogeneity is low and the number of trials is large, it
still may be relevant to examine whether there are differences between
the effect sizes of subgroups. In our example, it would have been logical,
based on pre-specified hypotheses to examine whether there are differ-
ences between Internet interventions with and those without support.

Choosing subgroups for subgroup analyses


Important issues in subgroup analyses is how many subgroup analy-
ses can be done and which subgroups should be selected for these analy-
ses. In general it can be said that the more studies with more participants
are in the meta-analysis, the more subgroup analyses can be done. But
that also depends on the number of studies and participants within the
subgroups that are examined, how many subgroups are examined in one
analysis (for example only two types of interventions, or three or more),
and the validity (risk of bias) of the studies in the total sample and each of
the subgroups. That implies that there are no fixed numbers for the num-
ber of subgroup analyses that can be done. But the general rule is less sub-
group analyses when the number of studies is small, and when 10 studies
are available in a meta-analysis, it makes little sense to do more than two
or three subgroup analyses and doing no subgroup analysis can also be a
good choice.

102
Examining heterogeneity and potential publication bias

The other question is which subgroups should be chosen for the sub-
group analyses. In principle any characteristic of the participants, the in-
tervention, comparison group, outcomes, or design of the study can be
used for subgroup analyses. When risk of bias is high in some of the stud-
ies, it is usually a good idea to at least examine whether there is a differ-
ence between the studies with low or high risk of bias (although that can
also be examined with a meta-regression analysis, see below). Or even
better, to examine each of the items of the risk of bias assessment tool
separately in a series of subgroup analyses (because each of these items
can have an independent effect on the outcomes and making a sum of the
risk of bias for each study may obscure this).
Unfortunately, there are no fixed rules for choosing which charac-
teristics should be included in the subgroup analyses. Knowing a field of
interventions and the studies examining the effects of the interventions
usually leads to enough ideas for doing subgroup analyses. The big risk is
that meta-analytic researchers simply do all possible subgroup analyses
with all characteristics of the studies that have been extracted, but select
only the ones that are significant for reporting in their article. This is one
of the reasons why publishing the protocol for a meta-analysis is useful
and to specify in advance which subgroup analyses are planned.
So when interpreting the findings of subgroup analyses it is also im-
portant to see whether the analyses were planned in advance. Apart from
that it is also to assess whether the findings in the subgroup analyses can
be explained by other or external evidence. For example, our finding that
Internet therapies with and without support differ from each other is in
line with the assumption that human contact is needed for psychologi-
cal interventions to be effective (Mohr, Cuijpers, & Lehman, 2011). That
makes the finding more credible and stronger.
Also the difference between the subgroups is important. Statistical
significance is not the only relevant issue, because it is very much depen-
dent on power and the number of studies and participants in the stud-
ies. As indicated earlier, the size of the difference between subgroups in

103
Step 5

terms of differential effect sizes is not the only issue that matters, but also
the clinical meaning of that differential effect size.

Metaregression analyses
Metaregression analyses can also be used to examine sources of
heterogeneity. In a bivariate metaregression analysis the association be-
tween a continuous characteristic of the studies and the effect sizes is
examined. For example, the association between the effect size and the
number of sessions in an intervention could be examined in a metaregres-
sion analysis. In figure 5.3 the association between the effect size and the
number of therapy sessions for therapies in adult depression is graph-
ically represented, based on a meta-analysis in which we examined this
(among others; (Pim Cuijpers, Huibers, Ebert, Koole, & Andersson, 2013)).
The line indicates the regression line, the curved lines are the 95% confi-
dence intervals and the dots are the individual studies. In general metare-
gression analyses should not be conducted when the number of studies is
smaller than 10 (Higgins, J.P.T. & Green, S., 2011).
There are different statistical methods for doing the metaregression
analyses, but we will not go into them because that is too technical. We
will focus on the main outcomes of a metaregression analysis and how to
interpret the results.
What is important in a metaregression analysis is the slope of the re-
gression line. When the regression line is completely horizontal, there is
no association between the effect size and the predictor (in this case the
number of sessions of the therapy that is examined). That horizontal line
indicates that the effect size is the same for any value of the predictor
(number of sessions). If, however, the line is not horizontal, that indicates
that the effect size differs for different values of the predictor. In our ex-
ample, there was a small, positive slope for the association between the
number of treatment sessions and the effect size.
The limitations that were mentioned for subgroup analyses are also
true for metaregression analyses. So they also only provide observational

104
Examining heterogeneity and potential publication bias

Figure 5.3 Metaregression analysis of number of sessions


as predictor of the effect size in studies examining the
effects of psychotherapy for adult depression
2,50

2,00

1,50
Hedges's g

1,00

0,50

0,00

-0,50
2,5 5,0 7,5 10,0 12,5 15,0 17,5 20,0 22,5 25,0
Number of sessions

evidence. In our example the (small) association between effect size and
number of sessions can not be considered as evidence that such an asso-
ciation really exists.
Apart from bivariate metaregression analyses it is also possible to do
multivariate metaregression analyses in which more than one predictor
is included simultaneously. In these multivariate models it is also possible
to include categorical variables, such as in normal regression analyses.
When categorical variables are included in metaregression analyses they
should be categorized into dummy variables (variables indicating a 1 or
0, for presence or absence of the characteristic). In our example Internet
interventions with or without support we could simply enter one variable
indicating support (1) or no support (0). But just as in normal regression
analyses it is also possible to include variables with more than two cate-
gories, where one of the categories is the reference category.

105
Step 5

In multivariate metaregression analyses a series of (continuous and


categorical) predictors can be included in one model. For example, we
could examine whether the association between the effect size and the
number of treatment sessions remains significant after adjusting for the
other characteristics of the participants, interventions and the study in
a multivariate metaregression analysis. We did this and found that this
association was no longer significant (Cuijpers, Huibers, et al., 2013).
Of course multivariate metaregression analyses can only be conduct-
ed when the number of studies is large. And again, the results of these
analyses are only observational.

Publication bias and other reporting biases


From the start of modern meta-analyses it has been recognized that
the file drawer problem can seriously influence the outcomes of meta-anal-
yses (Rosenthal, 1979), as has been described in the Introduction of this
book. In recent times the term file drawer problem is no used much
anymore, but has been replaced by the more specific term publication
bias. Publication bias refers to the problem that not all the studies that
are conducted in a certain area are actually published. Companies (espe-
cially pharmaceutical companies), but also authors, editors and journals
are inclined to favor publication of studies which show significant and
large effects of interventions. If a study shows no or only small effects the
chance increases that they are not published (or only selected, positive
outcomes are reported, see the paragraph on selective outcome report-
ing). And that is a big problem for meta-analyses. Meta-analyses try to
estimate the true effect of an intervention by integrating the results of
all trials examining the effects of that intervention. If negative studies are
not or less published this may considerably lead to an overestimation of
the true effect of the intervention.
There is considerable evidence for publication bias in social sciences
and mental health research. There is much evidence for publication bias
among drug trials in depression (Turner, Matthews, Linardatos, Tell, &

106
Examining heterogeneity and potential publication bias

Rosenthal, 2008), anxiety disorders (Roest et al., 2015) and schizophrenia


(Turner, Knoepflmacher, & Shapley, 2012). Also in the field of psychother-
apy there is direct evidence for publication bias. In one study we identi-
fied studies on psychotherapy for adult depression that were funded by
the NIH in the United States and checked which ones were published
and which ones were not (Driessen, Hollon, Bockting, Cuijpers, & Turner,
2015). We found that 13 (24%) of the 55 trials did not result in publica-
tions and two other trials never started. When the results of the unpub-
lished studies were added to the published one, the effect size indicating
the difference between psychotherapy and control groups dropped with
25%.
There is also more indirect evidence for publication bias and report-
ing bias. A study from the 1950s already found that 97% of all published
studies in psychology reject the null-hypothesis (Sterling, 1959). More re-
cently it was found that 92% of studies in psychology and psychiatry pub-
lish positive results (Fanelli, 2010). Furthermore, the odds of reporting a
positive result was around 5 times higher for papers in psychology and
psychiatry compared to space science, more than 2 times higher in social
sciences compared to physical sciences, and more then 3 times higher
in studies applying behavioral and social methodologies on people com-
pared to physical and chemical studies on non-biological material (Fanel-
li, 2010). In addition to that, the number of studies in psychology with a
p-value just below 0.05 has been found to be much higher than expected
based on chance (Masicampo & Lalande, 2012).
Publication bias is not the only type of reporting bias (Higgins, J.P.T.
& Green, S., 2011). There are several other types of reporting bias, like
the time lag bias (the phenomenon that some studies are published later
than others, depending on the nature and direction of the results), out-
come reporting bias (this was already mentioned in the previous chapter
because it is one of the items that is examined with the Cochrane Risk of
Bias Assessment tool), and language bias (when studies in another lan-
guage are not identified and these studies differ in terms of nature and

107
Step 5

direction of the results). In this chapter we will focus mainly on publica-


tion bias.

Testing for publication bias with indirect methods:


The funnel plot
In some cases it is possible to examine publication bias directly. For
example, the Food and Drug Administration (FDA) in the USA requires
that a dug has been tested in randomized controlled trials before it can
be admitted to the American market. That allows researchers to compare
the results that are published in scientific journals, with the trials that
have been conducted (and submitted to the FDA) but not published. In
an important paper (Turner et al., 2008) it was shown that including the
unpublished trials in a comprehensive meta-analysis of antidepressants
resulted in a reduction of the effect size of 25% when compared with pill
placebo. In the field of psychotherapy we already mentioned the study in
which we compared the trials that were funded by the NIH with those
that were not funded and also found a 25% drop in effect size when the
unpublished studies were included in the meta-analysis (Driessen et al.,
2015).
But in many cases it is not feasible to compare published with un-
published trials. In these cases it is possible to get an indirect impression
whether there is publication bias or not (although we will later see that
there are also other explanations for these phenomena). The indirect es-
timates of publication bias are based on the assumption that large studies
(with many participants) can make a more precise estimate of the effect
size, while the effect sizes found in smaller studies can divert more from
the pooled effect size because they are less precise in their estimates of
the effect size. Random variations of the effect sizes are larger in studies
with few participants, compared to studies with many participants. This
difference can be represented graphically in a funnel plot, where the ef-
fect size is represented at the horizontal axis and the size of the study on
the vertical axis (Sterne et al., 2011).

108
Examining heterogeneity and potential publication bias

Figure 5.4 Funnel plot of standard error by Hedges g


in studies comparing CBT with control groups
(Pim Cuijpers, Berking, et al., 2013).

Without missing studies imputed


0,0

0,2
Standard Error

0,4

0,6

0,8
-3 -2 -1 0 1 2 3
Hedges's g

With missing studies imputed

0,0

0,2
Standard Error

0,4

0,6

0,8
-3 -2 -1 0 1 2 3
Hedges's g

109
Step 5

In figure 5.4 we have given an example of this funnel plot. On the ver-
tical axis is the standard error, which indicates the size of the study. So the
higher on the vertical axis, the larger the study is in terms of participants.
On the horizontal axis the effect size (Hedges g) is given. Each of the cir-
cles represents one of the studies. This figure is based on a meta-analysis
of a meta-analysis of cognitive behavior therapy (CBT) for adult depres-
sion (Pim Cuijpers, Berking, et al., 2013).
When the study is smaller (less participants) the effect size can be ex-
pected to divert more from the mean effect size, because of chance. As
it is smaller it will be less precise and divert more from the pooled effect
size. When the study is larger the chance is smaller that its effect size dif-
fers from the mean effect size by chance, so it will be closer to the pooled
effect size. But all effect sizes plotted like this divert from the mean effect
size by chance, nothing else. The smaller studies more than the bigger
studies, but it is still all only chance. But if these effect sizes differ from
the mean effect size only by chance, they should divert in both directions,
positive and negative.
Thus, Figure 5.4 should be symmetrical with as much small studies
(lower in the figure) on the right as there are on the left, but without any
testing it can be see already that there are more studies on the right of the
mean effect size (positive studies) than on the left (negative studies). This
visual inspection of the funnel plot already suggests that there are more
positive studies than negative ones.
There are several tests for the asymmetry of the funnel plot. Two
much used tests are Begg and Mazumdars test (Begg & Mazumdar, 1994)
and Eggers test of the intercept (Egger, Davey Smith, Schneider, & Mind-
er, 1997). They test whether the funnel plot is symmetrical and if they are
significant, it can be concluded that there is significant publication bias (or
another bias, see below). Another approach to missing studies in a funnel
plot was developed by Duval and Tweedie (Duval & Tweedie, 2000). They
developed a method to estimate how many studies are missing from the

110
Examining heterogeneity and potential publication bias

funnel plot, to impute the missing studies and estimate the effect size af-
ter imputation of these missing studies. In the second part of figure 5.4
this method has been applied to the studies on CBT. The black dots repre-
sent the imputed studies, the ones that should have been there but are in
fact not there. In this case the number of imputed studies was 27 (there
were 94 studies included in this meta-analysis), and after taking these im-
puted studies into consideration the mean effect size indicating the dif-
ference between CBT and control groups after treatment dropped from
g=0.71 to g=0.53. Eggers test and Begg and Mazumdars test were also
highly significant in this meta-analysis (both p<0.001).
The funnel plot is a very useful tool for detecting possible publication
bias. But there are also several important risks associated with its use.
First of all, it requires a considerable number of studies, generally at least
30 (Lau, Ioannidis, Terrin, Schmid, & Olkin, 2006), but that also depends
on the effect size and the size of the studies. Furthermore, how the funnel
plot looks also depends on other factors, for example the type of outcome
(dichotomous versus continuous, and RRs versus ORs) lead to differenc-
es in funnel plots (Lau et al., 2006), as does the parameter of the vertical
axis (sample size, standard error, etc.). When heterogeneity is high (like in
our example) funnel plots also may lead to false interpretations (Terrin,
Schmid, Lau, & Olkin, 2003).
Another issue is that the funnel plot allows to see if small studies with
negative effects (smaller than the mean effect) have not been published
while they should have been published (based on chance). It is very well
possible that this is caused by publication bias and that these studies
are not published because authors, editors and journals prefer positive
effects and are not interested in papers when they report no or nega-
tive effects of an intervention. But this cannot be considered evidence
for publication bias. Maybe a better term for the phenomenon is small
sample bias. It is possible there are other reasons why studies with small
samples have larger effects than larger studies. For example, small studies
may focus more on high risk patients, have a shorter follow-up and dif-

111
Step 5

fer because treatment effects decrease over time or may target different
populations (Lau et al., 2006).
In the case of psychological interventions it may be true that very good
therapists who developed a new therapy delivered a treatment in a small
pilot trial themselves, while in later larger trials the therapy was delivered
by other therapists. The delivery of a new therapy by a famous professor
may also increase expectations of patients. And small pilot studies also
may use more waiting list control groups for example than larger studies.
I sum, the funnel plot is a useful tool to examine publication bias (or
better: small sample bias) but the results of the funnel plot should be con-
sidered with caution.

112
Examining heterogeneity and potential publication bias

Key points

Exploring heterogeneity is a key issue in meta-analyses


Heterogeneity can be examined with several methods:
Visual inspection of the forest plot
A test for homogeneity
Calculating I2, an indicator of heterogeneity in percentages
(the percentage of the total variance that can be explained by
heterogeneity).
I2 is very useful, but also should be considered with caution, be-
cause the uncertainty around it (95% confidence interval) is often
considerable, especially in smaller meta-analyses.
Examining the causes of heterogeneity is important
The causes of heterogeneity can be examined with several meth-
ods
Explore why outliers differ from other studies
Conduct subgroup analyses
Conduct bivariate metaregression analyses
Conduct multivariate metaregression analyses
Publication bias and other reporting biases are an important
problem for meta-analyses
The funnel plot can be used to examine publication bias or small
sample bias
But the funnel plot should be interpreted with caution because
of the requirements for funnel plot analyses and because it is no
evidence for publication bias

113
114
Step 6

Writing and publishing


meta-analyses

115
116
Step 6. Writing and publishing meta-
analyses

When you have followed the first five steps in this book you have con-
ducted the basic steps of a meta-analysis. You have formulated a research
question according to the PICO acronym, you have searched bibliograph-
ical databases to identify the studies that can answer you research ques-
tion, you have carefully selected the studies that meet inclusion criteria,
you have extracted the data of each of these studies including character-
istics of the participants, the intervention, the comparator and risk of bias,
you have calculated effect sizes and pooled these according to the random
effects model, you have examined sources of heterogeneity in subgroup
and metaregression analyses, and you have examined small sample bias.
So when you have done all that you are ready to publish the results of your
meta-analysis. In this step we will describe the publishing of the protocol
for your meta-analysis, the PRISMA guidelines for publishing meta-anal-
yses, and a stepwise guide for what each of the parts of the publication of
your meta-analysis should contain.

Publishing the protocol of your meta-analysis


In fact, you should have started with the publication of the protocol
of your meta-analysis, before actually beginning with the meta-analysis.
Many of the risks for bias that exist for randomized trials and that are as-
sessed in meta-analyses, are also true for meta-analyses themselves. The
searches, the selection and inclusion of studies, the choice for outcome
measures, the decision for which subgroup and metaregression analyses,
all these decisions can influence the findings of meta-analyses.
We already mentioned researcher allegiance, and agenda-driven
bias of researchers. Many meta-analyses are written by researchers who
are biased towards the intervention they examine in the meta-analysis.

117
Step 6

They are inclined to be positive about the outcomes of their meta-analy-


sis. There is considerable evidence that researcher allegiance leads to an
overestimation of the effects found in meta-analyses (Munder, Brtsch,
Leonhart, Gerger, & Barth, 2013b).
The best way to prevent researcher allegiance and other kinds of bi-
ases is to publish the protocol for the meta-analysis that is planned, that
contains the details of the meta-analysis that will be conducted. That will
make it possible to verify whether the planned meta-analysis is in line
with the final publication.
In 2011 the first international registry for systematic reviews (PROS-
PERO) was launched (www.crd.york.ac.uk/PROSPERO). Since then sever-
al open access journals have started to publish protocols of meta-analy-
ses, including the journal Systematic Reviews (a BioMed Central journal)
that was started in 2012, and for example BMJOpen.
In 2015 a PRISMA guideline for protocols (PRISMA-P) was published
in Systematic Reviews (Moher et al., 2015) and BMJ (Shamseer et al.,
2015). The PRISMA-P checklist describes all elements of the planned me-
ta-analysis, from the authors and financial support for the meta-analysis,
to the rationale, objectives and methods for the review.

The PRISMA Statement


The PRISMA Statement is a guide for authors of meta-analyses of
what should be reported (Moher, Liberati, Tetzlaff, & Altman, 2009).
PRISMA stands for Preferred Reporting Items for Systematic Reviews
and Meta-Analyses. The PRISMA statement contains an evidence-based
minimum set of items for reporting in systematic reviews and meta-anal-
yses that has been accepted by most journals in the biomedical field. Au-
thors of meta-analyses are advised to use PRISMA to improve the report-
ing of systematic reviews and meta-analyses.
In Step 3 we already described the PRISMA flowchart of the process
of selecting studies for inclusion in a meta-analysis. This flowchart re-
quires that several results from the searches and the inclusion process

118
Writing and publishing meta-analyses

are reported. In the same way, the PRISMA statement gives an overview
of other aspects of the meta-analysis that should be reported by authors,
including aspects of the Introduction (the rationale, the PICO), the meth-
ods (for example the inclusion criteria, the data extraction, the methods
used for assessing risk of bias), the results (such as a description of the
included studies, risk of bias for all studies, effect sizes), and the discus-
sion (including a summary of the main findings, the limitations, and impli-
cations for future research). All lists as well as the full PRISMA Statement
are available at the website (www.prisma-statement.org).
The PRISMA statement is one of the guidelines that exist for the re-
porting of many types of scientific studies, for example randomized trials
(the CONSORT statement), observational studies (STROBE), and quali-
tative research (SPQR). An overview of these statements can be found
at the website of the EQUATOR network (www.equator-network.org)
where much additional information can be found such as extensions for
specific types of meta-analyses (such as the PRISMA-P statement we al-
ready mentioned, but also the PRISMA extension for individual patient
data meta-analyses).

The structure of a paper on a meta-analysis


When you have conducted your meta-analysis, you are ready to write
a paper for a scientific journal about it. In this paragraph we will summa-
rize the key elements that such a paper should contain. In general a paper
on a meta-analysis follows the general rules for writing any other report
on a scientific study in social science and mental health. It begins with
an introduction explaining why this study is important and the research
question, then a section explaining the methods you have used, a section
with the results, and a discussion section. But there are also specific sub-
jects that should be discussed in a paper reporting about a meta-analysis.
In Table 6.1 the most important elements of such a paper are summarized.
In this paragraph I will go through these elements.

119
Step 6

Table 6.1 Overview of the structure of a paper on a meta-analysis

Basic information
Title
Authors
Funding support and conflict of interest
Reference to published protocol
Abstract

Introduction
Explain the background,
The importance of the problem,
Earlier (meta-analytic) research,
Why this new meta-analysis is needed,
End with the research question (PICO)

Methods
Identification (searches) and selection of studies (in/exclusion
criteria
Data extraction and quality assessment
Analyses (which effect size, how was it calculated, pooling, the
model used, heterogeneity, publication bias, subgroup and me-
taregression analyses)

Results
Selection and inclusion of studies (PRISMA flowchart)
Characteristics of included studies, including quality/validity, table
with selected characteristics
Outcomes: pooled effect sizes

120
Writing and publishing meta-analyses

Other analyses, subgroup and metaregression analyses, publica-


tion bias

Discussion
Summary of main results
What does this add to existing knowledge
Implications for research and practice
Future research
Limitations
Conclusion

Tables and figures


Figure. Flowchart of inclusion of studies (PRISMA flow diagram)
Figure. Forrest plot
Table. Selected characteristics of included studies
Preferably: table summarizing meta-analyses

In the Introduction of the paper it is important to explain in a few para-


graphs the background and the importance of the problem that the study
focuses on. It is also important to explain which earlier (meta-analytic)
research has been conducted and why this new meta-analysis is needed.
Reasons for doing a new meta-analysis are for example that since the last
meta-analysis many new trials have been conducted that may lead to dif-
ferent outcomes for the meta-analysis, or the earlier meta-analyses have
not been conducted well enough (for example no risk of bias assessment,
examining causes of high heterogeneity, no examination of publication
bias). The introduction should end with the PICO of the meta-analysis.
In the Methods section it should be reported whether a protocol for
the meta-analysis was published. Furthermore, the searches should be

121
Step 6

described, the bibliographical databases that were searched, the search


strings that were used (one full search string for one database should be
reported in the text or in an Appendix), and the inclusion and exclusion
criteria for the studies (based on the PICO). It should also be reported
whether the searches were conducted with two independent research-
ers.
A subsection of the Methods should focus on the data extraction of the
included studies. In that subparagraph the process should be described
(again, was it done by two independent researchers), as well as the details
of the data that are extracted. The extracted data include the data on the
participants, the intervention, the comparator, and the general aspects of
the study (like year of publication and country). Furthermore, the method
of assessing risk of bias or quality should be described.
In another subsection it should be reported which effect size is used
(Cohens d, Hedges g, RR, OR) and which outcome data are used to calcu-
late the effect sizes. The method of pooling should be reported (typically
the random effects model), how heterogeneity is measured (Q-test, I2),
how sources of heterogeneity are examined (subgroup analyses, bivariate
and multivariate metaregression analyses), and by which method funnel
plot asymmetry (publication bias or small sample bias) is examined (e.g.,
Eggers test, Begg & Mazumdars test, Duvall & Tweedies trim and fill pro-
cedure).
In the Results section first the results of the selection and inclusion
process should be described, including the presentation of the PRISMA
flowchart, the resulting numbers of records from each database, the
number of full-text papers that were retrieved, the number of studies
that were included and the reasons why the other full-text papers were
excluded.
In another subsection of the Results the characteristics of the includ-
ed studies should be reported (according to the characteristics that were
collected as described in the Methods section). A descriptive table of the
included studies should also be included with the most important charac-

122
Writing and publishing meta-analyses

teristics of each study. When the number of studies is too large this can
also be added as an Appendix to the paper. The description of the quality
or risk of bias of the studies (including a description of each study) should
also be given here.
Then the outcomes of the meta-analyses should be given, including
the pooled effect sizes (with 95% confidence intervals), heterogeneity,
results of sensitivity analyses, potential outliers, and asymmetry of the
funnel plot (publication bias). Furthermore, the results of subgroup and
metaregression analyses should be given here. A forest plot with the main
analyses should also be presented here.
The Discussion section is very much the same as the discussion sec-
tion of other papers in social science and mental health research. First, a
summary of the main findings is given and a description is given of what
this study adds to the existing knowledge. Then the implication for future
research and clinical practice are given. An important paragraph is about
the limitations of the study. Were there enough studies? Was heteroge-
neity not too high? Was it possible to explain causes of heterogeneity?
Was the risk of bias not too high in the set of studies? Finally, a conclusion
should be given about the outcomes and consequences of the study.
It is also important to present Tables and Figures in the paper. Some
must be included, such as a PRISMA flowchart and as we indicated, the
forest plot is in many ways the core of a meta-analysis and should there-
fore also be included in a paper. Then a descriptive table with the major
characteristics of the studies should also be included. The risk of bias of
each study should also be reported but that can be integrated in the de-
scriptive table. Another table with the main results of the analyses and
subgroup analyses is also very useful and makes the paper easier to un-
derstand for the reader. Apart from these Figures and Tables of course
other ones can also be included, for example an extra Figure with another
forest plot (for example of a subset of studies) or a Figure describing the
results of a metaregression analysis.

123
Step 6

Key points
It is advisable to develop a protocol before starting with a me-
ta-analysis and to publish that protocol, for example at the PROS-
PERO website
The PRISMA Statement contains an evidence-based minimum set
of items for reporting in systematic reviews and meta-analyses
and should be used when writing a paper on a meta-analysis
A paper on a meta-analysis follows the general rules for writing
any other report on a scientific study in social science and mental
health
The main structure of a paper on a meta-analysis contains main
sections of the Introduction, Methods, Results and Discussion
Figures that should be included are the PRISMA flowchart and
the forest plot
Tables that should be included are a descriptive table with the
major characteristics of the studies and a table summarizing the
main results of the analyses and subgroup analyses

124
Writing and publishing meta-analyses

125
126
References

Andersson, G., & Cuijpers, P. (2009). Internet-based and other com-


puterized psychological treatments for adult depression: a me-
ta-analysis. Cognitive Behaviour Therapy, 38(4), 196205. http://doi.
org/10.1080/16506070903318960
Barrett, J. E., Williams Jr, J. W., Oxman, T. E., Frank, E., Katon, W., Sullivan, M.,
Sengupta, A. S. (2001). Treatment of dysthymia and minor depression in
primary care: A ramdomized trial in patients aged 18 to 59 years. Journal of
Family Practice, 50, 405412.
Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., & Erbaugh, J. (1961). An invento-
ry for measuring depression. Archives of General Psychiatry, 4, 561571.
Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correla-
tion test for publication bias. Biometrics, 50(4), 10881101.
Bjrk, B.-C., Roos, A., & Lauri, M. (2008). Global annual volume of peer reviewed
scholarly articles and the share available via different Open Access options.
In ELPUB2008. Retrieved from http://ocs.library.utoronto.ca/index.php/El-
pub/2008/paper/view/689
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction
to Meta-Analysis. Chichester, UK: Wiley.
Brookes, S. T., Whitely, E., Egger, M., Smith, G. D., Mulheran, P. A., & Peters, T. J.
(2004). Subgroup analyses in randomized trials: risks of subgroup-specific
analyses; power and sample size for the interaction test. Journal of Clinical
Epidemiology, 57(3), 229236. http://doi.org/10.1016/j.jclinepi.2003.08.009
Chen, L., Zhang, G., Hu, M., & Liang, X. (2015). Eye movement desensitization
and reprocessing versus cognitive-behavioral therapy for adult posttrau-
matic stress disorder: systematic review and meta-analysis. The Journal
of Nervous and Mental Disease, 203(6), 443451. http://doi.org/10.1097/
NMD.0000000000000306
Cochrane, A. (1972). Effectiveness and Efficiency. Random Reflections on Health
Services. London.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hills-
dale, NJ: Erlbaum.
Constantino, M. J., Arnkoff, D. B., Glass, C. R., Ametrano, R. M., & Smith, J. Z.
(2011). Expectations. Journal of Clinical Psychology, 67(2), 184192. http://
doi.org/10.1002/jclp.20754

127
Crameri, A., von Wyl, A., Koemeda, M., Schulthess, P., & Tschuschke, V. (2015). Sen-
sitivity analysis in multiple imputation in effectiveness studies of psychother-
apy. Frontiers in Psychology, 6. http://doi.org/10.3389/fpsyg.2015.01042
Cuijpers, P., Berking, M., Andersson, G., Quigley, L., Kleiboer, A., & Dobson, K. S.
(2013). A meta-analysis of cognitive-behavioural therapy for adult depres-
sion, alone and in comparison with other treatments. Canadian Journal of
Psychiatry. Revue Canadienne de Psychiatrie, 58(7), 376385.
Cuijpers, P., Driessen, E., Hollon, S. D., van Oppen, P., Barth, J., & Andersson, G.
(2012). The efficacy of non-directive supportive therapy for adult depres-
sion: a meta-analysis. Clinical Psychology Review, 32(4), 280291. http://doi.
org/10.1016/j.cpr.2012.01.003
Cuijpers, P., Huibers, M., Ebert, D. D., Koole, S. L., & Andersson, G. (2013). How
much psychotherapy is needed to treat depression? A metaregression anal-
ysis. Journal of Affective Disorders, 149(1-3), 113. http://doi.org/10.1016/j.
jad.2013.02.030
Cuijpers, P., Karyotaki, E., Pot, A. M., Park, M., & Reynolds, C. F. 3rd. (2014). Man-
aging depression in older age: psychological interventions. Maturitas, 79(2),
160169. http://doi.org/10.1016/j.maturitas.2014.05.027
Cuijpers, P., Li, J., Hofmann, S. G., & Andersson, G. (2010). Self-reported versus cli-
nician-rated symptoms of depression as outcome measures in psychotherapy
research on depression: a meta-analysis. Clinical Psychology Review, 30(6),
768778. http://doi.org/10.1016/j.cpr.2010.06.001
Cuijpers, P., Sijbrandij, M., Koole, S., Huibers, M., Berking, M., & Andersson,
G. (2014). Psychological treatment of generalized anxiety disorder: a
meta-analysis. Clinical Psychology Review, 34(2), 130140. http://doi.
org/10.1016/j.cpr.2014.01.002
Cuijpers, P., Turner, E. H., Koole, S. L., van Dijke, A., & Smit, F. (2014). What is
the threshold for a clinically relevant effect? The case of major depressive
disorders. Depression and Anxiety, 31(5), 374378. http://doi.org/10.1002/
da.22249
Cuijpers, P., Turner, E. H., Mohr, D. C., Hofmann, S. G., Andersson, G., Berking, M.,
& Coyne, J. (2014). Comparison of psychotherapies for adult depression to
pill placebo control groups: a meta-analysis. Psychological Medicine, 44(4),
685695. http://doi.org/10.1017/S0033291713000457
Cuijpers, P., van Straten, A., Bohlmeijer, E., Hollon, S. D., & Andersson, G. (2010).
The effects of psychotherapy for adult depression are overestimated: a
meta-analysis of study quality and effect size. Psychological Medicine, 40(2),
211223. http://doi.org/10.1017/S0033291709006114
Cuijpers, P., Vogelzangs, N., Twisk, J., Kleiboer, A., Li, J., & Penninx, B. W. (2014).
Comprehensive meta-analysis of excess mortality in depression in the

128
general community versus patients with specific illnesses. The Ameri-
can Journal of Psychiatry, 171(4), 453462. http://doi.org/10.1176/appi.
ajp.2013.13030325
da Costa, B. R., Rutjes, A. W. S., Johnston, B. C., Reichenbach, S., Nesch, E., Tonia,
T., Jni, P. (2012). Methods to convert continuous outcomes into odds
ratios of treatment response and numbers needed to treat: meta-epidemio-
logical study. International Journal of Epidemiology, 41(5), 14451459. http://
doi.org/10.1093/ije/dys124
Dragioti, E., Dimoliatis, I., Fountoulakis, K. N., & Evangelou, E. (2015). A systematic
appraisal of allegiance effect in randomized controlled trials of psychothera-
py. Annals of General Psychiatry, 14, 25. http://doi.org/10.1186/s12991-015-
0063-1
Driessen, E., Hollon, S. D., Bockting, C. L. H., Cuijpers, P., & Turner, E. H. (2015).
Does Publication Bias Inflate the Apparent Efficacy of Psychological Treat-
ment for Major Depressive Disorder? A Systematic Review and Meta-Anal-
ysis of US National Institutes of Health-Funded Trials. PloS One, 10(9),
e0137864. http://doi.org/10.1371/journal.pone.0137864
Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plot-based method of
testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2),
455463.
Egger, M., Davey Smith, G., Schneider, M., & Minder, C. (1997). Bias in meta-analy-
sis detected by a simple, graphical test. BMJ (Clinical Research Ed.), 315(7109),
629634.
Eysenck, H. J. (1952). The effects of psychotherapy: an evaluation. Journal of Con-
sulting Psychology, 16(5), 319324.
Fanelli, D. (2010). Positive Results Increase Down the Hierarchy of the Sciences.
PLoS ONE, 5(4), e10068. http://doi.org/10.1371/journal.pone.0010068
Fournier, J. C., DeRubeis, R. J., Hollon, S. D., Dimidjian, S., Amsterdam, J. D., Shel-
ton, R. C., & Fawcett, J. (2010). Antidepressant drug effects and depression
severity: a patient-level meta-analysis. JAMA, 303(1), 4753. http://doi.
org/10.1001/jama.2009.1943
Furukawa, T. A., & Leucht, S. (2011). How to obtain NNT from Cohens d: compar-
ison of two methods. PloS One, 6(4), e19070. http://doi.org/10.1371/journal.
pone.0019070
Geraedts, A. S., Kleiboer, A. M., Wiezer, N. M., van Mechelen, W., & Cuijpers, P.
(2014). Short-term effects of a web-based guided self-help intervention for
employees with depressive symptoms: randomized controlled trial. Journal of
Medical Internet Research, 16(5), e121. http://doi.org/10.2196/jmir.3185
Griffiths, K. M., Carron-Arthur, B., Parsons, A., & Reid, R. (2014). Effectiveness
of programs for reducing the stigma associated with mental disorders. A

129
meta-analysis of randomized controlled trials. World Psychiatry: Official
Journal of the World Psychiatric Association (WPA), 13(2), 161175. http://doi.
org/10.1002/wps.20129
Griffiths, K. M., Christensen, H., Jorm, A. F., Evans, K., & Groves, C. (2004). Effect
of web-based depression literacy and cognitive-behavioural therapy inter-
ventions on stigmatising attitudes to depression: randomised controlled trial.
The British Journal of Psychiatry: The Journal of Mental Science, 185, 342349.
http://doi.org/10.1192/bjp.185.4.342
Gulliver, A., Griffiths, K. M., Christensen, H., Mackinnon, A., Calear, A. L., Parsons,
A., Stanimirovic, R. (2012). Internet-based interventions to promote
mental health help-seeking in elite athletes: an exploratory randomized
controlled trial. Journal of Medical Internet Research, 14(3), e69. http://doi.
org/10.2196/jmir.1864
Hamilton, M. (1960). A rating scale for depression. Journal of Neurology, Neurosur-
gery, and Psychiatry, 23, 5662.
Harnad, S., Brody, T., Vallires, F., Carr, L., Hitchcock, S., Gingras, Y., Hilf, E. R.
(2008). The Access/Impact Problem and the Green and Gold Roads to Open
Access: An Update. Serials Review, 34(1), 3640. http://doi.org/10.1080/009
87913.2008.10765150
Higgins, J. P. T., Altman, D. G., Gtzsche, P. C., Jni, P., Moher, D., Oxman, A. D.,
Sterne, J. A. C. (2011). The Cochrane Collaborations tool for assessing risk
of bias in randomised trials. BMJ, 343, d5928. http://doi.org/10.1136/bmj.
d5928
Higgins, J. P. T., & Green, S. (2011). Cochrane Handbook for Systematic Reviews of In-
terventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration.
Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring
inconsistency in meta-analyses. BMJ (Clinical Research Ed.), 327(7414), 557
560. http://doi.org/10.1136/bmj.327.7414.557
Houghton, J., & Vickery, G. (2005). Digital Broadband Content : Scientific Publishing
(Directorate for Science, Technology and Industry : Committee for Informa-
tion, Computer and Communications Policy.).
Ioannidis, J. P. A., Patsopoulos, N. A., & Evangelou, E. (2007). Uncertainty in het-
erogeneity estimates in meta-analyses. BMJ, 335(7626), 914916. http://doi.
org/10.1136/bmj.39343.408449.80
Jarrett, R. B., Schaffer, M., McIntire, D., Witt-Browder, A., Kraft, D., & Risser,
R. C. (1999). Treatment of atypical depression with cognitive therapy or
phenelzine: a double-blind, placebo-controlled trial. Arch Gen Psychiatry, 56,
4317.
Joling, K. J., Hout, H. P., vant Veer-Tazelaar, P. J., Horst, H. E., Cuijpers, P., Ven,
P. M., & Marwijk, H. W. (2011). How effective is bibliotherapy for very old

130
adults with subthreshold depression? A randomized controlled trial. Amer-
ican Journal of Geriatric Psychiatry, 19, 25665. http://doi.org/10.1097/
JGP.0b013e3181ec8859
Kiropoulos, L. A., Griffiths, K. M., & Blashki, G. (2011). Effects of a multilingual
information website intervention on the levels of depression literacy and de-
pression-related stigma in Greek-born and Italian-born immigrants living in
Australia: a randomized controlled trial. Journal of Medical Internet Research,
13(2), e34. http://doi.org/10.2196/jmir.1527
Kirsch, I., Deacon, B. J., Huedo-Medina, T. B., Scoboria, A., Moore, T. J., & Johnson,
B. T. (2008). Initial severity and antidepressant benefits: a meta-analysis of
data submitted to the Food and Drug Administration. PLoS Medicine, 5(2),
e45. http://doi.org/10.1371/journal.pmed.0050045
Kok, R. N., Donker, T., Batelaan, N. M., Beekman, A. T., Van Straten, A., & Cuijpers,
P. (n.d.). Psychological treatment of specific phobias: a meta-analysis. Submit-
ted.
Kraemer, H. C., & Kupfer, D. J. (2006). Size of treatment effects and their impor-
tance to clinical research and practice. Biological Psychiatry, 59(11), 990996.
http://doi.org/10.1016/j.biopsych.2005.09.014
Lau, J., Ioannidis, J. P. A., Terrin, N., Schmid, C. H., & Olkin, I. (2006). The case of
the misleading funnel plot. BMJ (Clinical Research Ed.), 333(7568), 597600.
http://doi.org/10.1136/bmj.333.7568.597
Lewis, S., & Clarke, M. (2001). Forest plots: trying to see the wood and the trees.
BMJ (Clinical Research Ed.), 322(7300), 14791480.
Leykin, Y., & DeRubeis, R. J. (2009). Allegiance in Psychotherapy Outcome Re-
search: Separating Association From Bias. Clinical Psychology: Science and
Practice, 16(1), 5465. http://doi.org/10.1111/j.1468-2850.2009.01143.x
Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just
below .05. Quarterly Journal of Experimental Psychology (2006), 65(11), 2271
2279. http://doi.org/10.1080/17470218.2012.711335
Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G. (2009). Preferred reporting
items for systematic reviews and meta-analyses: the PRISMA statement.
BMJ, 339, b2535. http://doi.org/10.1136/bmj.b2535
Moher, D., Shamseer, L., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., PRIS-
MA-P Group. (2015). Preferred reporting items for systematic review and
meta-analysis protocols (PRISMA-P) 2015 statement. Systematic Reviews, 4,
1. http://doi.org/10.1186/2046-4053-4-1
Mohr, D. C., Cuijpers, P., & Lehman, K. (2011). Supportive accountability: a model
for providing human support to enhance adherence to eHealth interven-
tions. Journal of Medical Internet Research, 13(1), e30. http://doi.org/10.2196/
jmir.1602

131
Mohr, D. C., Ho, J., Hart, T. L., Baron, K. G., Berendsen, M., Beckner, V., Duffecy, J.
(2014). Control condition design and implementation features in controlled
trials: a meta-analysis of trials evaluating psychotherapy for depression.
Translational Behavioral Medicine, 4(4), 407423. http://doi.org/10.1007/
s13142-014-0262-3
Mohr, D. C., Spring, B., Freedland, K. E., Beckner, V., Arean, P., Hollon, S. D.,
Kaplan, R. (2009). The selection and design of control conditions for ran-
domized controlled trials of psychological interventions. Psychotherapy and
Psychosomatics, 78(5), 275284. http://doi.org/10.1159/000228248
Moncrieff, J., Wessely, S., & Hardy, R. (2004). Active placebos versus antide-
pressants for depression. The Cochrane Database of Systematic Reviews, (1),
CD003012. http://doi.org/10.1002/14651858.CD003012.pub2
Munder, T., Brtsch, O., Leonhart, R., Gerger, H., & Barth, J. (2013a). Researcher al-
legiance in psychotherapy outcome research: an overview of reviews. Clinical
Psychology Review, 33(4), 501511. http://doi.org/10.1016/j.cpr.2013.02.002
Munder, T., Brtsch, O., Leonhart, R., Gerger, H., & Barth, J. (2013b). Researcher al-
legiance in psychotherapy outcome research: an overview of reviews. Clinical
Psychology Review, 33(4), 501511. http://doi.org/10.1016/j.cpr.2013.02.002
Mynors-Wallis, L., Davies, I., Gray, A., Barbour, F., & Gath, D. (1997). A randomised
controlled trial and cost analysis of problem-solving treatment for emotional
disorders given by community nurses in primary care. British Journal of Psy-
chiatry, 170, 1139.
National Institute for Clinical Excellence. (2009). The Treatment and Management
of Depression in Adults. Partial Update of Clinical Practice Guideline No 23. Lon-
don: National Institute for Clinical Excellence.
Pots, W. T. M., Meulenbeek, P. A. M., Veehof, M. M., Klungers, J., & Bohlmeijer, E. T.
(2014). The efficacy of mindfulness-based cognitive therapy as a public men-
tal health intervention for adults with mild to moderate depressive symp-
tomatology: a randomized controlled trial. PloS One, 9(10), e109789. http://
doi.org/10.1371/journal.pone.0109789
Riley, R. D., Lambert, P. C., & Abo-Zaid, G. (2010). Meta-analysis of individual par-
ticipant data: rationale, conduct, and reporting. BMJ (Clinical Research Ed.),
340, c221.
Riley, R. D., Simmonds, M. C., & Look, M. P. (2007). Evidence synthesis combining
individual patient data and aggregate data: a systematic review identified
current practice and possible methods. Journal of Clinical Epidemiology, 60(5),
431439. http://doi.org/10.1016/j.jclinepi.2006.09.009
Roberts, R. E. (1980). Reliability of the CES-D Scale in different ethnic contexts.
Psychiatry Research, 2(2), 125134.

132
Roest, A. M., de Jonge, P., Williams, C. D., de Vries, Y. A., Schoevers, R. A., & Turner,
E. H. (2015). Reporting Bias in Clinical Trials Investigating the Efficacy of
Second-Generation Antidepressants in the Treatment of Anxiety Disorders:
A Report of 2 Meta-analyses. JAMA Psychiatry, 72(5), 500510. http://doi.
org/10.1001/jamapsychiatry.2015.15
Rosenthal, R. (1979). The file drawer problem and tolerance for null results.
Psychological Bulletin, 86(3), 638641. http://doi.org/10.1037/0033-
2909.86.3.638
Shamseer, L., Moher, D., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., PRIS-
MA-P Group. (2015). Preferred reporting items for systematic review and
meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation. BMJ
(Clinical Research Ed.), 349, g7647.
Sterling, T. (1959). Publication decisions and their possible effects on inferences
drawn from tests of significance, or vice versa. . J Am Stat Assoc, (285), 3034.
Sterne, J. A. C., Sutton, A. J., Ioannidis, J. P. A., Terrin, N., Jones, D. R., Lau, J., Hig-
gins, J. P. T. (2011). Recommendations for examining and interpreting funnel
plot asymmetry in meta-analyses of randomised controlled trials. BMJ, 343,
d4002. http://doi.org/10.1136/bmj.d4002
Terrin, N., Schmid, C. H., Lau, J., & Olkin, I. (2003). Adjusting for publication bias in
the presence of heterogeneity. Statistics in Medicine, (22), 21132126.
Trauer, J. M., Qian, M. Y., Doyle, J. S., Rajaratnam, S. M. W., & Cunnington, D.
(2015). Cognitive Behavioral Therapy for Chronic Insomnia: A Systematic
Review and Meta-analysis. Annals of Internal Medicine, 163(3), 191204.
http://doi.org/10.7326/M14-2841
Turner, E. H., Knoepflmacher, D., & Shapley, L. (2012). Publication bias in antipsy-
chotic trials: an analysis of efficacy comparing the published literature to the
US Food and Drug Administration database. PLoS Medicine, 9(3), e1001189.
http://doi.org/10.1371/journal.pmed.1001189
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008).
Selective publication of antidepressant trials and its influence on apparent
efficacy. The New England Journal of Medicine, 358(3), 252260. http://doi.
org/10.1056/NEJMsa065779
Williams, J. W., Barrett, J., Oxman, T., Frank, E., Katon, W., Sullivan, M., Sengupta,
A. (2000). Treatment of dysthymia and minor depression in primary care: A
randomized controlled trial in older adults. Jama, 284, 151926.
Xia, J., Wright, J., & Adams, C. E. (2008). Five large Chinese biomedical bib-
liographic databases: accessibility and coverage. Health Information
and Libraries Journal, 25(1), 5561. http://doi.org/10.1111/j.1471-
1842.2007.00734.x

133
Meta-analyses in mental health research.
A practical guide

2016 Pim Cuijpers, Ph.D

ISBN 978-90-825305-0-6

ISBN 9789082530506

9 789082 530506

You might also like