You are on page 1of 13

Journal of Child Psychology and Psychiatry 56:9 (2015), pp 936–948 doi:10.1111/jcpp.

12442

Thresholds and accuracy in screening tools for early


detection of psychopathology
R. Christopher Sheldrick,1 James C. Benneyan,2 Ivy Giserman Kiss,3 Margaret J.
Briggs-Gowan,4 William Copeland,5 and Alice S. Carter3
1
Developmental-Behavioral Pediatrics, Tufts University School of Medicine, Boston, MA; 2Healthcare Systems
Engineering Institute, Colleges of Engineering and Health Sciences, Northeastern University, Boston, MA;
3
Department of Psychology, University of Massachusetts Boston, Boston, MA; 4Department of Psychiatry, University
of Connecticut Health Center, Farmington, CT; 5Department of Psychiatry and Behavioral Sciences, Duke University
School of Medicine, Durham, NC, USA

Background: The accuracy of any screening instrument designed to detect psychopathology among children is
ideally assessed through rigorous comparison to ‘gold standard’ tests and interviews. Such comparisons typically
yield estimates of what we refer to as ‘standard indices of diagnostic accuracy’, including sensitivity, specificity,
positive predictive value (PPV), and negative predictive value. However, whereas these statistics were originally
designed to detect binary signals (e.g., diagnosis present or absent), screening questionnaires commonly used in
psychology, psychiatry, and pediatrics typically result in ordinal scores. Thus, a threshold or ‘cut score’ must be
applied to these ordinal scores before accuracy can be evaluated using such standard indices. To better understand
the tradeoffs inherent in choosing a particular threshold, we discuss the concept of ‘threshold probability’. In
contrast to PPV, which reflects the probability that a child whose score falls at or above the screening threshold has
the condition of interest, threshold probability refers specifically to the likelihood that a child whose score is equal to
a particular screening threshold has the condition of interest. Method: The diagnostic accuracy and threshold
probability of two well-validated behavioral assessment instruments, the Child Behavior Checklist Total Problem
Scale and the Strengths and Difficulties Questionnaire total scale were examined in relation to a structured
psychiatric interview in three de-identified datasets. Results: Although both screening measures were effective in
identifying groups of children at elevated risk for psychopathology in all samples (odds ratios ranged from 5.2 to 9.7),
children who scored at or near the clinical thresholds that optimized sensitivity and specificity were unlikely to meet
criteria for psychopathology on gold standard interviews. Conclusions: Our results are consistent with the view that
screening instruments should be interpreted probabilistically, with attention to where along the continuum of
positive scores an individual falls. Keywords: Assessment, screening, psychopathology, developmental
psychopathology, methodology.

evaluate diagnostic accuracy. However, a full under-


Introduction
standing of diagnostic accuracy—which includes the
As many studies document, few children who expe-
ability to effectively communicate results—requires a
rience psychopathology are detected as part of
high level of facility with these concepts. We there-
routine care, and fewer still receive appropriate
fore begin with an in-depth discussion of the basic
treatment (Briggs-Gowan, Horwitz, Schwab-Stone,
premises of diagnostic accuracy. Our hope is that
Leventhal, & Leaf, 2000; Sheldrick, Merchant, &
this introduction will serve as a review, while also
Perrin, 2011; U.S. Department of Health and Human
providing a useful example for how these concepts
Services, 2000; Wang et al., 2005). Use of validated
can be communicated.
screening instruments is commonly recommended to
improve identification of psychopathology in com-
munity settings. It is well-recognized that to be Review of diagnostic accuracy
considered valid, screening instruments must dis-
There is strong consensus regarding the qualities of
play acceptable levels of accuracy in detecting psy-
studies that offer the best evidence of screening
chopathology. Moreover, to be used effectively, it is
accuracy. A rich literature (e.g., Jaeschke et al.,
critically important that the strengths and limita-
1994) and two detailed expert consensus state-
tions of screening instruments be understood to
ments, the STARD (Bossuyt et al., 2003) and the
inform implementation.
QUADAS-2 (Whiting et al., 2011), detail the criteria
Psychologists, psychiatrists, pediatricians, and
for conducting and reporting diagnostic accuracy
others who work in healthcare or health sciences
studies. This ‘gold standard’ methodology prescribes
are generally familiar with concepts such as sensi-
comparison of screening results to ‘gold standard’
tivity and specificity. By extension, they are therefore
tests and structured interviews that are considered
familiar with the fundamental concepts used to
to be our best indices of psychopathology, typically
yielding what we will hereafter refer to as the
Conflict of interest statement: No conflicts declared. standard indices of diagnostic accuracy:

© 2015 Association for Child and Adolescent Mental Health.


Published by John Wiley & Sons Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main St, Malden, MA 02148, USA
doi:10.1111/jcpp.12442 Thresholds and accuracy in screening tools 937

Sensitivity: the proportion of children with psy- screener with 20 true/false questions that yields
chopathology who are correctly classified as hav- raw scores from 0 to 20. A threshold might indicate
ing a disorder by the screening test that individuals who receive a score of 12 or higher
Specificity: the proportion of children without are classified as screening positive, whereas those
psychopathology who are correctly classified as with a score of less than 12 are classified as
not having a disorder by the screening test screening negative. The values for standard indices
Positive predictive value (PPV): the proportion of of diagnostic accuracy would vary if the threshold on
children with positive screening results who have the same screening instrument was raised (e.g., to
psychopathology (and are therefore correctly clas- 13) or lowered (e.g., to 11). This is the basis of
sified) assessing diagnostic accuracy by applying receiver
Negative predictive value (NPV): the proportion of operating characteristic (ROC) curves, which depict
children with negative screening results who do sensitivity and (1 minus) specificity for every possible
not have psychopathology (and are therefore cor- threshold on a screening instrument.
rectly classified) STARD and QUADAS-2 criteria make at least two
references to screening tests that have ordinal
Consistent with QUADAS-2 recommendations,
results. First, they state that to achieve the highest
these indices are usually presented together with a
quality evidence, screening thresholds should be
‘four-fold’ or ‘2 9 2’ table of screening and criterion
stipulated before an accuracy study is conducted.
test results from which these statistics can be easily
Second, they state that the full distribution of
calculated. Figure 1 depicts such a table and reviews
screening scores, (i.e., not just a 2 9 2 table), should
how all four indices are calculated.
be presented for individuals who test positive and for
For review, standard indices of diagnostic accu-
those who test negative on the gold standard crite-
racy are not independent of one another. All repre-
rion against which a screening test is evaluated.
sent ratios that rely on the same four cells within the
Figure 2A offers a hypothetical example of what
2 9 2 table, and all four indices include only correct
these distributions might look like. All individuals
classifications in their numerator: either true posi-
are assumed to be members of one of two popula-
tives or true negatives. As many recall from intro-
tions: those with clinically significant psychopathol-
ductory lessons on chi-square tests, 2 9 2 tables
ogy and those without. Inspection of this figure
have only 3 degrees of freedom. Thus, if prevalence
reveals that, on average, children with disorders
is known, knowledge of any two standard indices of
receive higher scores on the screening test (as
diagnostic accuracy is sufficient to calculate all
indicated on the horizontal axis) than do children
other indices. For example, if sensitivity, specificity,
without such disorders; however, the tails of the two
and prevalence are known, PPV can be readily
distributions overlap considerably. Figure 2B pre-
calculated. It is therefore not possible to raise or
sents the same distributions, except that the popu-
lower PPV without altering at least one of the other
lation with psychopathological disorders has been
indices. For a given prevalence, PPV cannot be
flipped below the x-axis to avoid overlap in the figure,
changed without also changing sensitivity and/or
and one of the many possible thresholds has been
specificity.
added. Figure 2C is identical to Figure 2B, except
Standard indices of diagnostic accuracy and 2 9 2
that a higher threshold has been chosen.
tables are sufficient for describing the results of
Comparison of Figure 2B and C with Figure 1
binary statistical tests. However, the results of most
reveals many similarities. Children who screen
tests in psychology, psychiatry, and pediatrics are
positive are depicted on the right, while those who
not purely binary. Most produce ordinal raw scores
screen negative are depicted on the left. Similarly,
that are transformed into binary results using a ‘cut
those with disorders are represented below the
score’ or ‘threshold’. As an example, consider a

Screening Results
– +
TN
Diagnosis

– TN FP TN+FP Specificity =
(TN+FP)
TP
+ FN TP FN+TP Sensitivity =
(FN+TP)
TN+FN FP+TP

NPV = PPV = For example:


TN TP Screener
(TN+FN) (FP+TP) – +
– 85 Sensitivity = 73%
Diagnosis

62 23
Specificity = 73%
PPV = 32%
+ 4 11 15 NPV = 94%
66 34

Figure 1 2 9 2 Table for Hypothetical Screener. Note: = correctly classified

© 2015 Association for Child and Adolescent Mental Health.


938 R. Christopher Sheldrick et al. J Child Psychol Psychiatr 2015; 56(9): 936–48

(A)

+
M=0
SD = 1

Frequency

M = 1.2
SD = 1
0
0 1 2 3 4 5 6 7 8 9
Screening scores

(B)

Screen negative Screen positive


Frequency

0 1 2 3 4 5 6 7 8 9

+
0.4

(C)

+
Frequency

0
0 3 4 5 6 7 8 9

+
-0.1

Figure 2 (A) Hypothetical distributions of screening scores. (B) Application of a screening threshold. (C) Affect of a higher screening
threshold

© 2015 Association for Child and Adolescent Mental Health.


doi:10.1111/jcpp.12442 Thresholds and accuracy in screening tools 939

horizontal axis, while those without disorders are ders (and are above the horizontal axis). Now con-
depicted above. Likewise, the meaning of indices like sider the overall proportion who are correctly
sensitivity, specificity, PPV, and NPV has not chan- classified among those who screen positive—i.e.,
ged. Note that Figure 2B and C both include normal the true positives in the lower right quadrant.
distributions and depict a population of children Clearly, the proportion of true positives relative to
with a 15% prevalence of disorders. The threshold total screened positives is greater in Figure 2C
set in Figure 2B yields sensitivity = 73% and speci- (where PPV = 63%) than in 2B (where PPV = 32%).
ficity = 73%, which is identical to the example in Thus assuming normal distributions of scores,
Figure 1. In contrast, the threshold in Figure 2C whenever the prevalence of the disorder identified
yields sensitivity = 20% and specificity = 98%. This by the screener is below 50%, PPV rises as the
illustrates how moving the threshold along the threshold rises.
screening score distribution can affect sensitivity To summarize, as thresholds rise, specificity and
and specificity. PPV also rise, but sensitivity falls. This dependency,
There is also an important difference between or negative correlation, between sensitivity and
Figures 2B and C—namely, the threshold is higher specificity has long been recognized in meta-analy-
in Figure 2C than in Figure 2B. What is immediately ses of screening tests and assessment instruments,
obvious is that with the higher threshold in Fig- thus necessitating the use methods that account for
ure 2C, fewer individuals will score positive on the this interdependency for accurate analyses (Trikali-
screening test (reflected by the fact that there is less nos, Uwe, & Joseph, 2009). It also highlights an
area under the curves to the right of the threshold). important point about any screening test that has
The effect on correct and incorrect classifications ordinal or continuous results: if higher sensitivity is
requires somewhat closer inspection. Any clinical desired, it can often be realized by lowering the
threshold or cut score will yield correct classifica- threshold, albeit at the cost of lower specificity and
tions (i.e., true positives and true negatives), as well PPV. Conversely, if higher PPV is required, it can
as errors (i.e., false positive and false negatives). As often be realized by raising the threshold, albeit at
the threshold changes from Figure 2B to C, the the cost of lower sensitivity.
proportions of both true positives (upper right quad-
rant) and false positives (lower right quadrant) fall,
The concept of threshold probability
while the proportions of true negatives (upper left
quadrant) and false negatives (lower left quadrant) While sensitivity, specificity, PPV, and NPV offer a
rises. great deal of information about the diagnostic accu-
Attending to these changes, one can discern the racy of a screening test, all are group-level statistics,
effect of raising a threshold on standard indices of describing the performance of a screening test across
diagnostic accuracy. Among children with disorders the full distribution of scores. To assist in making an
(depicted by the curve below the horizontal axis), informed decision about an individual patient who
those who are correctly classified are the ones to the scored positive on a screening test, one would ideally
right of the threshold (i.e., the shaded area). Thus, it know the probability that that particular patient had
is simple to visualize the fact that sensitivity, (i.e., been correctly classified given the screening score.
the proportion of children with disorders who are Clearly, there is significant heterogeneity among
correctly classified), falls as the threshold rises (in positive scores. For example, children with very high
this case, from 73% to 20%). The scores of children scores on a valid screening test are more likely to be
without disorders are depicted by the curve above affected by psychopathology than those who score at
the horizontal axis, and those who are correctly or near the threshold. Unfortunately, the clinical
classified as negatives are the ones to the left of the utility of PPV as a metric is limited in that it offers a
threshold. Thus, it is simple to visualize the fact weighted average probability across all individuals
that specificity (i.e., the proportion of children who score positive, regardless of their underlying
without disorders who are correctly classified) rises score.
as the threshold increases (in this case, from 73% to We therefore highlight an extremely useful (but
98%). less commonly known) statistic referred to as
Visualizing PPV (i.e., the proportion of children ‘threshold probability’, which is prominently used
with positive screening results who are correctly in the field of decision analysis (Pauker & Kassirer,
classified) requires even closer inspection. As the 1980; Vickers & Elkin, 2006), and in our case refers
threshold rises from Figure 2B to C, the absolute number to the probability of being in the population affected
of both true positives and false positives decrease. by psychopathology for an individual whose score
Thus, both the numerator and the denominator in the falls at the threshold. Figure 3A offers a visual
equation for PPV decrease as the threshold rises. To depiction of a single threshold probability. Because
visualize the effect, consider all children who screen standard indices of diagnostic accuracy apply to
positive with scores to the right of the thresholds in groups of individuals with a range of screening
Figure 2B and C, whether they have disorders (and scores, they can be conceptualized by visualizing
are below the horizontal axis) or do not have disor- the area bounded by different curves. Because

© 2015 Association for Child and Adolescent Mental Health.


940 R. Christopher Sheldrick et al. J Child Psychol Psychiatr 2015; 56(9): 936–48

threshold probability refers to individuals who It is possible, and often preferable, to have more
receive a single score on a screening instrument, it than one screening threshold for a given instrument.
can be visualized by attending to the proportion of Different thresholds can correspond to different
individuals who score at the threshold who are in the levels of risk for the condition of interest and each
affected population—or, to be more specific, the would be associated with different threshold proba-
length of the line reflecting the frequency of affected bility. Taken a step further, just as an ROC curve
individuals compared to the total line indicating the depicts sensitivity and specificity for each score on a
frequency of scores at the threshold (see Figure 3A). screening instrument, it is possible to determine the
Thus, if a screener displayed properties like those threshold probability for each score on a screening
depicted in Figure 3A, 32% of all children who score instrument. For very low scores, we would expect the
positive would be expected to actually experience threshold probability to approach 0; for very high
psychopathology (i.e., PPV = 32%). However, only scores, we would expect the threshold probability to
15% of children scored at the threshold would be approach 100, and along the continuum, we would
expected to experience psychopathology (i.e., thresh- expect the full range of probabilities.
old probability = 15%).

(A)

+ M=0
SD = 1
Frequency

0 1 2 3 4 5 6 7 8 9

+ (f[diagnosis|score = x]) M = 1.2

(B)

90%

80%

70%

60%

50%

f[diagnosis|score =
40%
f[score = x]

30%

20%

10%

0%
0

Figure 3 (A) Visualizing threshold probability. (B) Threshold probability & PPV

© 2015 Association for Child and Adolescent Mental Health.


doi:10.1111/jcpp.12442 Thresholds and accuracy in screening tools 941

Comparable to the depiction of sensitivity and Rescorla, 2001), and the total scale of the SDQ (Goodman,
specificity for each possible screening score in an 1997, 2001; Goodman & Scott, 1999). With approval from two
university Institutional Review Boards, we analyzed three de-
ROC curve, Figure 3B depicts the threshold proba-
identified datasets, each of which included a structured
bilities for every possible score on the screening psychiatric interview designed to assess psychopathology
instrument depicted in the examples above. As among children (i.e., a gold standard) as well as the CBCL
reference values, the PPV is included, as is the and/or the SDQ.
probability of being in the affected population
among those who screen negative (i.e., 1-NPV). Note Samples
that because PPV represents a weighted average for
all individuals who score positive (i.e., everyone who We analyzed data from three samples:
1. The National Comorbidity Study-Adolescent Supplement
scores above the threshold), its value is greater than
(NCS-A) is a nationally representative survey of adolescents
threshold probability. The threshold probability at aged 13–18 years (M = 15.2; SD = 1.5) in the continental
any one point along the distribution is determined U.S. that focuses on psychiatric disorders and related
by two factors: the accuracy of the screening constructs. A sample of 6,483 adolescents were selected
instrument (in this case represented by the stan- from a dual sampling frame of households and schools and
were interviewed face-to-face by trained lay interviewers
dardized difference between two normally distrib-
using the World Health Organization (WHO) Composite
uted populations, which is fixed in this example), International Diagnostic Interview Version 3.0 (CIDI) (Kess-
and the choice of where to place a threshold (which € un, 2004; Merikangas, Avenevoli, Costello, Koretz,
ler & Ust€
can vary anywhere along the curve as desired). & Kessler, 2009). One parent or surrogate was mailed a
Threshold probability has strong implications for questionnaire including demographic questions and the
SDQ. The survey was administered by staff from the
screening implementation in clinical settings. For
Institute for Social Research at the University of Michigan.
example, if guidelines recommend an evidence- Study procedures (Kessler et al., 2009a, 2009b; Merikan-
based screening instrument that in practice has a gas et al., 2009); and analyses regarding the diagnostic
threshold probability of 15% (as is the case in the accuracy of the SDQ (He, Burstein, Schmitz, & Merikangas,
example depicted by Figure 3B), then clinicians 2013) have been detailed in numerous publications.
2. The North Carolina (NC) sample included 636 adolescents
must be willing and able to ensure that patients
aged 7–17 (M = 12.8; SD = 2.4) who were recruited from the
who screen positive and have a 15% chance of Duke Pediatric Primary Care Clinics in Durham, North
suffering from psychopathology receive treatment Carolina. Parents received five questionnaires including the
and/or further evaluation. If this is not the case, CBCL and the SDQ. Because the purpose of the study was
whether because of lack of training or lack of to compare several ‘gold standard’ psychiatric interviews,
parents then completed two separate interviews over two
resources, then it is unlikely that evidence-based
visits. For the purposes of this study, all analyses were
screening instruments will be used as recom- conducted using results from the Child and Adolescent
mended. Moreover, if clinicians choose to use a Psychiatric Assessment (CAPA) (Angold & Costello, 2000).
higher threshold than is recommended, sensitivity Further details regarding study procedures are reported
will be lower, and it is therefore unlikely that as elsewhere (Angold et al., 2012).
3. The Greater New Haven (NH) sample was part of a longitu-
many children with psychopathology will be detected
dinal representative population study of children selected
as was initially expected. from birth records provided by the State of Connecticut
Unfortunately, estimating threshold probability Department of Public Health (N = 8,404) and born healthy
based on empirical data is not straightforward. between July 1995 and September 1997 in the New Haven–
Whereas calculation of standard indices of diagnostic Meriden Standard Metropolitan Statistical Area of the 1990
Census (n = 1,329). Parents of a subsample enriched for
accuracy requires only a 2 9 2 table, threshold
child psychopathology (n = 442; 77.6% response rate,
probability requires an estimate of the distribution 69.5% of eligible sample) were interviewed in the child’s
of screening scores among both healthy individuals kindergarten or first-grade year with the Diagnostic Inter-
and among those affected by psychopathological view Schedule for Children, Version IV (DISC-IV) (Schaffer
disorders. In this paper, we explore methods to et al., 1996). Analyses in this study include 402 children
aged 5–7 (M = 6.0; SD = 0.3) whose parents completed the
analyze threshold probability in three large datasets.
CBCL (Achenbach & Rescorla, 2001). Sampling weights
All three include structured diagnostic interviews were applied to adjust for selection and retention biases.
conducted by interviewers trained to research reli- Study details are reported elsewhere (Briggs-Gowan, Car-
ability. All three also include well-regarded evidence- ter, Skuban, & Horwitz, 2001; Carter et al., 2010).
based screening instruments: the Strengths & Diffi- Thus, data were available to support four analyses: two
culties Questionnaire (SDQ), the Child Behavior samples were available for the CBCL and two samples were
Checklist (CBCL), or both. Thus, these datasets offer available for the SDQ. For the CAPA, the CIDI, and the DISC-IV,
a unique opportunity to explore threshold probability all diagnostic categories were included except specific phobia.
for well-accepted behavioral screening instruments.
Analyses
Analysis of threshold probability necessitates estimation of the
Methods probability of having versus not having a diagnosis associated
We sought to assess the psychometric properties of the total with specific scores on an assessment instrument. Because
scales of two well-validated behavioral assessment instru- sampling variation can make point estimates associated with
ments. The Total Problems scale of the CBCL (Achenbach & individual screening scores unstable, we sought to harness full

© 2015 Association for Child and Adolescent Mental Health.


942 R. Christopher Sheldrick et al. J Child Psychol Psychiatr 2015; 56(9): 936–48

sample data by estimating separate probability density func-

of threshold

CBCL, Child behavior checklist; NC, North Carolina sample; NCS-A, National Comorbidity Study – Adolescent Supplement; NH, New Haven sample; SDQ, Strengths & Difficulties
probability
Estimate
tions (pdf) for children with diagnoses and for those without.

56.9
40.8

77.7
77.9

36.4
9.4

42.5
54.2
(%)
Specifically, we attempted to fit normal, lognormal and Weibull
curves to each distribution (including a shift parameter in the
latter two cases), estimating all parameters via maximum
likelihood, and then assessing goodness of fit using both

95% CI (%)

61.0–77.7
36.1–68.5

60.3–84.5
77.3–83.4

51.0–62.9
23.5–38.8

51.3–63.5
64.8–68.4
Kolmorogov-Smirnoff and chi square methods with a rejection
criteria of alpha = 0.05. Pdfs were converted to cumulative
density functions (cdf) to assess sensitivity and PPV. Threshold
probability (i.e., the probability of having a diagnosis condi-
tioned on a particular screening score) was estimated using the
equation Pt = pdfdiagnosis / (pdfdiagnosis + pdfno diagnosis). For

69.8
52.5

73.7
80.5

57.0
30.7

57.5
66.6
PPV
(%)
purposes of comparison, sensitivity and PPV were also ana-
lyzed directly from frequency tables using the standard non-

95% CI (%)

87.3–93.3
91.5–96.6

94.0–97.9
95.8–97.0

65.7–74.8
67.3–77.0

67.1–76.1
75.3–78.1
parametric equations. Finally, after applying recommended
thresholds to both the CBCL and the SDQ, we calculated risk
ratios to reflect the relative risk of having a diagnosis condi-
tional on screening results. To facilitate comparison with other
studies, associated odds ratios (OR), which are more com-

Speci-ficity
monly reported, were also calculated.

90.6
94.5

96.3
96.5

70.4
72.4

71.8
76.7
(%)
Results
Observed frequencies and probability density func-

33.6–47.0
24.0–49.9

14.2–25.0
17.7–20.6

66.2–78.4
59.1–83.3

63.8–76.3
59.3–62.9
95% CI
tions are presented for the CBCL in upper portions of

(%)
Figure 4 and for the SDQ in upper portions of
Table 1 Relative risk, odds ratios, and standard indices of diagnostic accuracy for CBCL and SDQ across samples

Figure 5. Whereas normal distributions offered opti-


mal fit for the CBCL scores among children with
Sensi-tivity

diagnoses in both the North Carolina (mu = 57.6,


40.2
36.2

19.2
19.1

72.6
72.4

70.3
61.1
(%)

sigma = 10.0) and New Haven samples (mu = 54.5,


sigma = 10.3), all other distributions displayed sig-
nificant skew. As a result, Weibull distributions
4.8 –19.7

3.3 –11.4
4.2–10.0

3.7–12.8
95% CI

offered optimal fit for the CBCL scores of children


5.3–7.9

4.4–9.1

4.2–8.7
4.6–5.8
without diagnoses in both the North Caroline
(alpha = 3.6, beta = 36.9, shift = 12.6) and New
Haven (alpha = 3.9, beta = 29.8, shift = 16.8) sam-
Odds
ratio

6.5
9.7

6.2
6.5

6.3
6.9

6.0
5.2
ples. Log-normal distributions offered optimal fit for
all distributions of SDQ scores, including the scores
1.9 –2.9

of children with diagnoses in the North Carolina


95% CI

2.2–3.2
3.4–7.9

2.0–2.2

2.5–4.2
3.0–8.7

2.5–4.0
2.3–2.5
sample (mu = 3.0, sigma = 0.3, shift = 8.5) and the
NCS-A (mu = 3.2, sigma = 0.3, shift = 13.5), and
for SDQ scores of children without diagnoses in the
Relative

North Carolina sample (mu = 2.0, sigma = 0.5,


ratio
risk

2.7
5.1

2.4
2.1

3.3
5.1

3.1
2.4

shift = 3.2) and the NCS-A (mu = 2.0, sigma = 0.5,


shift = 2.3).
negatives

Both instruments displayed strong ability to


131
37

177
2,266

60
16

65
1,090
False

(n)

stratify children according to risk (see Table 1 for


risk ratios, threshold probabilities and standard
indices of diagnostic accuracy). In the NCS-A,
negatives

children scoring above the threshold on the SDQ


367
325

389
3,554

285
249

290
2,826
True

(n)

were 2.1 times as likely to receive a diagnosis than


Sample-specific screening thresholds

were children who scored negative (OR = 6.1),


Recommended screening thresholds

whereas in the North Carolina sample, children


positives
False

were 2.4 times as likely to receive a diagnosis if


38
19

15
130

120
95

114
858
(n)

they screened positive on the SDQ (OR = 6.5).


Children who screened positive on the CBCL were
positives

2.7 times as likely to receive a diagnosis in the


88
21

42
536

159
42

154
1,712
True

North Carolina sample (OR = 6.5), and 5.1 times as


(n)

Questionnaire.

likely to receive a diagnosis in the New Haven


sample (OR = 9.7).
NCS-A

NCS-A

Despite substantial differences in prevalence,


CBCL

CBCL
Sample

SDQ

SDQ
NH

NH
NC

NC

NC

NC

estimates of PPV and threshold probability were


relatively consistent across samples; however, sen-

© 2015 Association for Child and Adolescent Mental Health.


doi:10.1111/jcpp.12442 Thresholds and accuracy in screening tools 943

0.05

0.04
CBCL CBCL
Distributions of screening scores by

North Carolina Sample New Haven Sample


diagnostic status and estimated
probability density functions

0.03

0.02

0.01

0
25 30 35 40 45 50 55 60 65 70 25 30 35 40 45 50 55 60 65 70

–0.01

–0.02

–0.03
100%
Parametric estimates of Sensitivity, PPV

90%

80%
and threshold probability

70%

60%

50%

40%

30%

20%

10%

0%
25 30 35 40 45 50 55 60 65 70 25 30 35 40 45 50 55 60 65 70
CBCL score CBCL score

Figure 4 Child Behavior Checklist (CBCL) results. Note: CBCL, Child Behavior Checklist; pdf, probability density function; PPV, Positive
predictive value. Vertical lines represent recommended screening thresholds. In the upper figures, bars indicate observed frequencies,
and lines indicate pdfs that have been fitted to these frequencies. In the lower figures, threshold probability has been calculated for each
possible screening score based on the pdfs. Sensitivity and PPV have been calculated based on frequency tables of observed results. The
area shaded gray represents the 95% confidence interval of the PPV

sitivity was uniformly low. For the SDQ, PPV was the gold standard (e.g., developmental disorders or
estimated at 73.7% and 80.5%, threshold probability parent psychopathology).
was estimated at 77.7% and 77.9%, and sensitivity In an effort to account for sample specific
was estimated at 19.2% and 19.1% in the North differences in prevalence and screening score
Carolina sample and NCS-A, respectively. For the distributions, we also examined psychometric char-
CBCL, PPV was estimated at 69.8% and 52.5% and acteristics at thresholds determined post hoc using
threshold probability was estimated at 56.9% and Youden’s index. Such thresholds maximize the sum
40.8% in the North Carolina sample and New Haven of sensitivity and specificity. For the SDQ, PPV was
sample, respectively. However, sensitivity was poor, estimated at 57.5% and 66.6%, threshold probability
estimated at 40.2% and 36.2%. This apparent con- was estimated at 42.5% and 54.2%, and sensitivity
sistency across samples may mask significant was estimated at 70.3% and 61.1% in the North
sources of uncertainty. For example, confidence Carolina sample and NCS-A, respectively. For the
intervals for estimates of PPV grow rapidly as CBCL, PPV was estimated at 57.0% and 30.7%,
thresholds rise and number of children with positive threshold probability was estimated at 36.4% and
results falls (see Figures 4 and 5). Moreover, esti- 9.4%, and sensitivity was estimated at 72.6% and
mates of PPV actually fall for higher thresholds of the 72.4% in the North Carolina sample and New Haven
SDQ in both samples (see Figure 4), and estimates of sample, respectively. In comparison to recom-
threshold probability therefore follow a similar mended thresholds, these post hoc thresholds
pattern. Uncertainty in estimates in the tails of achieved lower levels of PPV and threshold probabil-
distributions may be due to low sample size and ity but displayed desired levels of sensitivity.
the presence of outliers, similar to observations
made in previous studies (Dujardin, Van den Ende,
Van Gompel, Unger, & Van der Stuyft, 1994), or Discussion
possibly for substantive reasons, such as if high The results of this study confirm the utility of
scores detect the problems that are not identified by behavioral screening instruments for risk stratifica-

© 2015 Association for Child and Adolescent Mental Health.


944 R. Christopher Sheldrick et al. J Child Psychol Psychiatr 2015; 56(9): 936–48

0.1

0.08
SDQ SDQ
Distributions of screening scores by

North Carolina Sample National Comorbidity Study –


diagnostic status and estimated

Adolescent Supplement
probability density functions

0.06

0.04

0.02

0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 35

–0.02

–0.04
100%
Parametric estimates of Sensitivity, PPV

90%

80%
and threshold probability

70%

60%

50%

40%

30%

20%

10%

0%
0 5 10 15 20 25 30 0 5 10 15 20 25 30
SDQ score SDQ score

Figure 5 Strengths & Difficulties Questionnaire (SDQ) results. Note: SDQ, Strengths & Difficulties Quesionnaire; pdf, probability density
function; PPV, Positive predictive value. Vertical lines represent recommended screening thresholds. In the upper figures, bars indicate
observed frequencies, and lines indicate pdfs that have been fitted to these frequencies. In the lower figures, threshold probability has
been calculated for each possible screening score based on the pdfs. Sensitivity and PPV have been calculated based on frequency tables
of observed results. The area shaded gray represents the 95% confidence interval of the PPV

tion. In each case, children who screened positive adequate sensitivity for two well-validated screening
were approximately 2–5 times more likely than instruments. If clinicians have a similar perception,
children who screened negative to qualify for a it is reasonable to hypothesize that they might not
psychiatric diagnosis. However, the precise interpre- trust scores that fall close to the threshold. If so,
tation of screening results was highly dependent on clinicians may in effect shift the values of standard
the threshold chosen, and results were not uniform indices of diagnostic accuracy by implementing
across samples. Moreover, thresholds that offered higher thresholds than recommended. Such a strat-
acceptable sensitivity also offered diminished PPV egy would offer clinicians a higher PPV and threshold
and threshold probabilities. These findings are con- probability, but also lower sensitivity. This hypoth-
sistent with previous publications that have noted esis is consistent with evidence that when pediatri-
that the probability of misclassification rises as cians identify developmental and behavioral
scores approach the decision threshold (Hummel, disorders in standard care, specificity far outweighs
1999; Robins, 1985; Swets, Dawes, & Monahan, sensitivity (Sheldrick et al., 2011). It is also consis-
2000). tent with several implementation trials of screening
We again emphasize that for evidence-based instruments where clinicians have referred or trea-
screening instruments to be implemented with suc- ted only a proportion of children who score positive
cess and fidelity, resources must be sufficient for (Guevara et al., 2013).
children who score positive to receive treatment or Moreover, while screening instruments often yield
further assessment, regardless of whether they score binary results—either ‘positive’ or ‘negative’—our
at or well above the threshold. It is important to findings underscore the importance of understand-
remember that for a screening instrument with a ing screening results probabilistically. A particular
given ROC curve, high sensitivity comes with a threshold or score on a screening instrument defines
tradeoff—that is, comparatively low PPV and even groups of children who range in their risk of expe-
lower threshold probability. We were struck by how riencing psychopathology. While we can say with
low a threshold probability was required to achieve confidence that children who score positive on val-

© 2015 Association for Child and Adolescent Mental Health.


doi:10.1111/jcpp.12442 Thresholds and accuracy in screening tools 945

idated screening instruments are at much higher ations of screening accuracy is not. It is well
risk for having psychopathology than are children established that less-than-perfect reliability in an
who score negative, specifying the probabilities is independent variable attenuates the magnitude of
difficult given variation across samples. Thus, in the correlation that will be observed with a depen-
making decisions when using validated screening dent variables. Thus, if a diagnostic interview
instruments, clinicians may confront the fact that displays a reliability of rxx, the maximum correla-
not only are the predictive validities lower than they tion that should be expected with a screening
might prefer, but that there is also considerable instrument that measures the same construct is
uncertainty regarding the validity of estimates of SQRT(rxx). However, imperfect reliability among
diagnostic accuracy of the screening tools in their reference tests can also inflate associations with
own setting. Many have persuasively argued that screening instruments. In particular, this can
over-reliance on clinical intuition often leads to occur if the error variances of two tests are
error, and that identification of psychopathology correlated (Trikalinos et al., 2009), for example if
can be improved through use of valid screening both a diagnostic test and a screening instrument
instruments over clinical intuition alone (Dawes, tend to make the same errors for the same reasons.
Faust, & Meehl, 1989). However, overconfidence in Many screeners and interview rely on the same,
the course of standard clinical decision making is single, informant. In this scenario, any response
also a primary source of medical error (Berner & bias on the part of the parent will influence the
Graber, 2008), and failure to appreciate the likeli- results of both tests, thus inflating their associa-
hood of error in screening results and misinterpre- tion for reasons that have nothing to do with the
tation of their probabilistic nature creates the risk of child’s experience of psychopathology. While
over-valuing the information screening instruments accounting for these biases analytically is difficult
provide. There is a rich and diverse literature on the at best, a recent study offers practical guidance. In
interpretation of objective tests and their incorpora- a study in which Angold and colleagues assessed
tion with clinical findings (e.g., Hummel, 1999), and the accuracy with which each of three different
opinions regarding the relative value of objective ‘gold standard’ interviews for psychopathology in
‘actuarial’ methods versus clinical intuition vary children detected the other, the most accurate
widely across scientific fields (Swets et al., 2000). example was consistent with a sensitivity and
In medical and mental health fields, many have specificity of approximately 85% (Angold et al.,
argued for an evidence-based approach that uses 2012). If this is the highest accuracy with which
multiple thresholds (each associated with different one structured interview can detect the results of
diagnostic likelihood ratios) to account for variable another, then it may be reasonable to consider
predictive values among positive screening results these values as a practical limit on the observed
(Jaeschke et al., 1994; Straus, Glasziou, Richard- accuracy of screening instruments.
son, & Haynes, 2011; Youngstrom, 2013, 2014; Other limitations are also apparent. As is clear
Youngstrom, Choukas-Bradley, Calhoun, & Jensen- from our results, the precise shapes of distributions
Doss, 2014). In addition, we argue that training that of screening scores cannot be assessed with
includes information about the tradeoffs between certainty. This finding is consistent with prior inves-
threshold probabilities and sensitivity might counter tigations that note differences in test score distribu-
the rational use of higher than recommended screen- tions across populations (Brenner & Gefeller, 1997;
ing thresholds. Dans, Dans, Guyatt, & Richardson, 1998; Moons,
It is important to note many limitations to this van Es, Deckers, Habbema, & Grobbee, 1997; Sox
study and to the evaluation of diagnostic accuracy et al., 1990). Thus, analyses of threshold probabil-
of behavioral screening instruments more gener- ities are prone to error, and sensitivity analyses are
ally. First and foremost, there are both theoretical recommended to specify a range of plausible values.
and practical limits to the evaluation of the accu- Moreover, threshold probability and PPV are both
racy of screening instruments. Although they are directly influenced by underlying prevalence. Thus,
commonly referred to as ‘gold standards’, the differences in prevalence suggested by the use of
structured diagnostic interviews by which we different ‘gold standard’ diagnostic tools, or the use
assess psychopathology in children lack perfect of different impairment criteria within the same ‘gold
reliability. This fact is commonly reflected by inter- standard’ diagnostic tool, will influence results.
rater reliabilities with moderate kappa scores, and Moreover, attention to differences in prevalence
by modest retest reliabilities (Angold et al., 2012). between local populations and standardization sam-
For this reason, they are described in the statistical ples is critical. For example, prevalence might be
literature as examples of ‘fuzzy’ gold standards, expected to be lower in general pediatrics than in a
and there exists a large literature on this topic specialty referral clinic, and this will influence both
(Kraemer, 1992; Pepe, 2003; Phelps & Hutson, PPV and threshold probability.
1995; Zhou, McClish, & Obuchowski, 2009). While Overall, analysis of standard indices of diagnostic
the imperfect nature of structured diagnostic inter- accuracy in general and of threshold probability in
views is widely recognized, its influence on evalu- particular has significant limitations. Despite these

© 2015 Association for Child and Adolescent Mental Health.


946 R. Christopher Sheldrick et al. J Child Psychol Psychiatr 2015; 56(9): 936–48

limitations, the concept of threshold probability expected benefits of treatment or further evaluation
offers a framework for rationally considering and outweigh the expected costs. For children who
evaluating optimal screening thresholds. This is in score negative, the probability of having a disorder
contrast to standard approaches, which often rely on is low enough that the expected costs of treatment
arbitrary standards that do not consider the costs or further evaluation exceed the expected benefits.
and benefits of available assessment or intervention Therefore, for children who score precisely at the
options. For example, in recent years, 70% has often threshold, the probability of having a disorder is
been referred to as a minimally acceptable value for such that expected costs and benefits are equiva-
sensitivity and specificity. However, there is evidence lent (Pauker & Kassirer, 1980). Thus, a threshold
to suggest that standards for the accuracy of screen- and its associated threshold probability can reveal
ing instruments have fallen over time, perhaps in important information about the perceptions of
implicit recognition of the limits on sensitivity and costs and benefits among those who use or recom-
specificity in studies of accuracy. For example, in mend it. Moreover, this framework provided an
1974, a review of pediatric screening instruments explicit rationale for setting multiple thresholds
commented that the Denver Developmental Screen- (Pauker & Kassirer, 1980). For example, one
ing Test displayed ‘low sensitivity (81%) and speci- threshold could be set to reflect the costs and
ficity (77%), with a resultant large problem of benefits of further assessment, whereas a more
overreferral’ (Bailey et al., 1974). Today, a screening stringent threshold might be set to reflect the costs
test with this level of accuracy would be widely and benefits of typical interventions. Others have
accepted as falling well within commonly accepted recommended methods to optimize thresholds that
limits. However, neither standard makes an are functionally similar (Swets et al., 2000) or have
informed argument to support any given threshold. suggested reasonable approaches to setting thresh-
To make informed choices about screening thresh- olds when complete information on costs and
olds, it is useful to understand their relationship benefits are unavailable, as is typical in most cases
with cost-benefit analysis. Setting a threshold rep- (Kraemer, 1992; Straus et al., 2011). To conclude,
resents a decision, and this decision reflects the we recommend further research on the psychomet-
perceptions and values of the person or group of ric properties of instruments and the utility of
people setting the threshold. Any given threshold using different thresholds with a particular focus
implies that for all children who score above it, the on external validity across samples. In addition, we
expected benefits of being classified as positive recommend close attention to the best available
outweigh the expected costs. Conversely, it implies evidence on diagnostic accuracy and cost effective-
that for all children who score below the threshold, ness when setting and conceptualizing screening
expected costs of being classified as positive out- thresholds. As important, we recommend even
weigh expected benefits. Expected costs and benefits closer attention to what those data do—and do
result directly from the consequences of classifica- not—mean. Results of even the best screening
tion—for example, the amount and type of treatment instruments should be interpreted as probabilities,
and/or further assessment that a ‘positive’ label not as definitive findings, and even our under-
triggers. Costs and benefits may include multiple standing of those probabilities is subject to uncer-
factors, including expected changes in symptoms tainty. Thus, we argue that if empirically validated
associated with available interventions, economic screening instruments are to fulfill their potential
costs associated with interventions and mainte- for improving the detection of childhood psychopa-
nance or exacerbation of symptoms, risk of side thology, understanding the limits of evidence for
effects, stigma, or family distress, and impact on diagnostic accuracy is at least as important as
child and family quality of life (Godoy & Carter, understanding its strengths.
2013). How individuals weigh these multiple factors
can vary considerably and is dependent on their
perspectives and values. Thus, decisions about
where and how to set clinical thresholds reflect both
Acknowledgements
This original article was invited by the journal as part of
perceptions of consequences and value judgments
a special issue; it has undergone full, external peer
regarding the costs and benefits of identification, review. This study was made possible in part by funding
diagnosis, and treatment. from the National Institute of Mental Health
Assuming that quantitative estimates of costs (R01MH104400).
and benefits are both valid and available, then
threshold probability has a straightforward inter-
pretation in cost-benefit analysis. Screening instru- Correspondence
ments classify children according to risk for R. Christopher Sheldrick, Department of Pediatric Tufts
psychopathology. As described above, for children Medical Center, 800 Washington Street #854, Boston,
who score positive, the risk or probability of having MA 02111, USA; Email: rsheldrick@tuftsmedicalcenter.
a disorder is considered to be high enough that the org

© 2015 Association for Child and Adolescent Mental Health.


doi:10.1111/jcpp.12442 Thresholds and accuracy in screening tools 947

Key points
• Questionnaires designed to screen for psychopathology typically yield ordinal scores. One or more clinical
thresholds or ‘cut scores’ is typically applied to yield ‘positive’ and ‘negative’ screening results.
• ‘Threshold probability’ refers to the likelihood that a child whose score is equal to a particular clinical
threshold experiences psychopathology.
• Despite being effective in identifying groups of children at elevated risk for psychopathology, two well-
validated screening instruments displayed low threshold probabilities—i.e., children who scored at or near the
recommended clinical thresholds were unlikely to meet criteria for psychopathology on gold standard
interviews.
• Screening instruments should be interpreted probabilistically, with attention to where along the continuum of
positive scores an individual falls. Threshold probabilities offer a useful metric to evaluate and optimize clinical
thresholds.

Dujardin, B., Van den Ende, J., Van Gompel, A., Unger, J.P., &
References Van der Stuyft, P. (1994). Likelihood ratios: A real
Achenbach, T.M., & Rescorla, L.A. (2001). Manual for the improvement for clinical decision making? European
ASEBA school-age forms & profiles. Burlington: University of Journal of Epidemiology, 10, 29–36.
Vermont, Research Center for Children, Youth, and Godoy, L., & Carter, A.S. (2013). Identifying and addressing
Families. mental health risks and problems in primary care pediatric
Angold, A., & Costello, E.J. (2000). The child and adolescent settings: A model to promote developmental and cultural
psychiatric assessment (CAPA). Journal of the American competence. American Journal of Orthopsychiatry, 83, 73–
Academy of Child and Adolescent Psychiatry, 39, 39–48. 88.
Angold, A., Erkanli, A., Copeland, W., Goodman, R., Fisher, Goodman, R. (1997). The strengths and difficulties
P.W., & Costello, E.J. (2012). Psychiatric diagnostic questionnaire: A research note. Journal of Child Psychology
interviews for children and adolescents: A comparative and Psychiatry, 38, 581–586.
study. Journal of the American Academy of Child and Goodman, R. (2001). Psychometric properties of the strengths
Adolescent Psychiatry, 51, 506–517. and difficulties questionnaire (SDQ). Journal of the American
Bailey, E.N., Kiehl, P.S., Akram, D.S., Loughlin, H.H., Metcalf, Academy of Child and Adolescent Psychiatry, 40, 1337–
T.J., Jain, R., & Perrin, J.M. (1974). Screening in pediatric 1345.
practice. Pediatric Clinics of North America, 21, 123–165. Goodman, R., & Scott, S. (1999). Comparing the strengths and
Berner, E.S., & Graber, M.L. (2008). Overconfidence as a cause difficulties questionnaire and the child behavior checklist: Is
of diagnostic error in medicine. The American Journal of small beautiful? Journal of Abnormal Child Psychology, 27,
Medicine, 121, S2–S23. 17–24.
Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Guevara, J.P., Gerdes, M., Localio, R., Huang, Y.V., Pinto-
Glasziou, P.P., Irwig, L.M., . . . & Lijmer, J.G. (2003). The Martin, J., Minkovitz, C.S., . . . & Pati, S. (2013).
STARD statement for reporting studies of diagnostic Effectiveness of developmental screening in an urban
accuracy: Explanation and elaboration. Annals of Internal setting. Pediatrics, 131, 30.
Medicine, 138, W1–W12. He, J., Burstein, M., Schmitz, A., & Merikangas, K.R. (2013).
Brenner, H., & Gefeller, O. (1997). Variation of sensitivity, The strengths and difficulties questionnaire (SDQ): The
specificity, likelihood ratios and predictive values with factor structure and scale validation in U.S. adolescents.
disease prevalence. Statistics in Medicine, 16, 981–991. Journal of Abnormal Child Psychology, 41, 583–595.
Briggs-Gowan, M.J., Carter, A.S., Skuban, E.M., & Horwitz, Hummel, T.J. (1999). The usefulness of tests in clinical
S.M. (2001). Prevalence of social-emotional and behavioral decisions. In J.W. Lichtenberg, & R.K. Goodyear (Eds.),
problems in a community sample of 1- and 2-year-olds. Scientist-practitioner perspectives on test interpretation (pp.
Journal of the American Academy of Child and Adolescent 59–112). Needham Heights, MA: Allyn & Bacon.
Psychiatry, 40, 811–819. Jaeschke, R., Guyatt, G.H., Sackett, D.L., Guyatt, G., Bass, E.,
Briggs-Gowan, M.J., Horwitz, S.M., Schwab-Stone, M.E., Brill-Edwards, P., & Wilson, M. (1994). Users’ guides to the
Leventhal, J.M., & Leaf, P.J. (2000). Mental health in medical literature: III. How to use an article about a
pediatric settings: Distribution of disorders and factors diagnostic test B. What are the results and will they help
related to service use. Journal of the American Academy of me in caring for my patients? JAMA, 271, 703–707.
Child and Adolescent Psychiatry, 39, 841–849. Kessler, R.C., Avenevoli, S., Costello, E.J., Green, J.G.,
Carter, A.S., Wagmiller, R.J., Gray, S.A.O., McCarthy, K.J., Gruber, M.J., Heeringa, S., . . . & Zaslavsky, A.M. (2009a).
Horwitz, S.M., & Briggs-Gowan, M.J. (2010). Prevalence of National comorbidity survey replication adolescent
DSM-IV disorder in a representative healthy birth cohort at supplement (NCS-A): II. Overview and design. Journal of
school entry: Sociodemographic risks and social adaptation. the American Academy of Child and Adolescent Psychiatry,
Journal of the American Academy of Child and Adolescent 48, 380–385.
Psychiatry, 49, 686–698. Kessler, R.C., Avenevoli, S., Costello, E.J., Green, J.G., Gruber,
Dans, A.L., Dans, L.F., Guyatt, G.H., Richardson, S., & M.J., Heeringa, S., . . . & Zaslavsky, A.M. (2009b). Design and
Evidence-Based Medicine Working Group (1998). Users’ field procedures in the US National Comorbidity Survey
guides to the medical literature: XIV. How to decide on the Replication Adolescent Supplement (NCS-A). International
applicability of clinical trial results to your patient. JAMA, Journal of Methods in Psychiatric Research, 18, 69–83.
79, 545–549. € un, T.B. (2004). The World Mental Health
Kessler, R.C., & Ust€
Dawes, R.M., Faust, D., & Meehl, P.E. (1989). Clinical versus (WMH) Survey Initiative Version of the World Health
actuarial judgment. Science, 243, 1668–1674. Organization (WHO) Composite International Diagnostic

© 2015 Association for Child and Adolescent Mental Health.


948 R. Christopher Sheldrick et al. J Child Psychol Psychiatr 2015; 56(9): 936–48

Interview (CIDI). International Journal of Methods in Swets, J.A., Dawes, R.M., & Monahan, J. (2000). Psychological
Psychiatric Research, 13, 93–121. science can improve diagnostic decisions. Psychological
Kraemer, H.C. (1992). Evaluating medical tests: Objective and Science in the Public Interest, 1, 1–26.
quantitative guidelines. Newbury Park, CA: Sage Trikalinos, T.A., Uwe, S., & Joseph, L. (2009). Decision-
publications. analytic modeling to evaluate benefits and harms of
Merikangas, K.R., Avenevoli, S., Costello, E.J., Koretz, D., & medical tests: Uses and limitations. Medical Decision
Kessler, R.C. (2009). National Comorbidity survey Making, 29, E22–E29.
replication adolescent supplement (NCS-A): I. Background U.S. Department of Health and Human Services (2000). Mental
and measures. Journal of the American Academy of Child health: A report of the surgeon general. Washington, DC: US
and Adolescent Psychiatry, 48, 367–379. Government Printing Office.
Moons, K.G., van Es, G.A., Deckers, J.W., Habbema, J.D.F., & Vickers, A.J., & Elkin, E.B. (2006). Decision curve analysis: A
Grobbee, D.E. (1997). Limitations of sensitivity, specificity, novel method for evaluating prediction models. Medical
likelihood ratio, and Bayes’ theorem in assessing diagnostic Decision Making, 26, 565–574.
probabilities: A clinical example. Epidemiology, 8, 12–17. Wang, P.S., Lane, M., Olfson, M., Pincus, H.A., Wells, K.B., &
Pauker, S.G., & Kassirer, J.P. (1980). The threshold approach Kessler, R.C. (2005). Twelve-month use of mental health
to clinical decision making. The New England Journal of services in the United States: Results from the National
Medicine, 302, 1109–1117. Comorbidity Survey Replication. Archives of General
Pepe, M.S. (2003). The statistical evaluation of medical tests for Psychiatry, 62, 629–640.
classification and prediction. Oxford: Oxford University Whiting, P.F., Rutjes, A.W.S., Westwood, M.E., Mallett, S.,
Press. Deeks, J.J., Reitsma, J.B., . . . & The QUADAS-2 Group.
Phelps, C.E., & Hutson, A. (1995). Estimating diagnostic test (2011). QUADAS-2: A revised tool for the quality assessment
accuracy using a ‘fuzzy gold standard’. Medical Decision of diagnostic accuracy studies. Annals of Internal Medicine,
Making, 15, 44–57. 155, 529–536.
Robins, L.N. (1985). Epidemiology: Reflections on testing the Youngstrom, E.A. (2013). Future directions in psychological
validity of psychiatric interviews. Archives of General assessment: Combining evidence-based medicine
Psychiatry, 42, 918–924. innovations with psychology’s historical strengths to
Schaffer, D., Fisher, P., Dulcan, M.K., Davies, M., Piacentini, J., enhance utility. Journal of Clinical Child and Adolescent
& Schwab-Stone, M.E. (1996). The NIMH diagnostic interview Psychology, 42, 139–159.
schedule for children, Version 2.3 (DISC-2.3): Description, Youngstrom, E.A. (2014). A primer on receiver operating
acceptability, prevalence rates, and performance in the MECA characteristic analysis and diagnostic efficiency statistics
study. Journal of the American Academy of Child and for pediatric psychology: We are ready to ROC. Journal of
Adolescent Psychiatry, 35, 865–877. Pediatric Psychology, 39, 204–221.
Sheldrick, R.C., Merchant, S., & Perrin, E.C. (2011). Youngstrom, E.A., Choukas-Bradley, S., Calhoun, C.D., &
Identification of developmental-behavioral problems in Jensen-Doss, A. (2014). Clinical guide to the evidence-based
primary care: A systematic review. Pediatrics, 128, assessment approach to diagnosis and treatment. Cognitive
356–363. and Behavioral Practice, 22, 20–35.
Sox, H.C., Hickman, D.H., Marton, K.I., Moses, L., Skeff, K.M., Zhou, X.H., McClish, D.K., & Obuchowski, N.A. (2009).
Sox, C.H., & Neal, E.A. (1990). Using the patient’s history to Statistical methods in diagnostic medicine (Vol. 569). New
estimate the probability of coronary artery disease: A York: John Wiley & Sons.
comparison of primary care and referral practices. The
American Journal of Medicine, 89, 7–14. Accepted for publication: 13 May 2015
Straus, S.E., Glasziou, P., Richardson, W.S., & Haynes, R.B. Published online: 19 June 2015
(2011). Evidence-based medicine: How to practice and teach
EBM (4th edn). New York: Churchill Livingstone.

© 2015 Association for Child and Adolescent Mental Health.

You might also like