Professional Documents
Culture Documents
ABSTRACT
Objective: Research replication, or repeating a study de novo, is the scientific standard for building evidence and identifying
spurious results. While replication is ideal, it is often expensive and time consuming. Reproducibility, or reanalysis of data
to verify published findings, is one proposed minimum alternative standard. While a lack of research reproducibility has
been identified as a serious and prevalent problem in biomedical research and a few other fields, little work has been done
to examine the reproducibility of public health research. We examined reproducibility in 6 studies from the public health
services and systems research subfield of public health research.
Design: Following the methods described in each of the 6 papers, we computed the descriptive and inferential statistics for
each study. We compared our results with the original study results and examined the percentage differences in descriptive
statistics and differences in effect size, significance, and precision of inferential statistics. All project work was completed
in 2017.
Results: We found consistency between original and reproduced results for each paper in at least 1 of the 4 areas examined.
However, we also found some inconsistency. We identified incorrect transcription of results and omitting detail about data
management and analyses as the primary contributors to the inconsistencies.
Recommendations: Increasing reproducibility, or reanalysis of data to verify published results, can improve the quality of
science. Researchers, journals, employers, and funders can all play a role in improving the reproducibility of science through
several strategies including publishing data and statistical code, using guidelines to write clear and complete methods
sections, conducting reproducibility reviews, and incentivizing reproducible science.
KEY WORDS: open science, public health services and systems research, reproducibility
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
2 Harris, et al • 00(00), 1–9 Reproducing Public Health Services and Systems Research
sample of psychology papers,10 and 11% of P values Health Officials,” and “NACCHO.” We reviewed the
were incorrect in a sample of medical papers.11 resulting papers and restricted the results to those us-
Lack of reproducibility is serious and prevalent, ing 2013 NACCHO Profile data where all or most
with just 21% of 67 drug studies and 40% to of the analyses were conducted without combining it
60% of 100 psychological studies being successfully with other years of Profile data or other data sources
reproduced.4,6,12 While most of the work on repro- (n = 15). We reviewed all 15 papers and recorded the
ducibility in science has been in the biomedical or psy- following: variables examined, type of statistical anal-
chology fields, 1 recent paper in the journal Science ex- yses conducted, inclusion of detail on missing data
amined economics papers and found that 61% (11 of handling, inclusion of detail on variable recoding and
18) papers were reproducible.13 Less has been done to combining, whether analyses used the NACCHO core
examine the reproducibility of public health research. data and/or 1 or more of the modules, and detail on
Public health services and systems research (PHSSR) inclusion of another data set, if any. We used this sum-
is a subfield of public health research concerned with mary information to select 5 of the 15 articles (33.3%)
the “organization, financing, and delivery of public that (1) included clear descriptions of data manage-
health services, and the impact of these services on ment and analyses, and (2) applied commonly used
public health.”14 Governmental public health practi- statistical methods for PHSSR.17 Harris has also pub-
tioners use research evidence, including PHSSR, to se- lished PHSSR using NACCHO data but has no publi-
lect programs and justify program selection to their cations using 2013 NACCHO data alone for analyses;
partners, evaluate programs, prepare funding appli- we reproduced the results of a study using 2008 Pro-
cations, and conduct needs assessments.15,16 file data led by Harris to determine whether we would
One of the major sources of data for PHSSR be more successful in reproducing our own work.
research is the National Profile of Local Health
Departments Study (Profile), which is a comprehen-
Reproducing descriptive and inferential
sive survey of all local health departments (LHDs)
statistics
nationwide conducted regularly by the National
Association of County and City Health Officials The first step in reproducing the research was to
(NACCHO). Survey questions cover LHD funding, mimic the work described in each manuscript to re-
workforce, programs, and partnerships. Profile data produce descriptive statistics. We examined tables and
are available for public use for free through the methods sections in each paper to identify variables
Inter-university Consortium on Political and Social used, recoding, whether weights were used, and types
Research or can be requested directly by completing of descriptive analyses conducted. All authors worked
a data use agreement form and e-mailing it to NAC- together to manage and analyze the data in order
CHO. Given that it is publicly available at no cost to reproduce the reported descriptive statistics. Once
and widely used in PHSSR, we sought to examine we completed data management and were able to re-
the reproducibility of published studies that used the produce the descriptive statistics as closely as possi-
NACCHO Profile study data. We had 3 goals: (1) ex- ble to the original published manuscript, we repro-
amine reproducibility in a sample of published PHSSR duced bivariate and multivariate inferential model(s)
studies using publicly available data; (2) demonstrate on the same managed data. We reproduced statistics
one strategy for increasing research reproducibility; reported in all tables in each manuscript that used
and (3) encourage public health researchers to begin only NACCHO 2013 Profile data (or 2008 Profile
adopting reproducible research strategies. data for Paper 6); we did not reproduce tables that
included other data sources nor statistics displayed in
Methods figures.
In Paper 1, we were not able to closely reproduce
Goal 1: Examine reproducibility in PHSSR studies the tables of descriptive statistics without additional
information from the authors. For this paper, rurality
Sample selection
was coded on the basis of Rural-Urban Commut-
In early 2017, we conducted a search to identify pub- ing Area (RUCA) codes, which allows classification
lished journal articles using the 2013 NACCHO Pro- of health department jurisdictions by how rural
file data. We searched 10 sources: PubMed, Google or urban they are. The NACCHO data can be re-
Scholar, Ebsco, Web of Knowledge, Scopus, EMBASE, quested with RUCA codes but we were unsure of
CINAHL plus, Academic Complete, Web of Science, the exact categories used in Paper 1, which lacked
and CINAHL. Our search terms were as follows: detail about the codes. We e-mailed the lead author
“National Association of County & City Health Of- and were sent the RUCA codes used. The RUCA
ficials,” “National Association of County and City codes for Paper 1 are included on GitHub with the
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
00 2018 • Volume 00, Number 00 www.JPHMP.com 3
SPSS, Stata, and R command files for all results re- figures were produced in R. The SPSS, Stata, and
ported here (https://github.com/coding2share/phssr- R command files are available on GitHub: https:
reproducibility/). //github.com/coding2share/phssr-reproducibility/.
Each of our analyses reproducing the manuscript in
Comparing reproduced results with original question used the statistical program noted by that pa-
results per. Data and codebooks for the Profile study data are
We examined how similar or different the reproduced available from NACCHO (http://nacchoprofilestudy.
statistics were compared with the published results. org/).
For descriptive statistics reported alone or as part of a
table of bivariate analyses, we determined the percent- Results
age difference between the original and reproduced
value. In computing the percentage difference, we sub- Sample characteristics
tracted the original values from the reproduced values
and divided by the originals. The resulting percentages The 5 studies using 2013 NACCHO Profile study
were positive for reproduced values greater than the data were published between 2015 and 2017 in 3 dis-
originals, and negative for reproduced values less than tinct journals with 4 distinct lead authors. The addi-
the originals. For each paper, we determined the mean tional study by Harris was published in 2013 and used
and standard deviation for the percentage difference 2008 NACCHO Profile study data. Descriptive tables
across all descriptive statistics reported in each table. reported frequencies, percentages, means, and stan-
For multivariate inferential statistics, we compared dard deviations. Bivariate inferential statistics, such
original and reproduced statistics in 3 areas: effect as χ 2 or correlation analyses, were included in 4 of
size, statistical significance, and precision. For effect the 6 papers. The 5 papers using 2013 data used
size comparisons, we used a similar strategy to what weights in analysis and conducted multivariate lo-
we used for descriptive statistics, computing the per- gistic regression for the primary analysis; Harris pa-
centage difference between original and reproduced per neither weighted the data nor reported logistic
odds ratios. For significance, we used the cutoff for regression.
significance reported (or implied) in the published
manuscript and noted if reproduced statistical tests Reproducing descriptive statistics
matched the significance of original test results. For
those not matching, we recorded whether the result Following the methods and results sections of the
became significant or nonsignificant compared with 6 papers, we computed descriptive statistics from
the original. Statistical significance was also compared NACCHO Profile Study data sets obtained directly
for bivariate inferential analyses. Precision was exam- from NACCHO. Figure 1 shows the mean percent-
ined by comparing the mean width of a confidence age difference between the originally reported and re-
interval in the original and reproduced results. produced statistics. Most tables of descriptive statis-
tics were accurately reproduced with the exception of
Examining manuscript features Paper 1 Table 1, and Paper 6 Table 1 (Figure 1; see
Finally, after reproducing results from the 6 papers, Supplemental Digital Content Appendix A, available
we reviewed each paper and noted whether the paper at http://links.lww.com/JPHMP/A396). Tables show-
included detail we felt would have aided us in more ing the original and reproduced values are shown in
accurately reproducing results: (1) the type of bivari- Appendix A.
ate analysis conducted; (2) test statistic values for bi- Most differences between original and reproduced
variate analyses (eg, χ 2 , F); (3) the name or a detailed values in Paper 1 Table 1 were small, with a few ex-
description of the weight variable used (if any); (4) de- ceptions (see Supplemental Digital Content Appendix
tail on how missing data were handled; (5) detail on A, available at http://links.lww.com/JPHMP/A396).
variable combining and recoding; and (6) sample sizes One exception was the administrator highest degree
and raw frequencies for groups analyzed. variable where the category “Associates degree” had
1.8% of the original study participants but 3.8% of
Goal 2: Demonstrate a strategy for research the reproduced participants, more than doubling the
reproducibility percentage. Likewise, there was a large difference in
the percentage of LHDs a community health improve-
During the process of reproducing the 6 studies, we ment plan (CHIP) had, that developed an agency-wide
organized and annotated all of the SPSS and Stata strategic plan in the original (47.8%) compared with
commands we used to get from the raw NACCHO the reproduced (56.2%). These variables were created
Profile data to our reproduced results. Summary by combining variables (degree variable) or recoding
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
4 Harris, et al • 00(00), 1–9 Reproducing Public Health Services and Systems Research
Paper 1 Table 4
Paper 5 Table 4 Model 2
Paper 3 Table 2
Paper 6 Table 3
Paper 5 Table 4 Model 1
Paper 5 Table 3
Manuscript & table number
Paper 1 Table 3
Paper 2 Table 1
Paper 4 Table 1
Paper 6 Table 2
Paper 3 Table 1
Paper 1 Table 2
Paper 3 Table 3 Model 2
Paper 3 Table 3 Model 3
Paper 3 Table 3 Model 1
Paper 1 Table 1
Paper 2 Table 3
Paper 4 Table 2
Paper 6 Table 1
FIGURE 1 Difference Between Original and Reproduced Descriptive and Inferential Statistics Reported in 5 Published Public Health Services and
Systems Research (PHSSR) Studies Using National Association of County & City Health Officials (NACCHO) 2013 Profile Data (Papers 1-5) and 1 PHSSR
Study by Harris Using NACCHO 2008 Profile Data (Paper 6)
a multicategory variable into a binary variable (CHIP values were not. Other than this, the reproduced table
variable). The manuscript included very clear instruc- matched the original nearly exactly.
tions for combining and recoding; we reviewed our
syntax in light of the instructions in the manuscript Reproducing inferential models
but were not able to identify any obvious differences
Comparing effect size
in how we constructed the variables and how the
authors described variable construction. No sample On average, the reproduced effect sizes were accurate
sizes or unweighted frequencies were reported, so for most papers, with mean differences between
we did not have a way to compare the number of original and reproduced being less than 5% for all
LHDs recoded inconsistently from the original to the but 1 paper. For Paper 1 Table 4, the mean differ-
reproduced. ence between original and reproduced effect sizes
The differences between original and reproduced was 11.1%. We examined the differences between
in Paper 6 Table 1 (see Supplemental Digital Con- original and reproduced odds ratios and found 4
tent Appendix A, available at http://links.lww.com/ odds ratios in Paper 1 that were very different from
JPHMP/A396) were due to an error in transcription the original, including both categories of rurality,
by the paper authors or the journal staff. Specifically, whether or not the health department had a board
percentages in the land use planning, tobacco preven- of health, and whether the health department was in
tion and control, influenza, and obesity rows in Pa- the state governance category (Figure 2). We thought
per 6 Table 1 appear to have been swapped between that perhaps the large differences seen between
rows, suggesting the row labels were reordered but the original and reproduced descriptive statistics for
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
00 2018 • Volume 00, Number 00 www.JPHMP.com 5
Original Reproduced
Urban
State governance
Local Health Department Characteristic
Local governance
*Micropolitan
**Epidemiologist (ref=no)
1.0 10.0
Odds ratio & 95% CI (log scale)
FIGURE 2 Original and Reproduced Odds Ratios and Confidence Intervals From Paper 4 Model 1
Abbreviation: CI indicates confidence interval.
this paper may have resulted in the low accuracy Comparing precision
of the regression model, but the variables with the
In addition to relatively high reproducibility in statis-
least accuracy in the descriptive analyses were not
tical significance, the precision of effect size estimates
used in the regression model. We were unable to
was relatively stable across all of the studies except
identify the source of the sizeable differences between
for Paper 1. The reproduced results of Paper 1 were
observed and reproduced regression results for this
more precise than the original results (Figure 2),
model.
which suggests that smaller standard deviations
Comparing statistical significance and/or larger sample sizes were used to compute the
reproduced confidence intervals. Figure 4 shows the
All but 1 of the bivariate tables reproduced were com- mean difference in width between original and re-
pletely accurate with respect to statistical significance produced confidence intervals for odds ratios across
status (Figure 3). The multivariate models each had studies. Supplemental Digital Content Appendix
1 or more changes to statistical significance status B, available at http://links.lww.com/JPHMP/A397,
with the exception of Paper 4. Paper 6 had by far the shows original and reproduced odds ratios for Papers
most changes in statistical significance; 10 of the 40 1 through 5.
correlations changed from statistically significant to
nonsignificant when reproduced. In an examination Examining manuscript features
of documents submitted by Harris for publication of
Paper 6, we found that 4 of the 10 correlations that Three of the 4 papers that included bivariate analy-
went from significant in the original to nonsignificant sis included the type of bivariate analysis conducted,
in the reproduced results were not marked as signif- but none of the tables of bivariate analyses included
icant in the submitted table but were marked as so test statistic value(s) such as χ 2 or t statistics. Of the
in the published table. At some point in the process 5 papers using weights, 2 of them were specific about
from submission to publication, 4 correlations were the weighting variable used. The other 3 papers stated
transcribed incorrectly as significant and not caught that “appropriate” or “proper” weights were used. All
during the proof review or any other stage before papers gave overall response rates of the study and 1
publication. paper (Paper 5) stated that all included variables had
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
6 Harris, et al • 00(00), 1–9 Reproducing Public Health Services and Systems Research
Paper 6 Table 3
Paper 6 Table 2
Paper 6 Table 1
bivariate
Paper 4 Table 1
Paper 3 Table 2
Manuscript & table number
Paper 1 Table 3
Paper 1 Table 2
Paper 4 Table 2
multivariate
Paper 3 Table 3 Model 3
Paper 2 Table 3
Paper 1 Table 4
FIGURE 3 Percentage of Bivariate and Multivariate Inferential Results That Maintained or Changed Statistical Significance Status When Reproduced
for 6 Public Health Services and Systems Research Studies
Paper 2 Table 3
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
00 2018 • Volume 00, Number 00 www.JPHMP.com 7
First, typos and other errors when transferring text the original Paper 1 Table 2 were accurately repro-
during the writing and publishing process accounted duced, so we expected to also find accurately repro-
for numerous differences between original and repro- duced P values. However, the P values differed slightly
duced results. The switching of 4 rows of data in Paper between original and reproduced results. We believe
6 Table 1 without editing the labels was the biggest in- that this was due to a transcription error; however,
stance we found but not the only one. Paper 2 Table 1 including bivariate test statistics would have verified
had 4 frequency and percentage values switched (see our belief or alerted us to some analytic or reporting
Supplemental Digital Content Appendix A, available choice the authors made that we did not include in
at http://links.lww.com/JPHMP/A396). In addition, our process.
all of the significant P values in Paper 1 Table 2 were The differences found in effect size, precision, and
shown as P = .001 while reproduced results for this statistical significance in the reproduced analyses
paper indicated that these P values were all P < .001. likely resulted from not knowing data management
Although it seems unlikely that all the P values would details, like the handling of missing data during re-
be exactly equal to .001, without the test statistics coding in Papers 2 and 3. If a small change in data
for this table, it is difficult to check whether this management results in significance changes or less
difference is real or a set of typos. In Paper 6, 4 super- precise results, it suggests that findings from these
scripts indicating significance were added mistakenly studies may not be very robust and should be inter-
to nonsignificant correlations at some point between preted with caution. For example, in Paper 5, the dif-
submission and publication and were not discovered ferences between observed and reproduced descrip-
during the publishing process. Altogether, 3 of the 6 tive statistics were minor; however, in the inferential
papers examined included transcription errors in re- results, the variable indicating whether the local board
porting values. An additional paper, Paper 5, included of health had authority to approve the LHD bud-
a labeling error that did not result in misreported get was a significant protective factor against LHD
values but caused some initial confusion during data budget cuts in the original analyses but nonsignifi-
management. Specifically, a less than symbol (<) was cant in the reproduced analyses. The confidence in-
printed in Tables 3 and 4 where there should have tervals in the original (0.50-0.98) and reproduced
been a greater than or equal to (≥) symbol. (0.52-1.03) were close but just different enough to
The omission of detail throughout the research change the statistical significance status. The authors
process also hindered accuracy in reproducing these concluded that this significant protective factor was
papers. For example, our initial attempt to reproduce one of a few indicators that board of health ac-
the unweighted frequencies in Paper 2 Table 1 resulted tions influence LHD financial stability, which may
in lower values in the “No” category for 5 variables. or may not be the case depending on how the data
To reproduce the frequencies exactly, we tried assign- were managed. Instead of focusing on P values,18
ing missing data into the category for “No.” So, for a focus on effect sizes and practically relevant dif-
example, an LHD missing data on the variable for ferences is one possible strategy for addressing this
strategic plan developed would be placed into the concern. Finally, as mentioned, we reproduced anal-
“No” category indicating they were not developing yses in the software used in the original manuscript.
a strategic plan. This strategy was not intuitive to us This was at the recommendation of peer reviewers.
and was not described in the methods of the paper but Initially, we conducted all analyses in R. However,
did result in accurate frequencies. We found a similar when comparing R results to the original results ob-
recoding of missing values to “No” in Paper 3 for sev- tained from Stata and SPSS, we noticed a number
eral variables with no mention of this strategy in the of differences in statistical significance that gave us
article text (see command file at https://github.com/ pause. In examining possible reasons for these differ-
coding2share/phssr-reproducibility). In Paper 3, the ences, we found slight differences in algorithms19,20
authors included unweighted percentages, which were and more substantial differences in how weights are
useful in determining how to treat the missing values. applied across different statistical packages for some
Likewise, while 3 of the 4 papers including bivari- statistical tests. Understanding and describing how the
ate statistics named the test they were using, none of program we use applies weights in our analyses could
the 4 papers reporting bivariate analyses included the increase the ability of others to reproduce our work.
values of test statistics for the bivariate analyses. Al- Study limitations include the small number of
though the values of test statistics, like χ 2 or U, are studies reproduced, the small selection of studies
often not inherently meaningful, they do allow the based on data availability, and use of traditional
determination or verification of the P value for a sta- statistical methods. These studies were not likely to
tistical test. As an example of how this may have be representative of the larger population of PHSSR
improved reproducibility, the percentages reported in studies or public health research. However, given data
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
8 Harris, et al • 00(00), 1–9 Reproducing Public Health Services and Systems Research
Implications for Policy & Practice reproducibility.28 In addition, journals improve re-
producibility by (1) requiring authors to follow ex-
■ Increasing reproducibility, or reanalysis of data to verify pub- isting guidelines for reporting study results;23,25-27
lished results, can improve the quality of science. (2) requiring authors to submit data and statistical
source code;7,11,13,23 (3) using reviewers with statistical
■ To increase reproducibility, include detail in publications
expertise;11 and (4) offering reproducibility reviews.7
such as specific data management decisions, sample sizes
Funders and employers of researchers could consider
for all analyses, unweighted frequencies for weighted anal-
incentivizing or funding reproducible research, repro-
yses, and the name and test statistic for all statistical tests.
duction efforts, and strategies for increasing repro-
■ Checking proofs of publications closely, including all labels ducibility such as sharing data and statistical code.6,29
and numbers in tables, can identify errors to correct in re- While not all of these suggestions are appropriate for
porting or typesetting and increase reproducibility. all projects, especially for those with restricted use
■ Researchers, journals, employers, and funders can improve data or data with private or confidential information,
reproducibility by using existing publishing guidelines; pub- all projects can employ some of these strategies. In-
lishing statistical code, output, and data when possible; creasing research reproducibility in public health can
conducting reproducibility reviews; and incentivizing repro- help ensure that researchers are providing practition-
ducible science. ers with the best possible science to use in determining
how to spend scarce resources and improve public
health.
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
00 2018 • Volume 00, Number 00 www.JPHMP.com 9
evidence-based practices: a review of the literature. Am J Prev 23. Begley CG, Ellis LM. Raise standards for preclinical cancer re-
Med. 2012;43(3):309-319. search. Nature. 2012;483(7391):531-533.
17. Harris JK, Beatty KE, Barbero C, et al. Methods in public health 24. Ioannidis JP, Greenland S, Hlatky MA, et al. Increasing value and
services and systems research. Am J Prev Med. 2012;42(5): reducing waste in research design, conduct, and analysis. Lancet.
S42-S57. 2014;383(9912):166-175.
18. Nuzzo R. Statistical errors. Nature. 2014;506(7487):150. 25. International Committee of Medical Journal Editors (ICMJE). Uni-
19. Park HM. Comparing group means: t-tests and one-way ANOVA form requirements for manuscripts submitted to biomedical jour-
using Stata, SAS, R, and SPSS. http://hdl.handle.net/2022/19735. nals: writing and editing for biomedical publication. Haematologica.
Published 2009. Accessed October 11, 2017. 2004;89(3):264.
20. Park HM. Linear regression models for panel data using SAS, 26. Santori G. Journals should drive data reproducibility. Nature.
Stata, LIMDEP, and SPSS. http://www.indiana.edu/∼statmath/ 2016;535(7612):355-355.
stat/all/panel/panel.pdf. Published 2015. Accessed October 11, 27. Network E. Enhancing the quality and transparency of health re-
2017. search. www.equator-network.org. Published October 4, 2015. Ac-
21. Steen RG. Retractions in the medical literature: how many pa- cessed 2009. Accessed October 11, 2017.
tients are put at risk by flawed research? J Med Ethics. 2011;37(11): 28. Lowndes JSS, Best BD, Scarborough C, et al. Our path to better
688-692. science in less time using open data science tools. Nat Ecol Evol.
22. Steen RG. Retractions in the scientific literature: is the inci- 2017;1:0160.
dence of research fraud increasing? J Med Ethics. 2011;37(4): 29. Ioannidis JP. How to make more published research true. PLoS
249-253. Med. 2014;11(10):e1001747.
Copyright © 2018 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.